ELIPS treats the GPU as an optional engine — never a runtime dependency. Domain code talks to IndexPort; when GPU is built in, the index happens to live in device memory. Everything else — selection, allocation, batching, fallback — is out of band, behind narrow ports.
Thesis
Three commitments shape every header under include/elips/gpu_engine/:
- Interface Segregation. Five ports (compute, memory, kernel, stream, index) instead of one giant facade.
- Failure is expected. Every GPU operation returns
std::expected<T, GpuError>— device-lost, OOM, and unsupported metric are values, not exceptions. - Optional, additive. The CPU index and serving path stand on their own. GPU is a build-time switch (
-DELIPS_GPU_ENABLED=ON), not a runtime requirement.
Layered architecture
Ports — interface segregation
The engine is decomposed across five virtual interfaces so each caller depends only on the slice it actually needs.
// GpuPort.hpp — the umbrella port (init + compute + top_k)
class GpuPort {
public:
virtual std::expected<void, GpuError> initialize(const GpuConfig&) = 0;
virtual void shutdown() noexcept = 0;
virtual GpuDeviceInfo device_info() const noexcept = 0;
virtual std::expected<GpuBuffer, GpuError> allocate_device(size_t bytes) = 0;
virtual void free_device(GpuBuffer) noexcept = 0;
virtual std::expected<void, GpuError> upload(const void*, GpuBuffer&, size_t) = 0;
virtual std::expected<void, GpuError> download(const GpuBuffer&, void*, size_t) = 0;
virtual std::expected<void, GpuError> compute_distances_batch(
std::span<const float> queries, std::span<const float> database,
std::span<float> out, size_t nq, size_t nb, size_t dim, elips::Metric) = 0;
virtual std::expected<void, GpuError> top_k(
std::span<const float> distances,
std::span<uint32_t> indices_out, std::span<float> values_out,
size_t nq, size_t nb, size_t k) = 0;
virtual void synchronize() = 0;
virtual bool is_idle() const noexcept = 0;
};The dedicated ports keep call sites honest:
GpuMemoryPort—allocate/deallocate/ pinned host buffers /bytes_used / available / peak.GpuKernelPort— explicit metric-typed kernels:cosine_fp32,euclidean_fp32,dot_product_fp32.GpuStreamPort— fine-grained synchronisation surface for callers that schedule their own streams.GpuIndexPort— implements bothIndexPortandIndexTransferPort; exposesbuild_from_batch,search_batch,export_to_cpu_index/import_from_cpu_index, plusbackend_name()anddevice_bytes_used().
GpuConfig
enum class GpuPolicy { Auto, PreferGpu, RequireGpu, CpuOnly, Specific };
enum class IndexBuildMode { GpuBuild_CpuServe, GpuBuild_GpuServe, Hybrid };
enum class GpuIndexAlgorithm{ Auto, CagraGraph, IvfFlat, IvfPq, BruteForce };
enum class GpuPrecision { FP32, FP16, Int8, Auto };
struct GraphIndexBuildParams {
size_t intermediate_graph_degree{128};
size_t graph_degree{64};
enum class BuildAlgo { IvfPq, NnDescent, IterativeSearch } build_algo{...};
size_t nn_descent_iterations{20};
float compression_ratio{0.0f};
};
struct IvfPqBuildParams {
size_t n_lists{1024};
uint32_t pq_dim{0};
uint32_t pq_bits{8};
bool add_data_on_build{true};
size_t kmeans_n_iters{20};
float kmeans_trainset_fraction{0.5f};
};
struct GpuConfig {
GpuPolicy policy{GpuPolicy::Auto};
std::string preferred_backend; // "cuda" | "hip" | "metal" | ""
int32_t device_index{-1}; // -1 = let selector decide
IndexBuildMode index_build_mode{IndexBuildMode::GpuBuild_GpuServe};
GpuIndexAlgorithm algorithm{GpuIndexAlgorithm::Auto};
size_t device_memory_pool_bytes{0}; // 0 = auto-size
size_t pinned_host_pool_bytes{256UL * 1024 * 1024};
bool use_unified_memory{false}; // Apple / Grace Hopper
size_t default_ef_search_gpu{64};
size_t dynamic_batch_window_us{500};
size_t dynamic_batch_max_size{256};
bool enable_fp16_search{false};
GraphIndexBuildParams graph_params;
IvfPqBuildParams ivf_pq_params;
GpuPrecision search_precision{GpuPrecision::Auto};
bool auto_rebuild_on_startup{false};
float rebuild_threshold_ratio{0.1f};
bool enable_profiling{false};
bool emit_kernel_timings{false};
};policy drives the selector: Auto tries GPU and falls back; PreferGpu warns but still falls back; RequireGpu errors if no device is usable; CpuOnly short-circuits the engine entirely; Specific pins to preferred_backend / device_index.
Backends
Three concrete backends live under src/gpu_engine/backends/:
- CUDA (
backends/cuda/) — NVIDIA path, shipscosine_fp32.cukernel. - HIP / ROCm (
backends/hip/) — AMD path, parallelcosine_fp32.hipkernel. - Metal (
backends/metal/) — Apple Silicon, leans on unified memory (has_unified_memoryinGpuDeviceInfo).
Each backend implements GpuPort, registers its capabilities in GpuDeviceInfo (FP16/BF16/Int8 support, CAGRA, IVF-PQ, dynamic batching, half-precision search) and is selected byGpuSelector::rank_backend together with the device list returned by GpuDeviceManager.
Index family
Every GPU index is a final subclass of GpuIndexPort, which itself extends both IndexPort and IndexTransferPort. Domain code keeps using IndexPort.
GpuBruteForceIndex— exhaustive search on device, ideal for small/medium vaults and ground-truth probes.GpuIVFFlatIndex— inverted-file with flat lists; tunablen_lists.GpuIVFPQIndex— inverted-file with product quantisation (pq_dim,pq_bits); large-scale, memory-efficient.GpuGraphIndex— CAGRA-style graph; build via IVF-PQ, NN-descent, or iterative search.GpuHybridIndex— composite for mixed dense / lexical paths.GpuDistributedIndex— multi-GPU index with aDistributedModeselector.
Dynamic batcher
Concurrent single-query searches coalesce into one kernel launch. DynamicBatcher opens a window of dynamic_batch_window_us µs (default 500), buffers up to dynamic_batch_max_size queries (default 256), then flushes — either when the window closes or the buffer fills.
std::future<std::vector<SearchResult>>
DynamicBatcher::enqueue(std::span<const float> q, size_t k);
struct BatchStats {
size_t queries_coalesced{0};
size_t kernel_launches{0};
float avg_batch_size{0.0f};
float p99_latency_us{0.0f};
};Memory
GpuMemoryManager implements GpuMemoryPort: a slab-allocator over a device-side pool sized by device_memory_pool_bytes, plus a pinned host pool sized by pinned_host_pool_bytes for fast H2D / D2H transfers. GpuMemoryPool provides a simpler in-pool allocator used by the search pipeline for scratch buffers.
On Apple Silicon, use_unified_memory avoids redundant copies — the same allocation is visible to host and GPU. See ADR-GPU-006.
Index transfer
GpuIndexTransferManager moves built indexes between device and host:
clone(source, destination); // CPU↔CPU or GPU↔GPU
clone_cpu_to_gpu(cpu_src, gpu_dst); // GpuBuild_CpuServe → upload
clone_gpu_to_cpu(gpu_src, cpu_dst); // export for serving on CPUTogether with IndexBuildMode::GpuBuild_CpuServe, this unlocks the canonical pattern: build a CAGRA graph on a beefy training box, then serve it from CPU nodes with no GPU dependency. See ADR-GPU-003.
Error model
enum class GpuError {
DeviceNotFound, InsufficientMemory, KernelLaunchFailed,
TransferFailed, IndexBuildFailed, UnsupportedMetric,
InitializationFailed, BackendUnavailable,
};
[[nodiscard]] std::expected<...> op(...); // every port methodErrors flow up as values; the caller decides whether to fall back to CPU, retry on another device, or surface. The fallback chain is codified in ADR-GPU-008.
Metrics & profiling
gpu_stats() on ElipsInstance returns a GpuMetricsSnapshot:
struct GpuMetricsSnapshot {
std::string backend, device_name;
size_t device_memory_used_bytes, device_memory_total_bytes;
size_t index_build_count, index_build_time_total_ms;
float index_build_speedup_vs_cpu_avg;
size_t search_kernel_launches_total;
size_t search_p50_latency_us, search_p99_latency_us;
float batch_avg_size, batch_coalescing_ratio;
bool fp16_search_enabled;
size_t fallback_events_total;
size_t kernel_errors_total;
size_t pinned_memory_pool_used_bytes;
};Per-kernel timing is opt-in via GpuConfig::emit_kernel_timings; recent samples are retrievable from GpuProfiler::recent_timings(n).
Build flags
# CPU-only — the default
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
# GPU-enabled — backends auto-detected from the host toolchain
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DELIPS_GPU_ENABLED=ON
cmake --build build -j
ctest --test-dir build --output-on-failureReference: ADR-GPU-001 through ADR-GPU-010 in Design decisions.