ELIPS treats the GPU as an optional engine — never a runtime dependency. Domain code talks to IndexPort; when GPU is built in, the index happens to live in device memory. Everything else — selection, allocation, batching, fallback — is out of band, behind narrow ports.

Thesis

Three commitments shape every header under include/elips/gpu_engine/:

Interface Segregation. Five ports (compute, memory, kernel, stream, index) instead of one giant facade.
Failure is expected. Every GPU operation returns std::expected<T, GpuError> — device-lost, OOM, and unsupported metric are values, not exceptions.
Optional, additive. The CPU index and serving path stand on their own. GPU is a build-time switch (-DELIPS_GPU_ENABLED=ON), not a runtime requirement.

Layered architecture

Surface → orchestration → ports → backends. Each layer only knows the layer directly beneath it.

Ports — interface segregation

The engine is decomposed across five virtual interfaces so each caller depends only on the slice it actually needs.

cpp

// GpuPort.hpp — the umbrella port (init + compute + top_k)
class GpuPort {
public:
    virtual std::expected<void, GpuError>      initialize(const GpuConfig&) = 0;
    virtual void                               shutdown() noexcept = 0;
    virtual GpuDeviceInfo                      device_info() const noexcept = 0;

    virtual std::expected<GpuBuffer, GpuError> allocate_device(size_t bytes) = 0;
    virtual void                               free_device(GpuBuffer) noexcept = 0;
    virtual std::expected<void, GpuError>      upload(const void*, GpuBuffer&, size_t) = 0;
    virtual std::expected<void, GpuError>      download(const GpuBuffer&, void*, size_t) = 0;

    virtual std::expected<void, GpuError>      compute_distances_batch(
        std::span<const float> queries, std::span<const float> database,
        std::span<float> out, size_t nq, size_t nb, size_t dim, elips::Metric) = 0;

    virtual std::expected<void, GpuError>      top_k(
        std::span<const float> distances,
        std::span<uint32_t> indices_out, std::span<float> values_out,
        size_t nq, size_t nb, size_t k) = 0;

    virtual void synchronize() = 0;
    virtual bool is_idle() const noexcept = 0;
};

The dedicated ports keep call sites honest:

GpuMemoryPort — allocate / deallocate / pinned host buffers / bytes_used / available / peak.
GpuKernelPort — explicit metric-typed kernels: cosine_fp32, euclidean_fp32, dot_product_fp32.
GpuStreamPort — fine-grained synchronisation surface for callers that schedule their own streams.
GpuIndexPort — implements both IndexPort and IndexTransferPort; exposes build_from_batch, search_batch, export_to_cpu_index / import_from_cpu_index, plus backend_name() and device_bytes_used().

GpuConfig

cpp

enum class GpuPolicy        { Auto, PreferGpu, RequireGpu, CpuOnly, Specific };
enum class IndexBuildMode   { GpuBuild_CpuServe, GpuBuild_GpuServe, Hybrid };
enum class GpuIndexAlgorithm{ Auto, CagraGraph, IvfFlat, IvfPq, BruteForce };
enum class GpuPrecision     { FP32, FP16, Int8, Auto };

struct GraphIndexBuildParams {
    size_t intermediate_graph_degree{128};
    size_t graph_degree{64};
    enum class BuildAlgo { IvfPq, NnDescent, IterativeSearch } build_algo{...};
    size_t nn_descent_iterations{20};
    float  compression_ratio{0.0f};
};
struct IvfPqBuildParams {
    size_t   n_lists{1024};
    uint32_t pq_dim{0};
    uint32_t pq_bits{8};
    bool     add_data_on_build{true};
    size_t   kmeans_n_iters{20};
    float    kmeans_trainset_fraction{0.5f};
};

struct GpuConfig {
    GpuPolicy           policy{GpuPolicy::Auto};
    std::string         preferred_backend;        // "cuda" | "hip" | "metal" | ""
    int32_t             device_index{-1};         // -1 = let selector decide
    IndexBuildMode      index_build_mode{IndexBuildMode::GpuBuild_GpuServe};
    GpuIndexAlgorithm   algorithm{GpuIndexAlgorithm::Auto};

    size_t              device_memory_pool_bytes{0};            // 0 = auto-size
    size_t              pinned_host_pool_bytes{256UL * 1024 * 1024};
    bool                use_unified_memory{false};               // Apple / Grace Hopper

    size_t              default_ef_search_gpu{64};
    size_t              dynamic_batch_window_us{500};
    size_t              dynamic_batch_max_size{256};
    bool                enable_fp16_search{false};

    GraphIndexBuildParams graph_params;
    IvfPqBuildParams      ivf_pq_params;
    GpuPrecision          search_precision{GpuPrecision::Auto};

    bool  auto_rebuild_on_startup{false};
    float rebuild_threshold_ratio{0.1f};

    bool  enable_profiling{false};
    bool  emit_kernel_timings{false};
};

policy drives the selector: Auto tries GPU and falls back; PreferGpu warns but still falls back; RequireGpu errors if no device is usable; CpuOnly short-circuits the engine entirely; Specific pins to preferred_backend / device_index.

Backends

Three concrete backends live under src/gpu_engine/backends/:

CUDA (backends/cuda/) — NVIDIA path, ships cosine_fp32.cu kernel.
HIP / ROCm (backends/hip/) — AMD path, parallel cosine_fp32.hip kernel.
Metal (backends/metal/) — Apple Silicon, leans on unified memory (has_unified_memory in GpuDeviceInfo).

Each backend implements GpuPort, registers its capabilities in GpuDeviceInfo (FP16/BF16/Int8 support, CAGRA, IVF-PQ, dynamic batching, half-precision search) and is selected byGpuSelector::rank_backend together with the device list returned by GpuDeviceManager.

Index family

Every GPU index is a final subclass of GpuIndexPort, which itself extends both IndexPort and IndexTransferPort. Domain code keeps using IndexPort.

GpuBruteForceIndex — exhaustive search on device, ideal for small/medium vaults and ground-truth probes.
GpuIVFFlatIndex — inverted-file with flat lists; tunable n_lists.
GpuIVFPQIndex — inverted-file with product quantisation (pq_dim, pq_bits); large-scale, memory-efficient.
GpuGraphIndex — CAGRA-style graph; build via IVF-PQ, NN-descent, or iterative search.
GpuHybridIndex — composite for mixed dense / lexical paths.
GpuDistributedIndex — multi-GPU index with a DistributedMode selector.

Dynamic batcher

Concurrent single-query searches coalesce into one kernel launch. DynamicBatcher opens a window of dynamic_batch_window_us µs (default 500), buffers up to dynamic_batch_max_size queries (default 256), then flushes — either when the window closes or the buffer fills.

A short batching window converts many small queries into a few large kernel launches. Each query still resolves through its own std::future.

cpp

std::future<std::vector<SearchResult>>
DynamicBatcher::enqueue(std::span<const float> q, size_t k);

struct BatchStats {
    size_t queries_coalesced{0};
    size_t kernel_launches{0};
    float  avg_batch_size{0.0f};
    float  p99_latency_us{0.0f};
};

Memory

GpuMemoryManager implements GpuMemoryPort: a slab-allocator over a device-side pool sized by device_memory_pool_bytes, plus a pinned host pool sized by pinned_host_pool_bytes for fast H2D / D2H transfers. GpuMemoryPool provides a simpler in-pool allocator used by the search pipeline for scratch buffers.

On Apple Silicon, use_unified_memory avoids redundant copies — the same allocation is visible to host and GPU. See ADR-GPU-006.

Index transfer

GpuIndexTransferManager moves built indexes between device and host:

cpp

clone(source, destination);                 // CPU↔CPU or GPU↔GPU
clone_cpu_to_gpu(cpu_src, gpu_dst);         // GpuBuild_CpuServe → upload
clone_gpu_to_cpu(gpu_src, cpu_dst);         // export for serving on CPU

Together with IndexBuildMode::GpuBuild_CpuServe, this unlocks the canonical pattern: build a CAGRA graph on a beefy training box, then serve it from CPU nodes with no GPU dependency. See ADR-GPU-003.

Error model

cpp

enum class GpuError {
    DeviceNotFound, InsufficientMemory, KernelLaunchFailed,
    TransferFailed, IndexBuildFailed, UnsupportedMetric,
    InitializationFailed, BackendUnavailable,
};

[[nodiscard]] std::expected<...> op(...);   // every port method

Errors flow up as values; the caller decides whether to fall back to CPU, retry on another device, or surface. The fallback chain is codified in ADR-GPU-008.

Metrics & profiling

gpu_stats() on ElipsInstance returns a GpuMetricsSnapshot:

cpp

struct GpuMetricsSnapshot {
    std::string backend, device_name;
    size_t device_memory_used_bytes, device_memory_total_bytes;
    size_t index_build_count, index_build_time_total_ms;
    float  index_build_speedup_vs_cpu_avg;
    size_t search_kernel_launches_total;
    size_t search_p50_latency_us, search_p99_latency_us;
    float  batch_avg_size, batch_coalescing_ratio;
    bool   fp16_search_enabled;
    size_t fallback_events_total;
    size_t kernel_errors_total;
    size_t pinned_memory_pool_used_bytes;
};

Per-kernel timing is opt-in via GpuConfig::emit_kernel_timings; recent samples are retrievable from GpuProfiler::recent_timings(n).

Build flags

bash

# CPU-only — the default
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release

# GPU-enabled — backends auto-detected from the host toolchain
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DELIPS_GPU_ENABLED=ON

cmake --build build -j
ctest --test-dir build --output-on-failure

Reference: ADR-GPU-001 through ADR-GPU-010 in Design decisions.