GPUs hate small launches. The DynamicBatcher exists to make them stop being small.

Concurrent queries wait inside a tiny window, then ride one kernel launch as a batch.

Configurable knobs (include/elips/gpu_engine/GpuConfig.hpp)

window_us — how long to wait for stragglers (microseconds)
max_batch — hard cap on coalesced queries per launch
algorithm — brute-force, IVF-Flat, IVF-PQ, graph (CAGRA), hybrid

GpuIngestionPipeline streams record vectors onto the device. GpuQuantizationPipeline trains IVF-PQ centroids and pre-encodes residuals. GpuSearchPipeline orchestrates the batched distance + top-k path. GpuProfiler instruments every stage.

cpp

auto cfg = elips::Config{}
    .dimension(768)
    .metric(elips::Metric::cosine)
    .gpu(elips::gpu::GpuConfig{}
        .policy(elips::gpu::GpuPolicy::PreferGpu)
        .algorithm(elips::gpu::Algorithm::cagra)
        .window_us(500)
        .max_batch(64));