On this page

How it works

Most neural search systems follow a cascade pipeline: a fast first-stage retriever (BM25 or a bi-encoder) which pulls some ~1000 candidates, then a slow but accurate reranker (cross-encoder) which re-scores the top ~100. The reranker alone takes 100–500ms, which takes up most of the latency in Information Retrieval (IR) applications. This is the standard Retrieval Augmented Generation (RAG) pipeline for most production systems used today.

The question proposed with Hydra is: can we eliminate the complexity of cascade pipelines, reducing latency and the number of moving parts, all while maintaining competitive retrieval quality?

Hydra replaces the multi-model pipeline in standard RAG applications with a single BERT model that produces both sparse and dense retrieval signals in one forward pass. The sparse signal (SPLADE) handles semantic term expansions. It knows “temperature” relates to “heat” and “weather”. The dense signal (ColBERT) acts as a reranker and captures fine-grained token-level alignment via “late interaction”. Combined with BM25 for exact keyword matching, the three signals are then fused via Reciprocal Rank Fusion (RRF).


The three signals

SPLADE (Sparse Lexical and Expansion Model)

When you search for something like “why did the roman empire collapse”, a keyword system only matches documents containing those exact words. SPLADE uses a learned vocabulary to expand the query with related terms the model knows are semantically connected:

Now a document titled “The Fall of Rome” matches, even though neither “fall” nor “Rome” appeared in the original query. This is known as vocabulary mismatch, which SPLADE aims to reduce by learning which words relate to which.

Technically, SPLADE repurposes BERT’s Masked Language Modeling (MLM) head. The MLM head was trained to predict missing words from context, meaning it already knows which words are interchangeable. SPLADE takes those predictions and turns them into a sparse retrieval signal that can be looked up in an inverted index, similar to BM25.

Try it yourself

ColBERT (Contextualized Late Interaction over BERT)

Traditional search compresses an entire document into a single vector and compares it to the query vector. This is fast but lossy: a 768-dimensional vector can’t capture the granular details of a 200+ word passage.

ColBERT keeps every token’s embedding instead of pooling into a single vector. At query time, each query token finds its best-matching document token (MaxSim): for each query word, find the closest document word, then sum those scores. “Neural” matches “neural”, “network” matches “network” or “architecture” — fine-grained alignment without the cost of a full cross-encoder.

Document embeddings are precomputed offline, so at query time it’s just dot products over stored vectors.

BM25 (Best Matching 25)

Standard, classic term-frequency scoring. If you search for “Geoffrey Hinton”, you get exact matches. BM25 handles author names, paper IDs, and specific terminology that the learned models might underweight. It runs over the same inverted index as SPLADE with zero additional overhead.


How they combine

  1. Encode: The query passes through a single BERT model (using the ONNX runtime and quantized to INT8) on a CPU. One forward pass produces both the SPLADE sparse vector and ColBERT token embeddings. The raw tokens are also passed to BM25.
  2. Retrieve: SPLADE and BM25 each look up their posting lists in a shared inverted index. SPLADE uses the top 10 expansion terms; BM25 uses the raw query tokens. Scores are normalized and fused with a 70/30 weighted sum.
  3. Rescore: The top 50 candidates from the fused list are rescored with ColBERT via MaxSim. Document token embeddings are stored as INT8 in a memory-mapped file, which means the OS pages in only the ~1MB needed for 50 documents.

A concrete example

Searching “diseases spread by mosquitoes in tropical regions” across around 3M arXiv papers:

Total time: 66ms server-side (35ms encode, 29ms retrieve, 2ms rescore), evaluating 724,967 candidates.


Hydra — One model, two heads

This idea builds on SPLATEExternal link (Formal et al., 2024), which showed that SPLADE and ColBERT representations can come from the same BERT backbone. SPLATE attached an adapter to a frozen ColBERT model to produce sparse vectors but still used two separate models at serving time. Hydra takes this further and proposes a jointly-trained model: one backbone, two heads, with a single forward pass.

The key architectural choice: SPLADE and ColBERT share a single BERT backbone, running ColBERTv2 and SPLADE separately requires 219M parameters with two forward passes. Hydra uses a single 109.5M-parameter backbone with a single forward pass, adding only 720,954 parameters for both heads (98,304 for ColBERT, 622,650 for SPLADE).

The backbone is initialized from ColBERTv2 (already trained for dense retrieval). This is transfer learning, rather than training from scratch, Hydra inherits ColBERTv2’s language understanding and adapts it. Layers 0-9 are frozen to preserve those pretrained representations, and only layers 10-11 and the two heads are fine-tuned, meaning updating 15.5M parameters out of 110.2M total (86% frozen).

Parameters
ColBERTv2 + SPLADE (separate)220.0M
SPLATE110.6M
Hydra110.2M

The SPLADE head is trained via knowledge distillation from a pretrained SPLADE teacher model (naver/splade-cocondenser-ensembledistilExternal link). The teacher produces target sparse vectors, and the student learns to match them via Mean Squared Error (MSE). This teaches the student (our SPLADE head) to learn from the teacher’s sparse retrieval results without the backbone drifting from its dense retrieval pretraining.

Why freeze most of the backbone?

The first version of Hydra (v2) trained all 12 layers jointly. The result: the SPLADE head barely learned (weights moved only 2.4% from initialization), while the backbone’s hidden states drifted massively from ColBERTv2’s original representations (cosine similarity as low as 0.29). The SPLADE expansion terms were broken: searching “what causes climate change” returned terms like that, i, he instead of climate, weather, cause.

Just for testing: simply attaching a pretrained SPLADE head to ColBERTv2 with no training at all (a “Frankenstein” baseline) scored 0.6515 on SciFact, which was better than the fully-trained v2 model (0.6264). This confirmed the backbone drift was the problem.

MethodSciFactNFCorpus
All layers unfrozen, no distillation0.62640.2712
No training (pretrained SPLADE + ColBERTv2)0.65150.3273
v3: layers 0–9 frozen + distillation0.66590.3376
ColBERTv2 standalone0.64220.3255
SPLADE standalone0.70890.3539

Freezing layers 0-9 and training only layers 10-11 with SPLADE distillation (v3) gave the best results, exceeding both the no-training baseline and standalone ColBERTv2. The frozen layers preserve the general language understanding from pretraining, while the top layers adapt to produce both signals without drifting.

The sparsity collapse problem

A naive approach: training both heads with contrastive ranking losses, ends up failing. The SPLADE ranking loss rewards activating fewer, more discriminative terms (in other words, increasing sparsity): if the model only activates terms “climate” and “change”, positive documents score high while negatives score near zero. This drives a “sparsity collapse” where a large number of terms were initially activated, then dropping to less than 10 over training. We tested three configurations (with FLOPS regularization, without FLOPS, without top-k during training) and all collapsed identically. Knowledge distillation from a pretrained SPLADE teacher was the only approach that maintained rich term expansions, because the target IS rich term expansion, there’s no incentive to collapse.

Training

The model is trained on MS MARCOExternal link passage ranking (502,939 examples), each with 1 positive passage and 7 hard negatives from BM25 and dense retrieval models via sentence-transformers/msmarco-hard-negativesExternal link. Training runs for 3 epochs on an A100 GPU.

The training curves show clear diminishing returns across epochs. The SPLADE distillation loss drops sharply in the first few thousand steps, then slowly creeps over the remaining 89K steps. The ColBERT contrastive loss plateaus early and barely moves across all three epochs (1.80 → 1.77 → 1.76). Total improvement from epoch 2 to epoch 3 is under 1%.

Smoothed training loss (rolling average, window=10) over 94K steps. SPLADE distillation converges by step ~5K. ColBERT contrastive loss plateaus early and barely improves across epochs.

EpochTotal lossColBERT lossSPLADE loss
10.93061.79970.3512
20.81771.76680.1850
30.80981.75590.1790

The training loss keeps decreasing, but the benchmarks on the validation set are concerning. SPLADE improves steadily with more training (distillation keeps it stable), but ColBERT’s contribution degrades, meaning layers 10-11 were heavily influenced in training by the SPLADE loss, which hurts ColBERT’s fine-grained token matching.

CheckpointSciFact (e2e)
Step 2,0000.6691
Step 5,0000.6657
Step 10,0000.6659
Epoch 1 (31K)0.6583
Epoch 2 (63K)0.6557
Epoch 3 (94K)0.6505

Step 5,000 is the sweet spot where both heads are working well together. After that, more training helps SPLADE marginally but destroys ColBERT’s contribution. We select the step 5,000 checkpoint based on joint performance across both evaluation datasets. While step 2,000 achieves marginally higher SciFact (0.6691 vs 0.6657), step 5,000 shows stronger NFCorpus performance (0.3380 vs 0.3351) and more stable ColBERT contribution across both datasets. The production model uses the step 5,000 checkpoint.

This is a key finding: joint training with distillation converges rapidly; extended training degrades the dense retrieval head. The loss curves look like the model is improving, but the end-to-end quality peaks early and then declines.

The distillation loss required a 1000x scale factor to balance against ColBERT’s contrastive loss. Without scaling, the raw MSE distillation loss (~0.0006) was approximately 3000x smaller than the ColBERT loss (~1.8), meaning ColBERT’s gradients completely dominated in training. This caused SPLADE term collapse, with queries producing only 2-3 meaningful terms instead of 10+. The 1000x multiplier brings both losses to comparable magnitude (~0.4 vs ~1.8), allowing both heads to receive meaningful gradient signals through the shared layers.


Why it’s fast

The common assumption is that neural search requires GPUs. Hydra challenges this by:

The model inference is extremely fast with a trade-off of a one-time offline indexing run. Once the index is built, serving requires no GPU at all. The entire query pipeline runs on CPU.


Benchmarks

Latency

Measured on the live deployment (16 vCPU, 32 GB RAM, no GPU) with over 490 queries (excluding network overhead).

55ms

p50(median)

78ms

p95(95th percentile)

94ms

p99(99th percentile)

129ms

max(of 1450 queries)

Server-side latency across 1,450 queries on CPU (16 vCPU, no GPU). Every query runs the full neural pipeline live.

Pipeline breakdown

p50 (neural)p95 (neural)p50 (retrieval)p95 (retrieval)

Latency distribution

Latency vs. query length

p50 values across 1,450 queries.

Network latency (browser ↔ server) adds 150–250ms depending on distance to US-West. The server-side compute is not the bottleneck — the speed of light is.

Retrieval quality

Latency means nothing if the results are bad. Evaluated zero-shot on BEIRExternal link (NDCG@10) — no fine-tuning on the target datasets:

MethodSciFactNFCorpusPipelineLatency
BM25 (keyword only)0.6650.3251 stage~10ms
bi-encoder + cross-encoder (BENCHMARK THIS)~0.735*~0.369*2 stages (typical RAG)~200–500ms
ColBERTv20.6420.3261 stage~50ms
SPLADE0.7090.3541 stage~30ms
Hydra (e2e)0.66570.33801 stage~54ms

The typical RAG pipeline (bi-encoder retrieval + cross-encoder reranker) achieves the best quality but at 200–500ms latency, requiring GPU, and with two separate models to deploy and maintain. BM25 is fast and simple but misses semantic matches entirely.

Hydra sits in a unique position: near-GPU quality at CPU speed, in a single model. It outperforms ColBERTv2 despite sharing a backbone across two heads with 86% of parameters frozen. It doesn’t match the full bi-encoder + reranker cascade — but it eliminates the reranker entirely, halving infrastructure complexity and latency.


Known limitations

Proper name matching: Searching “Geoffrey Hinton” works, but “George Hinton” returns papers by anyone named George and anyone named Hinton separately. BM25 matches individual tokens, not multi-word entities. SPLADE doesn’t help — it expands semantically, not by correcting names. A dedicated entity recognition layer would fix this, but it’s not part of the current pipeline.

Vocabulary mismatch in SPLADE: BERT’s WordPiece tokenizer splits unfamiliar words into subwords. “COVID” becomes “co” + “##vid”, which can produce unrelated expansion terms.

First-stage recall ceiling: ColBERT only rescores what SPLADE + BM25 filters. If a relevant document isn’t in the top 100 candidates, ColBERT never sees it.

Single model tradeoff: Standalone SPLADE and bi-encoder models outperform Hydra on benchmarks (they dedicate full model capacity to one retrieval method). Hydra trades peak accuracy for the ability to run all three signals in a single, low latency one forward pass.


References

  1. Omar Khattab and Matei Zaharia. ColBERT: Efficient and effective passage search via contextualized late interaction over BERTExternal link. SIGIR 2020.

  2. Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. ColBERTv2: Effective and efficient retrieval via lightweight late interactionExternal link. 2021.

  3. Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. SPLADE: Sparse lexical and expansion model for first stage rankingExternal link. SIGIR 2021.

  4. Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant. From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More EffectiveExternal link. SIGIR 2022.

  5. Thibault Formal, Carlos Lassance, and Stéphane Clinchant. SPLATE: Sparse Late Interaction RetrievalExternal link. 2024.

  6. Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval modelsExternal link. NeurIPS 2021.