Reranker - Cross-Encoder Rescoring
Core Principle
A reranker (cross-encoder) takes a query and a candidate document, concatenates them into a single input [query] [SEP] [document], and produces a relevance score. Unlike Embedding Models for Semantic Similarity which encode query and document independently, the reranker’s transformer attention layers can directly compare words in the query against words in the document in the same forward pass.
This joint processing is the fundamental architectural difference. It’s why rerankers catch things bi-encoders miss — negation, subtle relevance distinctions, whether a document actually answers the question vs. merely mentions the same topic.
Why Not Use It for Everything?
The cross-encoder must run a full forward pass for every (query, document) pair. For a corpus of 100,000 documents, that’s 100,000 forward passes per query — far too slow. A bi-encoder computes document vectors once offline; at query time, only the query needs encoding, and similarity is a cheap vector operation.
This is why rerankers only operate on a pre-filtered candidate set (typically 50-100 documents from BM25 - Best Matching 25 Ranking Function or Hybrid Search - Combining Sparse and Dense Retrieval), not the full corpus.
The Sweet Spot for Candidate Count
Research suggests reranking 50-100 candidates is optimal. Feeding more documents to the reranker shows diminishing returns and can actually degrade performance — a paper titled “Drowning in Documents” (2024) found that scaling the number of reranked documents ultimately hurt Recall for modern cross-encoders. The reranker is trained on the kind of candidates a retriever produces (a mix of near-misses and true positives), not random corpus samples, so it works best on that distribution.
Types of Rerankers
- Pointwise cross-encoders — score each (query, doc) pair independently. Most common. Examples: Cohere Rerank, Zerank, Jina Reranker, BGE-reranker.
- Listwise rerankers — use an LLM to look at multiple candidates at once and output a permutation (ranking). More expensive but can capture inter-document relationships. Examples: RankGPT, RankZephyr.
- ColBERT / late interaction — a middle ground where query and document tokens each get their own embeddings, then interact via cheap token-level operations at search time. Near cross-encoder quality at much better latency.
Current Landscape (early 2026)
Top performers: Zerank 2 (highest ELO in head-to-head benchmarks), Cohere Rerank v4.0 Pro (close second), Voyage Rerank 2.5 (best latency-quality tradeoff). Open-source: Qwen3-Reranker series (0.6B/4B/8B), Jina Reranker v2, BGE-reranker-v2-m3.
When Reranking Helps Most
The biggest gains come when first-pass retrieval returns results with high lexical or semantic overlap but varying actual relevance — overlapping content, ambiguous queries, or corpora where many documents discuss similar topics in different ways. The benefit scales with content ambiguity, not raw corpus size.
Related Ideas
- Embedding Models for Semantic Similarity
- BM25 - Best Matching 25 Ranking Function
- Hybrid Search - Combining Sparse and Dense Retrieval
- ColBERT and Late Interaction Models
- LLM as a Judge for Preference Annotation
References
- Nogueira, R., & Cho, K. (2019). “Passage Re-ranking with BERT.” arXiv:1901.04085.
- Khattab, O., & Zaharia, M. (2020). “ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.” SIGIR.
- Boytsov, L., et al. (2024). “Drowning in Documents: Consequences of Scaling Reranker Inference.” arXiv:2411.11767.