A reranker (cross-encoder) takes a query and a candidate document, concatenates them as [query] [SEP] [document], and produces a single relevance score. Unlike Embedding Models, which encode query and document independently, the cross-encoder’s transformer attention layers directly compare query words against document words in the same forward pass. That joint processing is the architectural difference. It’s why rerankers catch what bi-encoders miss: negation, subtle relevance distinctions, whether a document actually answers the question vs. merely mentioning the same topic.

The cost is that every (query, document) pair needs its own forward pass. For a corpus of 100,000 documents that’s 100,000 forward passes per query, far too slow for first-pass retrieval. A bi-encoder computes document vectors once offline; at query time only the query needs encoding, and similarity is a cheap vector operation. So rerankers only operate on a pre-filtered candidate set, typically 50-100 candidates from BM25 or Hybrid Search.

Counter-intuitively, reranking more than ~50-100 candidates often degrades quality. Boytsov et al. (2024), “Drowning in Documents”, found that scaling reranked-document count ultimately hurt Recall for modern cross-encoders. The reranker is trained on the kind of candidates a retriever produces (a mix of near-misses and true positives), not random corpus samples, so it works best on that distribution.

Three flavors dominate. Pointwise cross-encoders score each pair independently (Cohere Rerank, Zerank, Jina, BGE) and are the default. Listwise rerankers use an LLM to look at multiple candidates at once and output a permutation (RankGPT, RankZephyr), more expensive but able to capture inter-document relationships. ColBERT and Late Interaction Models sit in the middle, with token-level embeddings and cheap interaction at search time, near cross-encoder quality at much better latency.