Show an LLM two retrieval configurations side by side (anonymized) for the same query, and ask which set of results is more relevant. Track wins and losses as an ELO score across many queries. The pairwise judgment sidesteps the well-known LLM bias toward generous absolute relevance labels: when forced to choose, the model’s ranking tends to track human ranking even when its absolute scores drift high.

The setup is the same data pipeline as LLM as a Judge for Preference Annotation, except instead of scoring each (query, document) pair in isolation, you score (query, results-from-config-A, results-from-config-B) triples. The prompt template is something like: “Given this query, which of these two result sets better answers it? Pick A, B, or tie. Explain briefly.” Anonymize the configurations (don’t tell the LLM which is BM25 vs hybrid) so its prior on retrieval methods doesn’t bleed in.

ELO accumulates over many comparisons, the same way it does in chess or LMArena. Each query is a “match”; the winning configuration gains rating, the losing one loses, scaled by the rating gap. After enough matches, the ELO ranking is robust to occasional bad judgments because individual errors get averaged out. This is the same reason aggregate-revealed-preference signals like OpenRouter rankings tend to outperform single-shot evaluations.

When to reach for pairwise over absolute. Pairwise wins when the absolute scores would be untrustworthy: small quality differences, models that are systematically generous, comparing methods rather than scoring documents. Absolute scoring wins when you need a per-document label for downstream training (e.g. fine-tuning a reranker), or when you only have one configuration to evaluate. Both can run on the same query log; the choice is per-evaluation-purpose, not per-pipeline.