LLM as a Judge for Preference Annotation

LLM-as-judge produces directionally-correct relevance labels at an order-of-magnitude lower annotation cost than manual labeling. Given a (query, document) pair, the LLM answers “does this document help answer this query?” and returns a score or binary label. The output is an evaluation dataset for a search pipeline that you can build in an afternoon instead of a week, and it’s the technique behind much of recent IR benchmarking work (Zheng et al. 2023, Thomas et al. 2023).

The motivating problem is sharper for personal search than for academic benchmarks. Evaluating search quality on a personal knowledge base (an Obsidian vault, a note archive) has a unique constraint: only you know what’s relevant, so you can’t crowdsource annotations the way IR papers do. But sitting down to label hundreds of (query, result) pairs is tedious enough that nobody does it. LLM-as-judge creates a middle path: approximate labels are directionally correct for most queries, and you only manually correct the cases where the LLM got it wrong.

The practical loop is three steps. Passive query logging: every search you run gets logged with query, top N results, and which results you clicked (implicit positive signal). Batch LLM scoring: periodically feed accumulated (query, result) pairs to an LLM with a prompt like “rate this document’s relevance: highly / somewhat / not. Explain briefly.” Human review of disagreements: spend manual annotation time only where the LLM’s judgment conflicts with your click data or where you’re uncertain. See LLM Pairwise Preference Judging for a more robust variant that avoids the absolute-scoring biases by showing the LLM two retrieval configurations side by side.

With labeled (query, document, relevance) triples you can compute standard IR metrics (Hit@k, MRR, nDCG@k) on your actual queries and data, comparing configurations like BM25 alone vs. Hybrid Search, with vs. without a Reranker. The output is not just an eval dataset but a tunable artifact: weights for BM25 vs. embeddings, reranker prompt wording, even a small embedding model fine-tuned on your notion of similarity.

Limitations are real. LLMs have their own biases: they may favor well-written documents over messy personal notes that contain the actual answer. Deeply personal content (“that thing from the meeting with Will”) falls outside the LLM’s context entirely; those cases still need human labels. Token costs aren’t free either, though small models or local inference keep batch scoring cheap.

Achhina's Digital Garden

Explorer

LLM as a Judge for Preference Annotation

Graph View

Backlinks