LLM as a Judge for Preference Annotation

Core Principle

Instead of manually labeling search results as relevant or irrelevant, you use an LLM to judge relevance on your behalf. Given a (query, document) pair, the LLM answers “does this document help answer this query?” and returns a score or binary label. This lets you build evaluation datasets for your search pipeline at a fraction of the cost of human annotation.

Evaluating search quality on a personal knowledge base (like an Obsidian vault) has a unique problem: only you know what’s relevant, so you can’t crowdsource annotations like academic IR benchmarks do. But sitting down and labeling hundreds of (query, result) pairs is tedious enough that nobody does it.

LLM-as-judge creates a middle path: the LLM provides approximate labels that are directionally correct for most queries, and you only need to manually correct the cases where the LLM got it wrong. This reduces annotation effort by an order of magnitude while still producing a usable evaluation dataset.

How It Works in Practice

The most practical version for personal search evaluation:

  1. Passive query logging — every search you run gets logged with the query, top N results, and which results you clicked (implicit positive signal)
  2. Batch LLM scoring — periodically feed the accumulated (query, result) pairs to an LLM with a prompt like: “Given this query, rate this document’s relevance as: highly relevant / somewhat relevant / not relevant. Explain briefly.”
  3. Human review of disagreements — focus manual annotation time on cases where the LLM’s judgment conflicts with your click data, or where you’re uncertain

Evaluation Metrics This Enables

With labeled (query, document, relevance) triples, you can compute standard IR metrics:

  • Hit@k — does at least one relevant document appear in the top k results?
  • MRR (Mean Reciprocal Rank) — on average, how high is the first relevant result?
  • nDCG@k — do the top k results have the right ordering of relevance?

These metrics let you compare search configurations (BM25 alone vs. Hybrid Search - Combining Sparse and Dense Retrieval, with vs. without Reranker - Cross-Encoder Rescoring) on your actual queries and data.

Head-to-Head Comparison (ELO Method)

A variant used by production benchmarks: show the LLM the top results from two different search methods side by side (anonymized), and ask which set is more relevant. Track wins/losses as an ELO score. This is more robust than absolute scoring because it avoids the LLM’s tendency to be generous with relevance labels.

Limitations

  • LLMs have their own relevance biases — they may favor well-written documents over messy personal notes that contain the actual answer
  • For deeply personal content (e.g., “that thing from the meeting with Will”), the LLM lacks the context to judge accurately — these cases still require human labels
  • LLM judgments are not free — batch scoring hundreds of pairs has a token cost, though small models or local inference keep this cheap

Practical Workflow for an Obsidian Vault

Log queries and results to a JSONL file with a relevance field defaulting to null. Use an LLM to pre-fill the relevance field. Review and correct labels when you have downtime. Over weeks, this accumulates a personal evaluation dataset that reflects your search patterns and relevance preferences.

This dataset can then be used to tune search configuration (BM25 vs. embedding weights, reranker prompt wording) or even fine-tune a small embedding model to encode your notion of similarity.

References

  • Zheng, L., et al. (2023). “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” NeurIPS.
  • Thomas, P., et al. (2023). “Large Language Models can Accurately Predict Searcher Preferences.” SIGIR.