Designing GenAI Evaluations - Process and Metrics

Summary

A literature synthesis on the process of writing an LLM evaluation: how to scope scenarios, choose metrics for both quality and compute, defend against contamination, and use LLM-as-a-judge reliably. Built from 22 papers including HELM, the Chang et al. methodology survey, MT-Bench / Chatbot Arena, BIG-Bench, MMLU, the Gu et al. LLM-as-judge survey, LiveCodeBench, SWE-Bench, AgentBench, MedHELM/VHELM, and the contamination literature (CONDA, PaCoST, preference leakage). Full long-form report lives at ~/docs/llm-evaluation-research-report.md.

Key Claims

Evaluation is multi-dimensional from the start. HELM (Liang et al., 2023) established the scenarios × metrics matrix: every scenario gets all 7 metric categories — accuracy, calibration, robustness, fairness, bias, toxicity, efficiency — not just one accuracy number. The matrix covers 98/112 cells = 87.5%; intentional gaps where a metric is ill-defined.
A scenario has structure. HELM’s three-axis decomposition: Task × Domain (What/Who/When) × Language. E.g., (QA, (clinical notes, doctors, now), English). Selection criteria: coverage, minimality, user-facing.
Closed-form benchmarks measure capability; open-ended benchmarks measure preference. MT-Bench (Zheng et al., NeurIPS 2023) explicitly argues MMLU/HellaSwag/HumanEval cannot distinguish RLHF-aligned chat models from base models. Pair both.
LLM-as-judge is competitive with humans, but biased. GPT-4 ↔ human pairwise agreement is 85% on MT-Bench (864 votes), 87% on Chatbot Arena (1,944 votes); human ↔ human is only 81%. Position bias: GPT-4 only 65% consistent under swap (best of major judges). Verbosity attack: Claude-v1 / GPT-3.5 fooled 91.3% of the time, GPT-4 only 8.7%. Self-enhancement: Claude-v1 +25%, GPT-4 +10%.
Reference-guided judging is the strongest fix for math/code. Math failure rate drops 70% → 15% when the judge generates its own answer first. CoT alone is insufficient (judge often “makes exactly the same mistake” as the candidate).
Position swap + majority vote are the two empirically-consistent reliability interventions [Gu et al. survey](https://arxiv.org/abs/2411.15594]. Self-explanation actively hurts; mean@5 is dominated by majority@5; ensembling depends critically on which judges you pool.
Compute must be reported alongside quality. HELM defines denoised (real provider, contention-removed) and idealized (uniform A100+Megatron) inference runtime, decomposed as F(prompt_tokens) + g · output_tokens. Training cost: e = n_GPU × W_GPU × t × PUE. Single-axis quality leaderboards systematically over-recommend frontier models — fine-tuned encoders match LLM zero-shot quality at 1–2 OOM lower cost (Gonzalez 2026).
Contamination is now the default state. CONDA 2024 collected 566 entries across 91 sources; PaCoST concluded “almost all models and benchmarks we tested are suspected contaminated.” Defenses: living benchmarks (LiveCodeBench, Chatbot Arena, FRESHQA, Xiezhi), statistical detection (PaCoST, DCR), held-out splits.
Preference leakage is a new, harder-to-detect contamination vector. Li et al. 2025: when the data generator and the judge come from the same model family, scores inflate from correlated preferences, not memorization.
Trajectory-level evaluation matters for agents. Single-shot benchmarks miss multi-step tool use. AgentBench, SWE-Bench, GAIA score full trajectories; verifier-based scoring is current SOTA, process reward models are the active frontier.
Domain HELMs work. The HELM matrix has been ported to medicine (MedHELM, 35 benchmarks across 5 clinical task categories), vision (VHELM), audio (Yang et al. 2025). Method generalizes: define the domain scenario taxonomy first, instantiate the same metric battery.

Notable Quotes

“A scenario instantiates a desired use case for a language model… operationalized through a list of instances, divided into a training set and one or more test sets.” — Liang et al., HELM

“Too often in AI, this has come to mean the system should be accurate in an average sense. While (average) accuracy is an important… property… accuracy is often not sufficient for a system to be useful.” — Liang et al., HELM (rationale for the scenarios × metrics matrix)

“MMLU and HELM cannot effectively tell the difference between [RLHF-]aligned models and the base models.” — Zheng et al., MT-Bench (the case for preference benchmarks)

“Almost all models and benchmarks we tested are suspected contaminated more or less.” — Zhang et al., PaCoST

My Reactions

The HELM “denoised vs. idealized” inference-runtime split is the single best idea in compute-side evaluation; most production teams report neither. Should be the default.
The empirical finding that self-explanation hurts judge reliability is counterintuitive and worth burning into memory — the field’s instinct is “more reasoning = better”; with judges, it surfaces noise.
“Preference leakage” is the most under-discussed risk in current eval pipelines. Any team using a single judge family for both data generation and evaluation has unmeasured score inflation.
The 9-step recipe in the long report (state the decision → scenarios → metrics → static+dynamic → judge stack → cost-quality Pareto → contamination defense → uncertainty → documentation) is a useful internal spec template.

Practical Recipe (extracted)

State the decision the evaluation will inform.
Define scenarios (HELM five-tuple: use case / task / domain / language / speaker).
For each scenario, choose metrics from the 7 HELM categories. Always include accuracy, ≥1 robustness perturbation, ≥1 cost metric.
Pair static and dynamic benchmarks; large gaps signal contamination.
For open-ended outputs use LLM-as-judge with the reliability stack: reference-guided where possible, position swap, majority vote across ≥3 judges, meta-evaluate against ~100 human-labeled gold pairs.
Report quality, cost, latency as Pareto frontiers, not single rankings.
Run contamination tests (PaCoST/DCR) on suspect benchmarks; prefer post-cutoff data; keep a private hold-out.
Bootstrap CIs over instances, and over judge runs.
Document prompts, decoding parameters, ICL examples, judge prompts, judge model versions, seeds.

Connections

Dimensions of LLM Quality — what each metric category is trying to measure
LLM Benchmark Reference — the parallel cheat-sheet of which benchmark to read for which axis
LLM Comparison Sources — meta-leaderboards that aggregate these
LLM as a Judge for Preference Annotation — the use case where this methodology is most often applied
LLM Pairwise Preference Judging — pairwise variant, with the bias mitigations from this synthesis
Small Local LLMs as Judges — contrast: what changes when the judge isn’t GPT-4
Issues with Popular Benchmarks and Evaluation Methods - Mellanie Lemman — adjacent take from a NeurIPS talk on rigor in eval design

Sources cited (22)

Foundation: Liang et al. HELM (arXiv:2211.09110) · Chang et al. Survey on Evaluation of LLMs (arXiv:2307.03109) · Srivastava et al. BIG-Bench (arXiv:2206.04615) · Suzgun et al. BIG-Bench Hard (arXiv:2210.09261) · Hendrycks et al. MMLU (arXiv:2009.03300).

Preference / judges: Zheng et al. MT-Bench / Chatbot Arena (arXiv:2306.05685) · Chiang et al. Chatbot Arena platform (arXiv:2403.04132) · Gu et al. Survey on LLM-as-a-Judge (arXiv:2411.15594).

Code / agents / domain: Chen et al. HumanEval / pass@k (arXiv:2107.03374) · Jain et al. LiveCodeBench (arXiv:2403.07974) · Liu et al. AgentBench (arXiv:2308.03688) · Jimenez et al. SWE-Bench (arXiv:2310.06770) · Bedi et al. MedHELM (arXiv:2505.23802) · Lee et al. VHELM (arXiv:2410.07112) · Yang et al. Audio-LM eval survey (arXiv:2505.15957).

Contamination / reliability: Sainz et al. CONDA 2024 (arXiv:2407.21530) · Palavalli et al. Contamination taxonomy (arXiv:2407.08716) · Zhang et al. PaCoST (arXiv:2406.18326) · Xu et al. DCR (arXiv:2507.11405) · Li et al. Preference Leakage (arXiv:2502.01534) · Dragoi et al. Beyond Pass@k (arXiv:2510.08325) · He-Yueya et al. Psychometric Alignment (arXiv:2407.15645) · Gonzalez Cost-Aware Model Selection (arXiv:2602.06370).

Achhina's Digital Garden

Explorer