Core Principle
For each capability axis in Dimensions of LLM Quality, there is a small set of canonical benchmarks the field has converged on. Knowing which benchmark maps to which axis lets you read a model release post intelligently and spot the gaps the lab didn’t report.
Why This Matters
Labs pick which benchmarks to put on their release page. If a model release lists MMLU-Pro and GPQA but not SWE-bench, that’s a signal — not just an omission. You can only read those signals if you know what each benchmark measures and what the saturation level is.
Benchmarks also rot. MMLU and HumanEval are saturated and contaminated. Citing them as evidence today is a tell that someone isn’t paying attention.
Evidence/Examples
General reasoning & knowledge
- MMLU-Pro — successor to MMLU, harder, less saturated. Paper
- GPQA Diamond — PhD-level science, “Google-proof”. Paper
- HLE (Humanity’s Last Exam) — frontier, ~3K expert questions. Site
- BIG-Bench Hard (BBH) — still cited, mostly saturated; ignore for frontier comparisons
Math
- MATH — competition problems. Paper
- AIME 2024/2025 — current frontier differentiator
- FrontierMath — research-level, mostly unsolved. Epoch AI
Coding
- HumanEval / MBPP — legacy, saturated, contaminated; ignore
- LiveCodeBench — refreshed from LeetCode, contamination-resistant. Site
- SWE-bench Verified — real GitHub issue fixes; current gold standard for agentic coding. Site
- Aider polyglot — multi-file edits across languages. Leaderboard
Instruction following
- IFEval — verifiable constraints (format, length, casing, etc.). Paper
Tool use / agents
- BFCL (Berkeley Function Calling Leaderboard) — Site
- τ-bench (tau-bench) — multi-turn retail/airline agents. Paper
- GAIA — general AI assistant tasks. Paper
Long context
Multimodal
- MMMU — college-level multi-discipline vision. Site
- MathVista — visual math
- ChartQA, DocVQA — charts and documents
Contamination-resistant aggregate
- LiveBench — rotated monthly; covers reasoning, math, coding, language, IF, data. Site
Implications
- A “saturation date” is the first thing to check on any benchmark. If frontier models score >95%, the benchmark is dead and won’t differentiate.
- Contamination is the silent killer. Anything scraped from the public web before training is suspect. Prefer benchmarks that refresh (LiveBench, LiveCodeBench) or that test private/recent items (SWE-bench Verified on filtered issues).
- Lab self-reports are not independent runs. When an outside party reproduces a benchmark, that’s the result that should update your beliefs.
- For agentic coding work specifically, SWE-bench Verified > everything else right now. For tool use, BFCL.
Related Ideas
- Dimensions of LLM Quality — what each benchmark is trying to measure
- LLM Comparison Sources — leaderboards that aggregate these benchmarks
- Small Local LLMs as Judges
Questions
- What’s the half-life of a benchmark before contamination kills it? (LiveCodeBench refreshes monthly for a reason.)
- How well does benchmark performance actually transfer to private workloads? (Anecdotally: loose correlation, not tight.)
- Is there a benchmark for “tool use under adversarial conditions” yet? (BFCL is pretty clean-room.)