LLM Benchmark Reference

Core Principle

For each capability axis in Dimensions of LLM Quality, there is a small set of canonical benchmarks the field has converged on. Knowing which benchmark maps to which axis lets you read a model release post intelligently and spot the gaps the lab didn’t report.

Why This Matters

Labs pick which benchmarks to put on their release page. If a model release lists MMLU-Pro and GPQA but not SWE-bench, that’s a signal — not just an omission. You can only read those signals if you know what each benchmark measures and what the saturation level is.

Benchmarks also rot. MMLU and HumanEval are saturated and contaminated. Citing them as evidence today is a tell that someone isn’t paying attention.

Evidence/Examples

General reasoning & knowledge

MMLU-Pro — successor to MMLU, harder, less saturated. Paper
GPQA Diamond — PhD-level science, “Google-proof”. Paper
HLE (Humanity’s Last Exam) — frontier, ~3K expert questions. Site
BIG-Bench Hard (BBH) — still cited, mostly saturated; ignore for frontier comparisons

Math

MATH — competition problems. Paper
AIME 2024/2025 — current frontier differentiator
FrontierMath — research-level, mostly unsolved. Epoch AI

Coding

HumanEval / MBPP — legacy, saturated, contaminated; ignore
LiveCodeBench — refreshed from LeetCode, contamination-resistant. Site
SWE-bench Verified — real GitHub issue fixes; current gold standard for agentic coding. Site
Aider polyglot — multi-file edits across languages. Leaderboard

Instruction following

IFEval — verifiable constraints (format, length, casing, etc.). Paper

Tool use / agents

BFCL (Berkeley Function Calling Leaderboard) — Site
τ-bench (tau-bench) — multi-turn retail/airline agents. Paper
GAIA — general AI assistant tasks. Paper

Long context

RULER — synthetic needle-in-haystack variants. Paper
LongBench v2 — realistic long-doc QA. Site

Multimodal

MMMU — college-level multi-discipline vision. Site
MathVista — visual math
ChartQA, DocVQA — charts and documents

Contamination-resistant aggregate

LiveBench — rotated monthly; covers reasoning, math, coding, language, IF, data. Site

Implications

A “saturation date” is the first thing to check on any benchmark. If frontier models score >95%, the benchmark is dead and won’t differentiate.
Contamination is the silent killer. Anything scraped from the public web before training is suspect. Prefer benchmarks that refresh (LiveBench, LiveCodeBench) or that test private/recent items (SWE-bench Verified on filtered issues).
Lab self-reports are not independent runs. When an outside party reproduces a benchmark, that’s the result that should update your beliefs.
For agentic coding work specifically, SWE-bench Verified > everything else right now. For tool use, BFCL.

Dimensions of LLM Quality — what each benchmark is trying to measure
LLM Comparison Sources — leaderboards that aggregate these benchmarks
Small Local LLMs as Judges

Questions

What’s the half-life of a benchmark before contamination kills it? (LiveCodeBench refreshes monthly for a reason.)
How well does benchmark performance actually transfer to private workloads? (Anecdotally: loose correlation, not tight.)
Is there a benchmark for “tool use under adversarial conditions” yet? (BFCL is pretty clean-room.)

Quartz 4

Explorer

LLM Benchmark Reference

Core Principle

Why This Matters

Evidence/Examples

General reasoning & knowledge

Math

Coding

Instruction following

Tool use / agents

Long context

Multimodal

Contamination-resistant aggregate

Implications

Questions

Graph View

Table of Contents

Backlinks

Quartz 4

Explorer

LLM Benchmark Reference

Core Principle

Why This Matters

Evidence/Examples

General reasoning & knowledge

Math

Coding

Instruction following

Tool use / agents

Long context

Multimodal

Contamination-resistant aggregate

Implications

Related Ideas

Questions

Graph View

Table of Contents

Backlinks