LLM Comparison Sources

Core Principle

No single leaderboard is canonical, and the people who think hardest about this (Karpathy, Willison) explicitly distrust the most popular one (LMArena). The credible sources fall into three buckets: aggregators, revealed-preference signals, and your own evals. The right workflow uses all three in sequence — never just one.

Why This Matters

Public benchmarks are gameable, leaderboards get gamed, and the top spot reshuffles weekly. Picking a model from a single leaderboard is how you end up with a model that benches well and underperforms on your real workload. Knowing which sources are trustworthy and why lets you triangulate instead of trusting one number.

Evidence/Examples

Aggregators (start here for the snapshot)

Artificial Analysis — closest thing to a canonical comparison dashboard. Composite Intelligence Index, speed (tok/s), TTFT, price, context length, side-by-side for ~every API model. Simon Willison routinely links it in release roundups as a starting point.
LiveBench — contamination-resistant aggregate, refreshed monthly. Covers reasoning, math, coding, language, instruction following, data analysis.
llm-stats.com — useful cross-section of public benchmarks per model.

Revealed preference (what people/apps actually pick)

OpenRouter rankings — Karpathy’s explicit recommendation. Real production apps routing real tokens, with private evals and money on the line. He called it “very difficult to game” because the signal is tokens spent, not votes cast. (Karpathy on X)
LMArena — blind human pairwise votes, Elo-ranked. Historically canonical for “feel” but increasingly distrusted: see the Leaderboard Illusion paper. Karpathy soured on it after Gemini scored #1 but felt worse than alternatives in daily use, while Claude 3.5 felt top-tier and ranked low. Willison flagged Meta testing 27 Llama 4 variants and only shipping the winner as evidence of selective-disclosure gaming.

Task-specific leaderboards

SWE-bench Verified leaderboard — canonical for agentic coding. Willison highlighted it specifically when independent runs (not lab self-reports) landed on it. (Willison Feb 2026)
Aider polyglot leaderboard — canonical for edit-a-real-repo coding
Berkeley Function Calling Leaderboard — canonical for tool use
HuggingFace Open LLM Leaderboard v2 — canonical for open-weight models specifically
Epoch AI benchmarking dashboard — rigorous, frontier-focused

Your own evals (the only signal that actually matters)

Simon Willison’s consistent line on his evals tag: the eval that matters is the one you write for your task. 10–20 prompts from your real workload, run blind across candidates, ranked by you. No public leaderboard substitutes for this.

A litellm proxy (or similar OpenAI-compatible router) makes this trivial: same code path, swap the model ID, same harness scores them all.

Implications

Practical workflow: Artificial Analysis for the 30-second snapshot → OpenRouter rankings to cross-check with revealed preference → task-specific leaderboard (SWE-bench, Aider, BFCL) for your use case → your own eval set on real prompts.
The Karpathy view (revealed preference) and Willison view (your own evals) aren’t in conflict. They’re the same idea at different scales: aggregate-revealed-preference vs. personal-revealed-preference. Both sidestep gameable benchmarks by anchoring on real usage.
Treat aggregators as a filter to pick 3–4 candidates, not a verdict. Always finish with your own eval.
A leaderboard you can game with selective disclosure (publish 1 of 27 variants) is fundamentally different from a leaderboard you can only game by getting users to actually pay for tokens.

Dimensions of LLM Quality — what these leaderboards are scoring
LLM Benchmark Reference — the underlying benchmarks aggregated by these sources
Guide to LLM abstractions
AI-Native Infrastructure The Nix-LLM Virtuous Cycle

Questions

Will OpenRouter rankings stay un-gameable as more usage flows through it? (At some point the incentive to game it grows.)
Is there a public leaderboard that weights revealed preference + objective benchmarks + human preference into one score?
How do you build a personal eval set efficiently without spending a week on it? (Probably: log real prompts for a week, sample 20.)

Quartz 4

Explorer

LLM Comparison Sources

Core Principle

Why This Matters

Evidence/Examples

Aggregators (start here for the snapshot)

Revealed preference (what people/apps actually pick)

Task-specific leaderboards

Your own evals (the only signal that actually matters)

Implications

Questions

Graph View

Table of Contents

Backlinks

Quartz 4

Explorer

LLM Comparison Sources

Core Principle

Why This Matters

Evidence/Examples

Aggregators (start here for the snapshot)

Revealed preference (what people/apps actually pick)

Task-specific leaderboards

Your own evals (the only signal that actually matters)

Implications

Related Ideas

Questions

Graph View

Table of Contents

Backlinks