Core Principle

No single leaderboard is canonical, and the people who think hardest about this (Karpathy, Willison) explicitly distrust the most popular one (LMArena). The credible sources fall into three buckets: aggregators, revealed-preference signals, and your own evals. The right workflow uses all three in sequence — never just one.

Why This Matters

Public benchmarks are gameable, leaderboards get gamed, and the top spot reshuffles weekly. Picking a model from a single leaderboard is how you end up with a model that benches well and underperforms on your real workload. Knowing which sources are trustworthy and why lets you triangulate instead of trusting one number.

Evidence/Examples

Aggregators (start here for the snapshot)

  • Artificial Analysis — closest thing to a canonical comparison dashboard. Composite Intelligence Index, speed (tok/s), TTFT, price, context length, side-by-side for ~every API model. Simon Willison routinely links it in release roundups as a starting point.
  • LiveBench — contamination-resistant aggregate, refreshed monthly. Covers reasoning, math, coding, language, instruction following, data analysis.
  • llm-stats.com — useful cross-section of public benchmarks per model.

Revealed preference (what people/apps actually pick)

  • OpenRouter rankingsKarpathy’s explicit recommendation. Real production apps routing real tokens, with private evals and money on the line. He called it “very difficult to game” because the signal is tokens spent, not votes cast. (Karpathy on X)
  • LMArena — blind human pairwise votes, Elo-ranked. Historically canonical for “feel” but increasingly distrusted: see the Leaderboard Illusion paper. Karpathy soured on it after Gemini scored #1 but felt worse than alternatives in daily use, while Claude 3.5 felt top-tier and ranked low. Willison flagged Meta testing 27 Llama 4 variants and only shipping the winner as evidence of selective-disclosure gaming.

Task-specific leaderboards

Your own evals (the only signal that actually matters)

Simon Willison’s consistent line on his evals tag: the eval that matters is the one you write for your task. 10–20 prompts from your real workload, run blind across candidates, ranked by you. No public leaderboard substitutes for this.

A litellm proxy (or similar OpenAI-compatible router) makes this trivial: same code path, swap the model ID, same harness scores them all.

Implications

  • Practical workflow: Artificial Analysis for the 30-second snapshot → OpenRouter rankings to cross-check with revealed preference → task-specific leaderboard (SWE-bench, Aider, BFCL) for your use case → your own eval set on real prompts.
  • The Karpathy view (revealed preference) and Willison view (your own evals) aren’t in conflict. They’re the same idea at different scales: aggregate-revealed-preference vs. personal-revealed-preference. Both sidestep gameable benchmarks by anchoring on real usage.
  • Treat aggregators as a filter to pick 3–4 candidates, not a verdict. Always finish with your own eval.
  • A leaderboard you can game with selective disclosure (publish 1 of 27 variants) is fundamentally different from a leaderboard you can only game by getting users to actually pay for tokens.

Questions

  • Will OpenRouter rankings stay un-gameable as more usage flows through it? (At some point the incentive to game it grows.)
  • Is there a public leaderboard that weights revealed preference + objective benchmarks + human preference into one score?
  • How do you build a personal eval set efficiently without spending a week on it? (Probably: log real prompts for a week, sample 20.)