Core Principle
“How good is this model?” is not one question. LLM quality decomposes into roughly a dozen orthogonal axes, and a model can be frontier on one while mediocre on another. Picking a model means picking which axes matter for your workload, then ranking on those — not on a generic “intelligence” score.
Why This Matters
A single composite ranking flattens a multi-dimensional object into a line and hides the tradeoffs that actually matter in practice. The model that wins LMArena overall might be slow, expensive, weak at tool calling, and bad at long context. Treating “strongest” as monolithic leads to picking models that look great on a leaderboard and disappoint in production.
Evidence/Examples
The axes people actually evaluate on:
Capability axes
- Reasoning / knowledge — multi-step logic, factual recall, science, math
- Coding — writing, editing, debugging, repo-level agentic work
- Instruction following — does it actually do what you asked, in the format you asked
- Tool use / function calling — picks the right tool, fills args, chains calls correctly
- Long context — retrieval and reasoning across 100K–1M tokens
- Multimodality — vision, audio, video in/out
- Multilingual — non-English quality
- Creative writing — voice, originality, structure
- Safety / refusal calibration — refuses real harm without over-refusing benign asks
Operational axes
- Speed (tokens/sec) — throughput once generation starts
- TTFT (time to first token) — how interactive it feels
- Context window — max input size in tokens
- Cost — $/Mtok input + output
- Intelligence-per-dollar — composite of quality ÷ price; the actually-practical axis
Aggregate signals
- Human preference — blind pairwise votes; captures “feel” that benchmarks miss
- Revealed preference — what tokens production apps actually route (see LLM Comparison Sources)
Implications
- Before picking a model, write down which 2–4 axes dominate your workload. RAG pipeline = long context + cost. Coding agent = tool use + coding + cost. Customer support bot = instruction following + speed + safety.
- Be suspicious of any “best LLM” claim that doesn’t name the axis.
- Two models can both be “frontier” along different axes and neither dominates.
- Cost and speed are first-class quality dimensions, not afterthoughts. A model that’s 5% smarter at 10× the cost is rarely the right pick.
Related Ideas
- LLM Benchmark Reference — concrete tests per axis
- LLM Comparison Sources — where to look up scores across axes
- Guide to LLM abstractions
- Small Local LLMs as Judges
Questions
- Which axes correlate with each other in practice? (Reasoning + math probably do; tool use + reasoning probably don’t.)
- Is there a defensible “general intelligence” composite, or is every weighting arbitrary?
- How fast do these axes shift relative to each other as models improve? (Coding seems to outpace creative writing.)