Dimensions of LLM Quality

Core Principle

“How good is this model?” is not one question. LLM quality decomposes into roughly a dozen orthogonal axes, and a model can be frontier on one while mediocre on another. Picking a model means picking which axes matter for your workload, then ranking on those — not on a generic “intelligence” score.

Why This Matters

A single composite ranking flattens a multi-dimensional object into a line and hides the tradeoffs that actually matter in practice. The model that wins LMArena overall might be slow, expensive, weak at tool calling, and bad at long context. Treating “strongest” as monolithic leads to picking models that look great on a leaderboard and disappoint in production.

Evidence/Examples

The axes people actually evaluate on:

Capability axes

Reasoning / knowledge — multi-step logic, factual recall, science, math
Coding — writing, editing, debugging, repo-level agentic work
Instruction following — does it actually do what you asked, in the format you asked
Tool use / function calling — picks the right tool, fills args, chains calls correctly
Long context — retrieval and reasoning across 100K–1M tokens
Multimodality — vision, audio, video in/out
Multilingual — non-English quality
Creative writing — voice, originality, structure
Safety / refusal calibration — refuses real harm without over-refusing benign asks

Operational axes

Speed (tokens/sec) — throughput once generation starts
TTFT (time to first token) — how interactive it feels
Context window — max input size in tokens
Cost — $/Mtok input + output
Intelligence-per-dollar — composite of quality ÷ price; the actually-practical axis

Aggregate signals

Human preference — blind pairwise votes; captures “feel” that benchmarks miss
Revealed preference — what tokens production apps actually route (see LLM Comparison Sources)

Implications

Before picking a model, write down which 2–4 axes dominate your workload. RAG pipeline = long context + cost. Coding agent = tool use + coding + cost. Customer support bot = instruction following + speed + safety.
Be suspicious of any “best LLM” claim that doesn’t name the axis.
Two models can both be “frontier” along different axes and neither dominates.
Cost and speed are first-class quality dimensions, not afterthoughts. A model that’s 5% smarter at 10× the cost is rarely the right pick.

LLM Benchmark Reference — concrete tests per axis
LLM Comparison Sources — where to look up scores across axes
Guide to LLM abstractions
Small Local LLMs as Judges

Questions

Which axes correlate with each other in practice? (Reasoning + math probably do; tool use + reasoning probably don’t.)
Is there a defensible “general intelligence” composite, or is every weighting arbitrary?
How fast do these axes shift relative to each other as models improve? (Coding seems to outpace creative writing.)

Quartz 4

Explorer

Dimensions of LLM Quality

Core Principle

Why This Matters

Evidence/Examples

Implications

Questions

Graph View

Table of Contents

Backlinks

Quartz 4

Explorer

Dimensions of LLM Quality

Core Principle

Why This Matters

Evidence/Examples

Implications

Related Ideas

Questions

Graph View

Table of Contents

Backlinks