“How good is this model?” is not one question. Quality decomposes into roughly a dozen orthogonal axes, and a model can be frontier on one while mediocre on another. Picking a model means picking which axes matter for your workload, then ranking on those, not on a generic “intelligence” score. A single composite ranking flattens a multi-dimensional object into a line and hides the trade-offs that matter in practice. The model that wins LMArena overall might be slow, expensive, weak at tool calling, and bad at long context.

The capability axes that leaderboards score are reasoning and knowledge, coding (writing, editing, debugging, repo-level agentic work), instruction following, tool use and function calling, long context (retrieval and reasoning across 100K-1M tokens), multimodality, multilingual quality, creative writing, and safety/refusal calibration. Each has distinct benchmarks; see LLM Benchmark Reference for the canonical tests per axis.

The operational axes are first-class quality dimensions, not afterthoughts. A model that’s 5% smarter at 10× the cost is rarely the right pick. Speed (tokens/sec), TTFT (time to first token), context window, and cost ($/Mtok input plus output) all factor into deployment decisions, with intelligence-per-dollar (quality / price) the actually-practical composite. Two aggregate signals layer on top of the per-axis benchmarks: human preference from blind pairwise votes captures “feel” that benchmarks miss, and revealed preference (what tokens production apps actually route) is the most game-resistant signal of all (see LLM Comparison Sources).

Before picking a model, write down which 2-4 axes dominate your workload. RAG pipeline = long context plus cost. Coding agent = tool use plus coding plus cost. Customer support bot = instruction following plus speed plus safety. Be suspicious of any “best LLM” claim that doesn’t name the axis: two models can both be “frontier” along different axes and neither dominates.