Core Principle
Andrej Karpathy’s autoresearch inverts the ML researcher’s role: instead of editing Python, the human edits a Markdown program.md (a “super lightweight skill”) that instructs an agent to autonomously iterate on a single train.py file. The agent runs fixed 5-minute experiments (~12/hour, ~100 overnight), measures val_bpb, keeps improvements on a feature branch, and merges them. The leverage point shifts from training code to the agent’s research-org instructions.
Why This Matters
This is the first credible demonstration that the iterative-experiment-driven loop, the “bread and butter” of two decades of ML research per Karpathy, can be executed end-to-end by an LLM agent on a real, well-tuned project. It is not a toy benchmark: improvements stack, transfer across scales, and beat hand-tuned baselines. If true at frontier scale, it reframes ML research labor as agent orchestration rather than direct experimentation.
Evidence/Examples
Round 1 results (Karpathy, Mar 9 2026):
- ~2 days of agent runtime on a depth-12 nanochat model
- ~700 autonomous changes attempted, ~20 retained as additive improvements
- All ~20 improvements transferred from d12 to d24, validating small-scale proxy as a search target
- “Time to GPT-2” leaderboard metric: 2.02h → 1.80h (~11% improvement) on top of an already-manually-tuned baseline
Specific findings the agent surfaced (and Karpathy had missed):
- Parameterless QKnorm was missing a scaler multiplier; attention was too diffuse
- Value Embeddings benefited from regularization (none was being applied)
- Banded attention window was too conservative
- AdamW beta values were misconfigured
- Weight decay schedule and network initialization both improved
Design choices that make the loop work:
- Single editable file (
train.py): keeps diffs reviewable and scope manageable - Fixed 5-minute wall-clock budget: makes experiments comparable across architectural changes (model size, batch size, optimizer) and across machines, since the agent always optimizes “best result in 5 min on this hardware”
- Single scalar metric (
val_bpb): vocab-size-independent so architectural changes are fairly compared - Branch-and-merge discipline: agent works on a feature branch, merges only on improvement
Karpathy’s meta-observation (Mar 5 tweet): over ~2 weeks, he iterated more on the “meta-setup” tuning agent flows than on the nanochat repo directly. The optimization target itself shifted up a level.
Implications
-
The fixed-time-budget + scalar-metric + single-file pattern is generalizable. Karpathy’s claim: any metric that is reasonably efficient to evaluate (or has efficient proxy metrics, e.g. training a smaller network) can be autoresearched by an agent swarm. Worth asking of any optimization problem: does it factor this way?
-
Proxy-scale search appears to transfer. d12 improvements transferring cleanly to d24 is the load-bearing assumption for the entire frontier-lab vision. If this holds at larger ratios, it justifies the “swarm tunes small models, promote winners to larger scales” architecture.
-
Frontier labs will industrialize this. Karpathy: “All LLM frontier labs will do this. It’s the final boss battle.” The hard parts at scale aren’t conceptual but engineering: multi-file repos, distributed training, multi-agent coordination, scale promotion pipelines.
-
program.mdis the new artifact of expertise. The repo explicitly framesprogram.mdas a “super lightweight skill”. This connects directly to the broader pattern in Agent Skills Spec Design Principles: the value-adding human work is increasingly the prompt/skill that conditions the agent, not the code the agent edits. -
Real risk of Goodharting on the proxy metric. Karpathy himself flagged suspicion about ClimbMix being too clean a win on
val_bpb. An agent optimizing 700 attempts against a single scalar will exploit any weakness in that scalar, classic Goodhart’s Law dynamics applied at machine speed. Robustness of the metric becomes critical infrastructure.
Related Ideas
- Agent Skills Spec Design Principles —
program.mdis explicitly framed as a lightweight skill; same pattern of conditioning agents via Markdown rather than code - Goodhart’s Law — when a measure becomes a target, it ceases to be a good measure; the central failure mode of any metric-driven autoresearch loop
Questions
- How does
program.mdevolve into “research org code”? What does a mature multi-agentprogram.mdlook like compared to the bare-bones default? - At what scale does d12→d24 proxy transfer break down? Is there a “scale gap” beyond which improvements stop transferring?
- How do you design metrics that are robust to 700+ adversarial optimization attempts by an agent without succumbing to Goodhart’s Law? Does this require holdout metrics, ensemble metrics, or periodic human spot-checks?
- Karpathy mentions “round 2” with multiple collaborating agents. What coordination patterns work? Independent exploration with merge? Specialist roles (architect, optimizer, debugger)?
- Does the fixed-time-budget framing privilege certain architectures (e.g. shallow/wide over deep/narrow) by accident? If so, does that bias the search trajectory in ways that don’t generalize beyond the budget?
Sources
- karpathy/autoresearch — repository and README
- karpathy/nanochat — parent project the recipe operates on
- Karpathy tweet, Mar 5 2026 — introduction, “post-AGI” framing, meta-setup observation
- Karpathy tweet, Mar 9 2026 — round-1 results, d12→d24 transfer, frontier-lab claim