Core Principle

Andrej Karpathy’s autoresearch inverts the ML researcher’s role: instead of editing Python, the human edits a Markdown program.md (a “super lightweight skill”) that instructs an agent to autonomously iterate on a single train.py file. The agent runs fixed 5-minute experiments (~12/hour, ~100 overnight), measures val_bpb, keeps improvements on a feature branch, and merges them. The leverage point shifts from training code to the agent’s research-org instructions.

Why This Matters

This is the first credible demonstration that the iterative-experiment-driven loop, the “bread and butter” of two decades of ML research per Karpathy, can be executed end-to-end by an LLM agent on a real, well-tuned project. It is not a toy benchmark: improvements stack, transfer across scales, and beat hand-tuned baselines. If true at frontier scale, it reframes ML research labor as agent orchestration rather than direct experimentation.

Evidence/Examples

Round 1 results (Karpathy, Mar 9 2026):

  • ~2 days of agent runtime on a depth-12 nanochat model
  • ~700 autonomous changes attempted, ~20 retained as additive improvements
  • All ~20 improvements transferred from d12 to d24, validating small-scale proxy as a search target
  • “Time to GPT-2” leaderboard metric: 2.02h → 1.80h (~11% improvement) on top of an already-manually-tuned baseline

Specific findings the agent surfaced (and Karpathy had missed):

  • Parameterless QKnorm was missing a scaler multiplier; attention was too diffuse
  • Value Embeddings benefited from regularization (none was being applied)
  • Banded attention window was too conservative
  • AdamW beta values were misconfigured
  • Weight decay schedule and network initialization both improved

Design choices that make the loop work:

  • Single editable file (train.py): keeps diffs reviewable and scope manageable
  • Fixed 5-minute wall-clock budget: makes experiments comparable across architectural changes (model size, batch size, optimizer) and across machines, since the agent always optimizes “best result in 5 min on this hardware”
  • Single scalar metric (val_bpb): vocab-size-independent so architectural changes are fairly compared
  • Branch-and-merge discipline: agent works on a feature branch, merges only on improvement

Karpathy’s meta-observation (Mar 5 tweet): over ~2 weeks, he iterated more on the “meta-setup” tuning agent flows than on the nanochat repo directly. The optimization target itself shifted up a level.

Implications

  1. The fixed-time-budget + scalar-metric + single-file pattern is generalizable. Karpathy’s claim: any metric that is reasonably efficient to evaluate (or has efficient proxy metrics, e.g. training a smaller network) can be autoresearched by an agent swarm. Worth asking of any optimization problem: does it factor this way?

  2. Proxy-scale search appears to transfer. d12 improvements transferring cleanly to d24 is the load-bearing assumption for the entire frontier-lab vision. If this holds at larger ratios, it justifies the “swarm tunes small models, promote winners to larger scales” architecture.

  3. Frontier labs will industrialize this. Karpathy: “All LLM frontier labs will do this. It’s the final boss battle.” The hard parts at scale aren’t conceptual but engineering: multi-file repos, distributed training, multi-agent coordination, scale promotion pipelines.

  4. program.md is the new artifact of expertise. The repo explicitly frames program.md as a “super lightweight skill”. This connects directly to the broader pattern in Agent Skills Spec Design Principles: the value-adding human work is increasingly the prompt/skill that conditions the agent, not the code the agent edits.

  5. Real risk of Goodharting on the proxy metric. Karpathy himself flagged suspicion about ClimbMix being too clean a win on val_bpb. An agent optimizing 700 attempts against a single scalar will exploit any weakness in that scalar, classic Goodhart’s Law dynamics applied at machine speed. Robustness of the metric becomes critical infrastructure.

  • Agent Skills Spec Design Principlesprogram.md is explicitly framed as a lightweight skill; same pattern of conditioning agents via Markdown rather than code
  • Goodhart’s Law — when a measure becomes a target, it ceases to be a good measure; the central failure mode of any metric-driven autoresearch loop

Questions

  • How does program.md evolve into “research org code”? What does a mature multi-agent program.md look like compared to the bare-bones default?
  • At what scale does d12→d24 proxy transfer break down? Is there a “scale gap” beyond which improvements stop transferring?
  • How do you design metrics that are robust to 700+ adversarial optimization attempts by an agent without succumbing to Goodhart’s Law? Does this require holdout metrics, ensemble metrics, or periodic human spot-checks?
  • Karpathy mentions “round 2” with multiple collaborating agents. What coordination patterns work? Independent exploration with merge? Specialist roles (architect, optimizer, debugger)?
  • Does the fixed-time-budget framing privilege certain architectures (e.g. shallow/wide over deep/narrow) by accident? If so, does that bias the search trajectory in ways that don’t generalize beyond the budget?

Sources