Autoresearch - Agent-Driven Autonomous ML Experimentation

Core Principle

Andrej Karpathy’s autoresearch inverts the ML researcher’s role: instead of editing Python, the human edits a Markdown program.md (a “super lightweight skill”) that instructs an agent to autonomously iterate on a single train.py file. The agent runs fixed 5-minute experiments (~12/hour, ~100 overnight), measures val_bpb, keeps improvements on a feature branch, and merges them. The leverage point shifts from training code to the agent’s research-org instructions.

Why This Matters

This is the first credible demonstration that the iterative-experiment-driven loop, the “bread and butter” of two decades of ML research per Karpathy, can be executed end-to-end by an LLM agent on a real, well-tuned project. It is not a toy benchmark: improvements stack, transfer across scales, and beat hand-tuned baselines. If true at frontier scale, it reframes ML research labor as agent orchestration rather than direct experimentation.

Evidence/Examples

Round 1 results (Karpathy, Mar 9 2026):

~2 days of agent runtime on a depth-12 nanochat model
~700 autonomous changes attempted, ~20 retained as additive improvements
All ~20 improvements transferred from d12 to d24, validating small-scale proxy as a search target
“Time to GPT-2” leaderboard metric: 2.02h → 1.80h (~11% improvement) on top of an already-manually-tuned baseline

Specific findings the agent surfaced (and Karpathy had missed):

Parameterless QKnorm was missing a scaler multiplier; attention was too diffuse
Value Embeddings benefited from regularization (none was being applied)
Banded attention window was too conservative
AdamW beta values were misconfigured
Weight decay schedule and network initialization both improved

Design choices that make the loop work:

Single editable file (train.py): keeps diffs reviewable and scope manageable
Fixed 5-minute wall-clock budget: makes experiments comparable across architectural changes (model size, batch size, optimizer) and across machines, since the agent always optimizes “best result in 5 min on this hardware”
Single scalar metric (val_bpb): vocab-size-independent so architectural changes are fairly compared
Branch-and-merge discipline: agent works on a feature branch, merges only on improvement

Karpathy’s meta-observation (Mar 5 tweet): over ~2 weeks, he iterated more on the “meta-setup” tuning agent flows than on the nanochat repo directly. The optimization target itself shifted up a level.

Implications

The fixed-time-budget + scalar-metric + single-file pattern is generalizable. Karpathy’s claim: any metric that is reasonably efficient to evaluate (or has efficient proxy metrics, e.g. training a smaller network) can be autoresearched by an agent swarm. Worth asking of any optimization problem: does it factor this way?
Proxy-scale search appears to transfer. d12 improvements transferring cleanly to d24 is the load-bearing assumption for the entire frontier-lab vision. If this holds at larger ratios, it justifies the “swarm tunes small models, promote winners to larger scales” architecture.
Frontier labs will industrialize this. Karpathy: “All LLM frontier labs will do this. It’s the final boss battle.” The hard parts at scale aren’t conceptual but engineering: multi-file repos, distributed training, multi-agent coordination, scale promotion pipelines.
program.md is the new artifact of expertise. The repo explicitly frames program.md as a “super lightweight skill”. This connects directly to the broader pattern in Agent Skills Spec Design Principles: the value-adding human work is increasingly the prompt/skill that conditions the agent, not the code the agent edits.
Real risk of Goodharting on the proxy metric. Karpathy himself flagged suspicion about ClimbMix being too clean a win on val_bpb. An agent optimizing 700 attempts against a single scalar will exploit any weakness in that scalar, classic Goodhart’s Law dynamics applied at machine speed. Robustness of the metric becomes critical infrastructure.

Agent Skills Spec Design Principles — program.md is explicitly framed as a lightweight skill; same pattern of conditioning agents via Markdown rather than code
Goodhart’s Law — when a measure becomes a target, it ceases to be a good measure; the central failure mode of any metric-driven autoresearch loop

Questions

How does program.md evolve into “research org code”? What does a mature multi-agent program.md look like compared to the bare-bones default?
At what scale does d12→d24 proxy transfer break down? Is there a “scale gap” beyond which improvements stop transferring?
How do you design metrics that are robust to 700+ adversarial optimization attempts by an agent without succumbing to Goodhart’s Law? Does this require holdout metrics, ensemble metrics, or periodic human spot-checks?
Karpathy mentions “round 2” with multiple collaborating agents. What coordination patterns work? Independent exploration with merge? Specialist roles (architect, optimizer, debugger)?
Does the fixed-time-budget framing privilege certain architectures (e.g. shallow/wide over deep/narrow) by accident? If so, does that bias the search trajectory in ways that don’t generalize beyond the budget?

Sources

karpathy/autoresearch — repository and README
karpathy/nanochat — parent project the recipe operates on
Karpathy tweet, Mar 5 2026 — introduction, “post-AGI” framing, meta-setup observation
Karpathy tweet, Mar 9 2026 — round-1 results, d12→d24 transfer, frontier-lab claim

Quartz 4

Explorer

Autoresearch - Agent-Driven Autonomous ML Experimentation

Core Principle

Why This Matters

Evidence/Examples

Implications

Questions

Sources

Graph View

Table of Contents

Quartz 4

Explorer

Autoresearch - Agent-Driven Autonomous ML Experimentation

Core Principle

Why This Matters

Evidence/Examples

Implications

Related Ideas

Questions

Sources

Graph View

Table of Contents