Summary
GEPA (Genetic-Pareto) is a sample-efficient prompt optimizer for compound AI systems that uses natural-language reflection on execution and evaluation traces, combined with Pareto-based candidate selection, to evolve module prompts. Across six tasks (HotpotQA, IFBench, HoVer, PUPA, AIME-2025, LiveBench-Math) on Qwen3-8B and GPT-4.1-Mini, GEPA beats GRPO (a leading RL/GRPO method) by up to 20% while using up to 35× fewer rollouts, and beats MIPROv2 by ~10%+ aggregate. Accepted to ICLR 2026 (Oral). Code: github.com/gepa-ai/gepa.
Key Claims
- Language is a richer learning medium than scalar rewards. Serialized rollouts (instructions, reasoning, tool calls, evaluator messages) carry far more diagnostic signal than the scalar reward GRPO sees, so reflection-based optimizers can extract more learning per rollout.
- Reflective prompt mutation works. GEPA samples rollouts on a minibatch, gathers a
feedback_textfrom a feedback functionµ_f(compiler errors, failed rubrics, per-hop signals), and asks a reflection LM to attribute success/failure to specific prompt elements and propose a revised module instruction. Round-robin module selection across|M|modules. - Pareto-based candidate selection beats greedy. Instead of always mutating the global best, GEPA records the per-instance best across all candidates, prunes dominated candidates, and stochastically samples from the Pareto frontier weighted by how many tasks each candidate “wins”. This avoids the local-optimum trap that hits SelectBest (TextGrad-style) and BeamSearch (APO-style); +7.33% and +6.4% aggregate over those baselines on Qwen3-8B.
- Instruction-only optimization now beats joint instruction+few-shot optimization. A reversal of prior findings (Wan et al., 2024); attributed to better instruction-following in modern LLMs plus GEPA’s reflective design. GEPA prompts are also up to 9.2× shorter than MIPROv2’s, reducing inference cost and latency.
- Cross-model generalization. Prompts optimized on Qwen3-8B and evaluated on GPT-4.1-Mini (“GEPA-Qwen-Opt”) still gained +9% aggregate, beating MIPROv2/TextGrad/Trace which optimized directly on GPT-4.1-Mini.
- Inference-time search. Passing the eval set as
D_trainlets GEPA “overfit” deliberately. On NPUEval (AMD XDNA2 kernels) with GPT-4o, GEPA pushes a sequential-refinement agent from 4.25% mean vector utilization (no RAG) to 30.52%, beating MIPROv2 + RAG (19.03%). On KernelBench (CUDA, 35-task subset) GEPA pushesfast_1from ~0% to 20%+. - Adversarial use. Inverting the reward yields universal task-preserving distractor prompts; on AIME-2025 with GPT-5-Mini, an adversarial prefix (random trivia + strict format directive) dropped pass@1 from 76% to 10%.
Headline Numbers
| Setting | Aggregate gain over baseline |
|---|---|
| GEPA on Qwen3-8B | +9.62% (up to 20% on HotpotQA) |
| GRPO on Qwen3-8B (24k rollouts) | +3.68% |
| MIPROv2 on Qwen3-8B | +2.61% |
| GEPA+Merge on GPT-4.1-Mini | +13.33% |
| MIPROv2 on GPT-4.1-Mini | +5.64% |
| GEPA-Qwen-Opt → GPT-4.1-Mini (transfer) | +9.00% |
GEPA’s rollout budgets ranged 1.8k–7.1k vs GRPO’s 24k. The majority of GEPA’s budget is spent on validation tracking, not learning signal: only 79–737 train rollouts are needed to reach the optimum.
Method (in one paragraph)
A compound AI system Φ = (M, C, X, Y) has a set of LLM modules M, each with prompt π_i (frozen weights θ_i). GEPA evolves Π_Φ only. It maintains a candidate pool P, selects a candidate via the Pareto rule, picks one module via round-robin, runs the candidate on a minibatch, gathers (score, feedback_text) from µ_f, asks a reflection LM to rewrite that module’s prompt, and re-evaluates on the minibatch. If the new variant beats its parent, it joins the pool and is evaluated on D_pareto for selection bookkeeping. Optional crossover (“Merge”) pulls the best version of each module from two distinct lineages.
My Reactions
- The framing is the strongest part: the argument that the trace itself carries signal that scalar rewards collapse away is a clean lens. It reframes prompt optimization as credit assignment with a richer feedback channel rather than as discrete search.
- The Pareto sampling result is the part I’d expect to generalize beyond GEPA: greedy selection collapsing into a single lineage matches what shows up in many evolutionary-search settings, and the per-instance Pareto front is a cheap diversity mechanism.
- Caveat: most of GEPA’s “budget” is validation-tracking rollouts, not learning rollouts. The 35× sample-efficiency framing leans on that distinction. Headline still holds, but the implementation cost depends on how
D_paretois sized. - Unresolved: Merge helps on GPT-4.1-Mini but degrades Qwen3-8B with the same hyperparameters. Adaptive scheduling of mutation vs. crossover is open.
- The adversarial result (76% → 10% pass@1 from injected trivia + strict format directive) is a useful red-team handle worth tracking in eval harnesses.
Connections
- DSPy — GEPA targets the same compound-AI-system abstraction; co-authored by Khattab, Zaharia, Potts.
- MIPROv2 — direct prior SOTA; GEPA more than doubles its aggregate gain.
- GRPO / RLVR — the contrast point; GEPA’s claim is that reflective prompt evolution dominates RL in low-rollout regimes.
- TextGrad, Trace (OptoPrime), APO — other prompt-space optimizers; GEPA’s Pareto sampling is the differentiator vs. their greedy/beam strategies.
- Reflexion, Self-Refine — reflection-as-feedback predecessors operating at inference time rather than as an outer optimization loop.
- AlphaEvolve, OpenEvolve, EvoPrompt — evolutionary-search relatives; GEPA distinguishes itself with per-instance Pareto fronts and textual feedback.
- Quality-Diversity Search / MAP-Elites (Mouret & Clune, 2015) — the “illumination” lineage GEPA borrows from.