GEPA’s optimize_anything: evolving a prompt that makes a small model draw a GIF
Optimize anything, not just prompts
Most prompt optimizers reduce a run to a single scalar reward and search over that. GEPA (Genetic-Pareto) does something different: it hands the full execution trace — error messages, metrics, stdout, even rendered images — to a reflection LLM, which reads why a candidate did poorly and proposes a targeted rewrite. It then keeps a Pareto frontier of candidates rather than a single best.
optimize_anything is GEPA’s universal entry point. The artifact under optimization can be a prompt, a block of code, an agent’s control flow, or a config — anything expressible as text. You provide two things:
- A seed candidate (the starting artifact, as a string or dict).
- An evaluator that runs the candidate, returns a numeric score, and emits diagnostics (“Actionable Side Information”, ASI) that the reflection model reads.
Everything else — mutation prompts, candidate selection, Pareto search — is handled internally. There are three modes, selected by whether you pass a dataset/valset: single-task (solve one problem; the candidate is the solution), multi-task (a batch of related problems with cross-transfer), and generalization (a skill that must transfer to unseen examples). This note uses single-task search.
The example: make a small model reliably produce a GIF
Small open-weights models are cheap but inconsistent at one-shot code generation. The exact failure here is subtle: ask qwen-2.5-7b for “a bouncing-ball GIF” and it does produce a valid GIF — but the frame count wanders (6, 20, 10, 6 across runs), the loop often doesn’t close (the ball ends mid-air), and the playback speed is whatever the model felt like. That inconsistency is exactly what a sharper system prompt can fix.
So the setup is:
- A LangGraph “animator” agent — a small open-weights model (routed through OpenRouter) writes Pillow code, runs it, and self-repairs on errors.
- The artifact GEPA optimizes is that agent’s system prompt.
- The evaluator renders the resulting GIF and scores it on four concrete things: exactly 24 frames, a seamless loop (ball returns to its start), ~1 s total playback, and small file size.
- GEPA’s reflection model reads those diagnostics and evolves the prompt.
The result GEPA found, end to end — a seamless 23-frame loop at 200×200, 40 ms/frame:

Setup
pip install gepa langgraph langchain-openai pillow numpy# OpenRouter key — read from env, never commit it.
export OPENROUTER_API_KEY="sk-or-v1-..."import os, re, tempfile, traceback
from typing import TypedDict
import numpy as np
from PIL import Image, ImageChops
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage
from langgraph.graph import StateGraph, START, END
# A small open-weights model on OpenRouter. Swap freely — see openrouter.ai/models.
SMALL_MODEL = "qwen/qwen-2.5-7b-instruct"
TARGET_FRAMES = 24
def small_llm() -> ChatOpenAI:
# OpenRouter is OpenAI-compatible: point ChatOpenAI at its base_url.
# temperature=0.2 keeps the task model reproducible so the score reflects the
# *prompt*, not sampling luck — important when an optimizer reads that score.
return ChatOpenAI(
model=SMALL_MODEL,
base_url="https://openrouter.ai/api/v1",
api_key=os.environ["OPENROUTER_API_KEY"],
temperature=0.2,
)
def extract_code(text: str) -> str:
m = re.search(r"```(?:python)?\n(.*?)```", text, re.S)
return m.group(1) if m else textThe LangGraph animator
A three-node graph: generate code → run it → repair on failure (looping back up to three attempts). The system prompt — the thing GEPA evolves — is a closure argument, so each candidate compiles its own graph.
class GenState(TypedDict):
code: str
error: str
gif_path: str
attempts: int
def build_graph(system_prompt: str):
llm = small_llm()
def generate(state: GenState) -> dict:
msg = llm.invoke([
SystemMessage(content=system_prompt),
HumanMessage(content="Write the function. Animation: a red ball "
"bouncing up and down on a white background"),
])
return {"code": extract_code(msg.content), "attempts": state["attempts"] + 1}
def run(state: GenState) -> dict:
# WARNING: exec'ing model-written code. Only run this in a sandbox/container.
path = tempfile.mktemp(suffix=".gif")
try:
ns: dict = {}
exec(state["code"], ns) # candidate is expected to define make_gif(path)
ns["make_gif"](path)
Image.open(path).verify() # did it actually write a real GIF?
return {"gif_path": path, "error": ""}
except Exception:
return {"gif_path": "", "error": traceback.format_exc(limit=3)}
def repair(state: GenState) -> dict:
msg = llm.invoke([
SystemMessage(content=system_prompt),
HumanMessage(content=(
f"This code:\n{state['code']}\n\nfailed with:\n{state['error']}\n\n"
"Return the corrected function only."
)),
])
return {"code": extract_code(msg.content), "attempts": state["attempts"] + 1}
def route(state: GenState) -> str:
return "repair" if state["error"] and state["attempts"] < 3 else END
g = StateGraph(GenState)
g.add_node("generate", generate)
g.add_node("run", run)
g.add_node("repair", repair)
g.add_edge(START, "generate")
g.add_edge("generate", "run")
g.add_conditional_edges("run", route, {"repair": "repair", END: END})
g.add_edge("repair", "run")
return g.compile()exec() on model-generated code is genuinely dangerous — a confused model can emit code that deletes files or makes network calls. In anything beyond a throwaway demo, run the candidate in a locked-down subprocess, container, or microVM, not in your optimizer process. I ran this on a throwaway machine.
Scoring a GIF (the objective GEPA optimizes against)
First, render the GIF and pull out raw metrics — frame count, frame-to-frame motion, how far the last frame is from the first (loop seam), duration, size:
def render_and_measure(system_prompt: str):
app = build_graph(system_prompt)
final = app.invoke({"code": "", "error": "", "gif_path": "", "attempts": 0})
if not final["gif_path"]:
return {"ok": False, "error": final["error"], "attempts": final["attempts"]}
img = Image.open(final["gif_path"])
n_frames = getattr(img, "n_frames", 1)
motion, prev, first, last = 0.0, None, None, None
for i in range(n_frames):
img.seek(i)
frame = np.asarray(img.convert("L"), dtype=np.int16) # vectorised; fast even at 100+ frames
if i == 0:
first = frame
last = frame
if prev is not None:
motion += float(np.abs(frame - prev).sum())
prev = frame
motion /= max(n_frames - 1, 1)
loop_gap = float(np.abs(last - first).sum()) if first is not None else 0.0
return {
"ok": True, "frames": n_frames, "motion": motion, "loop_gap": loop_gap,
"duration": img.info.get("duration"), "size_kb": os.path.getsize(final["gif_path"]) / 1024,
"attempts": final["attempts"],
}Then turn those into a score in [0, 1]. The one non-obvious trick: normalise the loop seam by the GIF’s own average motion. The absolute pixel difference between the first and last frame depends on the canvas and ball size the model happened to pick — but a seamless loop closes about as smoothly as any other frame step, so loop_gap / motion ≈ 1 is the size-independent signal:
def score_metrics(m: dict):
if not m.get("ok"):
return 0.0, {"failed": True, "error": m.get("error", "")[:300]}
n, dur, motion = m["frames"], (m.get("duration") or 0), m["motion"]
moving = 1.0 if motion > 5_000 else 0.0 # not a static frame
frame_acc = max(0.0, 1 - abs(n - TARGET_FRAMES) / TARGET_FRAMES)
loop_ratio = m["loop_gap"] / (motion + 1.0) # ~1 == seamless
loop_score = moving * max(0.0, 1 - max(0.0, loop_ratio - 1.0) / 3.0)
play_score = max(0.0, 1 - abs(n * dur - 1000) / 1000) # aim ~1 s playback
size_score = 1.0 if m["size_kb"] < 100 else 0.0
score = (0.20 * moving + 0.30 * frame_acc + 0.20 * loop_score
+ 0.20 * play_score + 0.10 * size_score)
return round(score, 3), {
"frames": n, "loop_ratio": round(loop_ratio, 2),
"duration_ms": dur, "play_score": round(play_score, 2),
"size_kb": round(m["size_kb"], 1), "attempts": m["attempts"],
}Finally, the evaluator GEPA actually calls. Diagnostics go out through oa.log() — GEPA captures those lines per evaluation and feeds them to the reflection model so it can see why a prompt under-performed. Because the model is still slightly stochastic, each call averages two samples — and uses the mean (not the max), so the optimizer rewards a prompt that is reliably good, not just occasionally lucky:
import gepa.optimize_anything as oa
def evaluate(candidate: str) -> float:
scores = []
for _ in range(2):
score, info = score_metrics(render_and_measure(candidate))
oa.log(f"score={score} {info}") # captured as Actionable Side Information
scores.append(score)
return sum(scores) / len(scores)Running optimize_anything
The one place the published docs lag the package (gepa==0.1.1): GEPAConfig requires all three of engine, reflection, and tracking, and EngineConfig requires max_workers. The minimal-but-complete call:
from gepa.optimize_anything import GEPAConfig, EngineConfig, ReflectionConfig, TrackingConfig
SEED_PROMPT = (
"You are a Python animator. Write a single function `make_gif(path)` that uses "
"Pillow to render an animated GIF and save it to `path`. Output only code."
)
# GEPA's reflection LM goes through LiteLLM, which reads OPENROUTER_API_KEY and uses
# the `openrouter/<model-id>` prefix — a DIFFERENT string format than the
# `base_url` + `provider/model` style ChatOpenAI uses above.
config = GEPAConfig(
engine=EngineConfig(max_metric_calls=12, max_workers=2, raise_on_exception=False),
reflection=ReflectionConfig(reflection_lm="openrouter/qwen/qwen-2.5-7b-instruct"),
tracking=TrackingConfig(),
)
result = oa.optimize_anything(
seed_candidate=SEED_PROMPT,
evaluator=evaluate, # single-task mode: no dataset/valset
objective=(
"Evolve the system prompt so the small model reliably writes Pillow code that "
"renders a seamless looping GIF of a red ball bouncing on white: exactly 24 "
"frames, the ball returns to its start, one loop plays in ~1 s, under 100 KB."
),
config=config,
)
print(result.best_candidate) # the evolved system promptWhat actually happened
I ran exactly this, end to end, against the live OpenRouter API. The evolved prompt didn’t just score higher — it made the small model consistent, which was the real problem. Four fresh samples from each prompt:
| System prompt | Mean score | The four samples | Frames | Loop |
|---|---|---|---|---|
| Seed (hand-written) | 0.625 | 0.58, 0.75, 0.80, 0.38 | 6 / 20 / 10 / 6 | mostly broken |
| GEPA-evolved (qwen-7b reflection) | 0.972 | 0.97, 0.97, 0.97, 0.97 | 23 / 23 / 23 / 23 | seamless |
The seed swings between 0.38 and 0.80 run-to-run; the evolved prompt produces a byte-stable 0.972 every time. It cost 12 metric calls — i.e. GEPA evaluated ~12 candidate prompts.
The interesting part is how it won. GEPA’s reflection step didn’t merely reword the instructions — it tightened them and appended a full reference implementation to the prompt:
You are a Python animator. Write a single function
make_gif(path)… with exactly 24 frames … Ensure the ball returns to its starting position, creating a seamless loop. Choose the frame duration so one loop plays in about 1 second. The final GIF should be under 100 KB. Output only code.from PIL import Image, ImageDraw import numpy as np def make_gif(path): frame_duration = 41 # ~1 second for 24 frames ... # a working bounce, ball_x = linspace there-and-back, save with loop=0
That is a legitimate — and well-known — prompt-optimization move: when the objective is precise, the cheapest way to make a small model reliable is to hand it a worked template in its system prompt. GEPA discovered that from the score diagnostics alone, with no hint that “add an example” was an option.
Small vs. frontier reflection model
GEPA’s guidance recommends a frontier model for reflection_lm, so I ran the identical setup again with openrouter/anthropic/claude-sonnet-4.6 doing the reflection — same 12-call budget, same small task model. The result was a useful surprise:
reflection_lm |
seed mean | evolved mean (4 fresh samples) |
|---|---|---|
qwen-2.5-7b (small) |
0.625 | 0.972 — 0.97 ×4, rock-steady |
claude-sonnet-4.6 (frontier) |
0.590 | 0.624 — 0.39, 0.83, 0.63, 0.65 |
Same budget, and the small reflection model won decisively. Don’t over-read this — it is one run each on a noisy, 12-call budget, not evidence that small beats frontier in general. The instructive part is why: the qwen run’s winning move was to paste a complete working implementation into the prompt, a blunt and near-unbeatable lever for making a weak task model reliable. The frontier run spent its budget on cleaner instruction-only rewrites that nailed the frame count (21–22) but never closed the loop (loop ratio stuck ~2.9). At a budget this small, the outcome is dominated by which lever the reflection model happens to pull, and by run-to-run variance — not by reflection-model horsepower. To draw a real conclusion you’d run several seeds of each and compare distributions. Treat the frontier recommendation as the safe default for subtle objectives, not a guarantee that a bigger reflector wins at tiny budgets.
How the loop works
| Step | What happens |
|---|---|
| Evaluate | The candidate prompt drives the LangGraph agent; the GIF is rendered and scored (×2, averaged). |
| Log ASI | oa.log(...) captures frame count, loop ratio, playback, errors — the reasons behind the score. |
| Reflect | The reflection LLM reads that trace and proposes a targeted prompt rewrite. |
| Select | GEPA keeps a Pareto frontier of candidates and samples the next parent from it. |
Because reflection reads the trace rather than just the scalar, it fixes the actual failure mode — “frames wander, loop doesn’t close” → “specify exactly 24 frames and require the ball to return” — instead of blindly mutating text. That is why it converges in ~12 evaluations.
Caveats and honest notes
Tested 2026-06-29 end-to-end against the live OpenRouter API with gepa 0.1.1, langgraph 1.2.6, langchain-openai 1.3.3, pillow 12.2.0, litellm 1.90.0, Python 3.11. The seed→evolved numbers above are from a real run.
- The evolved artifact depends on
numpybecause the small model reached for it and numpy was installed in my environment. If you drop numpy, that specific evolved prompt’s template would fail atexec— install it, or add “standard library + Pillow only” to the objective so GEPA evolves a numpy-free solution. - It’s a small, single-task, 12-call demo. The scoring weights and the ~1 s / 24-frame targets are illustrative. With a budget this small and a stochastic task model, expect run-to-run variance in the seed baseline (mine ranged 0.38–0.80); average a few samples before trusting a delta.
- Reflection model: see the small-vs-frontier note above. For subtler objectives, start with a frontier
reflection_lm. - “Small local model”: OpenRouter is a hosted router, not local. For truly local weights, point
base_urlat an Ollama/vLLM endpoint and use a LiteLLMollama/...string forreflection_lm— the structure is identical. - Model IDs change. If a model string 404s, look it up in the live OpenRouter catalogue.