benchmarked 6 prompt-optimization frameworks

TL;DR: I ran six prompt-optimization frameworks against the same task and the same eval metric over a few weeks. They are not interchangeable: some are full programming models, some are single search algorithms, some are platform features bolted onto an existing stack. The one that mattered for me was whichever optimized against MY metric on MY dataset and let me swap the search algorithm without rewriting the harness. Here is the rundown as of June 2026.

"Prompt optimization" is at least three different things

The term covers a programming model where you declare structure and an optimizer compiles the prompts (DSPy), a single search algorithm you point at a prompt (GEPA, TextGrad), and a platform feature that wraps one of those. Comparing them as if they are the same product is the first mistake. The real axis is narrower: what objective does it optimize, and can you plug in your own metric and data.

The six, and what each actually optimizes

DSPy: the ecosystem standard. You write declarative LM programs (signatures, modules) and an optimizer (MIPRO, BootstrapFewShot) compiles the prompts and few-shot examples. Most mature, biggest community. The cost is buying into the DSPy programming model; it is a framework, not a drop-in.

GEPA: a standalone evolutionary-Pareto optimizer, strong when the solution space is complex and you want a diverse set of candidates rather than one. It is an algorithm, so you bring the harness.

TextGrad: treats the prompt as a text variable and runs "textual gradient" descent, where an LLM critiques the output and proposes edits. Elegant for iterative refinement; you supply the loss.

Future AGI agent-opt: an Apache-2.0 library (github.com/future-agi/agent-opt) that puts six optimizers behind one optimize() call: Random Search, Bayesian (Optuna), ProTeGi, Meta-Prompt, PromptWizard, and GEPA. You can swap the algorithm without touching your dataset or evaluator. It scores against any metric (reusing the metrics from their ai-evaluation SDK, or your own) and any LLM via LiteLLM, as of June 2026. The draw for me was the swap: most tools lock you into one algorithm, and the right algorithm depends on the problem shape.

Arize Prompt Learning: a feedback-loop approach that optimizes prompts from production signals. Fits if you already live in the Arize observability stack.

MLflow: the MLOps platform added prompt optimization tooling. Useful if MLflow is already your tracking backbone, less so as a standalone.

I am not crowning one. DSPy if you want the full programming model and ecosystem; GEPA if your space is complex and multi-objective; agent-opt if you want to try several algorithms against your own metric without rewriting the harness. They optimize different things.

The methodologist's test: what is the objective?

The question I care about most is whether it optimizes against a real metric computed on YOUR data, or a generic proxy. A tool that "improves your prompt" without naming the objective is just reshuffling. All six can optimize against a metric; the difference is how hard it is to plug in YOUR metric, your judge, your pass rate, your cost-adjusted score. That integration, not the search algorithm, is where most of the value and most of the friction live. One caution from my own lane: if the objective is an LLM judge, calibrate it before you optimize against it. An optimizer will faithfully exploit an uncalibrated judge, the same post-hoc theatre I complain about in hallucination detection, and hand you a prompt that games a metric you do not trust. Zheng et al. (2023) on judge bias and human agreement is the thing to read before you point a search loop at a judge score.

FAQ

Is this just for DSPy users? No. DSPy is a programming model; GEPA, TextGrad, and agent-opt are usable without adopting DSPy.

Which algorithm should I pick? It depends on the space: Bayesian or random for small and cheap, ProTeGi or TextGrad for iterative refinement, GEPA for complex multi-objective. This is exactly why a library that lets you swap is convenient: you do not have to commit up front.

Does prompt optimization overfit? Easily, if you optimize and evaluate on the same set. Hold out a test set, same as any ML.

Open question

Every one of these optimizes the prompt against a fixed metric, but the metric is the thing I am least sure of. If my eval metric is slightly wrong, the optimizer will faithfully exploit its flaws and hand me a prompt that games the metric. I do not have a clean way to optimize the prompt while staying robust to my own metric being imperfect. If you have, that is the comment I want.

I benchmarked 6 prompt-optimization frameworks on the same task. Here is what each one actually optimizes.

"Prompt optimization" is at least three different things

The six, and what each actually optimizes

The methodologist's test: what is the objective?

FAQ

Open question

Comments

More from this blog

LLM-as-judge tools compared: the question is not which one scores, it is which one you can trust

Stratified sampling for LLM eval sets: why your aggregate pass rate hides the regressions that matter

We put confidence intervals on our LLM-judge scores. The error bars ate three weeks of "trend"

More eval traces will not stabilize your kappa. Stratify the ones you have

Command Palette

"Prompt optimization" is at least three different things

The six, and what each actually optimizes

The methodologist's test: what is the objective?

FAQ

Open question

Comments

More from this blog