If your LLM app is a folder full of f-strings and prompt templates you're afraid to touch, you already know the problem. Change the wording to fix one edge case, and three others break. Swap the underlying model, and every carefully tuned prompt has to be re-tuned by hand. There's no test suite, no metric, no compiler — just vibes and a growing pile of prompt_v7_FINAL_actually.py files.
DSPy is the fix. It's a Python framework from Stanford NLP that lets you program language models instead of prompting them: you declare what each step should do as a typed signature, wire steps together as modules, and then let an optimizer write and tune the actual prompts for you against a metric you define. This is a practical playbook for getting from hand-written prompts to a compiled, optimizable program.
Assumed background: you're comfortable with Python and have called an LLM API before. You do not need any ML background.
Why "program, don't prompt" is more than a slogan
The core insight behind DSPy is a separation of concerns. A prompt tangles three things together: what you want (the task), how the model should approach it (the strategy), and the exact words that happen to make a given model behave. DSPy splits those apart.
You specify the task and strategy in code. DSPy generates the words — and, crucially, can re-generate them when you change models or requirements. That's why teams in production lean on it: Shopify reports using DSPy for metadata extraction across all shops with a roughly 550× cost reduction, and Databricks, Dropbox, JetBlue, and Replit all run DSPy pipelines in production. The framework passed 6.4 million monthly downloads and 35k GitHub stars, so this is battle-tested infrastructure, not a research toy.
Step 1: Install and configure
DSPy requires Python 3.10+ and is MIT-licensed. Install the current release:
pip install -U dspy
Then point it at a model. DSPy uses LiteLLM under the hood, so the provider prefix (openai/, anthropic/, ollama/, etc.) selects the backend:
import dspy
lm = dspy.LM("openai/gpt-5.4-mini")
dspy.configure(lm=lm)
That single configure call is what makes model-swapping trivial later. Nothing downstream hardcodes the model.
Step 2: Declare a signature
A signature is a typed input/output contract — the DSPy equivalent of a function declaration. The simplest form is an inline string:
classify = dspy.Predict("ticket -> urgency, team")
But the class form is where signatures earn their keep, because you get typed outputs and a docstring that becomes the task instruction:
from typing import Literal
class Triage(dspy.Signature):
"""Route a customer support ticket to the right team."""
ticket: str = dspy.InputField()
urgency: Literal["low", "high"] = dspy.OutputField()
team: str = dspy.OutputField()
classify = dspy.Predict(Triage)
result = classify(ticket="My invoice is wrong and I was double-charged!")
# Prediction(urgency='high', team='billing')
Notice what you did not write: no "You are a helpful assistant," no "respond only with JSON," no examples. The Literal["low", "high"] type constrains the output, and DSPy handles turning your signature into whatever prompt format the current model needs.
Step 3: Pick a module (change strategy without rewriting the task)
The same signature can run under different modules, which control how the model executes:
# Direct completion
classify = dspy.Predict(Triage)
# Add step-by-step reasoning
classify = dspy.ChainOfThought(Triage)
# Add tools and a reasoning loop
classify = dspy.ReAct(Triage, tools=[search_kb])
This is the payoff of separating task from strategy. Want to A/B test whether chain-of-thought helps your classifier? Change one line. The task definition never moves. DSPy ships a family of modules including Predict, ChainOfThought, ReAct, ProgramOfThought, BestOfN, and Refine.
You compose modules with ordinary Python — no DSL, no graph builder:
class FactCheck(dspy.Module):
def __init__(self):
self.find = dspy.ChainOfThought("article -> claims: list[str]")
self.verify = dspy.ChainOfThought("claim, source -> verdict")
def forward(self, article):
found = self.find(article=article)
return [self.verify(claim=c, source=article) for c in found.claims]
That's a two-stage pipeline with a loop and typed intermediate values, written in plain control flow.
Step 4: Write a metric
Optimization needs a number to climb. A DSPy metric is just a function that scores a prediction against an example and returns a float (or a bool):
def urgency_match(example, prediction, trace=None):
return float(example.urgency == prediction.urgency)
For fuzzier tasks like RAG or summarization, DSPy provides built-in metrics such as SemanticF1 and CompleteAndGrounded that use an LM to judge quality. The metric is the single most important thing you'll write — the optimizer will happily maximize a bad metric, so make it reflect what you actually care about.
Step 5: Compile with an optimizer
This is the part that has no equivalent in hand-prompting. You hand DSPy your program, a training set of examples, and your metric; it searches for the best prompt configuration — instructions and few-shot demonstrations — automatically.
The current recommended starting point is GEPA (Reflective Prompt Evolution), introduced in a July 2025 paper. It uses an LM to reflect on failures and evolve better instructions:
optimizer = dspy.GEPA(metric=urgency_match, auto="medium")
optimized = optimizer.compile(classify, trainset=labeled_tickets)
optimized.save("triage_v2.json")
The auto="medium" setting lets DSPy budget the search for you; you can dial it to "light" or "heavy" depending on how much compute you want to spend. DSPy's own documentation shows optimization moving an extraction task from a 62% baseline to 89% on the same small model — the gain comes entirely from better prompts, not a bigger model.
The other workhorse optimizer is MIPROv2 (June 2024), which jointly tunes instructions and few-shot demonstrations across every predictor in your program in three phases: bootstrap demonstrations, propose instructions, then search over combinations. Note that MIPROv2's Optuna-backed search is now an optional dependency — install it with pip install dspy[optuna] if you need it. Other options include COPRO, the BootstrapFewShot family, and SIMBA; the docs' "choosing an optimizer" guide walks through when to reach for each.
Step 6: Save, load, and swap models
An optimized program serializes to JSON — the compiled prompts and demonstrations, not code:
optimized.save("triage_v2.json")
loaded = dspy.Predict(Triage)
loaded.load("triage_v2.json")
Here's the real test of whether the abstraction paid off. When the next cheaper model lands, you change the dspy.LM(...) line and re-run compile. You don't hand-edit a single prompt. AWS documented exactly this pattern — using DSPy to migrate prompts from a larger model down to a smaller, cheaper one on Amazon Nova. That's the workflow hand-tuned prompts can never give you.
When DSPy is worth it — and when it isn't
DSPy is not free complexity. Be honest about the trade-off.
Reach for it when you have a task you can measure (you can write a metric and gather even 20–50 labeled examples), when you expect to swap models or providers, or when prompt quality directly drives cost or accuracy at scale. Those are exactly the conditions where automatic optimization compounds.
Skip it when you're building a one-off script, a freeform chatbot with no gradable output, or a prototype you'll throw away next week. If you can't define a metric, the optimizers have nothing to climb, and you're paying the abstraction cost for none of the benefit.
A reasonable adoption path: start by expressing your existing prompts as signatures and modules with plain dspy.Predict. Get that working with zero optimization. Then, once you have a metric and a handful of examples, run an optimizer and compare. You keep the maintainability win immediately and add the optimization win when you're ready.
The Bottom Line
DSPy's bet is that prompt engineering should look like software engineering: declare typed contracts, compose modules, measure with a metric, and let a compiler tune the details. The moment you can write a metric and collect a few dozen examples, you unlock automatic optimization that routinely turns a mediocre baseline into a strong one on the same model — and, just as valuable, lets you swap to a cheaper model by re-compiling instead of re-writing. Start small: convert one prompt to a signature today, and add an optimizer the day you have something to measure.


