A great discovery from my recent project work: DSPy.
While building the content generation pipeline for Canviz, I encountered a recurring engineering problem—it was extremely difficult to maintain stable "problem explanation quality + canvas script usability" through prompts alone. Whenever I switched models or added new grade levels, I had to re-tune the entire string of prompts. DSPy offered me a systematic solution that's worth sharing separately.
The Fundamental Contradiction of Prompt Engineering
Before diving into DSPy, I need to clarify one thing: Why is writing prompts an engineering problem, not just a matter of technique?
Traditional prompts have a fatal design flaw: they mix "what I want to do" with "how to tell the model to do it."
That natural language prompt you write simultaneously handles two things:
Describing the task's logic (what inputs to accept, what outputs to produce);
The "incantation" tuned for this specific model.
Take a math teaching scenario as an example—the logic of "explaining a chicken-and-rabbit problem" is eternal, but the incantation to make GPT explain it well versus making Claude Sonnet explain it well can be quite different. Once you switch models, or change from third grade to fifth grade, that incantation might fail. Worse yet, there's no systematic way to fix it—you can only rely on intuition and trial-and-error.
This is what software engineering calls the hard-coding problem. For ordinary logic, we've long learned not to hard-code; but for AI pipelines, we willingly lock the most core logic into a fragile string.
DSPy's author, Stanford's Omar Khattab, describes this problem as:
"Existing LM pipelines are typically implemented using hard-coded prompt templates, discovered through trial and error, and extremely brittle."
What is DSPy? What's Its Core Insight?
DSPy (Declarative Self-improving Python) is a framework open-sourced by Stanford NLP Lab in 2023, published at ICLR 2024. Its core proposition is:
Programming language models, not prompting them.
It offers an elegant solution: completely separate the task's interface description from the specific prompt implementation.
You only need to tell DSPy:
What this step inputs and outputs;
What the logical structure of the entire pipeline is;
What your evaluation criteria are.
Then DSPy's Compiler and Optimizer will automatically find the best prompt for you—tailored to your chosen model, your data, and your metrics.
To borrow the official analogy: This is like jumping from assembly language to high-level languages, or from writing raw SQL to using an ORM.
Three Core Concepts to Understand DSPy's Full Picture
1. Signature: Type Signature of Tasks
Signature is DSPy's interface description. It tells the framework what this step does, not how to do it, using a type-declaration-like approach:
import dspy
class ExplainMathProblem(dspy.Signature):
"""Explain a math problem to students of a specified grade, using language appropriate to their cognitive level."""
problem: str = dspy.InputField(desc="Original text of the math problem")
grade: int = dspy.InputField(desc="Student grade level, e.g., 3 for third grade")
explanation: str = dspy.OutputField(desc="Step-by-step explanation suitable for the grade, friendly and easy to understand")
key_concept: str = dspy.OutputField(desc="Core concept tested by this problem, explained in one sentence")
Notice: you haven't written any prompt at all. This only contains the semantics of the interface, without any "you are a gentle and patient math teacher..." type of prompting.
2. Module: Composable Functional Units
Module is DSPy's execution unit, inspired by PyTorch's nn. Module. You can compose them like building blocks to construct a complete teaching content generation pipeline:
class MathLessonPipeline(dspy.Module):
def __init__(self):
# Step 1: Explain the problem
self.explain = dspy.ChainOfThought(ExplainMathProblem)
# Step 2: Generate corresponding Dinogsp geometry visualization script based on explanation
self.generate_diagram = dspy.Predict(
"problem, explanation -> dinogsp_script: str"
)
# Step 3: Create a practice problem of the same type
self.make_exercise = dspy.Predict(
"problem, key_concept, grade -> exercise: str, answer: str"
)
def forward(self, problem, grade):
# Explain
step1 = self.explain(problem=problem, grade=grade)
# Generate diagram
step2 = self.generate_diagram(
problem=problem,
explanation=step1.explanation
)
# Create practice problem
step3 = self.make_exercise(
problem=problem,
key_concept=step1.key_concept,
grade=grade
)
return dspy.Prediction(
explanation=step1.explanation,
dinogsp_script=step2.dinogsp_script,
exercise=step3.exercise,
answer=step3.answer
)
This entire three-step pipeline doesn't contain a single word of prompt—everything written is logic.
DSPy includes several classic reasoning strategy modules:
| Module | Corresponding Reasoning Method | Application in Teaching |
|---|---|---|
dspy. Predict |
Direct prediction | Problem difficulty grading, concept tagging |
dspy. ChainOfThought |
Chain of Thought (CoT) | Step-by-step problem-solving explanation |
dspy. ReAct |
Reasoning-Action loop | Calling external tools to validate scripts |
dspy. ProgramOfThought |
Program-based thinking | Generating executable math calculation code |
3. Optimizer: Automatic Tuning Engine
This is the most magical part of DSPy, where its truly unique value lies.
You need to provide:
An evaluation dataset (e.g., 100 problems, each with manually annotated good explanation samples);
An evaluation metric function (to judge whether the generated explanation is good).
Then call the optimizer, which will automatically search for the optimal combination of prompt instructions and few-shot examples:
# Define evaluation metric: whether explanation is age-appropriate, whether diagram script is parseable
def lesson_quality_metric(example, prediction, trace=None):
explanation_ok = len(prediction.explanation) > 50 # Basic length
script_parseable = validate_dinogsp(prediction.dinogsp_script) # Script usability
grade_appropriate = check_vocabulary_level(
prediction.explanation, example.grade
) # Age-appropriate vocabulary
return explanation_ok and script_parseable and grade_appropriate
# Optimize using MIPROv2
optimizer = dspy.MIPROv2(metric=lesson_quality_metric, auto="medium")
optimized_pipeline = optimizer.compile(
MathLessonPipeline(),
trainset=annotated_lessons # Your annotated data
)
# Save results, load directly in production without re-optimization
optimized_pipeline.save("./optimized_math_lesson.json")
A medium-level optimization costs about $10 and takes 20 minutes to run, resulting in a teaching content generation system automatically tuned for your chosen model and specified grade-level data.
Looking at the Data
DSPy's official documentation provides a set of impressive data:
On the HotPotQA multi-hop reasoning task (which requires combining information across documents, very similar to the logical structure of math word problems), running dspy. ReAct with the gpt mini series:
Before optimization: 24% accuracy
After MIPROv2 optimization with 500 samples: 51% accuracy
More than doubled, not by switching to a more expensive model, but by teaching this smaller model to better complete this type of task.
The Essential Difference from LangChain/LlamaIndex
You might wonder how DSPy differs from LangChain—for instance, if you're already using LangChain, do you need to switch?
LangChain / LlamaIndex are tool chain orchestration frameworks. They connect components like LLMs, vector databases, and tool calls, but the prompts themselves are still strings written by humans. If you switch models, you still have to manually modify the prompts.
DSPy is an AI program compilation framework. It doesn't just connect components—it takes over the generation and optimization of prompts. Humans are responsible for writing the logic, while it translates that into the most effective natural language instructions for a specific model.
Specifically for math teaching scenarios: if you built a "generate third-grade explanations" pipeline with LangChain, and tomorrow the product requires fifth-grade support, you need to manually go back and modify all related prompt strings—because the vocabulary and logical depth requirements for fifth grade have changed. With DSPy, you only change the input parameter grade=5, then rerun compilation, and the framework will automatically adjust the internal prompting strategy.
If I were to make an analogy: LangChain is an automated assembly line, DSPy is a high-level language with a JIT compiler.
My Developer Perspective: What It Solves, What's Still Missing
After all these praises, I should also mention what I think it still lacks.
What DSPy truly solves:
Pain of model migration: Switching from GPT-5.4 to the cheaper Kimi 2.5, just recompile once—no need to manually modify prompts;
Multi-step joint optimization: Explanation quality + diagram script usability—these two goals were previously hard to optimize simultaneously, but DSPy's compiler can perform global optimal search;
Reproducible experiments: Optimization results saved as JSON, shareable with the team, version-controlled, goodbye to "which document has that best-performing prompt we used before?"
Current limitations:
Evaluation metrics are the hard part: Functions like
validate_dinogsp()need to be written by you, and writing them well isn't easy. DSPy's optimization effectiveness highly depends on metric quality—vague metrics lead the optimizer to game the system;Optimization isn't free: Medium-level optimization on 100 samples costs about $2; if you have multiple grade levels and problem types, costs will rise significantly as data volume increases;
Debugging experience is still maturing: When an optimized pipeline still underperforms, it's sometimes hard to determine whether it's insufficient data, flawed metrics, or the model's inherent capability boundary.
When Should You Use DSPy?
If you're encountering any of the following situations, it's worth seriously considering DSPy:
✅ Very suitable:
You're building multi-step LLM pipelines (explanation + diagram + practice problems is exactly this structure)
You need to switch between different models (cost control, or selecting different capability models by age group)
You have an evaluation dataset and want quantifiable improvement in effectiveness
You're tired of modifying prompts by feel and want a systematic optimization method
Your application needs long-term maintenance in production
⚠️ Not quite suitable:
Just quickly validating an idea, no need for long-term maintenance
The task has no clear evaluation metrics, leaving the optimizer with nothing to work with
Final Thoughts
I think DSPy's approach is good because it proposes a more engineering-reliable way of thinking:
Prompts in AI pipelines are essentially parameters of the program, not the program's source code.
Just as I wouldn't hard-code neural network weights into source code, I shouldn't treat prompts tuned for a specific model as the program logic itself. These weights should be systematically learnable, optimizable, savable, and transferable.
The logic of teaching content is stable—step-by-step, illustrated, age-appropriate expression; but how to guide the model to achieve all this will constantly change with model updates, grade expansions, and problem type additions. Using DSPy to separate the two enables a truly maintainable AI teaching system.
🙋♀️ If you're also working on AI education, feel free to connect.
References
DSPy Official Documentation: dspy.ai
Paper: DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines, ICLR 2024
GitHub: stanfordnlp/dspy
Optimizer Details: dspy.ai/learn/optimization/optimizers
Top comments (0)