For an year, I treated LLMs like a command line. Type instructions, pray for output. Tweaking wording, adding "IMPORTANT:", moving sentences around like a ritual.
Then I found DSPy.
Here's what changed.
───
The prompt treadmill
I had folders of prompts. v1.txt, v2_final.txt, v2_final_REALLY_final.txt. None of them documented why they worked.
The worst part? When something broke, I couldn't tell if it was the prompt, the model, or the data. No version control. No tests. Just vibes.
───
The DSPy shift
DSPy (from Stanford NLP) flips the model:
You don't write prompts. You write Python.
class AnalyzeStartup(dspy.Signature):
"""Analyze a startup pitch."""
pitch: str = dspy.InputField()
viability_score: int = dspy.OutputField()
strengths: list[str] = dspy.OutputField()
weaknesses: list[str] = dspy.OutputField()
verdict: str = dspy.OutputField()
That's it. No You are an expert startup analyst.... No Respond in JSON format.... Just types and descriptions.
DSPy compiles this into a prompt. When you need better prompts, you run an optimizer — DSPy rewrites them based on examples that work.
What actually changed
I stopped memorizing prompt tricks:
Before: "If I put examples before instructions, it works better. Sometimes. Unless it's GPT-4o."
After: I write a signature. DSPy figures out the best prompt format.
The prompt becomes an implementation detail. I care about inputs, outputs, and behavior — not phrasing.
I can finally test my LLM code:
Before: "Let me manually check if this output looks right..."
After:
def test_startup_analyzer():
result = startup_analyzer(pitch="We're building AI for dog grooming...")
assert 1 <= result.viability_score <= 10
assert len(result.strengths) > 0
assert len(result.weaknesses) > 0
Real tests. In my test suite. With assertions.
Model swaps are one line:
Before: Every model needed prompt tuning. GPT-4 liked it one way, Claude another, Gemini another. Prompts were coupled to specific models.
After:
# Swap models
lm = dspy.LM("openai/gpt-4o-mini")
# lm = dspy.LM("anthropic/claude-3-sonnet")
# lm = dspy.LM("gemini/gemini-2.0-flash")
dspy.configure(lm=lm)
Same code. Different model. DSPy handles the prompt translation.
Optimizers do the tuning for me
This one still blows my mind.
Instead of manually tweaking prompts, I hand DSPy some examples of good outputs and let it figure out the best prompt:
optimizer = BootstrapFewShot(metric=my_metric, max_bootstrapped_demos=4)
optimized = optimizer.compile(StartupAnalyzer(), trainset=train_examples)
DSPy runs experiments. Finds examples that work. Builds the prompt. I just review the results.
The mental model shift
Old way: LLMs are magic boxes you talk to in English. Success depends on your prompting skills.
DSPy way: LLMs are function calls. You declare the interface. The framework handles the implementation.
It's the difference between writing raw SQL queries scattered across your codebase vs using an ORM. One is brittle, untyped, hard to refactor. The other is structured, testable, maintainable.
When to use DSPy
• You're building a real product, not a demo
• You need reliability, not just "it works sometimes"
• You want to swap models without rewriting prompts
• You're tired of the prompt treadmill
If you want to go deeper
I wrote a full guide on building with DSPy — practical chapters, real code, the things I had to learn the hard way.
It's called Harmless DSPy. Chapter 1 is free if you want to see if it's your thing.
DSPy is developed by Omar Khattab and team at Stanford NLP. It's open source, actively maintained, and genuinely changed how I build with LLMs.

Top comments (0)