Ali Raza

Posted on Apr 9

I stopped writing prompts and started writing Python

#dspy #python #ai #programming

For an year, I treated LLMs like a command line. Type instructions, pray for output. Tweaking wording, adding "IMPORTANT:", moving sentences around like a ritual.

Then I found DSPy.

Here's what changed.

───

The prompt treadmill

I had folders of prompts. v1.txt, v2_final.txt, v2_final_REALLY_final.txt. None of them documented why they worked.

The worst part? When something broke, I couldn't tell if it was the prompt, the model, or the data. No version control. No tests. Just vibes.

───

The DSPy shift

DSPy (from Stanford NLP) flips the model:

You don't write prompts. You write Python.

class AnalyzeStartup(dspy.Signature):
    """Analyze a startup pitch."""

    pitch: str = dspy.InputField()
    viability_score: int = dspy.OutputField()
    strengths: list[str] = dspy.OutputField()
    weaknesses: list[str] = dspy.OutputField()
    verdict: str = dspy.OutputField()

That's it. No You are an expert startup analyst.... No Respond in JSON format.... Just types and descriptions.

DSPy compiles this into a prompt. When you need better prompts, you run an optimizer — DSPy rewrites them based on examples that work.

What actually changed

I stopped memorizing prompt tricks:

Before: "If I put examples before instructions, it works better. Sometimes. Unless it's GPT-4o."

After: I write a signature. DSPy figures out the best prompt format.

The prompt becomes an implementation detail. I care about inputs, outputs, and behavior — not phrasing.

I can finally test my LLM code:

Before: "Let me manually check if this output looks right..."

After:

def test_startup_analyzer():
    result = startup_analyzer(pitch="We're building AI for dog grooming...")
    assert 1 <= result.viability_score <= 10
    assert len(result.strengths) > 0
    assert len(result.weaknesses) > 0

Real tests. In my test suite. With assertions.

Model swaps are one line:

Before: Every model needed prompt tuning. GPT-4 liked it one way, Claude another, Gemini another. Prompts were coupled to specific models.

After:

# Swap models
lm = dspy.LM("openai/gpt-4o-mini")
# lm = dspy.LM("anthropic/claude-3-sonnet")
# lm = dspy.LM("gemini/gemini-2.0-flash")

dspy.configure(lm=lm)

Same code. Different model. DSPy handles the prompt translation.

Optimizers do the tuning for me

This one still blows my mind.

Instead of manually tweaking prompts, I hand DSPy some examples of good outputs and let it figure out the best prompt:

optimizer = BootstrapFewShot(metric=my_metric, max_bootstrapped_demos=4)
optimized = optimizer.compile(StartupAnalyzer(), trainset=train_examples)

DSPy runs experiments. Finds examples that work. Builds the prompt. I just review the results.

The mental model shift

Old way: LLMs are magic boxes you talk to in English. Success depends on your prompting skills.

DSPy way: LLMs are function calls. You declare the interface. The framework handles the implementation.

It's the difference between writing raw SQL queries scattered across your codebase vs using an ORM. One is brittle, untyped, hard to refactor. The other is structured, testable, maintainable.

When to use DSPy

• You're building a real product, not a demo
• You need reliability, not just "it works sometimes"
• You want to swap models without rewriting prompts
• You're tired of the prompt treadmill

If you want to go deeper

I wrote a full guide on building with DSPy — practical chapters, real code, the things I had to learn the hard way.

It's called Harmless DSPy. Chapter 1 is free if you want to see if it's your thing.

DSPy is developed by Omar Khattab and team at Stanford NLP. It's open source, actively maintained, and genuinely changed how I build with LLMs.

Top comments (1)

Crafted Marketing Services • Apr 9

This hits the nail on the head. At Crafted Marketing Services, we’ve moved in the exact same direction. While prompting is okay for brainstorming, we found that for serious Technical SEO - like automating JSON-LD at scale or auditing thousands of URLs - you need the reliability of a Python script.
Moving the 'intelligence' into a structured workflow is the only way to get predictable results. It's great to see others treating AI as a component of a program rather than just a chat interface. Solid insight!