Alain Airom (Ayrom)

Posted on Jun 13

Shift Your Paradigm: Building Self-Improving LLM Workflows with DSPy

#dspy #bookreview

A book review of “Building LLM Applications with DSPy”

Disclaimer: I want to be completely transparent that I am not affiliated with Manning Publications, the publishers of the MEAP edition, nor am I affiliated with the authors of this book. While I happen to be connected with one of the authors on LinkedIn, we have no financial ties or business relationships whatsoever. This review and synthesis are purely based on my independent interest in the framework.

TL;DR-Introduction & Historical Evolution of DSPy

The story behind DSPy

Image from official DSPy GitHub Repositiry

DSPy (Declarative Self-improving Python) which originally was introduced as Demonstrate-Search-Predict, was created by researchers at Stanford University’s NLP Group.

The project is primarily led by Omar Khattab, a Ph.D. candidate at Stanford, alongside his advisor Matei Zaharia — a legendary figure in computer science who previously co-created Apache Spark and co-founded Databricks.

The inception and evolution of DSPy can be traced through a series of key milestones in the Stanford NLP Group’s research timeline:

1. The Retrieval Foundation: ColBERT (2020)

Before DSPy, Omar Khattab and Matei Zaharia focused heavily on improving how language models retrieve information. In 2020, they introduced ColBERT (Contextualized Late Interaction over BERT), a highly influential, fast, and accurate retrieval model. This deep expertise in information retrieval later became a core pillar of DSPy’s multi-hop reasoning capabilities.

2. The Stepping Stone: Demonstrate-Search-Predict (Early 2023)

In January 2023, the team published a paper titled “Demonstrate-Search-Predict: Composing retrieval and language models for knowledge-intensive NLP”. This research introduced the DSP framework. At this stage, it wasn’t yet the fully generalized programmatic compiler we know today; it was specifically designed to build systems that could systematically alternate between searching a knowledge base (Search) and prompting an LLM (Predict) using validated examples (Demonstrate) to solve complex, multi-hop question-answering tasks.

3. The Paradigm Shift: DSPy (Late 2023)

Recognizing that manual prompting was a fundamental bottleneck for all LLM applications — not just retrieval tasks — the Stanford team significantly generalized and rebuilt the framework. In October 2023, they released the seminal paper: “DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines.”

With this release, the acronym officially evolved into Declarative Self-improving Python (DSPy). The framework shifted from a niche retrieval pattern to a generalized programming paradigm. Inspired by how PyTorch separated network architecture from optimization weights, Khattab, Zaharia, and a growing team of co-authors (including researchers like Christopher Potts and Keshav Santhanam) introduced the concepts of Signatures, Modules, and Teleprompters (Optimizers).

The Vision

By treating the prompt as a fluid, optimizable hyperparameter rather than a hardcoded string, the creators at Stanford designed DSPy to do for AI pipelines what compilers did for high-level programming languages: abstract away the messy, hardware-level details (in this case, raw prompt strings specific to certain LLMs) so developers could write clean, resilient, and portable code

@article{Khattabetal2022,
  author  = {Khattab, Omar and Santhanam, Keshav and Li, Xiang Lisa and Hall, David and Liang, Percy and Potts, Christopher and Zaharia, Matei},
  title   = {Demonstrate-Search-Predict: Composing retrieval and language models for knowledge-intensive NLP},
  journal = {arXiv},
  year    = {2022},
  doi     = {10.48550/arxiv.2212.14024}
}

@article{Khattabetal2023,
  author  = {Khattab, Omar and Santhanam, Keshav and Li, Xiang Lisa and Hall, David and Liang, Percy and Potts, Christopher and Zaharia, Matei},
  title   = {DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines},
  journal = {OpenReview},
  year    = {2023},
  url     = {https://openreview.net/forum?id=sY5N0zY5Od}
}

Why DSPy matters?

The traditional development cycle for Large Language Model (LLM) applications relies heavily on prompt engineering. Developers write a prompt, evaluate its output against a couple of inputs, tweak phrases like “think step-by-step” or “imagine my life depends on it” when it fails, and repeat this manual trial-and-error loop indefinitely. This approach introduces significant liabilities: prompts are highly fragile, minor changes can dramatically shift model behaviors, and a prompt carefully crafted for one model often collapses when migrating to a different backend.

DSPy completely reimagines this paradigm by shifting the industry from prompt engineering to prompt programming.

Instead of hardcoding manual strings, you write clean, modular Python code to declare the inputs and outputs your task requires. DSPy treats prompts like hyperparameters in a traditional machine learning system. It leverages computational search over optimization techniques — such as Hill Climbing, Bayesian Optimization, and Genetic Algorithms — to systematically discover, compile, and refine the best-performing instructions and examples for your application.

A Quick Note on My Background and Full Disclosure: I’m not a traditionally trained data scientist. My journey into the world of Large Language Models began a few years ago, and everything I know today has been forged in the trenches — built on relentless internet research, reading every book I can get my hands on (and it costs me a monthly fortune… 🫠), and, most importantly, getting my hands dirty with real-world, hands-on implementations. I write this from the perspective of a practitioner sharing what works in practice.

Prompt engineering is a human heuristic; DSPy relies on computation to programmatically compile prompts that humans could never discover or optimize by hand.

Core Application Areas of DSPy

DSPy excels in structuring multi-component, complex pipelines where manual prompt engineering becomes unmanageable. Its primary deployment domains include:

Intent & Multi-Label Classification: Mapping messy user requests to accurate system routes.
Retrieval-Augmented Generation (RAG): Orchestrating multi-hop document retrieval pipelines to ground responses without hallucinations.
LLM-as-a-Judge Evaluation: Creating reliable scoring modules to systematically validate model outputs.
Agentic Frameworks & Chatbots: Structuring self-correcting agents capable of multi-step tool execution and conversational loop state management.

DSPy could be considered as a complementary technology acting as a transmission and optimization system that sits on top of engines provided by other technologies provided by different technology providers such as IBM, Google, Microsoft…

Systematic Optimization: Projects like Semantic Kernel or LangChain focus on orchestration — chaining events together. DSPy focuses on programmatically optimizing the prompts and demonstrations used by the models. It makes any application built on Microsoft, Google, or IBM infrastructure perform more reliably.
Model Agnostic (Compiler, not a Model): DSPy is designed to work across different LLM backends. A developer can write one piece of DSPy code and easily switch the underlying model from an OpenAI GPT model (Microsoft) to a Google Gemini model or an IBM Granite models, then use DSPy’s compiler to automatically re-optimize for the new backend.
Empowers the Ecosystem: By shifting from fragile manual prompting to systematic programming, DSPy makes the entire AI ecosystem (which these major companies dominate) more robust, easier to develop for, and more ready for enterprise-grade production.

Technical Synthesis of the Book Chapters

The book outlines a disciplined, iterative blueprint for building production-grade LLM oriented application/systems using DSPy.

Declaring Tasks via Signatures and Modules (Chapters 1–3)

The foundational building blocks of any DSPy application are Signatures and Modules. A signature defines what the task is by enforcing explicit input and output fields, while a module represents how that task is executed (e.g., direct prediction or chain-of-thought).

import dspy
from typing import List

# Configure the global language model setting
lm = dspy.LM("openai/gpt-4o-mini", api_key="your-api-key")
dspy.settings.configure(lm=lm)

# Define a Class-Based Signature for clean, type-hinted parsing
class IntentSignature(dspy.Signature):
    """Classify the incoming customer service message into an explicit category."""
    message: str = dspy.InputField()
    labels: List[str] = dspy.InputField()
    intent_label: str = dspy.OutputField()

# Instantiate a Chain of Thought predictive module using the signature
classifier = dspy.ChainOfThought(IntentSignature)

# Execute the module cleanly like a standard Python function
response = classifier(
    message="I deserve a refund for the frozen app.",
    labels=["Subscription Cancel", "Refund Request", "Bug Report"]
)
print(f"Reasoning: {response.rationale}")
print(f"Predicted Intent: {response.intent_label}")

Images and code sample(s) provided here are from the book.

Rigorous Dataset Curation and Evaluators (Chapter 4)

DSPy is strictly data-driven. To move away from eyeballing arbitrary test outputs, you must curate structured datasets (using dspy.Example) and construct robust metric functions that score outputs.

# Constructing explicit training/validation data structures
examples = [
    dspy.Example(
        message="I want to end my subscription", 
        labels=["Cancel Subscription", "Refund Request"], 
        intent_label="Cancel Subscription"
    ).with_inputs("message", "labels"),
    dspy.Example(
        message="The screen is frozen", 
        labels=["Cancel Subscription", "Bug Report"], 
        intent_label="Bug Report"
    ).with_inputs("message", "labels")
]

# Defining a functional metric to check accuracy
def validate_intent_metric(example, prediction, trace=None):
    return example.intent_label == prediction.intent_label

# DSPy's built-in evaluation harness for executing batch test simulations
from dspy.evaluate import Evaluate

evaluator = Evaluate(devset=examples, metric=validate_intent_metric, num_threads=4)
# evaluator(classifier) returns structured evaluation matrices

Code sample provided here are from the book.

Systematic Few-Shot and Instruction Optimization (Chapters 5–6)

Once signatures, datasets, and metrics are finalized, DSPy uses Optimizers to compile the prompt.

LabeledFewShot: Randomly samples valid examples from your dataset and formats them directly into few-shot demonstrations.
BootstrapFewShot: Runs your pipeline end-to-end, uses the LLM to generate internal execution traces (like internal chain-of-thought steps), validates them against your metric, and saves the successful runs as optimal few-shot examples.
MIPROv2 & COPRO: Use a secondary meta-language model to analytically iterate, rephrase, and discover optimal task instructions using Bayesian or coordinate descent search spaces.

from dspy.teleprompt import BootstrapFewShot

# Define the compilation configuration setup
optimizer = BootstrapFewShot(
    metric=validate_intent_metric,
    max_bootstrapped_demos=2,
    max_labeled_demos=2
)

# Compile transforms the unoptimized module into a hyper-optimized program
optimized_classifier = optimizer.compile(student=classifier, trainset=exam

Images and code sample provided here are from the book.

Self-Correcting Data Validation Guardrail

Codes provided by IBM Bob in “Ask” mode!

Post-Book Implementation Challenge!

After reading the book, you possess the knowledge required to assemble modular applications that extend beyond foundational patterns. The code below demonstrates an idea for a practical deployment. This system ingests unstructured raw JSON payloads, analyzes them against a strict structural data schema, uses an internal custom validation feedback loop to detect syntax or semantic schema anomalies, and automatically passes real-time execution traces back to itself to programmatically correct errors prior to final production output delivery.

import dspy
from pydantic import BaseModel, Field, ValidationError
import json

lm = dspy.LM("openai/gpt-4o-mini", api_key="sk-your-key-here", temperature=0.0)
dspy.settings.configure(lm=lm)

# 1. Define target downstream schema validation object
class UserProfileSchema(BaseModel):
    user_id: int = Field(..., description="Unique integer ID")
    email: str = Field(..., description="Valid standard email address string")
    account_status: str = Field(..., description="Must be exactly 'active', 'suspended', or 'pending'")

# 2. Define DSPy Signatures for generation and dynamic error correction
class ExtractDataSignature(dspy.Signature):
    """Extract raw input text and convert it into a valid formatted JSON object matching the target schema."""
    raw_text: str = dspy.InputField()
    target_schema_description: str = dspy.InputField()
    extracted_json: str = dspy.OutputField(desc="Raw parsable JSON block containing extracted keys")

class SelfCorrectSignature(dspy.Signature):
    """Examine prior malformed JSON output alongside validation errors to emit a structurally corrected JSON block."""
    malformed_json: str = dspy.InputField()
    validation_error_msg: str = dspy.InputField()
    corrected_json: str = dspy.OutputField(desc="Cleaned, valid, schema-compliant JSON object")

# 3. Create a Modular, Multi-Hop Program with an explicit internal feedback loop
class GuardrailDataExtractor(dspy.Module):
    def __init__(self):
        super().__init__()
        self.extractor = dspy.Predict(ExtractDataSignature)
        self.corrector = dspy.ChainOfThought(SelfCorrectSignature)

    def forward(self, raw_text: str):
        schema_desc = (
            "JSON fields required: user_id (int), email (str), "
            "account_status (string literal matching 'active', 'suspended', 'pending')"
        )

        # Step 1: Initial parsing extraction execution attempt
        pred = self.extractor(raw_text=raw_text, target_schema_description=schema_desc)
        extracted_text = pred.extracted_json

        # Clean potential markdown wrapping artifacts if emitted by basic models
        if "```

json" in extracted_text:
            extracted_text = extracted_text.split("

```json")[1].split("```

")[0].strip()

        # Step 2: Validate against Pydantic constraint layer
        try:
            parsed_data = json.loads(extracted_text)
            UserProfileSchema(**parsed_data)
            return dspy.Prediction(valid=True, data=parsed_data, logs="Success on initial pass.")
        except (ValidationError, json.JSONDecodeError) as e:
            error_feedback = str(e)

            # Step 3: Self-correction trigger branch passing runtime feedback traces to a CoT corrector module
            correction_pred = self.corrector(
                malformed_json=extracted_text,
                validation_error_msg=error_feedback
            )

            corrected_text = correction_pred.corrected_json
            if "

```json" in corrected_text:
                corrected_text = corrected_text.split("```

json")[1].split("

```")[0].strip()

            try:
                final_data = json.loads(corrected_text)
                UserProfileSchema(**final_data)
                return dspy.Prediction(valid=True, data=final_data, logs=f"Corrected after failure. Error was: {error_feedback}")
            except Exception as final_fail:
                return dspy.Prediction(valid=False, data=corrected_text, logs=f"Self-correction collapsed: {str(final_fail)}")

# --- Execution Simulation ---
guardrail_pipeline = GuardrailDataExtractor()

# Intentionally noisy input designed to cause parsing errors (invalid email format and illegal account state string)
noisy_input_string = (
    "Customer record update log: The individual with id number 9942 has an address at 'john_at_domain_com'. "
    "We need to flag his profile as initialized immediately."
)

result = guardrail_pipeline(raw_text=noisy_input_string)
print("\n--- Guardrail Program Output Evaluation Matrix ---")
print(f"Validation Status Verified: {result.valid}")
print(f"Final Parsed Safe Record Output: {json.dumps(result.data, indent=2)}")
print(f"Internal Pipeline Processing Logs: {result.logs}")

Agentic Implementation use-case

Another example of DSPy implementation could be to run a DSPy agent using an IBM Granite **model hosted on **Hugging Face, you can leverage DSPy’s native integration with Hugging Face transformers or its remote endpoint APIs.

Below is a complete, production-style implementation of a **Multi-Hop Research Agent **using DSPy. The agent connects to the ibm-granite/granite-3.0-8b-instruct model on Hugging Face, automatically handles retrieval via a sample vector store (or custom retrieval method), tracks its own intermediate reasoning chains, and uses a self-correction loop to catch missing data.

pip install dspy-ai transformers torch

The code;

import os
import dspy
from typing import List

# 1. Initialize the IBM Granite Model from Hugging Face via DSPy
# We use the official 'ibm-granite/granite-3.0-8b-instruct' model.
# Ensure your HF_TOKEN environment variable is set if using gated weights.
os.environ["HF_TOKEN"] = "your_huggingface_token_here"

print("Loading IBM Granite model from Hugging Face...")
granite_llm = dspy.HFClientTGI(
    model="ibm-granite/granite-3.0-8b-instruct", 
    port=None,  
    temperature=0.2
)

# Configure DSPy globally to use Granite as the default Language Model
dspy.settings.configure(lm=granite_llm)


# 2. Define the Agent's Signature (The Behavioral System Contract)
class ResearchAgentSignature(dspy.Signature):
    """
    An advanced multi-hop research agent designed to answer complex user queries 
    by analyzing provided context chunks. The agent must declare its step-by-step 
    rationale before formulating a synthesized, factual final response.
    """
    user_query: str = dspy.InputField(desc="The multi-part question or task the user wants resolved.")
    retrieved_context: str = dspy.InputField(desc="Relevant contextual passages, documentation, or search results.")

    agent_thought: str = dspy.OutputField(desc="Internal analytical chain-of-thought, reasoning steps, or gap analysis.")
    final_response: str = dspy.OutputField(desc="The concrete, verified final answer to the user query.")


# 3. Build the Agentic Module with Self-Correction Loops
class GraniteResearchAgent(dspy.Module):
    def __init__(self):
        super().__init__()
        # Using ChainOfThought forces Granite to output the 'agent_thought' field before the final response
        self.research_step = dspy.ChainOfThought(ResearchAgentSignature)

    def forward(self, query: str, context_documents: List[str]):
        # Format list of documents into a single readable text block
        formatted_context = "\n".join([f"-[Doc {i+1}]: {doc}" for i, doc in enumerate(context_documents)])

        current_query = query
        max_attempts = 2

        for attempt in range(max_attempts):
            # Execute the core LLM call using the compiled/optimized signature
            prediction = self.research_step(
                user_query=current_query, 
                retrieved_context=formatted_context
            )

            # Simple Agentic Self-Correction Rule:
            # If the model discovers it lacks enough context in its thought process, 
            # we modify the query dynamically to perform a broader internal pass.
            if "insufficient context" in prediction.agent_thought.lower() and attempt == 0:
                print(f"⚠️ [Agent Feedback Loop]: Granite detected missing info. Adjusting strategy...")
                current_query = f"{query} (Provide a best-effort answer using generalized structural reasoning if direct facts are sparse)."
                continue

            return dspy.Prediction(
                rationale=prediction.agent_thought,
                answer=prediction.final_response
            )

# 4. Simulation / Execution
if __name__ == "__main__":
    # Mocking a vector database retrieval output for the context input
    sample_knowledge_base = [
        "IBM Granite 3.0 models are architectures trained on over 12 trillion tokens of high-quality data.",
        "Granite 3.0 Instruct models show significant enhancements in safety, enterprise tool governance, and RAG architectures.",
        "DSPy allows users to separate pipeline structure from prompt engineering choices by treating prompts as parameters."
    ]

    # Instantiate the agent
    agent = GraniteResearchAgent()

    # Run a complex query
    user_question = "How does IBM Granite 3.0's architecture fit into programmatic framework environments like DSPy?"

    print(f"\n🚀 Dispatching Agent with query: '{user_question}'\n")
    output = agent(query=user_question, context_documents=sample_knowledge_base)

    # Output the results
    print("================ AGENT REASONING STEP ================")
    print(output.rationale)
    print("\n================== FINAL RESPONSE ==================")
    print(output.answer)

Conclusion: A Solid Launchpad into Prompt Programming

To wrap up, Building LLM Applications with DSPy provides an incredibly structured blueprint for anyone looking to transition from the fragile trial-and-error of manual prompt engineering to a disciplined, code-first optimization paradigm. It breaks down the mechanics of signatures, modules, and data-driven teleprompters, turning abstract machine-learning concepts into practical software engineering patterns. While this Manning Early Access Program (MEAP) edition is not yet fully finished, it has already served as a remarkably strong foundation for my own DSPy learning journey, even in its current form.

>>> Thanks for reading <<<

DEV Community