DEV Community

Cover image for 5 Prompt Engineering Patterns That Actually Work in Production
klement Gunndu
klement Gunndu

Posted on

5 Prompt Engineering Patterns That Actually Work in Production

Most prompt engineering guides teach you to write "Act as a senior developer" and call it a day.

That works in ChatGPT. It fails in production. The moment your prompt runs inside an automated pipeline — no human reviewing outputs, no chance to retry manually — you need patterns that enforce correctness structurally, not hopefully.

These 5 patterns come from running LLM calls in automated systems where bad outputs mean broken pipelines, not just awkward chat responses. Each one includes working Python code you can copy into your project today.

1. Separate System Prompts From User Input

The most common production bug: stuffing instructions and user data into the same message. The model treats everything equally, and your carefully crafted instructions get diluted by the user's input.

The fix is structural. Every major LLM API separates system-level instructions from user messages. Use that separation.

Here's how it works with the Anthropic Python SDK (as of v0.49+, March 2026):

from anthropic import Anthropic

client = Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system="You are a code reviewer. Return only: PASS, FAIL, or NEEDS_REVIEW. "
           "Include a one-line reason. No other text.",
    messages=[
        {"role": "user", "content": f"Review this function:\n\n{code_snippet}"}
    ],
)

verdict = response.content[0].text
Enter fullscreen mode Exit fullscreen mode

The system parameter is not a suggestion — it is a separate instruction channel that shapes the model's behavior before it sees the user message. When you put review criteria in the system prompt and the code in the user message, the model treats them differently. Instructions stay instructions. Data stays data.

Why this matters in production: When your system prompt and user input live in the same string, a sufficiently long user input pushes your instructions out of the model's attention window. Separating them prevents prompt injection by design, not by hope.

The pattern: Put constraints, output format, and role definition in system. Put variable data in messages. Never mix them.

2. Force Structured Output With Pydantic

Parsing free-text LLM responses with regex is the production equivalent of catching rain with your hands. It works sometimes. It fails at 3 AM on a Saturday.

Structured output forces the model to return data that matches your exact schema. No parsing. No "the model forgot to include the field." The schema is the contract.

Here's how it works with the OpenAI Python SDK using the Responses API (as of v1.66+, March 2026):

from openai import OpenAI
from pydantic import BaseModel, Field

client = OpenAI()

class CodeReview(BaseModel):
    verdict: str = Field(description="PASS, FAIL, or NEEDS_REVIEW")
    reason: str = Field(description="One-line explanation")
    severity: int = Field(ge=1, le=5, description="1=minor, 5=critical")

response = client.responses.parse(
    model="gpt-4o",
    input=[
        {"role": "system", "content": "Review the code. Return structured output."},
        {"role": "user", "content": f"Review:\n\n{code_snippet}"},
    ],
    text_format=CodeReview,
)

review = response.output_parsed  # CodeReview instance
print(review.verdict)   # "FAIL"
print(review.severity)  # 4
Enter fullscreen mode Exit fullscreen mode

The text_format parameter takes a Pydantic BaseModel class. The SDK handles JSON schema generation and response deserialization automatically. Your downstream code receives a typed Python object, not a string.

If you're using the Chat Completions API instead:

completion = client.chat.completions.parse(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Review the code. Return structured output."},
        {"role": "user", "content": f"Review:\n\n{code_snippet}"},
    ],
    response_format=CodeReview,
)

review = completion.choices[0].message.parsed
Enter fullscreen mode Exit fullscreen mode

Why this matters in production: Without structured output, you write regex to extract fields, handle edge cases where the model wraps its response in markdown, and debug silent failures when a field is missing. With Pydantic, the contract is enforced at the API level. If the response doesn't match your schema, you get an error — not corrupted data.

The pattern: Define your output as a Pydantic model. Pass it to the API. Never parse free text in an automated pipeline.

3. Use Few-Shot Examples to Lock In Format

System prompts define the rules. Few-shot examples show the rules in action.

When you need consistent formatting across thousands of calls — extracting data, classifying inputs, generating reports — few-shot examples reduce variance more than any instruction ever will. Models learn by imitation, not just instruction.

Here's how it works with the Anthropic SDK:

from anthropic import Anthropic

client = Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=256,
    system="Extract the action items from meeting notes. "
           "Return each as: - [OWNER] Action description (DEADLINE)",
    messages=[
        # Few-shot example 1
        {
            "role": "user",
            "content": "Notes: Sarah will update the API docs by Friday. "
                       "Mike needs to fix the login bug before sprint end.",
        },
        {
            "role": "assistant",
            "content": "- [Sarah] Update API docs (Friday)\n"
                       "- [Mike] Fix login bug (sprint end)",
        },
        # Few-shot example 2
        {
            "role": "user",
            "content": "Notes: Team agreed to defer the redesign. "
                       "Jake to send the Q3 report to finance by EOD Tuesday.",
        },
        {
            "role": "assistant",
            "content": "- [Jake] Send Q3 report to finance (EOD Tuesday)",
        },
        # Actual input
        {
            "role": "user",
            "content": f"Notes: {meeting_notes}",
        },
    ],
)
Enter fullscreen mode Exit fullscreen mode

Notice: the "defer the redesign" note produced no action item. That second example teaches the model to skip non-actionable statements. Without it, models tend to generate phantom action items from every sentence.

Why this matters in production: Instructions describe what you want. Examples describe what it looks like. The gap between "extract action items" and "extract action items formatted exactly like this, omitting non-actionable statements" is where production bugs live. Few-shot examples close that gap.

The pattern: Include 2-3 examples that cover your edge cases. At least one example should show what the model should NOT include. Alternate user/assistant roles to simulate a real conversation.

4. Chain-of-Thought for Complex Reasoning

When a model needs to classify, compare, or decide — not just extract — asking for the answer directly produces unreliable results. Chain-of-thought prompting forces the model to show its reasoning before committing to an answer.

This is the difference between "guess the answer" and "work through the problem."

from anthropic import Anthropic

client = Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=(
        "You are a security reviewer analyzing code for vulnerabilities.\n\n"
        "For each code snippet, follow these steps:\n"
        "1. Identify what the code does (2-3 sentences)\n"
        "2. List potential security issues (be specific)\n"
        "3. For each issue, state the attack vector\n"
        "4. Give your final verdict: SAFE, REVIEW, or VULNERABLE\n\n"
        "Always complete all 4 steps before giving the verdict."
    ),
    messages=[
        {"role": "user", "content": f"Review this code:\n\n{code_snippet}"}
    ],
)
Enter fullscreen mode Exit fullscreen mode

The key is step 4: "Always complete all 4 steps before giving the verdict." Without that constraint, the model often jumps to the verdict after step 1 and rationalizes backward. Forcing sequential reasoning produces more accurate classifications.

Here's a tighter version using XML tags for structure (works well with Claude models):

system_prompt = """Analyze the code for security issues.

<thinking>
Step 1: What does this code do?
Step 2: What security issues exist?
Step 3: What are the attack vectors?
</thinking>

<verdict>SAFE | REVIEW | VULNERABLE</verdict>
<reason>One sentence explaining the verdict</reason>

Always complete <thinking> before writing <verdict>."""
Enter fullscreen mode Exit fullscreen mode

XML tags give you parseable structure in the output. You can extract the verdict with a simple string search instead of hoping the model puts it in the right place.

Why this matters in production: On classification tasks, chain-of-thought prompting improves accuracy. Anthropic's own prompt engineering documentation recommends this pattern for complex analytical tasks. The model catches edge cases during reasoning that it would miss when jumping to conclusions.

The pattern: Number the reasoning steps. Put the final answer last. Tell the model to complete all steps before answering. Use XML tags if you need to parse specific sections from the output.

5. Template Variables With LangChain

When you're running the same prompt structure across different inputs — different customers, different documents, different code files — hardcoded strings become unmaintainable. Prompt templates separate the structure from the data.

Here's how it works with LangChain's ChatPromptTemplate (as of langchain-core 0.3+, March 2026):

from langchain_core.prompts import ChatPromptTemplate

review_prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You are a {role} reviewing {artifact_type}. "
     "Apply these standards: {standards}. "
     "Return: verdict (PASS/FAIL), issues found, suggestions."),
    ("human", "Review this {artifact_type}:\n\n{content}"),
])

# Reuse across different review types
code_messages = review_prompt.format_messages(
    role="senior Python developer",
    artifact_type="pull request",
    standards="PEP 8, type hints required, no bare exceptions",
    content=pr_diff,
)

doc_messages = review_prompt.format_messages(
    role="technical writer",
    artifact_type="API documentation",
    standards="all endpoints documented, examples for each, error codes listed",
    content=api_docs,
)
Enter fullscreen mode Exit fullscreen mode

One template. Two completely different review contexts. The structure stays consistent; the variables change per call.

For more complex workflows where you need conversation history:

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

multi_turn_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a {role}. Be concise and specific."),
    MessagesPlaceholder("history"),
    ("human", "{input}"),
])
Enter fullscreen mode Exit fullscreen mode

MessagesPlaceholder injects a list of previous messages, letting you build multi-turn conversations without string concatenation.

Why this matters in production: When prompt logic lives in f-strings scattered across your codebase, changing the output format means finding and updating every instance. Templates centralize prompt logic. You version them, test them, and swap them without touching business logic.

The pattern: Define templates once. Pass variables at call time. Use MessagesPlaceholder for conversation history. Store templates as constants or in config files — not inline in business logic.

The Meta-Pattern: Combine Them

These 5 patterns are not alternatives. They stack.

A production-ready LLM call typically uses 3-4 of these together: system prompt separation (Pattern 1) + structured output (Pattern 2) + few-shot examples (Pattern 3), all wrapped in a reusable template (Pattern 5). Add chain-of-thought (Pattern 4) when the task requires reasoning.

The difference between a prompt that works in development and one that works in production is not cleverness. It is structure. Structured prompts produce predictable outputs. Predictable outputs don't break pipelines at 3 AM.

Start with one pattern. Add the next when your current approach fails. By the time you're using all five, your LLM calls behave more like function calls — inputs in, typed outputs out, no surprises.


Follow @klement_gunndu for more AI engineering content. We're building in public.

Top comments (9)

Collapse
 
seryllns_ profile image
Seryl Lns

This was a great insight. I never thought about using XML tags like this to structure LLM outputs.

Turning the response into something easily parseable while also forcing the model to complete the reasoning steps before the verdict is a really clever pattern.
Learned something new here.
Thanks !!

Collapse
 
klement_gunndu profile image
klement Gunndu

The reasoning-before-verdict ordering is the key insight there — once the model commits to structured analysis inside those XML tags, the final output is grounded in its own chain of thought rather than pattern-matching to a quick answer.

Collapse
 
seryllns_ profile image
Seryl Lns

Yeah that makes sense
What clicked for me was that once you enforce structure like this, the model stops behaving like a chatbot and starts acting more like a deterministic step in a pipeline.

Thread Thread
 
klement_gunndu profile image
klement Gunndu

Exactly — that mental shift from 'chatbot' to 'pipeline step' is the key. Once you treat the model as a function with typed inputs and outputs, you can compose, test, and version prompts the same way you'd handle any other code.

Collapse
 
mihirkanzariya profile image
Mihir kanzariya

The system prompt vs user message separation is something I wish more people talked about. I've seen so many codebases where everything gets shoved into one giant string and then people wonder why the model ignores half their instructions lol. The few-shot pattern is underrated too — I started doing this for structured extraction tasks and the consistency improvement was night and day compared to just describing the format.

Collapse
 
klement_gunndu profile image
klement Gunndu

The system prompt vs user message split is one of those things that seems obvious once you see it, but almost nobody structures their prompts that way in practice. Moving static instructions to the system prompt and keeping the user message dynamic is essentially free consistency improvement.

The few-shot observation is spot on too. For structured extraction, showing the model 2-3 examples of the exact output format eliminates most of the format drift you get with description-only prompting. It works because the model pattern-matches the examples rather than interpreting your description of the format.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.