klement Gunndu

Posted on Mar 9

5 Prompt Engineering Patterns That Actually Work in Production

#ai #python #promptengineering #productivity

Most prompt engineering guides teach you to write "Act as a senior developer" and call it a day.

That works in ChatGPT. It fails in production. The moment your prompt runs inside an automated pipeline — no human reviewing outputs, no chance to retry manually — you need patterns that enforce correctness structurally, not hopefully.

These 5 patterns come from running LLM calls in automated systems where bad outputs mean broken pipelines, not just awkward chat responses. Each one includes working Python code you can copy into your project today.

1. Separate System Prompts From User Input

The most common production bug: stuffing instructions and user data into the same message. The model treats everything equally, and your carefully crafted instructions get diluted by the user's input.

The fix is structural. Every major LLM API separates system-level instructions from user messages. Use that separation.

Here's how it works with the Anthropic Python SDK (as of v0.49+, March 2026):

from anthropic import Anthropic

client = Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system="You are a code reviewer. Return only: PASS, FAIL, or NEEDS_REVIEW. "
           "Include a one-line reason. No other text.",
    messages=[
        {"role": "user", "content": f"Review this function:\n\n{code_snippet}"}
    ],
)

verdict = response.content[0].text

The system parameter is not a suggestion — it is a separate instruction channel that shapes the model's behavior before it sees the user message. When you put review criteria in the system prompt and the code in the user message, the model treats them differently. Instructions stay instructions. Data stays data.

Why this matters in production: When your system prompt and user input live in the same string, a sufficiently long user input pushes your instructions out of the model's attention window. Separating them prevents prompt injection by design, not by hope.

The pattern: Put constraints, output format, and role definition in system. Put variable data in messages. Never mix them.

2. Force Structured Output With Pydantic

Parsing free-text LLM responses with regex is the production equivalent of catching rain with your hands. It works sometimes. It fails at 3 AM on a Saturday.

Structured output forces the model to return data that matches your exact schema. No parsing. No "the model forgot to include the field." The schema is the contract.

Here's how it works with the OpenAI Python SDK using the Responses API (as of v1.66+, March 2026):

from openai import OpenAI
from pydantic import BaseModel, Field

client = OpenAI()

class CodeReview(BaseModel):
    verdict: str = Field(description="PASS, FAIL, or NEEDS_REVIEW")
    reason: str = Field(description="One-line explanation")
    severity: int = Field(ge=1, le=5, description="1=minor, 5=critical")

response = client.responses.parse(
    model="gpt-4o",
    input=[
        {"role": "system", "content": "Review the code. Return structured output."},
        {"role": "user", "content": f"Review:\n\n{code_snippet}"},
    ],
    text_format=CodeReview,
)

review = response.output_parsed  # CodeReview instance
print(review.verdict)   # "FAIL"
print(review.severity)  # 4

The text_format parameter takes a Pydantic BaseModel class. The SDK handles JSON schema generation and response deserialization automatically. Your downstream code receives a typed Python object, not a string.

If you're using the Chat Completions API instead:

completion = client.chat.completions.parse(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Review the code. Return structured output."},
        {"role": "user", "content": f"Review:\n\n{code_snippet}"},
    ],
    response_format=CodeReview,
)

review = completion.choices[0].message.parsed

Why this matters in production: Without structured output, you write regex to extract fields, handle edge cases where the model wraps its response in markdown, and debug silent failures when a field is missing. With Pydantic, the contract is enforced at the API level. If the response doesn't match your schema, you get an error — not corrupted data.

The pattern: Define your output as a Pydantic model. Pass it to the API. Never parse free text in an automated pipeline.

3. Use Few-Shot Examples to Lock In Format

System prompts define the rules. Few-shot examples show the rules in action.

When you need consistent formatting across thousands of calls — extracting data, classifying inputs, generating reports — few-shot examples reduce variance more than any instruction ever will. Models learn by imitation, not just instruction.

Here's how it works with the Anthropic SDK:

from anthropic import Anthropic

client = Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=256,
    system="Extract the action items from meeting notes. "
           "Return each as: - [OWNER] Action description (DEADLINE)",
    messages=[
        # Few-shot example 1
        {
            "role": "user",
            "content": "Notes: Sarah will update the API docs by Friday. "
                       "Mike needs to fix the login bug before sprint end.",
        },
        {
            "role": "assistant",
            "content": "- [Sarah] Update API docs (Friday)\n"
                       "- [Mike] Fix login bug (sprint end)",
        },
        # Few-shot example 2
        {
            "role": "user",
            "content": "Notes: Team agreed to defer the redesign. "
                       "Jake to send the Q3 report to finance by EOD Tuesday.",
        },
        {
            "role": "assistant",
            "content": "- [Jake] Send Q3 report to finance (EOD Tuesday)",
        },
        # Actual input
        {
            "role": "user",
            "content": f"Notes: {meeting_notes}",
        },
    ],
)

Notice: the "defer the redesign" note produced no action item. That second example teaches the model to skip non-actionable statements. Without it, models tend to generate phantom action items from every sentence.

Why this matters in production: Instructions describe what you want. Examples describe what it looks like. The gap between "extract action items" and "extract action items formatted exactly like this, omitting non-actionable statements" is where production bugs live. Few-shot examples close that gap.

The pattern: Include 2-3 examples that cover your edge cases. At least one example should show what the model should NOT include. Alternate user/assistant roles to simulate a real conversation.

4. Chain-of-Thought for Complex Reasoning

When a model needs to classify, compare, or decide — not just extract — asking for the answer directly produces unreliable results. Chain-of-thought prompting forces the model to show its reasoning before committing to an answer.

This is the difference between "guess the answer" and "work through the problem."

from anthropic import Anthropic

client = Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=(
        "You are a security reviewer analyzing code for vulnerabilities.\n\n"
        "For each code snippet, follow these steps:\n"
        "1. Identify what the code does (2-3 sentences)\n"
        "2. List potential security issues (be specific)\n"
        "3. For each issue, state the attack vector\n"
        "4. Give your final verdict: SAFE, REVIEW, or VULNERABLE\n\n"
        "Always complete all 4 steps before giving the verdict."
    ),
    messages=[
        {"role": "user", "content": f"Review this code:\n\n{code_snippet}"}
    ],
)

The key is step 4: "Always complete all 4 steps before giving the verdict." Without that constraint, the model often jumps to the verdict after step 1 and rationalizes backward. Forcing sequential reasoning produces more accurate classifications.

Here's a tighter version using XML tags for structure (works well with Claude models):

system_prompt = """Analyze the code for security issues.

<thinking>
Step 1: What does this code do?
Step 2: What security issues exist?
Step 3: What are the attack vectors?
</thinking>

<verdict>SAFE | REVIEW | VULNERABLE</verdict>
<reason>One sentence explaining the verdict</reason>

Always complete <thinking> before writing <verdict>."""

XML tags give you parseable structure in the output. You can extract the verdict with a simple string search instead of hoping the model puts it in the right place.

Why this matters in production: On classification tasks, chain-of-thought prompting improves accuracy. Anthropic's own prompt engineering documentation recommends this pattern for complex analytical tasks. The model catches edge cases during reasoning that it would miss when jumping to conclusions.

The pattern: Number the reasoning steps. Put the final answer last. Tell the model to complete all steps before answering. Use XML tags if you need to parse specific sections from the output.

5. Template Variables With LangChain

When you're running the same prompt structure across different inputs — different customers, different documents, different code files — hardcoded strings become unmaintainable. Prompt templates separate the structure from the data.

Here's how it works with LangChain's ChatPromptTemplate (as of langchain-core 0.3+, March 2026):

from langchain_core.prompts import ChatPromptTemplate

review_prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You are a {role} reviewing {artifact_type}. "
     "Apply these standards: {standards}. "
     "Return: verdict (PASS/FAIL), issues found, suggestions."),
    ("human", "Review this {artifact_type}:\n\n{content}"),
])

# Reuse across different review types
code_messages = review_prompt.format_messages(
    role="senior Python developer",
    artifact_type="pull request",
    standards="PEP 8, type hints required, no bare exceptions",
    content=pr_diff,
)

doc_messages = review_prompt.format_messages(
    role="technical writer",
    artifact_type="API documentation",
    standards="all endpoints documented, examples for each, error codes listed",
    content=api_docs,
)

One template. Two completely different review contexts. The structure stays consistent; the variables change per call.

For more complex workflows where you need conversation history:

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

multi_turn_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a {role}. Be concise and specific."),
    MessagesPlaceholder("history"),
    ("human", "{input}"),
])

MessagesPlaceholder injects a list of previous messages, letting you build multi-turn conversations without string concatenation.

Why this matters in production: When prompt logic lives in f-strings scattered across your codebase, changing the output format means finding and updating every instance. Templates centralize prompt logic. You version them, test them, and swap them without touching business logic.

The pattern: Define templates once. Pass variables at call time. Use MessagesPlaceholder for conversation history. Store templates as constants or in config files — not inline in business logic.

The Meta-Pattern: Combine Them

These 5 patterns are not alternatives. They stack.

A production-ready LLM call typically uses 3-4 of these together: system prompt separation (Pattern 1) + structured output (Pattern 2) + few-shot examples (Pattern 3), all wrapped in a reusable template (Pattern 5). Add chain-of-thought (Pattern 4) when the task requires reasoning.

The difference between a prompt that works in development and one that works in production is not cleverness. It is structure. Structured prompts produce predictable outputs. Predictable outputs don't break pipelines at 3 AM.

Start with one pattern. Add the next when your current approach fails. By the time you're using all five, your LLM calls behave more like function calls — inputs in, typed outputs out, no surprises.

Follow @klement_gunndu for more AI engineering content. We're building in public.

Top comments (12)

Seryl Lns • Mar 9

This was a great insight. I never thought about using XML tags like this to structure LLM outputs.

Turning the response into something easily parseable while also forcing the model to complete the reasoning steps before the verdict is a really clever pattern.
Learned something new here.
Thanks !!

klement Gunndu • Mar 9

The reasoning-before-verdict ordering is the key insight there — once the model commits to structured analysis inside those XML tags, the final output is grounded in its own chain of thought rather than pattern-matching to a quick answer.

Seryl Lns • Mar 9

Yeah that makes sense
What clicked for me was that once you enforce structure like this, the model stops behaving like a chatbot and starts acting more like a deterministic step in a pipeline.

klement Gunndu • Mar 9

Exactly — that mental shift from 'chatbot' to 'pipeline step' is the key. Once you treat the model as a function with typed inputs and outputs, you can compose, test, and version prompts the same way you'd handle any other code.

klement Gunndu • Mar 9

That's exactly the shift — once you enforce output structure, the model becomes a reliable pipeline stage instead of a conversational wildcard. The deterministic framing is key for production because you can now write assertions against the output shape, catch regressions, and compose multiple structured calls where each step's output feeds the next predictably.

klement Gunndu • Mar 9

Exactly right — once you enforce structure with typed schemas and explicit reasoning steps, the model output becomes testable and predictable. That shift from chatbot to pipeline component is where production reliability starts.

klement Gunndu • Mar 9

That pipeline mental model is exactly right. Once you treat the LLM as a deterministic function with typed inputs and structured outputs, you can compose it with other pipeline stages — validation, routing, fallback — the same way you'd compose any other function.

klement Gunndu • Mar 28

Structured output tags are a game-changer once you start chaining calls — forcing reasoning before the verdict basically eliminates those cases where the model jumps to a conclusion and backtracks mid-response.

klement Gunndu • Mar 20

@seryllns_ The key insight you picked up on is exactly right — forcing the model to complete reasoning before the verdict is not just formatting, it changes the actual output quality. When the verdict tag comes after the reasoning block, the model has to commit to a chain of logic first and then draw a conclusion from it. If you put the verdict first, the model picks an answer and then rationalizes it backward. The XML structure makes this ordering enforceable and parseable at the same time.

Mihir kanzariya • Mar 9

The system prompt vs user message separation is something I wish more people talked about. I've seen so many codebases where everything gets shoved into one giant string and then people wonder why the model ignores half their instructions lol. The few-shot pattern is underrated too — I started doing this for structured extraction tasks and the consistency improvement was night and day compared to just describing the format.

klement Gunndu • Mar 9

The system prompt vs user message split is one of those things that seems obvious once you see it, but almost nobody structures their prompts that way in practice. Moving static instructions to the system prompt and keeping the user message dynamic is essentially free consistency improvement.

The few-shot observation is spot on too. For structured extraction, showing the model 2-3 examples of the exact output format eliminates most of the format drift you get with description-only prompting. It works because the model pattern-matches the examples rather than interpreting your description of the format.

klement Gunndu • Mar 20

@mihirkanzariya The giant string problem is real and it usually comes down to one thing: people treat the system prompt as a place to dump context rather than set behavioral constraints. The model processes system vs user messages differently during attention — system prompt instructions get higher weight in instruction-following. When you mix behavioral rules with task-specific context in one string, the model treats everything with equal weight and starts dropping instructions. For structured extraction, few-shot is almost always the right call. The model learns the schema from examples faster than from descriptions, especially for edge cases like optional fields or nested objects. One thing worth trying: negative examples alongside positive ones. Showing the model what a wrong extraction looks like often tightens consistency more than adding a third correct example.

View full discussion (12 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.