System prompt architecture, few-shot design patterns, chain-of-thought for structured outputs, evaluation frameworks and version control. The stuff that actually matters in production.
I want to push back on something.
There's a version of the "prompt engineering" conversation that treats it like a party trick, a collection of clever phrasings that coax better responses out of chatbots. "Be specific." "Think step by step." "You are an expert in..." This advice isn't wrong. It's just nowhere near sufficient for building AI features that work reliably in production.
The gap between "prompt that works in a demo" and "prompt architecture that works in production at volume" is significant. It involves system prompt design, few-shot example curation, output structure enforcement, evaluation methodology and versioning practices that most tutorials never touch.
This is the playbook for the serious version of the skill. If you're building AI features at a Series A, B, or C company and you're still thinking about prompts as strings you pass to an API, this article is for you.
1. System Prompt Architecture
Most developers treat the system prompt as a paragraph of instructions. Enterprise-grade system prompts are structured documents with distinct sections that serve different purposes.
Here's the architecture we use:
python
SYSTEM_PROMPT = """
## ROLE AND CONTEXT
You are a financial document analyst for {company_name}.
You work with {document_types} and your outputs are used by {audience}.
Your analysis directly affects {business_consequence}.
## CORE CAPABILITIES
- Extract structured data from unstructured financial documents
- Identify anomalies, risks and compliance gaps
- Generate actionable summaries calibrated to audience expertise
## BEHAVIOURAL CONSTRAINTS
- Never fabricate figures not present in the source document
- When uncertain, express uncertainty explicitly with language like
"The document appears to indicate..." rather than stating as fact
- Do not provide investment advice or specific financial recommendations
- If a document is illegible or incomplete, state this clearly
## OUTPUT REQUIREMENTS
- Always respond in valid JSON matching the schema below
- Use null for fields you cannot determine, never guess
- Include a confidence_score (0.0-1.0) for extracted numerical values
## OUTPUT SCHEMA
{json_schema}
## EXAMPLES
{few_shot_examples}
The four-section structure does specific work:
Role and Context establishes who the model is and what stakes are involved. The phrase "your outputs are used by [audience]" and "directly affects [business consequence]" isn't rhetorical, it genuinely shifts how the model calibrates its confidence and thoroughness.
Core Capabilities sets scope. What the model is for, not what it isn't. Positive framing outperforms negative framing ("you can do X" outperforms "never do Y" in most cases).
Behavioural Constraints handles the failure modes that matter in production. Hallucination, overconfidence, scope creep. Each constraint is specific to an actual failure you've observed, not general safety boilerplate.
Output Requirements is non-negotiable for production systems. Your downstream code needs predictable output. Enforce structure here, not in a separate parsing step that assumes the model cooperated.
2. Few-Shot Design: The Patterns That Actually Work
Few-shot examples are the most underinvested part of most prompt architectures. Teams spend a week on the instruction text and 20 minutes on the examples. This is backwards.
The model learns more from examples than from instructions. Here's how to build them properly:
python
class FewShotLibrary:
"""
Manages few-shot examples with metadata for
selection and evaluation.
"""
def __init__(self):
self.examples = []
def add_example(
self,
input_text: str,
ideal_output: dict,
tags: list[str],
difficulty: str, # "easy", "medium", "hard", "edge_case"
failure_mode_covered: str = None
):
self.examples.append({
"input": input_text,
"output": ideal_output,
"tags": tags,
"difficulty": difficulty,
"failure_mode": failure_mode_covered,
"times_selected": 0
})
def select_for_prompt(
self,
n_examples: int = 5,
include_difficulties: list[str] = None,
required_tags: list[str] = None
) -> list[dict]:
"""
Select examples strategically rather than randomly.
Always include at least one edge case.
"""
pool = self.examples.copy()
if required_tags:
pool = [e for e in pool
if any(t in e['tags'] for t in required_tags)]
if include_difficulties:
pool = [e for e in pool
if e['difficulty'] in include_difficulties]
selected = []
# Always include one edge case if available
edge_cases = [e for e in pool if e['difficulty'] == 'edge_case']
if edge_cases:
selected.append(edge_cases[0])
pool = [e for e in pool if e not in selected]
# Fill remaining slots with mixed difficulty
import random
remaining = random.sample(
pool,
min(n_examples - len(selected), len(pool))
)
selected.extend(remaining)
return selected
def format_for_prompt(self, examples: list[dict]) -> str:
"""Format selected examples for injection into system prompt."""
formatted = []
for ex in examples:
formatted.append(
f"Input: {ex['input']}\n"
f"Output: {json.dumps(ex['output'], indent=2)}"
)
return "\n\n---\n\n".join(formatted)
Three design principles embedded in this:
Cover failure modes explicitly: Every known failure mode should have at least one example that demonstrates the correct handling. If the model sometimes fabricates figures when a document is partially illegible, you need an example showing exactly how to handle that case.
Include edge cases deliberately: Random selection of examples leaves your hardest cases underrepresented. Always reserve at least one slot for a case that represents a known difficulty.
Separate example storage from prompt construction: Inline examples are unmaintainable. A library with metadata lets you experiment with different example sets without touching the instruction text.
3. Chain-of-Thought for Structured Output
Chain-of-thought prompting is usually taught as a way to get better reasoning. For production systems, its more important use is as a reliability mechanism for structured output.
The pattern:
python
STRUCTURED_COT_TEMPLATE = """
Analyse the following document and extract the required information.
Think through your analysis in <reasoning> tags before producing
your final output. Your reasoning should:
1. Identify the document type and its reliability indicators
2. Note any ambiguities or data quality issues
3. Explain your confidence level for each extracted field
4. Flag any compliance or risk concerns
After your reasoning, produce your output in the exact JSON schema
specified. Do not include any text outside the JSON block.
<reasoning>
[Your analysis here, this will be parsed separately for quality review]
</reasoning>
<output>
{json_output_here}
</output>
Document to analyse:
{document_text}
"""
def parse_cot_response(response_text: str) -> dict:
"""
Parse a chain-of-thought response into reasoning
and structured output.
"""
import re
reasoning_match = re.search(
r'<reasoning>(.*?)</reasoning>',
response_text,
re.DOTALL
)
output_match = re.search(
r'<output>(.*?)</output>',
response_text,
re.DOTALL
)
reasoning = reasoning_match.group(1).strip() \
if reasoning_match else None
output_text = output_match.group(1).strip() \
if output_match else None
try:
structured_output = json.loads(output_text) \
if output_text else None
except json.JSONDecodeError:
structured_output = None
return {
"reasoning": reasoning,
"output": structured_output,
"parse_success": structured_output is not None
}
The reasoning section isn't just for debugging, it's a quality signal. When you log reasoning alongside outputs, patterns in the reasoning text predict output quality. A reasoning section that says "the document is partially illegible in section 3" should correlate with lower confidence scores in the output. When it doesn't, you've found a prompt calibration issue.
4. Evaluation Framework
This is the part most teams skip and regret. An evaluation framework is not optional, it's what lets you make changes to your prompts without flying blind.
python
class PromptEvaluator:
def __init__(self, evaluation_dataset: list[dict]):
"""
evaluation_dataset: list of {
"input": str,
"expected_output": dict,
"test_type": str # "accuracy", "format", "edge_case"
}
"""
self.dataset = evaluation_dataset
self.results = []
def run_evaluation(
self,
prompt_version: str,
model_fn: callable
) -> dict:
"""
Run a full evaluation pass and return metrics.
"""
results = []
for test_case in self.dataset:
response = model_fn(test_case['input'])
parsed = parse_cot_response(response)
result = {
"test_type": test_case['test_type'],
"parse_success": parsed['parse_success'],
"field_accuracy": self._calculate_field_accuracy(
parsed['output'],
test_case['expected_output']
) if parsed['output'] else 0.0,
"hallucination_detected": self._check_hallucination(
parsed['output'],
test_case['input']
) if parsed['output'] else False
}
results.append(result)
metrics = {
"prompt_version": prompt_version,
"total_cases": len(results),
"parse_success_rate": sum(
r['parse_success'] for r in results
) / len(results),
"average_field_accuracy": sum(
r['field_accuracy'] for r in results
) / len(results),
"hallucination_rate": sum(
r['hallucination_detected'] for r in results
) / len(results),
"by_test_type": self._group_by_type(results)
}
return metrics
def _calculate_field_accuracy(
self,
actual: dict,
expected: dict
) -> float:
if not actual or not expected:
return 0.0
correct = 0
total = 0
for key, expected_value in expected.items():
if key in actual:
total += 1
if self._values_match(actual[key], expected_value):
correct += 1
return correct / total if total > 0 else 0.0
def _values_match(self, actual, expected, tolerance=0.02) -> bool:
"""Match with tolerance for numerical values."""
if isinstance(expected, (int, float)) and \
isinstance(actual, (int, float)):
if expected == 0:
return actual == 0
return abs(actual - expected) / abs(expected) <= tolerance
return str(actual).strip() == str(expected).strip()
def _check_hallucination(
self,
output: dict,
source_text: str
) -> bool:
"""
Basic hallucination check, verify numerical values
in output appear in source text.
This is a heuristic, not a complete check.
"""
if not output:
return False
import re
source_numbers = set(re.findall(r'\d+\.?\d*', source_text))
def extract_numbers(obj):
numbers = set()
if isinstance(obj, dict):
for v in obj.values():
numbers.update(extract_numbers(v))
elif isinstance(obj, list):
for item in obj:
numbers.update(extract_numbers(item))
elif isinstance(obj, (int, float)):
numbers.add(str(obj))
return numbers
output_numbers = extract_numbers(output)
hallucinated = output_numbers - source_numbers
return len(hallucinated) > 0
def _group_by_type(self, results: list) -> dict:
grouped = {}
for result in results:
t = result['test_type']
if t not in grouped:
grouped[t] = []
grouped[t].append(result)
return {
t: {
"count": len(cases),
"parse_success_rate": sum(
c['parse_success'] for c in cases
) / len(cases),
"avg_accuracy": sum(
c['field_accuracy'] for c in cases
) / len(cases)
}
for t, cases in grouped.items()
}
Run this evaluation before and after every prompt change. The comparison tells you whether you improved overall, improved on one case type at the expense of another, or made things worse in a way that wasn't visible from spot-checking.
5. Version Control for Prompts
Prompts are code. Treat them that way.
python
# prompts/financial_analysis/v2.3.py
PROMPT_METADATA = {
"version": "2.3",
"author": "your-name",
"date": "2026-04-15",
"changelog": "Added explicit null handling for partially illegible docs",
"eval_metrics": {
"parse_success_rate": 0.96,
"average_field_accuracy": 0.91,
"hallucination_rate": 0.02,
"eval_dataset_version": "financial_docs_v4",
"cases_tested": 247
},
"supersedes": "2.2",
"breaking_changes": False
}
SYSTEM_PROMPT = """...""" # Full prompt here
FEW_SHOT_EXAMPLES = [...] # Example library here
Store prompts in version-controlled files with metadata. Every production deployment references a specific prompt version. When something breaks in production, you know exactly which prompt version was running and what changed between the last working version and the current one.
The eval_metrics field is the important one. You should never deploy a prompt version without knowing its evaluation scores on your test dataset. If you don't have those scores, you don't have a version, you have a draft.
The Patterns Connect
System prompt architecture gives you structure. Few-shot design gives you examples that cover the cases that matter. Chain-of-thought gives you reasoning transparency and output reliability. Evaluation gives you confidence that changes are improvements. Version control gives you the ability to debug production issues and roll back safely.
None of these are exotic. All of them are skipped in most AI feature builds because teams move fast and these practices feel like overhead until they aren't.
The enterprise prompt engineering guide from Dextra Labs covers the methodology in more depth including testing strategies for ambiguous outputs and prompt evaluation at scale.
Good prompt engineering is the difference between an AI feature that demos well and one that works in production. We use these patterns across client projects, from finance AI agents processing thousands of documents daily to customer support systems handling real customer escalations where the cost of a bad output is measured in customer relationships, not developer embarrassment.
The playbook exists. The question is whether you implement it before or after the first production incident.
Published by Dextra Labs | AI Consulting & Enterprise LLM Solutions
Top comments (1)
Prompt engineering is a real, practical skill—especially in enterprise settings. It’s not just about “being specific,” but about structuring prompts with clear context, constraints, examples, and expected outputs to get reliable, repeatable results.