Introduction
When developers first start building AI applications, they usually focus on one thing:
The prompt.
Questions like:
- How can I make Claude respond better?
- How do I reduce hallucinations?
- Why is my AI giving inconsistent results?
- Should I use Chain of Thought?
become common.
Most teams spend days optimizing user prompts.
Very few spend time designing system prompts.
And that's where the real problem begins.
Recently, while building an AI-powered Due Diligence and Compliance Reporting platform using Amazon Bedrock and Claude, we discovered that prompt quality wasn't our biggest issue.
The real issue was a lack of system-level instructions.
The Problem
Our application generated forensic risk reports.
The workflow was simple:
User Input
↓
Claude
↓
Generated Report
Users provided:
{
"companyName": "Microsoft Corporation",
"country": "United States"
}
along with intelligence gathered from:
- Companies House
- OFAC
- OpenSanctions
- News APIs
- Regulatory Sources
The AI then generated a complete report.
Everything seemed fine.
Until we started testing at scale.
Symptoms We Observed
The exact same data often produced different outputs.
Sometimes Claude generated:
Low Risk
For the same company.
Minutes later:
Medium Risk
for nearly identical input.
Other times:
- Sections appeared in different orders
- Risk scores changed
- HTML formatting broke
- Compliance recommendations varied
- Findings were summarized differently
The model wasn't hallucinating.
It was doing exactly what we asked.
The problem was that we hadn't told it enough.
The Original Prompt
Our first implementation looked like this:
Generate an integrity due diligence report for the company using the data below.
Then we appended the API results.
That was it.
No structure.
No scoring methodology.
No formatting rules.
No output constraints.
The model had too much freedom.
Why This Happens
LLMs are prediction engines.
If instructions are vague:
Generate a report
the model must decide:
- Format
- Structure
- Tone
- Risk methodology
- Recommendation logic
on its own.
Different reasoning paths produce different outputs.
This creates inconsistency.
And inconsistency is dangerous in production systems.
The Real Solution
We stopped optimizing the user prompt.
Instead, we designed a comprehensive system prompt.
Architecture changed from:
User Prompt
↓
Claude
to:
System Prompt
↓
User Prompt
↓
Claude
The system prompt became the source of truth.
What We Added
Output Constraints
Instead of:
Generate a report
we specified:
Output MUST be valid HTML.
Do NOT use markdown.
Do NOT use emojis.
Do NOT use conversational language.
Now every response followed the same format.
Fixed Section Order
We enforced:
1. Executive Summary
2. Entity Overview
3. Registry Findings
4. Sanctions Analysis
5. PEP Analysis
6. Litigation Review
7. Adverse Media Review
8. Risk Assessment
9. Recommendation
The model could no longer rearrange sections.
Deterministic Risk Scoring
Before:
Assess risk.
After:
Sanctions = 30%
PEP = 20%
Corruption = 20%
Litigation = 15%
Media = 15%
Every report now followed the same methodology.
Anti-Hallucination Rules
One of the most important additions was:
Do not invent information.
Use only provided data.
If data is unavailable, explicitly state:
"No data available from provided sources."
This dramatically improved reliability.
Before vs After
Before
Medium Risk
Reason:
Potential concerns observed.
No explanation.
No evidence.
No consistency.
After
Risk Score: 25
Sanctions:
0/100
Evidence:
No OFAC matches found.
Source:
OFAC API
Now every score was traceable.
The Hidden Benefit
Most teams think prompts only improve output quality.
In reality, strong system prompts also improve:
Maintainability
When requirements change:
Add ownership analysis
you update one system prompt.
Not every user prompt.
Debugging
When issues occur:
Why did risk increase?
you can inspect scoring rules directly.
Compliance
Auditors want repeatable processes.
System prompts create consistency.
Ad hoc prompting does not.
A Production Pattern
Today our AI architecture looks like this:
System Prompt
↓
API Data
↓
User Instructions
↓
Claude
↓
Structured HTML Report
The system prompt defines behavior.
The user prompt provides context.
This separation dramatically improves reliability.
Lessons Learned
The biggest mistake we made was treating prompts like chat messages.
Production AI systems are not chatbots.
They are software systems.
Software systems require:
- Rules
- Constraints
- Validation
- Predictability
- Repeatability
System prompts provide those guarantees.
Best Practices for Production AI
1. Keep User Prompts Small
User prompts should contain:
Data
Context
Specific Request
Nothing more.
2. Move Rules to System Prompts
Examples:
Output format
Scoring logic
Compliance requirements
Validation rules
3. Prevent Hallucinations Explicitly
Always include:
Do not invent information.
4. Define Failure Behavior
Specify:
If data unavailable:
State that clearly.
Never leave the model guessing.
5. Standardize Output
Use:
JSON
HTML
XML
Markdown
but choose one and enforce it.
Final Thoughts
Many AI teams spend weeks optimizing prompts.
Few invest time designing system prompts.
Yet system prompts are often the difference between:
Interesting Demo
and
Production Application
If your AI outputs are inconsistent, unpredictable, or difficult to maintain, don't start by rewriting your user prompts.
Start by asking:
Does my model actually know the rules it's supposed to follow?
Because most of the time, the prompt isn't the problem.
The missing system prompt is.
Top comments (0)