saif ur rahman

Posted on Jun 26

Your Prompt Isn't the Problem: Why System Prompts Matter More Than User Prompts in Production AI Applications

#prompt #aws #bedrock #genrativeai

Introduction

When developers first start building AI applications, they usually focus on one thing:

The prompt.

Questions like:

How can I make Claude respond better?
How do I reduce hallucinations?
Why is my AI giving inconsistent results?
Should I use Chain of Thought?

become common.

Most teams spend days optimizing user prompts.

Very few spend time designing system prompts.

And that's where the real problem begins.

Recently, while building an AI-powered Due Diligence and Compliance Reporting platform using Amazon Bedrock and Claude, we discovered that prompt quality wasn't our biggest issue.

The real issue was a lack of system-level instructions.

The Problem

Our application generated forensic risk reports.

The workflow was simple:

User Input
      ↓
Claude
      ↓
Generated Report

Users provided:

{
  "companyName": "Microsoft Corporation",
  "country": "United States"
}

along with intelligence gathered from:

Companies House
OFAC
OpenSanctions
News APIs
Regulatory Sources

The AI then generated a complete report.

Everything seemed fine.

Until we started testing at scale.

Symptoms We Observed

The exact same data often produced different outputs.

Sometimes Claude generated:

Low Risk

For the same company.

Minutes later:

Medium Risk

for nearly identical input.

Other times:

Sections appeared in different orders
Risk scores changed
HTML formatting broke
Compliance recommendations varied
Findings were summarized differently

The model wasn't hallucinating.

It was doing exactly what we asked.

The problem was that we hadn't told it enough.

The Original Prompt

Our first implementation looked like this:

Generate an integrity due diligence report for the company using the data below.

Then we appended the API results.

That was it.

No structure.

No scoring methodology.

No formatting rules.

No output constraints.

The model had too much freedom.

Why This Happens

LLMs are prediction engines.

If instructions are vague:

Generate a report

the model must decide:

Format
Structure
Tone
Risk methodology
Recommendation logic

on its own.

Different reasoning paths produce different outputs.

This creates inconsistency.

And inconsistency is dangerous in production systems.

The Real Solution

We stopped optimizing the user prompt.

Instead, we designed a comprehensive system prompt.

Architecture changed from:

User Prompt
      ↓
Claude

to:

System Prompt
      ↓
User Prompt
      ↓
Claude

The system prompt became the source of truth.

What We Added

Output Constraints

Instead of:

Generate a report

we specified:

Output MUST be valid HTML.
Do NOT use markdown.
Do NOT use emojis.
Do NOT use conversational language.

Now every response followed the same format.

Fixed Section Order

We enforced:

1. Executive Summary
2. Entity Overview
3. Registry Findings
4. Sanctions Analysis
5. PEP Analysis
6. Litigation Review
7. Adverse Media Review
8. Risk Assessment
9. Recommendation

The model could no longer rearrange sections.

Deterministic Risk Scoring

Before:

Assess risk.

After:

Sanctions = 30%
PEP = 20%
Corruption = 20%
Litigation = 15%
Media = 15%

Every report now followed the same methodology.

Anti-Hallucination Rules

One of the most important additions was:

Do not invent information.
Use only provided data.
If data is unavailable, explicitly state:
"No data available from provided sources."

This dramatically improved reliability.

Before vs After

Before

Medium Risk

Reason:
Potential concerns observed.

No explanation.

No evidence.

No consistency.

After

Risk Score: 25

Sanctions:
0/100

Evidence:
No OFAC matches found.

Source:
OFAC API

Now every score was traceable.

The Hidden Benefit

Most teams think prompts only improve output quality.

In reality, strong system prompts also improve:

Maintainability

When requirements change:

Add ownership analysis

you update one system prompt.

Not every user prompt.

Debugging

When issues occur:

Why did risk increase?

you can inspect scoring rules directly.

Compliance

Auditors want repeatable processes.

System prompts create consistency.

Ad hoc prompting does not.

A Production Pattern

Today our AI architecture looks like this:

System Prompt
      ↓
API Data
      ↓
User Instructions
      ↓
Claude
      ↓
Structured HTML Report

The system prompt defines behavior.

The user prompt provides context.

This separation dramatically improves reliability.

Lessons Learned

The biggest mistake we made was treating prompts like chat messages.

Production AI systems are not chatbots.

They are software systems.

Software systems require:

Rules
Constraints
Validation
Predictability
Repeatability

System prompts provide those guarantees.

Best Practices for Production AI

1. Keep User Prompts Small

User prompts should contain:

Data
Context
Specific Request

Nothing more.

2. Move Rules to System Prompts

Examples:

Output format
Scoring logic
Compliance requirements
Validation rules

3. Prevent Hallucinations Explicitly

Always include:

Do not invent information.

4. Define Failure Behavior

Specify:

If data unavailable:
State that clearly.

Never leave the model guessing.

5. Standardize Output

Use:

JSON
HTML
XML
Markdown

but choose one and enforce it.

Final Thoughts

Many AI teams spend weeks optimizing prompts.

Few invest time designing system prompts.

Yet system prompts are often the difference between:

Interesting Demo

and

Production Application

If your AI outputs are inconsistent, unpredictable, or difficult to maintain, don't start by rewriting your user prompts.

DEV Community