Part 1 of Practical AI Engineering: Beyond the Demo
Most developers think better prompting means finding better words.
Add a role. Add more detail. Ask the model to “think step by step.” Keep extending the prompt until it becomes a small novel.
That can improve an answer. It does not make the workflow reliable.
After years of working on large software systems, I trust explicit contracts more than clever wording.
For serious work, a prompt should look less like a clever question and more like a good engineering task. It should tell the model what success means, what evidence it can use, and where it must stop.
I call this a task contract.
A prompt asks. A task contract defines success.
Compare these two requests.
Prompt A
Review this pull request and find any issues.
The model must guess:
- What kind of issues matter?
- Which files are in scope?
- Is it allowed to suggest a redesign?
- How should it report uncertainty?
- What does a useful answer look like?
Now consider this version.
Prompt B
Goal:
Review this pull request for correctness and regression risk.
Context:
- This is a TypeScript service that processes subscription renewals.
- The diff changes retry handling after a payment timeout.
- Duplicate charges are the highest-risk failure.
Scope:
- Review the changed code and directly connected call paths.
- Do not comment on formatting or unrelated refactors.
Deliverable:
Return a table with:
1. severity,
2. file and line,
3. failure scenario,
4. evidence from the code,
5. smallest safe fix,
6. test that would catch it.
Acceptance checks:
- Do not report an issue without code evidence.
- Separate confirmed defects from possible risks.
- Say "not enough evidence" when the diff cannot support a conclusion.
The second version is not better because it sounds smarter. It is better because it reduces hidden decisions.
That is the real job of a useful prompt.
The five parts of a task contract
1. Goal
State the outcome, not merely the activity.
Weak:
Look at this API.
Better:
Find behaviours that could cause an existing mobile client to break after this API change.
“Look at” describes effort. “Find breaking behaviours” describes value.
2. Context
Give the model the facts that change the answer.
Useful context can include:
- the user or system affected,
- the current architecture,
- known constraints,
- the highest-risk failure,
- decisions already made.
Do not paste everything you know. Add only information that should alter the model’s judgement.
3. Constraints
Constraints define the edges of the problem.
Examples:
Do not change the public API.
Use only libraries already in package.json.
Keep the migration reversible.
Do not include personal data in logs.
Without constraints, an AI can produce a technically valid answer that is useless in your environment.
4. Deliverable
Specify the shape of the result.
You might ask for:
- a patch,
- a decision table,
- three options with trade-offs,
- a test plan,
- a JSON object matching a schema,
- a short recommendation followed by evidence.
A clear output format makes the response easier to review and easier to feed into the next step.
5. Acceptance checks
This is the part most prompts miss.
Acceptance checks let the model inspect its own work before returning it.
For example:
Before answering, verify that:
- every recommendation maps to a stated requirement,
- every factual claim has a source,
- the code compiles conceptually with the types shown,
- unresolved assumptions are listed,
- no out-of-scope files are changed.
These checks are not a guarantee. They are a lightweight test suite for the response.
One good example can beat another paragraph of instructions
Developers often use vague role prompts:
Act as a world-class senior software architect.
The model still has to guess what “world-class” means.
A short example is often more useful:
Good finding:
HIGH: retryPayment.ts:84
A timeout after the provider accepts payment can trigger a second charge.
Evidence: the retry path creates a new idempotency key.
Fix: reuse the original key until the operation reaches a terminal state.
Test: simulate provider success followed by a client-side timeout.
Bad finding:
"Improve error handling."
This is too broad and has no failure scenario or code evidence.
The example teaches the model your standard, not merely your aspiration.
Do not solve every AI problem inside the prompt
A prompt becomes bloated when it absorbs responsibilities that belong elsewhere.
Use this simple placement guide:
| Information | Better home |
|---|---|
| Stable rules that apply to every request | System instructions |
| The current goal and constraints | User prompt or task contract |
| A reusable procedure | Skill, template, or workflow |
| Changing facts from documents | Retrieval |
| Previous decisions and project state | Memory |
| An action such as searching or running tests | Tool |
| Proof that the result is acceptable | Evaluator or deterministic check |
This matters because each layer changes at a different speed.
Your security rules may stay stable for months. The current task may last ten minutes. A product document may change tomorrow. Mixing all three into one giant prompt makes the system harder to update and debug.
Treat prompts like code
A production prompt deserves the same habits as production software.
Keep representative cases
Collect a small set of real tasks:
- a normal case,
- an ambiguous case,
- a missing-information case,
- an adversarial case,
- a high-risk case.
Define what good looks like
Do not use “the answer feels better” as your only measure.
Check things such as:
- Did it follow the requested format?
- Did it use evidence?
- Did it respect scope?
- Did it expose uncertainty?
- Did it avoid a known failure?
Change one thing at a time
When you change the prompt, model, retrieval, and tool set together, you do not know what caused the result.
Prompt work becomes engineering when changes are testable.
A reusable template
GOAL
What outcome should be produced?
CONTEXT
Which facts materially affect the answer?
CONSTRAINTS
What must the model do, avoid, or preserve?
DELIVERABLE
What exact form should the output take?
ACCEPTANCE CHECKS
How should the result be tested before it is returned?
UNCERTAINTY
What should the model do when evidence is missing?
You do not need every heading for every request. The point is to remove the decisions that matter most.
What changes
The future of prompting is not memorising magic phrases.
It is making intent, constraints, and quality visible.
A strong model can fill small gaps. It should not have to invent the definition of success.
What is one prompt you keep making longer when the real problem is a missing acceptance check?
Top comments (0)