Balraj Singh

Posted on Jun 25

Stop Writing Bigger Prompts. Start Writing Better Task Contracts

#ai #llm #productivity #softwareengineering

Part 1 of Practical AI Engineering: Beyond the Demo

Most developers think better prompting means finding better words.

Add a role. Add more detail. Ask the model to “think step by step.” Keep extending the prompt until it becomes a small novel.

That can improve an answer. It does not make the workflow reliable.

After years of working on large software systems, I trust explicit contracts more than clever wording.

For serious work, a prompt should look less like a clever question and more like a good engineering task. It should tell the model what success means, what evidence it can use, and where it must stop.

I call this a task contract.

A prompt asks. A task contract defines success.

Compare these two requests.

Prompt A

Review this pull request and find any issues.

The model must guess:

What kind of issues matter?
Which files are in scope?
Is it allowed to suggest a redesign?
How should it report uncertainty?
What does a useful answer look like?

Now consider this version.

Prompt B

Goal:
Review this pull request for correctness and regression risk.

Context:
- This is a TypeScript service that processes subscription renewals.
- The diff changes retry handling after a payment timeout.
- Duplicate charges are the highest-risk failure.

Scope:
- Review the changed code and directly connected call paths.
- Do not comment on formatting or unrelated refactors.

Deliverable:
Return a table with:
1. severity,
2. file and line,
3. failure scenario,
4. evidence from the code,
5. smallest safe fix,
6. test that would catch it.

Acceptance checks:
- Do not report an issue without code evidence.
- Separate confirmed defects from possible risks.
- Say "not enough evidence" when the diff cannot support a conclusion.

The second version is not better because it sounds smarter. It is better because it reduces hidden decisions.

That is the real job of a useful prompt.

The five parts of a task contract

1. Goal

State the outcome, not merely the activity.

Weak:

Look at this API.

Better:

Find behaviours that could cause an existing mobile client to break after this API change.

“Look at” describes effort. “Find breaking behaviours” describes value.

2. Context

Give the model the facts that change the answer.

Useful context can include:

the user or system affected,
the current architecture,
known constraints,
the highest-risk failure,
decisions already made.

Do not paste everything you know. Add only information that should alter the model’s judgement.

3. Constraints

Constraints define the edges of the problem.

Examples:

Do not change the public API.
Use only libraries already in package.json.
Keep the migration reversible.
Do not include personal data in logs.

Without constraints, an AI can produce a technically valid answer that is useless in your environment.

4. Deliverable

Specify the shape of the result.

You might ask for:

a patch,
a decision table,
three options with trade-offs,
a test plan,
a JSON object matching a schema,
a short recommendation followed by evidence.

A clear output format makes the response easier to review and easier to feed into the next step.

5. Acceptance checks

This is the part most prompts miss.

Acceptance checks let the model inspect its own work before returning it.

For example:

Before answering, verify that:
- every recommendation maps to a stated requirement,
- every factual claim has a source,
- the code compiles conceptually with the types shown,
- unresolved assumptions are listed,
- no out-of-scope files are changed.

These checks are not a guarantee. They are a lightweight test suite for the response.

One good example can beat another paragraph of instructions

Developers often use vague role prompts:

Act as a world-class senior software architect.

The model still has to guess what “world-class” means.

A short example is often more useful:

Good finding:
HIGH: retryPayment.ts:84
A timeout after the provider accepts payment can trigger a second charge.
Evidence: the retry path creates a new idempotency key.
Fix: reuse the original key until the operation reaches a terminal state.
Test: simulate provider success followed by a client-side timeout.

Bad finding:
"Improve error handling."
This is too broad and has no failure scenario or code evidence.

The example teaches the model your standard, not merely your aspiration.

Do not solve every AI problem inside the prompt

A prompt becomes bloated when it absorbs responsibilities that belong elsewhere.

Use this simple placement guide:

Information	Better home
Stable rules that apply to every request	System instructions
The current goal and constraints	User prompt or task contract
A reusable procedure	Skill, template, or workflow
Changing facts from documents	Retrieval
Previous decisions and project state	Memory
An action such as searching or running tests	Tool
Proof that the result is acceptable	Evaluator or deterministic check

This matters because each layer changes at a different speed.

Your security rules may stay stable for months. The current task may last ten minutes. A product document may change tomorrow. Mixing all three into one giant prompt makes the system harder to update and debug.

Treat prompts like code

A production prompt deserves the same habits as production software.

Keep representative cases

Collect a small set of real tasks:

a normal case,
an ambiguous case,
a missing-information case,
an adversarial case,
a high-risk case.

Define what good looks like

Do not use “the answer feels better” as your only measure.

Check things such as:

Did it follow the requested format?
Did it use evidence?
Did it respect scope?
Did it expose uncertainty?
Did it avoid a known failure?

Change one thing at a time

When you change the prompt, model, retrieval, and tool set together, you do not know what caused the result.

Prompt work becomes engineering when changes are testable.

A reusable template

GOAL
What outcome should be produced?

CONTEXT
Which facts materially affect the answer?

CONSTRAINTS
What must the model do, avoid, or preserve?

DELIVERABLE
What exact form should the output take?

ACCEPTANCE CHECKS
How should the result be tested before it is returned?

UNCERTAINTY
What should the model do when evidence is missing?

You do not need every heading for every request. The point is to remove the decisions that matter most.

What changes

The future of prompting is not memorising magic phrases.

It is making intent, constraints, and quality visible.

A strong model can fill small gaps. It should not have to invent the definition of success.

What is one prompt you keep making longer when the real problem is a missing acceptance check?

Top comments (1)

Mike Czerwinski • Jun 28

The first four components are the strongest argument for task contracts anyone has made in a while. Goal, Context, Constraints, Deliverable are a human deciding what done means before the model gets to guess. That is the part that is genuinely exogenous: intent authored at write time, not reconstructed after the fact. Most prompting advice never gets that far.

Then the fifth component quietly hands the win back. Acceptance checks the model performs on itself are the actor signing its own delivery receipt. The same process that produced the output decides whether the output passes. It will pass. Not because it is correct, but because the thing grading it and the thing that wrote it share every prior, every blind spot, every confident mistake. A model cannot catch the error it was incapable of seeing while generating, because catching it requires standing outside the generation it just did.

So the contract holds for four steps and leaks on the fifth. The fix is not a better self-check. It is moving acceptance out of the actor entirely: a separate evaluator, a different model, a human, a test that exists before the answer and does not get rewritten to fit it. You already drew that line in the context piece, where a note cannot stamp its own freshness. Same line here. Whoever checks the work cannot also be the one who did it.

DEV Community