Most prompt engineering advice focuses on syntax. Add "think step by step." Specify the language. Say "you are an expert." Some of that helps. None of it addresses the actual reason prompts produce bad code.
The actual reason: the model can only solve the problem you described. Not the problem you have.
Those are different more often than people realize. I know because my job is to find where they diverge. I've spent the last year evaluating AI-generated code professionally: writing rubrics, running adversarial tests, doing multi-turn reviews. The failure modes I see aren't random. They trace back, almost every time, to something in how the task was specified.
There's a reason Andrej Karpathy argued in 2025 we should call this "context engineering" rather than prompt engineering. "In every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information." The framing shift matters. A request is something you fire off. A specification is something you construct.
Prompts Are Specifications, Not Requests
When you ask a colleague to write a function, they bring context you never stated. They know the codebase. They know what "production-ready" means at your company. They know the last time something similar broke and why. They fill gaps with judgment.
A model has none of that. It has your words and its training data. When you leave gaps, it fills them with statistically likely defaults. Those defaults are often fine. When they're not, the code looks correct and isn't.
The shift that changed how I prompt: stop thinking of a prompt as a request and start thinking of it as a specification. A specification answers questions the implementer will have whether or not you thought to ask them.
The Four Things Every Code Prompt Is Missing
After a few hundred evaluations, four categories of missing information account for most of the failures I see.
1. The Environment
The model doesn't know where this code lives. The same function needs a completely different implementation depending on whether it runs in a coroutine context, a background thread, a serverless function, or a single-threaded script.
Instead of:
Write a function that fetches user data and caches it.
Try:
Write a Kotlin function that fetches user data and caches it.
Context: this runs inside a ViewModel using viewModelScope.
The app targets Android API 26+. We use Coroutines, not RxJava.
The first prompt produces code. The second produces code that fits where it has to live.
2. The Failure Cases
The model optimizes for the happy path unless you tell it not to. Network calls succeed. Inputs are valid. Caches hit. This isn't laziness. Your prompt described a world where those things are true.
Instead of:
Write a function to parse this JSON response into a User object.
Try:
Write a function to parse this JSON response into a User object.
Handle: malformed JSON, missing required fields, null values on
optional fields, and network timeout. Return a Result<User> so
the caller can handle each failure type explicitly.
You're not asking the model to be more careful. You're describing a more complete problem.
3. The Constraints Nobody Mentions
Performance requirements. Size limits. Thread safety. Backward compatibility. These feel obvious because you carry them in your head. The model doesn't have your head.
Instead of:
Write a function that processes a list of transactions.
Try:
Write a function that processes a list of transactions.
Constraints: the list can contain up to 100,000 items,
this runs on the main thread so blocking is not acceptable,
and we need to support API 21+.
4. What "Done" Looks Like
If you don't define success criteria, the model defines them for you. Usually that means "compiles and handles the obvious case." That's a low bar.
Instead of:
Write unit tests for this repository class.
Try:
Write unit tests for this repository class.
I want coverage for: success path, network failure,
cache hit, cache miss, and concurrent access.
Use JUnit4 and Mockito. Each test should have a
single assertion and a descriptive name.
The Pre-Prompt Checklist
Before I send a code prompt, I run through four questions:
Where does this run? Language, runtime, framework, threading model, platform constraints. If I can't answer this in one sentence, I don't know my own context well enough yet.
What goes wrong? Every function has a set of inputs that break it. State them. If the function touches the network, a database, or user input, those are automatic candidates.
What can't I trade away? Performance floor, security requirements, API compatibility, dependency restrictions. Anything that would make a technically correct solution still unshippable.
How will I know it worked? If I can't describe a test that would fail on a wrong implementation and pass on a correct one, my spec is incomplete.
This takes maybe 90 seconds. It saves much more than that in review time.
The Adversarial Test
After I write a prompt, I read it as if I'm trying to satisfy it with the worst code that technically meets the stated requirements.
If the worst technically-compliant implementation is still unshippable, my prompt is missing something.
Example. Prompt: "Write a function that returns the user's name from the database."
Worst technically-compliant implementation:
def get_user_name(user_id):
return db.execute(f"SELECT name FROM users WHERE id = {user_id}").fetchone()[0]
This returns a name from the database. It's also a SQL injection vulnerability — still the most common LLM-generated security flaw according to OWASP's Top 10 for LLM Applications — and will throw an unhandled exception if the user doesn't exist.
Both problems are visible if you read the prompt adversarially. Neither shows up if you read it straight.
Better prompt:
Write a function that returns the user's name from the database.
Use parameterized queries. Return None if the user doesn't exist.
Raise a DatabaseError (not a generic exception) if the query fails.
Now the worst technically-compliant implementation is actually safe.
What This Doesn't Fix
Two things this framework won't help with: tasks that require understanding your system's history, and tasks where the right answer depends on a judgment call you haven't made yet. For the first, no amount of prompt engineering substitutes for the model actually knowing your codebase — that's where RAG and long-context strategies come in. For the second, the prompt can't be finished until you've made the decision.
Both are worth recognizing because they tell you when to stop trying to prompt your way out of a problem. Some work needs to stay with you.
The One-Line Version
Describe the problem you have, not the output you want.
The output is a function. The problem is a function that handles these inputs, runs in this context, fails gracefully in these ways, and satisfies these constraints. The model is better at solving the second one than you might think. It just can't infer it from the first.
Top comments (0)