Debugging a Prompt: When the Output Keeps Missing

#ai #prompting #debugging #beginners

AI in Practice, No Fluff — Day 4/10

I was helping a friend with a cover letter recently. She had a strong resume, real accomplishments, and a job posting that matched her experience well. I fed everything into Claude and asked it to draft the letter.

The output was... fine. It hit the right keywords, mentioned the right qualifications, structured everything logically. It also sounded like every cover letter you have ever read and immediately forgotten. "I am excited to bring my extensive experience in project management to your organization." That sentence could belong to literally anyone applying for literally anything.

So I iterated. "Make it more personal and direct." The result was warmer but still generic. "Match the tone of someone who is confident but not corporate." Better, but it still read like an AI approximating a human approximating professionalism. I went through four rounds of adjusting tone instructions before I stopped and asked a different question: why is this failing?

The answer was not in my instructions. It was in my context. I had given the AI her resume and the job posting, but I had not given it anything that showed how she actually communicates. The model had no voice to match, so it defaulted to the genre: cover-letter-ese. The fix was not another tone instruction. It was pasting in a few paragraphs from emails she had written, things where her actual voice came through, and saying "match this register." One change. The next draft sounded like her.

That experience is the whole article in summary. When a prompt is not working, the instinct is to keep adjusting the instructions. Sometimes that is exactly right. More often, the real fix is somewhere else entirely, and finding it requires a diagnostic approach instead of a guessing one.

What this post is and is not

In the first series, we covered what makes a good prompt: context, task clarity, format, examples. That post was about composition; how to write a prompt that works. This post picks up where that one leaves off. Your prompt is written. It is not working. Now what?

Here is the thing worth understanding about where we are right now: models have gotten good enough that prompts rarely produce garbage anymore. The output almost always looks reasonable. The problem has shifted. It is less "this is wrong" and more "this is not what I meant." That subtlety makes debugging harder, not easier, because the failure is easy to miss at first glance.

Reading the output, not just reacting to it

The first step is the one that gets skipped most often. Before changing anything, read the bad output carefully. Not to judge it. To diagnose it.

The output is data. It is telling you something about what the model understood, what it prioritized, and where it went off track. A summary that is too long tells you the model did not understand your length constraint, or did not consider it important enough to override its instinct to be thorough. A cover letter that sounds corporate tells you the model defaulted to the genre because you did not provide a voice to match. A code snippet that uses the wrong library tells you the model lacked context about your stack.

The natural reaction to bad output is "that is wrong." The diagnostic reaction is "that is wrong in a specific way, and the specific way tells me something."

Four places a prompt usually breaks

After working through dozens of these debugging cycles, I have found that most prompt failures fall into one of four categories. Knowing which one you are dealing with changes what you do next.

1. Missing context. The model does not have the information it needs to do the job. This is the most common failure and the easiest to fix. The cover letter above was a context problem: the AI had qualifications and job requirements, but no sample of the person's actual voice.

Signs: the output is technically correct but generic. It fills in gaps with reasonable guesses instead of specific details. It sounds like it is writing about your topic from general knowledge rather than from the material you gave it.

Fix: add the missing context. Sometimes that means providing more input. Sometimes it means restructuring the input you already have so the important parts are easier for the model to find.

2. Ambiguous instruction. The model understood something different from what you meant. This one is sneaky because the output often looks like the model is being difficult when it's actually being literal.

"Write a short summary" is ambiguous. Short to you might be three sentences. Short to the model might be two paragraphs. "Summarize this in three sentences" is not ambiguous.

Signs: the output does something coherent but it is not what you wanted. The model made a choice where you expected a specific behavior. If you re-read your prompt and can see two reasonable interpretations of what you asked for, this is probably the problem.

Fix: replace the ambiguous instruction with a specific one. If you find yourself writing "no, I meant..." in a follow-up message, the original instruction was ambiguous. Rewrite it so the follow-up is unnecessary.

3. Bad format specification. The model got the content right but the shape wrong. You wanted a table and got a list. You wanted JSON and got an essay with JSON embedded in it. You wanted three bullet points and got seven.

We covered in the first series that showing examples is one of the most effective prompting techniques. Format problems are where this pays off the most. A prompt that says "return a markdown table with columns for Name, Action, and Deadline" will usually work. A prompt that says "return a markdown table" and includes a two-row example of the exact table shape will almost always work.

Signs: the information in the output is correct but the structure is wrong. You are spending time reformatting rather than rewriting.

Fix: add a concrete example of the desired format, or tighten the format specification until there is only one way to interpret it. This is the fastest of the four to fix.

4. Model limitation. The task exceeds what the model can reliably do. This is the rarest of the four, but it is real. Some tasks require capabilities the model does not have: reliable counting, precise arithmetic on large numbers, consistent adherence to complex multi-constraint formatting rules, or knowledge of events after its training cutoff.

We covered hallucinations in the first series as one version of this: the model generating confident-sounding information that is not grounded in fact. Model limitation is a broader category. It includes hallucination, but also tasks where the model's architecture makes reliable performance unlikely regardless of how good your prompt is.

Signs: you have tried multiple clear, well-structured prompts and the output keeps failing in the same fundamental way. The failure is not about clarity or context; it is about capability. Math errors persist even with explicit "show your work" instructions. The model confidently cites a paper that does not exist no matter how you phrase the request.

Fix: change the approach. Use a calculator for math. Use a search tool for current information. Use code for deterministic logic. These are not tasks that language models are built for; precision and retrieval are not how they work. Understanding that distinction is the real lesson here. Pair the model with tools that cover its weaknesses instead of prompting harder.

One variable at a time

Once you have a hypothesis about which category the failure falls into, the temptation is to rewrite the whole prompt. Resist that.

Change one thing. Run it again. Read the output.

If the output improves, you found the right variable. If it does not, you learned that variable was not the problem, and you move to the next one. Either way, you have information you did not have before.

This sounds obvious. In practice it is surprisingly hard to do. When a prompt is frustrating you, the urge to throw it out and start from scratch feels productive. It's not. Starting over resets your experiment. You lose the diagnostic data from the failed version because now you have no idea which of your changes made the difference.

The best practice is to change one thing, observe, then decide your next move. It is the same loop whether you are debugging code, debugging a prompt, or debugging a recipe. Isolate the variable. Test. Observe.

When to stop iterating

There is a point where you should stop tweaking and reconsider the task itself. Say you are on your fifth or sixth revision and each one has made a minor improvement, but it's still not quite right. At this point, you are spending more time on the prompt than you would have spent just doing the task manually.

That is a signal. Not necessarily that the prompt cannot work, but that the return on further iteration is diminishing. Three things are usually going on:

The task might be too complex for a single prompt. Break it into steps. Have the model do part one, review the output, then feed that into part two. Multi-turn design from the previous post is the tool here. What cannot work as one prompt often works beautifully as a conversation.

The task might be wrong. Sometimes what I think I want is not actually what I need. I have spent twenty minutes trying to get a model to rewrite a paragraph in a specific way, then realized the paragraph should just be cut entirely. The prompt was not failing. My framing of the problem was off.

The task might need a different tool. Not every problem is a prompt problem. If you need exact formatting, maybe a template with variable substitution is better than asking a model to hit your format precisely. If you need reliable math, use a spreadsheet. AI is powerful for ambiguity, natural language, and judgment calls. It is not always the right tool for precision, determinism, or retrieval.

The reflex

The shift this post is really about is small but it changes the whole experience. When a prompt is not working, the instinct might be to brute-force a fix. Add more words. Rephrase. Hope for the best.

The better reflex is the one developers use when code does not work. Form a hypothesis about why. Test it. Observe the result. Let the result guide the next hypothesis. No guessing, no hoping, just a loop.

Hypothesis. Test. Observe. Refine.

It is not more complicated than that. The hard part is not the technique. The hard part is pausing long enough to read the bad output as diagnostic data instead of just reacting to it.

Your prompts are not conversations. They are experiments. Treat them that way.

Next up: what to do when you need your AI to return structured data instead of prose, and why "give me JSON" is almost never enough.