If you've ever asked an LLM to return output in a strict format —
valid JSON, a specific HTML structure, exactly N items — you've
probably noticed it drifts. Not constantly, but often enough that
"mostly works" isn't good enough for production code parsing the result.
I ran into this building a tool that sends a website screenshot to
Gemini and expects back a strict HTML structure: an ordered list,
each item with a specific tag layout, nothing extra.
What kept breaking
Early versions of my prompt just described the desired format in
prose: "format the response as an HTML list with this structure."
That worked maybe 80% of the time. The other 20%:
- Extra commentary before or after the list ("Here's my analysis:")
- The
<em>tag meant only for one specific line showing up elsewhere in the output, sometimes even written out as literal visible text - Two distinct issues merged into a single list item
- Occasionally a missing
<li>entirely
None of these are "the model is bad." They're the model treating a
descriptive request as a soft suggestion rather than a hard constraint.
What actually fixed it
1. Validate the output programmatically, don't trust it.
const roastHtml = result.response.text().trim();
if (!roastHtml.includes("<li>")) {
throw new Error("Unexpected format — triggering retry");
}
This alone changes the failure mode from "silently broken downstream"
to "automatically retried." A simple structural check (does the
expected tag exist) catches most drift without needing to validate
every detail.
2. Be explicit about what NOT to do, not just what to do.
Positive instructions ("format it like this") leave room for
interpretation. Adding explicit negative constraints closes the gaps:
FORMAT RULES — these are strict:
- The <em> tag is used ONLY for the Fix line at the end of each
point — never in the description
- Do not write tag names as visible text anywhere
- Do not add any introduction, conclusion, or commentary outside
the list
- Do not add extra HTML attributes, classes, or styles
The difference between "use em tags for fixes" and "never use em tags
anywhere else, and never write them as visible text" is the difference
between a suggestion and a constraint, even though both describe the
same intended behavior.
3. Retry on validation failure, with backoff.
async function withRetry(fn, label, maxAttempts = 3) {
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
try {
return await fn();
} catch (err) {
if (attempt === maxAttempts) throw err;
await new Promise((r) => setTimeout(r, attempt * 5000));
}
}
}
Format drift is usually not consistent — the same prompt against the
same input often succeeds on a second attempt. Treating a format
mismatch as a retryable error, the same way you'd treat a network
timeout, costs almost nothing and fixes most of the remaining cases.
The underlying pattern
Soft, descriptive instructions get treated as suggestions, not
requirements — even when they're logically necessary for the output
to be usable. The model needs the constraint stated as a constraint,
and your code needs to verify compliance rather than assume it.
Neither piece alone was enough; the prompt changes reduced drift,
but the validation + retry is what makes the pipeline actually reliable
end to end.
If you're building anything that parses LLM output downstream:
validate structurally, state constraints as negatives as well as
positives, and treat format mismatches as retryable failures rather
than edge cases to patch around later.
Top comments (0)