Getting an LLM to Actually Follow Your Output Format (Without Fighting It Every Request)

#llm #ai #javascript #webdev

If you've ever asked an LLM to return output in a strict format —
valid JSON, a specific HTML structure, exactly N items — you've
probably noticed it drifts. Not constantly, but often enough that
"mostly works" isn't good enough for production code parsing the result.

I ran into this building a tool that sends a website screenshot to
Gemini and expects back a strict HTML structure: an ordered list,
each item with a specific tag layout, nothing extra.

What kept breaking

Early versions of my prompt just described the desired format in
prose: "format the response as an HTML list with this structure."
That worked maybe 80% of the time. The other 20%:

Extra commentary before or after the list ("Here's my analysis:")
The <em> tag meant only for one specific line showing up elsewhere in the output, sometimes even written out as literal visible text
Two distinct issues merged into a single list item
Occasionally a missing <li> entirely

None of these are "the model is bad." They're the model treating a
descriptive request as a soft suggestion rather than a hard constraint.

What actually fixed it

1. Validate the output programmatically, don't trust it.

const roastHtml = result.response.text().trim();

if (!roastHtml.includes("<li>")) {
  throw new Error("Unexpected format — triggering retry");
}

This alone changes the failure mode from "silently broken downstream"
to "automatically retried." A simple structural check (does the
expected tag exist) catches most drift without needing to validate
every detail.

2. Be explicit about what NOT to do, not just what to do.

Positive instructions ("format it like this") leave room for
interpretation. Adding explicit negative constraints closes the gaps:

FORMAT RULES — these are strict:
- The <em> tag is used ONLY for the Fix line at the end of each 
  point — never in the description
- Do not write tag names as visible text anywhere
- Do not add any introduction, conclusion, or commentary outside 
  the list
- Do not add extra HTML attributes, classes, or styles

The difference between "use em tags for fixes" and "never use em tags
anywhere else, and never write them as visible text" is the difference
between a suggestion and a constraint, even though both describe the
same intended behavior.

3. Retry on validation failure, with backoff.

async function withRetry(fn, label, maxAttempts = 3) {
  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    try {
      return await fn();
    } catch (err) {
      if (attempt === maxAttempts) throw err;
      await new Promise((r) => setTimeout(r, attempt * 5000));
    }
  }
}

Format drift is usually not consistent — the same prompt against the
same input often succeeds on a second attempt. Treating a format
mismatch as a retryable error, the same way you'd treat a network
timeout, costs almost nothing and fixes most of the remaining cases.

The underlying pattern

Soft, descriptive instructions get treated as suggestions, not
requirements — even when they're logically necessary for the output
to be usable. The model needs the constraint stated as a constraint,
and your code needs to verify compliance rather than assume it.
Neither piece alone was enough; the prompt changes reduced drift,
but the validation + retry is what makes the pipeline actually reliable
end to end.

If you're building anything that parses LLM output downstream:
validate structurally, state constraints as negatives as well as
positives, and treat format mismatches as retryable failures rather
than edge cases to patch around later.

Top comments (2)

Alex Shev • Jun 27

The production lesson is that format compliance should not live only in the prompt. I like prompts for intent, schemas for contract, and validators for enforcement. If the caller cannot reject or repair a bad shape deterministically, the system is still trusting a language model to be a parser.

KNALLHART.DEV • Jun 28

That three-layer split is a cleaner way to say what I was getting at with examples instead of a framework. My validator right now is pretty minimal (just checking a tag exists), no real schema in the sense you mean — it works for a single fixed structure but wouldn't scale if the output shape got more complex.

"The system is still trusting the model to be a parser" is a sharp way to put it. That's exactly the failure mode when validation is just eyeballing outputs occasionally instead of something the code actually enforces every time.