Joud Awad

Posted on Jun 21

46/60 Days system Design Questions

#abotwrotethis #ai #agents #systemdesign

our production LLM agent just returned this JSON to your order processing service:

{ "action": "refund", "amount": "fifty dollars", "order_id": null, "confidence": "pretty high" }

Your downstream service crashes. The retry hits the same model. Same broken output. The refund never fires — but the user got a confirmation email.

You need your agent to return valid, typed, structured output — every time. What do you do?

A) Prompt-engineer harder — add "Always return valid JSON with these exact fields" to your system prompt and document the schema inline.

B) Use structured outputs / function calling (OpenAI, Bedrock tool use, Gemini response schema) — constrain the model at the API level to return a typed schema.

C) Post-process with a validation layer — parse the output, run JSON Schema or Pydantic validation, retry with corrective context if it fails (max 2 retries).

D) Add a second LLM as a judge — pass the first model's output to a smaller, faster model that scores and flags invalid responses before they reach your service.

Three of these are patterns used in production AI systems. One of them is wishful thinking.

Pick one — A, B, C, or D — and tell me why. I'll drop the full breakdown in the comments, including the pattern that looks defensive but actually makes hallucinations worse under load.

Drop your answer

30DaysOfSystemDesign #SystemDesign #AIEngineering #AgenticsAI

Top comments (4)

Joud Awad • Jun 21

Why B wins (Structured outputs / API-level schema enforcement):

This is the only option that eliminates the problem at the source. When you use OpenAI's response_format: { type: "json_schema" }, Bedrock's tool use / converse API, or Gemini's response_schema, the model's token sampling is constrained — it cannot produce output that violates the schema. Not "it's less likely." Cannot.

Under the hood, these APIs use constrained decoding: the model's output probability distribution is masked so invalid tokens are zeroed out at each step. The result is guaranteed schema conformance on every call. No retries. No parsing. No "it worked 98% of the time."

In production: use Pydantic (Python) or Zod (TypeScript) to define your schema, pass it to the API, deserialize directly into your typed model. Your downstream service never sees a string where it expects an int.

Joud Awad • Jun 21

Why A is wishful thinking (Prompt engineering):

"Always return valid JSON" is not a contract. It's a suggestion.

LLMs are probabilistic text predictors. Under load, edge-case prompts, or slightly different phrasing, they drift. You'll get valid JSON 97% of the time and a markdown code block the other 3%. At 10k requests/day that's 300 failures. At 100k, it's 3,000.

Prompt engineering is a first step when you're prototyping. It is not an output validation strategy for production. Teams that rely on this eventually write 50-line system prompts trying to enumerate every failure mode. The model still escapes.

Joud Awad • Jun 21

Why C is correct but incomplete (Post-process + retry loop):

Validation layer + retry with corrective context is a legitimate, widely-used pattern — especially when you can't use structured outputs (e.g., open-source models like Llama, Mistral, or your own fine-tune via Ollama/vLLM).

The implementation: parse → validate with Pydantic/JSON Schema → on failure, re-inject the original prompt + the bad output + a correction instruction ("Your previous output had these errors: [list]. Try again.") → retry, max 2 attempts → fail hard and surface to human review.

But: this is your fallback, not your primary. Every retry is latency + cost. Under high load, retry storms can spike your API spend 3–4x. Corrective context also doesn't help when the model fundamentally can't produce what you're asking — you'll just retry into the same failure.

Use C when B isn't available. Use B when it is....

Joud Awad • Jun 21

Why D is a trap (LLM-as-judge for validation):

LLM-as-judge is a real eval pattern — it's used to score quality, tone, safety, and correctness in offline evaluation pipelines and red-teaming.

For structural output validation in a hot path? It's expensive, slow (~200–500ms extra per request), and unreliable for schema enforcement. A judge model can tell you "this response seems wrong" — it cannot guarantee a valid typed object. You've added latency and cost without actually solving the problem.

LLM-as-judge belongs in your eval harness (nightly runs, CI) — not in your request/response loop for data integrity.

The worse failure mode: teams add a judge in addition to bad structured prompting, feel safer, and ship it. Both models can hallucinate in sync. The judge agrees the broken output looks fine. Silent corruption ships to prod.