Joud Awad

Posted on Jun 9

34/60 Days System Design Questions

#ai #promptengineering #systemdesign #backend

Your AI feature works in the demo.

It fails in production 3 weeks later. Nobody touched the model. Nobody changed the code.

The only thing that changed: the inputs got messier.

Here's the setup:

You're at a SaaS company. 50,000 support tickets a week. Your team builds an AI triage system — GPT-4o classifies each ticket into 6 categories (billing, bug, feature request, account access, security, other) so the right team gets it instantly.

In dev, it nails 71% accuracy. You need 90%+ to cut manual review.

The model is locked. The budget for inference isn't unlimited. You need to close the 19-point gap.

Here are your four options:

A) Zero-shot with a better system prompt — rewrite the instructions, add explicit category definitions, specify edge case rules. No examples.

B) Few-shot examples — add 3–5 real classified tickets directly in the prompt. One example per category edge case.

C) Chain-of-Thought — add "think step-by-step before answering" to the prompt. Force the model to reason through the ticket before outputting the category.

D) Self-Consistency — run each ticket through the model 5 times with temperature=0.7, take the majority vote across outputs.

Same model. Same ticket. Four different accuracy + cost profiles.

Pick one — A, B, C, or D — and tell me why. Full breakdown in the comments.

30DaysOfSystemDesign #AI #SystemDesign #BackendEngineering

Top comments (6)

Sloan the DEV Moderator • Jun 10

Hey, this article appears to have been generated with the assistance of ChatGPT or possibly some other AI tool.

We allow our community members to use AI assistance when writing articles as long as they abide by our guidelines. Please review the guidelines and edit your post to add a disclaimer.

Failure to follow these guidelines could result in DEV admin lowering the score of your post, making it less visible to the rest of the community. Or, if upon review we find this post to be particularly harmful, we may decide to unpublish it completely.

We hope you understand and take care to follow our guidelines going forward!

Joud Awad • Jun 9

D — Self-Consistency
Accuracy: 93–96%. Cost: 5x. Latency: 5x.

Run the same ticket 5–10x at temperature=0.7, take the majority vote. Works because each run samples a slightly different path — noise cancels out across runs.

The math is good. The cost is brutal.

50K tickets/week × $0.002 × 5 runs = $500/week vs $100/week with few-shot. You're paying $400/week extra for a 3–5% gain on the hardest cases.

Worth it when being wrong is expensive: fraud detection (a missed flag costs hundreds), medical triage (wrong category = real harm). For support routing, a misclassified ticket costs ~$0.10 of human rerouting time. Use B.

Joud Awad • Jun 9

C — Chain-of-Thought
Accuracy delta: ±2%. Latency: +200–400ms.

CoT is the most misapplied technique in production AI right now. Teams add "think step-by-step" to everything. For some tasks it works. Not this one.

Why it struggles here: the model generates a reasoning chain, then commits to a label. On ambiguous tickets, that chain becomes a liability. "Mentions a charge... but also a login issue... charge is mentioned first... probably billing." It picks billing. Right answer is account access.

CoT helps when the answer emerges from reasoning — math, debugging, planning. For classification, the answer emerges from calibration. Reasoning doesn't substitute for calibration.

Layer it on top of few-shot if you're still missing edge cases. Don't reach for it first.

Joud Awad • Jun 9

B — Few-shot Best answer here
Accuracy: 88–93%. Cost increase: near-zero.

Few-shot shifts the model from reading rules to pattern-matching against real examples. Fundamentally stronger signal.

The part most teams get wrong: they pick the obvious examples. A crystal-clear billing ticket. A textbook bug report. Those teach the model nothing it didn't already know.

You want the ambiguous ones. The ticket that looks like billing but is account access. The feature request that reads like a bug. Those are where the model currently fails — and those are the examples that close the gap.

Get the selection right, you close most of the 19 points without touching anything else.

Joud Awad • Jun 9

A — Zero-shot with a better system prompt
Accuracy ceiling: ~78–82%.

This is where every team starts. You iterate the prompt, add category definitions, enumerate edge cases. It feels like progress. And for a while it is.

The hard wall: instructions describe categories. They can't show the model where the ambiguous cases land.

Real example: you define "billing" as anything about charges or invoices. Then you get this ticket — "I can't log in and I was charged twice last month." Billing? Account access? Both?

No amount of rewriting tells the model which bucket that gets. It guesses wrong 20–25% of the time on edge cases. And your dataset is full of edge cases. The ceiling is real.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.