Anannya Roy Chowdhury for AWS

Posted on Jun 30

"Fail Fast, Fail Free : The Design principle my multi-agent game was missing"

#ai #aws #agents #systemdesign

This is an intro to "Multi-Agent Systems in Production: What They Don't Tell You" — a four-part series based on a game I built for my conference talks at AI Engineer Week, Conf42 LLM, AgentCon Bengaluru, and R/pharma GenAI. This introductory post defines the unifying principle behind everything that follows.

The Most Expensive Bug I Ever Shipped

The bug wasn't in my code. The logic was correct. The prompts were good. The model was state-of-the-art.

The bug was where my system failed.

I built a multi-agent interactive game called "Horcrux Hunt" where two AI agents (Harry and Voldemort) battle live in front of an audience. Harry (Claude on Amazon Bedrock, Strands SDK) hunts Horcruxes hidden across 15 locations. Voldemort (heuristic-first adversary with LLM fallback) relocates them, plants decoys, and corrupts Harry's beliefs. The audience watches on a Streamlit dashboard as the hunt unfolds in real time.

And then we ran it. One weekend event. $1,847 in AWS bills. 12-second latency per turn. Audience waiting. Harry losing 77% of the time.

When I dissected the failure, I found the same pattern everywhere:

The LLM, during Harry's move, reasoned about 90 possible actions. 86 were invalid. It spent 3 seconds and 2,000 tokens discovering what a 0.2ms constraint check could have told it for free.
The Harry agent retrieved 5,000 tokens of history to make a decision. A 55-token probability score contained the same information. But we loaded the full context first and compressed later — paying before checking.
A tool call with invalid parameters hit the API, got a 400 error, retried twice. Client-side validation would have caught it in <1ms, before any call or game action was wasted.
I added Hermione, Ron, and Dumbledore agents** to help Harry. These three agents independently queried the same guidelines, produced conflicting strategies, and Harry's win rate dropped from 61% to 34%. A single priority check before execution would have caught it for free.

Every expensive failure had the same shape: the system knew it would fail, but discovered this too late. After tokens were spent, latency was burned, compute was consumed, and turns were wasted.

I started calling this pattern "failing slow, failing expensive." And its opposite became my design principle:

Fail Fast, Fail Free.

If a decision is going to fail, make it fail before it costs you anything.

That's it. That's the principle.

Fail fast = catch it at the earliest possible checkpoint
Fail free = catch it before the expensive meter starts running

In the Horcrux hunt game, the "meter" is different depending on context:

In cost terms: an LLM call ($0.008-0.015 per failure) vs $0 for a constraint check resolving an invalid action
In latency terms: a 3-second inference call for an action the game rejects anyway vs a 0.2ms validation
In game terms: Harry wasting a turn on a cooldown location vs knowing instantly it's unavailable
In coordination terms: Four agents arguing for 9 seconds vs Harry deciding alone when entropy is low
In reliability terms: a retry loop burning tokens vs a pre-validated clean call

The principle asks one question of every failure in your system: Could this have been caught earlier, cheaper, or both?

Almost always, the answer is yes.

The Anatomy of a Free Failure

What does a "free failure" actually look like? Here's the pattern for the game:

# EXPENSIVE failure (traditional):
# 1. Load full game history and Build full context (500ms, 2000 tokens)
# 2. Call LLM for decision (3000ms, $0.008)
# 3. Parse response (50ms) DETECTED HERE
# 4. Retry from step 1 (another $0.008)
# Total cost of failure: $0.016 + 3.5 seconds

# FREE failure (fail fast, fail free):
# 1. Validate input ← FAILURE DETECTED HERE (0.2ms, $0)
# 2. (never reaches LLM)
# Total cost of failure: $0 + 0.2ms

The key insight: validation is nearly free. Inference is expensive. Move the checkpoint upstream.

This isn't just "input validation" in the traditional software engineering sense. In multi-agent production systems, there are multiple layers where you can catch failures before they become expensive:

Layer 1: Constraint check     →  "Is this action even valid?"     → 0.2ms, $0
Layer 2: Entropy check        →  "Does this need LLM reasoning?"  → 0.5ms, $0
Layer 3: Schema validation    →  "Are these parameters correct?"  → 0.3ms, $0
Layer 4: Safety gate          →  "Is this output safe?"           → 1ms, $0
Layer 5: Priority resolution  →  "Do agents agree?"               → 2ms, $0
─────────────────────────────────────────────────────────────────────────────
Layer 6: LLM inference        →  "What should I do?"              → 3000ms, $0.008
Layer 7: API call             →  "Execute the action"             → 500ms, variable
Layer 8: Retry                →  "Try again"                      → 3500ms, $0.008+

Layers 1-5 are free. Layers 6-8 are expensive. Every failure you catch in Layers 1-5 is a failure that never reaches Layers 6-8. That's "fail fast, fail free."

Why This Matters Specifically for Multi-Agent Systems

In a single-agent system, a failure costs you one LLM call. Annoying but survivable.

In a multi-agent system, failures compound:

1 agent  failing = 1 retry × 1 inference cost
3 agents failing = retries × context replay × coordination overhead × cascading delays

When Harry produces invalid output, Voldemort receives it, reasons about it (paying tokens), produces its own output based on garbage, Executor Agent receives THAT... by the time you detect the failure, you've paid three inference calls, contaminated shared state, and need to rewind everything.

In multi-agent systems, a failure that isn't caught early becomes a failure that multiplies. This is why "fail fast, fail free" isn't just a nice optimization. It's architecturally critical.

The cost of late detection in multi-agent systems:

Where failure is caught	Cost in single-agent	Cost in 3-agent system
Before LLM call (Layer 1-5)	$0	$0
After 1 LLM call (Layer 6)	$0.008	$0.008
After cascading to other agents	$0.008	$0.024 + state rollback
After reaching the user	$0.008	Incalculable

The multiplication factor is why "fail fast, fail free" becomes an architectural principle for my multi-agent game and other production AI systems, not just a coding best practice.

The Four Faces of Fail Fast, Fail Free

This principle shows up differently depending on which failure mode you're facing. Here's a preview of how it manifests across the four parts of this series:

🔥 Cost: Prune Before Reasoning (Part 1)

The LLM doesn't need to reason about invalid options.

# Fail fast: constraint solver runs BEFORE LLM
valid_actions = constraint_solver(game_state)  # 0.2ms, $0
# 90 options → 4 valid actions
# The LLM never sees the 86 invalid ones
# 86 failures caught for free

If 86 of Harry's 90 possible actions are invalid (exhausted location. spent powers), letting the LLM discover this wastes 95% of its reasoning budget. A constraint solver makes those 86 failures free, they never reach the meter.

The mantra: Don't let the LLM think about things you already know the answer to.

🧠 Memory: Gate Before Retrieving (Part 2)

Not every decision deserves full context retrieval, in my case the full 5000 tokens as history for Harry's next move.

# Fail fast: entropy check BEFORE retrieval
entropy = calculate_entropy(belief_map)
if entropy < 1.0:
    # Harry already knows where the Horcrux is
    return heuristic_decision()  # 0 tokens, $0
# Only uncertain decisions justify context retrieval cost

When entropy is low (the agent, using the bayesian belief map, is already confident of a move), sending context of 50 turns to the LLM is waste. The entropy check is a fail-fast gate: "Do I even need to spend tokens on this decision?" 60% of the time, the answer is no. Those decisions become free.

The mantra: Check whether you need to think before you start thinking.

🔌 Integration: Validate Before Calling (Part 3)

Client-side schema validation catches bad parameters for free.

# Fail fast: JSON Schema validation BEFORE API call
errors = jsonschema.validate(params, tool_schema)  # <1ms, $0
if errors:
    return fix_params(errors)  # self-correct without any call
# Only valid calls reach the API

A classic example of my game validation:

# Fail fast: schema validation BEFORE game action executes
errors = validate_tool_call("search_location", {"location": "hogwarts"})
if game_state.cooldown["hogwarts"] > 0:
    return ToolError("Hogwarts on cooldown for 2 turns")  # <1ms, $0
# Only valid, available actions consume game budget

When Harry tries to search a location on cooldown (from Game Theory - a mechanism that restricts immediate retaliation or repeated actions), catching it at validation (free, <1ms) is infinitely better than catching it after an LLM inference + game execution + failure + retry. So what's better than to use MCP here. MCP's killer feature isn't the protocol itself — it's that schema contracts between server and client enable free validation. Every parameter error caught in <1ms is a retry that never happens. At 2.3 retries per request (our pre-MCP baseline), this is massive: 91% reduction in retries, purely by moving the failure checkpoint upstream.

The mantra: The cheapest API call is the one you never make.

🏥 Coordination: Veto Before Executing (Part 4)

In regulated systems, unsafe responses must fail at review, not at the execution step. For example, in my horcrux game, when Hermione and Dumbledore disagree, we need to resolve it before Harry acts.

# Fail fast: priority resolution BEFORE team executes
if hermione.recommends("attack_azkaban") and dumbledore.warns("trap_detected"):
    # Priority: Dumbledore's safety assessment OVERRIDES Hermione's analysis
    return harry_defend()  # resolved in <2ms, no cascading confusion
# Only aligned, conflict-free strategies reach execution

When the Safety analysis vetoes an unsafe action, that "failure" is free and is a <2ms activity. The alternative (delivering an unsafe action using tokens and multiple retries) is infinitely expensive. So, "fail fast, fail free" becomes "validate early, harm never."

The mantra: The safest failure is the one that never reaches the executor.

The Optimization Ladder (Reframed)

Here, I'll introduce the "Optimization Ladder" — a framework for pushing decisions down from expensive layers to cheap ones.

Reframed through "Fail Fast, Fail Free," it becomes a failure checkpoint ladder:

CHEAPEST (try first):
├── Rules & Constraints     → Can I rule this out for free?
├── Heuristics              → Is the answer obvious?
├── Math & Statistics       → Can I compute instead of infer?
├── Compressed Inference    → Can I think with less context?
MOST EXPENSIVE (last resort):
└── Full LLM Reasoning      → Only genuinely uncertain decisions

Each layer is a checkpoint. Each checkpoint catches failures before they cascade to the layer below. The system only pays for inference on decisions that survive every free checkpoint above which turns out to be about 20-40% of turns.

The other 60-80%? Free. In my game, Harry acts on constraints, entropy gates, heuristics, and math. All at zero token cost. And counterintuitively, his win rate improved because less noise = better decisions.

How to Apply This Tomorrow

You don't need to redesign your system. Start with one question:

"Where in my pipeline do I first discover that something is wrong?"

Then ask: "Could I have discovered that one step earlier?"

Repeat until the answer is "no" or "the failure checkpoint is already free."

Practical starting points:

Add input validation before every LLM call. What percentage of your prompts contain information that makes the answer predetermined? What percentage of your agent's reasoning leads to invalid actions? That's your "free failure" opportunity.
Add an entropy/confidence check before retrieval. How often does your agent retrieve context it doesn't need? That's wasted tokens.
Add schema validation before every tool call. What's your retry rate? Each retry = full token cost. Multiply that by your average token cost. That's what free validation saves you.
Add a safety/priority check before every multi-agent output. How often do your agents disagree? Each disagreement caught at orchestration is a contradiction that never reaches the user.

The Series Roadmap

This blog defines the principle. The next four show it in action — all through the lens of building, breaking, and fixing Horcrux Hunt:

Part	Problem	"Fail Fast, Fail Free" Manifestation
Part 1: Cost	$1,847 bill for a weekend game, 12s latency, 23% win rate.	Prune invalid actions BEFORE inference
Part 2: Memory	77% failure rate, perfect reasoning	Gate retrieval by entropy BEFORE loading context
Part 3: Integration	Wrong airport, wrong date	Validate parameters BEFORE making API calls
Part 4: Coordination	Added 3 agents. They Fight. Win rate DROPPED to 34%.	Safety veto BEFORE delivering output

Each part tells a story, shows the failure, explains the fix, and proves the results. But now you know the common thread: every fix is a version of the same principle applied at a different layer.

One More Thing

There's a beautiful symmetry here. "Fail fast, fail free" has existed in software engineering for decades — circuit breakers, input validation, type systems, contract testing. We know this principle.

But somewhere in the excitement of LLMs, we forgot it. We started building systems where the first line of defense is a $200-billion-parameter model. We made inference the validator instead of the validated. We let Harry reason about every possibility instead of telling him which possibilities were already impossible.

Multi-agent systems make this mistake catastrophically expensive because failures compound across agents. But the fix is the same fix we've always known:

Don't let expensive things discover what cheap things already know.

In my Horcrux Hunt game terms:

Don't let Harry reason about locations on cooldown (constraints know this)
Don't let Harry retrieve history when he's already confident (entropy knows this)
Don't let Harry attempt actions with invalid parameters (validation knows this)
Don't let the team argue when priority rules are clear (the mediator knows this)

Check before you call. Validate before you execute. Prune before you reason. Gate before you retrieve. Veto before you deliver.

Fail fast. Fail free.

🚀 What's Next

Harry spent $1,847 learning this lesson in one weekend. You can learn it for free...

→ Part 1: The $1,847 Weekend where the game goes live, the bill arrives, and I discover that 86 of 90 actions Harry reasoned about were already impossible (releasing soon).

If you've ever watched your agent burn tokens on decisions a Python function could have handled, this one's for you.

💬 I'm curious — what's your agent's retry rate right now?

Drop it in the comments. If it's above 5%, you're probably failing slow and failing expensive somewhere in your pipeline. I'll reply with which Part (1-4) has your fix.

🔖 Bookmark this series if you're building agents in production — each post drops one principle that saved me $576K/year in inference costs. Or just watch your AWS bill and you'll know when you need them.* 😏

I am a Gen AI Developer Advocate at AWS. I adapted the classic 'Fail safe' principle into what I call 'Fail Fast, Fail Free' after spending too much money on multi-agent systems that discovered their failures too late. I am now on a mission to make every failure in all my systems free or at least cheaper than my rent.

Top comments (4)

Jatin Mehrotra • Jun 30

Loved the the 4 faces of Fail Fast, Fail Free.

Anupam • Jun 30

Tokens are the new currency. Would love to see how you cover the cost optimization for tokens!

Anannya Roy Chowdhury • Jun 30

@anupam_3b7fdd1e4f1cc60e5b , this releases next week. Stay tuned!

ankita singh • Jun 30 • Edited

Nice one! I am facing some latency issues in my agent app too. My MCPs are behind proxy and my agent needs to reconnect everytime for every call increasing call TAT. Any solution?