A tiny benchmark that exposes silent failure modes in AI and ML pipelines
Most AI blog posts show best practices: clean architectures, neat abstractions, and impressive demos. I decided to do the opposite.
I intentionally built a bad AI system — one that works, produces outputs, and even looks reasonable at first glance — and then compared it to a boring, well-designed version of the same pipeline.
The goal was not performance. The goal was to understand how systems fail silently when design principles are ignored.
The task: same problem, two implementations
Both systems solve the exact same problem:
Input text → extract keywords → compute a score → recommend an action
The action space is deliberately small:
WAIT_AND_SEEBUY_MORE_STOCKPANIC_REORDER
Keeping the task simple allows us to focus entirely on system behavior, not model quality.
The benchmark idea
The benchmark is intentionally minimal:
- Take a single, fixed input text
- Run it multiple times through the system
- Observe whether the outputs stay stable
Why this matters:
A system that only works once is not a system — it’s a coincidence.
If the same input produces different outputs, something is fundamentally wrong at the system level.
Benchmark results: BAD vs GOOD
The following results were produced by running the same input five times through both systems.
BAD system output (excerpt)
The BAD system gradually escalates its decisions:
- Run 1 → score
14, actionWAIT_AND_SEE - Run 3 → score
42, actionBUY_MORE_STOCK - Run 5 → score
74, actionPANIC_REORDER
Same input. Same keywords. Completely different decisions.
Aggregated benchmark summary
BAD system
- Runs: 5
- Unique scores: 5
- Scores:
[14, 28, 42, 58, 74] - Unique actions: 3
GOOD system
- Runs: 5
- Unique scores: 1
- Scores:
[14, 14, 14, 14, 14] - Unique actions: 1
The GOOD system behaves like a function. The BAD system behaves like a memory leak.
Failure Taxonomy: How the BAD System Breaks
The bad system does not fail in a single obvious way. Instead, it exhibits multiple interacting failure modes that are common in real-world AI and data systems. Naming these failure modes makes them easier to detect—and harder to accidentally ship.
1) Drift
Definition: The system’s output changes over time even when the input stays exactly the same.
Root cause:
- Global score accumulation across runs
- State that grows monotonically without reset
Why this is dangerous:
- Business logic mutates without any explicit change
- Historical execution order influences current decisions
- Monitoring dashboards often miss the problem because values remain “reasonable”
Drift is especially dangerous because it looks like learning—but it isn’t.
2) Non-determinism
Definition: Identical inputs produce different outputs.
Root cause:
- Random noise injected into scoring
- Implicit dependency on execution history
Why this is dangerous:
- Bugs cannot be reliably reproduced
- Test failures become flaky and untrustworthy
- A/B experiments lose statistical meaning
If you can’t reproduce a decision, you can’t debug it.
3) Hidden State
Definition: Functions rely on data that is not visible in their interface or inputs.
Root cause:
- Global variables such as
CURRENT_SCORE,LAST_TEXT, andRUN_COUNT
Why this is dangerous:
- Code cannot be understood locally
- Refactoring changes behavior in non-obvious ways
- New contributors unknowingly introduce regressions
Hidden state turns every function call into a guessing game.
4) Silent Corruption
Definition: The system continues to run without errors while its decisions become increasingly wrong.
Root cause:
- No explicit failure signals
- No invariants or sanity checks
Why this is dangerous:
- Incorrect outputs propagate downstream
- Problems surface only through business impact
- Rollbacks become difficult or impossible
Loud failures get fixed. Silent failures get deployed.
Why This Taxonomy Matters
These failure modes rarely appear in isolation. In the BAD system, they reinforce each other:
- Hidden state enables drift
- Drift amplifies non-determinism
- Non-determinism hides silent corruption
Understanding these patterns is more valuable than fixing any single bug—because the same taxonomy applies to much larger and more complex AI systems.
A single metric: Stability Score
To summarize system behavior, I used a single metric:
stability_score = 1 - (unique_scores / runs)
- 1.0 → perfectly stable
- 0.0 → completely unstable
Stability results
- BAD system →
0.0 - GOOD system →
0.8
This one number already tells you which system you can trust.
Minimal Fixes: Four Small Patches That Change Everything
This is not a rewrite. These are surgical changes. Each patch removes an entire class of failure modes without introducing new abstractions or frameworks.
Patch 1 — Remove Global State
Before (BAD):
# global mutation + history dependence
GS.CURRENT_SCORE += base
return GS.CURRENT_SCORE
After (GOOD):
def score_keywords(keywords, text):
return sum(len(w) % 7 for w in keywords) + len(text) % 13
What this fixes:
- Eliminates score drift
- Removes hidden history dependence
- Makes the function deterministic and testable
A function that depends on global state is not a function — it’s a memory leak.
Patch 2 — Push Side-Effects to the Boundaries
Before (BAD):
def extract_keywords(text):
print("Extracting keywords...")
open("log.txt", "a").write(text)
return tokens[:k]
After (GOOD):
def extract_keywords(text):
return tokenize(text)[:k]
# side-effects handled explicitly at the edge
logger.info("Extracting keywords")
What this fixes:
- Core logic becomes reusable
- Logging becomes configurable
- Unit testing becomes trivial
Side-effects inside core logic silently infect everything upstream.
Patch 3 — Make Dependencies Explicit
Before (BAD):
if GS.LAST_TEXT is not None:
base += len(GS.LAST_TEXT) % 13
After (GOOD):
def score_keywords(keywords, text):
base = sum(len(w) % 7 for w in keywords)
return base + (len(text) % 13)
What this fixes:
- No hidden inputs
- Clear data flow
- Safe refactoring
If a dependency isn’t in the function signature, it’s a liability.
Patch 4 — Name the Magic Numbers
Before (BAD):
if score > 42:
action = "PANIC_REORDER"
After (GOOD):
@dataclass(frozen=True)
class Config:
panic_threshold: int = 42
if score > cfg.panic_threshold:
action = "PANIC_REORDER"
What this fixes:
- Decisions become explainable
- Parameters become reviewable
- Behavior changes become intentional
Magic numbers turn engineering decisions into superstition.
Summary
These four patches:
- Remove hidden state
- Eliminate non-determinism
- Make behavior explainable
- Restore trust in the system
No agents. No frameworks. Just engineering discipline.
Final takeaway
The BAD system works. That’s the problem.
It fails in the most dangerous way possible: plausibly and quietly.
The GOOD system is boring, predictable, and easy to reason about — which is exactly what you want in production.
Working code is not the same as a working system.
Code & Reproducibility
All code used in this article — including the intentionally broken system, the clean implementation, and the benchmark — is available on GitHub:
👉 https://github.com/Ertugrulmutlu/I-Intentionally-Built-a-Bad-Decision-System-So-You-Don-t-Have-To
If you want to reproduce the results, run:
python compare.py
The benchmark will run the same input multiple times through both systems and show, in a few lines of output, why predictability matters more than flashy abstractions.
Top comments (0)