Leena Malhotra

Posted on Dec 19, 2025

I Tested AI on a Long Workflow and Context Collapse Killed It

#webdev #programming #ai

For six months, I ran every AI-generated code suggestion through a review process that would make a nuclear power plant jealous.

Human approval before any AI output reached production. Manual verification of every function. Line-by-line review of generated tests. Three-person sign-off on architecture suggestions. We treated AI like a junior developer who couldn't be trusted with scissors.

The results were predictable: our velocity was terrible, our team was frustrated, and we weren't actually getting safer code—just slower code with a false sense of security.

Then I did something that terrified our engineering leadership: I removed almost all the guardrails. Not recklessly—but deliberately. And what happened next taught me more about working with AI than the previous six months of cautious experimentation.

Turns out, the guardrails weren't protecting us. They were preventing us from learning how AI actually fails.

The Safety Theater Problem

Most teams implementing AI tools follow the same pattern: build elaborate review processes to catch AI mistakes before they cause damage. It feels responsible. It feels safe. It's actually worse than having no AI assistance at all.

Guardrails create the illusion of safety without the reality. When every AI output goes through extensive human review, you never learn where the AI is actually reliable and where it genuinely needs oversight. You're reviewing everything equally, which means you're not focusing review effort where it actually matters.

Over-cautious processes prevent skill development. Your team never develops intuition for when to trust AI outputs and when to scrutinize them. They treat all AI suggestions as equally suspect, which means they can't work efficiently with AI even after months of usage.

False security breeds real vulnerabilities. When you catch 95% of AI mistakes through elaborate review processes, you become confident. But that remaining 5%—the mistakes that slip through because everyone assumes someone else is checking carefully—cause the actual production issues.

The review overhead kills adoption. When using AI creates more work than not using it, teams quietly stop using it. The guardrails don't make AI safe—they make it irrelevant.

What Removing Guardrails Actually Meant

I didn't just delete all oversight and hope for the best. Removing guardrails meant replacing safety theater with targeted risk management.

Stop reviewing everything, start reviewing the right things. Instead of checking every line of AI-generated code, we identified high-risk areas: security-sensitive logic, database migrations, API contract changes, financial calculations. AI output in these domains got intense scrutiny. Boilerplate, tests, documentation, refactoring? Minimal review.

Trust-but-verify with production feedback loops. We deployed AI-generated code with the same quality checks as human-written code: automated testing, gradual rollouts, monitoring, and quick rollback capability. The production environment became our verification mechanism, not pre-deployment review.

Make failure fast and cheap. Instead of trying to prevent all mistakes, we optimized for rapid detection and recovery. Better monitoring, better rollback processes, better error tracking. When AI-generated code failed (and it did), we learned from it quickly without major impact.

Build team judgment through experience. Let people make small mistakes with AI and learn from them. A junior developer who ships AI-generated code that has a minor bug in staging learns more about AI limitations than a senior developer who reviews every line and never lets anything questionable through.

The Failures That Taught Us

Within the first two weeks of reduced guardrails, we had three incidents that would have been caught by our old review process. Each taught us something our careful reviews never could.

Incident 1: The Overly Confident Refactor. AI suggested refactoring a complex function into what looked like cleaner code. We shipped it. It broke an edge case that only appeared in production under specific load conditions. The monitoring caught it within minutes. Rollback took thirty seconds.

What we learned: AI doesn't understand implicit behavioral contracts that aren't visible in the code. Our tests hadn't captured them either. The incident led us to improve both our test coverage and our understanding of when to deeply scrutinize refactoring suggestions.

Incident 2: The Subtle Logic Error. AI-generated code for calculating shipping estimates had a logic error that rounded prices incorrectly for certain currency combinations. A customer reported it before our monitoring flagged it.

What we learned: AI is bad at edge cases involving money, time zones, and internationalization. We now route these domains through specialized validation tools and always include edge case tests. The AI Fact-Checker became standard for verifying calculations in these sensitive areas.

Incident 3: The Documentation Drift. AI generated documentation that was technically accurate when written but didn't get updated when the code changed. The docs misled users for three days before someone noticed.

What we learned: AI-generated documentation needs the same maintenance discipline as human-written docs—maybe more, because it's easier to generate than to maintain. We started using the Content Scheduler to trigger regular documentation reviews, treating it like any other technical debt.

The Surprising Upsides

Removing guardrails didn't just teach us about failures. It revealed capabilities we'd been suppressing with over-cautious processes.

Velocity increased dramatically, but so did quality. Without review overhead, developers could iterate faster. Faster iteration meant more experiments, which meant finding better solutions. The code quality improved because we were testing more approaches, not because we were reviewing more carefully.

Team confidence in AI grew rapidly. When you can see what AI actually gets wrong in real contexts, you develop much better intuition than when everything goes through sanitizing review processes. Our developers became genuinely skilled at working with AI instead of just cautiously coexisting with it.

We found the valuable use cases. Under heavy guardrails, we only used AI for low-risk tasks because high-risk tasks required too much review overhead. After removing most restrictions, we discovered AI was actually quite good at some high-risk tasks—with the right tooling and feedback loops.

Cost-benefit analysis became possible. With guardrails everywhere, we couldn't tell which review processes were valuable and which were security theater. After removing most of them, the truly valuable review points became obvious through their absence.

The New Risk Management Framework

The guardrails we removed weren't replaced with nothing. They were replaced with a more sophisticated risk management approach.

Risk-stratified review processes. We classify code changes into risk tiers. Tier 1 (security, payments, data integrity) gets intense review regardless of origin. Tier 3 (UI polish, documentation, test coverage improvements) ships with minimal review. Tier 2 (business logic, API endpoints) gets targeted review focused on edge cases and error handling.

AI-specific quality gates. We use AI tools to review AI output, checking for common AI failure modes. The Sentiment Analyzer catches tone issues in generated communications. The Plagiarism Detector flags when AI might be reproducing training data too closely. The AI Debate Bot challenges architectural decisions to surface unexamined assumptions.

Production validation over pre-production perfection. Our monitoring got significantly better. We can detect unusual error patterns, performance degradation, and user confusion within minutes of deployment. This makes shipping experimental code safer than over-reviewing before shipping.

Rapid rollback as a design principle. Every change ships with a rollback plan. Feature flags control risky functionality. Database migrations are reversible. This makes the cost of being wrong much lower, which makes being cautious less necessary.

Continuous learning from failures. Every AI-related incident goes through a lightweight postmortem focused on "what would have caught this?" Often the answer isn't "more review" but "better tests" or "better monitoring" or "better error messages."

The Multi-Model Safety Pattern

One of the most effective safety mechanisms we discovered wasn't about restricting AI—it was about using multiple AIs to check each other.

Critical decisions get multi-model validation. When AI suggests a significant architectural change, we run it through multiple models using platforms like Crompt AI. If GPT-5, Claude Opus 4.1, and Gemini 2.5 Pro all recommend similar approaches, confidence is high. If they diverge significantly, that's a signal for deeper human review.

Model diversity catches model-specific blind spots. Different AI models fail in different ways. GPT might confidently suggest incorrect approaches. Claude might be overly conservative. Gemini might miss context. Running suggestions through multiple models surfaces these blind spots without requiring human review of every output.

Comparison reveals uncertainty. When different models give wildly different answers to the same question, that's valuable information—it means the question is ambiguous or the problem space is genuinely uncertain. This signals when human judgment is actually needed.

The multi-model approach is efficient because it's automated. Tools like Crompt (available on web, iOS, and Android) make comparing model outputs frictionless, which means you can use it as a safety check without killing velocity.

The Cultural Shift Required

The hardest part of removing guardrails wasn't technical—it was cultural. Teams had to fundamentally change how they thought about AI assistance and risk management.

Embrace intelligent failure over perfect caution. Shipping code that might have minor issues and learning from them beats never shipping anything questionable. The cultural norm shifted from "we must catch all mistakes before production" to "we must detect and fix mistakes quickly in production."

Trust developers to evaluate AI output. Instead of requiring multiple sign-offs, we trusted developers to judge when AI output needed deeper review. This trust was earned through training on AI failure modes and supported by good tooling, but ultimately it required believing people would make good decisions.

Measure impact, not process compliance. We stopped tracking "how many AI suggestions were reviewed" and started tracking "how many AI-related incidents occurred" and "how much faster did we ship." The metrics shifted from measuring safety theater to measuring actual outcomes.

Make failure a learning opportunity, not a punishment. When someone shipped AI-generated code that caused a minor incident, the response was "what did we learn?" not "why didn't you review it more carefully?" This psychological safety was essential for people to actually take advantage of reduced guardrails.

What Still Requires Guardrails

Removing most guardrails doesn't mean removing all of them. Some domains genuinely require intense scrutiny.

Security-sensitive code still gets paranoid review. Authentication, authorization, cryptography, input validation—these areas get reviewed line-by-line regardless of whether a human or AI wrote them. The risk profile is too high for anything less.

Customer-facing communications need human judgment. AI-generated emails, error messages, and user-facing text still get human review. The tone and empathy requirements are too nuanced for fully automated approval. Using tools like the AI Caption Generator helps draft these, but humans verify before sending.

Legal and compliance contexts demand verification. Anything touching privacy, data handling, or regulatory compliance gets reviewed by someone who understands the legal implications. AI can draft it, but humans must verify it.

Financial calculations require multi-level validation. Anything involving money, pricing, billing, or financial projections gets verified through multiple mechanisms—automated tests, manual review, and often Excel Analyzer verification of complex spreadsheet logic.

The key insight is that these guardrails should be domain-specific, not tool-specific. Review the high-risk domains carefully regardless of whether AI generated the code, and review low-risk domains minimally regardless of who wrote it.

The Productivity Multiplier

Once we removed the review overhead and built trust in the workflow, the productivity gains became dramatic.

Tasks that took days take hours. Generating comprehensive test suites, refactoring legacy code, writing documentation, creating API clients—all these tasks compress from days to hours when you're not spending most of your time reviewing AI output.

Innovation cycles accelerate. When trying new approaches is cheap, you try more approaches. We ran more experiments in three months after removing guardrails than in the previous year of careful AI usage. Most experiments failed, but the successful ones more than justified the cost.

Team focus improves. Developers spend cognitive energy on genuinely hard problems instead of reviewing routine AI suggestions. The mental space freed up by trusting AI for boilerplate and routine tasks creates room for deeper thinking on complex challenges.

Quality improves through volume. Counterintuitively, shipping more code (including some with minor issues) led to better overall quality. More experimentation meant finding better approaches. Faster iteration meant incorporating feedback sooner. The net effect was higher quality delivery, not lower.

The Simple Truth

Guardrails feel safe but they're often just expensive security theater. They protect you from learning how AI actually fails, which means you never develop good judgment about when to trust it.

The path to working effectively with AI isn't wrapping it in bureaucratic review processes. It's understanding its failure modes through experience, building good feedback loops, and focusing oversight where it actually matters.

Trust, but verify—with production feedback loops, not pre-deployment review theater.

The teams that figure this out first will move dramatically faster than teams still treating every AI output like a potential catastrophe. Not because they're reckless, but because they've learned to distinguish real risks from imagined ones.

-Leena:)

DEV Community