Originally published at chudi.dev
I shipped broken code three times in one week. Not edge cases--fundamental errors that any test would have caught. The AI said "should work" and I believed it.
Building a quality control system for AI code generation means enforcing mandatory gates before implementation begins--loading relevant skills, validating context budget, and blocking rationalization phrases like "should work" that indicate unverified claims. The result is a two-gate system where tools literally cannot execute until quality checks pass. Claude Code provides the hooks infrastructure that makes gate enforcement possible at the session level.
Why Did I Need Quality Gates for AI?
The problem wasn't the AI's capability. Claude is remarkably good at generating code. The problem was my workflow--or lack of one.
I'd describe what I wanted. Claude would write it. I'd paste it in. Sometimes it worked. Sometimes I'd spend hours debugging issues that existed from the first line. Without me realizing it, I was trusting confidence over evidence.
That specific anxiety of deploying something you haven't tested--the kind where you refresh the page three times hoping the error goes away--became my default state.
Well, it's more like... I was using AI as a code generator when I needed it to be a quality-controlled collaborator.
How Does the Two-Gate System Work?
The system enforces two mandatory checks before any tool can execute. Like buttoning a shirt from the first hole--skip it, and everything else is wrong.
If either gate fails, all implementation tools are blocked. You literally cannot write code until the orchestration layer is ready.
Gate 0: Meta-Orchestration (Priority 0)
This gate loads immediately and handles three things:
Gate 1: Auto-Skill Activation (Priority 1)
This gate analyzes your query and activates relevant skills:
I love automation. But I spend hours building systems to slow myself down.
What Is Progressive Disclosure and Why Does It Save 60% of Tokens?
Most Claude configurations load everything upfront. Every skill, every rule, every example--thousands of tokens consumed before you've even asked a question. Anthropic's documentation covers prompt and context design patterns that help avoid this upfront token burn.
Progressive disclosure flips this. Load metadata first. Load details on demand.
The 3-Tier System
Tier 1: Metadata (~200 tokens)
- Skill name, triggers, dependencies
- Just enough to route the query
Tier 2: Schema (~400 tokens)
- Input/output types
- Constraints and quality gates
- Tools available
Tier 3: Full Content (~1200 tokens)
- Complete handler logic
- Examples and edge cases
- Only loaded when actively using the skill
The meta-orchestration skill alone: 278 lines at Tier 1, 816 with one reference, 3,302 fully loaded. That's 60% savings on every session that doesn't need the full content.
What Phrases Does the System Block?
The automated verification system flags specific patterns in code comments and commit messages. Here's the complete breakdown of phrases that indicate insufficient testing or assumptions:
These phrases aren't banned because they're wrong. They're banned because they indicate claims without evidence.
The goal isn't to be pedantic--it's to make evidence the path of least resistance. When "build passed" is easier to say than "should work," you'll naturally verify before claiming.
That hollow confidence of claiming something works without checking--the system makes it impossible.
How Does AMAO Handle Parallel Execution?
AMAO (Adaptive Multi-Agent Orchestrator) adds sophisticated orchestration on top of the gate system:
DAG Engine
- Directed acyclic graph for task dependencies
- Max 50 tasks with cycle detection
- Parallel grouping for independent operations
- Critical path analysis for optimization
Context Governor
- 75% max budget, 60% warning threshold, 20% reserve
- Predictive usage analysis
- Auto-compact at 70%
- Phase unloading to release memory between stages
Skill Evolution
- Pattern detection: 5 occurrences triggers skill proposal
- Auto-approval at 85% confidence
- Deprecation at 30% effectiveness
- Weighted feedback: 40% build, 30% test, 20% reverts, 10% user
The parallel execution runs up to 3 concurrent tasks with a 5-minute timeout. If parallel fails, it falls back to sequential--safety over speed.
What Are the 4 Pillars of Quality?
Every check maps to one of four pillars:
1. State & Reactivity
- Svelte 5 runes only (
$state,$props,$derived) - No legacy patterns that cause confusion
- State updates via
$effectfor side effects
2. Security & Validation
- All user input sanitized (XSS prevention)
- Form inputs validated with Zod
- API routes validate request schema
- No inline scripts in production
3. Integration Reality
- Every component used in at least one route
- No orphaned utility files
- All API routes consumed by UI
- Every feature has verification
4. Failure Recovery
- Error boundaries on all route groups
- Graceful degradation for failed API calls
- Loading states for async operations
- User-friendly error messages
FAQ: Building Quality Systems for AI Code Generation
What is a two-gate system for AI code generation?
A two-gate system enforces quality checks before any implementation begins. Gate 0 loads meta-orchestration and validates context budget. Gate 1 activates relevant skills based on your query. Both must pass before tools are unblocked.
How much do token savings matter with progressive disclosure?
Progressive disclosure saves 60% of tokens by loading skill metadata first (~200 tokens), then schemas on demand (~400 tokens), then full content only when needed (~1200 tokens). This prevents context overflow on long sessions.
Why block phrases like 'should work' in AI development?
Phrases like 'should work' and 'probably fine' indicate unverified claims. Blocking them forces evidence-based completion--actual build output, test results, or screenshots before marking work complete.
Can I implement this system for my own Claude Code setup?
Yes. Start with a CLAUDE.md file that enforces gate checks. Add hooks for UserPromptSubmit (skill activation) and Stop (build verification). The meta-orchestration plugin pattern works for any codebase.
What's the difference between AMAO and Cortex 2.0?
AMAO handles orchestration--parallel execution, context budgeting, skill evolution. Cortex 2.0 handles skill definitions with 3-tier progressive disclosure. They work together: AMAO decides what to run, Cortex defines how skills work.
What Changed After 6 Months
Six months of running this system daily taught me things I couldn't have predicted from theory alone.
The first surprise: phrase blocking changes how you think, not just what you say. After a few weeks, I stopped forming sentences like "this should work" internally. The habit of reaching for evidence replaced the habit of reaching for confidence. That's not something I expected from a text filter.
The second surprise: Gate 1 skill activation catches mismatches I didn't know were happening. I'd ask about a database query and Claude would activate the wrong skill--something frontend-adjacent because I'd mentioned a component in the same message. The gate surfaces that mismatch immediately. Without it, I'd get a halfway answer and not know why.
The third lesson: the 75% context budget threshold is exactly right, and I didn't trust it at first. I kept thinking "I have 30% left, that's fine." Then I'd hit context compaction 15 minutes later in the middle of implementation. The system wasn't being conservative. I was being optimistic about how much context finishing a feature actually consumes.
After 6 months, my ratio of "it worked first time" to "spent 2 hours debugging" has roughly flipped. Not because Claude got better. Because I stopped accepting "should work" as a completion state.
The hardest part was admitting the problem was my workflow, not the AI's capability. Once I accepted that, building the gates was easy.
I thought I needed better prompts. Well, it's more like... I needed better systems around the prompts. The AI was always capable. I just needed guardrails that made "should work" impossible to say.
Maybe the goal isn't to trust AI more. Maybe it's to trust evidence--and build systems that make evidence the only path forward.
Adapting the Gates to Different Project Types
The two-gate system was built for a specific project—a SvelteKit blog. The principles generalize, but the implementation details vary.
Bug Fixes vs. New Features
For bug fixes, Gate 1 skill activation should weight debugging skills higher than implementation skills. The mental model shift: you're investigating, not building. That difference matters because investigation and implementation have different quality criteria.
For a bug fix, "complete" means: the bug no longer reproduces, a test catches the regression, and the root cause is documented in context.md. Not "the code looks right."
For new features, "complete" means: the feature works end-to-end in the happy path, edge cases are documented and handled, and the build passes with types. A feature that "looks right" is in the same category as "should work."
The gate isn't different—evidence over confidence in both cases—but what counts as evidence is.
Refactoring Sessions
Refactoring is where the gate system earns its keep most visibly. The failure mode for AI-assisted refactoring is: Claude refactors the code, the surface behavior looks the same, but something subtle broke in a path you didn't test.
Before any refactoring session, add one step to Gate 0: document the current behavior you need to preserve. Not informally—write it in context.md. "This function should return null for missing users, not throw. The calling code depends on null, not an exception."
After refactoring, verify that exact behavior explicitly. Don't trust that "the tests pass" if you don't have a test for the specific behavior you documented. If the behavior isn't tested, write the test first, confirm it passes in the original code, then refactor.
Exploratory Sessions
Some sessions are genuinely exploratory—you're trying to understand an unfamiliar codebase, evaluate a new library, or figure out why something is slow. These sessions don't naturally fit the gate structure because there's no implementation to gate.
For exploratory sessions: skip Gate 1 skill activation and run with a single constraint—everything you learn goes into context.md in real time. Exploration without capture is entertainment.
At the end of an exploratory session, convert your notes into hypotheses for the next session. "Learned: the bottleneck is in the database query at line 45. Hypothesis: adding an index on user_id will fix it. Test this in the next session."
This closes the loop between exploration and implementation, which is the gap where insights evaporate.
Short Projects vs. Long Projects
Short projects (one to three sessions): use the gate system for quality control, skip the full dev docs workflow. One context file with the current state is enough.
Long projects (weeks to months): invest in the full dev docs structure. The overhead of maintaining three files pays off by session four, when the project would otherwise require full context rebuilds at the start of every session.
The gate system is not overhead. It's the minimum viable quality control. Don't scale it down below the gate check—that's where the broken-code problem that started this whole system lives.
Related Reading
This is part of the Complete Claude Code Guide. Continue with:
- Context Management - Dev docs workflow that prevents context amnesia
- Evidence-Based Verification - Why "should work" is the most dangerous phrase
- Token Optimization - Save 60% with progressive disclosure
- What is RAG? - Foundational concept behind context-aware AI
Top comments (0)