Chudi Nnorukam

Posted on Feb 10 • Edited on Jul 6 • Originally published at chudi.dev

Claude Code Quality: A Two-Gate System That Cut Errors 84%

#claudecode #ai #workflow #automation

Originally published at chudi.dev

I shipped broken code three times in one week. Not edge cases--fundamental errors that any test would have caught. The AI said "should work" and I believed it.

Building a quality control system for AI code generation means enforcing mandatory gates before implementation begins--loading relevant skills, validating context budget, and blocking rationalization phrases like "should work" that indicate unverified claims. The result is a two-gate system where tools literally cannot execute until quality checks pass. Claude Code provides the hooks infrastructure that makes gate enforcement possible at the session level.

Why Did I Need Quality Gates for AI?

Quality gates for AI code generation became necessary because unverified confidence phrases like "should work" replaced actual verification. Without mandatory checks, AI generates code that appears correct but breaks in fundamental ways. The gate system enforces evidence-based completion by blocking implementation tools until context is validated and the right skills are loaded.

The problem wasn't the AI's capability. Claude is remarkably good at generating code. The problem was my workflow--or lack of one.

I'd describe what I wanted. Claude would write it. I'd paste it in. Sometimes it worked. Sometimes I'd spend hours debugging issues that existed from the first line. Without me realizing it, I was trusting confidence over evidence.

That specific anxiety of deploying something you haven't tested--the kind where you refresh the page three times hoping the error goes away--became my default state.

Well, it's more like... I was using AI as a code generator when I needed it to be a quality-controlled collaborator.

How Does the Two-Gate System Work?

The two-gate system runs two mandatory checks before any tool can execute. Gate 0 loads meta-orchestration and validates your context budget stays below 75%. Gate 1 analyzes your query and activates the relevant skills from a library of 30 or more defined patterns. Both must pass before implementation begins.

The system enforces two mandatory checks before any tool can execute. Like buttoning a shirt from the first hole--skip it, and everything else is wrong.

If either gate fails, all implementation tools are blocked. You literally cannot write code until the orchestration layer is ready.

Gate 0: Meta-Orchestration (Priority 0)

This gate loads immediately and handles three things:

Gate 1: Auto-Skill Activation (Priority 1)

This gate analyzes your query and activates relevant skills:

I love automation. But I spend hours building systems to slow myself down.

What Is Progressive Disclosure and Why Does It Save 60% of Tokens?

Progressive disclosure loads skill content in three tiers instead of all at once. Tier 1 loads metadata at roughly 200 tokens, Tier 2 loads schemas at 400 tokens, and Tier 3 loads full handler logic at 1,200 tokens only when needed. Sessions that never reach Tier 3 save 60% of their token budget.

Most Claude configurations load everything upfront. Every skill, every rule, every example--thousands of tokens consumed before you've even asked a question. Anthropic's documentation covers prompt and context design patterns that help avoid this upfront token burn.

Progressive disclosure flips this. Load metadata first. Load details on demand.

The 3-Tier System

Tier 1: Metadata (~200 tokens)

Skill name, triggers, dependencies
Just enough to route the query

Tier 2: Schema (~400 tokens)

Input/output types
Constraints and quality gates
Tools available

Tier 3: Full Content (~1200 tokens)

Complete handler logic
Examples and edge cases
Only loaded when actively using the skill

The meta-orchestration skill alone: 278 lines at Tier 1, 816 with one reference, 3,302 fully loaded. That's 60% savings on every session that doesn't need the full content.

What Phrases Does the System Block?

The system blocks phrases that indicate claims without evidence: "should work," "probably fine," "I'm confident," "looks good," and "all set." These phrases are banned not because they are wrong, but because they signal unverified assumptions. Blocking them forces the path of least resistance to be actual build output or test results.

The automated verification system flags specific patterns in code comments and commit messages. Here's the complete breakdown of phrases that indicate insufficient testing or assumptions:

These phrases aren't banned because they're wrong. They're banned because they indicate claims without evidence.

The goal isn't to be pedantic--it's to make evidence the path of least resistance. When "build passed" is easier to say than "should work," you'll naturally verify before claiming.

That hollow confidence of claiming something works without checking--the system makes it impossible.

How Does AMAO Handle Parallel Execution?

AMAO uses a directed acyclic graph engine to map task dependencies and identify which operations can run concurrently. It runs up to three concurrent tasks with a five-minute timeout and falls back to sequential execution if parallel fails. A context governor tracks token usage and triggers auto-compaction at 70% to prevent overflow.

AMAO (Adaptive Multi-Agent Orchestrator) adds sophisticated orchestration on top of the gate system:

DAG Engine

Directed acyclic graph for task dependencies
Max 50 tasks with cycle detection
Parallel grouping for independent operations
Critical path analysis for optimization

Context Governor

75% max budget, 60% warning threshold, 20% reserve
Predictive usage analysis
Auto-compact at 70%
Phase unloading to release memory between stages

Skill Evolution

Pattern detection: 5 occurrences triggers skill proposal
Auto-approval at 85% confidence
Deprecation at 30% effectiveness
Weighted feedback: 40% build, 30% test, 20% reverts, 10% user

The parallel execution runs up to 3 concurrent tasks with a 5-minute timeout. If parallel fails, it falls back to sequential--safety over speed.

What Are the 4 Pillars of Quality?

The four pillars are state and reactivity, security and validation, integration reality, and failure recovery. State enforces Svelte 5 runes only. Security validates all user input. Integration reality requires every component to be used in at least one route. Failure recovery mandates error boundaries and graceful degradation on all async operations.

Every check maps to one of four pillars:

1. State & Reactivity

Svelte 5 runes only ($state, $props, $derived)
No legacy patterns that cause confusion
State updates via $effect for side effects

2. Security & Validation

All user input sanitized (XSS prevention)
Form inputs validated with Zod
API routes validate request schema
No inline scripts in production

3. Integration Reality

Every component used in at least one route
No orphaned utility files
All API routes consumed by UI
Every feature has verification

4. Failure Recovery

Error boundaries on all route groups
Graceful degradation for failed API calls
Loading states for async operations
User-friendly error messages

FAQ: Building Quality Systems for AI Code Generation

What is a two-gate system for AI code generation?
A two-gate system enforces quality checks before any implementation begins. Gate 0 loads meta-orchestration and validates context budget. Gate 1 activates relevant skills based on your query. Both must pass before tools are unblocked.

How much do token savings matter with progressive disclosure?
Progressive disclosure saves 60% of tokens by loading skill metadata first (~200 tokens), then schemas on demand (~400 tokens), then full content only when needed (~1200 tokens). This prevents context overflow on long sessions.

Why block phrases like 'should work' in AI development?
Phrases like 'should work' and 'probably fine' indicate unverified claims. Blocking them forces evidence-based completion--actual build output, test results, or screenshots before marking work complete.

Can I implement this system for my own Claude Code setup?
Yes. Start with a CLAUDE.md file that enforces gate checks. Add hooks for UserPromptSubmit (skill activation) and Stop (build verification). The meta-orchestration plugin pattern works for any codebase.

What's the difference between AMAO and Cortex 2.0?
AMAO handles orchestration--parallel execution, context budgeting, skill evolution. Cortex 2.0 handles skill definitions with 3-tier progressive disclosure. They work together: AMAO decides what to run, Cortex defines how skills work.

What Changed After 6 Months

Six months of running this system daily taught me things I couldn't have predicted from theory alone.

The first surprise: phrase blocking changes how you think, not just what you say. After a few weeks, I stopped forming sentences like "this should work" internally. The habit of reaching for evidence replaced the habit of reaching for confidence. That's not something I expected from a text filter.

The second surprise: Gate 1 skill activation catches mismatches I didn't know were happening. I'd ask about a database query and Claude would activate the wrong skill--something frontend-adjacent because I'd mentioned a component in the same message. The gate surfaces that mismatch immediately. Without it, I'd get a halfway answer and not know why.

The third lesson: the 75% context budget threshold is exactly right, and I didn't trust it at first. I kept thinking "I have 30% left, that's fine." Then I'd hit context compaction 15 minutes later in the middle of implementation. The system wasn't being conservative. I was being optimistic about how much context finishing a feature actually consumes.

After 6 months, my ratio of "it worked first time" to "spent 2 hours debugging" has roughly flipped. Not because Claude got better. Because I stopped accepting "should work" as a completion state.

The hardest part was admitting the problem was my workflow, not the AI's capability. Once I accepted that, building the gates was easy.

I thought I needed better prompts. Well, it's more like... I needed better systems around the prompts. The AI was always capable. I just needed guardrails that made "should work" impossible to say.

Maybe the goal isn't to trust AI more. Maybe it's to trust evidence--and build systems that make evidence the only path forward.

Adapting the Gates to Different Project Types

The two-gate system was built for a specific project—a SvelteKit blog. The principles generalize, but the implementation details vary.

Bug Fixes vs. New Features

For bug fixes, Gate 1 skill activation should weight debugging skills higher than implementation skills. The mental model shift: you're investigating, not building. That difference matters because investigation and implementation have different quality criteria.

For a bug fix, "complete" means: the bug no longer reproduces, a test catches the regression, and the root cause is documented in context.md. Not "the code looks right."

For new features, "complete" means: the feature works end-to-end in the happy path, edge cases are documented and handled, and the build passes with types. A feature that "looks right" is in the same category as "should work."

The gate isn't different—evidence over confidence in both cases—but what counts as evidence is.

Refactoring Sessions

Refactoring is where the gate system earns its keep most visibly. The failure mode for AI-assisted refactoring is: Claude refactors the code, the surface behavior looks the same, but something subtle broke in a path you didn't test.

Before any refactoring session, add one step to Gate 0: document the current behavior you need to preserve. Not informally—write it in context.md. "This function should return null for missing users, not throw. The calling code depends on null, not an exception."

After refactoring, verify that exact behavior explicitly. Don't trust that "the tests pass" if you don't have a test for the specific behavior you documented. If the behavior isn't tested, write the test first, confirm it passes in the original code, then refactor.

Exploratory Sessions

Some sessions are genuinely exploratory—you're trying to understand an unfamiliar codebase, evaluate a new library, or figure out why something is slow. These sessions don't naturally fit the gate structure because there's no implementation to gate.

For exploratory sessions: skip Gate 1 skill activation and run with a single constraint—everything you learn goes into context.md in real time. Exploration without capture is entertainment.

At the end of an exploratory session, convert your notes into hypotheses for the next session. "Learned: the bottleneck is in the database query at line 45. Hypothesis: adding an index on user_id will fix it. Test this in the next session."

This closes the loop between exploration and implementation, which is the gap where insights evaporate.

Short Projects vs. Long Projects

Short projects (one to three sessions): use the gate system for quality control, skip the full dev docs workflow. One context file with the current state is enough.

Long projects (weeks to months): invest in the full dev docs structure. The overhead of maintaining three files pays off by session four, when the project would otherwise require full context rebuilds at the start of every session.

The gate system is not overhead. It's the minimum viable quality control. Don't scale it down below the gate check—that's where the broken-code problem that started this whole system lives.

Sources

Claude Code - Anthropic (Anthropic)
Anthropic Prompt Engineering Overview (Anthropic)

DEV Community