Why AI coding agents fail with incomplete specs

Nirsa — Tue, 19 May 2026 07:53:15 +0000

AI coding agents like Codex and Claude Code are getting surprisingly good at writing code.

But after using them in real projects, I noticed something:

Most failures were not caused by the model.

They were caused by incomplete specs.

When a specification has gaps, the AI fills them in with plausible assumptions. At first the generated code often looks correct, but over time the implementation slowly drifts away from the intended behavior.

I kept running into issues like:

missing auth boundaries
unclear tenant ownership rules
retry and race-condition problems
webhook duplication edge cases
requirements enforced only on the client side
implementation drift between spec and actual behavior

Eventually it becomes difficult to tell whether the bug is:

in the implementation,
in the specification,
or in the original requirement itself.

A Real Failure Case

One issue I repeatedly saw was missing tenant ownership validation.

A spec would describe:

authentication
API structure
expected responses

but never explicitly define ownership constraints.

The AI agent would generate code that correctly authenticated the user, but still allowed cross-tenant access because ownership validation was never part of the specification.

The implementation looked reasonable at first glance.

But the security boundary itself was undefined.

That was the moment I realized the problem was often not "bad code generation."

It was ambiguous requirements.

The Idea

I started building an open-source tool called SpecGuard.

The goal is simple:

Review requirements before they become input to an AI coding agent.

Instead of reviewing generated code after implementation, SpecGuard tries to catch ambiguous or incomplete requirements earlier in the workflow.

This is heavily inspired by problems I encountered while experimenting with:

AI-assisted development
LLM coding agents
spec-driven development workflows

Intended Workflow

Write spec
    ↓
Run SpecGuard
    ↓
Fix NOT_READY findings
    ↓
Hand spec to AI coding agent

What SpecGuard Checks

Main validation areas

SpecGuard mainly looks for ambiguous or missing areas such as:

auth and permission boundaries
tenant ownership rules
idempotency and replay safety
race conditions
expiration and revocation handling
state transitions
webhook/background retry behavior
requirements relying only on client-side validation

The output is one of:

READY
READY_WITH_WARNINGS
NOT_READY

Why the Default Mode Does Not Use an LLM

I intentionally made the default path non-LLM.

I wanted spec validation to behave more like linting:

deterministic
reproducible
CI-friendly
cheap to run repeatedly

LLM-based review exists as an optional deeper layer, not the foundation.

There is also an optional OpenAI/Codex-based deeper review mode, but currently I treat that as a secondary layer rather than the default workflow.

Codex Plugin

In v0.4.0 I added an MVP Codex plugin.

Install:

pip install spec-guard
specguard --help

codex plugin marketplace add KoreaNirsa/spec-guard --ref main

Create an example spec package:

specguard example copy specs/your-feature-name --force

Inside Codex, the plugin can:

run SpecGuard analysis
read generated results
summarize READY/NOT_READY state
explain main findings and next actions

The plugin itself does not reimplement the engine.

It wraps the existing CLI workflow.

GitHub PR Review Workflow

SpecGuard also includes a GitHub Actions-based PR review workflow.

When a spec package changes in a PR, it can automatically run SpecGuard Review and leave findings directly on the PR.

The OpenAI review path currently uses GitHub secrets such as:

SPECGUARD_OPENAI_API_KEY
SPECGUARD_PR_REVIEW_MODEL
SPECGUARD_REVIEW_SPEC_PATHS

Current Status

This project is still very early and pre-beta.

I do not expect it to perfectly judge every specification.

Right now I am mainly interested in feedback around:

what kinds of specs this workflow fits well
where deterministic checks break down
which findings feel too noisy or too weak
whether PR enforcement would fit real engineering workflows

If you are already using AI coding agents in production workflows, I’d genuinely like to know:

what kinds of spec failures you see most often
where deterministic validation breaks down
and whether something like this would actually fit your development workflow

I’m especially interested in situations where the generated implementation looked correct, but the requirement itself was underspecified.

Feedback, issues, and PRs are all welcome.

KoreaNirsa / spec-guard

Validation-First Workflow (VFW) for AI-assisted development

SpecGuard

SpecGuard blocks weak specs before AI coding agents turn them into defective code.

SpecGuard is a Validation-First Workflow (VFW) for AI-assisted development It turns specs into reviewed, testable, implementation-ready packages before AI coding begins.

It is not a prompt-to-code generator. SpecGuard helps you prepare an approved spec package before an external Codex, Claude Code, or another coding agent writes application code.

Demo Video

Watch the full-resolution MP4 demo

The demo follows this flow:

Install SpecGuard with pip install spec-guard.
Copy the example spec with specguard example copy your-feature-name --force.
Insert a vulnerable spec. In v0.3.0, the packaged example intentionally includes a vulnerable spec by default so users can see a blocking SpecGuard Review.
Review the SpecGuard findings.
Fix the weak areas directly, or ask an AI assistant to strengthen the spec by giving it the SpecGuard Review findings.
Run SpecGuard Review again and confirm it reaches READY…

View on GitHub

The problem wasn't that the AI wrote bad code — weak specs caused unstable implementations

Nirsa — Fri, 08 May 2026 13:02:31 +0000

Recently I’ve been experimenting a lot with AI-assisted development workflows using tools like Codex and Claude Code.

At first, I assumed most implementation failures came from the AI itself.

But after repeatedly testing spec-driven workflows, I noticed something different:

The problem wasn't that the AI wrote bad code. The problem was that weak specs caused unstable implementations.

Ambiguous requirements often led to:

unstable architecture inconsistent contracts missing ownership boundaries unsafe delete/update behavior implementation drift features expanding outside original intent

In many cases, the AI was actually trying to follow the provided specification. The issue was that the specification itself was incomplete, unsafe, or unclear.

That led me to start experimenting with what I’ve been calling

VFW (Validation First Workflow)

The core idea is simple
Before AI coding starts, validate whether the specification is actually implementation-ready.

As part of that experiment, I started building a small OSS project called SpecGuard
https://github.com/KoreaNirsa/spec-guard

SpecGuard is not a code generator.

Instead, it acts more like a validation-first guard layer for spec-driven / AI-assisted development workflows.

Current v0.3.0 supports things like
readiness review for spec packages Critical / Major / Minor findings low review mode implementation handoff artifacts experimental PR drift review heuristic-first review flow

Typical workflow
Discovery → spec.md → technical-design.md → SpecGuard Review → readiness validation → implementation handoff → external coding agent → Pull Request → SpecGuard PR review

The project is still very experimental and immature in many areas.

Known limitations
heuristic false positives / false negatives limited benchmark coverage small real-world validation set review calibration still evolving UX/docs still rough

Right now this is still much closer to a demo-stage OSS project than a mature production tool.

But I’d like to continue evolving it toward something practical enough for real engineering workflows.

I’m especially interested in exploring
Spec-Driven Development validation-first workflows contract validation AI-assisted engineering PR review automation CI/CD validation gates harness/evaluation engineering

Feedback and contributors are very welcome.

DEV Community: Nirsa