Skip to content

DEV Community

Nirsa

Posted on May 19

Why AI coding agents fail with incomplete specs

#opensource #ai #devtools #github

AI coding agents like Codex and Claude Code are getting surprisingly good at writing code.

But after using them in real projects, I noticed something:

Most failures were not caused by the model.

They were caused by incomplete specs.

When a specification has gaps, the AI fills them in with plausible assumptions. At first the generated code often looks correct, but over time the implementation slowly drifts away from the intended behavior.

I kept running into issues like:

missing auth boundaries
unclear tenant ownership rules
retry and race-condition problems
webhook duplication edge cases
requirements enforced only on the client side
implementation drift between spec and actual behavior

Eventually it becomes difficult to tell whether the bug is:

in the implementation,
in the specification,
or in the original requirement itself.

A Real Failure Case

One issue I repeatedly saw was missing tenant ownership validation.

A spec would describe:

authentication
API structure
expected responses

but never explicitly define ownership constraints.

The AI agent would generate code that correctly authenticated the user, but still allowed cross-tenant access because ownership validation was never part of the specification.

The implementation looked reasonable at first glance.

But the security boundary itself was undefined.

That was the moment I realized the problem was often not "bad code generation."

It was ambiguous requirements.

The Idea

I started building an open-source tool called SpecGuard.

The goal is simple:

Review requirements before they become input to an AI coding agent.

Instead of reviewing generated code after implementation, SpecGuard tries to catch ambiguous or incomplete requirements earlier in the workflow.

This is heavily inspired by problems I encountered while experimenting with:

AI-assisted development
LLM coding agents
spec-driven development workflows

Intended Workflow

Write spec
    ↓
Run SpecGuard
    ↓
Fix NOT_READY findings
    ↓
Hand spec to AI coding agent

What SpecGuard Checks

Main validation areas

SpecGuard mainly looks for ambiguous or missing areas such as:

auth and permission boundaries
tenant ownership rules
idempotency and replay safety
race conditions
expiration and revocation handling
state transitions
webhook/background retry behavior
requirements relying only on client-side validation

The output is one of:

READY
READY_WITH_WARNINGS
NOT_READY

Why the Default Mode Does Not Use an LLM

I intentionally made the default path non-LLM.

I wanted spec validation to behave more like linting:

deterministic
reproducible
CI-friendly
cheap to run repeatedly

LLM-based review exists as an optional deeper layer, not the foundation.

There is also an optional OpenAI/Codex-based deeper review mode, but currently I treat that as a secondary layer rather than the default workflow.

Codex Plugin

In v0.4.0 I added an MVP Codex plugin.

Install:

pip install spec-guard
specguard --help

codex plugin marketplace add KoreaNirsa/spec-guard --ref main

Create an example spec package:

specguard example copy specs/your-feature-name --force

Inside Codex, the plugin can:

run SpecGuard analysis
read generated results
summarize READY/NOT_READY state
explain main findings and next actions

The plugin itself does not reimplement the engine.

It wraps the existing CLI workflow.

GitHub PR Review Workflow

SpecGuard also includes a GitHub Actions-based PR review workflow.

When a spec package changes in a PR, it can automatically run SpecGuard Review and leave findings directly on the PR.

The OpenAI review path currently uses GitHub secrets such as:

SPECGUARD_OPENAI_API_KEY
SPECGUARD_PR_REVIEW_MODEL
SPECGUARD_REVIEW_SPEC_PATHS

Current Status

This project is still very early and pre-beta.

I do not expect it to perfectly judge every specification.

Right now I am mainly interested in feedback around:

what kinds of specs this workflow fits well
where deterministic checks break down
which findings feel too noisy or too weak
whether PR enforcement would fit real engineering workflows

If you are already using AI coding agents in production workflows, I’d genuinely like to know:

what kinds of spec failures you see most often
where deterministic validation breaks down
and whether something like this would actually fit your development workflow

I’m especially interested in situations where the generated implementation looked correct, but the requirement itself was underspecified.

Feedback, issues, and PRs are all welcome.

KoreaNirsa / spec-guard

Validation-First Workflow (VFW) for AI-assisted development

SpecGuard

SpecGuard blocks weak specs before AI coding agents turn them into defective code.

SpecGuard is a Validation-First Workflow (VFW) for AI-assisted development It turns specs into reviewed, testable, implementation-ready packages before AI coding begins.

It is not a prompt-to-code generator. SpecGuard helps you prepare an approved spec package before an external Codex, Claude Code, or another coding agent writes application code.

Demo Video

Watch the full-resolution MP4 demo

The demo follows this flow:

Install SpecGuard with pip install spec-guard.
Copy the example spec with specguard example copy your-feature-name --force.
Insert a vulnerable spec. In v0.3.0, the packaged example intentionally includes a vulnerable spec by default so users can see a blocking SpecGuard Review.
Review the SpecGuard findings.
Fix the weak areas directly, or ask an AI assistant to strengthen the spec by giving it the SpecGuard Review findings.
Run SpecGuard Review again and confirm it reaches READY…

Top comments (0)

Subscribe