DEV Community: Bobby Blaine

Your MCP Server Is Probably Vulnerable

Bobby Blaine — Wed, 15 Apr 2026 23:26:54 +0000

In January and February 2026, security researchers filed 30 CVEs against MCP servers in just 60 days. Among 2,614 surveyed implementations, 82% were vulnerable to path traversal. The worst offender, CVE-2025-6514, scored a CVSS 9.6 for remote code execution in a package with 437,000 downloads. MCP security vulnerabilities are no longer theoretical.

The Scale of the Problem

The Vulnerable MCP Project now tracks 50 vulnerabilities across the MCP ecosystem, 13 of them critical. Thirty-two security researchers from organizations including SentinelOne, Snyk, Trail of Bits, and CyberArk have contributed findings. The numbers are bad for AI agent security.

The breakdown by attack type tells you where the real rot is: 43% are shell or exec injection, 20% are tooling infrastructure flaws, 13% authentication bypass, and 10% path traversal. That means over half the CVEs come down to MCP servers passing unsanitized input to shell commands or file system operations. The remaining 14% covers SSRF, cross-tenant data exposure, and supply chain attacks.

It gets worse when you look at the baseline. Of those 2,614 servers surveyed, 67% had code injection risk, 34% were susceptible to command injection, and 36.7% were exposed to SSRF. Between 38% and 41% lacked any authentication mechanism at all. These are not obscure hobby projects. The Framelink Figma MCP Server alone has over 600,000 downloads and 10,000 GitHub stars.

The protocol grew faster than its security practices. With 97 million installs across the ecosystem, MCP became the de facto standard for connecting AI agents to tools. Anthropic published the spec in late 2024, and by early 2026 every major IDE, AI assistant, and automation platform had adopted it. Security was an afterthought. The result is an ecosystem where most servers were built to demo well, not to withstand adversarial input.

MCP Server Hardening: Specific Threats and Fixes

Three attack classes deserve your attention right now.

Tool poisoning is the hardest to detect. Invariant Labs demonstrated that malicious instructions hidden in MCP tool descriptions can direct an AI model to exfiltrate SSH keys and config files. The user sees a friendly tool name in the approval UI. The model sees the full description, including concealed directives to read ~/.ssh/id_rsa and transmit it through tool parameters. In controlled testing, these attacks succeed 84.2% of the time when agents run with auto-approval enabled. And 5.5% of public MCP servers already have poisoned descriptions in the wild.

The poisoned tool does not even need to be called. Merely being loaded into context is enough for the model to follow hidden instructions. There are three variants: direct poisoning hides instructions in descriptions, tool shadowing lets a malicious server override a trusted tool's behavior, and rug pulls appear safe initially before silently modifying tool definitions on subsequent connections.

Command injection accounts for the largest share of CVEs. CVE-2026-5058 in aws-mcp-server carries a CVSS 9.8 and allows remote code execution through crafted input to AWS CLI wrappers. CVE-2026-5741 in docker-mcp-server enables OS command injection through version 0.1.0. CVE-2026-23744 in MCPJam Inspector is worth special attention: versions up to 1.4.2 listen on 0.0.0.0 with no authentication. A crafted HTTP request can install an MCP server and execute arbitrary code on the host. The pattern across all of these is the same: user-controlled input reaches a shell exec call without sanitization. If you have built an MCP server that calls child_process.exec() or subprocess.run(shell=True) with any parameter derived from tool input, you almost certainly have this bug.

Cross-client data leaks are the third major class. CVE-2026-25536 in the MCP TypeScript SDK causes responses to leak across client boundaries when a single McpServer instance is reused. One client receives data intended for another. If you are running multi-tenant MCP infrastructure, this one matters.

For command injection, the fix is strict input validation. Set additionalProperties: false on tool input schemas so only declared parameters are accepted, and apply pattern constraints on string fields to block traversal and injection characters:

{
  "type": "object",
  "properties": {
    "filename": {
      "type": "string",
      "pattern": "^[a-zA-Z0-9_\\-\\.]+$"
    }
  },
  "additionalProperties": false
}

Never pass tool parameters directly to shell commands. Use parameterized execution (array-form execFile in Node, list-form subprocess.run in Python) instead of string interpolation into a shell.

For tool poisoning, disable auto-approval immediately. This single change drops attack success from 84.2% to under 5%. If your workflow depends on auto-approval for speed, scope it narrowly to specific trusted servers rather than enabling it globally.

Your Five-Step MCP Audit

Here is a concrete process you can run today.

Step 1: Scan your installed servers. Run Invariant Labs' mcp-scan against every configured MCP server:

uvx mcp-scan@latest

This detects tool poisoning, rug pulls, cross-origin escalation, and prompt injection in your current setup. Run it before and after adding any new server.

Step 2: Inspect tool descriptions manually. Use uvx mcp-scan@latest inspect to read the full text of every tool description. Look for instructions that reference file paths, environment variables, or network endpoints that have nothing to do with the tool's stated purpose. Pay special attention to tools that mention ~/.ssh, ~/.aws, .env, or any credential file paths. A legitimate calculator tool has no business referencing your SSH directory.

Step 3: Pin versions and verify hashes. Never use @latest in production MCP configs. Pin specific versions. The mcp-scan tool hashes tool descriptions on first scan and alerts you if definitions change, catching rug pull attacks.

Step 4: Containerize your servers. Run MCP servers in Docker with read-only filesystems, dropped capabilities, no-new-privileges, and isolated networks. Default-deny network egress is the single highest-impact control because it limits the blast radius of every other vulnerability class. A compromised server that cannot reach the internet cannot exfiltrate your data.

Step 5: Remove what you do not use. Every installed MCP server is attack surface. Audit your claude_desktop_config.json, .cursor/mcp.json, or equivalent config file. If you added a server to try it once three months ago, uninstall it now. The Practical DevSecOps guide calls this out as the confused deputy problem: servers sitting idle still have their tool descriptions loaded into context, still influencing agent behavior.

What to Watch For

The OWASP MCP Top 10 is the emerging framework here, but it is early days. Coverage is uneven. Most scanning tools catch tool poisoning and command injection well. They are weaker on supply chain attacks (MCP04) and shadow MCP servers (MCP09).

The mcpserver-audit tool from the Cloud Security Alliance is worth watching as a more structured alternative to mcp-scan, but it is not yet mature enough to replace it.

Authentication remains the elephant in the room. The MCP spec makes auth optional for stdio transport, which covers most local setups. For HTTP transport, OAuth 2.1 is the recommended path, but adoption is glacial. Nobody wants to be the person who tells developers their localhost tool needs an auth flow. (Everyone should be that person.)

Key Takeaway

Run uvx mcp-scan@latest against your setup today. Disable auto-approval for tool calls. Pin your server versions and containerize anything that touches production data. These four actions address the majority of the attack surface that produced 30 CVEs in 60 days. The MCP ecosystem is not going to slow down, and attackers have already noticed the gap between adoption and security posture. Close it before they walk through it.

Slopsquatting: AI Hallucinations as Supply Chain Attacks

Bobby Blaine — Thu, 05 Mar 2026 06:48:24 +0000

One in five AI-generated code samples recommends a package that does not exist. Attackers are registering those phantom names on npm and PyPI with malware inside. The term for this is slopsquatting, and it is already happening.

What Slopsquatting Actually Is

Typosquatting bets on human misspellings. Slopsquatting bets on AI hallucinations. The term was coined by Seth Larson, Security Developer-in-Residence at the Python Software Foundation, to describe a specific attack: register the package names that LLMs consistently fabricate, then wait for developers to install them on an AI's recommendation.

A USENIX Security 2025 study analyzed 576,000 code samples across 16 language models and found that roughly 20% recommended at least one non-existent package. The hallucinations fall into three categories. 51% are pure fabrications with no basis in reality. 38% are conflations of real packages mashed together (like express-mongoose). 13% are typo variants of legitimate names.

The part that makes this exploitable is consistency. 43% of hallucinated package names appeared every time across 10 repeated queries, and 58% appeared more than once. An attacker does not need to guess which names an LLM will invent. They ask the same question a few times, collect the phantom names, and register them.

Traditional typosquatting registers names like crossenv hoping someone will mistype cross-env. Existing registry defenses flag new package names that are too close to popular ones. Hallucinated names bypass this entirely. They are often novel strings that no filter anticipates, because no real package was the starting point.

From Theory to 30,000 Downloads

Security researcher Bar Lanyado tested this by asking multiple LLMs for Python package recommendations. They consistently hallucinated a package called huggingface-cli. Lanyado registered the name on PyPI as an empty placeholder with no malicious code. Within three months, it had over 30,000 downloads. All organic. All from developers (or their AI tools) running pip install huggingface-cli based on a model's confident recommendation.

Another package, unused-imports, was confirmed malicious and still pulling roughly 233 downloads per week as of early 2026. The legitimate package is eslint-plugin-unused-imports. Developers keep installing the wrong one because AI assistants keep suggesting it.

A sharper example surfaced in January 2026. Aikido Security researchers found that react-codeshift, a name conflating the real packages jscodeshift and react-codemod, appeared in a batch of LLM-generated agent skill files committed to GitHub. No human planted it. The hallucination entered version control through automated code generation, where other tools could pick it up and propagate it further.

How the Payload Works

The attack payload is typically a post-install script. When you run npm install malicious-package, npm executes any postinstall script defined in the package's package.json automatically. The script steals API keys, cloud tokens, and SSH keys accessible from the local environment.

Some newer variants skip embedded code entirely, using npm's URL-based dependency support to fetch payloads externally at install time. The package.json looks clean because the malicious code is downloaded at runtime. Static scanners see nothing.

There is also a cross-ecosystem angle. The USENIX study found that 8.7% of hallucinated Python package names turned out to be valid JavaScript packages. An attacker could register the same phantom name on both npm and PyPI, catching traffic from both ecosystems with a single fabricated name.

Defending Your Workflow

The best defense layers multiple checks. Here is what works today.

Lock your dependencies. Use package-lock.json, yarn.lock, or poetry.lock and commit them to version control. A lockfile pins exact versions and checksums, so even if a malicious package appears later under the same name, existing installs are not affected. Run npm ci (not npm install) in CI to enforce the lockfile strictly.

Verify before you install. When an AI suggests a package you have not used before, check it first. On npm, npm info <package-name> shows the publisher, creation date, and weekly downloads. On PyPI, check pypi.org directly. A package created last week with no README, a single version, and no GitHub link is a red flag. Cross-reference the name against the library's official documentation.

Use a scanning wrapper. Aikido SafeChain is an open-source tool for npm, yarn, pnpm, pip, and other package managers that intercepts install commands and validates packages against threat intelligence before anything hits your machine. Install it:

curl -fsSL https://github.com/AikidoSec/safe-chain/releases/latest/download/install-safe-chain.sh | sh
# Restart your terminal, then use npm/pip/yarn normally -- SafeChain intercepts automatically
npm install some-package

It is free, requires no API tokens, and adds a few seconds per install.

Sandbox autonomous agents. If you use AI coding agents that install packages without confirmation, run them inside ephemeral containers or VMs. A malicious post-install script in a throwaway Docker container cannot exfiltrate your host credentials. At minimum, restrict your agent's permissions so it cannot run npm install without your explicit approval.

Disable post-install scripts for untrusted packages. Run npm install --ignore-scripts to skip all lifecycle scripts during installation, then selectively allow scripts for known-good packages. This blocks the most common slopsquatting payload vector at the cost of some manual setup.

Add a CI gate. Integrate Software Composition Analysis into your pipeline. Tools like OWASP dep-scan flag unknown or newly published packages before they reach production. Generate and sign Software Bills of Materials (SBOMs) for every build so each dependency is auditable. If a package does not appear in your organization's approved registry, the build should fail.

The Growing Attack Surface

The scale of this problem is what matters. As AI coding tools move from pair programming to autonomous agents that install dependencies without human review, the attack surface expands. A developer who reads a suggestion and checks the docs has some protection. An AI agent running npm install in an automated loop does not.

Registries have no automated defense against slopsquatting yet. npm's existing protections catch names similar to popular packages, but hallucinated names often bear no resemblance to real ones. They are novel strings that no similarity filter anticipates.

The react-codeshift case previews the feedback loop. An LLM hallucinates a package name. An AI agent writes code using it. That code gets committed to GitHub. A different LLM trains on or retrieves that code. The hallucination spreads further. Each step increases the download count, which makes the package look more legitimate, which makes the next LLM more likely to recommend it.

Whether or not the registries catch up, the exposure falls on developers who accept AI package suggestions at face value.

Key Takeaway

Before installing any AI-suggested package, run npm info <package-name> or check pypi.org to verify it exists, its age, and its publisher. For automated workflows, install SafeChain as a drop-in wrapper, and never let an AI agent run package installs outside a sandboxed environment. The 20% hallucination rate means one in five suggestions could be a trap.

Context Engineering: CLAUDE.md and .cursorrules

Bobby Blaine — Thu, 05 Mar 2026 06:48:15 +0000

75% of engineers use AI tools daily. Most organizations see no measurable productivity gains from them. Faros AI sums it up: "Clever prompts make for impressive demos. Engineered context makes for shippable software." When your AI coding agent enters a session without knowing your naming conventions, architecture patterns, or which directories to never touch, every session starts cold. That overhead compounds across every developer on every task.

What Context Engineering Actually Is

Context engineering has replaced prompt engineering as the skill that separates productive AI coding assistants from expensive autocomplete. Martin Fowler defines it as "curating what the model sees so that you get a better result." In practice, that means treating your agent's information environment as infrastructure -- architecting everything the model can access: project conventions, git history, team standards, tool definitions, and documentation.

The distinction from prompt engineering matters. Prompt engineering is a one-off act: write an instruction, get a response. Context engineering is a system: build the foundation that makes every session reliably productive, not just the occasional lucky one.

Two tools dominate this space right now: CLAUDE.md for Claude Code users and Cursor Rules for Cursor users. Both serve the same function, a permanent project-scoped instruction set that loads automatically at the start of every session. You configure it once; every subsequent session inherits it. You can debate whether calling this "engineering" is accurate for what amounts to editing a Markdown file. Meanwhile, the developers who figured it out months ago are shipping on first attempts.

How CLAUDE.md and Cursor Rules Work

CLAUDE.md is a Markdown file at the root of your project. Every time Claude Code opens a session in that directory, its contents are injected into context automatically (an onboarding document for a developer with perfect recall and exact instruction-following).

Claude Code provides four distinct context mechanisms, each with a different loading behavior:

CLAUDE.md -- always loaded, for project-wide universal conventions
Rules -- path-scoped guidance (e.g., rules that apply only to *.test.ts files)
Skills -- lazy-loaded resources triggered by the agent when a task matches
Hooks -- deterministic scripts that run at lifecycle events like file save or commit

Cursor uses a parallel architecture. The original .cursorrules file is deprecated; the replacement is individual .mdc files inside .cursor/rules/, each scoped to a specific concern or file glob. One rule per concern keeps configuration focused and easier to maintain across a team.

Both tools share a key finding from Faros AI's research: context ordering matters. Models attend more to content at the beginning and end of the context window. Critical constraints belong at the top; immediate task context and examples go at the end. Instructions buried in the middle of a 3,000-token CLAUDE.md get deprioritized.

There is also a counterintuitive ceiling on context size. Stanford and UC Berkeley research found model correctness drops around 32,000 tokens even for models advertising larger windows, the "lost-in-the-middle" effect. Keep CLAUDE.md under 500 tokens (roughly 400 words). For injecting large codebases selectively, Repomix lets you pack specific directories into structured prompts rather than dumping entire repositories at once. The goal is precision, not volume.

Building Your CLAUDE.md in 15 Minutes

Start with five sections. Keep each under 15 lines.

1. Project identity. Name, purpose, and tech stack in three bullet points. The agent needs to know whether it is working on a TypeScript Next.js app or a Python FastAPI service before it modifies anything.

2. Architecture conventions. Where do things live? One paragraph. "Components go in src/components/, utilities in src/lib/, tests colocated as *.test.ts files adjacent to their source."

3. Coding standards. What your linter does not catch: naming conventions, type rules, patterns to prefer or avoid. "Named exports only. No any types -- use unknown and narrow. Prefer composition over inheritance."

4. Off-limits without explicit instruction. List files or directories the agent should never modify unprompted. Migrations, generated code, vendored libraries. This section alone prevents the most costly agent errors.

5. Testing requirements. "All new functions need a unit test. Use vitest. Run npm test before marking any task complete."

A minimal example for a Node.js API project:

# Project: Payments API

**Stack:** Node.js 22, TypeScript 5.7, Postgres 16, Prisma ORM

## Architecture

- API routes in `src/routes/`, one file per resource
- Business logic in `src/services/`, never in route handlers
- All DB queries through Prisma -- no raw SQL

## Standards

- Named exports only. No `any` -- use `unknown` and narrow.
- Env vars via `process.env`, validated with Zod at startup.

## Off-limits

- `prisma/migrations/` -- never edit directly
- `src/generated/` -- overwritten on next build

## Before finishing any task

- Run `npm test` and confirm all pass
- Run `npm run lint` and fix all errors

Under 25 lines. An agent reading this produces dramatically fewer surprises than one starting cold.

For Cursor, apply the same logic across three .mdc files: one for general conventions, one for testing rules, one for framework-specific guidance. Each file stays under 100 lines and targets a specific concern.

To validate your CLAUDE.md is working, run two identical tasks side by side, one in a project without the file and one with it. First-attempt accuracy is the clearest signal. If the agent correctly follows your naming conventions without being told in the prompt, the context file is doing its job.

The Limits to Know About

Context engineering improves reliability; it does not guarantee outcomes. Martin Fowler notes that results still depend on LLM interpretation, requiring probabilistic thinking rather than certainty. Human review stays essential regardless of context quality.

Context files go stale. A CLAUDE.md written for an Express codebase that was later migrated to Fastify actively misleads the agent. This is worse than no file at all. A one-line note in your PR template ("Did you update CLAUDE.md?") costs ten seconds and prevents hours of confused agent sessions.

Finally, good context does not fix vague task descriptions. Faros AI found that most engineering tickets lack sufficient clarity for reliable agent execution. Context quality and task specification quality reinforce each other. Neither substitutes for the other. The distinction matters: "engineered context makes for shippable software" only if the task tells the agent what to ship.

Key Takeaway

Create a CLAUDE.md file in your project root today with five sections: project identity, architecture conventions, coding standards, off-limits files, and test requirements. Keep it under 30 lines. Run your next Claude Code session and observe the difference in first-attempt accuracy. The model does not change -- what it knows about your project does.

Cognitive Debt: The Real Cost of AI-Generated Code

Bobby Blaine — Thu, 05 Mar 2026 06:42:59 +0000

Developers trust AI-generated code less than ever. Confidence in AI coding tools dropped from 43% to 29% in eighteen months, yet usage climbed to 84%. That gap between belief and behavior has a name now: cognitive debt. And unlike technical debt, you cannot refactor your way out of it.

What Cognitive Debt Actually Means

Margaret-Anne Storey described the phenomenon in a February 2026 blog post, building on Peter Naur's decades-old insight that a program is not its source code. A program is a theory. It is a mental model living in developers' minds that captures what the software does, how intentions became implementation, and what happens when you change things.

Technical debt is a property of the codebase. You can measure it with linters and static analysis tools. Cognitive debt is a property of the people who work on the codebase. It accumulates when a team ships code faster than they can understand it.

Simon Willison put it plainly: he has gotten lost in his own AI-assisted projects, losing confidence in architectural decisions about code he technically authored. The code worked. His understanding of why it worked did not survive the pace at which it was produced.

The distinction matters because cognitive debt is invisible until the moment it is not. Nobody notices the buildup. Then someone needs to modify a feature, and the team discovers that no one can explain how the system arrived at its current state. The warning signs are quiet: developers hesitating before touching certain modules, growing reliance on one person's tribal knowledge, a creeping sense that parts of the system have become a black box.

Why AI Tools Accelerate the Problem

AI coding tools produce syntactically correct, well-structured code at a pace that makes deep review feel unnecessary. Most developers treat it that way. 67% report spending more time debugging AI-generated code than they expected, which suggests they skipped the understanding step and paid for it later.

The production data is consistent. AI-generated code introduces 1.7x more total issues than human-written code across production systems. Maintainability errors run 1.64x higher. Code churn doubles in AI-assisted development, and copy-pasted code rises 48%.

None of these numbers mean AI tools are bad. They mean the speed creates a specific failure mode: a gap between what gets committed and what gets understood. You can build a feature in an afternoon that would have taken a week. If you never internalized how it works, you traded velocity for comprehension. That trade compounds.

The mechanism is subtle. Luca Rossi describes two cognitive modes that matter here: create mode, where you actively build mental connections between ideas, and review mode, where you assess existing work with lower energy. AI tools push developers from create mode into review mode by default. You stop solving problems and start evaluating solutions someone else produced. The issue is that reviewing AI output feels productive. You are reading code, spotting issues, making edits. But you are not building the mental model that lets you reason about the system independently. You are anchored to whatever the AI generated first.

Storey describes a student team that hit this wall by week seven. They had been using AI to build fast and had working software. When they needed to make a simple change, the project stalled. Nobody could explain design rationales. Nobody understood how components interacted. The shared theory of the program had evaporated, and with it, the team's ability to change anything safely.

This is not limited to students. 75% of technology leaders are projected to face moderate or severe debt problems by 2026 because of AI-accelerated coding practices. The speed is real. So is the invoice.

Five Practices That Keep You in the Loop

Cognitive debt is not inevitable. Each of these habits trades a small amount of speed for a disproportionately large amount of understanding.

1. Read every function before committing it. 71% of developers already refuse to merge AI-generated code without manual review. The remaining 29% are accumulating cognitive debt on every commit. Line-by-line reading is the minimum. If you cannot explain what a function does to a colleague without referencing the prompt that generated it, you do not understand it well enough to own it.

2. Document the why, not the what. AI generates comments explaining what code does. Only you know why it exists. For every AI-generated change, add one line to your commit message or design doc explaining the decision behind it. What problem were you solving? What alternatives did you reject? What constraints shaped the approach? Six months from now, the code will still run. The reasoning behind it will be gone unless you write it down now.

3. Code without AI one day a week. Luca Rossi recommends setting aside regular time to solve problems entirely on your own. This is maintenance, not nostalgia. Pilots practice manual landings even when autopilot works. Developers should practice manual problem-solving even when Claude works.

4. Write first, then let AI review. The typical workflow is: prompt AI, review output. This creates anchoring bias. You become an editor of AI solutions rather than a thinker solving problems. Reverse the flow. Draft your approach first, then ask the AI to critique it. You keep your mental model intact and still get the AI's perspective.

5. Run understanding checkpoints. Storey recommends regular sessions where the team rebuilds shared knowledge through code walkthroughs and architecture reviews. The test is simple: if only one person understands a module, you have a single point of failure. No amount of test coverage protects against a bus factor of one.

The Catch Nobody Mentions

There is no linter for "the team does not understand its own codebase." The warning signs are subjective. They get deprioritized until a deadline forces a change nobody can safely make.

These practices also slow you down. That is the point, and it is why they get cut first. The entire appeal of AI coding tools is speed. Asking a team to go slower requires either institutional trust or a recent incident. Most organizations adopt these practices after the incident, not before.

There is also an asymmetry in how cognitive debt gets noticed. The developer who ships ten features a week with AI looks productive. The developer who ships five but understands all of them looks slow. The difference only becomes visible when something breaks, and by then the fast developer has moved on to the next project. Cognitive debt is the kind of problem that punishes the people who inherit it, not the people who created it.

Key Takeaway

Pick one AI-generated file you shipped last week. Try to explain every function in it without reading the source code. If you cannot do it fluently, you already have cognitive debt accumulating. Start tomorrow with practice number one: read every generated function before you commit it. The ten minutes it costs per session prevents the afternoon you lose next month when something breaks and nobody remembers why it was built that way. Cognitive debt is the one kind of debt that gets cheaper the earlier you start paying it down.

Spec-Driven Development: Write the Spec, Not the Code

Bobby Blaine — Thu, 05 Mar 2026 06:42:16 +0000

Vibe coding got developers building fast. It also got them rebuilding fast. The pattern: describe what you want, accept the AI's output, ship it. Then spend the next week debugging assumptions the model made because you never stated them. Spec-driven development is the emerging counter-approach, and in early 2026, three major platforms shipped dedicated tooling for it: GitHub's Spec Kit, AWS Kiro, and Tessl Framework. The idea is simple: write a structured specification first, then let the AI generate code that follows it.

What Spec-Driven Development Actually Is

Spec-driven development (SDD) inverts the vibe coding workflow. Instead of prompting an AI agent with a loose description and iterating on whatever it produces, you write a structured, behavior-oriented specification that defines expected behavior and constraints upfront. The AI agent receives this spec as its primary input and generates code to match.

The core insight is that language models are excellent at pattern completion but bad at mind reading. When you tell an AI agent "build me a REST API for user management," you are leaving thousands of decisions unstated: authentication method, error response format, pagination strategy, rate limiting, input validation rules. The agent fills those gaps with its training data, which may or may not match your actual requirements.

A spec eliminates this guesswork. It makes requirements explicit, testable, and reviewable before a single line of code is generated. Three levels of adoption exist: spec-first (write specs for immediate tasks), spec-anchored (maintain specs as living documents alongside code), and spec-as-source (specs become the canonical artifact, code is entirely generated). Most teams today are at spec-first, which is where the practical payoff starts.

Three Tools, Three Approaches

GitHub Spec Kit, Kiro, and Tessl each interpret SDD differently.

GitHub Spec Kit is the most customizable. It is an open-source CLI that integrates with Copilot, Claude Code, and Gemini CLI through slash commands. The workflow has four phases: /specify generates a detailed specification from your description, /plan creates a technical implementation plan given your stack and constraints, /tasks breaks the plan into small reviewable chunks, and the agent implements each task sequentially. Spec Kit enforces architectural rules through what it calls a "constitutional foundation" -- a set of project-level constraints the agent must obey.

Kiro is the simplest entry point. Built as a VS Code extension by AWS, it produces three markdown documents: requirements, design, and tasks. The workflow is linear and lightweight. The tradeoff is that Kiro generated 16 acceptance criteria for a simple bug fix. The overhead can exceed the problem.

Tessl Framework is the most ambitious. Still in closed beta, it pursues spec-as-source: the tool reverse-engineers specs from existing code and maintains a 1:1 mapping between spec files and code files, marking generated code with // GENERATED FROM SPEC - DO NOT EDIT comments. If it works as intended, developers would maintain only specs, never touching code directly.

The practical reality, across all three tools, is that AI agents still inconsistently follow instructions. A spec reduces the gap between intent and implementation, but it does not eliminate non-determinism. The spec is a guardrail, not a guarantee.

Getting Started with Spec Kit

Spec Kit is the most accessible tool today because it is open source and works with the agent you are already using. Here is the shortest path from zero to a spec-driven workflow:

Step 1: Install Spec Kit. It is a CLI tool available via npm. Initialize it in your project with specify init. This creates a .specify/ directory with templates and configuration files.

Step 2: Write your first spec. Run /specify and describe the feature you want to build. Be specific about behavior, constraints, and edge cases. The agent generates a structured specification you can review, edit, and approve before any code is written.

Step 3: Generate a plan. Run /plan with your tech stack and constraints. The output is a step-by-step implementation plan that references your spec at every point.

Step 4: Break it into tasks. Run /tasks to split the plan into small, reviewable work units. Each task has a clear objective and acceptance criteria pulled from the spec.

Step 5: Implement. The agent works through tasks sequentially, using the spec and plan as context. You review each completed task against the spec.

The difference between a spec-first prompt and a vibe coding prompt for the same feature is worth seeing. A vibe coding prompt reads: "Build a rate limiter middleware for Express." A spec-first prompt reads: "Implement the rate limiter defined in .spec/features/rate-limiter.md, which specifies a sliding window algorithm, 100 requests per minute per API key, 429 responses with Retry-After headers, and Redis-backed state for horizontal scaling." The second prompt leaves no room for the agent to improvise on decisions that should be yours.

The key difference from vibe coding is where you spend your time. In vibe coding, you spend it iterating on code after generation. In SDD, you spend it writing the spec before generation. The total time is often comparable, but the spec is reusable and serves as documentation after the project ships. Spec-driven projects in production reinforce this. Anthropic used GCC test suites to spec a Rust-based C compiler. Vercel used curated shell script tests for a TypeScript bash emulator. Pydantic applied the same approach to a Python sandbox for AI agents. A well-defined spec plus an existing test suite gets an AI agent far on a greenfield build.

Where SDD Breaks Down

SDD is not a universal improvement. Several friction points temper the hype.

Review overhead scales with spec verbosity. Kiro's 16 acceptance criteria for a bug fix is not an edge case. Spec Kit produces extensive markdown for mid-sized features. If reviewing the spec takes longer than reviewing the code would have, the process is working against you.

Iteration fits poorly into upfront specification. Exploratory work (prototyping, UI experiments, data pipeline debugging) benefits from fast, loose iteration. Writing a detailed spec before you know what you are building adds latency to a process that should be cheap and fast.

Non-determinism persists. Even with a detailed spec, agents sometimes ignore directives or over-interpret them. The spec improves consistency but does not solve the fundamental reliability problem. Vercel's CTO captured this with a useful metaphor: "Software is free now. Free as in puppies." Generation is cheap. Maintenance is where the work lives.

The sweet spot for SDD in its current form is greenfield features with well-understood requirements: new API endpoints, CRUD modules, integration layers. It is less useful for exploratory work or for codebases where the existing architecture is poorly documented.

Key Takeaway

Before your next feature, try writing a one-page spec before prompting your AI agent. Define the inputs, outputs, constraints, and edge cases in plain text. Then pass that spec as context alongside your prompt. You do not need Spec Kit or Kiro to start -- a markdown file works. The goal is to move the ambiguity from code review to spec review, where it is cheaper to fix. If the workflow clicks, install Spec Kit and formalize the process.

Chrome DevTools MCP: Give Your AI Agent Eyes in the Browser

Bobby Blaine — Thu, 05 Mar 2026 06:42:10 +0000

AI coding assistants write frontend code they never see rendered. They debug console errors from stack traces you copy-paste into a chat window. Google's Chrome DevTools MCP server eliminates this blindfold by connecting your AI agent directly to a live Chrome session, giving it access to DOM inspection, console logs, network requests, and performance traces through natural language.

What the DevTools MCP Server Does

Chrome DevTools MCP is an official Google project that exposes Chrome's full debugging surface as Model Context Protocol tools. When connected, your coding agent can navigate to any URL, inspect the rendered DOM, read console errors with source-mapped stack traces, capture screenshots, analyze network requests, and simulate user interactions like clicks and form submissions.

Under the hood, it uses the Chrome DevTools Protocol via Puppeteer. The server runs locally with an isolated browser profile, so your existing Chrome tabs and sessions stay untouched. Think of it as giving your agent the same DevTools panel you use manually, except the agent can act on what it finds without you switching windows.

The toolset covers what you would normally do by hand:

Console messages: Retrieve errors and warnings with full source-mapped stack traces
DOM & CSS inspection: Read element styles, computed layouts, accessibility attributes
Network analysis: List requests, check response codes, identify CORS issues
Performance traces: Record and extract Largest Contentful Paint, layout shifts, long tasks
User simulation: Click buttons, fill forms, hover elements, navigate between pages
Device emulation: Throttle CPU, simulate slow networks, resize viewports to any dimension

The practical effect is what Osmani calls a closed debugging loop. Your agent writes code, opens it in Chrome, checks whether it actually works, reads the errors if it doesn't, and fixes them. The cycle that used to involve two windows and a copy-paste now happens inside one conversation.

From Blind Suggestions to Verified Fixes

Without browser access, an AI agent debugging a layout issue is pattern-matching against your description of the problem. With Chrome DevTools MCP connected, the agent inspects the actual computed styles, identifies the specific CSS property causing the overflow, applies a fix, and verifies the rendered result by rechecking the page. Every diagnostic step is evidence-based rather than speculative.

CyberAgent, a Japan-based tech company, stress-tested this workflow on their Spindle design system. They pointed an AI agent at 32 UI components spread across 236 Storybook stories. The agent navigated to every single story, read the console output at each one, identified runtime errors and warnings, generated targeted fixes, and validated each fix by rechecking the browser state afterward. In roughly one hour, it achieved 100% audit coverage with zero false negatives, catching one runtime error and two warnings across the entire component library. The concrete fixes shipped in two pull requests. As one of their engineers put it, the benefit was straightforward: "offload runtime errors and warning checks that I used to do manually in the browser."

That coverage is the real story. Manually checking console output across 236 component stories is the kind of work that lands on a backlog ticket labeled "tech debt" and stays there until something breaks in production. An agent running DevTools MCP handles it mechanically.

Performance debugging follows the same closed-loop pattern. Instead of asking your agent "how do I improve my LCP?" and getting generic advice about image optimization, you ask it to record an actual performance trace on your staging URL, extract the LCP metric, identify the specific blocking resource, and suggest a fix grounded in measured data. The difference between a guess and a measurement is the difference between "try lazy-loading your images" and "your 2.3MB hero image at /assets/banner.webp is blocking LCP at 4.2 seconds."

Network debugging works the same way. If your API calls are silently failing, you do not need to open the Network tab and filter requests yourself. Ask the agent to list all network requests on the page, filter for non-200 status codes, and show the response bodies. CORS misconfigurations, missing auth headers, and 404s from incorrect API paths all surface in the agent's response with exact request details you can act on immediately.

As Addy Osmani noted, Chrome DevTools MCP transforms "AI coding assistants from static suggestion engines into loop-closed debuggers." CyberAgent apparently agreed. They now list the DevTools MCP server as their default debugging tool in their CLAUDE.md. Experiment to team standard in one sprint.

Chrome DevTools MCP Setup in Five Minutes

The server requires Node.js v20.19 or newer and a current Chrome stable build. Installation takes one command.

Claude Code:

claude mcp add chrome-devtools -- npx chrome-devtools-mcp@latest

Cursor (Settings > MCP > Add New Server):

{
  "mcpServers": {
    "chrome-devtools": {
      "command": "npx",
      "args": ["-y", "chrome-devtools-mcp@latest"]
    }
  }
}

The same JSON config works for VS Code Copilot, Cline, and Gemini CLI. No additional dependencies beyond Node.js and Chrome. The server downloads on first run via npx, so there is nothing to install globally or maintain across updates.

To verify the connection is live, ask your agent: "Navigate to web.dev and check the LCP score." If it opens Chrome, records a performance trace, and returns a number, the server is working.

For daily use, the most productive starting prompt is: "Open localhost:3000, check the console for errors, and fix any you find." That single instruction triggers the full closed loop: navigate, inspect, diagnose, edit code, re-verify. The workflow that used to span two monitors and a clipboard now runs in one conversation thread.

Beyond error fixing, the performance workflow is worth building into your regular process. Before deploying frontend changes, ask your agent to run a performance trace on the updated page and compare LCP, CLS, and INP metrics against the baseline. This catches performance regressions before they reach production and gives you specific numbers for your pull request description.

What to Watch For

The server is in public preview. Some tools occasionally time out, with resize_page as the most common offender. The agent usually retries with an alternative approach, but persistent failures may require restarting the MCP server process.

Visual judgment stays with you. The agent reads DOM structure and console output with precision, but it cannot assess whether a design looks good to a human eye. It can tell you that a div has overflow: hidden clipping its children. It cannot tell you the page feels cramped. Screenshots help bridge this gap, though interpretation quality varies by model.

The isolated browser profile is both a feature and a limitation. Your existing cookies and authenticated sessions are not available to the agent. If your app requires login, you need to authenticate within the MCP-managed session first or configure the server to reuse a Chrome profile directory with existing credentials.

Key Takeaway

Run claude mcp add chrome-devtools -- npx chrome-devtools-mcp@latest, then ask your agent to check localhost:3000 for console errors. You will go from copy-pasting stack traces to a closed AI debugging loop in under five minutes. The gap between "AI writes the code" and "AI verifies the code actually works" is where most frontend debugging time quietly disappears.