DEV Community: Signadot

Open Source Maintainers Are Drowning in AI-Generated Pull Requests. Enterprise Teams Are Next.

Signadot — Tue, 26 May 2026 13:45:25 +0000

Read this article on Signadot.

Something is breaking in open source, and it should alarm every engineering leader pushing coding agents into their organization.

Over the past year, open-source maintainers have been overwhelmed by a flood of low-quality, AI-generated pull requests. Verbose changes with nonsensical descriptions. Contributions that submitters cannot explain when questioned. Code that looks plausible on the surface but crumbles under review.

The Jazzband collective, a well-known Python project ecosystem, was forced to shut down entirely this year. Its lead maintainer cited the unsustainable volume of AI-generated spam PRs and issues as a primary driver.

Other projects are feeling the same pressure. Remi Verschelde, who maintains the Godot game engine, has described triaging AI slop as draining and demoralizing. Daniel Stenberg, the creator of curl, has canceled bug bounty programs because they became magnets for low-effort AI submissions.

The pattern is consistent: maintainers spend a disproportionate share of their time evaluating code that should never have been submitted, crowding out genuine contributions and accelerating burnout.

This is not just an open-source problem. It is a preview of what is coming for enterprise engineering teams, and most of them are not ready.

The asymmetry no one is planning for

The core issue is a throughput asymmetry. AI coding agents have made code generation dramatically cheaper and faster. A developer working with an agent can produce five, six, or more pull requests in a day.

A nontechnical team member using a coding agent for the first time can generate working-looking code in minutes. But the review, validation, and integration of that code have not gotten any faster. Sixty percent of unpaid volunteer maintainers already struggle to keep up. Now the volume has multiplied.

“AI coding agents have made code generation dramatically cheaper and faster… But the review, validation, and integration of that code have not gotten any faster.”

Open-source maintainers experience this asymmetry in its most extreme form because their repositories are open to the world. Anyone can point an agent at an open GitHub issue and generate a plausible-looking pull request in seconds.

As one contributor put it, if that were really what maintainers wanted, they could do it themselves. The value of a contribution was never just the code. It was the understanding behind it, the testing that validated it, and the human judgment that shaped it.

Enterprise teams face the same structural problem behind a corporate wall. When organizations mandate the adoption of coding agents, they accelerate one end of the pipeline while leaving the other unchanged. The reviewer inherits the full burden of determining whether that code actually works, integrates correctly, handles edge cases, and does not introduce regressions. Research from Agoda found that experienced developers were actually 19 percent slower when using AI tools, largely due to what researchers described as comprehension debt, in which developers understand less of their own codebase over time as AI-generated code accumulates. A CodeRabbit analysis of 470 open-source pull requests found approximately 1.7 times more issues in AI-co-authored pull requests than in those written entirely by humans.

The math does not work in the reviewer’s favor. And the numbers get worse as complexity increases.

Why AI-assisted review is not enough

The natural response to a surge of AI-generated code is to deploy AI agents for code review. Tools that summarize pull requests, flag issues, and assess quality are proliferating rapidly. For straightforward changes, they may be enough. An AI reviewer can catch style violations, spot anti-patterns, and surface obvious bugs faster than a human scanning line by line.

But for cloud-native distributed systems with a dozen or more interdependent services, AI-assisted review hits the same wall as traditional CI pipelines: neither can tell you whether a change actually works in context. A modification to one service might look correct in isolation while silently breaking a contract with a downstream dependency. An agent-generated refactor might introduce a race condition that only manifests under realistic traffic patterns.

These problems require running the code in an environment that resembles production, and no amount of static analysis, whether human or AI-powered, can substitute for that.

“The validation bottleneck sits earlier than most teams realize, in the gap between when code is written and when a reviewer can confidently evaluate it.”

The validation bottleneck sits earlier than most teams realize, in the gap between when code is written and when a reviewer can confidently evaluate it. If a developer generates six pull requests in a day and each one requires 30+ minutes of manual validation, they spend most of their time managing a deployment queue rather than building software. The agent made writing faster. AI review made triage faster. Everything downstream stayed slow.

What the open-source crisis is actually telling us

Open-source repositories are experiencing the full, unfiltered force of AI-accelerated code generation because they cannot control who contributes. Maintainers have responded with a mix of stricter contributor policies, reputation systems, platform tools that gate or filter pull requests, and, in some cases, simply shutting projects down.

Enterprise teams have more control over who submits code, but less visibility into whether the person who submitted it understood it. An agent-generated PR from an internal developer or a nontechnical team member looks the same in a review queue as a carefully crafted change from a senior engineer. Without additional context, the reviewer has no way to distinguish between the two or quickly validate whether the code does what it claims to do.

The open-source response to this crisis is instructive. Projects that are weathering the storm are not just adding policies. They are investing in mechanisms that shift the burden of proof back to the contributor, requiring demonstration that the code works rather than asking the reviewer to prove that it does not.

How enterprises should respond

The gap between code generation and code validation needs to be closed. Every pull request should arrive with evidence that it works, not just a claim.

First, validation needs to move into the development loop. Developers and agents need access to isolated, production-like environments where changes can be validated against real service dependencies before a PR is even opened. The review should start with proof of working code.

Second, the review process needs to evolve. When agents produce thousands of lines per hour, line-by-line review does not scale. The shift is from inspecting code to evaluating evidence of behavior. Did the change work against realistic service interactions? Does the behavior match the specification?

Third, organizations need to treat AI-generated code as draft material. This means tagging AI-authored changes, tracking defect rates separately, and building review workflows that account for code the submitter may not fully understand.

Finally, accountability cannot be outsourced to an AI. The engineer who guides the agent remains responsible for what ships. This means giving them tools to validate agent-generated code inside the development loop so they can submit PRs with confidence rather than relying on the reviewer to catch problems.

The warning is already here

Open-source repos are the canary. Their openness means they absorb every externality of cheap code generation first and most visibly. But the underlying problem, the imbalance between the speed of producing code and the speed of validating it, is not unique to open source. It is structural.

Enterprise organizations that invest heavily in coding agents without equally investing in the infrastructure to validate what those agents produce are building a pipeline that gets faster at creating work and no faster at finishing it. The PRs will pile up. The review times will stretch. The defect rates will climb. The engineers tasked with reviewing agent output will burn out for the same reasons open-source maintainers are burning out today.

The tooling to close this gap exists in pieces. Isolated preview environments, automated end-to-end validation, smarter review workflows, and better observability into agent-generated changes are all solvable problems. Solutions like Signadot are already helping teams validate changes against real service dependencies before code ever reaches a reviewer.

The question is whether organizations will learn the lesson that open-source maintainers are teaching us in real time, or wait until they feel the pain themselves, risking the loss of good engineers and their competitive advantage. Investing in these capabilities before the backlog becomes unmanageable will be a key differentiator between teams that benefit from coding agents and those that find themselves in crisis.

Why Claude Needs a Real Environment to Validate Cloud-Native Code

Signadot — Mon, 11 May 2026 16:03:50 +0000

Read this article on Signadot.

Boris Cherny, who built Claude Code, recently shared on X how to get the most out of it following the release of Opus 4.7. He left the most important tip for last:

“Make sure Claude has a way to verify its work. This has always been a way to 2-3x what you get out of Claude, and with 4.7 it’s more important than ever.”

That observation describes a pattern establishing itself as the standard model for developing software with coding agents. It is also a pattern that is easy to implement locally, against a single codebase with limited dependencies.

It is much more difficult against a cloud-native application with a complex topology. Closing that gap is the difference between coding agents that accelerate teams and those that bury them in review queues and manual validation.

The pattern emerging across coding agents

Boris’s tip mirrors a pattern emerging across the industry. Every major coding agent has shipped infrastructure in the last six months whose explicit purpose is to let the agent check its own work before handing it off.

OpenAI’s Codex iterates in a loop within an isolated cloud container, editing code, running checks, and validating its changes against commands specified in the team’s AGENTS.md file. The validation loop is the product, not a feature on the side.

“The validation loop is the product, not a feature on the side.”

GitHub’s Copilot coding agent runs in an ephemeral GitHub Actions environment that automatically executes the repo’s tests, linters, CodeQL, and secret scanning on every task. If anything fails, Copilot attempts to fix it before marking the task ready for review. Cursor’s cloud agents run in sandboxed VMs with shell and browser access so the agent can exercise its changes end-to-end and produce screenshots, videos, and logs as evidence of what it tested.

Claude Code exposes the same shape as composable primitives. Stop hooks prevent teams from completing a task until tests pass. Subagents can run dedicated validation passes that inspect work without modifying it. The verification loop is something the team assembles, but the building blocks are explicit and well-supported.

The convergence is not a coincidence. Every team building a coding agent has identified the same problem: a model that writes plausible code without checking it pushes the entire correctness problem back onto the developer. The productivity gain disappears into review overhead.

A coding agent that can verify its own work operates differently. It iterates on the task, catches its own mistakes, and hands over something the developer can reasonably trust. That’s where useful agent work lives, and it’s the bet every major agent vendor is now making.

Cloud-native systems make the loop harder

All of this assumes the agent can run the change against a realistic environment that mirrors production. In modern cloud-native architectures, that assumption breaks quickly.

The code an agent is changing rarely fails in isolation. It fails at the seams. Services call other services. Async events fire through message buses. A schema change in one service cascades through its consumers. A new middleware header breaks callers three hops away.

“The code an agent is changing rarely fails in isolation. It fails at the seams.”

The agent writing the change has no way to catch any of that with a mocked integration test. The mock returns whatever the agent told it to return.

Real validation in a distributed system means running the change in a realistic environment and observing what happens as actual requests flow through it. Full end-to-end. Real dependencies. Real traffic patterns.

Anything less pushes the problem back onto the developer. More review rounds. More iteration cycles. Broken staging environments that slow other developers and agents down. The occasional bug that makes it into production.

Realistic feedback, without duplicating the stack

What cloud-native teams need is a feedback loop that lets their agents see how a change actually behaves. Not against mocks. Not against a simplified approximation of production. Against real services, real data paths, and real traffic patterns, close enough to production that the integration failures the agent is most likely to cause are the integration failures it can most easily catch.

That loop has to satisfy three constraints at once.

It has to be realistic. The agent is trying to verify a change that crosses service boundaries, so it needs an environment where those boundaries exist and behave the way they will in production. Anything less and the agent ends up validating a version of the system that will not match what its code actually runs against when it ships.

It has to be isolated. Multiple agents and multiple developers will be exercising changes concurrently, often in overlapping parts of the system. If one agent’s test run breaks the environment for everyone else, the loop has closed for that agent but opened a bigger one for the rest of the team. An agent’s validation work cannot become a coordination problem for the humans around it.

And it has to scale with the way agents actually work. A team running coding agents is not running one task at a time. It is running many tasks in parallel, each on a different branch, each needing its own realistic place to validate. A model that requires duplicating the entire application stack for every agent collapses the moment the team gets serious about throughput, both on cost and on the wall-clock time it takes to stand each stack up.

The environment has to feel like production to the agent, stay out of the way of everyone else using it, and be cost-efficient enough that a team can run as many of them as they have agents in flight.

Agents need to know how to use the environment

Giving the agent a realistic environment is necessary, but on its own it does not close the loop. An agent with access to a production-like system, and no guidance about how to use it, behaves like a new engineer dropped into an unfamiliar codebase on day one. The access is there. The judgment about how to use it is not.

That judgment has two parts. One is the team’s operational knowledge: which upstream callers to exercise when a code path changes, which downstream dependencies actually affect the outcome, how to tell whether a failing request was caused by the change under review or by a flake three services away. The other is fluency with the tooling inside the environment itself: how to route traffic for testing, how to inspect state across services, which commands are available, how to read the logs the environment exposes. Generic testing know-how covers neither.

Agent skills are the vehicle for both. A skill captures how changes in a particular system should be validated and debugged, and how to drive the specific tools the environment provides to do that work. It is the team’s institutional knowledge, plus the operating manual for the environment, handed to the agent alongside access itself.

What this enables is the thing that matters. An agent that can validate its own work the way a senior engineer on the team would. Not just running tests, but exercising the right paths with the right tools, interpreting the right signals, and recognizing when something is wrong in a way that reflects how the system actually behaves.

Skills and environments have to ship together. An environment without a skill gives the agent access with no judgment. A skill without an environment gives the agent judgment with nothing to act on. Either one alone leaves the loop open.

The inner loop is where the next leap lives

The next unlock for cloud-native teams using coding agents is not a smarter model. It is a real environment the agent can work against in their inner loop and the context to use it well.

What makes this more interesting over time is that the boundary between inner loop and outer loop starts to dissolve once agents are involved. When a test fails in CI, the natural next step for an agent is not to surface the failure and wait for a human. It is to drop the same change back into an inner-loop environment, reproduce the failure with real dependencies, debug it, and push a fix. The outer loop’s signal becomes the inner loop’s starting condition.

It runs the other way too. An ad-hoc validation an agent runs once in the inner loop often deserves to outlive the task. Encoded into the outer loop, it becomes part of the team’s standing regression suite. The inner loop’s one-off experiment becomes the outer loop’s durable guard.

Both directions depend on the same foundation: a real environment the agent can reach from either loop, and the context to use it well. This is the approach we’re building toward at Signadot, so the validation cycle stays continuous wherever the signal arrives.

This feedback loop is what turns agents from fast code generators into collaborators developers can trust. The teams that close it, across both loops, will be the ones getting the real benefits of coding agents while the rest are buried under their review queues.

Why Coding Agents Will Break Your CI/CD Pipeline (and How To Fix It)

Signadot — Mon, 27 Apr 2026 15:05:45 +0000

Read this article on Signadot.

Every engineering leader I speak with lately is quietly asking the exact same question. The conversation has shifted entirely. We are no longer debating how to adopt AI coding assistants or whether they write good boilerplate. That ship has sailed. The board has mandated AI adoption, your developers are already using the tools, and the code is flowing.

The real question keeping technical leaders up at night is far more daunting: What breaks when autonomous agents generate 10 times more code than your engineering team ever could?

If you are leading an engineering organization right now, you are likely feeling the whiplash of this transition. You were promised a 10 times increase in shipping velocity. Instead, you are looking at a growing, stagnant mountain of pull requests. You are seeing staging environments crash constantly. You are watching your senior engineers burn out, not from writing code, but from trying to review, test, and untangle the massive volume of code generated by machines.

The uncomfortable truth of the AI engineering revolution: the bottleneck has not been eliminated. It has simply moved.

Writing code is no longer the rate-limiting step in software delivery. AI agents have solved that. The new bottleneck is validation. It is the arduous, complex process of proving that the AI-generated code actually works in production-like conditions before it ever hits the main branch.

The cloud native collision course

If you are operating a modern, cloud native architecture, this validation bottleneck is not just a nuisance. It is a fundamentally critical failure point.

In a distributed microservices environment, services do not exist in a vacuum. A seemingly isolated change to one backend service generated by an agent can easily cascade, breaking three other downstream services and corrupting a shared database schema.

Now, multiply that reality by dozens of AI agents working asynchronously and shipping code in parallel. What happens to your existing continuous integration (CI) pipeline? It becomes a massive traffic jam.

Historically, we solved validation by relying on shared staging environments. Developers would merge their code, push it to staging, run integration tests, and cross their fingers. But a shared staging environment is a single-lane bridge.

When you have humans and agents simultaneously trying to drive trucks full of new code across that bridge, collisions are inevitable. Staging becomes permanently broken. Developers spend days trying to figure out whose commit caused the outage.

If this structural flaw is not addressed, the impact is severe:

The deploy gap: A massive chasm opens up between code generated and code actually deployed to users. Unmerged code sitting in a repository is a liability, not an asset. It delivers zero value to the business.

Post-merge failures: Desperate to clear the backlog, teams will inevitably lower their validation standards, leading to a spike in production incidents and rollback queues.

Negative return on investment on AI: The massive investment in AI tooling is entirely negated if the output cannot be safely and rapidly integrated into the product.

The teams that figure out how to solve this validation bottleneck will ship five times faster than the industry average. The ones that do not will simply drown in their own generated code.

Rethinking validation for the agentic era

Solving this is not a matter of throwing more compute at your Jenkins servers or writing stricter pull request review policies. You cannot human-review your way out of a machine-generated code avalanche.

It requires a fundamental architectural shift in how we validate software. If we want agents to act like autonomous developers, we need to give them the infrastructure to test their work exactly like a senior developer would.

“You cannot human-review your way out of a machine-generated code avalanche.”

To achieve this, engineering organizations need to implement a modern validation architecture built on two distinct layers:

1. A scalable approach to ephemeral environments

The days of the shared staging bottleneck must end. Every single agent, and every single pull request, needs its own isolated environment to validate changes against the full, complex system.

However, spinning up an entire replica of a 50-microservice cluster for every PR is financially ruinous and agonizingly slow. The modern approach relies on lightweight, highly scalable ephemeral environments. Instead of duplicating the entire world, you run a stable baseline of your architecture and use request routing to isolate the agent’s specific changes.

When an agent writes a new version of “Service A,” the infrastructure dynamically provisions just that single updated service. It then routes test traffic through the stable cluster, intelligently diverting only the relevant requests to the agent’s sandbox.

This means you can have 100 agents testing 100 different architectural changes in parallel against the full system, without stepping on each other’s toes or bankrupting your cloud budget. No shared staging bottleneck. No waiting in line.

2. The plans-based validation layer

Providing an isolated environment is only half the battle. An environment is useless if the agent does not know how to use it.

Human developers do not just write code. They possess the skill of validation. They curl endpoints, query databases, check Grafana dashboards, and read server logs to verify their logic. AI agents need these same skills.

A plans-based validation layer equips coding agents with the programmatic tools to interact with their ephemeral environments. Instead of generating code and immediately opening a PR for a human to test, the agent generates the code, deploys it to its isolated sandbox, and uses its “plans” to run integration tests, generate load, and analyze the resulting logs.

If the agent detects an error in the logs, it iterates. It fixes its own code, redeploys to the sandbox, and tests again. The loop is closed independently. The agent requests a human review only after it has mathematically proven that its changes work within the broader context of the distributed system.

Enabling true autonomy and continuous delivery

When you combine scalable ephemeral environments with a plans-based validation layer, the entire paradigm shifts.

Agents transition from being mere autocomplete engines to becoming truly autonomous contributors. They are no longer throwing untested code over the wall for your senior engineers to clean up. They are taking responsibility for the full lifecycle of their assigned tasks, from generation to system-wide validation.

“Continuous delivery was always the holy grail of software engineering. AI agents… have simply made the infrastructure required to achieve it non-negotiable.”

This is the only path forward to realizing continuous delivery in the age of AI. Continuous delivery was always the holy grail of software engineering. AI agents have not changed the goal. They have simply made the infrastructure required to achieve it non-negotiable.

Bridging the gap

This structural shift in validation is the exact problem we built Signadot to solve.

We provide a platform that gives every human developer and AI agent their own lightweight, isolated environment to validate changes against the full system before merge, in parallel and at massive scale.

Agents Write Code. They Don't Do Software Engineering.

Signadot — Mon, 20 Apr 2026 14:47:24 +0000

Read this article on Signadot.

Long-running and background coding agents have hit a new threshold. When an agent runs for hours, manages its own iteration loop, and submits a pull request without hand-holding, it stops being a tool you invoke and starts being more like a worker you assign tasks to. Like any worker, the question isn’t how closely you supervise them. It’s what work you assign them in the first place.

We are all figuring this out in real time, and I see many teams making an understandable but critical error. They tune the autonomy dial, adding more review checkpoints or removing them, when the actual variable that matters is which categories of work agents should own versus which categories developers should own. That distinction isn’t about risk tolerance. It’s about capability boundaries.

Code writing and software engineering are not the same job

Writing code is pattern recognition. Take what’s been done before, apply it to a new context, and scaffold it out. Large language models are exceptional at this because that’s exactly what they do: recognize and reproduce patterns from massive corpora of prior work.

Software engineering is something else. It’s trade-offs. Constraints. Decisions that require context no model has access to: your business domain, your product strategy, your customers, your technical debt, the conversation your team had last week about why you chose one approach over another.

Most teams split on importance or risk tolerance. The real divide is between work that can be reasoned from prior patterns and work that requires context, strategy, and judgment that lives outside the codebase.

What developers actually own

The work that actually requires developers is more specific than “anything important,” but it doesn’t reduce to a tidy list of task types. It cuts across every part of the engineering process.

“Agents can read code. They cannot read the room.”

Developers own the work where the right answer depends on context that doesn’t exist in the codebase. Product strategy, business constraints, team dynamics, conversations in Slack threads, and architecture reviews are part of the history of why a system is built the way it is. Agents can read code. They cannot read the room.

Developers own the work where the risk profile is ambiguous, or the failure modes are hard to predict. Some changes cascade in ways that depend on organizational boundaries, deployment timing, or data contracts baked into a system over years of iteration. Evaluating correctness in those cases requires judgment that no model can drive from code alone. The higher the uncertainty, the more you need someone who understands not just what the code does but why it was written that way.

Developers own the work where the output is a decision, not an artifact: what to build, what to cut, and which technical bets to place six months from now. Agents can generate options. They cannot tell you which option is right for your situation because “right” depends on factors that reside in human heads and organizational contexts, not in training data.

And all of this is still evolving. As teams invest in making context more explicit through better documentation, clearer contracts, and more structured decision records, the boundaries shift. Work that once required a developer’s institutional knowledge becomes accessible to an agent. But the frontier of unstructured, high-judgment work keeps moving, too, and that’s where developer time is most valuable.

In distributed systems, this problem gets worse. The more services you spread across multiple teams and codebases, the more that critical context lives outside any single codebase. A change to one service’s event schema can break downstream consumers in ways that no test in that service’s own suite will catch.

Also: The agent doesn’t know what it doesn’t know. The developer on that team does — not because they wrote the code — but because they were in the meeting where the schema was agreed upon. For cloud-native teams, this scales badly: the more services, the more implicit contracts, and the more context that only people carry.

Where agents deliver the most value

There is a mountain of work in every codebase that is a waste of human brainpower. Boilerplate, scaffolding, repetitive refactors, unit test generation, configuration templating, and data formatting. This work is rote, mechanical, and can be reasoned entirely from prior patterns. Agents should own it.

Once a developer specifies the interface, contract, and expected behavior, an agent can implement faster and more consistently than a developer could. The implementation is the repeatable part. The reasoning that precedes it is not.

“The developers who thrive in this model won’t be the ones who write the most code. They’ll be the ones who make the best decisions.”

Iteration speed matters here, too. Generating multiple implementations, running test suites, checking contract conformance: agents do this at a pace no developer can match. Circle CI’s State of Software Delivery Report found that throughput bottlenecks most commonly appear in the feedback and validation loop, not the code-writing phase. Agents compress that loop significantly when the acceptance criteria are clear and they have access to the runtime environment and tools they need to validate their work.

The developers who thrive in this model won’t be the ones who write the most code. They’ll be the ones who make the best decisions about what to build and how to architect it, then hand off the execution to agents that can move faster than any human.

A three-tier model for dividing the work

In our team, we have found it useful to implement a rough framework for bucketing engineering work categories to make decisions about how it is distributed between agents and developers.

Tier 1: Agent-led, developer-reviewed

Tasks where agent execution is high-confidence and the output is self-verifiable. Boilerplate generation, configuration templating, adding endpoints within established patterns, running and reporting on test suites, and scaffolding new services or modules from existing conventions. The developer reviews the output, but the agent owns the work.

Routing this to developers wastes your most expensive resource. This category should expand as teams get better at making their patterns explicit and testable.

Tier 2: Agent-assisted, developer-guided

Tasks that require context beyond the codebase to validate. The agent implements, but the developer defines the scope, constraints, and success criteria. Feature work within a well-understood domain, refactoring within established boundaries, and test implementation for developer-defined strategies fall here.

The developer provides the engineering judgment. The agent provides the implementation throughput. Most feature work, across any architecture, falls into this tier.

Tier 3: Developer-led, agent-supported

Tasks where the core work is judgment, not implementation. Architectural decisions, cross-boundary contract changes, debugging emergent failures, and defining what to build next. Agents can assist with subtasks: drafting proposals, analyzing logs, and generating candidate implementations for evaluation. But a developer must drive because the work itself is reasoning, not pattern execution.

The distinction from Tier 2 is that the developer isn’t just validating output. They’re doing the intellectual work that no amount of training data can substitute for.

The cost of getting the split wrong

Most teams I speak to are either under-allocating or over-allocating work to agents. Both are expensive mistakes.

Over-allocation is the more visible failure. Push agents into Tier 3 work, and they produce output that requires significant rework because the necessary context wasn’t available to them. The rework cost is real, but the opportunity cost is worse: developers who should be doing Tier 3 work are instead reviewing and correcting agent output that shouldn’t have been delegated in the first place.

Under-allocation is quieter but equally damaging. Teams that default to developer-owned work because agent output seems uncertain are paying developer rates for Tier 1 tasks. Developer time is the highest-cost resource on the team. Burning it on pattern-execution work that agents could handle is a slow drag on velocity that compounds over months.

This is why many teams adopting agentic workflows see limited gains or even slight decreases in merged code throughput. They haven’t solved the allocation problem. They’ve added a new tool without changing how work gets distributed.

Audit the work, not just the agents

The question isn’t whether agents replace developers. It’s what the right engineering model looks like when agents handle the mechanical work, and humans focus on the strategic work.

The teams navigating this well don’t just audit their agents. They audit their work. They ask which tasks could be agent-led if the boundaries were made explicit, then invest in making those boundaries explicit. That investment returns developer time to the high-judgment, context-dependent work that agents won’t own well anytime soon.

The answer won’t come from the AI labs. It’ll come from engineering teams actually building software this way every day, figuring out where the line is through practice, and learning what their specific codebase, team, and domain require on each side of the divide.

The Agent PR Flood Is Here. If You Run Istio, You're Halfway to Solving It.

Signadot — Tue, 14 Apr 2026 18:08:26 +0000

Read this article on Signadot.

Agentic workflows are rapidly accelerating the volume of pull requests, and validation is quickly becoming the most critical bottleneck. Teams using service meshes like Istio are well-positioned to solve it in ephemeral environments.

Engineering teams across the industry are waking up to a harsh new reality. The widespread adoption of agentic workflows has made code generation cheap and fast, but it has created a new infrastructure problem.

In simpler application architectures, running unit tests and mocks in a continuous integration pipeline might be enough to validate agent-generated code. But in cloud-native, distributed systems, validating behavior in a live environment is critical.

In just the past few months, I’ve seen the conversation with customers and colleagues shift from “agents are great for writing code, but we’re not seeing it impact pipeline” to “we’re drowning in PRs.” Validation has become the new bottleneck for distributed systems.

“If you cannot validate that code as quickly as agents write it, your pipeline will collapse down to the same human-level throughput it was built to handle.”

This flood isn’t impacting organizations equally. Companies like Stripe, Ramp, and other early adopters of advanced AI workflows are seeing exponential gains in code merged to main. They recognized early that generating code is only half the battle. If you cannot validate that code as quickly as agents write it, your pipeline will collapse down to the same human-level throughput it was built to handle.

For teams that want to replicate the success of these organizations, the answer might already be running in their clusters. If your platform is currently running a service mesh like Istio, you are already halfway to eliminating the validation bottleneck.

The AI velocity illusion and the integration bottleneck

The recent CircleCI 2026 State of Software Delivery report confirms what on-call rotations are already feeling: The pipeline is choking on its own success. While average workflow throughput increased 59 percent year over year, those gains are heavily concentrated at the top. Elite teams are operating at an unprecedented scale. The top 5 percent of teams saw their throughput nearly double, up 97%.

For the vast majority of organizations, the pipeline is clogging. The median team saw a 15.2 percent increase in throughput on feature branches where AI supports rapid prototyping, but their throughput on the main branch actually declined by 6.8 percent. Developers and their autonomous agents are generating significantly more code, but teams are struggling to review, validate, and promote it.

“The pipeline is choking on its own success. Developers and their autonomous agents are generating significantly more code, but teams are struggling to review, validate, and promote it.”

Traditional shared staging environments were never designed to handle this level of concurrency. They were sized for human output. For an engineering team of 50 generating 2-3 pull requests a day, their infrastructure was built to handle 100-150 PRs a day. This quickly becomes a critical choke point when hit with a massive volume spike. The queue grows faster than it drains.

Organizations that fail to upgrade their validation infrastructure are finding that the velocity promised by their AI investments is dissolving in the staging queue. The teams that are winning recognize that scalable validation infrastructure is the only way to unlock the true return on investment of agentic workflows.

The true bottleneck of agentic workflows

To understand why this bottleneck is so destructive, you must examine what happens when machine output speed collides with infrastructure built for human throughput. Agents exponentially increase the volume of pull requests, and traditional staging queues and review processes simply cannot support that volume without creating impossibly long backlogs.

Because the pipeline cannot handle the load, developers are forced to throttle their agents. They do not submit the full volume of agent-generated code. Instead, they have agents rely on unit tests and mocks to avoid the staging queue until the later stages of development. This imperfect pattern worked for human developers who had a mental model of the full system architecture and could intuit which changes would break downstream dependencies. Agents don’t work that way. They frequently generate novel code that passes localized unit tests but fails when introduced to the broader system architecture. For agents, a fast feedback loop with a realistic runtime to validate their code is not a nice-to-have. It’s a requirement.

This means the potential throughput of agents is artificially capped by linear infrastructure designed for human velocity. It also means the code that does get through is much more likely to break. The CircleCI report highlights the cost of these integration failures. Success rates on the main branch for most teams fell to 70.8 percent.

The unsustainable math of environment duplication

To convert the increased output of agentic workflows into actual throughput and eliminate this bottleneck, the validation infrastructure needs to give each agent or pull request an isolated, realistic runtime environment. Traditionally, platform teams would spin up a fresh Kubernetes namespace or an isolated cluster for every single pull request. While this provides the necessary fidelity, the math completely breaks down at the agentic scale. Duplicating every database, message queue, and microservice takes 15 minutes or more. When you multiply that overhead by 1,000 pull requests a day, infrastructure costs explode, and the 15-minute deployment lag severely caps an agent’s iteration cycles.

Another common approach to bypass full cluster duplication is shifting the burden to heavy virtual machines running localized container setups. I spoke recently with an engineering leader whose team handles integration testing by dynamically generating Docker Compose files for isolated cloud instances. Because tests rely on shared state, touching just a few core files in continuous integration triggers a fleet of 100 heavy cloud instances that spend an hour grinding through sequential testing.

Whether you are spinning up 1,000 full Kubernetes namespaces or orchestrating fleets of heavy virtual machines to run localized containers, the result is the same. The deployment lag compounds quickly, and the velocity of your AI workflows when it meets the bottleneck of linear infrastructure.

Ephemeral environments for agentic scale

These compounding factors mean that the only viable solution is a new model of scalable ephemeral environments. To handle machine speed concurrency, environments must spin up in seconds and provide a realistic runtime without the cost of duplicating the entire cluster. Instead of copying everything, a scalable, ephemeral environment model deploys only the microservices that have changed, as a lightweight sandbox. The rest of the architecture, including all heavy databases and stable downstream services, is shared from a baseline environment. The sandbox dynamically routes test traffic between the changed services and the baseline.

This approach delivers the exact same high-fidelity runtime as a full duplicate environment. The code is tested against real, live dependencies. The critical difference is the resource footprint. By only deploying the services under test, the environment spins up in seconds rather than minutes. It consumes a fraction of the compute resources.

In this model, agents can validate their code, get instant feedback, and iterate with massive concurrency and zero contention.

Routing your way out of the staging queue

Implementing this shared baseline architecture requires advanced traffic control. Building the automation and lifecycle management for these environments from scratch is a massive engineering undertaking. However, teams running a service mesh such as Istio have a significant advantage.

Because these tools already provide the exact routing capabilities needed, implementing scalable ephemeral environments like those described above becomes seamless. The underlying service mesh or ingress controller simply handles the dynamic routing of test traffic to a lightweight sandbox while ensuring all regular traffic flows uninterrupted to the stable baseline.

Here is what the underlying routing logic looks like when configured in Istio:

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: location
  namespace: hotrod-istio
spec:
  hosts:
  - location
  http:
  - match:
    - headers:
        baggage:
          regex: ^.*\b(sd-routing-key|sd-sandbox)\s*=[^,]*\bqwblp48fpmb30\b.*$
    - headers:
        tracestate:
          regex: ^.*\b(sd-routing-key|sd-sandbox)\s*=[^,]*\bqwblp48fpmb30\b.*$
    route:
    - destination:
        host: local-location-location-9f4477c8.hotrod-istio.svc.cluster.local
        port:
          number: 8081
  - route:
    - destination:
        host: location

When a request carries the specific header, Istio intercepts it and routes it directly to the sandbox version of the service deployed for that specific pull request.

The critical enabling mechanism behind this is context propagation. In a deep microservice call chain, the sandbox routing header must travel automatically between every service. OpenTelemetry (otel) baggage propagation handles this seamlessly. The routing value rides along the trace context, crossing boundaries without any individual service needing to explicitly forward it.

By leveraging these foundational primitives, platform teams can easily adopt scalable ephemeral environment solutions to orchestrate the deployment of sandbox services and automatically configure the routing rules within their existing mesh. This gives agents the ability to validate their own work against live cluster dependencies with instant feedback, eliminating the integration bottleneck.

What’s next

Agentic workflows are the new standard for software development, and they are already revealing the cracks in the traditional model of code validation and review. The gap between teams that have made scalable validation infrastructure a top priority and those that haven’t is evident, and it will only get bigger.

Teams that are already running service meshes like Istio are significantly ahead of the curve here. They already have the traffic-routing primitives in place that make implementing scalable, ephemeral environment solutions like Signadot seamless. This puts them in a position to move quickly on tackling the agentic PR validation issue before it becomes a full-blown crisis.

Coding Agents Are Only as Good as the Signals You Feed Them

Signadot — Thu, 09 Apr 2026 21:02:09 +0000

Read this article on Signadot.

The industry has spent the last few years optimizing AI agents’ code-generation capabilities. The focus has been on expanding context windows, fine-tuning models on repository-specific data, and developing complex prompting strategies. This has undoubtedly produced more capable coding agents. However, for most teams, that code-generation capability has not translated into significant gains in productivity.

Most engineering teams are stuck in a manual workflow. The agent generates the code, tests it locally, and submits a PR to the developer for review. Deploying the code, validating that it works, and feeding back any integration issues to the agent all happen at human pace. This workflow puts a hard ceiling on the productivity gains that agents can deliver by making developers into a validation bottleneck.

But some companies are enabling real autonomy for their agents and seeing the productivity gains that AI promises. Organizations like Stripe, Ramp, and the internal teams at OpenAI and Anthropic have come to the same realization: the quality of an agent’s output is directly proportional to the quality of the feedback loop it receives.

“The quality of an agent’s output is directly proportional to the quality of the feedback loop it receives.”

To elevate engineers to architects and see the speed of agentic code generation translate into productivity, platform engineering teams need to reconsider their strategy. Instead of focusing on giving developers better coding agents, the more impactful lever may be giving agents better feedback infrastructure.

The lesson from harness engineering

OpenAI recently documented how they built a complete software product using Codex with a guiding principle of “Humans steer. Agents execute.” Their success was not driven purely by model intelligence or prompt engineering. It was driven by a heavy investment in the environment within which the agent operated.

A team of just three engineers generated a working product with internal users and millions of lines of code by designing environments, specifying intent, and building rigorous feedback loops. The primary job of the engineers shifted from writing implementation code to building the scaffolding that allowed agents to verify their own work. This approach is known as harness engineering.

Harness engineering involves equipping agents with the tools and constraints required to act effectively. OpenAI engineers would write a docstring and a set of assertions. The agent would then generate the implementation. If the assertions failed, the environment would automatically capture the traceback, feed it back to the model, and request a retry. This loop allows for dozens of iterations without human intervention.

The key lesson here is that for agents to behave like engineers, they need the same tools, environments, and constraints as engineers at the infrastructure level. By giving the agent a way to validate its own work, they transformed the model from a one-shot code generator into an engineer capable of iteration. The harness provided the signals the agent needed to debug its own code, verify its logic, and deliver fully functioning software.

Stripe’s Minions and the feedback loop

We can see a similar pattern at Stripe. The company recently published a blog post detailing its internal agent framework, Minions. Reportedly, Minions produce over a thousand merged pull requests every week. Stripe did not achieve this volume simply by pointing a large language model at its monorepo.

The company built an MCP server called Toolshed, which exposes over 400 tools to its agents. Crucially, it gave agents full access to the development environment and built deterministic verification steps into the agent’s loop.

When a Stripe Minion writes code, the harness forces it through a gauntlet of verification steps. It begins with git operations, then moves to linting and formatting. If the agent generates code that violates the style guide, the linter rejects it immediately and returns the specific line number. The agent consumes this error message and corrects the syntax.

The Minion then moves on to type checking and testing. If a test fails, the error output is fed back into the context window for a fix. This functions as a closed-loop control system, with the development environment itself providing the error signal. This design allows the organization to trust agent output because the system prevents incorrect code from leaving the agent’s local context.

The verification gap

Most engineering teams today do not operate at this level of sophistication. They often provide their agents with little more than a code editor and a terminal window.

This creates a verification gap. It is like hiring a senior engineer and not giving them access to staging environments, monitoring dashboards, test infrastructure, or code review, and expecting them to contribute effectively.

The feedback signals available to an agent define the ceiling of what it can accomplish autonomously. If an agent can only see the text in the editor, it is limited to fixing syntax errors. If it can see the compiler output, it can detect type errors. But to solve complex integration problems, it needs access to the same rich diversity of signals that human engineers rely on.

“The feedback signals available to an agent define the ceiling of what it can accomplish autonomously… Without these signals, agents are prone to silent failures.”

Without these signals, agents are prone to silent failures. An agent might generate a SQL query that is syntactically correct and returns the correct data, but performs a full table scan, degrading production performance. Without access to an explain plan or execution metrics, the agent has no way of knowing that the code it wrote fails in production.

A hierarchy of feedback signals

To understand the impact of feedback signals on agents, it is worth mapping the hierarchy. Each level provides the agent with more context and raises the ceiling on their autonomy and, in turn, their productivity.

Syntax and type checking

This is the baseline. Any competent agent loop effectively eliminates syntax and type errors by iterating against compiler or linter output. However, these represent the shallowest class of bugs. A program can be syntactically perfect and type-safe while completely failing in production.

Unit tests

Agents that can run local unit tests can verify logic in isolation. This catches a significant volume of logical defects but misses the complexity of distributed systems. A unit test can confirm that a function correctly calculates a tax rate, but it cannot confirm that the tax service is reachable or that the authentication token is valid.

Integration and API tests

This is where the verification gap widens. To verify that a new or updated service correctly calls an upstream dependency, the agent needs access to a running environment where those services interact. Agents frequently hallucinate API payloads or invent endpoints without this context.

Observability data

Agents are rarely given access to traces and logs, yet these are critical tools for developers to debug complex failures. Giving an agent the ability to query logs or analyze a trace ID allows it to diagnose runtime behavior issues that static analysis will never catch.

Visual and end-to-end verification

Finally, visual and end-to-end verification is required to validate any changes that fully impact the frontend. A backend agent might deploy a schema change that passes all service-level tests but breaks the user interface because a component expects a different data format. By equipping agents with isolated previews and tools to drive a browser, they can confirm that their changes function end-to-end and close the loop.

What’s next

We have already crossed the threshold of model intelligence, enabling powerful engineering capabilities for agents. The limiting factor is now the richness of the feedback signals available to agents.

There is a strong case for treating agent feedback infrastructure as a first-class platform capability, much like CI/CD pipelines are treated now. This involves considering investments in standardized tool interfaces like MCP, structured outputs that make logs and errors easily consumable by machine, and ephemeral environment solutions that allow agents to spin up the isolated spaces they need to test and iterate in parallel against real dependencies.

Teams that build infrastructure to enable these feedback loops will see velocity compound as models improve. Those that do not will always have a ceiling on the productivity they can generate from coding agents.

Why the MCP Server Is Now a Critical Microservice

Signadot — Thu, 19 Feb 2026 15:55:36 +0000

Read this article on Signadot.

In my previous article on preparing CI/CD pipelines to ship production-ready agents, I argued that we cannot ship agents to production that are driven primarily by non-deterministic models. Instead, they must be built as robust workflows where the large language model (LLM) is introduced strategically at specific steps within a deterministic control flow.

Now we must examine the most critical node in that framework.

The Model Context Protocol (MCP) server facilitates interactions between the probabilistic LLM node and the deterministic microservices workflow. It acts as the translation layer connecting the reasoning engine to external data and tools.

The model is one half of the agent architecture. The MCP server is the other half. While model evaluations validate the reasoning engine, they cannot verify the system as a whole. Validation strategies relying on mocks fail to test the agent as a workflow.

Reliability of the end-to-end workflow is paramount when shipping agents to production. The MCP server is the critical node in this topology, acting as both sensory organ and effector arm. If it transmits ambiguous signals, the agent acts erratically. It hallucinates. It degrades user trust. It causes critical business errors.

The Architectural Shift From Contracts to Semantics

To understand the failure risks, we must examine how the MCP server alters service contracts.

Service-to-service communication is deterministic in standard microservices environments. Service A calls Service B using a strict REST or gRPC contract. The interaction is rigid. It is predictable. It is easily validated.

An agentic workflow inverts this.

The agent is a nondeterministic actor operating on probabilistic logic. It decides when to call a tool based on semantic context provided by the MCP server. The server exposes a world model rather than just an API endpoint.

This makes the MCP server a distinct type of microservice. It is a translation layer converting probabilistic intent into deterministic action. This responsibility manifests in three operations requiring rigorous engineering.

1. Defining Capability Boundaries

The MCP server defines agent capabilities through JSON-RPC tool definitions.

If the server exposes a schema with vague descriptions, the agent cannot formulate a valid execution plan. A human developer might read documentation to clarify an API field, but the agent relies solely on metadata exposed by the list_tools capability.

Consider a payment operations agent handling refunds. A fragile MCP implementation might expose a tool named refund_user to process a refund.

This lacks semantic density. The model does not know whether this applies to a full or partial refund or if it handles tax calculation. It is a black box.

A robust implementation defines the boundary with precision. It exposes process_prorated_subscription_refund. The description explicitly states that it calculates the remaining balance for the current billing cycle and issues a credit.

The reasoning chain breaks without this specificity.

2. Governing the Context Economy

The MCP server governs the context window. It must retrieve backend data and format it for LLM consumption.

This data engineering challenge requires differentiating between signal and noise.

Providing a raw 5 MB JSON dump dilutes agent attention. It wastes tokens and increases latency. Conversely, providing too little data causes the agent to hallucinate missing details.

The server must act as a transformation layer that optimizes raw data into context-ready snippets.

3. Executing Side Effects

The MCP server executes actions for the agent. When an agent triggers a deployment, the server is the execution mechanism.

A confused agent can trigger destructive loops if the server lacks idempotency or error handling. The server must implement safeguards preventing the model from erroneously retrying state-changing operations.

The Engineering Rigor Required for Production

Shipping agents to production requires due diligence exceeding standard microservice development. This is most visible in return state ambiguity.

A traditional API might return a 404 error code, which a client handles with logic. An MCP server faces a more complex challenge. It must return a natural language description or structured tool result explaining why the action failed.

If the server returns a generic stack trace, the agent may retry endlessly or invent a plausible but incorrect reason for failure. The error message becomes part of the prompt for the next conversation turn. It must be engineered as carefully as the system prompt.

Latency is also critical. Agents operate in a sequential thought loop. They reason. They call a tool. They wait. They reason again.

A slow server breaks the cognitive chain. High latency causes context timeouts, forcing the agent to abandon workflows. This leaves systems in inconsistent states.

Scaling Testing via Multitenancy

The nondeterministic nature of the client makes testing difficult. Traditional unit tests are insufficient.

Unit testing a Python function to ensure valid JSON output does not prove that an agent will understand how to use it. Mocks are equally ineffective. They decouple the test from real system behavior and create false confidence.

The only way to validate an MCP server is through rigorous end-to-end testing against real dependencies. However, spinning up full cluster replicas for every test is rarely feasible.

To validate an MCP server without the overhead of full environment replication, we treat the test run as a logical slice within a shared cluster. This life cycle relies on header based routing and session affinity:

Handshake and routing: The test harness initializes the agent with specific context metadata (such as a baggage header or a custom routing parameter) during the WebSocket or transport handshake. This signals the ingress controller or service mesh to route the persistent JSON-RPC session specifically to the candidate MCP server (the version under test), bypassing the stable production traffic.
Session isolation: Once connected, the agent operates within a strictly isolated session. While the underlying compute resources may be shared, the logical control flow is pinned to the candidate artifact. This ensures that the nondeterministic reasoning of the agent is exercising only the new code paths.
Shared downstream state: The candidate MCP server processes the agent’s intent but executes side effects against shared downstream dependencies such as staging databases or stable microservices. This eliminates the need for mocks, allowing the agent to interact with a realistic “world model” where API contracts and data schemas are genuine.

This architecture enables safe end-to-end semantic testing. The harness prompts the agent to perform an operation and verifies the state change against downstream microservices.

Isolation at the connection layer turns the test run into a private lane on a public highway. This enables full end-to-end validation of the MCP server without saturating testing infrastructure or introducing resource contention in shared staging environments.

Treat It Like Critical Infrastructure

Teams that are shipping advanced, customer-facing agents understand that robust MCP servers are critical infrastructure. We must recognize them as complex architectural nodes that directly affect agent reliability.

Model evals are critical but insufficient for production standards. Rigorous integration testing of agents with MCP servers is necessary.

An agent is only as effective as its tools. A fragile MCP server creates a fragile agent. Elevating the MCP server to a fully validated microservice is essential for advancing agent development from internal experiments to products that are ready for production.

Learn more about how to implement this testing workflow for your agents at Signadot.com

Your CI/CD Pipeline Is Not Ready To Ship AI Agents

Signadot — Wed, 18 Feb 2026 15:43:01 +0000

Read this blog on Signadot.

Let’s be honest with ourselves for a minute. If you look past the hype cycles, the viral Twitter demos and the astronomical valuation of foundation model companies, you will notice a distinct gap in the AI landscape.

We are incredibly early, and our infrastructure is failing us.

While every SaaS company has slapped a copilot sidebar onto its UI, actual autonomous agents are rare in the wild. I am referring to software that reliably executes complex and multistep tasks without human hand-holding. Most agents today are internal tools glued together by enthusiastic engineers to summarize Slack threads or query a SQL database. They live in the safe harbor of internal usage where a 20% failure rate is a quirky annoyance rather than a churn event.

Why aren’t these agents facing customers yet? It is not because the models lack intelligence. It is because our delivery pipelines lack rigor. Taking an agent from cool demo to production-grade reliability is an engineering nightmare that few have solved because traditional CI/CD pipelines simply were not designed for non-deterministic software.

We are learning the hard way that shipping agents is not an AI problem. It is a systems engineering problem. Specifically, it is a testing infrastructure problem.

The Death of ‘Prompt and Pray’

For the last year, the industry has been obsessed with frameworks that promised magic. You give the framework a goal and it figures out the rest. This was the “prompt and pray” era.

But as recent discussions in the engineering community highlight, specifically the insightful conversation around 12-Factor Agents, production reality is boringly deterministic. The developers actually shipping reliable agents are abandoning the idea of total autonomy. Instead, they are building robust and deterministic workflows where large language models (LLMs) are treated as fuzzy function calls injected at specific leverage points.

When teams start testing agents, they almost always start with evals.

The 12-Factor philosophy correctly argues that you must own your control flow. You cannot outsource your logic loop to a probabilistic model. If you do, you end up with a system that works 80% of the time and hallucinates itself into a corner the other 20%.

So we build the agent as a workflow. We treat the LLM as a component rather than the architect. But once we settle on this architecture, we run headfirst into a wall that traditional software engineering solved a decade ago but which AI has reopened. That wall is integration testing.

The Trap of Evals

When teams start testing agents, they almost always start with evals.

Evals are critical. You need frameworks to score your LLM outputs for relevance, toxicity and hallucinations. You need to know if your prompt changes caused a regression in reasoning.

However, in the context of shipping a product, evals are essentially unit tests. They test the logic of the node, but they do not test the integrity of the graph.

In a production environment, your agent is not chatting in a void. It is acting. It is calling tools. It is fetching data from a CRM, updating a ticket in Jira or triggering a deployment via an MCP (Model Context Protocol) server.

The reliability of your agent is not just defined by how well it writes text or code. It is defined by how consistently it handles the messy and structured data returned by these external dependencies.

The Integration Nightmare

This is where the platform engineering headache begins.

Imagine you have an agent designed to troubleshoot Kubernetes pod failures. To test this agent, you cannot just feed it a text prompt. You need to put it in an environment where it can do several things. It must call the Kubernetes API or an MCP server wrapping it. It must receive a JSON payload describing a CrashLoopBackOff. It must parse that payload. It must decide to check the logs. Finally, it must call the log service.

If the structure of that JSON payload changes, or if the latency of the log service spikes, or if the MCP server returns a slightly different error schema, your agent might break. It might hallucinate a solution because the input context did not match its training examples.

To test this reliably, you need integration testing. But integration testing for agents is significantly harder than for standard web apps.

Why Traditional Testing Tails

In traditional software development, we mock dependencies. We stub out the database and the third-party APIs.

But with LLM agents, the data is the control flow. If you mock the response from an MCP server, you are feeding the LLM a perfect and sanitized scenario. You are testing the happy path. But LLMs are most dangerous on the unhappy path.

You need to know how the agent reacts when the MCP server returns a 500 error, an empty list or a schema with missing fields. If you mock these interactions, you are writing the test to pass rather than to find bugs. You are not testing the agent’s ability to reason. You are testing your own ability to write mocks.

The alternative to mocking is usually a full staging environment where you spin up the agent, the MCP servers, the databases and the message queues.

But in a modern microservices architecture, spinning up a duplicate stack for every pull request is prohibitively expensive and slow. You cannot wait 45 minutes for a full environment provision just to test if a tweak to the system prompt handles a database error correctly.

The Need for Ephemeral Sandboxes

To ship production-grade agents, we need to rethink our CI/CD pipeline. We need infrastructure that allows us to perform high-fidelity integration testing early in the software development life cycle.

We need ephemeral sandboxes.

A platform engineer needs to provide a way for the AI developer to spin up a lightweight, isolated environment that contains:

The version of the agent being tested.
The specific MCP servers and microservices it depends on.
Access to real (or realistic) data stores.

Crucially, we do not need to duplicate the entire platform. We need a system that allows us to spin up the changed components while routing traffic intelligently to shared and stable baselines for the rest of the stack.

This approach solves the data fidelity problem. The agent interacts with real MCP servers running real logic. If the MCP server returns a complex JSON object, the agent has to ingest it. If the agent makes a state-changing call like restart pod, it actually hits the service or a sandboxed version of it. This ensures the loop is closed.

This is the only way to verify that the workflow holds up.

Shifting Left on Agentic Reliability

The future of AI agents is not just better models. It is better DevOps.

If we accept that production agents are just software with fuzzy logic, we must accept that they require the same rigor in integration testing as a payment gateway or a flight control system.

We are moving toward a world where the agent is just one microservice in a Kubernetes cluster. It communicates via MCP to other services. The challenge for platform engineers is to give developers the confidence to merge code.

That confidence does not come from a green checkmark on a prompt eval. It comes from seeing the agent navigate a live environment, query a live MCP server and execute a workflow successfully.

Conclusion

Building the agent is the easy part. Building the stack to reliably test the agent is where the battle is won or lost.

As we move from internal toys and controlled demos to customer-facing products, the teams that win will be those that can iterate fast without breaking things. They will be the teams that abandon the idea of “prompt and pray” and instead bring production fidelity to their pull request (PR) review. This requires a specific type of infrastructure focused on request-level isolation and ephemeral testing environments that work natively within Kubernetes.

Solving this infrastructure gap is our core mission at Signadot. We allow platform teams to create lightweight sandboxes to test agents against real dependencies without the complexity of full environments. If you are refining the architecture for your AI workflows, you can learn more about this testing pattern at signadot.com.

Ramp’s Inspect shows closed-loop AI agents are software’s future

Signadot — Thu, 12 Feb 2026 15:23:51 +0000

Read this blog on Signadot.

The recent release of the background coding agent Inspect by Ramp’s engineering team serves as a definitive proof point that closed-loop agentic systems are the future of software development. It has transformed coding agents into truly autonomous engineering partners, and it is fundamentally changing the way agents deliver software.

Whether teams use a custom cloud development environment (CDE) like Ramp or another approach, the signal is clear: Teams need to solve for this kind of autonomy or risk getting left behind. Modern engineers need access to coding agents that do not just generate code but also run it, verify the output, and iterate on the solution until it works.

This distinction represents a fundamental shift. The industry has been focused on optimizing the “brain” of agents, solving for context windows and reasoning. Ramp’s success validates that the “body” matters just as much.

The ability to interact with a runtime environment is what transforms code from a hypothesis into a solution. This verification loop separates truly autonomous coding agents from those that rely on humans to validate their work.

The open-loop bottleneck

Modern coding agents are impressive. They can plan complex refactors and generate thousands of lines of code. However, these agents typically operate in an open loop. They rely on the developer to act as the runtime environment. The agent proposes a solution. The human must compile, test, and interpret error messages or feed them back to the agent. The cognitive load of verification remains with the user.

This workflow caps developer velocity. The speed of the agent is irrelevant if the verification process is slow. We have optimized code generation to be near instantaneous, but verification remains bound by human bandwidth and linear CI pipelines.

Inspect demonstrates that closing that loop unlocks a new category of velocity. By giving the agent access to a sandbox to run builds and tests, the agent transitions from text generator to task completer. It hands off a verified solution rather than a draft.

The impact is measurable. Ramp reported vertical internal adoption charts. Within months, approximately 30% of all pull requests merged to its frontend and backend repositories were written by Inspect. This penetration suggests closed-loop agents are a step function change in productivity, not a marginal improvement.

The economics of curiosity

The value proposition of closed-loop agents is not just delivering code faster. It is about the parallelization of solution discovery.

In traditional workflows, exploring refactors or library upgrades is expensive. It requires context switching, stashing work and fighting dependency conflicts. Because experimentation costs are high, we experiment less. We stick to safe patterns to avoid the time sink of failure.

Background agents change the economics of curiosity. If an engineer can spin up 10 concurrent agent sessions to explore 10 architectural approaches, the cost of failure drops significantly.

Consider a team migrating a legacy component. Currently, this is a multiweek spike. In the new paradigm, a developer could instead task a fleet of agents to attempt the migration using different strategies. One agent might try a strangler fig pattern. Another might attempt a hard cutover. A third might focus on integration tests.

The developer then reviews results rather than typing code. The agents run in isolated sandboxes. They build, catch syntax errors, and run test suites until they achieve a green state. The developer wakes up to three potential pull requests verified against the CI pipeline and chooses the best one.

Verification beyond localhost

Ramp’s Inspect platform validates within a custom-built CDE. To ensure these environments start quickly despite their complexity, a sophisticated snapshotting system keeps images warm and ready to launch. Ramp was able to extend this CDE infrastructure to also support integration testing, a brilliant engineering feat that works well for its specific context.

However, for many organizations building complex, cloud native applications with high levels of dependencies, this approach faces significant hurdles. Often, the entire stack is too large to be spun up on a single virtual machine (VM) or devpod. In these scenarios, while CDEs remain excellent for replacing local development laptops, high-fidelity integration testing requires a different approach.

To enable true autonomy in these complex environments, we need a way to perform integration testing without replicating the entire world. We can connect agents directly to a shared baseline environment using existing Kubernetes infrastructure.

In this model, the agent deploys only the modified service to a lightweight sandbox. The infrastructure uses dynamic routing and context propagation to direct specific test traffic to that sandbox while fulfilling all other dependencies from a shared, stable baseline.

This approach gives coding agents the power to execute autonomous end-to-end testing, regardless of the stack’s size or complexity. It leverages the existing cluster to provide high-fidelity context. An agent can then run integration tests against real upstream and downstream services. It sees how the change interacts with the actual message queue schema and the latency of the live database.

This closes the loop with higher fidelity while lowering the infrastructure barrier. By testing against a shared cluster, the agent can catch integration regressions that might pass in a hermetic VM without requiring the platform team to build a custom orchestration engine to support it.

The future of software delivery

The release of Inspect is a clear signal of where software development is heading. The era of the human engineer as the sole verifier is ending. We are moving toward a world where agents operate as autonomous partners capable of exploring solutions and verifying their own work.

Ramp has proven that this workflow is not science fiction. It is working in production today and is driving massive efficiency gains. The question for the rest of the industry is not whether to adopt this workflow, but how.

Whether a team chooses to build a custom platform like Ramp or adopt an existing cloud native solution like Signadot to give their agents a runtime, the imperative is the same. We must provide our agents with a body. We must close the loop between generation and verification. Once we do, we unlock a level of velocity that will define the next generation of high-performing engineering teams.

Your infrastructure isn’t ready for agentic development at scale

Signadot — Thu, 05 Feb 2026 15:53:22 +0000

Read this blog on Signadot.

I have spent the last year watching the AI conversation shift from smart autocomplete to autonomous contribution. When I test tools like Claude Code or GitHub Copilot Workspace, I am no longer just seeing code suggestions. I am watching them solve tickets and refactor entire modules.

The promise is seductive. I imagine assigning a complex task and returning to merged work. But while these agents generate code in seconds, I have discovered that code verification is the new bottleneck.

For agents to be force multipliers, they cannot rely on humans to validate every step. If I have to debug every intermediate state, my productivity gains evaporate. To achieve 10 times the impact, we must transition to an agent-driven loop where humans provide intent while agents handle implementation and integration.

The code generation feedback loop crisis

Consider a scenario where an agent is tasked with updating a deprecated API endpoint in a user service. The agent parses the codebase, identifies the relevant files, and generates syntactically correct code. It may even generate a unit test that passes within the limited context of that specific repository.

However, problems emerge when code interacts with the broader system. A change might break a contract with a downstream payment gateway or an upstream authentication service. If the agent cannot see this failure, it assumes the task is complete and opens a pull request.

The burden then falls on human developers. They have to pull down the agent’s branch, spin up a local environment, or wait for a slow staging build to finish, only to discover the integration error. The developer pastes the error log back into the chat window and asks the agent to try again. This ping-pong effect destroys velocity.

Boris Cherny, creator of Claude Code, has noted the necessity of closed-loop systems for agents to be effective. An agent is only as capable as its ability to observe the consequences of its actions. Without a feedback loop that includes real runtime data, an agent is building in the dark.

In cloud native development, unit tests and mocks are insufficient for this feedback. In a microservices architecture, correctness is a function of the broader ecosystem.

Code that passes a unit test is merely a suggestion that it might work. True verification requires the code to run against real dependencies, real network latency, and real data schemas. For an agent to iterate autonomously, it needs access to runtime reality.

The requirement: Realistic runtime environments at scale

In a recent blog post, “Effective harnesses for long-running agents,” Anthropic’s engineering team argued that an agent’s performance is strictly limited by the quality of its harness. If the harness provides slow or inaccurate feedback, the agent cannot learn or correct itself.

This presents a massive infrastructure challenge for engineering leadership. In a large organization, you might deploy 100 autonomous agents to tackle backlog tasks simultaneously. To support this, you effectively need 100 distinct staging environments.

The traditional approach to this problem fails at scale. Spinning up full Kubernetes namespaces or ephemeral clusters for every task is cost-prohibitive and slow. Provisioning a full cluster with 50 or more microservices, databases, and message queues can take 15 minutes or more. This latency is fatal for an AI workflow. Large language models (LLMs) operate on a timescale of seconds.

We are left with a fundamental conflict. We need production-like fidelity to ensure reliability, but we cannot afford the production-level overhead for every agentic task. We need a way to verify code that is fast, cheap, and accurate.

The solution: Environment virtualization

The answer lies in decoupling the environment from the underlying infrastructure. This concept is known as environment virtualization.

Environment virtualization allows the creation of lightweight and ephemeral sandboxes within a shared Kubernetes cluster. In this model, a baseline environment runs the stable versions of all services. When an agent proposes a change to a specific service, such as the user service mentioned earlier, it does not clone the entire cluster. Instead, it spins up only the modified workload containing the agent’s new code as a shadow deployment.

The environment then utilizes dynamic traffic routing to create the illusion of a dedicated environment. It employs context propagation headers to route specific requests to the agent’s sandbox. If a request carries a specific routing key associated with the agent’s task, the service mesh or ingress controller directs that request to the shadow deployment. All other downstream calls fall back to the stable baseline services.

This architecture solves the agent-environment fit in three specific ways:

Speed: Because a single container or pod is launching, rather than a full cluster, sandboxes spin up in seconds.
Cost: The infrastructure footprint is minimal. You are not paying for idle databases or duplicate copies of stable services.
Fidelity: Agents test against real dependencies and valid data rather than stubs. The modified service interacts with the actual payment gateways and databases in the baseline.

The seamless verification workflow for AI agents

The mechanics of this verification loop rely on precise context propagation, typically handled through standard tracing headers like OpenTelemetry baggage.

When an agent works on a task, its environment is virtually mapped to the remote Kubernetes cluster. This setup supports conflict-free parallelism. Multiple agents can simultaneously work on the same microservice in different sandboxes without collision because routing is determined by unique headers attached to test traffic.

Here is the autonomous workflow for an agent refactoring a microservice:

Generation: The agent analyzes a ticket and generates a code fix with local static analysis. At this stage, the code is theoretical.
Instantiation: The agent triggers a sandbox via the Model Context Protocol (MCP) server. This deploys only the modified workload alongside the running baseline in seconds.
Verification: The agent runs integration tests against the cluster using a specific routing header. Requests route to the modified service while dependencies fall back to the baseline.
Feedback: If the change breaks a downstream contract, the baseline service returns a real runtime error (e.g., 400 Bad Request). The agent captures this actual exception rather than relying on a mock.
Iteration: The agent analyzes the error, refines the code to fix the integration failure, and updates the sandbox instantly. It runs the test again to confirm the fix works in the real environment.‍
Submission: Once tests pass, the agent submits a verified pull request (PR). The human reviewer receives a sandbox link to interact with the running code immediately, bypassing local setup.

Why engineering’s future is autonomous

As we scale the use of AI agents, the bottleneck moves from the keyboard to the infrastructure. If we treat agents as faster typists but force them to wait for slow legacy CI/CD pipelines, we gain nothing. We simply build a longer queue of unverified pull requests.

To move toward a truly autonomous engineering workforce, we must give agents the ability to see. They need to see how their code performs in the real world rather than just in a text editor. They need to experience the friction of deployment and the reality of network calls. This is Signadot’s approach.

Environment virtualization is shifting from a tool for developer experience to foundational infrastructure. By closing the loop, agents can do the messy and iterative work of integration. This leaves architects and engineers free to focus on system design, high-level intent, and the creative aspects of building software.

Traditional Code Review Is Dead. What Comes Next?

Signadot — Tue, 27 Jan 2026 18:22:10 +0000

Read this blog on Signadot.

I noticed a quiet shift in our engineering team recently that brought me to a broader realization about the future of software development: Code review has changed fundamentally.

It started with a pull request (PR). An engineer had used an agent to generate the entire change, iterating with it to define business logic, but ultimately relying on the agent to write the code. It was a substantial chunk of work. The code was syntactically perfect. It followed our linting rules. It even included unit tests that passed green.

The human reviewer, a senior engineer who is usually meticulous about architectural patterns and naming conventions, approved it almost immediately. The time between the PR opening and the approval was less than two minutes.

When I asked about the speed of the approval, they said they checked if the output was correct and moved on. They did not feel the need to parse every line of syntax because it was written by an agent. They spun up the deploy preview, clicked the buttons, verified the state changes and merged it.

This made sense, but it still took me by surprise. I realized that I was witnessing the silent death of traditional code review.

The Silent Death of the Code Review

For decades, the peer review process has been the primary quality gate in software engineering. Humans reading code written by other humans served two critical purposes:

It caught logic bugs that automated tests missed.
It maintained a shared mental model of the codebase across the team.

The assumption behind this process was that code is a scarce resource produced slowly. A human developer might write 50 to 100 lines of meaningful code in a day. Another human can reasonably review that volume while maintaining high cognitive focus.

But we are entering an era where code is becoming abundant and cheap. In fact, the precise goal of implementing coding agents is to generate code at a velocity and volume that by design makes it impossible for humans to keep up.

When an engineer sees a massive block of AI-generated code, the instinct is to offload the syntax-checking to the machine. If the linter is happy and the tests pass, the human assumes the code is valid. The rigorous line-by-line inspection vanishes.

The Problem: AI Trust and the Rubber Stamp

This shift leads to what I call the rubber stamp effect. We see a “lgtm” (looks good to me) approval on code that nobody actually read.

This creates a significant change to the risk profile. Human errors usually manifest as syntax errors or obvious logic gaps. AI errors are different. Large language models (LLMs) often hallucinate plausible but functionally incorrect code.

Traditional diff-based review tools are ill-equipped for this. A diff shows you what changed in the text file. It does not show you the emergent behavior of that change. When a human writes code, the diff is a representation of their intent. When an AI writes code, the diff is just a large volume of tokens that may or may not align with the prompt.

We are moving from a syntax-first culture to an outcome-first culture. The question is no longer “Did you write this correctly?” The question is “Does this do what we asked the agent for?”

Previews as the New Source of Truth

In this new world, where engineers are logic architects who offload the writing of code to agents, the most important artifact is not the code. It is the preview.

If we cannot rely on humans to read the code, we must rely on humans to verify the behavior. But to verify behavior, we need more than a diff. We need a destination. The code must be deployed to a live environment where it can be exercised.

While frontend previews have become standard, the critical gap — and the harder problem to solve — is the backend.

Consider a change to a payment processing microservice generated by an agent. The code might look syntactically correct. The logic flow seems correct. But does it handle the race condition when two requests hit the API simultaneously? Does the new database migration lock a critical table for too long?

You cannot see these problems in a text diff. You cannot even see them in a unit test mock. You can only see them when the code is running in a live, integrated environment.

A backend preview environment allows for true end-to-end verification. It allows a reviewer to execute real API calls against a real database instance. It transforms the review process from a passive reading exercise into an active verification session. We are not just checking whether the code compiles. We are checking whether the system behaves.

As AI agents write more code, the “review” phase of the software development life cycle must evolve into a “validation” phase. We are not reviewing the recipe. We are tasting the dish.

The Infrastructure Challenge: The Concurrency Explosion

However, this shift to outcome-based verification comes with a massive infrastructure challenge that most platform engineering teams are not ready for.

A human developer typically works linearly. They open a branch, write code, open a pull request, wait for review and merge. They might context switch between two tasks, but rarely more.

AI agents work in parallel. An agent tasked with fixing a bug might spin up 10 different strategies to solve it. It could open 10 parallel pull requests, each with a different implementation, and ask the human to select the best one.

This creates an explosion of concurrency.

Traditional CI/CD pipelines are built for linear human workflows. They assume a limited number of concurrent builds. If your AI agent opens 20 parallel sessions to test different hypotheses, you face two prohibitive problems: cost and contention.

First, you cannot have 20 full-scale staging environments spinning up on expensive cloud instances. Imagine spinning up a dedicated Kubernetes cluster and database for 20 variations of a single bug fix. The cloud costs would be astronomical.

Second, and perhaps worse, is the bottleneck of shared resources. Many pipelines rely on a single staging environment or limited testing slots. To avoid data collisions, these systems force PRs into a queue.

With existing human engineering teams, these queues are already a frustrating bottleneck. With multiple agents dumping 20 PRs into the pipe simultaneously, the queue becomes a deadlock. The alternative of running them all at once on shared infrastructure results in race conditions and flaky tests.

Scaling Development With Environment Virtualization

To scale agent-driven development, we cannot rely on infrastructure built for linear human pacing. We are talking about potentially hundreds of concurrent agents generating PRs in parallel, all of which need to be validated with previews. Cloning the entire stack for each one is not a viable option.

The solution is to multiplex these environments on shared infrastructure. Just as a single physical computer can host multiple virtual machines (VMs), a single Kubernetes cluster can multiplex thousands of lightweight, ephemeral environments.

By applying smart isolation techniques at the application layer, we can provide strict separation for each agent’s work without duplicating the underlying infrastructure. This allows us to spin up a dedicated sandbox for every change, ensuring agents can work in parallel and validate code end-to-end without stepping on each other’s toes or exploding cloud costs.

Conclusion

There is a clear shift happening in the way we review changes. As agents take over the writing of code, the review process naturally evolves from checking syntax to verifying behavior. The preview is no longer just a convenience. It is the only scalable way to validate the work that agents produce.

At Signdot, we are building for this future. We provide the orchestration layer that enables fleets of agents to work in parallel, generating and validating code end-to-end in a closed loop with instant, cost-effective previews.

The winners of the next era won’t be the teams with the best style guides, but those who can handle the parallelism of AI agents without exploding their cloud budgets or bringing their CI/CD pipelines to a grinding halt.

In an AI-first world, reading code is a luxury we can no longer afford. Verification is the new standard. If you cannot preview it, you cannot ship it.

Merging To Test Is Killing Your Microservices Velocity

Signadot — Mon, 19 Jan 2026 19:57:45 +0000

Read this blog on Signadot.

Frontend and data layers have evolved with branch-based previews and isolated environments. Why is the backend service layer stuck with shared staging?

If you are a platform engineer or an engineering leader, look at your current development pipeline. Is everything treated equally?

To me, it seems that there is a glaring discrepancy in the way we treat different parts of the stack.

When a frontend developer pushes code to a feature branch, tools like Vercel or Netlify immediately spin up a deploy preview. It is a unique URL, isolated from production, where they can click around and validate changes instantly.

When a database engineer needs to test a schema migration, modern platforms like Neon or PlanetScale allow them to branch the database. They get an isolated, copy-on-write clone of the production data to wreck and repair without affecting a single real user.

But what happens when a backend engineer pushes a change to one microservice in a mesh of 50?

Nothing.

There is a gaping hole in the middle of our cloud native stack. While frontend and data layers have evolved to embrace branch-based development, the backend service layer is stuck in the stone age of shared environments.

This isn’t just an annoyance. It is the primary bottleneck preventing teams from truly shifting left.

The Merge To Validate Anti-Pattern

In most distributed architectures, a developer working on a backend service cannot realistically run the entire platform on their laptop. It is too heavy.

So they rely on unit tests and mocks. But we all know that mocks are liars. They do not catch the contract drift between services or the latency issues that only appear over the network.

To get real validation, the developer has to merge their branch to the main trunk so it can be deployed to a shared staging environment.

This is where velocity goes to die.

The queue: Developers wait in line to deploy to staging.
The block: If one developer breaks staging, everyone is blocked.
The noise: Testing fails, but is it your code, or did someone else deploy a bad config to the auth-service five minutes ago?

We have normalized this dysfunction. We treat staging as a fragile, sacred monolith. But in an era where we want to deploy multiple times a day, merging to trunk just to see if your code works is backward. It is like pouring concrete before you have checked the blueprints.

The Solution: Service Branching

We need to bring the Vercel and Neon experience to the Kubernetes backend. We need service branching.

The goal is simple. Every git branch should result in a testable, isolated environment.

However, the physics of microservices makes this hard. You cannot duplicate a cluster with 100+ services for every single pull request. The cost and spin-up time would be prohibitive.

The solution is not duplication. It is isolation.

Imagine a base environment, your existing staging cluster, that runs the stable version of all your services. When a developer pushes a change to a specific service, the platform shouldn’t clone the cluster. It should simply spin up a lightweight sandbox containing only the modified service.

Smart routing does the rest:

Standard traffic flows through the stable baseline.
Test traffic is intercepted and rerouted only to the sandboxed service.
If the sandboxed service needs to call other services, it routes back into the stable baseline. This gives the developer the experience of a full, dedicated environment with a fraction of the infrastructure footprint.

The New Mental Model: Git Equals Environment

For this to work at scale, platform engineers need to provide a clean mental model that maps source code directly to infrastructure.

This is what the new standard looks like:

Trunk (main) corresponds to the baseline environment (staging). This is the source of truth. It represents the stable state of the world where all services are interacting as expected.
Feature branch (feat-xyz) corresponds to a sandbox environment. This is ephemeral. It lives only as long as the PR is open. It contains only the delta of the services that have changed in that specific branch.

When a developer opens a PR, they do not need to think about clusters or namespaces. They just get a dedicated playground that mirrors their branch perfectly.

The Holy Grail: The Full Virtual Stack

When you combine this service branching approach with the existing tools for frontend and database branching, you unlock something powerful: a full virtual stack per branch.

Imagine a workflow where a developer creates a branch, and magically, a complete, isolated environment materializes. To the developer, it feels like they have their own private copy of the entire company’s infrastructure.

This includes frontend, backend services and database schemas. They are all aligned to their specific code changes.

They can run end-to-end integration tests on their branch before merging. They can hand a URL to a product manager to demo the feature. They can validate complex migrations safely. It is a dedicated reality for their feature, created instantly and destroyed just as quickly.

Why This Matters: Speed and Quality at Scale

This model shifts the paradigm from serial blocking to massive parallelism.

Remove the bottleneck: Large engineering teams no longer have to queue up for staging. You can have 10, 50 or 100 developers and agents testing simultaneously without stepping on each other’s toes.
True shift left: Integration testing happens during development, not after the merge. You catch the bug when you write it, not three days later when the staging build fails.
More quality, faster: When testing is easy and isolated, people do more of it. We stop fearing deployments and start treating them as routine.

The result is a software delivery pipeline that is both significantly faster and more stable.

Closing the Gap

The technology to do this exists. The patterns are proven. It is time for platform teams to stop managing static environments and start managing dynamic, ephemeral workflows.

If you are looking to implement this service branching layer to complete your testing strategy, this is exactly what Signadot was built for. Signadot provides the orchestration layer that brings request-based isolation to Kubernetes.

Stop merging to test. Start branching to validate.