It was 11:47pm and I had an agent-generated PR sitting open, waiting for merge. The pipeline was green. Tests were passing. The code looked clean. And I had absolutely no idea why it had chosen that specific implementation.
I wasn't nervous about the code. I was nervous because in ten minutes I was going to hit the merge button and the technical owner of that PR was going to be me — someone who hadn't written a single line of it.
That's when I understood the real problem with agents that generate PRs. And it's not the one that shows up in pitch decks.
AI agent PR automation: the promise and what they leave out
Twill.ai just came out of YC S25. The pitch is clean: you send it an issue, the agent reads it, investigates the codebase, writes the code, and sends you a Pull Request ready to review. Zero friction. No need to delegate to a senior dev who's already got ten things in the backlog.
In the demo, it's magical. The agent reads the issue, navigates the repo, understands the context, writes code that compiles, and opens the PR with a reasonable description. The pipeline passes. Looks like the problem is solved.
And at a technical level, a lot of the time it is solved. That's not the problem.
The problem is what happens next.
What the pitch deck doesn't mention: epistemic responsibility
When a human developer opens a PR, there's something implicit: that person knows why they did what they did. If you ask in the review "why did you use a mutex here instead of a channel?", they can answer. If there's a production bug at 3am, that person can debug it because they have a mental model of their own decision.
With an agent, that mental model doesn't exist anywhere accessible.
I lived this working with Research-Driven Agents — systems where the agent investigates before coding, similar to what I described in my post about agents that read before they write. The code quality was noticeably better than pure vibe-coding. But the understanding of the code was still mine, if I had it, or nobody's, if I merged without really getting it.
I call this the epistemic responsibility of generated code: who holds the knowledge of why each technical decision exists in the codebase.
The experiment that shifted my perspective
For three weeks I used a coding agent on a side project — automation work on top of PostgreSQL. I'd hand it surgically precise issues. I'd get PRs back. I'd review them. I'd merge.
By the end of the sprint, the codebase worked. Tests covered the happy paths. And I could explain maybe 60% of the implementation decisions.
The other 40% was code I had read but hadn't understood deeply. Enough to approve the review. Not enough to debug at 3am.
// The agent generated this. I approved it.
// Three weeks later I couldn't remember why it used
// this specific retry strategy instead of exponential backoff
const retryWithJitter = async <T>(
fn: () => Promise<T>,
maxAttempts: number = 3,
baseDelayMs: number = 100
): Promise<T> => {
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
try {
return await fn();
} catch (error) {
if (attempt === maxAttempts) throw error;
// Decorrelated jitter — the agent chose this
// I never questioned whether it was the right call for this specific case
const delay = Math.min(
baseDelayMs * Math.random() * Math.pow(2, attempt),
2000
);
await new Promise(resolve => setTimeout(resolve, delay));
}
}
throw new Error('Unreachable');
};
Was the implementation correct? Yes. Did I know why it was correct for that specific context versus the three alternatives the agent could have chosen? Not really.
When the project grew and I had to modify that module, it took me twice as long as it would have if I'd written it from scratch myself.
The real gotchas with agents that generate PRs
1. The PR description problem
Agents generate reasonable PR descriptions. But "reasonable" isn't the same as "useful for understanding design decisions." A description that says "implements retry logic for the database client" doesn't tell you why it chose decorrelated jitter instead of simple exponential backoff.
This becomes critical in codebases with historical architectural decisions that carry context. The agent doesn't have access to that Slack thread from six months ago where you decided not to use the obvious solution for a very specific reason.
2. The shallow review problem
When you review code you wrote, there's a different level of attention. You know what you were trying to do, so you notice the gap between what you intended and what you actually achieved.
When you review an agent's code, the risk is reviewing syntax instead of semantics. "Does it compile? Do the tests pass? Merge." That's not a code review — it's surface-level validation.
3. The distributed context problem
Every agent PR is a decision made in isolation. The agent doesn't remember that last week's PR chose a different strategy for a similar problem. The architectural coherence of the codebase becomes your exclusive responsibility — with the added burden that you didn't write the previous code either.
This connects to something I explored when looking at whether Git is ready for a world of agents: version control systems are designed to track who wrote what, not why the agent made that decision at that moment with that context.
4. The invisible attack surface problem
An agent that reads your codebase to write code is also reading your security patterns — the good ones and the bad ones. If your codebase has an insecure pattern that "works," the agent will replicate it because it's consistent with the context.
I had a case where an agent replicated an error-handling pattern that silenced specific exceptions — something that in the original codebase had a well-documented reason, but in the new context was straight-up dangerous. The code compiled. The tests passed. The bug was sitting there, waiting.
This gets especially relevant when you think about verification of AI-generated code — we don't even have mature tooling to audit what was agent-generated versus human-written in a mixed codebase.
5. The cumulative effect on team knowledge
This one is the quietest and the most dangerous.
If your team starts systematically merging agent PRs, deep codebase knowledge starts eroding. Not all at once — gradually. Every PR you merge without fully understanding it is a small epistemic deficit. Six months later you have a codebase that "works" but that nobody on the team can confidently explain.
It's the opposite of the tribal knowledge problem: it's not that the knowledge is locked in one person's head. It's that it's not in anyone's head.
What Twill.ai promises vs. what the problem actually requires
I'll be straight: the technology behind these agents is genuinely impressive. What Twill.ai describes — reading an issue, navigating the codebase, generating contextually appropriate code — is hard to do well and there's clearly serious work behind it.
But the "delegate to an agent, get a PR" pitch solves the problem of generating code without touching the problem of responsibility for that code.
And in production, responsibility is the most expensive problem.
It's similar to what happened with ORMs that "abstracted away" the database: they worked perfectly until you needed to debug a slow query at 3am and the developer who'd used the ORM didn't know SQL. The agent is the ORM of code — a useful abstraction that creates comprehension dependency.
Interesting comparison to something I explored earlier: when I loaded Linux's git history into Postgres for analysis, what struck me wasn't the volume of commits but the density of context in the commit messages — every human commit explained why, not just what. Agent PRs still don't have that density.
What I'd do differently
I'm not saying "don't use agents that generate PRs." I'm saying there are conditions under which it makes sense, and conditions under which it's technical debt dressed up as productivity.
It makes sense when:
- The issue is fully specified and leaves no room for design decisions
- The scope is small and isolated (a test, a CRUD endpoint, a bugfix with an identified cause)
- The reviewer has enough context to understand why the agent chose each thing
- There's a documentation process that captures decisions, not just code
It's technical debt when:
- The issue requires architectural decisions
- The reviewer is under time pressure and is going to merge without fully understanding it
- It's the third agent PR this week and the team has lost track of what was written by whom
- There's no process to capture the context behind the agent's decisions
What I'd actually do: require the agent to produce not just the PR but a decision document — an automated Architecture Decision Record. Not the code. The alternatives it evaluated. Why it chose this one. What it sacrificed. What it assumes about the context.
That turns an agent PR into something genuinely reviewable. And it turns epistemic responsibility into something transferable.
Without that, you're merging code from someone you can't call at 3am.
FAQ: AI agent PR automation
What is an AI agent that automatically generates PRs?
It's an AI system that reads an issue or task in your repository, analyzes the existing codebase, writes the code needed to solve the problem, and opens a Pull Request ready to review — no human involvement in the writing phase. Tools like Twill.ai (YC S25), Devin, and various agents built on Claude or GPT-4 do this with varying levels of sophistication.
Are agent-generated PRs reliable for production?
Depends on the scope and the review process. For well-scoped and well-specified tasks — targeted bugfixes, additional tests, simple CRUD endpoints — the technical quality is usually acceptable. The problem isn't code reliability, it's the team's ability to understand the implementation decisions well enough to maintain that code in the future.
How do I do an effective code review of an agent-generated PR?
Don't just review syntax. Ask yourself: can I explain why the agent chose this specific implementation? What alternatives existed? Is this decision consistent with previous decisions in the codebase? If you can't answer those questions, the PR isn't ready to merge — you need more context, not more green tests.
Will agents that generate PRs replace developers?
Not in any visible horizon, and specifically because of the epistemic responsibility problem. Someone has to understand the codebase deeply enough to make architectural decisions, debug complex problems, and evaluate whether the agent's decisions are appropriate for the specific context. Agents reduce the cost of generating code, not the cost of understanding software.
What security risks come with using agents that read my codebase?
Two main ones: first, the agent can replicate insecure patterns already in the codebase because it perceives them as "the project's style." Second, if the agent has access to secrets or configurations during its analysis, there's attack surface in the integration pipeline. Always review the permissions you give the agent and which parts of the repository it can read.
Does this make sense for small teams or just large companies?
For small teams the risk is higher, not lower. In a 2-3 person team, every developer has to be able to maintain any part of the codebase. If you're merging PRs you don't fully understand, the bus factor of the project climbs to dangerous levels — and the agent won't be around when you need to understand its own code at 3am before a critical deploy.
The uncomfortable conclusion
Twill.ai is going to get traction. The problem it solves — the friction of converting issues into code — is real and the market is going to adopt it.
But there's something that critical safety systems learned decades ago that the software world is still processing: responsibility can't be fully delegated to a tool. The tool executes. The responsibility stays human.
The agent sends you the PR. The signature on the merge is yours.
Make sure that what you're merging is something you can explain. Not because the agent failed — but because the codebase is yours, the architecture is yours, and the 3am call is going to be yours too.
And that, for now, isn't showing up in anyone's benefits slide.
Are you using agents that generate PRs in production? I'd genuinely like to know what kind of review process you've put together. Reach out.
Top comments (0)