I used AI coding agents for a week at work. Here is what actually happened.

I used AI coding agents as my first approach for every task during a full work week. My raw line output went up about 40 percent. The amount of code that actually shipped to production without needing significant rework stayed roughly the same. The agents were fast at the repetitive parts. They could not help me with the parts that were actually hard, which are the parts where you need to understand why a system was built a certain way by someone who is no longer on the team, or whether adding a new component is worth the operational cost your team can barely handle right now. The bottleneck was never typing.

Why I did this?

Every engineering Slack channel and substack newsletter I am reading is all about fighting about AI coding agents since early 2025. One side says software engineering is basically over and that agents will be writing most production code. The other side says the agents are fancy autocomplete and the real work of engineering, making decisions when you do not have all the information, is still a human problem.

I kept going back and forth depending on which demo I had most recently watched, which is a bad way to form an opinion about anything. So I stopped watching demos and decided to collect my own data. One full work week, leaning on the agents as heavily as I could for my actual job, and tracking what happened.

What I used
The codebase is a mix of Python and Go and the current work involves maintaining a handful of microservices, dealing with message queues, Postgres, Redis, and a set of third party API integrations that each have their own authentication quirks.

For the experiment I used Cursor with agent mode turned on for code generation across files, Claude Code for longer reasoning tasks where I wanted the model to look at an entire service and suggest changes, and a custom internal tool our team built that reads Jira tickets and suggests implementation plans.

The rule was simple and If I would normally open a file and start writing, I would instead describe what I wanted to the agent and let it go first. Then I would review and correct rather than writing from scratch.

Where the agents were actually useful
On Tuesday I needed to spin up a new service. Kafka consumer, schema validation, some business rules, write results to Postgres. I have built this kind of thing enough times that the structure is predictable. I described the requirements and Cursor generated about 80 percent of it in around 12 minutes. The consumer setup, the Pydantic models, the SQLAlchemy layer, the error handling. It was clean and mostly correct however I adjusted the logging to match our team conventions and it was ready.

That kind of work, the repetitive structural stuff that follows a pattern you already know, is where the agents are genuinely good. They are fast and they do not make the dumb typos that I make at 4 PM on a Tuesday.

Test writing was also surprisingly good. I had a backlog task to add test coverage to a service that had been shipped quickly with minimal tests. I pointed Claude Code at the directory and asked for unit tests on the core modules. It produced a solid suite that covered the main paths, caught a boundary condition I had missed in one of the validation functions, and only needed three tests adjusted because it made wrong assumptions about how we mock the database layer. That would have been most of my Wednesday afternoon done manually. The agent did it in twenty minutes.

And also it did the documentation. I needed to onboard a new teammate onto a service I wrote six months ago. Instead of spending an hour writing up a walkthrough, I had Claude Code analyze the service and produce an explanation of the architecture and the data flow. It was about 90 percent accurate and gave the new engineer something to read before asking me questions, which meant I spent the onboarding time on their actual questions instead of on prose.

Where they fell apart
Monday morning we had an incident. A downstream service was getting malformed data from one of our endpoints. The root cause turned out to be an interaction between our serialization layer and a schema change that another team had made. Figuring that out required reading the Git history of two repos, finding a Slack thread from three weeks earlier that explained why the schema change had been made, and then reasoning about how the serialization would behave differently under a specific race condition between the old and new versions.

I tried to use Claude Code for this and gave it the relevant files and the error logs. It generated several guesses that were plausible but wrong, because the actual answer depended on context that is not in the code. It was in conversations, in commit messages, in the organizational memory of why someone made a decision three weeks ago. The agent could read the code. It could not read the room.

On Wednesday I had to decide whether to add a cache to a service that was slow under load. The textbook answer is yes, obviously, add a cache. The agent gave me the textbook answer. But the textbook answer did not account for the fact that we had just shrunk our on call rotation and nobody had time to babysit another cache layer. It did not know that the database team had a query optimization sprint planned for next month that might fix the latency at the source. And it definitely did not know that our last caching attempt had caused a consistency bug that took two weeks to untangle.

The agent produced a technically correct recommendation that would have been the wrong decision in our specific situation. That gap between the general answer and the right answer for this team at this moment is where the human work actually lives.

On Friday I was implementing billing logic. Proration across subscription tiers, timezone handling for billing cycles, grandfathering rules for legacy customers. The business rules lived across three spec documents and two Slack threads with the product manager. The agent could generate the structure, but it kept getting the edge cases wrong where the rules interacted with each other. Every fix I made introduced a new regression somewhere else. After ninety minutes of going back and forth with the agent, I closed the tool and wrote the logic myself in about the same amount of time, because at that point reviewing and correcting was costing me more than just thinking through the problem on my own.

Numbers
At the end of the week I compared my Git activity to a normal week.

Lines committed were up about 40 percent. Almost all of that came from the new service and the test generation, both of which produce a lot of code quickly.

Pull requests merged went from my usual four to five. The extra one was the new service.

I spent roughly three hours across the week reviewing and correcting agent generated code, which ate into some of the time I saved by not writing it myself.

No production incidents from code I shipped that week, but that was because I reviewed everything carefully before merging. If I had trusted the output without checking, at least two sections would have caused problems based on the errors I caught during review.

What I actually think after doing this
The agents are good at the parts of the job that were already the easiest parts. Repetitive service setup, test boilerplate, documentation that summarizes code you already wrote. They are fast at those things and they produce reasonable output.

They are not good at the parts that make engineering hard. Understanding why a system was built the way it was. Knowing the team well enough to factor operational capacity into an architecture decision. Debugging across service boundaries when the root cause is in a Slack thread, not a stack trace. Writing business logic where the rules contradict each other in ways that only show up at the edges.

The people who should be worried are the ones whose main contribution has been cranking out predictable code on well understood problems, because the agents can now do that faster. The people who should not be worried are the ones who spend most of their time in the messy middle, making judgment calls that depend on context the agents cannot access.

I do not think the agents are replacing engineers. I think they are replacing the part of engineering that engineers already did not find very interesting. The part that is actually hard, the part that makes you stare at your screen and think for twenty minutes before typing anything, is still yours.

If you are anxious about AI agents taking your job, run this experiment yourself. Use the tools hard for a week. See what they speed up and where they stall out. Form your own opinion from your own data instead of from someone else’s demo video or Twitter takes.

My take after the week is pretty simple. I type less now. (Also Whispr Flow is blessing where you need to type less and speak more) I think the same amount.

If you are in an active interview process and want to make sure you are practicing effectively for the technical rounds that come after the recruiter stage, PracHub has company specific questions organized by role and round type that let you prepare for the exact format each company uses. But all of that preparation is wasted if you never reach the technical rounds because your recruiter emails communicated that you were difficult to work with before anyone had the chance to evaluate your code.

DEV Community

I used AI coding agents for a week at work. Here is what actually happened.

Why I did this?

Top comments (0)