Adrian Jiga

Posted on Jun 10 • Edited on Jun 13

Six Months of AI Writing My Tests: What Got Better, What Got Worse

#testing #automation #programming #ai

Back in January, I wrote about falling in love with Claude Code. That post was the honeymoon phase: the first debugging session that saved me two hours, the first generated test that actually matched my conventions, the feeling that something fundamental had shifted.

This is the six-months-later post. The one nobody writes, because "it's complicated" doesn't get clicks the way "this changes everything" does.

Here's the short version: I barely write test code by hand anymore, my output has multiplied, and I've shipped tools I wouldn't have had time to build otherwise. I've also become a full-time code reviewer of work I didn't write, I've watched AI quietly drift away from my conventions when I wasn't paying attention, and I feel further behind the field than I did when I started.

Both halves are true. Let me show you what six months actually looked like.

What got better

The honest answer: more than I expected, and not where I expected it.

My testing workflow became a pipeline

Six months ago, writing tests for a new feature meant reading the Jira ticket, poking at the implementation, writing a manual test plan, then translating that into automation. Each step done by hand, each step a context switch.

Now it's a chain, and AI runs most of it:

Claude Code reads the Jira ticket and the actual application code, then drafts a manual testing plan in markdown.
From there, it drafts an API testing plan, which gets converted into a Postman collection.
Once I've reviewed both (and actually run them), it generates the Cypress scenarios using the reviewed plans as the source of truth.

The important words in that list are reviewed and run. I don't just read the plans and nod along. I execute the manual testing plan against the application. I run the Postman collection and watch the real responses come back. Only after both have proven themselves against the actual system do they become the source for the Cypress specs. AI drafts, reality validates, then automation locks it in.

This matters because a test plan that's never been executed is just a plausible-sounding document and "plausible-sounding" is exactly what AI is best at producing. Running everything first is how I catch the assumptions that read fine but don't survive contact with the real application. When I skip that discipline, things go sideways fast (more on this later).

The result: the boring translation work: ticket to plan, plan to collection, collection to spec, is no longer my job. Deciding what to test still is. That part got more important, not less.

I shipped tools, not just tests

This is the part I didn't expect. The time AI freed up didn't just go into more tests. It went into building things that were on my "someday" list for years.

PostmanToCypressConverter. A Node CLI tool that takes a Postman collection and generates Cypress specs:

node index.js --input fixtures/sample.collection.json --output output/

Sample output:

Wrote to /output
  Converted: 4 folders → 4 files
  Requests processed: 11
  Assertions converted: 14
  TODOs requiring manual review: 0

Files:
  • output/users.cy.js
  • output/posts.cy.js
  • output/admin.cy.js
  • output/search.cy.js
  • output/cypress.env.json
  • output/cypress.config.js

It started as a CLI, but it grew a UI: a small Express server with a /convert endpoint, a button that calls it, syntax highlighting for viewing the generated specs, and a zip download for the output. The kind of internal tool that used to take a sprint of "spare time" now takes a few evenings.

A flakiness dashboard. We run everything through GitHub Actions across multiple test repositories and several deployment environments. The dashboard pulls run data from a configurable time window: every run, every environment, including which application branch was deployed at the time, and collects pass/fail/pending/skipped counts plus the actual failures. Everything lands in an NDJSON file, and a second command generates a flakiness report from it. Guards and error handling at every step, because a flakiness tool that's itself flaky would be a bit too ironic.

This is the tool that changed how we talk about test stability. Before, "this test is flaky" was a feeling. Now it's a number with a deployment branch attached to it.

A company plugin marketplace for Claude Code. Together with one of our developers, I built an internal Claude Code plugin marketplace. The naming convention tells you who each plugin is for: a role prefix followed by the task, along the lines of qa-generate-tests, dev-review-pr, product-draft-ticket. We have over a dozen plugins now, available to everyone in the company. The workflows I figured out for myself stopped being mine and became infrastructure.

Agentic maintenance work. The newest experiment: an npm audit sweep workflow that runs one agent per repository. Each agent switches to main, pulls the latest changes, runs npm audit and npm audit fix, hunts for alternative fixes when the automatic one isn't possible, commits, pushes, and opens a PR with a structured description (Summary, Why, What Changed). Then it waits for the automated PR review to come in, reads the comments, and either resolves them or marks them as resolved with a reason for why the finding is a false positive.

Dependency hygiene across all our repos went from "the chore everyone avoids" to something that runs while I do other work.

Slide decks, of all things

A small one, but real: status presentations on testing across services used to eat half a day of formatting. Now I describe the status, Claude builds the deck, I adjust. Nobody misses the old way.

What got worse

Now the half that the hype posts skip.

I write less code and review much, much more

This is the biggest shift, and I want to be precise about it: reviewing code is harder than writing it. When you write code, you build the mental model as you go. When you review AI-generated code, you have to reconstruct a mental model of something you didn't think through, and you have to do it at the speed the AI produces it, which is much faster than you ever wrote.

Most days, my job title might as well be "reviewer of code I didn't write." And the failure mode is sneaky: AI-generated code almost always looks right. It's well-formatted, well-named, plausible. The bugs aren't in the syntax. They're in an assertion that checks the wrong thing while reading like it checks the right one, or a setup step that papers over the exact condition the test was supposed to catch.

Here's what that looks like in practice:

// What Claude generated. It passes. It looks complete.
it('removes the user from the list after deletion', () => {
  cy.intercept('DELETE', '/api/users/*').as('deleteUser');
  cy.get('[data-testid="delete-user-3"]').click();
  cy.get('[data-testid="confirm-delete"]').click();
  cy.wait('@deleteUser');
  cy.get('[data-testid="user-list"]').should('be.visible');
});

Read it quickly and it's fine: intercept the request, click delete, confirm, wait, assert. But look at what it actually asserts. The list is visible. Not that user 3 is gone from it. The DELETE could return a 500, the user could still be sitting in the list, and this test stays green. The fix is two lines (assert the response status, assert the user is no longer in the list), but you only catch it if you read the assertion as skeptically as you'd read the implementation.

That's the new job. Not writing this test, which takes two minutes, but noticing what it quietly doesn't check.

Sometimes it just goes bananas

There's no more technical way to put it. Most sessions, Claude Code is disciplined. But sometimes (usually deep into a long session, or when the task is slightly ambiguous) it drifts. It stops following the project conventions it was respecting an hour ago. It refactors something nobody asked it to touch. It invents a helper that duplicates one we already have.

If I'm paying attention, this costs me a "stop, look at how the rest of this folder does it" message and two minutes. If I'm not paying attention, it costs me a polluted PR and a cleanup session. The tool didn't get worse over six months. My understanding of how much supervision it needs got more honest.

And it's not just my agent. The audit sweep workflow has an AI reviewer checking every PR, and a meaningful share of those review comments are noise: findings that sound serious and are simply wrong for our context. So now part of the workflow is an AI reading another AI's review and explaining why it's a false positive, with me as the referee. That's a sentence I could not have written a year ago, and I'm still not sure how I feel about it.

I work with AI every day and still feel behind

This one is less about the tools and more about me, but I suspect it's universal so I'm saying it out loud.

I use AI daily. I build agentic workflows. I maintain a plugin marketplace. By any reasonable measure, I'm deep in this. And I still feel like the field is moving faster than I can track. New models, new capabilities, new patterns: every few weeks something I built gets a better way to be built.

Six months ago I thought working with AI every day would make me feel ahead of the curve. It mostly made me more aware of how fast the curve is moving. If you feel behind: so does everyone, including the people who look like they're ahead.

The thing that ties it all together

After six months, here's the conclusion I keep coming back to:

AI only amplifies what's already there.

If you have solid engineering judgment, AI multiplies your output. If you don't, AI multiplies your problems, faster than you can review them. Good engineering skills don't matter less in this setup. They matter more, because the bottleneck moved from "can you write this" to "can you tell whether this is right."

You have to be able to look at code that's clean, idiomatic, and confident, and know whether it actually does what it's supposed to do, and does it how it's supposed to. That skill only comes from understanding your languages, frameworks, and tools properly. If you're using AI to generate code in a stack you don't really understand, you're not saving time. You're taking out a loan against your future self, and the interest rate is brutal.

Who shouldn't work this way

I want to be upfront about this, because I always am.

If you're early in your career and haven't yet built the judgment to spot plausible-but-wrong code, going all-in on AI-generated tests is risky. The review skill I lean on every day was built by years of writing tests by hand, breaking things, and debugging the consequences. You can't review your way to that judgment; you have to earn it first. Use AI to learn: ask it to explain, compare approaches, challenge your understanding. But keep writing things yourself.

And if your team doesn't have real review discipline, generated code will flow into your codebase faster than anyone can honestly check it. The volume is the danger. AI doesn't remove the need for careful review; it concentrates all the risk there.

Conclusion

Six months in, I wouldn't go back. The pipeline, the tools, the marketplace: none of it existed in my pre-AI workflow, and all of it makes my team measurably better.

But the job changed shape. I went from someone who writes tests to someone who designs what gets tested and verifies what got built. Less typing, more judgment. The engineers who'll thrive in this aren't the ones who prompt the best. They're the ones who can look at a wall of confident, well-formatted code and know, from experience and not vibes, whether it's actually right.

That skill was always what separated good engineers from the rest. AI just raised the price of not having it.

Top comments (1)

Alex Shev • Jun 11

AI is useful for tests when the developer already knows what behavior needs to be protected. The risk is that it can generate a lot of plausible tests that mostly confirm the implementation instead of challenging it.

The best pattern I have found is to ask for test ideas first, then choose the cases manually, then let the agent draft the mechanics. That keeps the judgment where it belongs while still saving time on boilerplate.