It still feels a little strange to have AI writing practically all the code — but I decided to give it a real shot on this new project. A bit of co...
For further actions, you may consider blocking this person and/or reporting abuse
asking AI to pick your project is the underrated trap in this workflow. it'll suggest something technically tractable but not necessarily worth building - you end up optimizing for code-plausibility over actual problem fit.
The "honest breakdown" framing is what made me stop and read. I built secure-log2test in a similar problem space (turn structured logs into pytest cases), and I went the opposite way on the AI question: zero LLM in the generation path, fully deterministic templating with Jinja2. Cost reasoning was part of it, but the bigger driver was reproducibility. The team I work with cannot have a regression suite that emits subtly different assertions across runs.
What you described about cleaning the AI output until it matched what you would have written manually resonates. The deterministic-vs-AI tradeoff in test generation feels like it lands differently depending on whether the output is meant to be read once or shipped to a CI lane.
Repo for reference: github.com/golikovichev/secure-log2test
The honesty here is appreciated. Letting an AI write the bulk of the logic for something as critical as a performance testing tool is a great exercise in code review and architecture validation. It’s one thing for the AI to generate the connection pools, but ensuring the metrics are actually statistically significant is where the human element still shines.
The "AI wrote the benchmarks so the benchmarks are suspect" problem you hit is real. I've seen the same pattern: an LLM generates a test suite that accidentally tests the fast path 90% of the time, and you walk away thinking your DB is 10x faster than it actually is.
One thing that helped me: splitting the benchmark into two layers. Layer 1 is hand-written adversarial queries — joins that force index misses, skewed key distributions, concurrent writes during reads. Layer 2 is where AI helps, generating variations of those adversarial patterns to catch regressions. The human designs the stress points, the AI fills in the coverage.
Did you end up mixing hand-crafted test cases with the AI-generated ones, or was the whole suite LLM-produced? I'd be curious what ratio actually caught real perf bugs.
For this project, since it was my first time working with this type of test, the test cases were 100% generated by AI. I wanted to see if it could understand the context, but I noticed that at times it went off the rails, which confirms what you said. In real-world scenarios, I would provide better context to prevent that from happening.
The variations-vs-scenarios split is the framing in this post that scales beyond databases. Once you ask the model to invent the adversarial case rather than execute a variation of one you've already designed, you lose the property that lets you trust the output — you knew what the bug looked like before you asked. The 80% codegen number is impressive, but the more interesting metric would be: how much of that 80% you'd still have written the same way after seeing the LLM's first draft. That ratio is the real productivity gain, and I suspect it's lower than line count suggests.
the N+1 detection use case is the right one to start with — it's the failure mode that hides longest in development and only shows up once you're under real load.
the honest bit is what I appreciate. AI picked the idea too? we've started doing that for internal tooling: give it a list of known pain points and ask it to rank by complexity vs value. the project scoping is surprisingly good; the edge case coverage is where it falls apart every time.
did you find the AI generated test assertions were reliable, or did you end up rewriting most of those by hand?
Some of the queries weren't quite realistic—the AI kind of “went off the rails”—but I didn't rewrite any of them. The main goal was to understand how it generated those results, so I could see how to improve them when I provide it with context.
This was more interesting than the usual “AI will replace everything” posts.
The biggest thing I’ve noticed is that AI tools become much more useful once projects grow beyond a few files.
For small prototypes: