DEV Community

AI makes writing code easier. It doesn't make engineering easier.

Dimitris Kyrkos on June 19, 2026

The narrative is backwards There's a narrative going around that AI is making software engineering easier. I think it's getting the dire...

Read full post

xulingfeng • Jun 19

"The gap between code that exists and software that works in context" — that line hit hard. Just wrapped up my 15th story on the same gap. The team in that one also had AI hitting 97.2% coverage, but the client had 14 external dependencies and a 24-hour CI pipeline. Turns out coverage report exists ≠ production won't blow up 😅

Dimitris Kyrkos • Jun 19

The 97.2% coverage with 14 external dependencies is the perfect example because coverage measures "did the code run" not "did the code handle what production will actually throw at it." You can hit 100% coverage with tests that all mock the external dependencies, which means you've thoroughly tested your code's behavior in a world that doesn't exist. The 24-hour CI pipeline detail is telling too because that's usually a symptom of the same problem, the team is running a massive test suite that gives them confidence numbers without actually reducing risk proportionally. The gap I keep seeing is that AI makes it trivially easy to generate tests that boost coverage metrics without anyone asking "what does this test actually prove about production behavior." Coverage went from a useful signal to a vanity metric the moment generating tests became cheaper than thinking about what to test.

xulingfeng • Jun 19

"Coverage turned into a vanity metric when generating tests got cheaper than thinking about what to test" — that's the whole thing in one sentence. I'm honestly tempted to repost your reply as a comment under my own article so more people see it 😂

Dimitris Kyrkos • Jun 19

Ha go for it, good ideas should travel. And honestly that framing only clicked for me because of your 97.2% coverage example, the specific number makes the absurdity concrete in a way that "coverage doesn't equal quality" never does. Drop a link to your article if you want, curious to read the full story behind the 14 external dependencies and the 24-hour pipeline.

Comment deleted

Dimitris Kyrkos • Jun 19

Just read through it. The RFP framing is what makes it land, because now the 97.2% isn't just a bad metric, it's a sales pitch that won the room while the actual product fell apart behind it. That's the part nobody wants to say out loud, that half these numbers exist to close deals not to protect production. Good stuff, gonna follow the series.

Jake Lundberg • Jun 19

agreed on the core...writing code was always the cheap part. but I'd push on the "doesn't make it easier" framing, because I think it's worse than that. when building was slow, the build itself was a brake on bad design. you'd get halfway in and feel the friction...this is dragging, the abstraction's wrong, time to back up. you caught the mistake before you'd fully paid for it. AI took that brake off. now you can build the wrong thing all the way and fast, and the design flaw stays invisible until it's big and load-bearing. so the part of the job that didn't get easier is now the part that's most expensive to get wrong

Dimitris Kyrkos • Jun 22

You're right. The friction of writing was doing design work that nobody recognized as design work. You'd feel the wrongness in your hands before you could articulate it intellectually. That physical slowness was a feedback mechanism and we ripped it out without replacing it with anything. "The part that didn't get easier is now the most expensive to get wrong" is exactly it. The cost of a bad design decision didn't change, we just removed the thing that used to catch it early.

Jake Lundberg • Jun 22

I'd put it a little differently...I think the slowness was a side effect. the real brake was that writing forced you down into the specifics, and a bad design only shows itself once you're in the details. AI lets you stay up at the level of the idea all the way to done, so you never personally hit the wall where the wrongness lives.

and "without replacing it with anything" is the part that gets me too. I haven't found a free replacement either...the old brake cost nothing, it just happened while you worked. everything that replaces it now is deliberate, and deliberate is the first thing to go when you're slammed. thorough grilling helps, interrogating the approach before you build genuinely shrinks the gap. it just can't close it, because the part you most need to catch is the forward-looking call, and that's the hardest thing to pin down well enough to even interrogate

Dimitris Kyrkos • Jun 23

The contact with details point is interesting but I think the original framing still holds. Speed and specificity aren't separate things, the slowness is what forced you into the details in the first place. You couldn't skip them because each line took effort. The brake was both. On the replacement question, I agree that deliberate is the first thing to go under pressure, which is why I lean toward automated structural checks rather than relying on people to choose rigor when they're slammed. It's not a perfect replacement but it at least creates friction that doesn't depend on someone's willpower on a Thursday at 5pm. The forward-looking call is genuinely hard to systematize though, you're right about that. That's still pattern recognition from experience and I don't think we've found a shortcut for it.

Dirk Mattig • Jun 21

I could not agree more!
The fundamental problem with vibe coding is that the code is generated from an idea rather than a specification. The architecture & design steps are missing, exactly as you described. I never quite understood why planning has never gained widespread recognition in software development.
If people wanted to build a house, I think most of them would not go straight to a group of craftsmen and ask them to start building. But with software, this is almost the norm, and already was even before AI.
I am convinced that software engineering will eventually become pure specification. Today's developers who favor frameworks over solutions will soon have a very hard time.

Dimitris Kyrkos • Jun 22

The house analogy is perfect and I think the reason planning never gained traction in software is that software was always cheap enough to change that people convinced themselves they didn't need a blueprint. "We'll refactor later" was tolerable when writing was slow and changes were small. Now AI lets you build the whole house in a day from a napkin sketch and suddenly the missing blueprint isn't a minor shortcut, it's a structural problem. Your point about specification becoming the core skill is where I think this is heading too. The developers who thrive will be the ones who can precisely describe what needs to exist before anything gets generated.

anhmtk • Jun 20

This hits the nail on the head. "The gap between code that exists and software that works in context is widening" is probably the most accurate description of the current AI era.

As a non-tech founder who literally self-taught and built a web platform for AI agents entirely alongside LLMs, I live this paradox every single day. AI makes me feel like a wizard who can materialize features in minutes. But the moment real traffic hits, or when I have to reason about edge cases, rate-limiting, and structural maintenance, the "wizardry" fades, and the sheer necessity of true engineering judgment becomes glaringly obvious.

Building a product with AI has actually made me respect senior engineers and architects infinitely more. AI can write the functions, but it doesn't possess the empathy to understand user behavior, nor the historical judgment to prevent architectural decay.

The tools changed, but the ultimate bottleneck is still—and will always be—human engineering discipline. Thanks for writing this!

Dimitris Kyrkos • Jun 22

Your perspective as a non-tech founder building with AI is actually the most valuable one in this thread because you're experiencing both sides simultaneously. The wizard feeling and the "oh wait this doesn't hold up under real traffic" moment. The line about AI not having empathy for user behavior is underrated. It can't model the weird things real users do because it was never a user. It generates code for the spec, not for the person who's going to misuse the spec in creative ways at 2am on their phone. The fact that building with AI gave you more respect for senior engineers is telling because it means you're seeing firsthand what they actually do that the code generation step never captured.

Kartik N V J K • Jun 19

The widening-gap framing is right, and it shows up most in the parts that never make it into the prompt: error paths, edge cases, and behavior under load. When generation was slow, that thinking got forced on you line by line; now it has to be a deliberate separate step that's easy to skip when the diff already looks done. The only reliable counter I've found is deciding the failure cases and the checks up front, before any code gets generated.

Dimitris Kyrkos • Jun 22

"Deciding the failure cases before any code gets generated" is the single most practical piece of advice in this whole thread. That's essentially test-driven development but applied at the prompt level. Define what should break and how before you ask the AI to build the thing. If you do that the generated code has to satisfy real constraints instead of just looking plausible. The teams I've seen adopt this pattern catch problems dramatically earlier because the failure cases become the acceptance criteria, not an afterthought someone thinks about during review.

Vasyl • Jun 23

Building on the "eval infrastructure first" point a few people made here, I'd split "correct" into two halves, because not all of it is knowable upfront. The hard invariants (no spoilers, no leaked PII, no out-of-scope context) you can write as a zero-tolerance check before you build, and they often reshape the design itself. But the graded stuff (is this answer actually good?) you mostly can't specify in advance; you discover those failure modes by reading real outputs once users phrase things in ways you never imagined. So eval-first buys you the deterministic guardrails, but the quality bar you earn by looking at production. Do you write the graded eval before or after first contact with real users?

Dimitris Kyrkos • Jun 24

After. Every time. We've tried writing graded evals before real users touched it and they always tested for what we expected to go wrong, which by definition is not where the interesting failures live. The hard invariants you write first because they're binary and they protect the floor. But the quality bar emerges from looking at actual outputs against actual user inputs and going "wait, that's technically correct but completely unhelpful" or "nobody asked it that way in testing but apparently everyone asks it that way in production." So the sequence for us is: hard invariants before you build, ship with those guardrails in place, then build the graded eval iteratively from real usage patterns in the first few weeks. Trying to do it the other way around just means you test for your own assumptions, which is the same problem as AI-generated tests passing AI-generated code.

Agoro, Adegbenga. B • Jun 19

Software design and engineering principles have never been more important than they are today.

One thing worth remembering is that code generation was never the game. Developers who were exceptionally good at rapidly producing code were often referred to as “code monkeys.” They weren’t the people invited into the room to define how the system should work or how complex business problems should be solved.

The real value has always been in understanding the problem, designing the system, defining the boundaries, and making sound architectural decisions. Writing the code was simply the implementation of that thinking.

Working with AI makes this even more important. Before we let an AI agent generate code, we need to clearly define what we’re building, why we’re building it, and the constraints it needs to operate within.

Many teams that initially evaluated AI based purely on code generation speed are now discovering that without strong architectural guidance, coding standards, and quality guardrails, a significant portion of that generated code ends up needing to be rewritten.

Code generation speed is not the same as production value.

And production value is not the same as customer value.

The teams that will get the most out of AI won’t be the ones generating the most code. They’ll be the ones designing the best systems for AI to build.

Dimitris Kyrkos • Jun 22

"Code generation speed is not the same as production value, and production value is not the same as customer value." That's a chain that deserves to be on a wall somewhere. The "code monkey" parallel is interesting too because I think we're about to see the same dynamic play out with AI. The teams that use AI to generate more code faster will hit the same ceiling that fast typists always hit. The teams that use AI to explore design alternatives and validate architectural decisions faster are playing a completely different game.

Nazar Boyko • Jun 19

The line about slow generation forcing deliberation is the part that rings true. When every line cost effort, you were basically rubber-ducking the design as you typed it. Now that step is free, so the thinking has to move somewhere, and for most teams it just doesn't. Where I feel it most is review. Reviewing code a human wrote, you can usually guess the intent behind it. Reviewing generated code there's no intent to read, just plausible output, so you end up reverse-engineering what it was even trying to do, which is slower than people admit. Has your team actually carved out time for that, or does review still get squeezed the way it did before?

Dimitris Kyrkos • Jun 22

The reverse-engineering point is something I haven't seen articulated this clearly before and it's true. When a human writes code there's an author's intent you can read between the lines. With generated code you're staring at plausible output trying to figure out what it was even attempting, and that's genuinely harder than reviewing human-written code even though it looks the same on screen. Honestly we haven't fully solved the review problem either. What helped most was requiring the person who prompted the AI to write a short explanation of what they asked for and why, so the reviewer has intent to compare against instead of guessing. It doesn't fully replace the embedded intent you get from human-written code but it closes the gap.

Elmar Chavez • Jun 20

There will always be the people bottleneck. Software can't possibly be accelerated overnight. Decisions always lie on an actual person orchestrating the AI. Today these decisions need higher quality verifications than ever before. This is a good reminder, thank you for this.

Dimitris Kyrkos • Jun 22

That's the part that keeps getting underestimated. Everyone talks about AI acceleration like it's a straight multiplier on output. But every piece of generated code still needs a human decision: is this right, does this fit, should this ship. The bottleneck was never typing speed. It was decision quality. And now each decision carries more weight because the volume behind it is so much higher. Thanks for reading.

Mudassir Khan • Jun 22

the 'gap between code that exists and software that works in context' framing is exactly what i'd been trying to articulate for a few months.

we ran into this building an MCP server last sprint. had a working prototype in two hours. spent three days figuring out why the error handling was technically correct but completely wrong for the actual failure modes users hit. AI didn't know what users would do — we barely did.

the slow generation step used to be the forcing function for that kind of thinking. it bought time.

what's your read on whether the evaluation side of engineering gets better as tooling matures, or is the context gap structural?

Dimitris Kyrkos • Jun 22

3:42 PM

The MCP server example nails the ratio, two hours to generate, three days to make it actually work for real users. That's the new normal and it's why leadership miscalibrates expectations.

On your question, I think it's mostly structural. Tooling will keep getting better at catching mechanical mistakes, static analysis, security scanning, known failure patterns. But "what will users actually do with this" is knowledge that lives in your team's experience with your specific users, not in code or documentation. That's not a detection problem tooling can solve, it's a judgment problem. Better tools can shorten the three days but they can't replace the thinking that made those three days necessary.

Mudassir Khan • Jun 23

the "gap between code that exists and software that works in context" line is the one that maps most directly to what we keep seeing in production.

the concrete version: AI assisted teams now produce more code per sprint than 18 months ago. they also produce more incidents per sprint. the code reviews aren't getting harder because AI is writing bad code — they're getting harder because there's more surface area, and the generation cost dropping means scope creep is now mostly invisible until production.

what's your take on tooling: do you think observability and tracing are the constraint, or is it more a judgment call that no tool fully solves?

Dimitris Kyrkos • Jun 23

The "scope creep is now invisible until production" point is sharp because that's the part the velocity metrics actively hide. When generation was expensive, scope creep showed up as missed deadlines, you could see the team falling behind. Now the extra scope gets generated for free so it never trips the early warning signals, it just quietly expands the surface area until something in that expanded surface breaks.
On your tooling question, I think it's both but they operate at different layers. Observability and tracing are necessary and they're improving, they'll catch more of the "what broke and where" faster. But they're reactive by nature, they tell you about the incident after the surface area already grew. The constraint that no tool fully solves is the upstream judgment call: should this scope exist at all, does this addition fit the architecture, is this complexity worth carrying. Tooling can measure complexity growth and flag it, which helps, but it can't make the call about whether that growth is justified in your specific context. That's the part that stays human. So I'd say observability solves the detection problem and it's getting better fast, but the scope discipline problem sits upstream of any tool because it's a decision about what to build, not a measurement of what was built.