Dimitris Kyrkos

Posted on Jun 19

AI makes writing code easier. It doesn't make engineering easier.

#webdev #programming #ai #discuss

The widening gap between code and context

The narrative is backwards

There's a narrative going around that AI is making software engineering easier. I think it's getting the direction wrong.

AI is making it easier to generate code, build prototypes, and move from idea to output faster than ever. That part is real and significant. But the act of writing code was never the hardest part of software engineering. Understanding the problem was. Defining the right architecture was. Translating what a client actually needs into reliable system behavior was. Testing, validating, maintaining, and scaling software over time was.

None of that got easier because an LLM can produce a function in three seconds.

The gap is widening, not shrinking

If anything, the gap between "code that exists" and "software that works in context" is widening. When generating code was slow and expensive, the generation step forced a certain amount of thinking. You considered trade-offs as you wrote. You questioned assumptions because each line took effort. Now that code appears instantly, all of that deliberation has to happen separately and deliberately. And most teams haven't adjusted their process to account for that.

What the teams doing it well look like

The teams I see succeeding with AI aren't the ones generating the most code. They're the ones asking better questions before they generate anything. They define the problem clearly before they prompt. They evaluate whether the generated output actually fits their architecture instead of just checking whether it runs. They validate edge cases the AI never considered because nobody prompted for them. They invest time in understanding what was generated before it ships.

The role is shifting, not shrinking

The role is moving from "person who writes code" to "person who designs systems that work in context." That's not a demotion. It's actually a higher bar. The writing was the mechanical part. The engineering judgment around it was always where the real value lived.

AI reduces the effort needed to produce software. It increases the importance of everything that surrounds production: problem definition, architectural decisions, validation, and the judgment to know when generated code is good enough and when it's hiding assumptions that will break under real load.

Where the advantage actually lives

The future won't belong to teams that output the most code. It'll belong to teams that validate faster, make better technical decisions, and ask the questions that LLMs can't ask for themselves.

Is your team's process actually different since adopting AI tools? Or did the tools change but the workflow stayed the same?

Top comments (32)

xulingfeng • Jun 19

"The gap between code that exists and software that works in context" — that line hit hard. Just wrapped up my 15th story on the same gap. The team in that one also had AI hitting 97.2% coverage, but the client had 14 external dependencies and a 24-hour CI pipeline. Turns out coverage report exists ≠ production won't blow up 😅

Dimitris Kyrkos • Jun 19

The 97.2% coverage with 14 external dependencies is the perfect example because coverage measures "did the code run" not "did the code handle what production will actually throw at it." You can hit 100% coverage with tests that all mock the external dependencies, which means you've thoroughly tested your code's behavior in a world that doesn't exist. The 24-hour CI pipeline detail is telling too because that's usually a symptom of the same problem, the team is running a massive test suite that gives them confidence numbers without actually reducing risk proportionally. The gap I keep seeing is that AI makes it trivially easy to generate tests that boost coverage metrics without anyone asking "what does this test actually prove about production behavior." Coverage went from a useful signal to a vanity metric the moment generating tests became cheaper than thinking about what to test.

xulingfeng • Jun 19

"Coverage turned into a vanity metric when generating tests got cheaper than thinking about what to test" — that's the whole thing in one sentence. I'm honestly tempted to repost your reply as a comment under my own article so more people see it 😂

Dimitris Kyrkos • Jun 19

Ha go for it, good ideas should travel. And honestly that framing only clicked for me because of your 97.2% coverage example, the specific number makes the absurdity concrete in a way that "coverage doesn't equal quality" never does. Drop a link to your article if you want, curious to read the full story behind the 14 external dependencies and the 24-hour pipeline.

Comment deleted

Dimitris Kyrkos • Jun 19

Just read through it. The RFP framing is what makes it land, because now the 97.2% isn't just a bad metric, it's a sales pitch that won the room while the actual product fell apart behind it. That's the part nobody wants to say out loud, that half these numbers exist to close deals not to protect production. Good stuff, gonna follow the series.

xulingfeng • Jun 19

Thanks for reading the whole thing. See you in the next one 👊

Dimitris Kyrkos • Jun 19

You are welcome, see you soon.

Agoro, Adegbenga. B • Jun 19

This is why you need to have eval systems baked into the system. AI agents/models are probabilistic entities, you need to put in place deterministic infrastructure to safe guard yourself

Dimitris Kyrkos • Jun 22

Exactly right. The probabilistic output needs deterministic guardrails around it, not the other way around. The mistake I keep seeing is teams treating the AI output as the system and bolting checks on afterward when something breaks. The ones doing it well build the eval infrastructure first, what does correct look like, what are the acceptance criteria, what invariants must hold, and then let the AI generate within those constraints. It's the same principle as database constraints or type systems, you don't rely on the application layer to always get it right, you make the wrong thing hard to ship by design.

Jake Lundberg • Jun 19

agreed on the core...writing code was always the cheap part. but I'd push on the "doesn't make it easier" framing, because I think it's worse than that. when building was slow, the build itself was a brake on bad design. you'd get halfway in and feel the friction...this is dragging, the abstraction's wrong, time to back up. you caught the mistake before you'd fully paid for it. AI took that brake off. now you can build the wrong thing all the way and fast, and the design flaw stays invisible until it's big and load-bearing. so the part of the job that didn't get easier is now the part that's most expensive to get wrong

Dimitris Kyrkos • Jun 22

You're right. The friction of writing was doing design work that nobody recognized as design work. You'd feel the wrongness in your hands before you could articulate it intellectually. That physical slowness was a feedback mechanism and we ripped it out without replacing it with anything. "The part that didn't get easier is now the most expensive to get wrong" is exactly it. The cost of a bad design decision didn't change, we just removed the thing that used to catch it early.

Jake Lundberg • Jun 22

I'd put it a little differently...I think the slowness was a side effect. the real brake was that writing forced you down into the specifics, and a bad design only shows itself once you're in the details. AI lets you stay up at the level of the idea all the way to done, so you never personally hit the wall where the wrongness lives.

and "without replacing it with anything" is the part that gets me too. I haven't found a free replacement either...the old brake cost nothing, it just happened while you worked. everything that replaces it now is deliberate, and deliberate is the first thing to go when you're slammed. thorough grilling helps, interrogating the approach before you build genuinely shrinks the gap. it just can't close it, because the part you most need to catch is the forward-looking call, and that's the hardest thing to pin down well enough to even interrogate

Dimitris Kyrkos • Jun 23

The contact with details point is interesting but I think the original framing still holds. Speed and specificity aren't separate things, the slowness is what forced you into the details in the first place. You couldn't skip them because each line took effort. The brake was both. On the replacement question, I agree that deliberate is the first thing to go under pressure, which is why I lean toward automated structural checks rather than relying on people to choose rigor when they're slammed. It's not a perfect replacement but it at least creates friction that doesn't depend on someone's willpower on a Thursday at 5pm. The forward-looking call is genuinely hard to systematize though, you're right about that. That's still pattern recognition from experience and I don't think we've found a shortcut for it.

Dirk Mattig • Jun 21

I could not agree more!
The fundamental problem with vibe coding is that the code is generated from an idea rather than a specification. The architecture & design steps are missing, exactly as you described. I never quite understood why planning has never gained widespread recognition in software development.
If people wanted to build a house, I think most of them would not go straight to a group of craftsmen and ask them to start building. But with software, this is almost the norm, and already was even before AI.
I am convinced that software engineering will eventually become pure specification. Today's developers who favor frameworks over solutions will soon have a very hard time.

Dimitris Kyrkos • Jun 22

The house analogy is perfect and I think the reason planning never gained traction in software is that software was always cheap enough to change that people convinced themselves they didn't need a blueprint. "We'll refactor later" was tolerable when writing was slow and changes were small. Now AI lets you build the whole house in a day from a napkin sketch and suddenly the missing blueprint isn't a minor shortcut, it's a structural problem. Your point about specification becoming the core skill is where I think this is heading too. The developers who thrive will be the ones who can precisely describe what needs to exist before anything gets generated.

anhmtk • Jun 20

This hits the nail on the head. "The gap between code that exists and software that works in context is widening" is probably the most accurate description of the current AI era.

As a non-tech founder who literally self-taught and built a web platform for AI agents entirely alongside LLMs, I live this paradox every single day. AI makes me feel like a wizard who can materialize features in minutes. But the moment real traffic hits, or when I have to reason about edge cases, rate-limiting, and structural maintenance, the "wizardry" fades, and the sheer necessity of true engineering judgment becomes glaringly obvious.

Building a product with AI has actually made me respect senior engineers and architects infinitely more. AI can write the functions, but it doesn't possess the empathy to understand user behavior, nor the historical judgment to prevent architectural decay.

The tools changed, but the ultimate bottleneck is still—and will always be—human engineering discipline. Thanks for writing this!

Dimitris Kyrkos • Jun 22

Your perspective as a non-tech founder building with AI is actually the most valuable one in this thread because you're experiencing both sides simultaneously. The wizard feeling and the "oh wait this doesn't hold up under real traffic" moment. The line about AI not having empathy for user behavior is underrated. It can't model the weird things real users do because it was never a user. It generates code for the spec, not for the person who's going to misuse the spec in creative ways at 2am on their phone. The fact that building with AI gave you more respect for senior engineers is telling because it means you're seeing firsthand what they actually do that the code generation step never captured.

Kartik N V J K • Jun 19

The widening-gap framing is right, and it shows up most in the parts that never make it into the prompt: error paths, edge cases, and behavior under load. When generation was slow, that thinking got forced on you line by line; now it has to be a deliberate separate step that's easy to skip when the diff already looks done. The only reliable counter I've found is deciding the failure cases and the checks up front, before any code gets generated.

Dimitris Kyrkos • Jun 22

"Deciding the failure cases before any code gets generated" is the single most practical piece of advice in this whole thread. That's essentially test-driven development but applied at the prompt level. Define what should break and how before you ask the AI to build the thing. If you do that the generated code has to satisfy real constraints instead of just looking plausible. The teams I've seen adopt this pattern catch problems dramatically earlier because the failure cases become the acceptance criteria, not an afterthought someone thinks about during review.

Vasyl • Jun 23

Building on the "eval infrastructure first" point a few people made here, I'd split "correct" into two halves, because not all of it is knowable upfront. The hard invariants (no spoilers, no leaked PII, no out-of-scope context) you can write as a zero-tolerance check before you build, and they often reshape the design itself. But the graded stuff (is this answer actually good?) you mostly can't specify in advance; you discover those failure modes by reading real outputs once users phrase things in ways you never imagined. So eval-first buys you the deterministic guardrails, but the quality bar you earn by looking at production. Do you write the graded eval before or after first contact with real users?

Dimitris Kyrkos • Jun 24

After. Every time. We've tried writing graded evals before real users touched it and they always tested for what we expected to go wrong, which by definition is not where the interesting failures live. The hard invariants you write first because they're binary and they protect the floor. But the quality bar emerges from looking at actual outputs against actual user inputs and going "wait, that's technically correct but completely unhelpful" or "nobody asked it that way in testing but apparently everyone asks it that way in production." So the sequence for us is: hard invariants before you build, ship with those guardrails in place, then build the graded eval iteratively from real usage patterns in the first few weeks. Trying to do it the other way around just means you test for your own assumptions, which is the same problem as AI-generated tests passing AI-generated code.

Agoro, Adegbenga. B • Jun 19

Software design and engineering principles have never been more important than they are today.

One thing worth remembering is that code generation was never the game. Developers who were exceptionally good at rapidly producing code were often referred to as “code monkeys.” They weren’t the people invited into the room to define how the system should work or how complex business problems should be solved.

The real value has always been in understanding the problem, designing the system, defining the boundaries, and making sound architectural decisions. Writing the code was simply the implementation of that thinking.

Working with AI makes this even more important. Before we let an AI agent generate code, we need to clearly define what we’re building, why we’re building it, and the constraints it needs to operate within.

Many teams that initially evaluated AI based purely on code generation speed are now discovering that without strong architectural guidance, coding standards, and quality guardrails, a significant portion of that generated code ends up needing to be rewritten.

Code generation speed is not the same as production value.

And production value is not the same as customer value.

The teams that will get the most out of AI won’t be the ones generating the most code. They’ll be the ones designing the best systems for AI to build.

Dimitris Kyrkos • Jun 22

"Code generation speed is not the same as production value, and production value is not the same as customer value." That's a chain that deserves to be on a wall somewhere. The "code monkey" parallel is interesting too because I think we're about to see the same dynamic play out with AI. The teams that use AI to generate more code faster will hit the same ceiling that fast typists always hit. The teams that use AI to explore design alternatives and validate architectural decisions faster are playing a completely different game.

Nazar Boyko • Jun 19

The line about slow generation forcing deliberation is the part that rings true. When every line cost effort, you were basically rubber-ducking the design as you typed it. Now that step is free, so the thinking has to move somewhere, and for most teams it just doesn't. Where I feel it most is review. Reviewing code a human wrote, you can usually guess the intent behind it. Reviewing generated code there's no intent to read, just plausible output, so you end up reverse-engineering what it was even trying to do, which is slower than people admit. Has your team actually carved out time for that, or does review still get squeezed the way it did before?

Dimitris Kyrkos • Jun 22

The reverse-engineering point is something I haven't seen articulated this clearly before and it's true. When a human writes code there's an author's intent you can read between the lines. With generated code you're staring at plausible output trying to figure out what it was even attempting, and that's genuinely harder than reviewing human-written code even though it looks the same on screen. Honestly we haven't fully solved the review problem either. What helped most was requiring the person who prompted the AI to write a short explanation of what they asked for and why, so the reviewer has intent to compare against instead of guessing. It doesn't fully replace the embedded intent you get from human-written code but it closes the gap.

Elmar Chavez • Jun 20

There will always be the people bottleneck. Software can't possibly be accelerated overnight. Decisions always lie on an actual person orchestrating the AI. Today these decisions need higher quality verifications than ever before. This is a good reminder, thank you for this.

Dimitris Kyrkos • Jun 22

That's the part that keeps getting underestimated. Everyone talks about AI acceleration like it's a straight multiplier on output. But every piece of generated code still needs a human decision: is this right, does this fit, should this ship. The bottleneck was never typing speed. It was decision quality. And now each decision carries more weight because the volume behind it is so much higher. Thanks for reading.

Mudassir Khan • Jun 22

the 'gap between code that exists and software that works in context' framing is exactly what i'd been trying to articulate for a few months.

we ran into this building an MCP server last sprint. had a working prototype in two hours. spent three days figuring out why the error handling was technically correct but completely wrong for the actual failure modes users hit. AI didn't know what users would do — we barely did.

the slow generation step used to be the forcing function for that kind of thinking. it bought time.

what's your read on whether the evaluation side of engineering gets better as tooling matures, or is the context gap structural?

Dimitris Kyrkos • Jun 22

3:42 PM

The MCP server example nails the ratio, two hours to generate, three days to make it actually work for real users. That's the new normal and it's why leadership miscalibrates expectations.

On your question, I think it's mostly structural. Tooling will keep getting better at catching mechanical mistakes, static analysis, security scanning, known failure patterns. But "what will users actually do with this" is knowledge that lives in your team's experience with your specific users, not in code or documentation. That's not a detection problem tooling can solve, it's a judgment problem. Better tools can shorten the three days but they can't replace the thinking that made those three days necessary.

View full discussion (32 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.