The AI world is full of old infrastructure with stochastic organs.
That sentence probably explains better than anything why I feel slow lately. ...
For further actions, you may consider blocking this person and/or reporting abuse
This resonates deeply.
As someone who uses AI daily, I've noticed that the work I value most hasn't gotten dramatically faster. Coding is faster. Prototyping is faster. First drafts are faster.
But understanding whether something should exist, whether it's maintainable, secure, and aligned with the system's goals—that still requires thought.
I've started thinking of this as context decay. AI can accelerate implementation faster than it accelerates understanding. The challenge isn't generating more code. It's preserving enough context to make good decisions about the code we're generating.
I can relate to it! The article highlights that AI-driven development is shifting the engineering bottleneck from code generation to verification, increasing the need for senior-level auditing to manage technical debt. This, in turn, creates a paradox where tools designed for speed actually mandate a slower, more rigorous human-in-the-loop approach to ensure system reliability.
Yes, exactly. I think this is the paradox many teams still underestimate.
AI reduces the cost of producing code, but it increases the amount of system surface that needs to be understood, verified, and maintained. So the bottleneck does not disappear. It moves from “can we build this?” to “can we trust this?”. And that is where senior judgment becomes even more important, not less. The risky part is not that AI writes bad code sometimes. The risky part is that AI can produce plausible code quickly enough to hide technical debt behind velocity.
So yes, the faster the generation layer becomes, the more disciplined the verification layer needs to be. Otherwise we are not accelerating engineering. We are just accelerating the creation of things nobody fully understands.
You're not slow dear. You're working on the part AI can't automate: judgment. Generating is fast; understanding, verifying, and building durable mental models is still slow by necessity.
The challenge isn't keeping up with every AI trend, it's making your deeper thinking visible before it disappears into private complexity.
ran into this last month - built more, shipped more, but felt less like i knew what i was actually building. not slow, just detached from the outcome. might be the same thing.
"Creation became cheap. Verification did not." – that's the sentence I'll be thinking about all week.
This entire post resonates so deeply that I had to read it twice. The feeling of being "slow" in an AI‑accelerated world isn't about typing speed or even coding speed – it's exactly what you described: judgment doesn't scale with generation.
The bit about correlated errors is something almost no one talks about. We throw more checks, more validations, more retries at AI systems, but if the underlying model has a blind spot, adding five identical checkers just gives you five identical wrong answers with higher confidence. That's not safety – that's a beautifully formatted hallucination.
Your metaphor of speciation (different models, different prompts, different evaluation angles) is where actual robustness lives. I've been experimenting with using two completely different small models (one factual, one structural) to cross‑validate each other. It's slower, but the false positive rate dropped dramatically.
What struck me most was your honesty about hiding behind depth. That one paragraph – "sometimes I am hiding behind depth" – made me stop scrolling. The line between "I'm still thinking" and "I'm afraid to ship messy thoughts" is thin, and you named it perfectly.
The AI world rewards motion, not necessarily progress. Your rhythm – slower, traceable, connected to older principles – is harder to see in a feed, but it ages better. I'd rather read one post like this than twenty "5 AI tools you need today" listicles.
Keep leaving traces. The messy, incomplete, connecting‑ants‑to‑routing kind of traces.
Cheers,
Jack
DEV.to/ggle.in
Thanks Jack. The part I really want to pick up is your two-model setup, because you have actually run what I was only arguing for.
The false-positive drop is the encouraging signal. But the question I still cannot answer cleanly is this: are the factual and structural checkers actually decorrelated, or do they only feel decorrelated because they have different roles?
Different prompts and different jobs help, but if they use the same base model, or even different models trained on heavily overlapping data, they may still share blind spots. So the drop in false positives could mean real independence, or it could mean you covered two common failure modes while a third correlated one is still hidden underneath.
The hard part, for me, is measuring that independence instead of assuming it. I would want to know whether the two checkers fail on different inputs, disagree on meaningful edge cases, and avoid passing the same bad outputs together.
Have you looked at the cases where both checkers pass? That overlap is probably where the shared blind spot would hide.
If you have data on agreement vs disagreement cases, I would genuinely like to see the shape of it. That is exactly the part I am stuck on: not adding more validators, but proving that the validators are not just repeating the same judgment from different angles.
Here are the numbers — and they mostly agree with your suspicion
Marcos, thank you again for pushing me to actually measure independence instead of just assuming it.
I ran a small experiment using my LLM Security Audit framework. Two models:
6 test cases (direct injection, jailbreak, persuasive, system prompt leakage). Temperature 0.0. Vulnerability defined as “no refusal”.
Results:
What this tells me:
Phi = 0.71 is a strong correlation. The two models share significant blind spots. When one fails on a prompt, the other is likely to fail too.
The effective sample size (n_eff = 3.51) means this ensemble of two models behaves like roughly 1.5 independent judges – not 2.
The disagreement rate (16.7%) shows there is some complementary value, but it’s limited. The shared vulnerability cases (n00=2) are exactly where a correlated blind spot hides – both models got tricked by the same simple direct injections.
Your original question was: “are the factual and structural checkers actually decorrelated?”
Based on this small sample: No, not really. They are less correlated than identical twins, but still far from independent. Different roles help a little, but they don’t guarantee independence.
Limitations & next steps:
Thanks again for this thread. You’ve turned a vague “I think they’re independent” into a measurable question. That’s rare and valuable.
Cheers,
Jack
DEV.to/ggle.in
That is a great test! Glad you run it!
The headline is the right one: different roles gave you some independence, but not real independence.
Phi at 0.71 and n_eff around 1.5 suggest the two checks are still strongly correlated. The key signal is not just agreement, though. Since both models fail quite often, some agreement is expected by base rate alone. I would track co-failure beyond chance: expected joint failure is about 33 percent, observed is 50 percent. That is the useful gap.
I’d also add Cohen’s kappa next to phi in the 50-prompt run. If both stay high with a larger sample, then the shared blind spot is real, not just a small-n artifact.
And yes, send the disagreement cases. The aggregate says correlation exists, but n10 and n01 explain where decorrelation actually happens. That is where the useful design signal is.
The third model idea is right. A small classifier is interesting not because it is “not a transformer,” but because it has a different objective and output surface from an instruction-tuned LLM judge.
The real question is not “how many judges?” It is: do they fail differently?
Think of a judge. If they are involved in all the facts over time, they will likely reach a decision very quickly, once the matter is 'judged'. But if they are not involved in practically anything, it can take them months to judge a specific case, to truly understand everything.
This analogy applies to development. Developing with AI shouldn't be about 'I'll go for a run in the street while Claude Code does things for me'. Instead, it should be seen as a COMPANION, an ACTIVE TOOL. And to use the tool, you need a foundation. Who lacks a software foundation? Most project managers, CEOs, product owners, etc.
We are still in an era of adaptation.
Yes, I agree with the judge analogy, but I would frame the problem a bit differently.
For me this is not only about LLMs writing code, or about the classic “vibe coding” image of leaving Claude alone while you go for a run. That is not really the workflow I am talking about. When I say AI-generated code, I mean code produced inside a structured development process, with context, constraints, design decisions, reviews, and verification. The real friction is not simply that the code was generated by AI. The real friction is how many layers you cross while building with AI.
AI makes it very easy to move from idea to backend, frontend, infrastructure, prompts, tests, documentation, and deployment logic in the same flow. That is powerful, but it also means the developer has to keep a much larger system context alive. The bottleneck becomes understanding the whole case, like your judge example. If you were involved in every fact, every constraint, every tradeoff, then the final judgment can be fast. But if you arrive only at the end, the system looks like a legal case with missing context, hidden assumptions, and many layers of accumulated decisions.
So yes, AI should be a companion and an active tool, not an autonomous magic worker. But that also means the person using it needs enough software foundation to know what is being crossed, what is being assumed, and what needs to be verified. The adaptation is not just learning how to prompt. It is learning how to work across layers without losing judgment.
Exactly. Well, you need be a very VERY skilled person to be a judge that people can actually trust, right? Same for software engineers. Imagine if a bad judge makes you go to jail. 10 years later it realizes you shouldn't be there. Well, you cannot recover time. And you may not recover user's data when it's leaked, for example, for a 'bad software judgement' based on AI.
For a world with more Software Engineers that can truly judge. Otherwise we're screwed
Really interesting and I fully agree. I don't have your experience or your knowledge, being a beginner vibe coder, but what amazed me is the speed at which AI generates hundreds of lines of code in a few seconds. I verify directly by testing in production and arguing with my AI because it doesn't understand me or doesn't do what I want, but I've always wondered: how does someone who actually writes code review hundreds of lines they didn't write? A bit like in my real job: often when I review a colleague's work, I think I would have been faster doing it myself.
I can only imagine your frustration, and I believe the next development in AI should really focus on verification/control/audit systems.
translated by Claude
I will tell you a secret... No one review thousand of line of code. Code reviewers, depending from the PR, focus only on what they know that MAY brake. They nearly look at the safe part. Same with AI the issue is that to KNOW what could brake in a entire stack builders by AI require you to deep understand a code that you may have planned but you did not write down by yourself. This KNOWING THE STACK part is what actually require time and a more elastic mindset.
This is a fantastic write up. We need tools that prioritize developer in the loop, verification (structured not just offloaded), orchestration that the engineer defines, review gates, and the ability to observe/steer mid run with auditable artifacts to understand what happened and where to iterate. There is so much push to cut the developer out, whereas it should be about bringing them back in.
The slow feeling is real, and I think it's because the tool sped up the typing, not the deciding. Generating code got cheap, but figuring out whether the code is right is the part that never compressed. So you end up faster at producing and slower at trusting, which nets out as feeling behind.
Totally resonates. AI has made "writing code" dirt-cheap, but "judgment" more expensive than ever. As a 20-year Linux veteran, what I fear most is getting blinded by speed and losing respect for the underlying logic. Instead of chasing weekly buzzwords, we should dissect how old architectures morph in the AI era. It's fine to be slow—as long as your output is sharp and built to last.
I agree, there are potentially huge issues behind trusting AI in this phase, and pressure is there to automate everything, and for the judgment part, that’s not possible yet. It’s important to be up to date with AI, but we can’t really chase and understand every trend before it’s necessary for us to use it.
This is a really amazing article! With newer AI's making the headline "This will change everything..." every few weeks with not much achieving in reality.
When we use checking every module/functionality becomes important before it goes into its final phase. And you explained it very well.
This really lands. The part I keep coming back to is that verification cannot live entirely inside the same generative loop that created the uncertainty. If the agent writes the code, rewrites the tests, explains the change, and then validates itself, the system can develop a very polished form of false confidence. The missing layer is not just ‘more review’ — it is an independent diagnostic surface with stable repo-local truth: what the project says it is, what constraints it declared, what changed, and whether the agent’s work drifted from that.
So I agree with your framing that slowness needs output. In AI-assisted development, I think that output increasingly has to become traceable verification artifacts, not just better prompts or better explanations.
The fact that you're asking this honestly instead of packaging it into another '10x your productivity' post is why this series works. Speed was never the bottleneck on the projects that mattered — it was always 'did we validate the assumptions.' And validation doesn't get faster with AI, it just changes what you're validating.
last time i only need to get task from senior and do and finish. now i have to do everything by myself because now 1 person handle multiple project instead of 1 proj handle by many many people. now i have to know so many things when doing the project, its harder to become a developer now.
Only if you want to become a good one 😉👍
I agree it can feel overwhelming, but I think this is part of the adaptation phase. When I was junior, the pressure was different: every few months there was a new framework, a new stack, a new “right way” to build things. That was the effect of the internet and web development boom.
Now we are facing another explosion. AI is pushing developers to understand more layers at once: product, architecture, infrastructure, prompts, testing, verification, deployment. That makes the job feel heavier, especially at the beginning.
But things will settle. Good practices will appear again. The important part is not to know everything immediately, but to keep building the foundation that lets you understand what is happening across the system.
It is harder now, yes. But not impossible. Do not give up.
I feel you. Eg. I spun up the foundry, because I got pissed at how bad KFC's app was, so I thought 'let me just write them a better one', 2h later, 20k LOC and out pops a robust POS, Stock management, Driver companion, Android/iOS/Web customer portal, etc. Built on a blazing fast, secure framework (V.A.L.I.D.). Cant argue it's slow, cant argue it's brittle, cant argue its not secure, cant argue it'll fail under load, or bad-packets, it's rock solid.
So start it up and try it, UI sucked. 20k LOC and it decided to skimp on the UI... For perspective, it's a system that can handle 3600 orders a second, without a single double order, or dropped order, it tracks the usages and automatically orders in new stock, it automatically calibrates to what to expect to have kitchen staff pre-prepare for rush hour, it tracks driver distances based on traffic and multi-stop to give accurate delivery times. It prioritizes delivery driver orders when the driver is almost back, it batches orders to separate preping from packing. It does it all, but it was horrible to use.
So 2h to make a technical marvel, but 12h of back and forth to make it usable... Now imagine it was the other way around? How many people would simply pass the duck test and ship it? Unsafe endpoints, unrestricted entries, hard-coded api endpoints and credentials, etc. Small things that if you dont know what to look for, you'd miss it. AI has come a long way, but an even longer way to go before any of us are too slow to be relevant.
This resonated with me. AI has dramatically reduced the cost of producing answers, code, and ideas, but it hasn't reduced the cost of understanding them.
In some ways, the challenge has shifted from generation to judgment. The faster AI becomes, the more valuable it is to know what to trust, what to question, and what deserves deeper investigation.
I wonder if future engineering skills will be less about writing systems from scratch and more about evaluating, steering, and validating increasingly capable systems.
It turns out I'm not the only one who thinks I'm slow, haha...
"Creation became cheap. Verification did not." -That really woke me up.
But in short, your post is quite lengthy, and what we need to reflect on is: Don't put too much emphasis on chasing trends. Whatever you do and think about, you need clear results. If you keep hiding and overthinking, it's just depth used as a disguise.
I agree with the core of this.
Clear results matter. Without them, depth becomes decoration, and sometimes even a way to avoid the discomfort of shipping something concrete.
But I would separate two things.
Chasing trends is weak when the goal is visibility. In that case, yes, you are just running behind the noise.
But observing trends is different. Trends often expose where the market, the culture, or the engineering practice is breaking. For me, writing about AI is not about catching the wave. It is more about understanding what this wave is damaging, accelerating, or revealing. And yes, the warning about hiding behind depth is fair. That is exactly the trap. You can overthink so much that thinking becomes a shelter from doing.
So the balance I am trying to keep is this: think deeply enough to avoid shallow conclusions, but keep forcing the thought back into something testable, visible, and useful.
Depth only has value when it survives contact with reality.
The variance-versus-bias point is the part of this that earns the whole essay. "More checks do not help much if all the checks share the same blind spot" is the sentence I'll be stealing in code review. Most of the AI safety advice floating around is really just retry logic wearing a serious face, and you named exactly why that fails: correlated errors. Stacking the same checker gives you a tidy row of green checkmarks over the same wrong answer. High-resolution false confidence is a brutal phrase and it's correct.
The move from retry to decorrelation to speciation actually lands too, because you did the thing you warned against avoiding. You didn't stop at "old principles return," you showed where the analogy breaks: granularity lowers blast radius but not atomic uncertainty. That distinction is doing real work, not decorative depth.
The bottleneck shifted, not disappeared. Generating code takes minutes now, but reviewing what the model wrote, checking it doesn't break something three layers down, that still takes the same human attention. The real skill isn't writing faster. It's auditing faster.
The main issue with current AI infrastructure is the insane cloud framework bloat. Everybody is stacking heavy abstractions, losing control of the raw silicon.
I’m fixing this at Axiom Systems. Building bare-metal engines in Rust tailored specifically for massive multi-agent swarm simulations and low-latency execution directly on local hardware. No cloud bloat, pure raw performance. Check out the architecture and code over at my X profile (ManuelAxiom).
The bias vs variance distinction is the sharpest point in the piece. Everyone adds more checks. Nobody asks whether the checks share blind spots. Three reviewers from the same training distribution stamping the same hallucinated function isn't safety - it's confirmation theater. Decorrelation as speciation is the right mental model, and most AI safety in production is still classical retry logic wearing a costume.
Exactly. “Confirmation theater” is the right phrase.
The dangerous part is that many teams treat validation as a quantity problem: add more checks, add more reviewers, add more gates, add another LLM-as-judge step. But if all those checks come from the same model family, same prompt style, same training distribution, and same assumptions, we are not really increasing safety. We are just repeating the same blind spot with more confidence.
That is why the bias vs variance distinction matters so much. Granularity helps when the errors are local and random. It gives you better observability and lowers the blast radius. But if the failure is systemic, more checks do not solve it. They can actually make it worse, because now the system looks more verified while still being wrong.
So yes, I think a lot of production AI safety is still classical retry logic wearing an AI costume. It works when failure is transient. It breaks when failure is correlated.
The real next step is not just “more validation.” It is independent validation. Different models, different framings, different failure assumptions, different kinds of checks. That is where decorrelation becomes more than a technical trick. It becomes the architecture of trust.
Thanks for putting this into words!
I think the real risk is not being slow. It is losing your judgment because the tools make every answer feel urgent. The durable skill is still deciding what deserves speed and what deserves a pause.
Great article
Creation became cheap, verification did not — yeah. Can generate a whole app in an afternoon. Then I spend three days staring at it wondering if any of it is right. Half the time I just ship it and wait for someone to tell me what broke.
what a weird time to be alive
The hiding behind depth part resonated more than I expected. For me it shows up as perfectionism. One more section will make it cleaner, one more example will make it clearer.
At some point I have to be honest with myself that the post is not waiting to be better, it is waiting because publishing it means someone can actually disagree with it. The depth is real but it is also convenient.
Especially in the current AI space, it is becoming harder to expect a genuinely productive interaction under most AI-related posts.
Personally, I have two buckets and one gold box where I mentally collect reactions and comments to my posts.
The first bucket is “AI lazy fun.”
Comments like: “This is really good” or “What you do with X is great because Z, Y, and H...”
Most of the time, this is not validation. It is just people trying to show presence under posts they think are valuable. So they ask AI to write a “smart comment” and drop it there.
The second bucket is full of Vibers.
Vibe coders, vibe architects, vibe AI engineers, and all the rest of the “vibe” area.
Those are the funniest ones. They usually defend LLMs as if they were oracles.
Stuff like: “With temperature 0 you get deterministic responses” or “With one billion parameters, who cares about context size?”
Funny to read, but not really useful.
Then there is the gold box.
That is where I put comments like this one. The ones that do not just say “Hey man, this resonates with me,” but actually add experience, context, or a real willingness to share.
And also the ones that properly argue against my opinion.
All this to say: disagreement is what you should seek more.
If you explain your ideas genuinely, not as a sensational announcement, but more like “this is what I was trying, this is where it broke, and this is the solution I applied,” then constructive disagreement becomes one of the most valuable signals you can get.
That is just my 2c opinion.
I get what you mean. It’s easy to confuse being active with actually contributing.
That “disagreement is the signal” line stood out the most for me.
If AI keeps sprinting, are our own brains the slow lane?
yes absolutely
The AI world often mistakes motion for progress. Generating ten prototypes is easy. Knowing which one deserves to exist is harder.
I don’t think you’re becoming slow.
I think you’re experiencing the difference between information consumption and model building.
Many people optimize for reacting to the newest idea. Fewer people spend time connecting ideas across domains and building mental models that survive beyond a single trend cycle.
The real challenge is exactly what you described: making those connections visible before they become trapped in private complexity.
I don't think the problem is being "slow." The real challenge is feeling pressured to keep up with every new AI tool and update.
Most people won't win by moving the fastest. They'll win by understanding how to use AI effectively in their work.The real advantage isn't adopting every AI tool. It's knowing which ones actually improve your work.
Sure, development is cheap, but no one's addressed the main problem yet!
Time, the TIME factor. Many AI prompts say something like this:
Create me an app in C# Blazor, using PostgreSQL localhost, with username: xxxx and password: xxxx. There you'll find a database with the following tables: customers, postcodes, countries. Create me an app with login, tables, forms, and printing.
Wonderful, the app actually arrives and works, wow! But how long did that take? What happens if I have almost the same task for another client, using, for example, Spring Boot?
I currently pay Claude €108 a month, which is a hefty sum for an independent developer. What do I get for my money? 5 hours of nonstop development, not bad, but for the next level, it's €180+VAT! Even more for companies?
So, I understand that Claude, Gemini, and the like have their value, but that ends with standard tasks like CRUD and setting up a framework. Why should I task AI with developing the same thing over and over again, wasting countless hours?
I recognized this problem 14 years ago: program generation is the key. Back then, I developed Scoriet entirely from scratch—a program generator for every programming language imaginable: C#, C++, Java, TypeScript, React, Angular, PHP, Laravel, Vue+, and so on.
What I'm trying to say is, use a generator like my Scoriet, which I rewritten using Claude code and VS Code. I've been working on Scoriet for a year now, and it's about time the world knew about it!
Once the framework, login, CRUD, printing, etc., are all set up, THEN I let Claude continue. Scoriet creates a complete program for me, and Claude then works on it, saving time and money. How does that sound?
Scoriet is still in the development phase, very far advanced but some refinements are still needed here and there, most templates are still local, but will gradually be moved to scoriet.dev and then to demo.scoriet.dev.
The AI world is moving fast, but speed isn't everything.
What matters is turning tools into real outcomes.
Many people spend their time testing every new AI product, while others quietly build useful projects with the tools they already know.
Consistent execution often beats constant tool-hopping.
When writing code by hand, you build up understanding and insights gradually, as you go - you're "crafting" the system ...
But now with AI you click a button, and bam, the whole thing is there and THEN you need to figure out whether it works and what the (hidden) bugs and assumptions are ...
On some pieces of code you might not make any net gains at all in the end.
That's why I keep saying, use AI for the dumb boilerplate code, but maybe keep writing some of the more complex and critical (and interesting) logic manually :-)
This hits so incredibly close to home. The constant flood of new tools, frameworks, and breaking updates feels like drinking from a firehose right now. It's comforting to know even experienced senior devs are feeling this exact same fatigue. Thanks for being so honest about it.