DEV Community: Leonid Bugaev

Security by obscurity is dead. Three things killed it at the same time.

Leonid Bugaev — Mon, 01 Jun 2026 10:39:32 +0000

For years, “we’re not a bank, we don’t need that level of security” was a defensible posture. Most teams ran a dep scanner, had a pen test once a year, and quietly relied on attackers not caring enough to read their code.

That posture is gone. Here’s what changed:

1) Your source code is probably already out there.
May 20, 2026: GitHub disclosed an employee device compromise via a poisoned third-party VS Code extension. ~3,800 internal repos exfiltrated. The malicious version was live in the Visual Studio Marketplace for 18 minutes. In OpenVSX, 36 minutes.

If that can happen to Microsoft-owned GitHub — with MDM, EDR, and a small army of security engineers — assume it already happened to your team. How many extensions did your devs install last week? Do you know what any of them actually do?

2) AI flipped the economics of vulnerability research.

Anthropic’s Project Glasswing reports 10,000+ high/critical vulns found in important software in the first month using Claude Mythos Preview. Cloudflare confirmed it works on real codebases — given the right harness (architecture context, narrow tasks, validation, attacker-input reachability).

The scary part isn’t the frontier model. It’s the floor. Open source historically catches up in about six months. After that, a kid in a basement with a decent OSS model can build a personalised exploit for your specific codebase, scan the entire public IPv4 internet in an afternoon, and hit every instance.

3) The supply chain is no longer trusted code.

Trivy — a security scanner — became the attack vector. March 2026: attackers force-pushed 76 of 77 version tags in aquasecurity/trivy-action to credential-stealing malware. CanisterWorm followed shortly after — a self-propagating npm worm using postinstall hooks. Install was enough.

Zero trust now has to include code itself. Pin everything to commit SHAs, not tags. But ask the harder question: what if the platform itself is compromised, the way GitHub just was?

And the CVE model can’t keep up. VulnCheck: 32.1% of known exploited vulns in H1 2025 had exploitation evidence on or before the day the CVE was issued. NIST submissions up 263% from 2020 to 2025. The CVE is no longer the start of danger — it’s already late.

So what do you actually do?

Start with the worst case. How would you kill your own company? Authentication, authorization, tenant isolation, secrets, CI/CD, package publishing, billing — these are existential. Everything else is noise until they’re covered.

Patching faster won’t save you. If your architecture and dev flow don’t enforce security boundaries by default, no SLA gets you there. The answer is designed security, not reactive security.

Wrote the full essay on what this looks like in practice https://blog.reqproof.com/p/death-of-security-by-obscurity

Genuine question to the room: how is your team thinking about source-code-as-attack-surface? Or is everyone still on patching SLAs and hoping?

Death of Security by Obscurity

Leonid Bugaev — Fri, 29 May 2026 05:12:02 +0000

I should feel very scared right now. Everyone should be freaking scared. But I think we have had so many emotional events happening in the world recently that people have stopped feeling much of anything, and that is the most dangerous part of where we are. The line we used to tell ourselves - "we are not a bank, we are not NASA, we don't need that level of security" - is no longer an option. We just haven't felt it yet.

Try this thought experiment. Imagine your company's source code is made public tomorrow. All of it. How would you feel? I bet most of you would be freaking scared. Not because of IP. Because of the quality. Because of the spaghetti conditions in some files. Because of the strange customer-specific branch nobody touched in three years. Because of the comment that says "// TODO: fix this before prod" still sitting in prod. Because of the auth path that "almost" works. Because of the secret that probably should have been rotated.

Thanks for reading The Verification Gap! Subscribe to read my journey on re-discovering software engineering craft

For years, a lot of teams treated security as something between a badge, a process, and a hope. Of course everyone said the right thing. "We treat security as a first-class citizen," and so on. But in practice, unless you were in a regulated industry, security was often optional in the only sense that matters - optional in priority. You followed best practices. You ran dependency scanners. Maybe you had a penetration test every six months because customers asked for it. Maybe you had a badge in your sales deck. But you weren't really thinking about security as part of how the product works.

Banks, Automotive, Aerospace was different. In those industries, security is existential - if a bank loses trust, the bank is dead, and if an automotive system fails the wrong way, people die. So they built heavy processes around it: requirements, reviews, evidence, traceability, release gates. All the painful stuff. For a long time it was easy for the rest of us to look at that and say: yes, but we are not a bank.

I used to think this way too. At Tyk, I work with banks, governments, and large enterprises, and I was often annoyed by how slow some of their security processes were. Every release needed another check. Every patch had to go through another team. Every dependency update could become a discussion. Sometimes it took weeks or months. From the outside it looked like bureaucracy, and a lot of it was bureaucracy. But I changed my mind on the core idea. Those industries understood something the rest of us could safely ignore for a while: security is not something you add at the end. It is part of what the system is.

Security is a market

This is the part most engineers do not internalise. There is a real economy of people who make money by finding vulnerabilities in software. Some sell to bug bounty programmes. Some sell to brokers. Some sell to whoever is buying. And some just use what they find directly - exfiltrate data, sell the data, blackmail companies, take systems hostage.

That market used to be expensive to enter. Finding bugs took time. Understanding a custom system took time. Building an exploit took time. So attackers focused where the return was high - WordPress, Drupal, popular CMS plugins, well-known SaaS - anything they could exploit a million times after building it once. If you were niche, you were maybe scanned, but rarely understood. That asymmetry was your moat. Nobody admitted it out loud, but it was the moat.

A few years ago at Tyk we had a slightly crazy idea: let's find open-source Tyk users across the world and see if some of them could become paid users. The idea wasn't crazy. The crazy part was how easy the technical side was. I wrote a scanner that could scan the public IPv4 internet in a matter of hours. The whole internet. Once you do something like that yourself, "nobody will find us" stops sounding like a serious argument.

You can reproduce a small version of this at home. Run a basic HTTP application on a fresh public IP, expose a port, and watch the logs. Within minutes you start seeing requests: WordPress paths, admin URLs, old plugin routes, random probes, exploit attempts for software you are not even running. Most of it is dumb traffic. That is exactly the point. The internet does not need to know who you are before it starts touching your system.

So scanning was already cheap. The thing that just changed is understanding.

What AI actually changed

AI did not invent insecure software. We were already very good at writing insecure software. What AI changed is the cost of finding the insecurity, and the cost of understanding an unfamiliar system. A model can read a codebase and ask the questions a tired team will never ask, because everyone is busy shipping the next thing. It can build a personalised exploit for an unfamiliar service in hours, sometimes minutes. It can chain three small bugs that nobody would have chained manually because the manual cost was too high.

The reason this is genuinely scary is the floor, not the ceiling. The economics have flipped to the point where a kid in a basement with a decent model can build a personalised exploit for your specific codebase, scan the whole public internet in an afternoon, find every instance of your software, and run that exploit against all of them. Nothing about that sentence requires a state actor.

And this is not theoretical. Anthropic's Project Glasswing reports that Anthropic and around 50 partners used Claude Mythos Preview to find more than ten thousand high- or critical-severity vulnerabilities across important software in the first month, with the bottleneck shifting from finding vulnerabilities to verifying and patching them. Cloudflare pointed the same model at more than fifty of their own repositories and found that real vulnerability research needs a harness - architecture context, narrow tasks, validation - but with that harness, it works. Anthropic also says Mythos-class capabilities will soon exist in many AI labs. Open source historically catches up in around six months. The gap closes; it does not stay open.

So the relevant question is not whether this exact model is public today. The question is whether your security model assumes this capability stays rare. I would not bet a company on that.

Assume your source code is already out

Here is the uncomfortable truth: you should stop wondering whether your source code is going to leak. You should assume it already has.

I know how that sounds. But look at what just happened to GitHub. On May 20, 2026, GitHub said it had detected and contained a compromise of an employee device involving a poisoned third-party VS Code extension. GitHub's assessment was that GitHub-internal repositories were exfiltrated, with the attacker's claim of around 3,800 repositories being directionally consistent with the investigation. GitHub said it had no evidence of impact to customer repositories outside its own internal ones, but it still had to rotate critical secrets and continue analysing logs.

Think about who this happened to. This is GitHub. Owned by Microsoft. With MDM, EDR, hardened endpoints, mature processes, more security engineering than almost any company on earth. One developer. One poisoned VS Code extension. Source code gone.

And the exposure window was tiny. The Nx Console advisory says the malicious version was live in the Visual Studio Marketplace for about 18 minutes and in OpenVSX for about 36 minutes. Eighteen minutes was enough.

Now compare that to your company. If this happened to GitHub, with everything GitHub has, do you really believe nobody has done the equivalent to your team? Be honest. How many extensions did your team install last week? Do you know what any of them do? When did each one last update? Who reviewed it?

So assume it. Assume some snapshot of your source code is already out there. It does not even need to be the latest one - an older snapshot is enough to know where to look. And once an attacker has that, they can use AI to do exactly what defenders are starting to do: read everything, model the system, and build the exploit shaped specifically for you.

That changes the threat model fundamentally. Black-box testing - sending requests, observing responses, fuzzing endpoints, inferring behaviour - is already dangerous. White-box is a different category. With your code in hand, the attacker does not need to guess. They can follow your authentication logic. They can inspect your authorisation paths. They can see your tenant isolation, find the one resolver that does not validate ownership, see the timeout path nobody tested, find the internal endpoint that was "safe" because nobody knew it existed, find the retry that is not idempotent, read the comment that says "this should never happen," and find the strange customer-specific branch that everyone forgot about.

If your answer to that scenario is "well, they probably do not know how the system works," then the source code itself was part of your security boundary. And that boundary is gone.

CI/CD is not plumbing anymore

The other place this hits hard is the build system. We used to think of CI/CD as plumbing. From an attacker's point of view, CI/CD is one of the most interesting machines in the company - it usually has source code, deployment credentials, package publishing tokens, cloud access, GitHub tokens, and secrets for half of the internal systems.

The Trivy incident in March 2026 is the most uncomfortable example because Trivy is a security tool. A trusted security scanner became the attack vector - version tags in aquasecurity/trivy-action were force-pushed to credential-stealing malware, and the action stole everything CI runners had access to. CanisterWorm followed a similar pattern: attackers stole npm tokens from compromised pipelines and used them to publish backdoored versions across every namespace they could reach. The malicious packages ran on postinstall - install was enough.

So zero trust now has to include code. But notice the ladder. First, don't trust your dependencies - pin versions, quarantine updates. Then, don't trust the actions running in your pipelines - pin those to commit SHAs, not tags. Then ask the harder question: what if the platform itself is compromised, the way GitHub just was? At that point pinning helps but is no longer a complete answer. Each step up the ladder gives the attacker less leverage, but no step makes the problem zero.

This sounds exhausting. It is. But the alternative is worse.

The CVE model is dead

The CVE model still matters, but it cannot be the centre of your security process anymore. For a lot of teams the hidden workflow is still: wait for the CVE, check the severity, patch by priority, hope customers update before something bad happens. That workflow was already fragile. The new world breaks it.

VulnCheck found that in the first half of 2025, 32.1% of known exploited vulnerabilities had exploitation evidence on or before the day the CVE was issued. For a large share of exploited vulnerabilities, the CVE was not the start of danger. It was already late.

The system producing CVEs is also overloaded. NIST said CVE submissions increased 263% between 2020 and 2025. NIST enriched nearly 42,000 CVEs in 2025, more than any prior year, and still said it was not enough to keep up. In April 2026, NIST moved to a risk-based enrichment model where some CVEs are listed but not immediately enriched.

And none of that machinery will know the weird things inside your own system. A CVE will not know that your GraphQL resolver crashes a customer's system on one malformed input. It will not know that your retry path is unsafe when the downstream service writes but your load balancer times out first. It will not know that your PII is in the same database as everything else, and one forgotten SQL injection path could expose more than anyone expected. You need a model of your own system, not just a feed of public bugs.

The intuitive response is "patch faster." That is not enough either. No matter how fast you patch, if your architecture and development flow do not enforce security boundaries by default - if they do not force you to think about security as you write the code - patching will not save you. Every new release becomes a new attack surface, and a personalised exploit can be ready in five minutes. I am not exaggerating.

Cloudflare made the more important point. They described teams talking about a two-hour SLA from CVE release to patch in production, but if regression testing takes a day, getting to two hours means skipping something. Their conclusion is architectural: make exploitation harder even when a bug exists, put defences in front of the application, design the system so one flaw does not give access to everything else, and make fixes deployable everywhere at once.

That is the difference between reactive security and designed security. Reactive security tries to outrun the attacker. Designed security assumes bugs exist and limits what one bug can do.

So how do you live in this world?

The first thing is to actually accept where you are. You have to go through the stages - angry, scared, in denial, eventually acceptance. Most teams stop at denial. They tell themselves the old story: we are not big enough, we are not interesting enough, who would target us. That story is over.

Once you accept it, the next move is not "make everything secure." That is too vague and usually becomes theatre. Start with the worst case. What can kill your company? How would I leak all customer data? How would I bypass authentication? How would I bypass tenant isolation? How would I poison a release? How would I get production credentials out of CI? How would I make one customer's system go down with a single malformed request? How would I turn a slow downstream service into a cascading outage?

This is uncomfortable, but it gives you priorities. The worst case is not the same for every company. For a bank, anything touching the money is existential. If a bank loses money, the bank is dead. For a SaaS company automating LinkedIn outreach, leaking a list of customer emails is bad but survivable. The same company being used to impersonate its users and send messages on their behalf is not survivable. That is the death of the company.

If everything is critical, nothing is critical. But some things really are: authentication, authorisation, tenant isolation, secrets, CI/CD, package publishing, admin APIs, customer data, billing data, and anything that turns one bug into many customers affected. Be honest about which of these are existential for you, and put real process around those.

By real process I do not mean bureaucracy for its own sake. I mean what banks and government suppliers actually do, and I am saying this from experience - I spent years building software that went through that kind of pen-testing, the long kind, with people who do this for a living. Pin everything that runs in your build. Make secrets short-lived. Quarantine dependency updates before they reach your main branch. Treat anything that executes code in your dev or build environment as part of the product, because it is.

And then there is the boring part, the part that works much better than people want to admit: checklists.

Checklists are not exciting. They are one of the main reasons software engineering has ever shipped anything reliable. They are why planes fly. They are why a CT scanner does not lie to a radiologist. The reason they work is not because they are clever. They work because they force you to think about things you would otherwise skip.

If you have a login system, you should be forced to think about password reset, previous password reuse, account enumeration, timing attacks, brute force, lockout, and what happens when the email provider is down. If you make an outbound HTTP call, you should be forced to think about timeouts, DNS hangs, retries, idempotency, downstream slowness, partial success, and what happens if the service receives your request but your load balancer times out before you get the response. If your Go code starts a goroutine, you should be forced to think about cancellation, ownership, leaks, blocked channels, and behaviour under load.

None of this is advanced security research. It is basic engineering. But it only happens when the process forces it to happen. The hardest bugs to find are not the ones you wrote wrong; they are the ones you never wrote at all, because you never thought about that case.

And if you decide not to handle something, fine - say it. Log it as a known issue. Put an expiration on it. Hidden gaps are the dangerous ones.

Putting security on autopilot

After years of pen-testing work with banks and governments, I noticed the same kinds of gaps kept coming back. Login systems that didn't think about timing attacks. HTTP calls that didn't handle partial success. Goroutines that leaked under load. Not exotic bugs - basic ones, in different codebases, over and over.

So I started writing them down. Open-source projects I had investigated, real incidents I had seen up close, every checklist I had built across years of audits - all of it into a catalogue. That catalogue is what Proof runs on.

Then I built automation around it, because checklists in someone's head do not scale. This is also why I care about MC/DC. Line coverage tells you a line ran. Modified Condition/Decision Coverage - required by the FAA for Level A software, where failure could be catastrophic - asks whether every logical part of a decision independently affected the outcome. The bug is rarely "this function was never tested." It is "this function was tested, but not when auth is false, tenant is different, feature flag is enabled, downstream state is stale, and the retry path is active." Not the happy path. The combination nobody specified.

Proof works in both directions. From spec to code: if your requirement says "user can log in," Proof attaches sub-requirements for password reset, timing attacks, enumeration, lockout, dependency failures - and the CI check literally fails if a required item has no test attached. If you decide not to handle one of them, fine - mark it as a known issue, attach an expiry, and it stays visible instead of disappearing. From code to spec: static patterns scan for signals (HTTP client, goroutine, database call, queue), and when one is found, Proof asks the spec whether you have described what happens when the service is slow, down, or partially successful. If the spec is missing, the code is shouting at the spec.

Over time the spec stops being a static document and becomes a living source of truth - in my case, a graph of small interconnected requirements that I treat as more authoritative than the code itself, because the code can drift and the spec is the contract. Spec links to code, code to tests, tests back to spec. If one of them changes, the link becomes suspect, and you have to look again.

This does not replace human security work. It just makes the questions impossible to skip.

Security is everyone's problem now

Not every company needs to copy bank-level process. That would kill a lot of teams. But every software company needs to copy the posture. Security is not something you bolt on at the end. Your moat is no longer the software you build, or the market expertise, or being first. Your moat is being something people can trust to depend on for a long time.

So assume the internet will find you. Assume your dependencies are not safe by default. Assume one developer tool can become the entry point. Assume your source code has already leaked. Assume attackers can use AI to understand your system faster than you can.

Then ask what still protects you. Do you know which parts of the system can kill the company? Can you rotate secrets quickly? Do you have tests for the behaviours that matter, not just the lines that execute? Do you know when code, tests, and specs drift apart?

Security by obscurity is what dies when attention gets cheap. That world is going away. The practical question now is simple. When someone looks closely at your system - with automation, with your source code in front of them, with AI, and with patience - what will they find?

And what will be your answer?

If any of this is hitting close to home and you want to see what putting security on autopilot looks like in practice, get in touch. I'd love to hear what your team is dealing with - and show you how Proof works on real code. reqproof.com

Thanks for reading The Verification Gap! Subscribe for free to receive new posts and support my work.

Source of truth: Code, Spec, or Requirement?

Leonid Bugaev — Thu, 14 May 2026 19:38:51 +0000

The code runs. The code breaks. The code is what production uses.

Specs and docs can help, but they often become stale. So we learned not to trust them too much.

Code is honest in a way documents are not. It may be wrong, but it does exactly what it does.

But I think there was another reason we trusted code so much.

For a long time, code was manual work. We wrote it ourselves. We spent time with it. We were not only typing syntax; we were thinking through the system while writing it. The thinking and the implementation were almost the same activity.

That is why code deserved this level of trust. Not because code was perfect. It was not. But because the code carried a lot of the human judgement that produced it.

With agentic coding, this becomes more complicated.

As an individual contributor, I can move incredibly fast now. I can open Claude Code or a similar tool, give it direction, and shape the project while it is being built. I may start with a rough spec or just an idea, but the real decisions often happen during implementation. I try something. I see the code. I realise the original idea was not quite right. I adjust. The agent tries another version.

This is not bad. Actually, it is one of the best parts of working this way. Implementation gives feedback. Sometimes the code teaches you what the spec should have been.

So code-first is very tempting. It keeps speed. It keeps flow. It works especially well when the person steering the agent has the whole picture in their head.

But that is also the problem.

In that case, the real source of truth is not the code and not the spec. It is the experienced person.

They know the edge cases. They know the dependencies. They know which interface is fragile. They know why some strange behaviour exists. They know when the generated code is technically correct but still wrong. They are continuously filling the gaps.

The agent is not really working from a complete spec. It is working with a human who carries the missing context.

This works while the system fits inside one brain.

But real systems do not stay like this. They grow. They get delegated. Teams split. People leave. New people join. Some engineers understand the product but not the architecture. Some understand the architecture but not the domain. Some are junior. Some are moving fast. And the agent only knows what we gave it, plus whatever it can infer.

At that point, memory becomes a bad source of truth.

We forget dependencies. We forget edge cases. We forget why something was built in a strange way. We forget which downstream system depends on a behaviour. Not because people are bad, but because we are humans.

Agents do not magically solve this. If something is not described, they will improvise. They will choose something plausible. And plausible is often enough to pass the first review.

The same is true for humans. If something is not specified, we should not expect it to behave exactly as we imagined.

This is where I think the word “spec” is not enough.

In many software teams, a spec is a temporary artifact. You write it to start the task. It helps the engineer or the agent. Then implementation happens, some things change, and the spec is effectively dead. Maybe it still exists in Notion, Linear, Jira, GitHub, or a markdown file. But nobody really trusts it six months later.

If the spec is temporary, of course code wins.

But maybe the better word is requirement.

And this is not a new idea. This is basically how regulated industries already work. In aerospace, medical, automotive, and similar places, requirements are normal. They can produce multi-hundred-page documents explaining the whole system, but usually it is not really one big document in the simple sense. It is a set of requirements, sub-requirements, interfaces, tests, evidence, and links. A graph that can be turned into a document when needed.

That difference matters.

A spec often says: here is what we want to build.

A requirement says: here is what the system must do, what it must not do, how we know it was implemented, and where the evidence is.

The requirement does not die when the task is done. The code implements it. The test proves it. The evidence is attached to it. If something changes, the requirement becomes part of the review again.

This is the part I think normal software engineering may need to borrow, but without copying all the heaviness.

Not because we suddenly want bureaucracy. I don’t. But because agentic engineering makes implementation faster, and the faster we implement, the easier it becomes to lose intent.

The dangerous drift is not always a bug. The code can work. The tests can pass. CI can be green. But the product intent may have moved a little. The domain behaviour may not be exactly right. A security assumption may be weaker. An interface may have changed in a way nobody noticed.

Everything is green, but it is green around slightly wrong intent.

I also do not think the answer is just “write a huge spec first.” That can fail too. A detailed spec can make wrong assumptions before reality pushes back. Then implementation starts, the spec is challenged, and now you have spec drift instead of code drift.

So for me the real question is not code-first or spec-first.

The real question is how we manage the drift between intent and implementation.

If the code does something different from the requirement, it should not silently become the new truth. But if the requirement is wrong, it should not block reality forever either. There should be a stop. The team should ask: is the code wrong, is the requirement wrong, or did we learn something?

Until that is resolved, something is broken in the process.

This is where traceability matters. Not as documentation for documentation’s sake, but as an invalidation mechanism.

If code changes, the related requirement and tests should become suspicious. If a test changes, the requirement should be checked. If a requirement changes, the code and tests should definitely be reviewed. The system should say: this thing changed, so these other things are no longer fully trusted.

This is also why evidence matters.

I do not trust humans. I do not trust AI either. I want the evidence.

A green CI run is useful, but evidence of what? That the code passes current tests? Or that the system still matches the original intent?

Those are not the same thing.

In larger projects, the original intent becomes a spec, then tasks, then subtasks, then test cases. Every step narrows the focus. Everyone implements their piece. Everyone tests their piece. Locally, everything can look correct. But who validates the final system against the original intent?

Usually not systematically.

This is the part I want to think more about. Maybe requirement management, in some lighter and more modern form, becomes much more important for normal software teams. Not because requirements are new, but because agentic development changes the cost of not having them.

Code is still the runtime truth. It tells us what the system does.

But requirement is the intent truth. It tells us what the system is supposed to mean.

And evidence is what connects them.

I do not have the full answer yet. I still do not know how detailed requirements should be before they become harmful. I do not know which parts should be formal and which should stay flexible. I do not know how to make traceability work inside Git, PRs, CI, and agentic coding tools without making everyone hate it.

But I think the direction is becoming clearer.

In the old world, code was expensive to produce, so code naturally became the main asset. In the agentic world, code may become cheaper to produce, but intent becomes easier to lose.

And if intent is the thing we can lose, maybe intent is the thing we need to manage much more seriously.

What is the new engineering bottleneck?

Leonid Bugaev — Thu, 07 May 2026 07:05:53 +0000

Something I keep thinking about:

Maybe AI is not exposing a new problem in engineering. Maybe it is exposing an old one that we were already bad at.

We talk about AI like the bottleneck is still writing code. But honestly, writing code has not been the hard part for a long time. The hard part is all the surrounding context.

Why are we making this change? Who asked for it? Which customer depends on the current behavior? Was this weird edge case intentional? Is this a product decision or just an implementation accident? Did security already review something similar before? Is the documentation describing the current behavior or the behavior we wish we had?

And the uncomfortable part is that most of this context is not in one place.

It is all over GitHub, Slack, docs, tests, head of the engineer who left six months ago.

This was already a problem. AI just makes it harder to ignore. Because now we can create more code from less context, more tests, more docs, more confident explanations.

But if the context is incomplete, then all of that output is built on sand. This is the part I find interesting.

Not “will AI replace engineers?” I don’t think that is the most useful question.

The more interesting question is: What happens when engineering teams can generate implementation faster than they can preserve intent?

Because that is where things get messy.

You can have a clean PR. You can have passing tests. You can have updated docs. You can even have a very convincing AI-generated explanation.

And still nobody can answer the basic question: “Is this actually the right change?”

That question is much more expensive than people admit. I have felt this many times in open source and infrastructure work.

You look at a small change and think, “This should be simple.” Then you start pulling the thread. There is a backwards compatibility issue. There is some behavior that looks wrong but someone depends on it. There is a test that protects the implementation but not the real promise. There is a doc page that says one thing and production behavior says another. There is a customer workaround that became part of the product without anyone naming it. Suddenly the small change is not small.

And this is why I think “AI will make everyone ship faster” is only half true. AI can make the creation part faster. But creation is not the same as shipping. Shipping means the organization understands the change well enough to stand behind it.

That is a different problem. I don’t have the perfect answer yet.

But I think “AI coding” is the wrong frame. The real problem is not coding. The real problem is engineering memory.

And most teams’ engineering memory is held together with Slack search, old PR comments, and someone saying: “I think I remember why we did that.”

That does not scale.

Trust Is the Bottleneck

Leonid Bugaev — Thu, 07 May 2026 07:04:16 +0000

Everyone is asking the same question now: if AI can help us create much more code, why aren’t engineering teams suddenly moving much faster?

I think the question is right, but the answer usually stops too early.

AI does make some things dramatically faster. MVPs are faster. Prototypes are faster. The time to validate an idea is reduced a lot. You can explore directions that previously weren’t worth the effort. This is real, and I don’t want to pretend otherwise. But creating the first version of something isn’t the same as maintaining a product, and creating more pull requests isn’t the same as creating more trusted change.

This is where the economics breaks. If your team can create ten times more pull requests, your product doesn’t automatically move ten times faster. Your company doesn’t become ten times faster. The economy doesn’t double or triple. Because the expensive part of mature engineering was never only the typing of code.

The expensive part is trust.

Can I trust this change? Does it match the intent? Does it break a hidden customer flow? Does it affect backwards compatibility? Are the docs updated? Are the tests proving the right thing? Did we think about security, performance, malformed input, error states, release notes, migration, support?

A pull request doesn’t answer all of this.
A pull request is just something asking to be trusted.

So I don’t think the interesting question is “can AI create more code?” It can. The interesting question is: what needs to exist around the code so we can safely absorb more change?

If we can scale trust, we can unlock the real scaling of AI. But not by sending maintainers ten times more PRs. That only moves the bottleneck. What I want is a pull request that comes with enough context that I can actually believe it: why this change exists, what it affects, which tests prove it, which docs changed, what can break, and what still needs a human decision.

AI made implementation cheaper, but trust is still expensive. Join me on journey to find what to trust in new, post AI world.

If that is the kind of engineering problem you care about, subscribe.

If your trust model is green CI, you are in trouble

AI isn’t going back in the box. Even if you personally didn’t join the hype train, people around you probably did. Engineers use it to write code. PMs use it to write specs. Someone asks it to validate the plan, then write the code, then write the tests, then check everything against the same plan again.

I do it too. I ask AI to help me write the spec. Then I ask it to validate the plan. Then I ask it to write the code. Then validate its own code. Then write the tests. Then check everything against the original plan. It’s tempting because it works surprisingly well. In a lot of cases, it feels almost magical.

But this is exactly why it becomes dangerous.

For a long time, our basic engineering trust model was something like this: write the code, write the tests, pass CI/CD, review the pull request, ship. It was never perfect, but it was much better than nothing. Green CI never meant the product was correct. It meant the code passed the checks we had.

The problem is that those checks don’t prove intent. They don’t prove the requirement was correct. They don’t prove the tests were testing the right thing. They don’t prove the documentation was complete. They don’t prove the change matched the real product behavior we needed.

They prove that the current artifacts agreed with the current checks.

With AI, the whole chain can be generated. The spec can be wrong, the code can follow the wrong spec, the tests can validate the wrong code, the docs can describe the wrong behavior, and CI can still be green. Everything agrees with everything, but the intent is wrong.

That’s not trust. That’s a consistent mistake.

This is why high coverage isn’t enough either. In my previous article about jsonparser, the painful part wasn’t that I had no tests. I had near-100% coverage in the area that mattered. The problem was that malformed input behavior was never properly described. So the tests proved what existed, not what should have existed.

You cannot test what you never described.

Security makes this even less optional. For years, many teams survived with some quiet version of security by obscurity. Not officially, of course. Everyone says security matters. But in practice, a lot of software depended on nobody looking too closely, or on attackers moving slowly enough that maintainers had time to react.

That assumption is breaking. VulnCheck reported that in the first half of 2025, 32.1% of known exploited vulnerabilities had exploitation evidence on or before the day the CVE was issued. This doesn’t mean every vulnerability becomes an exploit in hours, but it does mean the old time cushion isn’t something you can build your product around anymore.

So things that felt optional before become normal engineering requirements: malformed input, authorization boundaries, resource limits, timeout behavior, error states, data exposure, public API behavior. These aren’t enterprise extras. They’re product requirements.

This is the uncomfortable part: the trust problem is now everyone’s problem. Even if your company hasn’t “adopted AI,” your people probably have. Even if your CI is green, it may be green against the wrong intent. Even if your coverage is high, it may cover the behavior you remembered to describe, not the behavior the product actually needs.

So we need a different source of truth. Not instead of CI/CD, not instead of tests, not instead of code review. Above them. Something that says what the system is supposed to do, which obligations apply, what evidence proves them, and what becomes suspicious when something changes.

Otherwise AI won’t only help us move faster.
It will help us move faster with a false feeling of safety.

The outside structure is not the product

I know this problem from open source. For the last 12 years at least, I worked a lot in open source. I had my own popular open source projects, and today at Tyk we build an open source API Gateway.

Open source is hard. Not because people are bad. Usually it’s the opposite. Someone from the outside sends you a pull request. Maybe it’s a bug fix. Maybe a new feature. Maybe it’s useful. Maybe it’s technically correct. Maybe they spent their evening on it.

But as a maintainer, you still need to get inside the context. You need to understand what’s happening and why this person is doing it. You can be fast and accept too much, or stay picky and make people unhappy. Neither option really solves the trust problem.

The real issue isn’t that contributors are bad. The issue is that they see the outside structure. They see the code. Maybe they see the tests. Maybe they see the docs. But they don’t see the intent in the same way the owner of the project sees it. They don’t know all the small product promises made over the years. They don’t know which ugly thing is accidental and which ugly thing is load-bearing. They don’t know which customer flow depends on some behavior that looks strange from the outside.

They are not inside of this bubble.

And this isn’t only open source. The same thing happens inside a company. Someone from support knows the product very well. They see customer pain every day. They may even be technical enough to raise a pull request. Someone from solutions architecture can do the same. Another team can contribute to your service. AI makes all of this easier.

But internal doesn’t automatically mean trusted.

A support engineer may understand the product from the customer side, but not the architecture. Another team may understand code, but not the local history. AI may generate something that looks clean, but it has no real ownership unless someone gives it context and checks it.

These contributions can become shallow. Not useless. Shallow. They touch the visible layer of the system, but they aren’t backed by the deep intent of the people who own this part of the product.

We tried relaxing quality gates a few times. More people contributing sounds obviously good, especially when every company has more backlog than humans. But we had cases where a simple line, a simple fix, broke everything. We had other cases where the fix was so big in scope that it was too dangerous to move there.

The conclusion wasn’t “only engineers can write code.”
The conclusion was: if you want to scale engineering, it’s always about trust.

This is also why “move fast” changes meaning when you have customers. When you’re still searching for an MVP, you can break things and call it learning. But when customers put your product inside their infrastructure, the product is no longer fully yours. They pay you for stability, security, and predictable behavior. In a way, you give away part of the ownership.

At Tyk, this is very real. We build software used by banks, governments, and large enterprises. Quality assurance isn’t some internal slogan. It’s part of the relationship with customers. Every software has bugs; I don’t want to pretend otherwise. But the price of a bug isn’t the same everywhere. Sometimes it’s legal. Sometimes it’s regulatory. Sometimes it’s very big money. Forget even the money for a second: what if the bank goes down? It can become a national-level issue.

Speed isn’t how quickly you can make a change.
Speed is how quickly you can safely absorb change.

Lehman’s software evolution work has a phrase that fits here: “The safe rate of change per release is constrained by the process dynamics.” In the same passage, he says that as the number, size, and architectural distance of changes increase, complexity and fault rate grow more than linearly.

This sounds academic, but it matches product reality. You can move only as fast as your safety norms allow. Your current team, your architecture, your customer base, your process, your quality gates, your review culture — all of this defines your real speed.

If AI gives you more change than your trust system can absorb, you aren’t scaling engineering. You’re scaling incoming work.

Temporary specs become archaeology

One of the deeper problems is how we treat specifications in consumer engineering.

Most of what we call a specification is a temporary artifact. You start with all the best practices. Maybe an RFC. Then it becomes a detailed Jira ticket. Maybe later there is an ADR. There are comments in GitHub. A Slack thread. A Confluence page. A few decisions made during review because reality was different from the original assumption.

At the moment, this feels normal. This is how software gets built. But after some time, all these artifacts pile up. If you want to understand how a component works, you need to dig through history. You need to understand why it ended up in this final state. Why this was done and not that. You may be lucky and find the exact explanation. In most cases it’s lost in someone’s head.

This is archaeology, not development.

The bigger problem is that these artifacts are independent. The RFC isn’t connected to all the code. The Jira ticket isn’t connected to all the tests. The docs are scattered across ten pages. The final implementation isn’t connected back to the original assumptions. It’s not a graph.

So we trust a person — engineer, architect, lead, PM — to hold the high-level picture in their head. We trust them to find all dependencies. We trust them to notice backwards compatibility issues. We trust them to know which docs need updating. We trust them to remember which customer flow can break.

And of course people forget. Not because they’re careless. Because this is too much context for one person to carry.

You fix a bug and break some other flow. You build a feature and forget a dependency with another service. You update two documentation pages and miss the other eight. The feature exists, but it’s unusable for one group of users. The implementation works, but not in the real production shape.

The spec was supposed to create clarity.
But because it was temporary, it becomes one more historical artifact.

This is also why spec-first isn’t enough. Spec-driven development is better than no spec. Planning before coding is obviously better than jumping into implementation. But if the spec is still treated as a temporary artifact, after a few iterations you end up in the same position, with intent chaos.

During development, the spec always changes. You start with assumptions. Researched assumptions, but still assumptions. Then implementation begins and reality appears. The architecture doesn’t work. A limitation appears. A reviewer notices a security issue. QA finds a case. A customer dependency changes the direction. And in many teams, the spec isn’t updated.

The real knowledge moves into GitHub comments, Slack messages, review threads, and people’s heads. The Jira ticket becomes stale. The implementation says one thing. The ticket says another.

Imagine someone from QA comes back from vacation and needs to test the feature. They see the ticket. They see the implementation. They have no idea what is happening. Why this? Why not that? Is this intended?

It’s so common. I bet a lot of you feel the same.

How I can know what I don’t know?

A lot of bugs are not in the code first. They are in the missing specification.

Have you actually described what should happen if the input is malformed? Have you described that this functionality must not allow SQL injection? What happens if the third-party service times out? What is the error state? Have you described authorization boundaries? Resource limits? Performance boundaries? What happens when something goes from ten requests per second in a test environment to thousands in production?

There are also more subtle cases. Concurrency. Non-deterministic behavior. Map iteration. Merge order. I’m looking at you, Go.

Have you described that the behavior should be deterministic? Did you write a test for it? Did the test prove the requirement, or did it just execute the code?

This is where checklists, obligations, processes, discipline, and all the boring stuff come in. I know people hate boring process. I hate fake process too. Documents nobody reads. Boxes people tick after the fact. Quality theatre.

But the useful version is different. An obligation is not a test case. It’s a category of behavior you are required to describe: malformed input, boundary behavior, error handling, access denied, determinism, idempotency, atomicity, nil safety, overflow safety, encoding safety.

The obligation doesn’t tell you the answer.
It forces you to ask the question.

That is why I like it. It turns “maybe someone remembers” into a deterministic process. The checklist itself is human judgment. But checking whether the spec covered the checklist can be mechanical.

This is also where AI can help without pretending to own the product. If the code uses goroutines, the system can ask where cancellation, lifecycle, and error propagation are described. If code depends on map iteration or merge logic, it can ask whether determinism or commutativity matters. If code reads time directly, it can ask whether time is part of the behavior and how this is tested. If code changes a public API, it can ask where compatibility and documentation obligations are.

This isn’t AI judging architecture taste. It’s tooling surfacing missed questions.

That is the “how I know what I don’t know” loop. Spec obligations force code and test evidence. Code shape can reveal missing spec questions.

What regulated industries got right

Consumer engineering and regulated engineering live in different worlds. Different tools. Different conferences. Different language. Some of it is archaic. Some of it is bureaucracy. I don’t want every SaaS team to become an avionics certification team.

But we shouldn’t ignore what they learned.

I expected to find paperwork. Annoyingly, I found a lot of things our world forgot to learn.

In aviation, automotive, medical devices, space systems, the spec isn’t treated as a temporary note. It’s a source of truth that lives together with the software. Requirements have IDs. They have layers. They are linked to documentation, tests, implementation, verification evidence. You can see blast radius. You can see what a change affects. During review, if implementation differs from the spec, the spec must be updated.

The useful idea is not the paperwork.
The useful idea is that intent is durable, traceable, and connected to evidence.

NASA’s FRET is one example of this direction. It lets users enter hierarchical system requirements in structured natural language, gives those requirements unambiguous semantics, and can show them as natural language, formal logic, diagrams, and interactive simulation.

That doesn’t mean every product team needs FRET or formal methods everywhere. It means the requirement is not just a document. It’s something you can analyze, link, verify, and keep alive.

This is where requirement management becomes interesting again for consumer engineering. Not the old heavy version copied blindly from regulated industries. Not paperwork for paperwork. But the useful part: a source of truth, cross-links, invalidation, traceability, and evidence.

Combined with everything consumer engineering learned over the years: CI/CD, pull requests, fast feedback, developer experience, automated tests, observability, docs, release automation.

We should not throw away modern engineering.
We should add the missing trust layer.

From pull request to evidence pack

Today a pull request usually gives me code, maybe tests, maybe a description. But it doesn’t give me the whole chain.

It doesn’t tell me the original intent. It doesn’t tell me which obligations apply. It doesn’t show the blast radius. It doesn’t show which docs changed or should have changed. It doesn’t show which specs this conflicts with. It doesn’t show what changed during implementation compared to the plan.

So the reviewer has to reconstruct all of that.

Again, archaeology.

What I want instead is an evidence pack. Not enterprise theatre. Not documents for the sake of documents. A practical package that makes the change reviewable.

Here is the intent. Here are the requirements. Here are the obligations. Here are the tests that witness them. Here are the docs. Here is the blast radius. Here is how it aligns with existing specs and where we checked for conflicts. Here is what changed during implementation. Here is what still needs human judgment.

Then the pull request isn’t only code. It’s the full chain of development.

This matters for open source. It matters for support engineers contributing fixes. It matters for other internal teams. It matters for AI agents. You don’t trust the contributor blindly. You don’t trust AI blindly. You trust the evidence chain, and then you still apply human judgment where judgment is needed.

This will feel slower at first. Writing obligations is slower than writing a vague ticket. Linking tests to requirements is slower than writing random tests. Updating docs through the graph is slower than pushing a change and hoping someone remembers. But not all friction is bad.

The question is whether the friction creates trust.

Bureaucracy gives you friction without trust.
Evidence gives you friction that lets more people move safely.

If this trust exists, then AI can actually help us scale. Not by dumping more pull requests into the same review bottleneck, but by making more changes reviewable, traceable, and safe to absorb in parallel.

Without trust, maintainers become managers of incoming things. Instead of thinking about architecture, future, and vision, they review an endless stream of pull requests, fixes, and generated artifacts.

That is not the scaling I want.

Why I am building Proof

This is why I am building Proof.

I don’t want another tool whose main purpose is to create more code. We already have many of those. The problem isn’t that we can’t produce enough artifacts. The problem is that the artifacts don’t preserve intent.

I want specs to stop being temporary. I want requirements to live with the software. I want obligations to force the boring questions before they become production bugs. I want code, tests, docs, and requirements to invalidate each other when they drift. I want a reviewer to see the evidence chain instead of rebuilding it from memory.

AI will make engineering faster. That part is already happening. But faster without trust is not enough.

For me, the real question is this: how can I end up in the position where it’s not just a pull request coming from someone from the outside, but a well-thought evidence pack that makes me believe I can merge it as soon as possible?

That is the scaling I care about.

Not just more code.

More trusted change.

I Had Near 100% Test Coverage. It Didn't Matter.

Leonid Bugaev — Wed, 29 Apr 2026 18:17:34 +0000

You cannot test for what you never described.

I woke up and saw a wall of emails in my personal account. Then logged into my corporate Slack, and it was filled with Zendesk messages from customers. Everyone was looking for me.

The library I wrote, jsonparser, which got used by a lot of projects, got its very own public CVE. So everyone started freaking out looking at their scanners.

"That's what the fame is," was my first thought.

Now I remember some notifications I kept ignoring from the Google OSS Fuzz project, I signed up multiple years ago.

This lib was written in the pre-AI-agents era (so weird to say that now!). Every piece was handcrafted manually, using best practices, with full test coverage. I checked the function which had the issue, and it literally had near 100% test coverage. But it did not matter, because the issue was in handling of malformed input data. One of the edge cases which was missed. In other words, the issue was in the specification of what this function should do and how it should behave in edge cases.

But it opened one more can of worms. I wrote this library like 6 years ago. I don't remember anything. And my only source of truth is the code and the tests, which is rather cryptic and looks more like archaeology.

The issue is fixed now. But how do I prevent such issues happening in the future? And if 100% code coverage is not the answer, what is? And what is my source of truth?

So I started digging. And it went way deeper than I expected, and changed the way I look at software engineering forever.

Down the rabbit hole

I started thinking about what the gold standard of software quality is. My first answer was NASA. How does NASA solve these kinds of issues?

AI now produces so much code that I feel like I am losing ownership of it. Not only of the code. Of the intent.

I wanted to understand how people work when tests passing is still not enough and the price of being wrong is huge.

The surprising thing is that a lot of NASA's work is public. Their software engineering requirements are public. FRET is public. Kind2 is public. A lot of the case studies are public. There are papers about aircraft, Mars rovers, superconducting magnets, and formal requirements that found bugs before code existed.

I started reading all of this not as an academic exercise, but because I had a very dumb practical problem: my tests were green, my coverage looked fine, and still one missed edge case was enough to create a public CVE.

Then I went deeper into automotive and aerospace. It opened a whole new world of software engineering for me. For some reason, our world of consumer software engineering and regulated software engineering in those industries almost do not intersect. Different tools, different conferences, different language. Sometimes it feels like they live in a parallel universe.

Some of it looks archaic. Some methodologies are weird.

Our engineering progressed a lot too. We got very good at moving fast and catching damage quickly. CI/CD, linters, tests, canaries, observability, rollbacks. I don't want to pretend every SaaS product should behave like avionics certification.

But we optimized for speed. They optimized for evidence.

Their industry spends much more time asking what evidence they need before they are allowed to trust the change. Some of it is painful. Some of it is bureaucracy. But the idea underneath is not stupid: if you claim the system should behave in some way, you need a durable chain from that statement to tests, code, and evidence.

There is real proof there, but it is not the fantasy version I had in my head, where every line of every product is mathematically proven end-to-end. They prove specifications. They use model checking. They simulate models, like with Simulink, against many input/output cases. They measure structural coverage. They use formal proof where the criticality justifies it.

And they still use testing, code review, static analysis, and all the normal engineering work around it. The difference is that proof and evidence are attached to the parts where being wrong is not acceptable.

That actually made the idea useful for normal engineering.

This is a huge topic, which I will cover in future articles. But the first concrete thing I found was MC/DC. It is one of the ways safety-critical industries look at coverage, and it made standard line coverage look very weak to me.

Line coverage says a line was touched at runtime. It does not say that the decision was tested.

Why 90% line coverage can still mean 60% real coverage

I still use line coverage. I still look at it.

But line coverage is bullshit. You should not trust it. Not on its own.

In Go, when you run:

go test -cover ./...

you mostly get statement coverage. The tool tells you whether a statement executed during the test run. That's useful. But it doesn't tell you whether the decision was tested.

Take a tiny parser-style example:

func isDigit(c byte) bool {
    return c >= '0' && c <= '9'
}

Now test it like this:

func TestIsDigit(t *testing.T) {
    if !isDigit('5') {
        t.Fatal("5 should be a digit")
    }
    if isDigit('x') {
        t.Fatal("x should not be a digit")
    }
}

Looks fine. The line ran. The function returned true once. The function returned false once. Your coverage report can look perfect.

But what did you actually prove?

You tested '5'. You tested 'x'. You didn't prove the lower boundary. You didn't prove that '/' fails because it's before '0'. You didn't prove that ':' fails because it's after '9'.

The line is covered. The boundary is not.

MC/DC stands for Modified Condition/Decision Coverage. It asks the question line coverage does not ask: did each condition independently affect the outcome?

When your code says if a && b, line coverage tells you the if was hit. MC/DC asks whether a alone can change the result, and whether b alone can change the result.

For this line:

return c >= '0' && c <= '9'

there are two conditions:

c >= '0'
c <= '9'

A simplified MC/DC table looks like this:

Case	`c`	`c >= '0'`	`c <= '9'`	Result	What it proves
lower edge	`'0'`	true	true	true	lower edge accepted
below lower edge	`'/'`	false	true	false	lower bound independently blocks
above upper edge	`':'`	true	false	false	upper bound independently blocks
nominal digit	`'5'`	true	true	true	normal digit works

The table is just a way to say: these are the cases that matter. This is the part ordinary coverage does not force you to say.

This used to be mostly a safety-critical tooling conversation. DO-178C requires MC/DC for the highest-criticality aviation software. The tooling was expensive, slow, and hard for normal teams to justify.

That changed. GCC 14 has -fcondition-coverage. Clang 18 has -fcoverage-mcdc. Rust is moving in the same direction with richer branch and condition coverage work, even if I would not call Rust MC/DC stable yet. Go does not have native MC/DC support, so I ended up adding code-level Go MC/DC measurement to Proof, and we have been extending the same direction to JavaScript and TypeScript as well.

What aerospace and automotive had because they were slow and diligent is now becoming available to normal engineering teams because AI changed the economics. You don't need a certification lab to ask a harder question about your tests. You also don't need to apply all of this to the whole company on day one. Start with the part where wrong behavior actually hurts.

The jsonparser numbers weren't subtle

After the CVE fix, I wanted to understand why my previous approach didn't make this kind of missing behavior obvious enough.

So I applied the MC/DC and requirements approach to jsonparser in a later public PR: buger/jsonparser#281.

Again: this PR didn't fix the original CVE. It was the follow-up work after the CVE fix. But it was not just a paperwork exercise. The hardening pass found and fixed more real issues and removed dead code that my previous process had not made obvious.

That was the uncomfortable part for me. I started by asking: what did my tests actually prove?

On the main branch before that work, ordinary Go statement coverage was already decent:

Metric	Before / main	After / PR
Standard Go statement coverage	85.3%	99.4%
MC/DC decisions	138/209 = 66.0%	203/203 = 100%
MC/DC conditions	175/253 = 69.2%	244/244 = 100%
MC/DC gaps	71 incomplete decisions, 78 missing condition proofs	0

85.3% coverage isn't bad. Most teams would see that and move on. But decision coverage told a different story: only 66% of decisions were fully covered, and only 69.2% of conditions were proven independently.

And the more interesting part: some functions already looked perfect by ordinary coverage.

Examples from the before state:

parseInt                     100% statement coverage
Unescape                     100% statement coverage
decodeSingleUnicodeEscape    100% statement coverage

But MC/DC still found missing independent-condition evidence:

bytes.go:21   parseInt missing proof for c < '0'
escape.go:148 Unescape missing proof for len(in) > 0
escape.go:47  decodeSingleUnicodeEscape missing proof for h1 == badHex
escape.go:47  decodeSingleUnicodeEscape missing proof for h2 == badHex
escape.go:47  decodeSingleUnicodeEscape missing proof for h3 == badHex

100% line coverage can still leave a condition unproven.

The code ran. The decision wasn't tested.

The bug was in what I forgot to describe

Coverage does not paint the whole picture. Even MC/DC. The bug can still be in the spec.

That is what happened with jsonparser. It was a classical case: you are building something, moving forward, and not looking back. You don't know what you don't know. I did not think about what would happen if this edge case appeared. I think most of us do not think about it this way.

I did not have any specs driving development or anything that forced me to think about the edge cases before writing the code. So of course I did not test for them. You cannot test for what you never described.

Testing assumes the specification is correct. That is the NASA/formal-methods lesson that changed how I think about this. The hard part is not testing the implementation. The hard part is questioning the specification itself.

This is where I found two different questions that I had been mashing together.

The first question starts from my specification: if this is what I claim the system should do, which logical cases need to be witnessed?

Not the code. The intent.

NASA built an open-source tool called FRET (Formal Requirements Elicitation Tool) that lets you write requirements in structured English and translates them into formal logic.

FRET includes an algorithm called FLIP (FuLl Independence Pair). FLIP takes a formalized requirement and generates the minimum set of test cases proving each boolean variable independently affects the outcome. Not every possible combination. Just the ones that matter.

I still have to write the requirement. I still have to decide what malformed input, boundaries, errors, and edge cases mean. FLIP does not do that for me.

But once the requirement is formalized, FLIP tells me exactly which test cases that requirement needs.

I built a tool called Proof that implements this approach.

That is the part I care about: how many tests are enough for this requirement?

Not "how many tests did I happen to write?" Enough for what I described.

The second question starts from my actual code: did my tests exercise every boolean condition in the implementation so each one independently affects the outcome?

This side does not care what I meant. It looks at what I wrote.

And sometimes it shows that my code has many more logical cases than my spec. So maybe my spec is not accurate enough.

Or my spec says this edge case matters, but my tests don't witness it.

Or my tests cover implementation details, but the behavior is under-described.

I learned this the hard way on jsonparser. The spec side and the code side kept disagreeing in useful ways, and that is where code drift and spec drift become visible.

The gap goes in both directions. Sometimes the code is wrong. Sometimes the tests are weak. Sometimes the spec is too vague.

Sometimes all of it combined badly.

Checklists, not memory

What can be more deterministic than a checklist? In aerospace and automotive, everything has its own checklist. The price of a mistake is too high to rely on someone's memory. I think checklists are the driving force behind quality engineering in those industries.

When you do not have specifications, it is very hard to create a checklist. When you are building a feature, you can have test cases, but that is a moving target. The items are constantly changing. You need something that will be the same all the time.

You cannot rely on humans here. Even on me, to be frank. I can miss these items too. You need deterministic checklists.

In practice, the questions are very simple:

What will happen if this is malformed data? What will happen if this is slow and the request times out? What will happen if the database is down? What will happen if you have a very large object? What will happen if the function returns different values with the same inputs?

These are the cases where security issues and data bugs tend to live. For jsonparser, these are the exact cases I had not thought about.

Without obligations, edge cases depend on memory. Maybe I remember to test malformed data. Maybe the AI remembers. Maybe a reviewer notices. Maybe no one does.

At the moment, it is just a matter of whether someone forgets or not forgets to test it.

This is where the CVE fix actually changed how I work. The fix itself was mechanical. But the obligations I wrote afterward forced me to think about the cases I had skipped. Every one of those became an explicit question I had to answer. Not "did someone remember to test this?" but "here is the list, and each item needs a witness."

Obligations turn edge cases from "someone remembered to test this" into a deterministic process.

When I first started writing obligations for jsonparser, it was actually quite easy with modern AI tooling. I reviewed all of the specs. The flow is: you cannot pass this check until the checklist is green, until you define obligations for all of those cases, and until you define test cases for all of those cases as well.

This is what the double link looks like in practice:

// In the code — annotated with the requirement it implements:
// SYS-REQ-863
func (s *Service) lookupCache(req Request) (*Result, bool) {
    // ...
}

// In the test — annotated with both the requirement AND the specific MC/DC row:
// Verifies: SYS-REQ-863
// MCDC SYS-REQ-863: cache_lookup_requested=T, component_inputs_unchanged=F,
//                    cached_component_result_reused=F => TRUE
func TestMCDC_SYS_REQ_863_Row1(t *testing.T) {
    evalVerifyScenario(t, "SYS-REQ-863", map[string]bool{
        "cache_lookup_requested":         true,
        "component_inputs_unchanged":     false,
        "cached_component_result_reused": false,
    }, true)
}

Each test is not just "test the function." Each test is: "prove that this specific variable independently affects the outcome of this specific requirement."

If I change the spec, I can see exactly which MC/DC rows are affected and which tests need to be reviewed. If I change a test, I can see which spec requirement it was proving and check whether the spec still says the same thing. If I add a new variable to the requirement, FLIP will generate new witness rows, and the missing tests become immediately visible.

This is the double link. Change the spec, review the tests. Change the tests, review the spec. If you have not touched the spec, why would you touch the test?

This is where the "how many tests are enough?" question changed for me. Before, the answer was always vibes. Write enough tests. Cover important paths. Don't overdo it. Be pragmatic.

All true, and also not very helpful.

Now I think about it differently. Enough tests means enough evidence that every condition I described, or every condition my code actually contains, can independently affect the behavior I care about.

It is not about how many tests I have. It is about whether I really, really trust my system and whether it actually does what I described.

The true challenge is legacy

You can always start a new project and have a really nice experience with all of this. But the true challenge lies in the big legacy projects. They make up like 90% of all software. They bring the majority of the money. And they are the ones where wrong behavior actually hurts.

I work with very complex software. At Tyk, we build API gateway software used by banks, governments, and other serious enterprise customers. I am a very sceptical person. I always want some proof. At the same time, I understand that software is always about compromises.

But the game is changing. What was not possible in the past is now possible for small teams in terms of quality and processes. The wind is changing with AI.

The true power happens when you can apply some of those approaches to legacy large enterprise codebases. If it works there, it will work everywhere.

I know how challenging it is. You cannot do it in one go. You cannot just make a switch and start using a new process.

This is not only about the technical part. It is also about the people part. Even at the size of Tyk, with like a hundred people, it is not about the implementation. It is about the processes and the people. The technical part is the easiest one.

In order to convince people that you can actually make it, you need to be able to do it in parts. Start small, then scale.

Can you take small parts, turn them into a repeatable process, and then start scaling? That is how it works in the majority of cases.

So I picked the policy engine. Authorization and gateway policy decisions are obviously critical. If the policy engine behaves incorrectly, you are not talking about a cosmetic bug.

I applied the same kind of thinking to the Tyk policy package in a public PR: TykTechnologies/tyk#7932.

Metric	Before / main	After / PR
Standard Go statement coverage	81.0%	99.2%
MC/DC decisions	74/115 = 64.3%	111/111 = 100%
MC/DC conditions	95/142 = 66.9%	137/137 = 100%
MC/DC gaps	41 incomplete decisions, 47 missing condition proofs	0

81% ordinary coverage. 64.3% decision coverage. The normal coverage number says most statements ran. The MC/DC number says a lot of policy decisions still do not have independent evidence.

For a policy engine, the second number is the one I care about.

Code coverage is not about a metric

It is about trust.

What do we trust? In classical software engineering, we say: here is the code and here are the tests, the tests are the source of truth. If you want to know how the system works, read the tests.

I do not believe that anymore. Not with AI writing code. Not with AI writing tests. Not with AI validating its own assumptions.

The source of truth cannot just be tests anymore. AI can write those too.

A passing test can prove that the code agrees with the test. It cannot prove that both agree with my intent.

So I moved the source of truth up. For me, it has to be the specification: the static description of what I expect the system to do.

Then code implements it. Tests witness it. Coverage measures evidence around it. Traceability keeps the chain from silently rotting.

I started this whole journey because of one CVE in a library I wrote six years ago. I ended up in a completely different place.

I thought the problem was in the code. It was in what I forgot to describe.

I thought coverage was the answer. It was the wrong question.

The first article was about losing intent. This one is about binding intent back to code.

Originally published on substack: https://blog.reqproof.com/p/i-had-near-100-test-coverage-it-didnt

AI Made Implementation Faster. Verification Is Still the Bottleneck

Leonid Bugaev — Thu, 23 Apr 2026 15:35:12 +0000

AI made implementation dramatically faster.

Trust did not.

I live in two different worlds now.

In one, I build my own projects with AI and ship more software than ever. I have written more software in the last two years than across the rest of my career, and I have barely written any code manually in the last year.

In the other, I lead engineering for software used by banks, governments, and other regulated environments, where mistakes are expensive and confidence matters more than speed.

In both worlds, I keep hitting the same wall:

Implementation got dramatically faster. Trust did not.

That is the part I think the industry still keeps smoothing over.

Faster code generation is not faster engineering

The current AI coding conversation often assumes that if code generation speeds up, engineering speeds up too.

That is not what I see.

On my own projects, I can build much faster than before. AI helps me move quickly, clean things up, write tests, refactor, and push ideas further in less time.

But it also asks me to trust more.

I am not just delegating typing.

I am delegating thinking, validation, and judgment too.

And I am still not sure where the safe line is.

In enterprise software, the picture is different but the problem is the same.

AI absolutely helped us in some areas. It reduced noise. It reduced interruption-based work. It helped other teams answer questions about system behavior without constantly pulling senior engineers into ad hoc investigations.

That mattered.

People were less interrupted. Context switching got better. Engineers were happier.

But it did not suddenly make us ship features 2x faster.

Not even close.

Because implementation was never the whole job.

Verification is the bigger slice.

The verification gap

The phrase I keep coming back to is:

verification gap

By that I mean the distance between what I intend the software to do and what I can actually prove about its behavior.

Between intended behavior and demonstrated behavior.

That gap always existed.

AI did not invent it.

It amplified it.

Why AI makes this problem worse

When humans wrote the code, the same brain often held the intent, the implementation, and the validation loop together.

Not perfectly. People still shipped bugs. Specs were incomplete. Tests missed things.

But there was at least one place where the system could be understood as a whole: the person writing it.

That is no longer the default.

Now the human writes the prompt.

The model writes the code.

The model writes the tests.

The human skims the diff.

The model writes the cleanup.

The CI passes.

The feature ships.

And if the original intent was slightly wrong, incomplete, or misunderstood, that mistake does not stay in one place anymore.

It gets propagated through the whole stack.

The plan is based on the wrong assumption.
The implementation is based on the wrong assumption.
The tests are based on the wrong assumption.
The documentation often reflects the same wrong assumption.
The "manual validation" is often the same model being asked to sanity-check itself.

At that point, what exactly are we proving?

Often just that the system is internally consistent with the assumption it invented for itself.

Not that it matches our intent.

Bug free is not the same as intent-correct

This is why I think a lot of AI productivity discourse still misses the real problem.

People say: just write better tests.

I do write tests.

AI writes tests for me too.

That is not the point.

Tests verify behavior for cases somebody thought of.

That somebody used to be a human.

Now it is often a human plus a model.

That is still not the same thing as verifying intent.

You can have 100% line coverage and still miss the thing that matters.

You can have a green CI run and still not know whether the software behaves the way you intended.

A green pipeline can still be a polished misunderstanding.

Bug free is not the same as intent-correct.

Software is not flat. It is layers.

This gets worse as software gets bigger.

Software is not flat.

It is layers.

It is wide, deep, and full of interacting components, hidden assumptions, old decisions nobody remembers, backwards compatibility constraints, and behavior that only makes sense if you know four other subsystems.

Any project that lives long enough eventually reaches a point where one brain is no longer enough.

That was true before AI.

It is still true now.

AI does not remove that limit.

In some cases it makes you hit it faster, because you can generate change faster than you can understand its consequences.

A lot of our engineering process exists because of this:

CI/CD
QA
RFCs
architecture reviews
team boundaries
approval workflows

These are not random rituals.

They are patches over the same underlying problem: software complexity grows beyond what one brain can safely manage.

Where does intent actually live?

I think mainstream software engineering is still missing something fundamental.

We do not maintain a real source of truth for intent.

If I ask where the intended behavior of a system lives right now, the honest answer in most teams is:

all of it combined badly

Some of it is in source code.

Some of it is in tests.

Some of it is in RFCs.

Some of it is in Jira tickets.

Some of it is in Confluence.

Some of it is in the heads of senior engineers.

None of those is the place where I can go and see, clearly, how the system is supposed to behave right now.

That is not a source of truth.

That is archaeology.

And that feels like a major difference between mainstream software and more regulated domains like aerospace or automotive, where intended behavior is at least treated as a first-class artifact.

In mainstream software, especially in large, complex systems, we mostly reconstruct intent after the fact from scattered artifacts.

And then we act surprised when regressions keep happening.

So what is the actual bottleneck now?

If a feature can be implemented in hours instead of weeks, why have so many teams not seen the full payoff?

Because implementation was never the only bottleneck.

The harder part is deciding what should be built, making that intent explicit enough, and then verifying that the resulting system still matches it after the code, tests, and surrounding context have all changed.

That is where the time goes.

That is why I think AI did not remove the hard part of engineering.

It moved it from writing to verification.

If you want the next essays on this topic, subscribe on Substack: https://blog.reqproof.com/p/ai-writes-your-code-nobody-verifies