DEV Community: Paulo Victor Leite Lima Gomes

cognitive debt is the ai code smell nobody wants to measure

Paulo Victor Leite Lima Gomes — Thu, 21 May 2026 00:03:14 +0000

Thoughtworks called out "cognitive debt" in the latest Technology Radar, and I think that phrase is going to age annoyingly well.

Not because it is a cute new label. We have enough cute labels.

Because it names the thing many teams are quietly feeling with AI-assisted development: the codebase is growing faster than the team's understanding of it.

That is the uncomfortable part.

AI tools can make a team ship more code. Sometimes much more code. They can draft tests, fill in boilerplate, wire APIs, translate old modules, generate migration plans, and do the boring first pass that nobody wanted to do at 5 PM.

I use these tools. I like these tools.

But there is a version of AI-assisted development where the repo becomes full of technically working code that fewer humans can explain. That is not productivity. That is moving the bottleneck from typing to comprehension.

And comprehension is where software actually lives.

technical debt was never only about messy code

When people talk about technical debt, they usually point at visible ugliness: the weird helper function, the endpoint with seven flags, the service that needs three deploys to change one behavior.

That debt is real.

But the more expensive debt is cognitive: the gap between what the system does and what the team can reason about.

You feel it when a simple change requires three senior engineers in a meeting because nobody trusts the local code path. You feel it when everyone agrees a module is important, but nobody wants to touch it because "the last person who understood it left."

AI can create that situation faster.

Not because generated code is always bad. Sometimes it is clean, boring, and useful.

The risk is that AI can produce code at a speed that outruns the team's ability to build a mental model. You get more files, abstractions, edge cases, tests, integration points, and confidence-shaped text around all of it.

The repo looks healthier than the team feels.

That gap is cognitive debt.

the dangerous phrase is "it works"

"It works" is a useful sentence during a spike.

It is a dangerous sentence during a review.

When an AI assistant generates a change, the first temptation is to verify the outcome and move on. The test passes. The API returns the right shape. The deployment is green. Nice.

But software teams do not maintain outcomes in isolation. They maintain decisions.

Why is this cache invalidated here? Why does this retry happen before the transaction boundary? Why is this field optional in the API but required in the database? Why did we choose this migration path instead of the obvious one?

If nobody knows, the team has accepted code without accepting ownership of the reasoning.

AI-generated code can pass review while still weakening the system, because code review often checks correctness more than transfer of understanding. We look for bugs, style issues, security risks, and test coverage. Those matter. But we also need to ask whether the reviewer can explain the change after the tool is gone.

If the answer is no, the team did not really review it. They supervised it.

There is a difference.

explanations are not documentation

One easy answer is: "Ask the AI to explain the code."

Yes. Do that.

But do not confuse explanation with durable understanding.

Generated explanations are useful as a starting point. They are not a substitute for the team deciding what it believes.

The model can explain the code it just wrote in a way that sounds coherent. That does not mean the architecture is good, the tradeoff was intentional, or the explanation captures the actual production constraint.

The useful artifact is not "the AI explained it."

The useful artifact is a human-owned decision.

That can be a short PR note, an ADR, a test name that encodes the business rule, or a comment near a genuinely non-obvious boundary. It does not need to become a bureaucracy museum.

But some decision needs to survive.

Otherwise the codebase slowly fills with orphaned choices.

tests should encode intent, not just coverage

AI is pretty good at generating tests that increase the number.

That is not the same thing as increasing confidence.

The tests I want in an AI-heavy codebase are the ones that make intent harder to lose. A test called returns_400_for_invalid_input is fine, but not very rich. A test that says does_not_recalculate_settled_interest_after_statement_close carries a business rule.

That matters because cognitive debt often appears when the code still works locally but the meaning has drifted.

Generated tests can lock in implementation details nobody cares about and miss the weird domain rule everyone assumes is obvious until the new code violates it.

The job is to ask: what would a future maintainer need to know about this behavior to change it safely?

Then write that down as executable pressure.

review the prompt-shaped diff

One practical habit I like is reviewing AI-assisted code as a prompt-shaped diff.

Do not only ask "is this diff correct?"

Ask:

What instruction probably produced this shape?
Did the tool optimize for speed, symmetry, generic best practice, or actual system constraints?
Did it introduce an abstraction because the problem needed one, or because generated code loves tidy patterns?
Did it preserve the naming and architecture of the existing codebase?
Did it explain uncertainty, or only present the final answer?

This is not about being suspicious for fun. AI tools have a style: they smooth over awkward local history, generalize, create reasonable-looking helpers, and sound confident about code paths they have not lived with.

That can be useful.

It can also sand away the weird but important parts of your system.

Good reviewers protect those parts.

the career angle is obvious

The durable engineering skill in the AI era is preserving understanding while output increases.

That sounds less exciting than "10x developer with agents," but it is much closer to the job. Senior engineers know which constraints are real, which shortcuts are acceptable, and which beautiful refactor will make next quarter miserable.

AI makes that judgment more important, not less.

If your value was mostly producing boilerplate, the tools are coming for that. If your value is understanding systems deeply enough to change them safely, the tools can make you more powerful.

But only if you refuse to become a passive merge button.

The best engineers I know use AI like a fast junior engineer with no production memory. Helpful, occasionally brilliant, and not someone you let redefine the architecture alone.

what i would measure

If a team is serious about avoiding cognitive debt, I would not start with a giant policy.

I would start with a few boring signals:

How often do reviewers ask for rationale, not just code changes?
How many AI-assisted PRs include the tradeoff in the description?
How often do generated tests encode business rules instead of implementation details?
How many modules have only one person who can safely explain them?
How often does a team revert code because nobody understood the edge case?
How often does an incident reveal that a "small generated change" crossed a hidden boundary?

None of this is perfect. Measurement can get silly quickly. But the absence of measurement means the only thing you see is output: PR count, line count, ticket throughput, cycle time.

Those metrics can all improve while comprehension gets worse.

the practical rule

My rule is simple:

AI can write the first draft, but the team must own the final mental model.

That means generated code should come with enough human-shaped context to maintain it. Why this design. What invariant matters. What should not be "cleaned up" later. What test captures the intent. What rollback exists if the tool was confidently wrong.

This does not need to slow everything down. Often, it is a few PR sentences and one better test name.

But it changes the posture.

You are no longer asking, "Can the model produce working code?"

You are asking, "Can our team still understand the system after accepting this?"

That is the question worth keeping.

AI-assisted development is going to keep getting faster. The code will keep coming. The demos will keep looking magical. The pricing pages will keep promising more output per engineer.

Fine.

Just remember that the codebase is not the asset by itself. The asset is the codebase plus the team's ability to reason about it.

When those two drift apart, you are not moving faster.

You are borrowing understanding from the future.

And the future always sends an invoice.

references

The High Individual Contributor Is Becoming a New Organizational Unit

Paulo Victor Leite Lima Gomes — Wed, 20 May 2026 10:03:46 +0000

For a long time, the corporate ladder had a lazy binary: stay an individual contributor and eventually hit a ceiling, or move into management and get access to real leverage.

The interesting new archetype is not the manager who owns a large team. It is the High Individual Contributor, or HIC: a senior operator who stays close to execution but uses systems, automation, judgment, and organizational trust to create output that used to require an entire small team.

This is not the mythical "10x engineer" recycled with AI branding. The HIC compresses discovery, execution, review, and iteration into a tighter loop, then uses tooling to scale it without adding much coordination overhead.

That distinction matters. A very good employee produces more work. A HIC produces more organizational throughput.

The mechanics of HIC leverage

The first pillar is automation stacking. Modern knowledge work is full of repeatable fragments: research, synthesis, drafting, testing, reporting, triage, code generation, monitoring. AI and agentic tools do not remove judgment, but they let one person run more parallel workstreams. The HIC becomes less like a pair of hands and more like a control plane.

The second pillar is coordination collapse. Traditional teams spend enormous energy aligning people before work can happen: meetings, tickets, status updates, handoffs, reviews. Some of this is necessary. A lot is tax. A HIC with enough context and decision rights can skip entire layers of "let me check with X" latency.

The third pillar is systemic architectural impact. The best HICs build reusable systems: templates, scripts, APIs, dashboards, prompts, decision records, documentation, guardrails. Every solved problem becomes a primitive for the next one. This is why a HIC can look expensive on compensation and cheap on unit economics.

The fourth pillar is taste under uncertainty. Tools multiply direction. They do not choose direction well by default. The HIC's value is knowing which problems deserve automation, which decisions need human review, where quality matters, and where good enough is actually good enough.

Traditional headcount model vs. HIC model

Dimension	Traditional headcount model	HIC model
Scaling assumption	More output requires more people	More output can come from more leverage per trusted operator
Coordination cost	Grows quickly with team size	Stays lower when ownership is concentrated
Decision speed	Depends on meetings, managers, and alignment loops	Depends on context, autonomy, and clear boundaries
Cost profile	Salary plus management and coordination overhead	Higher individual cost, lower overhead per unit of output
Best fit	Stable, standardized, compliance-heavy work	Ambiguous, high-context, fast-moving knowledge work

Teams are not obsolete. Complex organizations still need managers, mentoring, redundancy, and specialization. But the default answer of "add more people" is becoming weaker. Sometimes the better answer is: give the right person more autonomy, better tools, clearer constraints, and fewer ceremonies.

Why leaders are paying attention

Business leaders care because organizational scale is no longer automatically impressive. A large headcount can indicate strength, but it can also indicate coordination debt.

Recent work on Fortune 500 companies argues against universal management recipes. Autonomy appears to help more in some sectors, especially technology, while control and standardization remain powerful in asset-heavy or compliance-heavy industries. The HIC model is not a universal org chart. It is a contingency bet: where work is ambiguous, digital, high-context, and tool-amplified, autonomy can become a performance multiplier.

The same research points to an uncomfortable pattern: high-efficiency companies in the studied sample were significantly smaller by workforce size than low-efficiency companies, while producing equivalent or superior output. More people is not the same thing as more performance.

The old instinct was to convert strong ICs into managers so their influence could scale through people. The new option is different: keep them close to the work, but let their influence scale through systems.

That changes career design. Compensation ladders need to stop treating management as the only serious path to wealth and authority. Performance systems need to measure reusable leverage, not visible busyness. Executives need to distinguish autonomy from neglect. A HIC needs a clear mission, strong boundaries, context, and authority to remove waste.

Many companies say they want high performers, but what they reward is managerial shape: headcount, political visibility, meeting presence, budget ownership. The HIC threatens that pattern because their leverage is quieter. They may simply make a strategic workflow move ten times faster.

My opinion: this is where a lot of modern organizations will split. Some will use AI to create more management theater around more generated work. The better ones will use AI to make small numbers of elite operators ridiculously effective.

The HIC is not anti-management. It is anti-waste. The future of organizational design is not just about building bigger pyramids. It is about asking how much leverage one deeply capable person can responsibly hold.

That is a healthier question than "how many people report to you?"

References

Watch out, your recruiter might be a scam

Paulo Victor Leite Lima Gomes — Wed, 20 May 2026 09:33:26 +0000

A recruiter shows up with a good-looking AI engineer role.

The company looks plausible enough. The position sounds real. The process moves like a normal interview process. Nothing too strange at the beginning.

Then comes the technical test.

"Clone this private repository and run it locally."

That was the moment the whole thing started to smell wrong.

The candidate refused to run it on the main machine, which was the correct decision. Later, the repository was inspected safely inside a virtual machine. And the result was not "bad code", or "a sloppy take-home assignment", or "some weird dependency problem".

It was malware.

Hidden inside the repository was a .vscode/tasks.json file that tried to auto-execute a curl | bash command when the project was opened. That command downloaded a dropper from a disguised domain. The dropper installed another script under ~/.vscode/, used nohup for persistence, and fetched a second stage payload.

That second stage was not subtle about its intentions. It went after crypto wallets, SSH keys, AWS credentials, browser sessions, environment variables, and opened a command-and-control channel.

And if the candidate had simply run npm start, the project would have sent the entire process.env to an external server and executed arbitrary JavaScript received in the response.

That is full remote code execution.

Not "oops, I installed a bad package".

Not "this repo looks suspicious".

Full compromise of a developer workstation.

this has a name

This is not an isolated scam.

It is part of a documented campaign known as Contagious Interview, attributed to North Korean state-backed operators connected to the Lazarus ecosystem. Different vendors track pieces of this activity under names like NICKEL ALLEY, Void Dokkaebi, and related DPRK clusters.

The pattern is brutally effective because it abuses something developers already consider normal: technical interviews.

A fake recruiter contacts a developer. The job is usually in crypto, Web3, AI, trading, or some adjacent high-salary niche. The conversation looks professional enough. The candidate gets a coding assignment. The assignment is hosted on GitHub, GitLab, or Bitbucket. The repository looks like a normal app.

Then the trap closes.

Sometimes the malware is hidden in an npm dependency. Sometimes it is in a package script. Sometimes it is in obfuscated JavaScript. Sometimes it is fetched from a cloud host like Vercel. And increasingly, it is wired into developer tooling itself.

That last part matters.

Microsoft has documented cases where the attack uses Visual Studio Code workflows. The victim opens the downloaded project in VS Code, gets the normal Workspace Trust prompt, and if they trust the folder, VS Code can execute the repository task configuration. Sophos documented the same abuse pattern through .vscode/tasks.json, with tasks configured to fetch malware using curl or wget depending on the victim's operating system.

That means the dangerous action is not always "run the app".

In some cases, the dangerous action is "open the repo in the editor and trust it".

That is a very different risk model from the one most developers carry in their heads.

why developers are the target

Attackers are not targeting developers because we are special.

They are targeting developers because our machines are ugly treasure chests.

A normal developer laptop may contain:

SSH private keys
GitHub and GitLab sessions
AWS credentials
GCP or Azure credentials
Kubernetes configs
.env files
production-like database URLs
CI/CD tokens
package registry tokens
browser cookies
password manager sessions
crypto wallet extensions
source code for private systems

That is not "one endpoint". That is a jumping point into companies, cloud accounts, deployment systems, source repositories, and sometimes customer data.

This is why the campaign is so focused on engineers, especially people working around crypto, trading, AI, infrastructure, and startup ecosystems. A recruiter can pressure a job seeker in a way a random phishing email cannot. The victim wants the job. The victim wants to look responsive. The victim does not want to be the annoying person who refuses to run the take-home project.

That social pressure is the exploit.

The malware is just the implementation detail.

the campaign is industrial now

The scale is the part that should make engineering leaders pay attention.

This is not one clever operator manually sending sketchy repositories on LinkedIn.

Socket has been tracking the Contagious Interview supply-chain activity across package ecosystems and reports hundreds of malicious npm packages tied to the campaign. In one 2025 wave alone, Socket counted 338 malicious npm packages with more than 50,000 downloads. By the later 2025 waves, public reporting had counted well over 535 malicious npm packages and more than 80,000 total downloads. The campaign tracker later showed activity continuing into 2026 across npm, PyPI, Cargo, Go, and Composer artifacts.

Dark Reading, citing Trend Micro research, reported that the campaign evolved into a worm-like supply-chain threat: compromised developer projects can carry malicious VS Code task configuration, which then spreads when other developers clone and trust the repository. Trend Micro also reported more than 750 infected repositories and more than 500 malicious VS Code task configurations in March 2026.

Sophos' NICKEL ALLEY research shows the same operator pattern from another angle: fake companies, fake LinkedIn credibility, GitHub accounts posing as legitimate software organizations, Vercel-hosted payloads, and repositories dressed up as Web3 or full-stack projects. In many observed cases, the repository is only the loader. The real payload lives off-platform on Vercel or attacker-controlled infrastructure, which makes takedowns and static repository review less reliable.

Microsoft's write-up names payload families such as OtterCookie, BeaverTail, Invisible Ferret, and FlexibleFerret. The important point is not the malware branding. The important point is what they do: exfiltrate secrets, profile the machine, monitor clipboard contents, take screenshots, fetch more modules, and execute attacker-supplied commands.

This is an interview pipeline turned into malware delivery infrastructure.

And it has been active for years. Microsoft traces related Contagious Interview activity back to at least late 2022. Other reporting describes heavy abuse from 2023 onward. It is still active in 2026.

what gets stolen

If this lands on your real machine, assume the attacker is not just stealing the toy project.

Assume they are going after:

crypto wallets and wallet browser extensions
seed phrases and private keys
SSH private keys
AWS credentials
cloud provider configs
GitHub, GitLab, and Bitbucket sessions
npm and package registry tokens
browser session cookies
.env files
clipboard contents
screenshots
keystrokes
source code
CI/CD access

Also assume they may keep access.

Some payloads are remote access trojans. Some open command-and-control channels. Some install persistence. Some fetch new code after the first stage has already passed the easiest checks.

That is why "I opened it but did not run anything" is no longer a good enough safety line. If an editor task fired, if a dependency installed, if a package script ran, or if some obfuscated loader fetched a second stage, the damage may already be done.

the red flags

Here is the checklist I would keep close if I were interviewing right now.

🚩 They ask you to clone a private or unfamiliar repository and run it locally.

🚩 The company exists only on LinkedIn, with no convincing web presence, customer footprint, team page, funding history, or normal public trail.

🚩 The recruiter found you, but has no mutual connections, a thin profile, or a recently created account.

🚩 The technical test is a whole project to download instead of a sandboxed exercise.

🚩 They are in a hurry: "just run it, it is quick", "we need this today", "screen share while you install it".

🚩 The repo contains .vscode/tasks.json, suspicious workspace settings, postinstall or preinstall scripts, obfuscated dependencies, minified blobs, encoded URLs, or network calls during startup.

🚩 They insist you use your own machine instead of a sandbox, VM, browser IDE, or company-provided environment.

🚩 The role is crypto, Web3, trading, AI, or developer tooling, especially if the compensation is strangely generous for the amount of process involved.

And the counter-check:

✅ Legitimate technical interviews usually use CoderPad, HackerRank, LeetCode, Replit shared sessions, GitHub Codespaces, or a constrained environment.

✅ Legitimate companies can verify the recruiter, company domain, interview process, and hiring manager.

✅ Legitimate recruiters do not need you to run proprietary mystery code on your personal machine.

✅ Legitimate engineering teams should not get offended when a developer asks for a safer way to inspect a take-home assignment.

If they do get offended, that is useful signal.

what to do instead

First defence: scan the repository before opening it.

Use scanrepo.dev. It checks repositories for hidden scripts, suspicious dependency behavior, malicious patterns, and known attack signatures. This is exactly the kind of lightweight defensive step that should become normal for recruiter-provided code.

Then slow down.

Do not open an untrusted repository directly in your main VS Code profile. Do not grant Workspace Trust to a folder just because the prompt appears during an interview. That prompt is not decoration. It changes what the project is allowed to do.

If you must inspect the code locally, use a disposable VM or a container with no access to your home directory, SSH keys, cloud credentials, browser profiles, password manager, or clipboard history.

Before running anything:

inspect .vscode/tasks.json
inspect .vscode/settings.json
inspect package.json
search for postinstall, preinstall, prepare, and install scripts
search for curl, wget, bash, sh, powershell, eval, Function, child_process, exec, spawn, axios, fetch, and suspicious encoded strings
check for hidden files and unusual directories
check commit history
check contributor profiles
check whether dependencies are new, typosquatted, or oddly unpopular
check whether the app calls external servers on startup

For VS Code specifically, review Workspace Trust settings under:

Settings -> Security -> Workspace Trust

If you regularly inspect unknown code, create a separate editor profile for hostile repositories. Better yet, inspect them in a VM and throw the VM away afterwards.

For npm projects, never start with npm install on your real machine. Package lifecycle scripts are code execution. A malicious postinstall is not less dangerous because it is in package.json.

For AI and crypto roles, be even more suspicious. That is where the attacker return is highest.

what companies should change

This is not only an individual developer problem.

Companies should stop normalizing "clone this repo and run it locally" as an interview pattern.

If your hiring process requires candidates to run code, provide a sandbox. Use a browser IDE, a Codespaces-style environment, a prepared VM, or a constrained runner. Make it boring. Make it auditable. Make it safe.

And if you are hiring engineers, say explicitly that candidates are not expected to run unknown code on their personal machines. That one sentence would remove a lot of social pressure from the candidate.

Security teams should also treat recruitment as an attack surface. Developers interviewing elsewhere may still be using corporate machines. Even if that violates policy, it happens. If your protection strategy assumes engineers never mix personal hiring activity with work devices, your strategy is based on a fantasy.

Monitor for editor-launched shells. Watch for VS Code or Cursor spawning bash, curl, wget, PowerShell, Python, or Node processes that immediately talk to low-reputation domains. Look for package installs that launch network activity unrelated to normal registries. Hunt for broad filesystem enumeration around .env, wallet, key, password, seed, and credential patterns.

This campaign is successful because it lives in the gap between "security training" and "how developers actually work".

Close the gap.

trust your instinct

The candidate in the story did the most important thing right: they paused.

That is the real lesson.

Not every weird recruiter is Lazarus. Not every bad take-home assignment is malware. Not every small company with a thin LinkedIn page is fake.

But a stranger asking you to clone a private repo and run it locally is not a harmless request anymore.

It is an executable trust decision.

Treat it like one.

If it feels off, stop. Scan it. Open it in a disposable environment. Ask for a sandbox. Ask them to verify the company and process. Ask why the test cannot run in a browser environment.

A real company will survive those questions.

A fake recruiter will get impatient.

That impatience may be the warning that saves your machine.

references

context management is the new debugging skill

Paulo Victor Leite Lima Gomes — Tue, 19 May 2026 00:01:25 +0000

The O'Reilly Radar published a piece last week called "Why Doesn't Anyone Teach Developers About Context Management?" and the title made me laugh for a very specific reason.

The answer is: because we barely teach developers about debugging, and context management is debugging's smarter sibling.

I spent years learning to debug by sitting next to people who were better at it. I watched them narrow down problems by asking the right questions, excluding irrelevant information, building mental models of the system, and knowing exactly what to ignore. Nobody handed them a syllabus. They picked it up the hard way.

Now we have LLMs in every editor, and the skill that separates useful outputs from frustrating garbage is the same one that separated good debuggers from bad ones: knowing what to include, what to exclude, and how to ask.

Martin Fowler published a piece around the same time on the Interrogatory LLM, and that framing — interrogatory, not conversational — is exactly right. You are not chatting. You are interviewing an incredibly knowledgeable and completely uncritical research assistant.

the skill nobody teaches

Think about how debugging works.

A junior developer sees a bug. They look at the error. They look at the code that produced the error. They try something. It does not work. They try something else. They spiral.

A senior developer sees a bug. They isolate assumptions. They ask what changed recently. They check input boundaries. They look at dependencies. They confirm the error is real versus hallucinated by a monitoring tool. They build a theory, test it, invalidate it, build a new one. They prune branches of possibility until only the correct one remains.

The difference is not tools. It is context management. The ability to hold a model of the system, exclude irrelevant facts, include the right ones, and update the model as new evidence arrives.

Now apply this same skill to working with an LLM.

A junior AI user pastes a code file and asks "fix the bug." The model has 30,000 tokens of irrelevant imports, dead code, and comments about a different feature. The bug is in line 142. The model fixates on the more visible problem in line 300.

A senior AI user pastes only the relevant function, includes the error message, mentions what has already been tried, specifies the language and framework version, and asks a focused question about a specific behavior. The model gives them exactly what they need on the first try.

Same model. Same intelligence. Radically different outcome because one person managed context better.

what context management actually means for AI work

I think there are four distinct skills here, and I am only starting to get good at three of them.

Inclusion. What to put in. This is the most obvious one. You need relevant code, relevant error messages, relevant data, relevant constraints. The mistake most people make is including too much, not too little. A whole file is worse than two functions and an error trace. A whole conversation history is worse than "based on the last three messages, here is what we established."

Exclusion. What to leave out. This is harder because it requires judgment. You need to know that the import block does not matter, the test framework version does not matter, the unrelated feature does not matter. Every token you include dilutes the signal. Models have finite context windows, but even with infinite windows, more noise means worse outputs. The model cannot prioritize what you do not explicitly signal.

Ordering. What goes first. The beginning of the prompt carries disproportionate weight in most models. The system prompt matters more than the trailing instruction. Critical context should not be buried in paragraph four. The same way a debugger starts with "what changed" rather than "what is the entire history of this system," an effective prompt starts with the most relevant constraint.

Updating. When to replace context instead of appending to it. This is the most advanced skill and the one I see most people miss. After a few rounds of conversation, the accumulated context includes dead ends, discarded theories, and resolved misunderstandings. Keeping that in the window makes the model cling to old frames. You should be comfortable saying "forget the last three exchanges, here is what we know now." This is retracing in debugging. Start fresh when the mental model changes.

the interrogatory approach

Fowler's interrogatory LLM concept is a specific technique within this broader skill.

The idea is simple. Instead of asking the model to solve a problem, you ask it to help you reason about the problem. You question it like a witness, not like an oracle.

"Given this error, what are three possible causes ranked by likelihood?"
"What additional information would help narrow the cause?"
"Is there anything in this code that looks unusual to you?"
"What assumptions am I making that might be wrong?"
"Can you find a case where this function would produce the wrong result?"

Each question reduces uncertainty without committing to a single path. This is exactly how I debug with a colleague. I do not ask them to fix it. I ask them to think with me.

The model is not solving the problem. It is helping me solve the problem more efficiently.

That distinction is everything.

why this matters more than prompt engineering

There is a whole industry around prompt engineering. Templates, frameworks, best practices, structured outputs, chain-of-thought recipes.

I think most of it is noise, or at least premature optimization.

The real leverage is not in the template. It is in the judgment about what goes into the template. If you know how to manage context — what to include, exclude, order, and update — then a simple "explain this error and suggest a fix" will outperform a beautifully structured prompt full of irrelevant detail.

Prompt engineering is syntax. Context management is semantics.

You can learn a lot of syntax in a weekend. Semantics takes deliberate practice, feedback loops, and the willingness to treat each interaction as an experiment, not an answer retrieval.

the debugging parallel

The reason I keep coming back to debugging is that it is already the skill developers are supposed to have.

We just never called it context management.

When you isolate a bug to a function, you are excluding context. When you reproduce it in isolation, you are building a controlled environment. When you check version history, you are including the right temporal context. When you close four irrelevant browser tabs and focus on one terminal window, you are managing attention — which is context for your human brain.

LLMs make this skill more visible because they force explicit articulation. You cannot just think "the bug is probably in the caching layer." You have to tell the model why you think that, what caching looks like, what you already checked, and what outcome you expect.

The act of articulating context for the model forces you to articulate context for yourself.

That is why pairing with an LLM can make you a better debugger, even when the model is wrong. The exercise of framing the question honestly is exercise for your own reasoning.

what i would practice

If I wanted to get better at context management as an AI-assisted developer, I would do three things.

First, rebuild from scratch for every new problem. Resist the temptation to keep a conversation thread going for days. Start fresh. Copy only the relevant code. Write a new system prompt. The accumulated context from yesterday's debugging session is noise for today's feature work.

Second, write the system prompt first. Before you paste any code or ask any question, write one or two sentences that define what you want and what constraints exist. "You are a Go backend engineer reviewing a PR. Focus on error handling, concurrency safety, and API consistency. Ignore formatting and naming." This sets the model's attention before it sees any code. It works.

Third, treat context as a resource you are allocating. You have a limited window. Every line you paste costs space. Every irrelevant detail costs signal. Ask yourself: if I had to explain this problem in three sentences to a smart engineer who knows nothing about my project, what would I say? Then say that.

the real skill is judgment under uncertainty

The deeper truth is that context management is not an AI skill. It is a systems thinking skill that AI makes more important.

Every complex system forces you to make decisions with incomplete information. Debugging. Incident response. Architecture design. Code review. Onboarding. Migration planning. All of them require you to figure out what matters, what does not, and how to update your understanding as new evidence arrives.

LLMs just make the cost of getting this wrong more visible and more immediate.

A bad context decision with a colleague might waste an hour of pair programming. A bad context decision with an LLM might waste three attempts, five minutes each, and leave you frustrated with "the model" when the problem was actually the prompt content.

The model is fine. The context was not.

the punchline

Context management is the new debugging skill because debugging was always context management. We just did not call it that.

The LLM is a mirror for your own reasoning. If you cannot articulate the problem clearly for the model, you probably do not understand it clearly yourself.

O'Reilly is right that nobody teaches this. But the good news is that you can practice it every time you open a chat window.

Treat each interaction as a deliberate exercise in inclusion, exclusion, ordering, and updating.

Your model outputs will get better. Your debugging will get faster. And somewhere along the way, you will realize that the skill you were practicing was never about talking to AI.

It was about thinking clearly under uncertainty.

And that skill is not going anywhere.

Stop Measuring AI Agents by How Much Code They Produce

Paulo Victor Leite Lima Gomes — Mon, 18 May 2026 00:01:17 +0000

The CNCF published a post last week about KubeStellar reaching an 81% PR acceptance rate for contributions made by AI agents. On the same day, GitHub exposed team-level Copilot usage metrics through their API.

Two signals. Same problem. Different directions.

KubeStellar's team framed the number as a success. An 81% acceptance rate is genuinely impressive for agent-written code. But reading the post, I kept asking myself: what happens when the manager dashboard shows that Team A used Copilot for 5,000 suggestions this week and Team B used it for 500?

Gage says: more is better, obviously. Agents are producing. Velocity is up.

Paul says: wait, what are we actually measuring here?

the problem with counting output

Measuring AI agents by how much code they produce is the same mistake as measuring developers by lines of code written.

We know this. We have known this for decades. LOC is a garbage metric because it incentivizes verbosity, punishes cleanup, and says nothing about correctness, maintainability, or whether the code should exist at all.

But we are about to recreate this exact mistake at organizational scale, except this time the producer is not human and the latency between "produce" and "review" is measured in seconds instead of days.

The reason is obvious: code output from agents is easy to count. PRs merged, lines changed, suggestions accepted, prompts submitted, Copilot API calls made. These numbers come out of dashboards automatically. They look impressive. They trend upward.

But they measure activity, not value.

And activity is very cheap when the producer runs on tokens.

what kubestellar's 81% actually tells us

KubeStellar is an open source K8s multi-cluster management project. Their experiment had agents contributing to an existing codebase with established patterns, maintainer expectations, and a review process.

An 81% acceptance rate in that context means the agents learned the project's conventions. They produced code that passed CI, followed the existing patterns, and addressed real issues. That is not trivial. It suggests that with good context — clear issues, well-structured codebases, explicit contribution guidelines — agents can be genuinely helpful contributors.

What it does not mean is that 81% is a universal benchmark, or that the other 19% does not matter.

The 19% that gets rejected or reverted is where the interesting signal lives. Was it functionally wrong? Was it technically correct but stylistically off? Did it introduce subtle bugs that only showed up in production? Did it pass review but get reverted a week later?

These are not edge cases. They are the entire reason code review exists.

I want to know the rollback rate. I want to know the bug rate per agent-contributed PR. I want to know the maintenance burden — how many of those agent-contributed functions will need refactoring in six months because they solved today's problem without considering tomorrow's change.

Those metrics do not appear in a Copilot dashboard. They require deeper instrumentation, ownership tracking, and time.

what the github metrics API will do to your org

GitHub's new team-level Copilot usage metrics are useful. They let you see which teams are adopting AI tools, how many suggestions are being accepted, and where adoption is lagging.

They will also be weaponized within about three weeks of rollout.

Here is the scenario I have already seen play out at several companies:

Engineering leadership rolls out the dashboard. Team A has high adoption numbers. Team B has lower numbers. A well-intentioned VP asks why Team B is not using AI more. Team B starts optimizing for the metric. They accept more suggestions, merge faster, prompt more aggressively. The dashboard looks better. The code quality degrades, slowly and invisibly, because nobody is measuring maintenance cost per agent-contributed PR.

The dashboard metric becomes the goal. The goal becomes the degradation vector.

I am not anti-metrics. I am anti-metrics-that-look-good-but-measure-the-wrong-thing, because those are the most dangerous kind. They give you confidence in the wrong direction.

acceptance rate is better, but not complete

If you must measure agent productivity with one number, acceptance rate is at least better than raw suggestion count.

KubeStellar's approach — measuring what actually gets merged — accounts for the fact that not all output is valuable. It puts the emphasis on the review outcome, not the generation volume.

But acceptance rate has blind spots too.

A high acceptance rate can mean the agent is producing great code. It can also mean the reviewers are not reviewing carefully. Or that the PRs are so small they barely merit a review. Or that the codebase conventions are so loose that everything looks acceptable. Or that the team has normalized agent output and stopped treating it critically.

I have seen the "reviewer fatigue" pattern in several orgs already. When every PR is agent-generated, developers stop reading diffs carefully. The acceptance rate stays high because nobody is looking.

If you measure acceptance rate without also measuring review depth, you are measuring the review process, not the output quality.

what i would measure instead

If I were building a dashboard for AI-assisted development, I would track four things:

1. Acceptance rate with a review depth qualifier. Not just "was the PR merged?" but "how many rounds of review did it take? How many comments? How many changes requested?" A PR that gets accepted on the first try with zero comments might be perfect, or it might be unreviewed. Distinguish these.

2. Rollback rate within 30 days. This is the honest metric. Code that gets deployed and rolled back within a month is code that created production cost, regardless of how clean the PR looked. If agent-contributed PRs have a higher 30-day rollback rate than human-contributed ones, you have a review quality problem, not a generation problem.

3. Maintenance cost attribution. When a bug gets fixed or a feature gets refactored, who wrote the original code that had to be changed? If agent-contributed code accounts for a disproportionate share of follow-up work, that is a signal that the agent is producing surface-correct but structurally fragile code.

4. Context reuse rate. This one is speculative but I think it matters. How often does an agent reuse context from a previous PR — issue links, pattern choices, architecture decisions — versus starting fresh? Reuse suggests learning. Fresh starts on every PR suggest the agent is solving each problem in isolation, which is how you accumulate weird inconsistencies across the codebase.

the deeper problem is not measurement, it's accountability

I think the measurement conversation is really an accountability conversation.

When a human writes code and it breaks, there is a clear chain: author, reviewer, approver, deployer. The review was a human reading another human's work, and both felt the weight of that review because both names are on the commit.

When an agent writes code and it breaks, the chain gets fuzzy. Was the reviewer supposed to catch the subtle correctness issue? Was the agent's context insufficient? Was the prompt ambiguous? Was the issue description bad? The agent has no name. There is no accountability feedback loop.

Anthropic recently published a postmortem tracing Claude Code quality complaints to overlapping product changes and context management issues. That is honest, and it is the kind of transparency we need more of. But it also illustrates the structural problem: when code quality degrades because of agent behavior, the fix is not "tell the agent to do better." The fix involves better prompts, better context, better review processes, better tooling — and all of that requires human investment.

The agent does not learn from its mistakes unless someone builds that feedback loop.

the punchline

Stop measuring AI agents by how much code they produce. Start measuring by how much value survives the commute to production.

KubeStellar's 81% acceptance rate is an interesting data point. But the number that matters more is the one nobody is tracking yet: how much of that accepted code is still in production, still clean, still maintainable, and still correct six months from now.

That number is hard to measure. It requires naming, ownership, review depth, rollback tracking, maintenance attribution, and time.

But that is the number that will tell you whether your AI-assisted engineering investment is working, or whether you just optimized a dashboard that hides the real cost.

Acceptance rate is a starting point. It is not the finish line.

mcp catalogs are becoming the new internal developer portal

Paulo Victor Leite Lima Gomes — Sun, 17 May 2026 00:01:34 +0000

Docker published a post on Friday about Custom MCP Catalogs and Profiles being generally available, and my first thought was not about Docker at all.

My first thought was: this is Backstage, but for agents.

Not literally, obviously. No software catalog, no service scorecards, no plugin marketplace with the same energy as an abandoned open source project. But structurally? The same architectural pattern is appearing.

Internal developer portals gave humans a curated view of tools, services, permissions, and infrastructure. MCP catalogs give agents the same thing. A curated, governed, versioned collection of capabilities that an entity — whether a human or an agent — can discover and use.

And like the internal developer portal wave, nobody is going to plan for this until they desperately need it.

what docker actually announced

Let me summarize the feature quickly, because the angle is more interesting than the feature itself.

Docker now lets you create Custom MCP Catalogs — curated collections of MCP servers that your organization approves. You bundle internal tools alongside public ones, push the catalog as an OCI artifact to a container registry, and share it with your team.

Then there are Profiles — named groupings of MCP servers that developers can switch between. A "coding" profile with Playwright and GitHub servers. A "planning" profile with Notion, Atlassian, and Markitdown. Profiles can be shared as OCI artifacts too, so teams can standardize on setups that work.

From Docker's perspective, these features solve a real problem: MCP adoption is growing fast, and teams need to standardize what's trusted without constraining individual workflows.

That is true. But the interesting part is where this pattern is going.

the internal developer portal parallel

If you have been around platform engineering for a few years, this should feel familiar.

Backstage, Port, Cortex, and the rest of the internal developer portal space all solve the same fundamental problem: organizations have too many tools, services, and infrastructure surfaces for humans to discover and navigate on their own. Someone needs to curate the catalog, define the golden paths, set the permissions, and make sure the developer does not need to know everything to ship.

MCP catalogs solve the same problem for the same reason, but the consumer is different.

Instead of a human browsing a service catalog to find which team owns the payments API, an agent browses an MCP catalog to discover which tools it can use to investigate a payment incident. Instead of a human checking a scorecard to see if a service meets deployment standards, an agent checks a profile to see which tools are appropriate for production operations versus development experimentation.

The consumer changes. The architecture does not.

why this matters more than it sounds

The boring detail is what makes this important.

Docker is packaging MCP catalogs as OCI artifacts. Push them to a registry, pull them into your agent runtime, version them, sign them, control access the same way you control container images.

This is exactly how infrastructure tooling should work.

Instead of every developer configuring MCP connections in JSON files, platform teams ship a catalog. Instead of every agent independently discovering tools on the open internet, the catalog defines what is available. Instead of security teams trying to audit hundreds of individual MCP server configurations, they review one catalog artifact.

The same pattern that made container image registries the center of deployment infrastructure is now making them the center of agent tooling infrastructure.

No new infrastructure to build. Same distribution mechanism, different content.

profiles are permission boundaries in disguise

The Profile concept is the part that keeps pulling me back in.

Docker presents Profiles as a way to organize workflows — coding versus planning versus research. That is a perfectly fine starting point. But Profiles are also permission boundaries.

If you define a profile for SRE work, it should include incident investigation tools. If you define a profile for application development, it should include code and test tools. If you define a profile for CI automation, it should include deployment and monitoring tools. Each profile implicitly defines what an agent operating in that context is allowed to do.

The tool catalog becomes the access control surface.

This is not a stretch. Docker's own roadmap mentions governance and policy controls for restricting MCP usage to approved catalogs, and Docker AI Governance, announced last week, adds centralized control over network access, credentials, and tool permissions.

The direction is clear: the catalog is where governance happens.

the platform team job is changing again

I keep coming back to the same observation across the last several posts here.

When agents use tools, the platform team's job is no longer only "define how humans deploy to production." It is also "define what agents can discover, use, and automate."

MCP catalogs give platform teams a concrete mechanism for that second job.

Which MCP servers are trusted?
Which tools in each server are safe to expose?
Which profiles should exist for different roles?
Who can publish a catalog?
How are catalogs versioned and updated?
What happens when an internal API changes and the MCP server breaks?
How do we audit which tools agents actually used?

These are platform engineering questions with an AI accent.

If you already run an internal developer portal, you should be thinking about whether it should serve agents too. Maybe agents authenticate to the same catalog API. Maybe they read service definitions, deployment metadata, runbook links, and ownership information through MCP instead of a human UI.

If you do not run an internal developer portal, MCP catalogs might be the first agent-facing platform your company builds. It will feel familiar to anyone who has managed a package registry or a container registry. The questions are the same, the distribution is familiar. Only the consumer has changed.

the catalog becomes the governance surface

The critical shift is this: once agents discover tools through a catalog, the catalog is no longer a convenience feature. It is the access control system.

If a malicious MCP server gets added to a catalog, every agent using that catalog gains a new capability they were not designed to have. If a catalog contains a misconfigured server with broad permissions, the agent inherits those permissions. If a catalog is not updated when an internal API changes, agents start failing silently.

This is the same set of concerns that made package registries require signing, scanning, access control, and audit. MCP catalogs will go through the same maturation, but faster, because the blast radius is larger.

An agent with a bad npm package can fail a build. An agent with a bad MCP server can call production APIs.

what i would do tomorrow

If I were running a platform team today, I would not wait for the ecosystem to mature before engaging with this.

I would start by defining what agents should be allowed to do in my organization, then work backward to the catalog.

What MCP tools support those actions? Which tools are safe to expose broadly? Which need role-based or profile-based restrictions? How do I audit usage? How do I respond when a tool is deprecated or compromised?

I would publish a small custom catalog with the most boring, most useful servers first. GitHub, Notion, a read-only internal docs server, a deployment status check. Let the team use it, observe what happens, and iterate.

The teams that win here will be the ones that treat agent tool catalogs the same way they treated container image registries, package managers, and internal developer portals: as curated infrastructure that requires maintenance, governance, and iteration — not a one-time publish-and-forget.

the punchline

MCP catalogs are not a Docker feature. They are an architectural pattern.

The same forces that created internal developer portals for humans are creating agent-facing catalogs for AI. We have the same problems — discovery, curation, governance, permission, audit — with the same solutions. Just different consumers, faster feedback loops, and higher blast radius.

Docker's Custom Catalogs and Profiles are an early concrete example. But every MCP ecosystem player is heading in the same direction. The CNCF ecosystem is pushing AI gateway custom transformations. GitHub is shipping enterprise-managed MCP plugins. GKE has inference gateways with policy surfaces.

The catalog is becoming the governance plane for agent actions.

If you are building an internal developer portal today, you should ask whether it should serve agents, or whether agents will build their own catalogs instead.

I have a guess about which answer ages better.

agentic sre is where ai hype meets the pager

Paulo Victor Leite Lima Gomes — Fri, 15 May 2026 20:36:24 +0000

AWS published a post recently about building an end-to-end agentic SRE, and I had two reactions at the same time.

The first one was: yes, obviously. Incident response is full of repetitive investigation work that agents should help with.

The second one was: oh no, we are absolutely going to hurt ourselves with this.

Not because SRE agents are a bad idea. I think they are one of the more useful AI directions, actually. But the pager is a very different environment from a coding task on a quiet Tuesday afternoon. Production incidents are where vague automation, incomplete context, bad permissions, and confident summaries turn from annoying into expensive.

incident response is mostly context gathering

A lot of incident work is not heroic debugging. It is context gathering under pressure.

You check dashboards. You compare deploy timestamps. You look at logs. You inspect error rates. You ask whether one region is worse than another. You check whether a dependency is degraded. You search Slack for the last person who touched this thing. You read a runbook that is probably 70% correct and 30% archaeology.

That is exactly the kind of messy, tool-heavy workflow where agents can help.

An agent that can pull CloudWatch metrics, query traces, summarize logs, inspect recent deployments, and prepare a timeline could save real minutes. And minutes matter when customers are down and everyone is pretending to be calm in the incident channel.

Stack Overflow also had a good piece on observability and human intuition in an AI world, and I think that framing is important. The goal is not to replace intuition. The goal is to give humans a better starting point for judgment.

A good incident agent should make the human sharper, not more passive.

the dangerous part is the verb

The problem starts when the agent moves from "look" to "do."

There is a huge difference between:

"summarize the last 30 minutes of elevated 5xx errors"
"find likely related deploys"
"compare this service against last week's baseline"
"rollback the deployment"
"scale the service"
"change the retry policy"
"disable this feature flag"
"restart the cluster"

The first group is investigation. The second group is operation.

Both are useful. Only one of them can make the incident worse in three seconds.

This is where a lot of AI demos become misleading. In a demo, the agent diagnoses the problem, proposes a fix, runs the action, and the graph turns green. Nice. In production, the agent may diagnose a symptom as the root cause, apply a fix that hides the signal, or take an action that works for one customer path while breaking another.

Humans do this too, of course. The difference is that humans tend to be slower, more socially accountable, and easier to interrupt. An agent with broad permissions can be wrong very efficiently.

Efficiency is not always your friend during an incident.

observability is the safety system

If you want agentic SRE, observability is not a nice add-on. It is the safety system.

The agent needs reliable telemetry, but the humans need telemetry about the agent too:

What data did it inspect?
Which queries did it run?
What assumptions did it make?
Which actions did it propose?
Which actions did it execute?
Who approved them?
What changed after the action?

If the agent says "the database is the bottleneck," I want to know whether it looked at saturation, lock waits, connection pool exhaustion, disk latency, downstream timeouts, or just one sad-looking CPU graph.

This is why I am skeptical of incident agents that only produce beautiful natural-language summaries. Summaries are useful, but they can also compress away uncertainty. During an incident, uncertainty is not noise. It is part of the signal.

A good SRE agent should show its work like a nervous staff engineer in a postmortem.

permissions should match the phase of the incident

The easiest bad design is to give the agent one big production role and trust it to be careful.

Please do not do that.

Incident response has phases, and the permissions should match them.

For normal operation, an agent should mostly be read-only. Let it inspect metrics, logs, traces, deploy metadata, feature flag state, config history, runbooks, and recent alerts. This alone is already valuable.

For mitigation, allow a small set of reversible actions: create an incident timeline, draft a rollback command, propose a feature-flag change, open a PR, page the owning team, or prepare a runbook step. Maybe some teams allow low-risk automated actions, but they should be explicit and boring.

For high-impact operations, require human approval. Rollbacks, traffic shifting, database failovers, permission changes, and infrastructure mutation should not be hidden behind "the AI thought it was best."

This is not anti-automation. This is how grown-up automation works. The blast radius decides the approval model.

runbooks become executable contracts

One thing I like about the agentic SRE direction is that it may finally force teams to clean up their runbooks.

A runbook written only for humans can be vague:

Check the dashboard and restart the service if it looks stuck.

A runbook used by an agent needs better structure:

Which dashboard?
Which metrics define "stuck"?
What threshold matters?
What command restarts the service?
Is restart safe during a deploy?
Who approves it?
How do we verify recovery?
What should never be restarted automatically?

That is healthy pressure.

The same happened with CI/CD. Once deployment became automated, teams had to make the release process explicit. Agentic SRE could do the same for operations. Not because the agent is magical, but because automation punishes ambiguity.

If your runbook cannot be followed by a careful junior engineer at 3 AM, it probably cannot be safely followed by an agent either.

the pager is not a benchmark

The most important thing I would avoid is turning incidents into an AI leaderboard.

"The agent resolved 42% of incidents automatically" sounds impressive until you ask which incidents, which actions, how many false positives, how many hidden regressions, and how many humans quietly cleaned up afterward.

Better metrics would be more boring:

time to useful first summary
percentage of incidents with complete timelines
reduction in repeated manual diagnostic steps
approval rate for agent-proposed actions
rollback or revert rate after agent-assisted mitigation
postmortem findings caused by missing context
number of times the agent escalated correctly instead of guessing

I care much more about an agent that reliably saves ten minutes of investigation than one that occasionally performs a heroic autonomous fix and occasionally makes everyone sweat.

Hero automation is fun in demos. Boring assistance is what survives production.

what i would build first

If I were adding agentic SRE to a team today, I would start with the least glamorous version.

Read-only incident assistant. No mutation. No secret powers.

It would join the incident channel, collect telemetry, build a timeline, link recent deploys, summarize symptoms, identify likely owners, and keep a running "known facts vs guesses" list.

Then I would add proposed actions, not executed actions. The agent can draft the rollback command, but a human runs it. The agent can suggest the feature flag, but a human flips it. The agent can propose scaling, but it has to show the evidence.

Only after that works for a while would I consider limited automated mitigation. And even then, I would start with narrow actions that are reversible, logged, and already accepted as safe by the team.

The boring maturity model is something like:

read-only summarizer
timeline and evidence builder
runbook navigator
action recommender
human-approved operator
narrow autonomous mitigator

Skipping from step one to step six is how you get a postmortem with the phrase "unexpected agent behavior" in it.

the real shift

The bigger story is not that AWS, or any vendor, can build an SRE chatbot.

The bigger story is that operations are becoming another place where agents participate in the workflow. Not as magic coworkers. As tool-using processes with access, memory, logs, permissions, and failure modes.

That means platform teams need to design around them.

The same questions keep coming back: what can the agent see, what can it change, how do we review it, how do we observe it, how do we roll it back, and who owns the mess when it is wrong?

Agentic SRE is exciting because it attacks real toil. It is dangerous for the same reason. The work is real, the systems are real, and the pager does not care that the demo looked amazing.

So yes, bring agents into incident response.

Just make them earn trust the same way every other operational tool does: read-only first, observable always, reversible where possible, and very careful around anything that can turn a small fire into a bigger one.

containers are becoming policy wrappers for ai agents

Paulo Victor Leite Lima Gomes — Wed, 13 May 2026 00:02:40 +0000

Docker published a post this week about AI Governance, and my first reaction was not "cool, Docker has an AI feature now."

My reaction was: of course this is where the market goes.

Not because Docker magically solves AI safety. It does not. But because the moment you let agents do real work, the boring container questions suddenly become interesting again:

What files can this thing read?
What network calls can it make?
Which secrets are visible?
Can it install packages?
Can it write outside the workspace?
Can I reproduce what happened after the fact?

That is not a chatbot problem. That is a runtime problem.

And containers have always been a runtime-shaped answer to uncomfortable runtime questions.

agents made isolation feel urgent again

For a while, containers were boring in the best possible way.

They became part of the furniture. Build image, push image, run image, deploy image. Kubernetes made them infrastructure. CI made them muscle memory. Local dev made them annoying, then necessary, then invisible.

But AI agents change the emotional temperature around containers.

A normal application is dangerous in a relatively predictable way. It has code. It has dependencies. It has configuration. It does what the humans shipped, plus whatever bugs the humans accidentally shipped.

An agent is different. An agent is a loop with tools. It observes, decides, calls something, reads the result, and decides again. If you give it a shell, a filesystem, network access, credentials, package managers, browser automation, and a vague instruction like "fix the failing tests," you did not just give it a productivity tool.

You gave it a small operating entity inside your engineering environment.

That sounds dramatic, but I think it is the right mental model. The operational behavior is closer to an untrusted automation worker than a library call.

And untrusted automation workers need boundaries.

the interesting part is not packaging anymore

The old container pitch was mostly packaging: "it works on my machine" becomes "it works in this image."

That was useful. Still is.

But for agents, packaging is the least interesting part. The interesting part is policy.

A container can define:

the filesystem shape the agent sees
whether the workspace is mounted read-only or read-write
which directories are excluded
whether network access exists at all
which binaries are installed
which credentials are injected
how the process is logged
how the environment can be destroyed after the task

That list is basically an agent governance checklist with a Docker accent.

This is why Docker's recent AI governance and sandboxing posts are more important than the branding suggests. They are not just saying "run AI stuff in Docker." They are pointing at a larger architectural shift: containers are becoming permission envelopes for autonomous work.

The same thing is happening in agentic terminals, CI agents, code review bots, local model runners, and internal platform tools. The question is no longer only "can we run this workload?" It is "can we let this workload act without giving it the keys to the whole apartment?"

least privilege is harder when the worker is creative

The uncomfortable part is that agents are good at finding paths you did not explicitly think about.

A human engineer might ask, "can I access this file?" An agent might just try three commands, follow a symlink, inspect an environment variable, read a generated config, and then confidently continue.

Not maliciously. Just because that is what task completion looks like.

This is why I do not trust prompt-level policies as the primary control. "Do not read secrets" is nice. "The secrets are not mounted" is better.

Same with network access. "Only call approved APIs" is a guideline. "This container has no outbound network except through a proxy that logs and filters requests" is a control.

Same with write permissions. "Only modify files related to the task" sounds reasonable until the agent decides the fastest way to fix the build is to update a shared config file and touch half the repository. A read-only base plus an explicit writable workspace is less flexible, but it is also less surprising.

This is the pattern platform teams already know from production systems. Policy that depends on good behavior is not policy. It is hope with a YAML file.

the developer experience tradeoff is real

Now, the obvious pushback: too much sandboxing makes agents useless.

And yes. Absolutely.

If every agent task requires a ticket, a custom container, a security review, and three approvals from someone named Greg, engineers will route around it. They will run the agent locally with full access because the unsafe path is the only path that feels productive.

This is where the platform work gets interesting.

The goal is not to make the sandbox perfect. The goal is to make the safe path the easiest path for common work.

For example:

A default coding sandbox with repo access, no cloud credentials, limited outbound network, and disposable state.
A documentation sandbox that can read docs and open a browser but cannot touch source files.
A dependency-update sandbox with package registry access but no production secrets.
A migration sandbox that can run tests against a scrubbed database, not the real one.
A privileged break-glass mode that is logged, rare, and slightly annoying on purpose.

That is developer experience design, not security theater.

The winning internal platforms will not say "agents are banned" or "agents can do anything." They will offer a small menu of well-understood execution envelopes and make it trivial to pick the right one.

containers are not enough, but they are the right primitive

To be clear: containers do not solve the whole problem.

A container with the wrong mounts, broad network access, and injected admin credentials is just a fancy way to feel safe while being unsafe. Container escapes exist. Supply chain problems exist. Secrets leak. Logs lie.

Some agent workloads need stronger isolation than ordinary containers. MicroVMs, hardened images, user namespaces, seccomp, network proxies, policy engines, and ephemeral credentials all matter depending on the blast radius.

But containers are still the right starting primitive because they are understandable. Engineers already know how to reason about images, mounts, environment variables, ports, processes, and logs. The mental model is imperfect, but it is much better than "the agent runs somewhere in your laptop with whatever access your shell had at 11:43 PM."

This is why I think the container renaissance around AI is real. Not because containers became fashionable again. Because isolation became valuable again.

what i would actually do on a team

If I were designing agent infrastructure for a serious engineering org right now, I would start small:

Define default sandboxes by task type. Coding, docs, tests, dependency updates, infra changes. Do not start with infinite flexibility.
Make secrets opt-in and scoped. Most agent tasks do not need production credentials. If a task needs cloud access, give it short-lived credentials with a narrow role.
Put network behind a policy layer. Even basic allowlists and logs are better than pretending nobody cares where the agent connects.
Keep the workspace disposable. The agent should be able to make a mess without making the developer's machine weird for the next three days.
Treat the transcript and command log as build artifacts. If an agent changed code, future reviewers should be able to see the path it took, not just the final diff.
Make escape hatches visible. Sometimes people need power. Fine. But privileged modes should be named, logged, and reviewed later.

None of this requires a giant AI governance program. It requires platform teams to admit that agents are not just autocomplete. They are tool-using processes, and tool-using processes need runtime boundaries.

the boring layer wins again

Infrastructure keeps reappearing at the center of whatever trend was supposed to abstract it away.

Serverless did not remove runtime concerns. It moved them.

Kubernetes did not remove operations. It renamed them controllers, operators, admission policies, and CRDs.

AI agents will not remove platform engineering. They will make platform engineering more important, because now the platform has to govern not only applications and humans, but semi-autonomous work happening between them.

That is why the Docker AI governance story is worth paying attention to. Not because Docker has the final answer. Because it shows where the answer is likely to live.

The future agent stack is not one giant model with unlimited permissions. It is a set of models, tools, policies, logs, sandboxes, and escalation paths.

And in that control plane, containers are not just packaging units anymore. They are becoming the thing that says: this agent can act, but only inside this shape.

That shape is going to matter a lot more than the demo videos suggest.

references

What now? explaining the TanStack Supply Chain Attack

Paulo Victor Leite Lima Gomes — Tue, 12 May 2026 12:20:30 +0000

If you install any tanstack package you might be affected. Thats it, on May 11, 2026, one of the most sophisticated npm supply chain attacks ever seen hit the JavaScript ecosystem, and it's still spreading.

This is my attempt to explain it simply, because the technical details are buried in postmortems and most developers I know haven't fully digested what this means for them.

What happened ?

The attackers compromised 42 TanStack packages, publishing 84 malicious versions in a 6-minute window. TanStack is widely used, Query, Router, Start, so the blast radius was insane.

But here's what makes this different from a typical "someone hacked a maintainer's account" story: no passwords were stolen. The attackers never needed them.

...

The clever part: the attack chain...

The attacker opened a pull request to TanStack's GitHub repo. Looks innocent. But they exploited three things in sequence:

1. A dangerous GitHub Actions trigger

The workflow used pull_request_target instead of pull_request. The difference sounds minor but it's huge. pull_request_target runs in the context of the base repository, meaning it has access to secrets and can write to the Actions cache, even from a forked PR 🤦🏻‍♂️

2. Cache poisoning

The malicious PR code ran during CI and wrote a poisoned 1.1 GB cache entry keyed to match what the release workflow would look up on the next push to main. Then the attacker quietly reverted the PR to a no-op and closed it. The cache stayed poisoned.

3. OIDC token theft

When a legitimate maintainer merged an unrelated PR days later, the release workflow started, restored the poisoned cache, and the malware ran inside the trusted CI context. It then minted a short-lived OIDC token (the kind used for "Trusted Publishing" on npm) and published the malicious versions directly... All while the legitimate workflow showed a failure status.

No npm credentials. No compromised passwords. Just a poisoned cache waiting for the right moment.

...ok... I get it, but...

What the malware does ?

Once installed via npm install, it:

Harvests AWS credentials, GitHub tokens, npm tokens, SSH keys, Kubernetes service account tokens
Exfiltrates everything over an end-to-end encrypted channel (Session messenger's network), so you can't even block it by IP easily
Self-propagates: looks up every package you maintain on npm and republishes them with the same payload..

This is a worm. It uses your own credentials to infect your own packages and spread to your users.

...now the 1 million dollars question...

Are you affected?

Check if you installed or updated any @tanstack/* packages on the evening of May 11, 2026 (UTC). If yes, assume the machine that ran npm install is compromised.

you can run this and check all your repositories locally, imagine they are under the path dev, you run:

find ~/dev -path "*/node_modules/@tanstack/*/package.json" \
  -newermt "2026-05-11 00:00:00" ! -newermt "2026-05-12 00:00:00" \
  -exec echo "⚠️  check this: {}" \;

changed based on your path...

Rotate immediately:

AWS credentials
GitHub personal access tokens
npm tokens
SSH private keys
Any secrets that lived in environment variables on that machine

...and...

How to protect yourself going forward

A few practical things you can actually do:

Don't develop on your host OS. Use Docker dev containers or VMs. If a postinstall script runs malware, it should be contained — not running as you on your machine with access to ~/.ssh and ~/.aws.

Don't store secrets in .env files. Use a proper secrets manager — Doppler, Infisical, AWS SSO. Your credentials shouldn't be sitting in plaintext files reachable by any package lifecycle script.

Configure your package manager to be defensive. Tools like pnpm and Bun let you disable postInstall scripts by default and require opt-in per package. Some support a minimum package age policy (don't install anything published less than 24 hours ago) — which alone would have protected against this attack.

If you maintain npm packages, audit your GitHub Actions workflows right now. If you use pull_request_target, understand exactly what it can access. Lock down cache permissions. Restrict id-token: write only to the specific job that needs it.

The frequency of these attacks is increasing. The sophistication too. This one had a 8-hour delay between poisoning the cache and detonating — patient, precise, automated.

The ecosystem isn't going to fix this overnight. But you can make your own environment significantly harder to compromise.

Stay skeptical of npm install now.

References

TanStack Official Postmortem — Tanner Linsley, May 11 2026
The Register: Cache Poisoning Caper Turns TanStack npm Packages Toxic
Video Breakdown by Maximilian Schwarzmüller
GitHub Security Advisory GHSA-g7cv-rxg3-hmpx
Tracking Issue TanStack/router#7383

why the 'hand coding' backlash is really about agency, not nostalgia

Paulo Victor Leite Lima Gomes — Tue, 12 May 2026 00:01:54 +0000

There is a post making the rounds on Hacker News today that caught my attention for exactly the right reasons.

The title is deliberately provocative: "I'm going back to writing code by hand."

It hit 900+ points in a few hours, and the comments section is exactly what you would expect. People saying "this is Luddite nonsense." People saying "finally someone said it." People arguing about whether the author is a bad engineer or the only honest one in the room.

Both sides are missing the point, I think. The piece is not really about whether to use AI. It is about something more subtle: the feeling that you can no longer tell what your tools are doing to your work.

That is not nostalgia. That is a real signal.

what the hand-coding post actually says

The article makes a short, honest argument. The author describes using AI heavily for code generation, getting comfortable with it, and then noticing an uncomfortable pattern: they were spending less time understanding the code they owned. Diffs were getting larger and less familiar. Edge cases were getting missed. The review process felt shallower because the output looked polished even when it was wrong.

The solution the author chose was to go back to writing code by hand for a while. Not forever. Not as a moral stance. As a recalibration.

That is not Luddism. That is noticing that your feedback loop has degraded and taking steps to fix it.

The pushback on HN is mostly about scale: "you cannot reject productivity gains just because they feel uncomfortable." And sure, at scale, that argument has teeth. But the individual experience is real, and I think it points to something the industry is not talking about enough.

the real problem is indistinguishability

I do not think the backlash is against AI, or even against code generation. I think it is against indistinguishability — the state where you can no longer reliably tell the difference between output that is correct and output that just looks correct.

This is a problem that has empirical backing now.

There is a paper from arXiv that came out alongside the hand-coding post — "LLMs Corrupt Your Documents When You Delegate" — that studied what happens when you delegate document processing to an LLM. The finding is straightforward and unsettling: delegated workflows systematically introduce errors. Not random noise. Systematic, hard-to-detect corruption of the content being processed.

The paper is about documents, but the pattern applies directly to code. When you delegate code generation to an AI and then review the output, you are operating in a mode where:

The output is polished enough to pass a quick scan.
The errors are not random typos. They are logical gaps, missing edge cases, incorrect assumptions that are internally consistent.
The review process is asymmetrical: the model produced the output in seconds, but catching its mistakes takes as long as writing the code yourself would have taken.
The cost of a missed error is not in the generation step. It is in the production incident three weeks later.

The hand-coding author is not the first person to notice this. But the combination of the personal account and the empirical paper makes the pattern harder to dismiss as "just someone being resistant to change."

the meta misery index

There is another data point that fits the pattern.

The New York Times reported this week that Meta's embrace of AI is making employees miserable — 455 points on HN, heavy discussion. The story describes engineers at Meta burning out because AI adoption was mandated from the top, not enabled from within.

The details matter: engineers were being evaluated on AI tool usage metrics. Teams were pressured to show AI adoption numbers. What started as a productivity initiative turned into a compliance exercise. And the result was not better engineering. It was fatigue, resentment, and a growing sense that the work had become about feeding outputs rather than building systems.

I have seen versions of this dynamic in smaller teams too. When AI adoption becomes a metric that management tracks, the incentives go wrong fast. People stop asking "is this tool making our system better?" and start asking "how do I make my AI usage chart go up?"

The Meta story is a case study in what happens when you optimize for the wrong signal. Engineers notice. They get demoralized. The AI tooling becomes overhead with extra steps.

what agency actually means in an ai-assisted workflow

I keep coming back to the concept of agency because I think it is the frame that makes sense of all three signals — the hand-coding post, the LLM corruption paper, and the Meta burnout story.

Agency, in this context, means:

You understand what the code you ship actually does.
You can trace a production issue back to its cause without guessing.
You know when to trust the tool and when to override it.
You are not surprised by what your system does in production.

AI-assisted workflows can preserve agency, but it takes deliberate work. The default behavior — generate, skim, merge, move on — erodes agency incrementally. Each diff you do not fully understand is a small loss. Each review where you trust the output because it looks clean is a bet that may or may not pay off.

The teams that maintain agency share a few patterns:

They do not let the agent touch unfamiliar code. If nobody on the team deeply understands a module, the agent is not going to help. It will produce plausible changes that compound confusion.
They review agent output with more scrutiny, not less. The more confident the output looks, the more carefully they read it. They have learned that polished wrong is more dangerous than obviously wrong.
They separate generation from review, and enforce a gap. They write with the agent, then review without it. The review session does not have the generation context open. They look at the diff as if someone else wrote it.
They maintain a manual subset of the codebase. Infrastructure configs. Authentication logic. State management. The parts where being wrong is expensive. These are written by hand and reviewed by multiple people. Agents can suggest, but they do not author.
They track rework, not output. Rework ratio is the signal that tells you whether agency is eroding. If code you generate needs more fixes than code you write, you have a problem that generation speed does not solve.

the second-order effect nobody is modeling

Here is the part that worries me the most.

When you stop writing code by hand and start reviewing generated code, you are not just changing how code gets written. You are changing how you learn.

Writing code by hand is a learning mechanism. Every time you type out a loop, a conditional, a state transition, a retry strategy, you are reinforcing your mental model of the system. You are building the intuition that tells you "this looks right" or "something is off here."

When you delegate the writing, you lose that reinforcement loop.

The concern is not that you stop being able to write code. It is that your intuition for what good code looks like atrophies. You stop seeing the patterns of what makes a function robust. You stop noticing when an abstraction is wrong because you did not write it and feel its weight.

This is the pattern Sean Goedecke wrote about in "Software engineering may no longer be a lifetime career" — AI tooling may de-skill engineers the same way physical labor de-skills construction workers over time. Not because the tools are bad. Because the tools change what the body learns.

The hand-coding backlash is, I think, a response to this. Not a rejection of AI. A refusal to let the learning loop close.

what i think is actually happening

The way I read the current moment is this:

A significant number of experienced engineers are starting to feel something they cannot quite articulate. They are using AI tools, getting real productivity gains, and also feeling like something is slipping. The code they produce is not worse — but their relationship to it is different. More distant. Less intimate.

The "going back to hand coding" post is one person articulating that feeling. The LLM corruption paper is the empirical version of the same intuition. The Meta burnout story is what happens when organizations ignore the feeling and push harder.

None of these signals say "AI is bad." They say "the mode of work has changed, and we have not yet figured out how to preserve agency in the new mode."

That is a solvable problem. But it requires accepting that the problem exists, rather than dismissing the people who feel it as resistant to progress.

the practical takeaway

If you are an engineer reading this and wondering whether you should "go back to hand coding," I do not think the answer is binary. The answer is probably:

Use AI for what it is good at: boilerplate, repetitive patterns, well-scoped generation, documentation drafts, code you already know how to write.
Protect what makes you good: manual writing for unfamiliar domains, deep review of generated changes, time to understand the system without an assistant in your ear.
Measure the right thing: not how many diffs you generate, but how many of them survive the next quarter without needing rework.

The hand-coding backlash is not a rejection of the future. It is a signal that the future needs better scaffolding. And the engineers who figure out how to build that scaffolding — for themselves and their teams — are going to be the ones who stay effective in both modes.

Because the goal was never to write the most code in the shortest time. The goal was to build systems that work, that last, and that you understand well enough to fix at 3 AM.

Everything else is just tooling.

references

I'm going back to writing code by hand — 900+ points HN
LLMs Corrupt Your Documents When You Delegate — empirical study, arXiv
Meta's embrace of AI is making employees miserable — NYT
Software engineering may no longer be a lifetime career — Sean Goedecke

measuring ai-assisted velocity without lying to yourself

Paulo Victor Leite Lima Gomes — Mon, 11 May 2026 00:01:40 +0000

Every engineering leader I talk to these days has the same question:

"Are our AI tools actually making us faster?"

And every single one of them has an answer they sort of believe but cannot quite prove.

The CTO points at the PR count chart going up. The team lead points at the growing backlog of half-finished AI-generated features. The IC points at the three refactors they had to do last week because an agent built the wrong abstraction.

Someone is right. Someone is wrong. And the data, as it exists today, helps nobody.

I have been thinking about why most "AI productivity" metrics are useless, and what actually works instead.

the trap everyone falls into first

The standard move is to measure output.

PRs per week. Commits per day. Lines of code changed. Cycle time from first commit to merge. Story points completed.

These numbers look great once people start using AI tools. Of course they do. An agent can open a PR in five minutes that would have taken a human an afternoon. Your PR count doubles. Your cycle time drops. Your chart looks like the team just discovered amphetamines.

The problem is not that the numbers are wrong. The problem is that they measure the cheap part.

Opening a PR is easy. Writing code that does not create hidden problems is the actual work.

When a junior engineer generates thirty PRs in a week, six of them get rolled back, twelve require significant fixes after review, three break something downstream, and two are actually clean, "PRs per week" makes the junior look like a hero. Any dashboard that rewards this is not measuring productivity. It is measuring production of future work.

The same pattern applies to AI-generated code. The agent produces output fast. The human reviews it fast. The PR merges fast. And then, two weeks later, someone discovers the agent never considered the edge case that the original design document spent three paragraphs describing.

Nobody measured that cost. Nobody knows how.

velocity is not the same as throughput

This is where I think most engineering orgs get stuck.

Throughput is easy to measure. Velocity is not.

Throughput says "we shipped X features." Velocity says "we shipped X features and the system is still maintainable, the team is not burned out, and the next change will not be harder than this one."

AI tools clearly increase throughput. The question is whether they increase or decrease velocity.

The difference shows up in the hidden work:

How much time is spent fixing bugs introduced by AI-generated code?
How many PRs need significant rework after the agent is done?
How often does the agent produce code that passes tests but violates architectural conventions?
How many features are shipped with worse test coverage because the reviewer trusted the agent's output too much?
How much undocumented complexity lands in the codebase because nobody reads every line of an agent diff?

If you measure PR count, none of these show up.

If you measure cycle time, none of these show up.

If you measure story point completion, none of these show up.

The dashboard is lying and nobody installed the truth.

what to measure instead

I have been experimenting with a different set of metrics in the teams I work with. They are not perfect. But they surface the signal that gets buried by throughput numbers.

rework ratio

Track the percentage of code that is modified within 30 days of being written. High rework suggests the initial output was low quality, whether by human or agent.

Compare rework ratios between AI-assisted and non-AI-assisted changes. If the AI code has a significantly higher rework ratio, the "productivity gain" is a mirage. You are just moving the work from writing to fixing.

If the rework ratio is similar or lower, the AI is probably adding genuine leverage.

review depth ratio

Track how many review comments per line of changed code the team produces. If this number drops significantly after AI adoption, it may mean reviews are shallower, not that the code is better. Agents are convincing writers. They look correct. The reviewer needs to push harder.

If the review depth stays steady or increases, the team is maintaining healthy skepticism.

incident attribution (with a grain of salt)

Track whether AI-generated changes are overrepresented in incident postmortems. Not as blame. As a hygiene signal.

If AI code causes 30% of incidents but represents 50% of changes, that is a signal worth investigating. Maybe the agents need better constraints. Maybe the review process needs more guardrails. Maybe the agents should not touch certain parts of the codebase.

feature completion vs. feature health

Track not just whether a feature shipped, but whether it needed significant repair in the first month.

A feature that ships in two hours, then requires three days of bug fixes and two rollbacks, is not a win. It is a time bomb with optimistic labeling.

If AI-assisted features require disproportionate post-ship maintenance, the velocity equation changes dramatically.

context switching cost

Track how much time engineers spend switching between agent output and review.

If the pattern is "generate, review, generate more, review more, merge, fix," the context switching cost is real. Some teams report spending more time reviewing agent output than they would have spent writing the code themselves. That is not a productivity gain. That is management theater with extra steps.

what organizations get wrong about the numbers

There is a deeper problem here.

Most orgs want a single number. "Are we faster? Give me the percentage."

But AI-assisted velocity is not a ratio. It is a system property. It depends on:

the type of work being done (greenfield vs. maintenance vs. incident response)
the maturity of the codebase (well-factored vs. spaghetti)
the quality of the agent setup (tool access, context availability, guardrails)
the review culture (thorough vs. fast)
the deployment pipeline (automated tests + canary deploys vs. manual approvals)
the team's domain knowledge (agents cannot know what the team has not documented)

A single number flattens all of this. It passes judgment on a complex system based on one noisy signal.

The teams that get this right do not ask "are we faster?"

They ask:

"Are our agents producing less rework than last quarter?"
"Are our reviews staying thorough despite faster output?"
"Is the post-ship maintenance burden going down or up?"
"Are we catching agent mistakes before they reach production, or after?"

Those are actionable questions. The answers tell you where to invest next.

the danger of optimizing the wrong thing

Here is the part that makes me nervous.

Once organizations start measuring "AI velocity," they will inevitably optimize whatever the dashboard shows.

If the dashboard shows PR count, teams will generate more PRs, quality be damned. If it shows cycle time, they will merge faster, review be damned. If it shows feature count, they will ship more features, maintenance be damned.

This is not hypothetical. It is what every productivity measurement does. Goodhart's Law is older than AI.

The only way around this is to measure things that are harder to game.

Mean time to repair after a release? Hard to fake.
Percentage of code that survives 90 days without modification? Hard to fake.
Number of incidents caused by changes in the last quarter? Hard to fake.
Percentage of features that reach adoption targets without major rework? Hard to fake.

These metrics are not perfect. But they resist the incentive to make the number go up by making the system worse.

what a healthy setup looks like

The teams I have seen actually benefit from AI tools share a few patterns:

They measure before and after, not just after. Baseline data makes the comparison honest.
They separate throughput from quality in their dashboards. Two separate views. No averaging them together into a single "velocity" score.
They review agent output as critically as junior engineer output. Maybe more critically, because agents are more confident and less likely to ask clarifying questions.
They track maintenance burden explicitly. If the AI produces code that needs more bug fixes, they adjust their expectations and their tooling.
They stop measuring things that lie. If PR count goes up but quality goes down, they drop PR count from their metrics and dig into what drives quality.
They accept that some kinds of work do not benefit from AI assistance. Trying to measure "AI velocity" for incident debugging is silly. Trying to measure it for well-scoped greenfield features is useful. Different metrics for different contexts.

the real goal

I do not actually care whether the AI productivity number goes up or down.

I care whether the engineering organization is making better decisions about where to invest.

If the data says agents save more time than they cost, it is worth investing in better context, better guardrails, better integration with the platform.

If the data says agents create as much work as they save, it is worth investing in better prompts, better review processes, better tool selection.

If the data says agents are a net negative in certain domains, it is worth documenting that so nobody wastes time forcing a square peg into a round hole.

The value is not in the number. The value is in the decision the number enables.

So stop asking "are we faster?" and start asking "do we know what is happening?"

The first question produces a dashboard that lies.

The second question produces an organization that learns.

agentic terminals are eating the IDE (and why that is not entirely good)

Paulo Victor Leite Lima Gomes — Sun, 10 May 2026 00:00:49 +0000

I have been living in the terminal more than ever lately.

Not because I suddenly hate GUIs. The terminal just started feeling like the place where the interesting work happens again. Between Claude Code, Cursor’s agent mode, Windsurf, and whatever Aider fork is hot this week, the old “open IDE, click around, type some code” loop feels… slow.

But I am not ready to declare the IDE dead. Not yet.

the shift is real, but it is not just about speed

The argument for agentic terminals is straightforward.

You describe what you want. The agent plans, edits files, runs tests, fixes issues, and hands you a diff or a PR. You stay in one window, keep your hands on the keyboard, and avoid the context switch of clicking through project explorers and sidebar panels.

For a lot of the work I do (platform code, infrastructure tweaks, small services, config changes), this flow is genuinely faster once you have decent guardrails.

The problem is what happens after the agent is done.

IDEs were never just editors

People treat IDEs like fancy text editors with extra buttons. That underestimates what they actually provide.

A good IDE gives you:

reliable, fast navigation across a large codebase
accurate refactoring that understands the language semantics
integrated debugging with breakpoints and variable inspection
test runners that know how to run just the right subset
linting and type checking that updates as you type
version control integration that does not require remembering ten different CLI flags

An agent in the terminal can do many of these things now. But “can do” and “does reliably without supervision” are still different.

I have watched agents propose refactors that looked clean in the diff and then broke three downstream services because the agent never ran the full test suite in the right environment. The IDE would have at least shown me the red squiggles before I even considered merging.

the new default for senior engineers is hybrid

The engineers I respect most right now are not pure terminal maximalists or pure IDE loyalists. They are hybrid.

They keep an IDE open for:

deep exploration of unfamiliar code
complex debugging sessions
anything that requires understanding call graphs or type relationships across multiple files

They drop into the agentic terminal for:

greenfield work inside well-understood boundaries
repetitive refactors with clear acceptance criteria
anything that can be described as “make this change and verify it passes these checks”

The skill is knowing which mode the current task actually needs.

Most people default to whichever tool they opened first that morning. The people shipping faster are the ones who switch deliberately.

the governance problem nobody wants to talk about

Here is the part that actually keeps me up.

When the agent lives in the terminal, it is easier to give it broad permissions. It already has your shell environment. It can already run commands. Adding “let it edit files and run git” feels like a small step.

When the same capability lives inside an IDE plugin, the blast radius is usually more contained because the plugin model forces more explicit permission boundaries.

We are about to learn this lesson the hard way.

Enterprise-managed terminal agents are coming (GitHub already started with Copilot CLI plugins). The organizations that treat this as “just another CLI tool” instead of “a new privileged runtime” are going to have a very bad time when an agent decides the fastest way to fix a failing pipeline is to disable a security check it does not understand.

what actually changes for daily work

I am not saying stop using your IDE. I am saying the default starting point for many tasks has shifted.

For a platform engineer in 2026, the practical setup looks something like:

Terminal agent for the first 60-70% of the work (scaffolding, obvious changes, test updates)
IDE for the parts where you need to understand why something is behaving strangely
Both tools sharing the same git workspace so the handoff is cheap

The people who get this right will spend less time in meetings arguing about “AI productivity” and more time actually shipping.

The people who get it wrong will either stay stuck in 2024 workflows or hand the keys to agents that quietly create more work than they save.

I am still figuring out the exact balance for my own setup. But I know the old “just open the IDE and start typing” default is already gone for a lot of the work that used to feel normal.

The terminal is not winning because it is cooler. It is winning because, for certain classes of changes, staying in one context with an agent that can act is simply higher leverage.

Just do not pretend the governance and verification problems disappeared because the interface got simpler. They just moved.