DEV Community: Andrew Kew

JetBrains just open-sourced the missing piece of self-hosted AI pipelines

Andrew Kew — Tue, 02 Jun 2026 17:59:26 +0000

JetBrains just open-sourced Mellum2 — a 12B-parameter coding model built for the infrastructure layer of agentic AI systems. It's available under Apache 2.0 from day one, runs entirely on hardware you control, and is explicitly designed for the deployment scenarios where Claude Code and OpenAI Codex can't go: air-gapped environments, compliance-sensitive orgs, and teams that don't want to route every inference call through an external API.

"Frontier models will continue to push the limits, but practical AI products also require focal models: fast, specialized components that handle high-frequency tasks efficiently."

That's JetBrains framing Mellum2 not as a frontier model challenger, but as a specialist — fast, lean, and pointed squarely at software engineering workflows.

What actually changed

Mellum (the original) was a 4B-parameter model that did one thing: code completion inside JetBrains IDEs. It launched proprietary in late 2024 and went open-source in April 2025.

Mellum2 is a different animal. It's built for the broader set of tasks that now define how engineering teams ship AI: coordinating between models, handling sub-agent workloads, compressing context in retrieval pipelines. JetBrains calls it a "focal model" — not trying to beat GPT-4o on breadth, but winning on the high-frequency tasks that matter in production.

The architecture is Mixture-of-Experts (MoE): 12B total parameters, but only 2.5B active per token, routing through a subset of 64 experts. That's why the throughput numbers are interesting:

Single-request: matches Qwen2.5-7B (192 vs 193 tokens/sec on one H100)
Under concurrent load: 21% ahead of Qwen2.5-7B, 79% ahead of Qwen3-8B
EvalPlus (thinking variant): 78.4% — ahead of Qwen3.5-9B (71.8%) and Seed-Coder-8B (73.8%)

Two variants ship alongside the base: an instruct version for direct answers, and a thinking version that produces an explicit reasoning trace — aimed at harder multi-step and agentic tasks. The tradeoff is real though: Qwen3.5-9B still leads on broader reasoning benchmarks (GPQA Diamond, MMLU-Redux). JetBrains owns it: "The gap reflects a deliberate tradeoff in our training mix toward code and developer documentation rather than broad encyclopedic coverage."

The dependency argument

This is the real story. Claude Code runs locally but calls home to Anthropic. OpenAI Codex does the same to OpenAI. Cursor's power is tied to its platform, and its xAI partnership adds another layer of external control. Every one of these tools hands inference to someone else's infrastructure.

Mellum2 doesn't have to. Open weights, Apache 2.0, fully self-hostable. For teams in regulated industries, air-gapped environments, or anyone doing serious cost modeling at scale — that's not a minor footnote, it's the whole point.

JetBrains is making a bet: as AI embeds deeper into engineering workflows, deployment flexibility and operational control will matter more, not less.

What to do

If you're evaluating AI tooling for a compliance-sensitive environment — Mellum2 is now a credible option worth a benchmark run. Grab the weights on Hugging Face.
If you're building agentic pipelines — the MoE throughput advantage under load makes it worth testing as a routing or sub-agent model.
If you're on the frontier-model-only path — keep an eye on how the thinking variant matures. The EvalPlus numbers are already competitive for code-focused tasks.
If you run JetBrains IDEs — this is coming to your toolchain anyway. Understanding the architecture helps you configure it well.

Source: The New Stack

✏️ Drafted with KewBot (AI), edited and approved by Drew.

Enterprise AI doesn't need a better model. It needs smarter agent logic.

Andrew Kew — Tue, 02 Jun 2026 17:25:27 +0000

Most enterprise AI pilots aren't failing because the model is too weak. They're failing because the model has no idea where it is. IBM Research dropped a post this week making the case that the missing layer isn't a better LLM — it's agent logic: domain-specific software primitives that give the model a map before it starts driving.

"Agent logic is software primitives, such as knowledge graphs, algorithms, program analysis libraries, which operate at the agentic layer (within an agent harness) and can intentionally steer the LLM in the direction of the enterprise workflow, reducing the context space."

What IBM actually built

Four production use cases, four sets of hard numbers:

Legacy code understanding (COBOL/PL1): ~30× lower token consumption vs. baseline LLM-only approach, while maintaining performance on up to 1M lines of code. Program analysis libraries chunked the problem; the LLM only touched what mattered.
Test generation (Aster library): 15× fewer tokens, +20–45% improvement in code coverage vs. zero-shot LLMs. Structured test harnesses replaced raw prompting.
Incident response (Instana I3 agent): 4× improvement over ReAct+GPT-5.1. A knowledge graph scoped the LLM to local reasoning — no sprawling context, no hallucinated blast radius.
Compliance automation: Success rates went from single digits to 80%+ (using Claude 4 Sonnet). 1.3–2× better than fixed-planning agents. The structured workflow did what prompt engineering never could.

There's also a real estate asset maintenance pilot: analysis time dropped from 15–20 minutes to 15–30 seconds — a 97% reduction — and asset coverage jumped from 1% to 30%.

The pattern

Every one of these wins follows the same shape. The LLM has the generative capability. What it lacks is domain structure: the graph of what entities exist, the algorithms for breaking a 1M-line codebase into tractable chunks, the rules that constrain compliance decisions.

Agent logic provides that structure programmatically — not through prompts, not through fine-tuning, not through a bigger context window. It's a software layer that runs above the model and below the task.

The GPS analogy is apt. You don't need a smarter driver. You need a map.

This matters because the usual enterprise response to AI underperformance is to swap models or write better prompts. Both are fighting the wrong battle. The gap is architectural.

What to do

If you're an AI/ML engineer: Stop asking "which model?" Start asking "what does the model need to know to stay on track?" Build the graph or the index before you build the prompt.
If you're an engineering leader: Treat agent logic as an infrastructure investment, not a model selection problem. The ROI numbers here (30×, 97%, 80%) aren't coming from the model — they're coming from the harness.
If you're evaluating enterprise AI vendors: Ask what agent logic layer they ship. If it's "great prompts," push harder.

The bottleneck has shifted. The models are good enough. The architecture around them isn't.

Source: IBM Research — Beyond LLMs: Why Scalable Enterprise AI Adoption Depends on Agent Logic

✏️ Drafted with KewBot (AI), edited and approved by Drew.

Your DIY platform is automation debt with a better outfit

Andrew Kew — Mon, 01 Jun 2026 21:08:05 +0000

Platform engineers are some of the most resourceful people in IT. Give them a problem, they'll automate it. The trouble is, automation doesn't maintain itself.

A piece on The New Stack this week named the pattern clearly:

"Automation may mask complexity but does not eliminate it, and mountains of automation makes diagnosis and repair exponentially harder when things go sideways."

That's the trap. You automate a painful workflow, ship it, move on. Then the engineer who wrote it moves on. The context behind why it was built that way fades. When it breaks — and it will — you're not debugging an application. You're doing an archaeological dig through your own infrastructure.

What actually happens

The DIY platform cycle goes like this:

You automate a painful workflow ✅
That automation breaks when context is lost
You automate around the breakage
Now you're managing two mountains of automation
The platform team can never be reassigned — the business depends on them keeping the lights on
You've traded software costs for people costs, and often spent more

The framing is sharp: you didn't eliminate complexity. You became responsible for it in a new way.

Why this matters now

AI is the forcing function. Code generation is speeding up dev cycles — but if deployment pipelines haven't kept pace, you erode the gains immediately.

The argument: you need to deploy nearly as fast as AI can generate code. That means every step in the path to production needs to be streamlined. An autonomous agent can't wait days to provision a database or weeks to rotate credentials.

And the pace of AI innovation compounds the problem. Shadow AI, MCP servers, agentic harnesses, new foundation models weekly — if you're running a DIY platform, you're evaluating and integrating each of these yourself, on top of everything else you're already managing to keep the lights on.

The pre-engineered alternative

The article is authored by a Broadcom/Tanzu PM, so it's a vendor argument — but the underlying observation holds regardless of which platform you'd choose.

A pre-engineered PaaS comes with the plumbing, security, and resilience already integrated. Deployment packages and base images are pre-wired. When a CVE drops, you restage with a single command instead of chasing changes across your SDLC. Onboarding a new team is a repeatable process, not a one-off integration project.

The comparison is stark: assembling Terraform, ArgoCD, Kubernetes, cert-manager, OpenBao, and Istio gives you powerful building blocks. But you still own the integration, opinions, lifecycle management, and the operational model tying them together. A PaaS makes those decisions for you up front.

What to do

Running a DIY platform? Map the automation honestly — count the scripts nobody fully understands and the engineers whose departure would break things.
Evaluating PaaS? The criteria from this piece are sound regardless of vendor: Day 1 batteries-included, consistent deployment packages, security handled upstream.
On Kubernetes already? Tanzu Platform layers on top of existing VMware Cloud Foundation — incremental, not rip-and-replace.
Thinking about AI deployments? The deployment bottleneck is your real constraint, not code generation speed.

The honest question for any platform team: is the automation you've built a genuine productivity multiplier, or has it become the thing you now need to escape?

Source: The New Stack — "The DIY platform trap that's burning out engineering teams"

✏️ Drafted with KewBot (AI), edited and approved by Drew.

Your MCP servers can read your SSH keys. Anthropic just fixed that.

Andrew Kew — Sun, 31 May 2026 18:31:35 +0000

Every MCP server you run locally executes with your full filesystem and network permissions. That means the GitHub MCP server, the Slack one, that third-party tool you installed from npm last week — all of them can read your SSH keys, .env files, and credential stores by default.

Anthropic just open-sourced the fix: sandbox-runtime, the sandboxing layer they built for Claude Code. One-line wrap, no Docker, OS-level enforcement.

What actually changed

srt (the Sandbox Runtime CLI) enforces filesystem and network restrictions on any process using native OS primitives:

macOS: Uses sandbox-exec with dynamically generated Seatbelt profiles
Linux: Uses bubblewrap for containerization + network namespace isolation
Network filtering: HTTP/HTTPS traffic routes through an HTTP proxy; other TCP goes through SOCKS5 — both enforce your domain allowlists

Install it:

npm install -g @anthropic-ai/sandbox-runtime

Wrap an MCP server in your .mcp.json — change command from npx to srt, move the rest to args:

{
  "mcpServers": {
    "filesystem": {
      "command": "srt",
      "args": ["npx", "-y", "@modelcontextprotocol/server-filesystem"]
    }
  }
}

Then configure what the process is actually allowed to touch in ~/.srt-settings.json:

{
  "filesystem": {
    "denyRead": ["~/.ssh"],
    "allowWrite": ["."],
    "denyWrite": ["~/sensitive-folder"]
  },
  "network": {
    "allowedDomains": ["api.github.com", "*.npmjs.org"]
  }
}

The result: the MCP server can work in your project directory, talk to the domains it needs, and nothing else.

Why this matters

The threat model is real. An MCP server running compromised code — or simply a server with more ambient access than it needs — can exfiltrate your SSH keys, read your .env files, or phone home to arbitrary hosts. This isn't theoretical; it's the same class of supply-chain risk that exists for any untrusted npm package, except MCP servers are typically long-running processes with broad system access.

srt is designed secure-by-default: processes start with minimal access, and you explicitly poke holes for what they need. That's the right mental model — not "trust then restrict" but "deny then allow."

The dual isolation model

Both isolation layers are required because they protect against different escape paths:

Without filesystem isolation: a compromised process exfiltrates credentials it can read
Without network isolation: a compromised process sends those credentials out, bypasses restrictions with direct connections

The proxy-based network model is clever: on Linux, the sandboxed process has its network namespace removed entirely, so all traffic must go through proxies on the host. On macOS, the Seatbelt profile restricts connections to a specific localhost port where the proxies listen. There's no in-process hook to bypass.

What to do

Running any MCP servers locally? This is worth setting up now. Start with denyRead: ["~/.ssh"] and an empty allowedDomains list — see what breaks and add back only what's needed.
Building an MCP server? Publish a recommended srt config alongside your server. It's a trust signal, and it documents what your server actually needs.
Building an AI agent platform? The SandboxManager is available as a library — you can wrap spawned processes programmatically.

It's tagged as a beta research preview, so the config format may shift. But the core primitive is solid, and the source code is there to audit.

Source: anthropic-experimental/sandbox-runtime · Anthropic Claude Code Sandboxing Docs

✏️ Drafted with KewBot (AI), edited and approved by Drew.

Replit agents just got a financial identity — and Visa backed it

Andrew Kew — Sun, 31 May 2026 18:14:12 +0000

Visa just made a strategic investment in Replit and the two are integrating payment infrastructure directly into Replit's agent-building environment. Tokenization, authentication, wallet management, payment instructions — native, from day one, not bolted on after the fact.

But the more interesting piece isn't the payments. It's the identity layer.

What actually changed

Visa Intelligent Commerce APIs are now accessible natively in Replit's dev environment. Developers building agents don't need to wire up payments separately — the building blocks are there as they build.
Visa made a strategic investment in Replit. Amount undisclosed. More than 1,000 Visa employees are already using the platform for internal prototyping.
Replit is exploring joining the Visa Trusted Agent Protocol registry — the mechanism that would let agents built on Replit transact with merchants and services on behalf of users.

"The next generation of builders and companies is emerging within ecosystems like Replit has developed. Our investment and partnership reflect a shared view that card payments should be native, secure and integrated directly into those experiences from the start."

— Rubail Birwadker, SVP, Head of Growth Products and Partnerships, Visa

The identity layer is the real story

Visa's Trusted Agent Protocol registry is a public key distribution system for AI agents. Agents register their identity and publish cryptographic keys. Merchants and infrastructure providers can then verify an agent's identity and intent in real time — distinguishing between a trusted agent acting for a user versus unknown or potentially malicious automation.

For an agent to be "Visa-trusted," it needs to go through Visa's onboarding, approval, and certification process. Replit is exploring a path to put agents built on its platform into that registry.

This matters more than the payment integration. Payments are table stakes. Verified agent identity is the unsolved problem. The moment an AI agent can be cryptographically identified and trusted by a merchant, the whole economics of agentic commerce unlock. Right now, most agent-to-commerce flows break down on trust, not capability.

Why it matters for builders

If you're building on Replit (or thinking about it):

Agent payment flows become a first-class feature, not an afterthought. You don't need to choose, integrate, and maintain a separate payments provider to get your agent transacting.
The identity angle is early stage — Replit is still "exploring" registry participation, not live in it yet. But the direction is clear.
Machine-to-machine payments are on the roadmap. Visa and Replit are doing early exploration of M2M flows, initially for low-value, high-frequency transactions between services or agents.

Existing chargeback and dispute frameworks apply for now; those are expected to evolve as the model matures.

The bigger picture

Visa's framing here is "B2AI" — a world where AI agents are active participants in commerce, not just assistants. Their research says 53% of US business leaders would already let AI agents negotiate prices with other AI agents. The gap between intent and execution has been trust and payment infrastructure.

Replit is where a huge chunk of the next generation of agents gets built. Getting Visa's stack embedded there at the start of the agent-building journey — rather than as an integration you add later — is a meaningful distribution play for both sides.

Replit CEO Amjad Masad put it plainly:

"Over the last few months, our enterprise traction has been growing, and Visa coming on board underscores our mission of making coding available to anyone in a secure and robust manner."

What to do

Building agents on Replit? Keep an eye on when the Visa Intelligent Commerce APIs go live for your plan tier — this is the piece to actually integrate.
Building agents elsewhere? Watch the Trusted Agent Protocol registry — it's the identity standard that may end up mattering regardless of what platform you're on.
Building payment infrastructure? Note that Visa framed this as "not a new product launch but a new developer context for existing infrastructure." That's the model to compete with.

Source: The New Stack — Replit × Visa

✏️ Drafted with KewBot (AI), edited and approved by Drew.

"The AI did it" won't save you when EU regulators come knocking

Andrew Kew — Sat, 30 May 2026 08:09:23 +0000

The EU Cyber Resilience Act has been on everyone's "we'll deal with it later" list since it entered into force in December 2024. Later is arriving: vulnerability reporting requirements kick in September 2026, and full compliance is mandatory by December 2027.

The timing matters because of what's happening in parallel: most engineering teams have accelerated shipping velocity by leaning hard on AI coding assistants. Copilot, Claude, Cursor — pick one. The code ships faster. The bugs ship faster too. And under the CRA, you own every line of it.

"The AI did it" won't save you when EU regulators come knocking.

That's not just a headline. It's a structural feature of the regulation.

What the CRA actually requires

The CRA applies to any product with digital elements placed on the EU market — hardware, software, apps, APIs. If you have EU customers, it applies to you regardless of where you're incorporated.

The core obligations:

No known exploitable vulnerabilities at market. You must ship with a clean bill of health — not "we'll patch it post-launch."
Security updates for the product's supported lifetime, minimum five years.
Report actively exploited vulnerabilities to ENISA within 24 hours of becoming aware. Not 72. Not "when the patch is ready." 24 hours.
CE marking required for covered products — same as medical devices and industrial kit.
Fines up to €15 million or 2.5% of global annual turnover, whichever is higher.

The open source exemption is narrower than it sounds: if you commercialise it — bundle it in a paid product, offer it as a managed service — you're likely in scope.

The AI code liability gap

Here's where it gets interesting for engineering teams in 2026. AI-generated code ships with the same legal weight as hand-written code. The CRA doesn't care how a vulnerability got there — it cares that you shipped it and you're the manufacturer.

AI coding tools are not auditing for regulatory compliance. They're optimising for working code that passes tests. Security posture, patch surface area, long-term maintainability — those are your job, not the model's. The CRA formalises that responsibility into law.

The risk isn't hypothetical. Security researchers have already shown that AI-generated code reintroduces known CVE patterns at meaningful rates. Ship it into a CRA-regulated product without a review layer and you've built a compliance debt that comes due at the worst moment.

What to do

Before September 2026 (vulnerability reporting deadline):

Inventory every product with EU customers — establish what's in scope
Set up your 24-hour ENISA reporting pipeline now; it's an operational change, not just legal
Know who owns the call when an exploited vuln is discovered at 3am

Before December 2027 (full compliance):

Audit AI-assisted code paths for known vulnerability patterns — automated SAST is the floor, not the ceiling
Document your vulnerability handling process; you'll need to demonstrate it
Review your open source dependencies: if a critical upstream project is in your CRA-scope product, you're responsible for its security posture in that context
Update SLAs to include security update commitments that match the five-year requirement

If you're building AI tooling for enterprise EU customers: you're almost certainly selling a product with digital elements, which means you're a manufacturer under the CRA, not just a software provider. Get legal eyes on this.

Source: The New Stack — "The AI did it" won't save you when EU regulators come knocking

✏️ Drafted with KewBot (AI), edited and approved by Drew.

Mistral acquired an AI physics lab. Here's what they're building.

Andrew Kew — Fri, 29 May 2026 12:46:58 +0000

Mistral just posted the research stack behind their acquisition of Emmi AI — and it's not another chat model. They're building neural surrogates that replace or accelerate the kind of computational fluid dynamics (CFD) simulations that currently eat weeks of supercomputer time.

The target industries: aerospace, automotive, semiconductors, and energy. The pitch: foundational Physics AI that lets engineers build faster and gain continuous performance gains at scale.

"We are doubling down on building foundational Physics AI for the industries that shape the physical world."

What actually changed

The Emmi acquisition brings a serious body of published research into Mistral:

AB-UPT (Feb 2025) — Anchored-Branched Universal Physics Transformer. Handles raw 3D geometry without remeshing — 9M surface cells and 140M volume cells on a single GPU. Previously that kind of simulation required a cluster.
UPT (Feb 2024) — Universal Physics Transformer. A general framework for scaling neural operators across diverse spatio-temporal problems, supporting both grid and particle simulations.
NeuralDEM (Nov 2024) — First end-to-end deep learning surrogate for large-scale multi-physics processes. Enables real-time simulation of industrial processes like fluidised bed reactors.
GyroSwin (Oct 2025) — 5D surrogates for plasma turbulence in nuclear fusion reactors. Addresses one of the key blockers for viable fusion power.
3D Wing CFD dataset (Dec 2025) — 30,000 CFD simulation samples for 3D wings in the transonic regime, filling a gap where existing datasets only covered 2D airfoils.

What this actually means

Most AI labs are competing on language, code, and reasoning. Mistral is carving out something different: simulation as a target domain.

The moat here isn't a bigger transformer — it's domain-specific architecture work (AB-UPT, GyroSwin) built on years of physics-informed ML research, plus proprietary datasets that are genuinely hard to replicate. A 30,000-sample CFD dataset for transonic 3D wings doesn't come cheap.

The industries they're targeting — aerospace, automotive, semiconductors, energy — all share the same pain: physical simulation is expensive, slow, and bottlenecks product development. If neural surrogates can get close enough to ground truth at a fraction of the compute cost, the market is enormous.

What to do

Building in aerospace/automotive/energy? Watch this space closely. Mistral is positioning these as enterprise solutions, not just research drops.
ML engineer working on physics-informed models? The AB-UPT and UPT repos are open on GitHub — worth a look at the architecture decisions.
Evaluating AI strategy for industrial simulation? Mistral is now a credible vendor name to put in the conversation alongside specialist players.
Just here for LLMs? This is a signal that frontier lab competition is fragmenting — not everyone is racing to the same GPT-5 endpoint.

Sources: Mistral Physics AI Research | Emmi AI

✏️ Drafted with KewBot (AI), edited and approved by Drew.

AI Security Tools Are Drowning Open Source Maintainers — curl Is the Canary

Andrew Kew — Wed, 27 May 2026 13:05:20 +0000

curl is installed on roughly 30 billion devices. It's arguably the most scrutinised, most-fuzzed networking library on the planet. And right now, its creator is burning out.

Not because curl is suddenly full of holes. Because AI-powered security research has reached a quality and volume that human maintainers weren't built to absorb.

Daniel Stenberg, curl's founder and lead developer, published a raw, honest post this week:

The rate of incoming security reports is 4-5 times higher than it was in 2024 and double the speed of 2025 — meaning that on average we now get more than one report per day. The quality is way higher than ever before. The reports are typically very detailed and long.

This isn't the slop era anymore. In 2024, Stenberg was writing about stupid LLM hallucinations flooding bug trackers. In early 2025, it was "death by a thousand slops." Now in 2026, the tooling has matured — and so has the pressure.

What actually changed

Reports are arriving at 4-5× the 2024 rate, 2× the 2025 rate — over one per day
They're no longer hallucinations — reports are credible, detailed, and require full triage
The upcoming release already has 12 confirmed vulnerabilities — a project record
curl is on track to publish 30+ CVEs in 2026 before the year is half over
Stenberg is spending almost all his working hours on HackerOne triage, patching, and advisory writing
For the first time, his wife has raised concerns about his work/life balance

The bottleneck isn't the bugs

Here's the thing: technically, curl is holding up. Every vulnerability found in the last few years has been rated LOW or MEDIUM severity. The last HIGH severity CVE was October 2023. Thirty years of relentless engineering means the catastrophic holes are genuinely rare.

But that's almost beside the point. The constraint isn't bug quality — it's human bandwidth.

AI security tooling can now do systematic, deep code analysis at scale. That's a net positive for software quality. But there's no corresponding scaling on the other side: the small team of maintainers who verify each report, write patches, coordinate disclosure timelines, and ship fixes.

Stenberg is direct about the math: "There's a tsunami coming over us and all we can do is swim, there are no life boats for us."

Why this goes beyond curl

If the best-maintained piece of critical infrastructure on the internet is struggling, the rest of the open source ecosystem should be paying close attention.

This is the open source sustainability crisis getting an AI-shaped edge. The industry consumes billions of dollars of free infrastructure, and maintainers absorb the cost — now including the cost of being the last human checkpoint in an AI-powered security research pipeline.

Curl at least has some paying customers. Most projects don't.

What to do

If your company depends on curl or libcurl (you do): fund it. Stenberg is explicitly asking for support contracts — that pays developer time. His post has the details.
If you ship AI security tooling: think about downstream load. Rate limiting, deduplication, and severity filtering before HackerOne submission would make a real difference.
If you maintain open source: this pattern is coming for every significant project as AI-assisted research matures. Worth thinking about now, not when you're already drowning.

Sources: The pressure — Daniel Stenberg · Simon Willison's linkblog

✏️ Drafted with KewBot (AI), edited and approved by Drew.

Stack Overflow is back to 2008 traffic. The programming book is next.

Andrew Kew — Tue, 26 May 2026 14:00:00 +0000

The computer book section isn't gone yet. But it's getting smaller. In some stores it's down to a rack of six titles, three of which are about ChatGPT.

The numbers are stark. Computer book sales fell 16.9% year-over-year in the first nine months of 2023. The "professional books" segment — the category that covers technical reference, the stuff your employer used to buy you — was down 22.3% in August 2025. And then, quietly, Publishers Weekly simply stopped reporting the category. Not a press conference. Not a Napster moment. Just silence.

"The category doesn't die, it just stops being talked about."

What actually changed

The demand hasn't disappeared — it's been rerouted. ChatGPT now has 900 million monthly active users. GitHub Copilot has 4.7 million paying subscribers as of January 2026, up roughly 75% in a year. Stack Overflow is receiving about 3,800 questions a month — which is exactly what it was getting in 2008, before it had even finished launching.

The chatbots absorbed the demand that programming books used to serve. You have a question about idempotency or regex or SQL indexes — you ask the model, get a precise answer in the number of words you need, and close the tab.

The thing that's actually going away

The author of the original post lands on something worth sitting with:

"Knowledge, for working programmers, was always the residue of typing. Of doing. The typing was the practice. What is going away is the typing."

The programming book was the wrong format for the content — printed static text describing dynamic software, requiring readers to retype examples by hand. But that friction was the point. You couldn't fake your way through 400 pages. The slowness was the mechanism by which knowledge stuck.

The chatbot has read every book and forgotten the point of every one of them. It'll explain anything in exactly the words you need, instantly, in a way that requires no effort from you — and so leaves no residue.

What this means for how developers learn

This isn't a doom narrative. The kid learning to code by chatting with an agent isn't a worse programmer — they're a different one. Working at a higher level of abstraction from day one. That'll produce things we can't predict.

But the shift matters for anyone thinking about developer education, onboarding, or how teams build shared knowledge:

The passive consumption trap is real. Getting an answer from a model and getting that answer into your head are different things. Deliberate practice still matters — the form just has to change.
Depth still requires friction. Deep dives, side projects, building things that break: these are the replacements for the 400-page book. The mechanism is different; the need is the same.
Technical publishing isn't dead, it's transforming. What survives will be the stuff models can't replace: opinionated takes, hard-won experience, context that isn't in the training data.

The O'Reilly animal books were always a workaround — an imperfect medium for a problem that now has a better solution. What we're figuring out, collectively, is what the better solution costs us.

Read the original: unix.foo

✏️ Drafted with KewBot (AI), edited and approved by Drew.

The case for using AI to write better code more slowly

Andrew Kew — Tue, 26 May 2026 13:50:16 +0000

The dominant mental model for AI-assisted coding is speed: generate multi-hundred-line PRs, merge fast, iterate faster. Vibe coding as a velocity play.

Nolan Lawson's post this week pushes back on that — not by rejecting LLMs, but by using them differently.

"You can use them just as effectively to write high-quality code more slowly."

The hook is simple: LLMs are excellent at finding bugs. Anthropic's Mythos research showed agents can surface flaws in a codebase at scale. Lawson extended that insight into a practical PR review workflow — and the results are the opposite of slop.

What the workflow looks like

Lawson runs a multi-agent review skill that throws Claude, Codex, and Cursor Bugbot at every PR independently, then consolidates findings ranked by severity: critical, high, medium, low.

The key design choice is the ensemble. Multiple models reviewing the same code self-correct each other — the false positive rate drops to near zero, while bug coverage stays high. A single model hallucinates; three models debating converge on real issues.

His triage loop once the report lands:

Fix all criticals and highs — with his own guidance on the right solution, not just "accept the suggestion"
Skip mediums where the fix cost outweighs the risk — not every edge case deserves 100 lines of code
Abandon the PR entirely if criticals reveal the whole approach is wrong

That last point is important. This workflow will sometimes tell you to throw away your work. That's a feature.

The real insight

Velocity hasn't gone up. If anything, it's slower. The review process regularly surfaces pre-existing bugs, sending Lawson on side-quests to write unit tests and fix subtle flaws that predate the PR.

That's the point. Pre-LLM, understanding a codebase deeply meant exploring its failure modes — where the assumptions break down, where the edge cases bite. That's still the most valuable form of code knowledge. This workflow automates the discovery without removing the depth.

Lawson also suggests pairing this with understanding tools: have the agent explain how the PR works and where it might fail, generate Mermaid diagrams, or use Matt Pocock's /grill-me skill until you can explain the entire changeset from memory.

What to do

Shipping large AI-generated PRs unreviewed? Run a multi-model review pass first. You'll be surprised what you find.
Building your own review skill? The ensemble approach is the key — 2–3 models, independent runs, severity ranking, deduplication before you act.
Skeptical that AI helps with code quality? This is worth a try. It's closer to careful engineering than vibe coding.
Worth reading the full post: nolanlawson.com

The tools didn't change. The mental model did.

✏️ Drafted with KewBot (AI), edited and approved by Drew.

NVIDIA's Nemotron Diffusion: One Model, Three Generation Modes, 6 Faster

Andrew Kew — Sat, 23 May 2026 22:58:38 +0000

NVIDIA just released Nemotron-Labs Diffusion: a family of open-weight language models (3B, 8B, 14B, plus an 8B VLM) that can run in three distinct generation modes from the same checkpoint — autoregressive, diffusion, or self-speculative — with no application-level changes required. The headline number: 6.4× higher token throughput versus standard autoregressive decoding, with accuracy that matches or beats Qwen3 8B on benchmarks.

"Autoregressive and diffusion generation should not be separate model families. They should be capabilities of the same model."

What actually changed

Autoregressive LLMs have a hard constraint: one token at a time, every token a full model pass. That's fine for quality but brutal for throughput at low batch sizes — the GPU spends most of its time on memory ops, not compute.

Nemotron-Labs Diffusion breaks that constraint by adding parallel drafting on top of a pretrained AR model (rather than training a diffusion model from scratch). Three modes, switchable at deploy time:

Autoregressive — standard left-to-right decoding. Backward compatible with anything you run today.
Diffusion (FastDiffuser) — generates a 32-token block at a time, iteratively denoising until tokens hit a confidence threshold. Raw throughput gains here.
Self-speculation (LinearSpec / QuadraticSpec) — the model drafts a block bidirectionally using diffusion, then verifies it causally with AR. Lossless at temperature 0. Hits ~865 tok/s on an H100/B200 — roughly 4–6× the AR baseline on the same hardware.

Models are available under the NVIDIA Nemotron Open Model License (commercially friendly). SGLang support is landing imminently via an open PR.

Why it matters

Most "fast inference" approaches force you to choose: either a smaller model, a different model, or a speculative decoding setup with a separate draft model you have to maintain. Nemotron bundles all of that into one checkpoint.

The deployment story is what makes this notable for practitioners. You swap inference modes by changing a single config line — same weights, same endpoint, same application code. That makes it much easier to tune the speed/accuracy tradeoff without rebuilding your stack.

The self-speculative mode is particularly interesting: it's essentially speculative decoding without the separate draft model. The AR verification pass means output quality is preserved at temperature 0, which is what you usually want in production.

Training approach is worth noting too: they started from a pretrained AR model and continued pretraining with a joint AR + diffusion objective on 1.3T tokens. Building on existing weights rather than training from scratch is a significant practical shortcut, and it preserves the AR capabilities rather than trading them away.

What to do

If you're evaluating inference infrastructure: Nemotron-Labs Diffusion 8B is a concrete candidate to benchmark against your current setup. The self-speculative mode's 4–6× throughput gain at batch size 1 is worth testing — that's where AR models leave the most performance on the table.

If you're serving a latency-sensitive app: Watch the SGLang PR closely. Once it lands in main, you'll be able to drop Nemotron in as a faster drop-in without touching your API layer.

If you're interested in the architecture: The technical report and training recipe on GitHub are both open. This is a practical implementation of diffusion LMs, not a research demo.

Source: NVIDIA Nemotron-Labs Diffusion on HuggingFace · Model collection

✏️ Drafted with KewBot (AI), edited and approved by Drew.

OpenTelemetry Is Now a CNCF Graduate — and It's Coming for Your AI Stack

Andrew Kew — Fri, 22 May 2026 21:46:42 +0000

OpenTelemetry graduated as a CNCF project on May 21, 2026. That's not just a badge — it's the formal recognition that OTel has won the observability standards race. But graduation isn't the finish line. The project is now squarely aimed at the AI infrastructure era, with GenAI semantic conventions already shipping in VS Code Copilot, OpenAI Codex, and Claude Code.

"Graduation is not the finish line. The OpenTelemetry community remains committed to building interoperable, high-quality observability standards and tooling for cloud native software at global scale."
— OpenTelemetry project blog

What actually changed

CNCF graduation — OTel moved from incubating to graduated, joining Kubernetes, Prometheus, and a handful of other foundational cloud-native projects. This signals production-readiness and long-term stewardship.
Origins — formed from the merger of OpenTracing and OpenCensus, OTel has absorbed thousands of contributors across language SDKs, semantic conventions, and the Collector.
Declarative configuration went stable — a quieter but significant win: you can now configure the OTel Collector declaratively, which matters for GitOps and platform teams managing collectors at scale.
GenAI semantic conventions are in active use — the gen_ai.* attribute namespace standardises how LLM operations are recorded: model name, input/output token counts, finish reasons, tool calls, and (when opted in) full prompt/response content.
Major AI tools already emit OTel — VS Code Copilot, OpenAI Codex, and Claude Code all export OTel telemetry today. That's not an aspiration — it's already the default for the most-used AI coding tools.

Why this matters

OTel is the first observability framework that's genuinely spanning both cloud-native infrastructure and AI workloads under a single standard. That's a big deal.

Before the GenAI semantic conventions, monitoring an AI agent meant vendor-specific dashboards, proprietary SDKs, or rolling your own spans. Now you get a common schema — gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.client.operation.duration — that any OTLP-compatible backend can ingest and visualise.

The practical upside: if your AI agent takes 45 seconds to answer a question, you can now tell whether it was the model, a slow tool call, or a retry loop — without guessing. Token costs, latency histograms, and tool invocation traces all flow through the same pipeline you already run for your services.

The graduation timing is deliberate. OTel is establishing itself as the standard before the AI observability market fragments into proprietary tooling. That's the same playbook it ran against Prometheus/Jaeger fragmentation in the cloud-native space.

What to do

If you're building AI-powered apps:

Instrument with the GenAI semantic conventions now — they're in use and under active development, so your feedback shapes what gets standardised.
Try the free Aspire Dashboard Docker image for local GenAI telemetry exploration — OTLP-native, no cloud account required.

If you're a platform/infra engineer:

OTel Collector declarative config is now stable — worth revisiting your collector setup if you deferred it waiting for stability.
Check if your AI tooling already emits OTel (Copilot and Codex do) — you may have free telemetry sitting uncollected.

If you're evaluating observability vendors:

Prioritise OTLP-native backends. Vendor lock-in via proprietary agents is increasingly a bad bet when the standard is this mature.

Sources: CNCF graduation announcement · OpenTelemetry blog · TNS analysis

✏️ Drafted with KewBot (AI), edited and approved by Drew.