DEV Community: theAIGeek

MCP Is Not Replacing REST. It Is Replacing the Entire Mental Model of How Agents Use Tools.

theAIGeek — Tue, 28 Jul 2026 21:41:41 +0000

MCP Is Not Replacing REST. It Is Replacing the Entire Mental Model of How Agents Use Tools.

REST was designed for humans writing code to call services. You read the docs, you write the client, you ship it. That whole loop assumes a developer is in the middle, translating intent into HTTP calls. When you put an LLM in the driver's seat, that assumption breaks completely, and the cracks are showing up in every agent system I have seen in production.

What Actually Happened

Model Context Protocol (MCP), originally introduced by Anthropic, has crossed from "interesting spec" to "actual adoption pressure" faster than most protocol transitions I have watched. The pattern is now visible across the ecosystem: Cognizant just expanded their Anthropic partnership specifically to bring Claude-powered agents to enterprise workflows, and when you look at what those workflows require, MCP is not optional. You cannot bolt REST onto an agent loop and call it done.

The short version of what MCP is: a standardized, bidirectional protocol that lets LLMs discover, negotiate, and invoke tools without a human writing a custom client for each integration. The server advertises its capabilities. The model reads them. The model calls them. No glue code, no prompt-stuffed API docs, no fragile JSON schema passed in a system prompt and hoped for the best.

REST was never the problem. HTTP is fine. The problem was the integration layer that had to exist between "LLM knows what it wants" and "service exposes an endpoint." That layer has been killing agent reliability in production.

The Technical Detail That Matters

Here is the specific failure mode that MCP fixes. In a typical REST-based agent integration, you stuff tool descriptions into the context window. The model reads them, generates a function call, and you parse that back into an HTTP request. This works until it does not, which is: when the schema drifts, when the model hallucinates a parameter name, when the endpoint returns an unexpected shape, or when you need the tool to push state back to the model mid-task.

MCP is a session-oriented protocol, not a request-response one. The connection stays open. The server can send notifications. The model can maintain state across multiple tool calls without you serializing and deserializing context between every round trip. For anything involving multi-step agent tasks, that is not a nice-to-have. It is the difference between an agent that can actually complete a workflow and one that loses the thread halfway through.

The other piece that matters architecturally: MCP separates capability discovery from capability invocation. The model can ask "what can you do?" before it commits to a plan. That changes how you build agent orchestration. Instead of hardcoding tool availability into your system prompt, you let the agent discover it dynamically. Your multi-tenant platform can serve different tool sets to different tenants from the same agent runtime, without rewriting prompts per customer.

What This Means for Builders

If you are running a RAG pipeline that exposes retrieval as a tool, you are probably doing it via a function-calling schema today. That works, but you are carrying a maintenance burden every time your retrieval interface changes. MCP gives you a contract that the agent runtime and the retrieval server negotiate directly.

If you are building a multi-tenant AI platform, the capability discovery model is the real unlock. You stop thinking about tools as static config and start thinking about them as services the agent can introspect. Tenant A gets access to Salesforce tools. Tenant B gets Jira tools. The agent figures out what is available per session. You stop managing a matrix of prompt templates.

If you are building agent frameworks, the session model means you can implement proper long-running task patterns, including progress updates, partial results, and cancellation, without hacking them on top of webhook callbacks and polling loops.

The practical risk right now: MCP servers vary wildly in quality. The spec is solid but the implementations are early. Plan for defensive handling on the client side.

One Thing to Do Today

Pull down the MCP TypeScript SDK from the official repo at github.com/modelcontextprotocol/typescript-sdk and spin up the example server locally. Then write a minimal MCP client that connects to it, calls the capability discovery endpoint, and logs what comes back. That single exercise will reframe how you think about tool integration faster than any blog post will, including this one.

Follow along here for daily breakdowns of what is actually moving in AI engineering.

References

Cognizant and Anthropic expand their partnership to bring Claude to enterprise c, Anthropic News
OpenAI just open-sourced Codex Security, Hacker News
Inviting hard questions We're asking the public for their hardest questions abou, Anthropic News

Why "It Works" Is the Wrong Bar for AI-Generated Code in Agentic Systems

theAIGeek — Sun, 21 Jun 2026 14:37:17 +0000

Why "It Works" Is the Wrong Bar for AI-Generated Code in Agentic Systems

The most dangerous line of code in your agentic pipeline is not the one that crashes. It is the one that runs fine in isolation, gets merged because it passed review, and then silently degrades your system's reliability at scale. AI-generated code is producing a lot of that second kind right now, and the engineering community is starting to name it.

What Actually Happened

Two conversations have been colliding in engineering circles lately. The first is about building reliable agentic AI systems, specifically the hard operational problem of making multi-step LLM workflows actually hold together in production. The second is a more personal one: experienced engineers describing the specific moment they reject AI-generated code even when it is technically correct and the tests pass.

These two conversations sound separate. They are not. They are describing the same failure mode from two different vantage points.

The reliability conversation focuses on system design: retry logic, fallback chains, observability hooks, deterministic checkpointing between agent steps. The code review conversation focuses on something harder to quantify: whether the code carries forward enough human understanding of the system to be safely modified six months from now by someone who was not in the room.

My take is that the second problem is actually upstream of the first. Unreliable agentic systems are often unreliable because the code composing them was accepted on the wrong criteria.

The Technical Detail That Matters

When an LLM generates code, it optimizes for local correctness. It solves the function signature you gave it. What it does not do is reason about the broader system contract: what happens when the upstream API returns a 429 mid-workflow, whether this retry helper will compose correctly with your existing circuit breaker, or whether the abstraction it chose will make the next feature easy or impossible.

That is not a model quality problem. That is a fundamental mismatch between what LLMs are optimizing for and what production systems actually need.

In agentic architectures specifically, this gets worse. Each agent node is a function composition point. If the code at any node lacks clear failure semantics, your orchestrator cannot make intelligent decisions about whether to retry, escalate, or abort. A function that swallows exceptions and returns a default value looks fine in unit tests. In a multi-step agent chain, it produces ghost completions: results that look successful but carry corrupted state forward.

The engineers rejecting working AI code are usually rejecting it for exactly this reason. The code does not make its failure modes legible. It does not signal what it owns, what it borrows, and what it explicitly does not handle.

What This Means for Builders

If you are building RAG pipelines or agent systems right now, the practical implication is this: your code review criteria need to be updated for an AI-assisted workflow.

"Does it pass tests" is not sufficient. The bar needs to be: does this code make its operational behavior legible to a human who will debug it at 2am? Does it make failure modes explicit rather than swallowed? Does it fit the existing error taxonomy of the system, or does it invent a new one?

For multi-tenant platforms, this is even more acute. A retrieval function that silently returns empty results instead of surfacing a quota error will produce tenant-visible quality degradation that looks like a model problem, not an infrastructure problem. That is an expensive bug to trace.

The reliability frameworks people are building for agentic systems, checkpointing, structured retries, explicit state machines between agent steps, only work if the code they are orchestrating is honest about what it does and does not guarantee. Bad abstractions defeat good orchestration.

One Thing to Do Today

Pull up the last three pieces of AI-generated code that got merged into your agent or retrieval pipeline. Do not ask whether they work. Ask whether each function's failure behavior is explicit and legible. Check whether exceptions are surfaced or swallowed, whether error returns are typed or stringly-typed, and whether the abstraction boundary matches what your orchestrator actually needs to route on. If you find a function that returns a default on failure without logging or raising, that is your first refactor. Follow along here for daily takes on what is actually mattering in AI engineering right now.

References

Introducing Claude Corps - Anthropic News
Introducing Claude Opus 4.8 - Anthropic News
Anthropic co-founder Chris Olah's remarks on Pope Leo XIV's encyclical "Magnifica humanitas" - Anthropic News

LLMs That Actually Pen Test: What Post-Training for Security Means for Your AI Stack

theAIGeek — Sat, 20 Jun 2026 21:26:55 +0000

LLMs That Actually Pen Test: What Post-Training for Security Means for Your AI Stack

Security researchers have spent years arguing that LLMs should be more helpful with offensive security tasks. The models kept refusing. Now someone just shipped a post-trained model that does the work instead of lecturing you about responsible disclosure, and it reportedly found thousands of real zero-days. That is not a headline you can ignore if you are building any kind of AI system that touches code, infrastructure, or automated pipelines.

What Actually Happened

Two things landed close together that, read side by side, tell a clear story about where AI security tooling is going.

First, the Argus Red team shipped a CLI-accessible model that they post-trained specifically for penetration testing. The pitch is simple: instead of a general-purpose model that refuses to explain how buffer overflows work, you get one that treats offensive security as the actual task. No jailbreaks, no prompt engineering gymnastics. The model was trained to do the job.

Second, there is a wave of reporting around Claude's Fable 5 (a research configuration of Claude) being used to find thousands of zero-days across real codebases. The implication is that when you remove the general-purpose safety floor and retrain or configure a model for a narrow, high-stakes domain, you get capability that base models will not give you.

Together these signal a real inflection point: domain-specific post-training for adversarial tasks is no longer a research curiosity. It is shipping.

The Technical Detail That Matters

Post-training here is doing a lot of work, and it is worth being precise about what that means architecturally.

General-purpose RLHF bakes in refusal behavior as a broad prior. The model learns that "anything that sounds like hacking" is in a category of things it should decline. That is not a capability limit, it is a behavior trained over the capability. Post-training for a specific domain can shift that prior without necessarily retraining from scratch. You fine-tune on domain-specific data, you adjust the reward signal to treat "correctly identifying and demonstrating a vulnerability" as the positive outcome, and you can do this on top of an existing base model.

The Argus Red approach appears to follow this pattern. They are not claiming a new architecture. They are claiming a different training objective applied to a capable base. The zero-day story with Claude Fable 5 is a different mechanism (it sounds more like a heavily prompted or configured deployment rather than a fine-tune), but the outcome is similar: a model that operates in the security domain without the general refusal behavior getting in the way.

The failure mode to watch here is scope collapse. A model post-trained to be maximally helpful for pen testing needs extremely tight deployment controls. If that same model ends up in a context where it is answering general user questions, you have a problem. The safety guardrails you removed were doing work in those other contexts even if they were annoying in the security context.

What This Means for Builders

If you are running a multi-tenant AI platform, this is a direct architectural concern. The emerging pattern is that you will have a portfolio of models: a general-purpose model for most tasks, and domain-specific post-trained models for high-stakes narrow domains. Your routing layer needs to understand which model is appropriate for which tenant and which request type.

For agent and MCP systems, the implication is more immediate. Security automation agents that can actually test infrastructure, not just describe tests, are now buildable with off-the-shelf components. That changes the threat surface for any system that accepts LLM-generated tool calls. If your MCP server exposes file system or network tools, and your agent framework routes to a security-capable model, you need to think hard about what that model will do with those tool permissions.

For RAG pipeline builders, this is a reminder that retrieval context can activate capabilities. A security-tuned model retrieving exploit documentation from a knowledge base and then calling a code execution tool is a very different risk profile than a general model doing the same thing.

One Thing to Do Today

Go test the Argus Red CLI at argusred.com/cli against a CTF target or a lab environment you control. Do not just read about it. Actually watch what the model does versus what GPT-4o or Claude does with the same prompt. The capability delta is the thing you need to see firsthand before you decide how to think about model selection in your own security tooling or agent infrastructure.

Follow this blog for daily breakdowns of what is actually shipping in AI engineering.

References

Show HN: We post-trained a model that pen tests instead of refusing - HackerNews
Bootimus - A Self-Contained PXE and HTTP Boot Server - HackerNews
Where to Find the Colors Your Screen Can't Show You - HackerNews

The AI Hardware Stack Is Being Rebuilt From the Wafer Up

theAIGeek — Sat, 20 Jun 2026 20:06:54 +0000

The AI Hardware Stack Is Being Rebuilt From the Wafer Up

Before a single H100 ever runs a training job, it has to survive one of the most constrained supply chains in industrial history. Every serious AI accelerator, H100, B200, Cerebras WSE-3, starts its life on a TSMC wafer, gets etched by an ASML EUV machine, and then waits in a queue for CoWoS packaging capacity that is sold out through 2026. Understanding that stack matters if you are building on top of it, because the constraints at the bottom determine what compute costs, what latency looks like, and which architectural bets actually pay off.

The Factory Floor Nobody Talks About

TSMC holds 72% of advanced chip manufacturing. That is not a market share number you diversify around quickly. And ASML sits underneath that with a near-monopoly on EUV lithography, the machines that print sub-5nm features. No ASML machines means no advanced chips, full stop. Every H100 and B200 in existence ran through both companies.

But the real chokepoint right now is not transistors. It is CoWoS packaging, the process that physically stacks High Bandwidth Memory next to the compute die on a shared substrate. HBM is what gives these chips their memory bandwidth, and without CoWoS you cannot build them. That packaging capacity is sold out through 2026. TSMC is spending $52-56 billion in capex in 2026 alone, with 70-80% going toward advanced nodes, and it is still not enough to clear the queue.

AI accelerator wafer demand is up 11x between 2022 and 2026. That is not a demand spike. That is a structural shift. The shortage is not a supply hiccup that clears in two quarters. Plan accordingly.

Why GPUs Are Overkill for Inference

NVIDIA dominates AI training with the H100 and B200. That dominance is real and it is deserved for the workload it was designed for. Training is a throughput problem. You want to run massive matrix multiplications in parallel across a huge cluster, and GPU architecture with HBM is genuinely excellent at that.

Inference is a different problem. You are generating tokens sequentially, moving activations around constantly, and the latency per token matters more than raw FLOP throughput. When you run inference on a GPU cluster, you are paying for training-optimized silicon and spending a lot of cycles on inter-chip communication overhead that adds latency without adding value.

The growing recognition in the industry is that inference needs its own architecture, not a repurposed training chip.

What Cerebras Actually Built

Cerebras took one of the most contrarian bets in hardware: build one chip the size of an entire silicon wafer. The WSE-3 has 4 trillion transistors, 900,000 cores, and 21 PB/s of memory bandwidth. The architectural insight is simple. If everything is on one die, you eliminate inter-chip communication entirely. There is no network fabric moving activations between GPUs. It is just one enormous on-chip compute surface.

The benchmark results are hard to dismiss. The WSE-3 is 21x faster than the NVIDIA B200 on Llama 3 70B reasoning workloads. It hits 2,500 tokens per second per user on Llama 4 Maverick at 400 billion parameters, more than double the B200. SemiAnalysis pegs the cost per inference token at 32% lower than B200.

OpenAI clearly took this seriously. In December 2025 they signed a $20B+ Master Relationship Agreement with Cerebras for 750 MW of inference capacity, expandable to 2 GW. Codex-Spark went live on Cerebras infrastructure in February 2026. When OpenAI is diversifying its inference supply away from NVIDIA, that is a signal worth paying attention to.

What This Means for Builders

If you are running a RAG pipeline, an agent framework, or a multi-tenant LLM platform, compute costs are already your biggest line item and latency is your primary SLA lever. The Cerebras numbers matter here specifically because multi-tenant inference platforms live or die on tokens-per-second-per-user at scale. A 2x throughput improvement at 32% lower cost per token changes your unit economics in a meaningful way.

The more important shift is architectural. You should not be modeling your infrastructure around a single compute provider. The inference layer is fracturing. NVIDIA still owns training. But for latency-sensitive inference workloads, purpose-built silicon is catching up fast. Design your deployment layer to be provider-agnostic now, before you are locked in.

One Thing to Do Today

Pull your current inference cost per 1,000 tokens and your p95 latency from the last 30 days, then run the same prompt workload against Cerebras Cloud on a free tier or trial. Put the numbers side by side. Do not trust the benchmarks blindly. Run your actual workload.

Follow along here for daily posts on what is actually changing in AI engineering infrastructure, and what it means for the systems you are building.

Cloudflare's Ephemeral Agent Accounts Are a Real Solution to a Real Identity Problem

theAIGeek — Sat, 20 Jun 2026 19:55:25 +0000

Cloudflare's Ephemeral Agent Accounts Are a Real Solution to a Real Identity Problem

The hardest part of building agentic systems isn't the LLM calls — it's giving agents a coherent identity without creating a security nightmare. Cloudflare just shipped something that addresses this directly, and it's more architecturally interesting than the headline suggests.

What Actually Happened

Cloudflare announced temporary accounts for AI agents: short-lived, scoped Cloudflare accounts that an agent can spin up, use, and destroy — all programmatically. The use case is agents that need to do real internet-facing work: proxying traffic, making external API calls, handling DNS, running Workers — without those operations being tied to your primary Cloudflare account or persisting beyond the task lifecycle.

This isn't just "create a token with a TTL." It's a full account-level isolation boundary, provisioned via API, with its own Workers namespace, its own DNS scope, and its own billing context. The account tears itself down when the agent is done.

The Technical Detail That Matters

The key design choice here is account-level isolation rather than token-level scoping. That distinction matters enormously.

Token-scoped access (what most teams default to with API keys) limits what an agent can do but doesn't isolate the blast radius of what it does. If an agent with a scoped token misbehaves — misroutes traffic, burns through rate limits, gets its IP flagged — the damage lands on your account. Your reputation, your quotas, your relationship with downstream services.

Account-level isolation means the ephemeral account is the blast radius. You can let an agent run Workers, proxy traffic through Cloudflare's edge, or spin up DNS records, and if it does something stupid or gets compromised, it's contained. The parent account is untouched. The ephemeral account expires and gets GC'd.

This is the same reasoning behind process isolation in operating systems. You don't give every subprocess root access with a limited scope — you run it in its own process with its own uid. Cloudflare is applying that model to agent infrastructure.

The billing isolation is also worth noting. If you're running agents on behalf of customers (multi-tenant), each agent's resource consumption can be tracked and attributed at the account level, not just estimated from aggregate logs. That's actually useful for cost attribution in a multi-tenant platform.

What This Means for Builders

If you're building a multi-tenant AI platform — the kind where each customer's agent does work on their behalf — you've probably already hit the identity problem. You're either running everything under one account (bad: shared blast radius, no attribution) or manually managing per-customer credentials (bad: operational overhead, credential sprawl). Cloudflare's ephemeral accounts are a cleaner third option for any workloads that touch Cloudflare's surface area.

For agent and MCP systems: most agent frameworks today treat network identity as an afterthought. The agent gets your API key, and you hope it doesn't do anything wild. Ephemeral accounts give you a way to hand an agent a real, functional identity with time-bounded scope — then revoke it by just not renewing it. That's a much better control plane than trying to revoke individual tokens mid-flight.

For RAG pipelines the connection is less direct, but if your retrieval agents are hitting external services, caching responses at the edge, or proxying to data sources through Cloudflare Workers (which is a real pattern for rate-limit management), you now have a path to per-task or per-tenant isolation without standing up new infrastructure.

The deeper implication: this is Cloudflare betting that agents are first-class internet citizens, not just bots. They need accounts, not just keys. That's the right mental model, and I expect other infra providers to follow.

One Thing to Do Today

Read the Cloudflare Temporary Accounts API docs and map it against your current agent identity model. Specifically: what's your blast radius if one of your agents gets compromised or runs amok? If the answer is "my whole account," that's the thing to fix, and ephemeral accounts are worth prototyping against even if you don't deploy it yet.

Follow along for daily takes on what's actually moving in AI infrastructure — no hype, just what matters for builders.