DEV Community: Aamer Mihaysi

Async Batching Is the Real Latency Win Nobody's Talking About

Aamer Mihaysi — Fri, 15 May 2026 09:00:11 +0000

Synchronous batching is a throughput hack that became a design constraint. Hugging Face's latest work on asynchronous continuous batching shows why the distinction matters more than the batch size.

Most inference servers treat batching as a queuing problem. Requests pile up, you wait for N items or a timeout, then you process them together. This works until it doesn't—when your tail latency spikes because one long request blocks the entire batch, or when your GPU sits idle waiting for that last straggler to arrive.

The move to continuous batching helped. Instead of fixed windows, you could add and evict requests dynamically. But it was still fundamentally synchronous: every forward pass had to wait for the slowest sequence in the batch to complete its decode step. The GPU utilization looked good on dashboards, but the latency distribution told a different story.

The Async Shift

Asynchronous continuous batching decouples the scheduling loop from the forward pass. Requests enter a pool, the scheduler makes decisions about what to run, and the GPU executes independently. This sounds subtle but changes everything about how you think about inference throughput.

First, you can pipeline. While the GPU is working on step T, the scheduler is already preparing the batch for step T+1. The overhead doesn't disappear, but it overlaps with useful work. On modern GPUs with async copy engines, this matters more than most benchmarks capture.

Second, you can preempt. Not in the OS sense, but in the ability to yank a completed sequence from the batch mid-flight and replace it with a fresh one. The synchronous model forced you to wait for the entire batch to finish before anyone could leave. Async lets you maintain a full batch even when individual sequences have wildly different lengths.

Why This Matters for Agents

Agent workloads break traditional batching assumptions. Tool calls introduce non-deterministic latency. A request might pause for 500ms waiting for a search result, then resume with a burst of generation. Synchronous batching either holds the slot (wasting GPU memory) or evicts the request (paying recompute costs). Neither is acceptable at scale.

Async batching treats these pauses as first-class citizens. The request steps aside, the GPU keeps working on other sequences, and the scheduler brings it back when the tool responds. The memory stays allocated, but the compute doesn't stall.

This is particularly relevant for the emerging class of "always-on" agents that maintain long-running sessions. You can't batch these traditionally—they're perpetual. But you can interleave them with short-turnaround requests if your scheduler understands async completion.

The Implementation Reality

Hugging Face's TGI and vLLM have both moved toward async scheduling, though the implementations differ. TGI uses a dedicated scheduling thread that runs ahead of the GPU, while vLLM's recent iterations push more of the async logic into the CUDA graph itself. The tradeoffs are familiar: thread overhead versus kernel launch latency, complexity versus control.

What both approaches acknowledge is that the synchronous abstraction was a convenience, not a requirement. The hardware has been capable of async execution for years. The software is catching up.

The Takeaway

If you're running inference at scale, look at your tail latency percentiles, not your average throughput. If p99 is more than 3x your median, you're probably suffering from synchronous batching artifacts. Async continuous batching won't fix everything—memory bandwidth is still a bottleneck, and attention costs don't disappear—but it removes a class of scheduling-induced latency that has no business existing in 2026.

The best part: for many workloads, this is a software upgrade, not a hardware purchase. Your A100s or H100s get immediately more useful when the scheduler stops waiting for permission to work.

DeepSeek-V4: Finally, a Context Window Built for Agents

Aamer Mihaysi — Thu, 14 May 2026 09:06:24 +0000

Most long-context models are benchmarks in search of a use case. DeepSeek-V4 is different. It is built for the one workload that actually needs a million tokens: agents running long-horizon tasks.

The specs are straightforward. Two MoE checkpoints: V4-Pro at 1.6T total parameters with 49B active, and V4-Flash at 284B total with 13B active. Both ship with a 1M-token context window. But the headline is not the window size. It is what happens to inference cost as you use it.

At 1M tokens, V4-Pro requires 27% of the single-token FLOPs compared to V3.2. The KV cache uses 10% of the memory. V4-Flash drops further: 10% of FLOPs, 7% of KV cache. Against a standard grouped-query attention baseline, V4 uses roughly 2% the cache size. These are not incremental gains. They are the difference between a demo and a production deployment.

Hybrid Attention

The architecture splits attention into two mechanisms that alternate across layers.

Compressed Sparse Attention (CSA) compresses KV entries 4x using softmax-gated pooling, then runs a lightning indexer in FP4 to select top-k blocks per query. A sliding window handles the most recent uncompressed tokens.

Heavily Compressed Attention (HCA) goes further: 128x compression, then dense attention over the compressed stream. The compression is aggressive enough that dense attention becomes cheap.

Layers alternate between CSA and HCA. Storage uses FP8 for most KV entries, BF16 only for RoPE dimensions.

What Actually Changes for Agents

Interleaved thinking across tool calls. V3.2 discarded reasoning traces when a new user message arrived. For multi-turn agent workflows, this meant the model lost accumulated state. V4 preserves reasoning content across user message boundaries when tool calls are present.

Tool-call schema with dedicated tokens. V4 introduces a DSML special token and an XML-based tool-call format. This removes a class of JSON escaping failures that plague string-based tool calls.

DSec: a sandbox built for RL rollouts. The agent behavior was trained with RL against real tool environments. DeepSeek Elastic Compute exposes four execution substrates: function calls, containers, microVMs (Firecracker), and full VMs (QEMU).

The Numbers

Terminal Bench 2.0: 67.9
SWE Verified: 80.6 resolved
MCPAtlas Public: 73.6
Toolathlon: 51.8

V4-Pro-Max hits 67% pass rate on DeepSeek internal R&D coding benchmark versus 47% for Sonnet 4.5 and 70% for Opus 4.5.

Long-context retrieval holds at 0.59 accuracy on MRCR 8-needle at 1M tokens.

The Real Test

V4-Pro is at parity with frontier closed models on agent tasks. The open question is whether the community's tool harnesses adapt to the DSML schema and whether the interleaved thinking gains transfer to out-of-domain agent frameworks.

The model is on the Hub. The architecture is documented. The sandbox is described. What happens next depends on whether the ecosystem builds around these primitives or ignores them in favor of the next benchmark chase.

EMO: Mixture-of-Experts That Actually Behaves Like One

Aamer Mihaysi — Thu, 14 May 2026 03:29:49 +0000

Most MoE models are just big transformers with a traffic cop attached. The router directs tokens to different experts, sure, but ask for just the code experts and the whole thing falls apart. That's not modularity. That's sharding with extra steps.

The problem isn't that MoE doesn't work. It's that the experts don't specialize where it matters. Open up a standard MoE and you'll find one expert handling prepositions, another managing punctuation, a third dealing with numbers. The specialization is lexical, not semantic. When you try to extract just the "math" capability, every token still needs access to most of the experts anyway. The promise of selective deployment remains theoretical.

EMO changes this by making modularity a first-class training objective rather than a hoped-for emergent property.

The insight is simple: tokens from the same document usually belong to the same domain. So EMO constrains all tokens in a document to route through a shared pool of experts. The router learns to identify which expert subsets belong together because the training signal forces it to. Documents about code activate one cluster. Documents about biology activate another. The specialization emerges from the data, not from hand-labeled categories.

This matters because it enables something MoE was supposed to deliver all along: composable deployment. EMO lets you run inference with just 12.5% of the experts and retain near full-model performance on domain-specific tasks. For a 14B parameter model with 1B active parameters, that's meaningful. You can serve capabilities independently without loading the entire weight matrix into memory.

The results are striking. On coding benchmarks, an EMO subset outperforms full-model baselines from comparable architectures. On mathematical reasoning, the same pattern holds. The experts actually specialize in capabilities, not token patterns. When you isolate the "code" experts, you get code generation. When you isolate the "math" experts, you get mathematical reasoning. The mapping is reliable enough to build around.

This is where EMO gets interesting for production systems. Most MoE deployments still require the full model because expert selection is unstable across contexts. A prompt that starts as a coding question might drift into natural language explanation mid-generation, activating a different expert set and degrading output quality. EMO's document-level routing constraint creates coherence. The model commits to an expert pool for the duration of the context.

The architectural implications go further. EMO suggests we've been thinking about MoE backwards. The standard approach assumes we need a gating mechanism to distribute load across parallel experts. But what we actually need is a routing mechanism that learns to cluster capabilities so we can deploy them selectively. The goal isn't parallelization. It's factorization.

There's a cost, of course. EMO requires global load balancing across documents rather than local balancing within batches. The training infrastructure is more complex. The router has harder constraints to satisfy. But the tradeoff is worth it for anyone actually trying to deploy large models efficiently.

The broader point is about how we build AI systems. We've spent years assuming that scale would automatically produce structure—that a trillion parameters would naturally organize into useful abstractions. It doesn't. Structure has to be trained for, not hoped for. EMO is a reminder that architectural decisions during pretraining matter more than parameter count for determining what a model can actually do.

For practitioners, EMO offers a path toward truly modular AI infrastructure. Instead of deploying monolithic models and paying for capabilities you don't use, you could compose expert subsets for specific workloads. The same base model serves code generation, mathematical reasoning, and biomedical QA, but each deployment loads only the relevant experts. Memory costs drop. Latency improves. The economics change.

Whether this becomes standard practice depends on whether the training recipe generalizes to larger scales. EMO's results are on a 14B parameter model. The question is whether the same document-level routing constraints produce coherent expert specialization at 100B parameters and beyond. If they do, MoE might finally deliver on its original promise.

Either way, EMO makes one thing clear: modularity isn't something you get for free. It's something you train for.

TPUs for the Agentic Era: Hardware Finally Catching Up to the Workload

Aamer Mihaysi — Thu, 14 May 2026 03:25:10 +0000

TPUs for the Agentic Era: Hardware Finally Catching Up to the Workload

Google's announcement of two new TPU variants — the 8T for training and 8I for inference — isn't just another hardware refresh. It's an admission that the workloads we've been throwing at AI infrastructure have outgrown the general-purpose designs we've been using.

The agentic era demands something different.

The Mismatch We've Been Ignoring

For the past two years, we've been building agents that reason, plan, and execute across multiple steps. Each agent loop involves inference, tool calls, context retrieval, and state updates. Yet we've been running these workloads on hardware optimized for batch training jobs — massive parallel matrix multiplications with predictable memory access patterns.

Agentic inference looks nothing like that. It's bursty, latency-sensitive, and memory-bandwidth constrained. Context windows balloon. KV caches fragment. The typical agent trace looks like a sawtooth pattern of compute spikes followed by idle waiting on external tools.

Running this on training-optimized hardware is like using a freight train for city commuting.

What the Split Actually Means

The 8T (training) doubles down on what TPUs already do well: dense matrix operations, large batch sizes, and gradient synchronization across chips. If you're training the next foundation model, this is your chip.

The 8I (inference) is where it gets interesting. Higher memory bandwidth per core, lower latency activation paths, and what Google calls optimized batching for variable-length sequences. Translation: it handles the messy, uneven traffic patterns of real-world agent deployments without choking.

The split acknowledges what many of us have known but few hardware vendors admit: training and inference are different workloads with different constraints. Pretending one architecture serves both was always a compromise.

The Real Impact on Agent Architecture

Cheaper inference changes how you design agents. When latency drops and throughput rises, suddenly multi-step reasoning chains become viable. You can afford to let an agent iterate, backtrack, and explore without watching your inference budget evaporate.

This shifts the bottleneck. The constraint stops being can I afford to run this agent? and becomes can I design an agent that uses the compute effectively?

That's a harder problem. But it's the right one to be solving.

The Broader Pattern

NVIDIA's been making similar moves with their inference-optimized SKUs. Startups like Groq and Cerebras built their entire thesis on this gap. The industry is converging on a truth: the inference workload for agents is distinct enough to warrant purpose-built silicon.

Google's dual-TPU strategy validates this shift. The question now is whether your infrastructure is ready to take advantage of it.

Because the hardware is finally here. What you build on it is up to you.

MoE Architectures Keep Solving the Wrong Problem

Aamer Mihaysi — Wed, 13 May 2026 09:03:53 +0000

MoE Architectures Keep Solving the Wrong Problem

Emergent modularity sounds like a feature. In practice, it's usually a band-aid for training instability we refuse to name.

AllenAI's EMO work has people talking about "pretraining for emergent modularity" as if it's a design choice. It's not. It's the system compensating for the fact that we've scaled dense transformers to the point where gradient updates interfere destructively across unrelated capabilities. The experts don't emerge because they're elegant. They emerge because the alternative is a 300B parameter model that forgets how to count while learning French verb conjugation.

I've shipped MoE systems in production. The pitch is always the same: sparse activation means efficiency, gated routing means specialization, and your inference costs stay manageable while capacity scales. The reality is more complicated. You get efficiency at the cost of predictability. You get capacity at the cost of debugging nightmares when your router decides that code completion and poetry generation should share the same expert at 2am on a Saturday.

The real issue isn't whether MoEs work. They do. The issue is that we're treating the symptom—interference across tasks—instead of the disease. We keep building bigger models with more parameters, then act surprised when they exhibit catastrophic forgetting and gradient conflicts. MoEs are a mitigation strategy masquerading as architecture.

What's interesting about the EMO approach is the acknowledgment that expert specialization isn't automatic. Most MoE implementations assume that if you create enough experts and train long enough, specialization will magically appear. Sometimes it does. Often you get "super-experts" that handle everything, dead experts that never activate, or weird load imbalances that require auxiliary loss terms and constant babysitting. The pretraining objective in EMO explicitly encourages modularity, which is a more honest framing than pretending the problem solves itself.

But here's what gets left out of the conversation: MoEs trade training compute for inference complexity. You still train the full parameter count. You just hope that at serving time, only a fraction activates per token. This works beautifully until your router encounters an edge case it wasn't trained on, or until latency requirements force you to cap the number of experts you can consult per step. Suddenly your "efficient" 8x7B model is hitting memory bandwidth limits that a dense 70B model handles gracefully.

The broader pattern here is that we're optimizing around hardware constraints instead of rethinking what we're actually building. MoEs exist because we can't train 1T parameter dense models efficiently. They don't exist because they're the best conceptual solution to multi-task learning. They're a compression technique disguised as an architectural innovation.

Does this mean you shouldn't use MoEs? Absolutely not. In resource-constrained environments, they're often the right call. But go in with clear eyes. You're not getting "emergent modularity" as a free lunch. You're buying into a system where routing decisions happen in milliseconds based on patterns that may or may not align with your actual task boundaries. Where debugging why a particular token got routed to expert 7 instead of expert 3 requires visualizing attention patterns across 64 layers. Where the efficiency gains you calculated on paper evaporate when real traffic patterns don't match your training distribution.

The next frontier isn't bigger MoEs. It's figuring out why we need them in the first place. If we could train dense models without interference, without the gradient conflicts that make MoEs necessary, would anyone choose the complexity? Probably not. The fact that emergent modularity is considered a win tells you everything about the state of the field. We're celebrating our workarounds.

What's actually needed is a fundamental rethink of how we structure parameter spaces. MoEs are a local optimum. They're good enough that we stop looking for something better. But the history of ML is littered with good-enough solutions that persisted decades past their expiration date because they worked well enough to ship.

Ship MoEs if you need to. Just don't mistake the workaround for the destination.

MachinaCheck: Manufacturing Agents That Actually Ship

Aamer Mihaysi — Mon, 11 May 2026 09:03:52 +0000

Most manufacturing workflows still treat design and production as separate conversations. An engineer models a part; a machinist figures out how to make it. The handoff is where things break—tolerances get reinterpreted, capabilities get assumed, and 'should be machinable' becomes a costly trial-and-error exercise.

MachinaCheck is a multi-agent system that sits between CAD and CNC. It doesn't just validate G-code; it interrogates the design itself. One agent parses the CAD geometry, another queries material constraints, a third simulates toolpaths on AMD MI300X hardware. They argue until they agree, and only then does a part reach the shop floor.

This matters because agentic infrastructure is often discussed in the abstract—chatbots that reason, systems that plan. But manufacturing is where the constraints are unforgiving. You can't hallucinate a tolerance. You can't context-window your way out of a collision. The domain forces rigor.

The AMD MI300X is the interesting choice here. Most multi-agent demos run on cloud A100s or H100s because that's where the APIs are. MachinaCheck went with MI300X for memory bandwidth and deterministic latency—when you're simulating physics, consistency beats peak throughput. The agents share state through a unified memory pool, which cuts the serialization overhead that typically kills multi-agent performance.

What's notable is the failure mode. When agents disagree—say, the geometry agent thinks a feature is machinable but the toolpath agent finds interference—the system doesn't default to a human escalation. It runs a local search: adjust feed rate, try a different tool, modify the approach angle. Only when the local search exhausts its budget does it flag for review. This is the difference between agentic automation and agentic assistance. One handles the routine; the other handles the edge cases.

The broader pattern here is domain-specific agent swarms. General-purpose reasoning models are impressive, but they struggle with specialized knowledge that isn't well-represented in training data. Manufacturing physics, regional building codes, clinical trial protocols—these are areas where you need agents that can query structured databases, run simulations, and respect hard constraints. MachinaCheck is a template for this: small, specialized agents with narrow interfaces, coordinated through a lightweight protocol.

There's a temptation to scale these systems horizontally—more agents, broader coverage. But MachinaCheck suggests the opposite. The value is in depth, not breadth. Three agents that deeply understand CNC constraints are more useful than twenty that shallowly understand manufacturing.

For teams building agentic infrastructure, the lesson is about boundaries. Define what each agent can assume, what it must verify, and how it fails. The MI300X choice matters less than the architecture—tight feedback loops, shared memory, and clear escalation paths. The hardware enables the system; the design makes it reliable.

Manufacturing is often dismissed as 'solved' or 'legacy.' But it's exactly the kind of domain where agents can prove their worth—not by replacing humans, but by handling the routine validation that currently consumes engineering hours. MachinaCheck isn't flashy. It just ships parts that work.

vLLM's V1 Release Fixes the Silent Killer in RL Training

Aamer Mihaysi — Fri, 08 May 2026 09:02:13 +0000

Most people benchmark inference engines on throughput. Tokens per second, batch size limits, latency percentiles. But when you're training agents with reinforcement learning, there's a metric that matters more: correctness. A silent bug in your inference stack doesn't just slow you down—it poisons your training data, and you won't know for weeks.

The vLLM team just shipped V1, and buried in the release notes is a fix that should make anyone running RL training take notice. They found and corrected subtle correctness issues in how V0 handled certain token sequences under grouped query attention. The kind of bugs that don't crash your job but subtly shift your reward model's understanding of what "good" looks like.

Why RL is Unforgiving

Supervised fine-tuning is forgiving. If your inference engine produces slightly different logits for 0.1% of tokens, the gradient updates average out. RL is different. You're generating rollouts, computing advantages, updating policy and value networks in tight loops. A correctness bug doesn't average out—it compounds. Your policy learns from corrupted rollouts. Your value function trains on garbage advantages. By the time you notice the loss curve looks weird, you've burned thousands of GPU hours.

The vLLM V0 bugs were subtle enough to pass standard tests. They manifested under specific conditions: long contexts with particular attention patterns, batched generations with heterogeneous lengths, certain temperature settings. Exactly the conditions you hit when training agents that need to explore environments, maintain state, and generate variable-length reasoning traces.

What Changed in V1

The V1 rewrite isn't just a refactor. The team rebuilt the attention backends with correctness as the primary constraint, then optimized. They added comprehensive property-based testing that generates random sequences and verifies equivalence against a reference implementation. They caught edge cases in rotary position embeddings that only appeared at context lengths above 16k tokens.

More importantly, they changed how they think about the PagedAttention algorithm. V0 optimized for throughput first. V1 optimizes for correctness first, then recovers throughput. The result is an engine that generates identical outputs to reference implementations across the test matrix, while still maintaining competitive performance.

The Production Lesson

If you're running RL training at scale, you need to audit your inference stack for correctness, not just speed. Run equivalence tests against a reference implementation on your actual training distribution. Generate thousands of rollouts with both engines and compare reward distributions. Monitor for divergence in KL divergence estimates between your policy and reference policy.

vLLM V1 is a reminder that infrastructure for agent training has different requirements than infrastructure for chatbots. When your model is generating its own training data, correctness isn't a nice-to-have. It's the foundation everything else builds on.

The throughput numbers in V1 are good. But the correctness guarantees are what make it production-ready for RL.

DeepSeek-V4: What a Million-Token Context Actually Changes

Aamer Mihaysi — Wed, 06 May 2026 09:02:27 +0000

DeepSeek-V4: What a Million-Token Context Actually Changes

The context window arms race officially crossed into absurdity this week. DeepSeek-V4 launched with a million-token context window, and suddenly everyone building agents is asking the same question: is this finally enough?

The honest answer: it depends on what you were doing wrong before.

Most agent memory designs are sophisticated workarounds for a problem nobody defined clearly. When your context fits in a few thousand tokens, you build elaborate retrieval systems, hierarchical memory structures, and clever compression schemes. Not because they're good ideas, but because you have no choice. The constraint shapes the architecture.

Remove that constraint and the architecture doesn't automatically become elegant. It just becomes different.

The Real Problem with Long Context

A million tokens sounds like freedom. In practice, it's a different kind of trap. The failure mode shifts from "can't fit" to "can't find." When you dump an entire codebase, weeks of conversation history, and multiple tool outputs into a single prompt, attention becomes your bottleneck. The model sees everything but prioritizes nothing.

I've watched agent traces where the critical tool result was technically present in context but effectively invisible, buried under thousands of tokens of irrelevant history. The model hallucinated a response instead of retrieving the actual answer sitting three-quarters of the way through the window.

Long context doesn't solve retrieval. It just changes where retrieval happens—from external vector stores to internal attention mechanisms. And attention is expensive. Every additional token you attend to costs latency and compute. The economics don't disappear just because the window got bigger.

What Actually Works

The teams shipping reliable agents at scale aren't dumping everything into context. They're using long windows selectively:

Single-shot analysis over chunking. When you need to understand cross-document relationships or detect patterns across a large codebase, fitting everything at once beats stitching together partial views. RAG pipelines that previously required three separate retrieval calls can now handle the full document set in one pass.

Working memory for active sessions. Keeping the last hour of conversation in context beats constant re-retrieval from a memory store. The latency win is real, and coherence improves when the model maintains consistent references across turns.

Tool output aggregation. Some workflows generate massive intermediate results—log analysis, test suites, multi-page scrapes. Being able to pass the full output through without aggressive summarization preserves signal that gets lost in compression.

What Doesn't Change

The fundamentals of agent design stay the same. You still need clear tool boundaries, structured output formats, and error handling that assumes failure. A bigger window doesn't make your prompts better or your evaluation metrics more meaningful.

If your agent was unreliable with 8K context, a million tokens won't save it. The bugs just get more expensive to trace.

The Infrastructure Angle

From an infrastructure perspective, million-token windows change the serving calculus. KV cache memory requirements scale linearly with sequence length. A batch of 32 requests at 1M tokens each is a very different proposition than the same batch at 4K.

Pricing models haven't settled. Some providers charge per token regardless of context position, which means the first token costs the same as the millionth. Others are experimenting with attention-based pricing that accounts for actual compute. If you're building cost-sensitive applications, the economics of long context matter more than the capability.

Bottom Line

DeepSeek-V4's million-token window is a genuine capability shift, but not a paradigm shift. It removes a constraint that was forcing bad architectural decisions. It doesn't automatically produce good ones.

The agents that benefit most are those that were already well-architected but hitting artificial limits. If your system was designed around retrieval augmentation because you had to, not because it was the right choice, this is your opportunity to simplify.

Just don't mistake "can fit" for "should fit." The window is bigger. Your judgment still needs to be selective.

ai #agents #llm #deeepseek #rag #machinelearning

The 8B Model That Punches at 32B Weight

Aamer Mihaysi — Mon, 04 May 2026 09:03:39 +0000

IBM's Granite 4.1 release exposes something the industry keeps forgetting: parameter count is a vanity metric. Their 8B instruct model matches or beats their own previous 32B-A9B MoE variant. Same capabilities, one-fourth the size.

Most teams still chase the pre-training lottery. Dump more tokens, add more layers, scale the cluster. Granite 4.1 took the opposite path. Fifteen trillion tokens, yes, but filtered through five distinct phases where data quality progressively tightened like a vise.

The architecture isn't revolutionary. Grouped Query Attention, RoPE, SwiGLU, RMSNorm. Standard components assembled competently. Where Granite diverges is the post-training stack. Supervised fine-tuning on 4.1 million samples, each scored through an LLM-as-Judge pipeline. Then a multi-stage RL pipeline using on-policy GRPO with DAPO loss.

The long-context extension to 512K tokens is equally methodical. Staged expansion: 32K, then 128K, then 512K. A model merge after each stage to preserve short-context performance.

Apache 2.0 licensing matters here. Not just for ethics, but because it lets you actually inspect the training logs and replicate the pipeline.

What's striking is the absence of gimmicks. No Mixture of Experts shell game. No claims about emergent capabilities. Just disciplined data engineering across 15 trillion tokens.

The broader implication: we're entering an era where training efficiency beats model scale. If you can get 32B-equivalent performance from 8B parameters, deployment costs collapse. Latency drops. You can run inference on commodity hardware.

Granite 4.1 won't dominate the headlines. But for teams building production systems, it's a signal. The window where bigger was automatically better is closing. The winners will be the ones who treat data curation as infrastructure, not an afterthought.

The 8B figure isn't a constraint. It's a choice that forced better decisions.

AI Evaluation Is Now a Capital Expense

Aamer Mihaysi — Fri, 01 May 2026 09:02:48 +0000

We used to worry about training costs. Now the bill for checking if the model works is becoming the line item that kills budgets.

The Holistic Agent Leaderboard recently spent $40,000 to run 21,730 agent rollouts across nine models and nine benchmarks. A single GAIA run on a frontier model can hit $2,829 before you even think about caching. Exgentic's sweep across agent configurations found a 33x cost spread on identical tasks.

Static benchmarks could be compressed. Flash-HELM showed 100-200x compute reduction preserved rankings. Agent benchmarks broke that assumption. When your evaluation is a multi-turn rollout with tool calls and stateful interaction, each item is the expensive object.

On HAL's Online Mind2Web benchmark, Browser-Use with Claude Sonnet 4 cost $1,577 for 40% accuracy. SeeAct with GPT-5 Medium hit 42% for $171. That is a 9x cost difference for two percentage points.

Agent benchmarks measure a model x scaffold x token-budget product. CLEAR found that accuracy-optimal configurations cost 4.4 to 10.8x more than Pareto-efficient alternatives. The best result on a leaderboard is often just the most expensive configuration someone was willing to pay for.

The democratization narrative in AI has always been fragile. Open weights helped. Open datasets helped. But open evaluation is becoming a luxury good.

A grad student can download Llama 4 and fine-tune it on a single GPU. They cannot reproduce the HAL leaderboard without institutional backing. The verification layer of the scientific process is being priced out of reach.

What we need is transparency about costs alongside scores. A leaderboard that shows dollars per point of accuracy. Until then, evaluation will continue its drift from quality control to capital allocation. And the people best positioned to know which models actually work will be the ones with the deepest pockets, not the sharpest insights.

The Agent Orchestration Layer Is Finally Here

Aamer Mihaysi — Wed, 29 Apr 2026 09:03:43 +0000

We've spent two years obsessing over model benchmarks. Meanwhile, a quieter shift has been happening: the realization that throwing more parameters at a problem isn't the same as building systems that can actually work together.

Sakana's Conductor is the clearest signal yet. A 7B model trained with reinforcement learning not to solve tasks directly, but to decide which agent should solve them. It reached 83.9% on LiveCodeBench and 87.5% on GPQA-Diamond—not by being smarter than the frontier models it orchestrates, but by being better at dispatching them.

The implications are uncomfortable for anyone who's built their architecture around single-model dominance. When a 7B parameter orchestrator can outperform individual 70B+ workers by routing queries intelligently, the economics of inference change completely.

Google's TPU split into training (8t) and inference (8i) variants reinforces this trajectory. When you start optimizing silicon specifically for inference workloads, you're acknowledging that the action has shifted from training massive models to deploying them efficiently at scale.

The question isn't whether multi-agent systems will dominate. It's whether your stack is built to route between them efficiently.

The orchestration layer is no longer theoretical. It's here, it's 7B parameters, and it's beating the frontier models at their own game.

The Browser Is Becoming an Agent Operating System

Aamer Mihaysi — Tue, 28 Apr 2026 09:03:42 +0000

The Browser Is Becoming an Agent Operating System

Chrome's AI Mode and Skills features mark a shift most developers haven't internalized yet. The browser isn't just adding AI features. It's becoming the runtime environment for agentic workflows.

Google's announcement of AI Mode in Chrome and the new "Skills" capability to turn prompts into one-click tools signals something larger than convenience features. The browser is evolving from a document viewer into an agent orchestration layer. This matters because it changes where intelligence lives in your stack.

Most agent architectures today assume the model lives somewhere else—OpenAI's API, your backend, a local inference server. The browser is treated as a dumb terminal. But Chrome's moves suggest a different model: the browser itself becomes the agent host, with local context, persistent memory, and tool-calling capabilities built into the chrome.

AI Mode transforms how users interact with web content. Instead of browsing passively, users can query, summarize, and act on information across tabs. The browser gains semantic understanding of what the user is looking at. This isn't just a chat overlay—it's the foundation for agents that can reason about web content in real-time.

Skills takes this further by letting users (and eventually developers) package prompts into reusable tools. A "Skill" in Chrome is essentially a lightweight agent with a specific purpose—research this company, compare these products, draft a response to this email. They execute with access to the current browsing context.

For developers building agentic applications, this changes the playing field. The browser becomes a competitor to your backend agent infrastructure, but also a potential platform to leverage.

The implications are concrete:

Context access: Chrome has access to cookies, browsing history, and on-page content that external agents can't easily replicate without complex OAuth flows and scraping infrastructure.
User trust: Users trust their browser more than random third-party agents. Chrome's built-in agents inherit that trust by default.
Distribution: Skills distributed through Chrome reach users where they already are. No installation friction, no new interface to learn.

But there are trade-offs. Chrome's agents run in Google's environment, subject to their rate limits, their model choices, their privacy policies. The Skills ecosystem will likely be as open as Chrome extensions—technically extensible, but gated by store policies and Google's priorities.

What this means for builders: if you're constructing multi-agent systems, the browser is no longer just a client. It's a first-class agent platform. Your architecture needs to account for agents that run locally in Chrome, agents that run in your backend, and how they coordinate.

The vision is clear. Chrome becomes the agent OS, Skills become the app store, and the line between browsing and task execution dissolves. For some workflows, this is ideal. For others—those requiring custom models, sensitive data, or complex coordination—you'll still need your own infrastructure.

The browser isn't dead. It's becoming something more interesting: the universal agent runtime.