<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Rome</title>
    <description>The latest articles on DEV Community by Rome (@rome1).</description>
    <link>https://dev.to/rome1</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3303991%2F2f80f8cf-5626-41ed-84d0-2c0d64d60cad.png</url>
      <title>DEV Community: Rome</title>
      <link>https://dev.to/rome1</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/rome1"/>
    <language>en</language>
    <item>
      <title>Gemma 4 dense by default: why your local agent doesn't want the MoE</title>
      <dc:creator>Rome</dc:creator>
      <pubDate>Sat, 23 May 2026 22:52:45 +0000</pubDate>
      <link>https://dev.to/rome1/gemma-4-dense-by-default-why-your-local-agent-doesnt-want-the-moe-4pgd</link>
      <guid>https://dev.to/rome1/gemma-4-dense-by-default-why-your-local-agent-doesnt-want-the-moe-4pgd</guid>
      <description>&lt;h2&gt;
  
  
  The decision you don't realize you're making
&lt;/h2&gt;

&lt;p&gt;You sit down to wire &lt;a href="https://developers.googleblog.com/bring-state-of-the-art-agentic-skills-to-the-edge-with-gemma-4/" rel="noopener noreferrer"&gt;Gemma 4&lt;/a&gt; into a local agent loop — a Claude-Code-style tool-using harness, a long-context code reviewer, an offline research assistant. Google has handed you four architectures from the same release. The contest framing nudges you toward an obvious read:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;E2B and E4B&lt;/strong&gt; (effective-parameter) models for phones, browsers, and a Pi 5.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;31B dense&lt;/strong&gt; as the on-prem workhorse.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;26B MoE&lt;/strong&gt; as the efficient one — the &lt;a href="https://en.wikipedia.org/wiki/Mixture_of_experts" rel="noopener noreferrer"&gt;mixture-of-experts&lt;/a&gt; variant that activates a fraction of its parameters per token and claims better quality per FLOP.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The cheap reflex, in 2026, is to reach for the MoE. It's the architecture every frontier lab is paying for. &lt;a href="https://arxiv.org/abs/2401.04088" rel="noopener noreferrer"&gt;Mixtral&lt;/a&gt;, &lt;a href="https://arxiv.org/abs/2412.19437" rel="noopener noreferrer"&gt;DeepSeek V3&lt;/a&gt;, &lt;a href="https://arxiv.org/abs/2505.09388" rel="noopener noreferrer"&gt;Qwen3&lt;/a&gt; — the production direction is one-way traffic. If the open release at your scale gives you an MoE option, you pick it. That's the move.&lt;/p&gt;

&lt;p&gt;That move is wrong for the workload most of you will actually build.&lt;/p&gt;

&lt;p&gt;For &lt;strong&gt;interactive local agents&lt;/strong&gt; — multi-turn tool use, code editing, long-context reasoning chains where each step is read by a human or fed into the next step — the &lt;strong&gt;31B dense&lt;/strong&gt; is the right default, and the &lt;strong&gt;26B MoE&lt;/strong&gt; is the wrong one. The reason is not size, quality, or open-weights ergonomics. It's that agents fail on the tail, not the mean. &lt;strong&gt;Routing variance lives in the tail.&lt;/strong&gt; That's the entire argument; the rest of this post is the math, the mechanism, and a benchmark you can run this weekend on whatever hardware you have.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tail is the workload
&lt;/h2&gt;

&lt;p&gt;A common mistake when reasoning about local-model deployments is to optimize the average. Tokens per second, mean time-to-first-token, throughput at batch 1 — these numbers are what every llama.cpp benchmark in your feed reports.&lt;/p&gt;

&lt;p&gt;For a chatbot that gets one user message and emits one reply, the mean is fine. For an agent, it isn't.&lt;/p&gt;

&lt;p&gt;Consider a 30-step tool-use loop — a typical &lt;a href="https://arxiv.org/abs/2210.03629" rel="noopener noreferrer"&gt;ReAct&lt;/a&gt;-style agent doing local code search, file edits, test runs, and a final summary. Each step is a generation call. Suppose each call has a 3% probability of exceeding a soft timeout (say, the human watching the screen gives up and Ctrl-Cs). The probability that &lt;strong&gt;at least one&lt;/strong&gt; step in the loop blows the timeout is:&lt;/p&gt;

&lt;p&gt;1 - (1 - 0.03)^{30} \approx 0.60&lt;/p&gt;

&lt;p&gt;A 3% per-step tail becomes a 60% chance of a visibly broken loop. Push the per-step tail to 1% and you're still at 26% loop-level failure across 30 steps. The mean latency could be excellent; the user experience is broken. Agents are tail-bound systems, the same way &lt;a href="https://queue.acm.org/detail.cfm?id=2655588" rel="noopener noreferrer"&gt;low-latency trading&lt;/a&gt;, &lt;a href="https://research.google/pubs/the-tail-at-scale/" rel="noopener noreferrer"&gt;search ranking&lt;/a&gt;, and request-level web serving are. Jeff Dean's "tail at scale" is, in fact, about exactly this — when you fan out many requests, the slowest one dominates. An agent fans out across time, not machines, but the math is identical.&lt;/p&gt;

&lt;p&gt;This is the lens the architecture comparison has to be read through. Not "which model gets more right answers per FLOP," but "which model gets right answers &lt;em&gt;reliably under the user's patience budget&lt;/em&gt;."&lt;/p&gt;

&lt;h2&gt;
  
  
  How MoE routing makes a tail
&lt;/h2&gt;

&lt;p&gt;Every MoE forward pass routes each token through a small subset of the available experts. In Mixtral 8×7B, that's top-2 of 8 experts per token per MoE layer. In DeepSeek V3, it's top-K out of hundreds, with shared experts plus routed experts. The architectural details vary; the routing mechanism does not. A learned &lt;a href="https://arxiv.org/abs/1701.06538" rel="noopener noreferrer"&gt;gating network&lt;/a&gt; emits expert probabilities; the implementation activates the top-K.&lt;/p&gt;

&lt;p&gt;Several consequences fall out of this design — and they are not bugs, they're load-bearing features of why MoE works at all. They just bite a local agent harder than a batched cloud workload.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Active parameters per token are a distribution, not a constant.&lt;/strong&gt; Google's naming for the MoE — &lt;code&gt;gemma-4-26B-A4B&lt;/code&gt;, "26 billion total, 4 billion active" — describes an &lt;em&gt;expectation&lt;/em&gt; across tokens, not a per-token guarantee. For a prompt that lands hard on a small number of experts — say, code in an unusual language, or a domain-specific token distribution — the &lt;em&gt;same&lt;/em&gt; experts get hit on every layer, creating utilization hotspots. The total work across the layer is the same; the load balance is not. On a multi-GPU deployment this is fine, because the all-reduce across experts amortizes it. On a single-machine local deployment it shows up as per-token latency variance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Expert utilization is non-uniform.&lt;/strong&gt; &lt;a href="https://arxiv.org/abs/2406.18219" rel="noopener noreferrer"&gt;A closer look at Mixture-of-Experts in LLMs&lt;/a&gt; (Lo et al., 2024) finds that routers tend to select experts whose outputs have larger norms, that expert diversity changes layer by layer, and that the final layer routes differently from intermediate layers. The implication for an agent's tail isn't subtle: which experts your tokens actually hit is a function of &lt;em&gt;what the tokens are&lt;/em&gt;, and adjacent steps in an agent loop — where one step's output becomes the next step's input — can land in substantially different parts of the expert configuration space, with substantially different latencies. For a batch-mode service the variance washes out across many users; for a single agent loop, each user &lt;em&gt;is&lt;/em&gt; the batch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Load-balancing losses pull the model toward uniform expert use, but they don't guarantee it.&lt;/strong&gt; Training-time auxiliary losses (&lt;a href="https://arxiv.org/abs/1701.06538" rel="noopener noreferrer"&gt;Shazeer et al., 2017&lt;/a&gt;; &lt;a href="https://arxiv.org/abs/2101.03961" rel="noopener noreferrer"&gt;Fedus et al., Switch Transformer&lt;/a&gt;) penalize over- and under-used experts so the model learns a roughly even distribution. &lt;em&gt;Roughly&lt;/em&gt; even at the dataset level is not even on your specific prompt. The auxiliary loss is a soft prior; the runtime gating is what determines your tail.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The KV cache is mostly arch-agnostic, but the activations on top of it aren't.&lt;/strong&gt; For Gemma 4 specifically we don't yet have the per-architecture cache numbers I'd want, but the general principle holds: the &lt;a href="https://arxiv.org/abs/2305.13245" rel="noopener noreferrer"&gt;KV cache footprint&lt;/a&gt; at long context will be similar for dense and MoE in the same family (same attention layout, &lt;a href="https://arxiv.org/abs/2305.13245" rel="noopener noreferrer"&gt;GQA&lt;/a&gt;, same hidden size), but the per-token activation cost will diverge. The MoE wins on raw FLOPs per token. It does not win on &lt;em&gt;predictable&lt;/em&gt; FLOPs per token.&lt;/p&gt;

&lt;p&gt;The dense model has none of this surface. Every token does the same work. Routing variance is zero because there is no routing. The p99 looks like the p50 plus the inherent system noise of your kernels, your scheduler, and your memory hierarchy. That's the property the agent loop wants.&lt;/p&gt;

&lt;h2&gt;
  
  
  The MoE memory myth
&lt;/h2&gt;

&lt;p&gt;A second piece of the cheap reflex says: "&lt;em&gt;26B MoE is smaller than 31B dense, so it fits better in my VRAM.&lt;/em&gt;"&lt;/p&gt;

&lt;p&gt;It doesn't.&lt;/p&gt;

&lt;p&gt;MoE memory residence is &lt;em&gt;all&lt;/em&gt; parameters, not just the activated ones. Every expert has to be loaded somewhere — either in GPU memory, or paged in from system RAM, or paged in from disk. The "26B" in Gemma 4 26B MoE is the residence footprint; the "activates a fraction per token" describes compute, not memory. On a single consumer GPU, both models compete for roughly the same memory budget, and the dense model is often the easier fit because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Quantization is cleaner on dense.&lt;/strong&gt; Dense weight matrices quantize uniformly; MoE quantization has to contend with expert-utilization skew, where rarely-routed experts get less calibration signal and degrade harder under aggressive bit-width reduction. Sub-1-bit MoE compression is feasible — see &lt;a href="https://arxiv.org/abs/2310.16795" rel="noopener noreferrer"&gt;QMoE&lt;/a&gt; for the existence proof on Switch-Transformer-c2048 — but it requires custom kernels. The boring 4-bit path on the dense model is the path that already works in every local inference engine you'll touch (&lt;a href="https://arxiv.org/abs/2310.16836" rel="noopener noreferrer"&gt;LLM-FP4&lt;/a&gt; is the well-trodden version).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Offloading hurts MoE more.&lt;/strong&gt; If you spill experts to CPU or disk, you pay the offload cost &lt;em&gt;per token, conditional on routing&lt;/em&gt;. A dense model that spills paid the cost predictably; an MoE that spills pays it stochastically, and the stochasticity is — again — the tail.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compilation is cleaner on dense.&lt;/strong&gt; &lt;a href="https://github.com/ggerganov/llama.cpp" rel="noopener noreferrer"&gt;llama.cpp&lt;/a&gt;, &lt;a href="https://github.com/ml-explore/mlx" rel="noopener noreferrer"&gt;MLX&lt;/a&gt;, and &lt;a href="https://github.com/vllm-project/vllm" rel="noopener noreferrer"&gt;vLLM&lt;/a&gt; all support both, but the dense path has had more attention. Fewer corner cases in expert routing, GQA, and KV layout interactions. If you've ever had a custom kernel mis-handle expert dispatch, you know.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The 26B MoE will run on your machine. So will the 31B dense. Which one runs &lt;em&gt;predictably&lt;/em&gt; is the question.&lt;/p&gt;

&lt;h2&gt;
  
  
  When the MoE is correct
&lt;/h2&gt;

&lt;p&gt;I'm not arguing the 26B MoE is a bad model. It's the right model for several workloads — they just aren't the local-agent workload most readers building this weekend will actually have.&lt;/p&gt;

&lt;p&gt;The MoE is the right pick when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;You're serving many concurrent users.&lt;/strong&gt; Batching across requests amortizes routing variance. Hot experts become warm; cold experts get exercised. The per-token expectation is what you see.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You're throughput-bound, not latency-bound.&lt;/strong&gt; Background processing, RAG indexing, bulk classification, &lt;a href="https://arxiv.org/abs/2406.20094" rel="noopener noreferrer"&gt;synthetic data generation at scale&lt;/a&gt;. Anywhere the user isn't waiting on a single response.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You can afford a per-request retry budget.&lt;/strong&gt; If a slow step is recoverable by re-issuing, the tail flattens out. Most agent loops can't recover gracefully — the partial state is already entangled with the world (a file was edited, a test was run).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You're deploying on accelerators with native expert-parallel support.&lt;/strong&gt; Multi-GPU or TPU deployments where the all-reduce overhead is dominated by the FLOP savings. This is exactly where the MoE was designed to win, and it does.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Gemma 4 26B MoE is, on its merits, an excellent open-weights MoE. Pick it for the workloads it was designed for. Don't pick it because the marketing material says "more efficient" and you didn't ask "efficient at what."&lt;/p&gt;

&lt;h2&gt;
  
  
  The decision matrix
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workload&lt;/th&gt;
&lt;th&gt;Pick&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Single-user local agent loop (tool use, code edit, multi-turn)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;31B dense&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local long-context Q&amp;amp;A (one or two large reads, then short outputs)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;31B dense&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local RAG service with many concurrent users&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;26B MoE&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local batched synthetic-data or evaluation pipelines&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;26B MoE&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Background indexing, classification, summarization at scale&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;26B MoE&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fine-tuning on a narrow domain (small dataset, limited compute)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;31B dense&lt;/strong&gt; (MoE fine-tuning collapses expert specialization at small data scales)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mech-interp / SAE / circuit work on an open MoE&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;26B MoE&lt;/strong&gt; (it's the interesting object)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Phone, Pi 5, browser deployment&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;4B effective&lt;/strong&gt; (you don't have a choice)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloud or multi-GPU server deployment&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;26B MoE&lt;/strong&gt; (this is where it wins)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two things to flag explicitly. First, the fine-tuning row: MoE fine-tuning is a research subfield of its own because the router and the experts have to be tuned coherently, and small datasets can collapse expert specialization. If you have 5,000 examples and want to teach Gemma 4 your codebase's conventions, the dense model is the more forgiving target. Second, the mech-interp row: a 26B MoE with open weights is a genuinely exciting interpretability artifact — see the &lt;a href="https://huggingface.co/google/gemma-scope" rel="noopener noreferrer"&gt;GemmaScope&lt;/a&gt; lineage on Gemma 2 dense — and if your "agent" is actually a circuit-analysis pipeline, the MoE is the more interesting object. That's a different essay.&lt;/p&gt;

&lt;h2&gt;
  
  
  The benchmark you should run this weekend
&lt;/h2&gt;

&lt;p&gt;Don't take my word for any of the above. The whole point of running models locally is that you can measure. Here is the experiment, three ways, calibrated for the hardware most of you actually own.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What to measure:&lt;/strong&gt; not tokens per second. Measure &lt;strong&gt;per-step p50, p95, and p99 latency&lt;/strong&gt; across at least 200 generation calls, with prompts that approximate your real workload (short turns, mixed lengths, occasional long-context reads). If you can run an actual agent loop end-to-end, even better — record the per-step distribution and synthesize the loop-level success probability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Apple Silicon (M2 / M3 / M4):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Build llama.cpp with Metal backend&lt;/span&gt;
git clone https://github.com/ggml-org/llama.cpp
&lt;span class="nb"&gt;cd &lt;/span&gt;llama.cpp &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="nt"&gt;-DGGML_METAL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;-j&lt;/span&gt;

&lt;span class="c"&gt;# Community-quantized GGUFs (Google ships safetensors; unsloth ships GGUF)&lt;/span&gt;
huggingface-cli download unsloth/gemma-4-31B-it-GGUF &lt;span class="se"&gt;\&lt;/span&gt;
  gemma-4-31B-it-Q4_K_M.gguf &lt;span class="nt"&gt;--local-dir&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
huggingface-cli download unsloth/gemma-4-26B-A4B-it-GGUF &lt;span class="se"&gt;\&lt;/span&gt;
  gemma-4-26B-A4B-it-Q4_K_M.gguf &lt;span class="nt"&gt;--local-dir&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;

&lt;span class="c"&gt;# Benchmark: 200 generations of 512 tokens, log per-call timing&lt;/span&gt;
./build/bin/llama-bench &lt;span class="nt"&gt;-m&lt;/span&gt; gemma-4-31B-it-Q4_K_M.gguf  &lt;span class="nt"&gt;-n&lt;/span&gt; 512 &lt;span class="nt"&gt;-r&lt;/span&gt; 200 &lt;span class="nt"&gt;-o&lt;/span&gt; json &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; dense.json
./build/bin/llama-bench &lt;span class="nt"&gt;-m&lt;/span&gt; gemma-4-26B-A4B-it-Q4_K_M.gguf &lt;span class="nt"&gt;-n&lt;/span&gt; 512 &lt;span class="nt"&gt;-r&lt;/span&gt; 200 &lt;span class="nt"&gt;-o&lt;/span&gt; json &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; moe.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;26B-A4B&lt;/code&gt; suffix on the MoE is Google's naming for "26B total params, ~4B activated per token" — the very expectation-vs-distribution gap discussed above. &lt;a href="https://github.com/ggml-org/llama.cpp/tree/master/tools/llama-bench" rel="noopener noreferrer"&gt;&lt;code&gt;llama-bench&lt;/code&gt;&lt;/a&gt; gives per-call timings; pipe through &lt;code&gt;jq&lt;/code&gt; to extract the per-call latencies and compute your own quantiles. For a richer view, &lt;a href="https://github.com/ml-explore/mlx-examples/tree/main/llms" rel="noopener noreferrer"&gt;MLX-LM&lt;/a&gt; gives cleaner kernel timing on Apple Silicon and lets you verify the dispatch overhead directly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Consumer NVIDIA (RTX 4090, 3090, A6000):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# vLLM with the official safetensors, served on the same machine&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;vllm
python &lt;span class="nt"&gt;-m&lt;/span&gt; vllm.entrypoints.openai.api_server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model&lt;/span&gt; google/gemma-4-31B-it &lt;span class="nt"&gt;--quantization&lt;/span&gt; awq &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--port&lt;/span&gt; 8000 &amp;amp;

&lt;span class="c"&gt;# Drive it with a benchmark harness — 200 requests, log per-request latency&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; vllm.benchmarks.bench_serving &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--backend&lt;/span&gt; openai-chat &lt;span class="nt"&gt;--endpoint&lt;/span&gt; /v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--base-url&lt;/span&gt; http://localhost:8000 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model&lt;/span&gt; google/gemma-4-31B-it &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--num-prompts&lt;/span&gt; 200 &lt;span class="nt"&gt;--dataset-name&lt;/span&gt; sharegpt &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--result-filename&lt;/span&gt; dense.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://github.com/vllm-project/vllm/tree/main/benchmarks" rel="noopener noreferrer"&gt;vLLM's benchmark suite&lt;/a&gt; reports p50/p95/p99 directly. Repeat with &lt;code&gt;--model google/gemma-4-26B-A4B&lt;/code&gt;, compare the p99 columns, and that's your answer.&lt;/p&gt;

&lt;p&gt;A note: vLLM's MoE kernel quality varies by version. Check the &lt;a href="https://docs.vllm.ai/en/latest/models/supported_models.html" rel="noopener noreferrer"&gt;supported models page&lt;/a&gt; before assuming both architectures get equal-quality dispatch. This is itself part of the dense-by-default argument — the ecosystem is more battle-tested on dense.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pi 5 / phone (Gemma 4 E4B):&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The MoE-vs-dense decision doesn't apply at this tier — you're running the &lt;a href="https://huggingface.co/google/gemma-4-E4B" rel="noopener noreferrer"&gt;E4B&lt;/a&gt; effective-parameter model, not because you chose to, but because it's what fits. The interesting benchmark here is different: measure &lt;strong&gt;token-level latency *variance&lt;/strong&gt;* on the same prompt set across multiple runs. Edge silicon shares compute with the OS scheduler, and your real tail will be dominated by OS jitter, not architecture. Use &lt;code&gt;time&lt;/code&gt; plus a wrapper that logs to a CSV:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# On Pi 5 with llama.cpp built for ARM&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;i &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;seq &lt;/span&gt;1 200&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
  /usr/bin/time &lt;span class="nt"&gt;-f&lt;/span&gt; &lt;span class="s2"&gt;"%e"&lt;/span&gt; ./build/bin/llama-cli &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-m&lt;/span&gt; gemma-4-E4B-it-Q4_K_M.gguf &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s2"&gt;".prompts[&lt;/span&gt;&lt;span class="k"&gt;$((&lt;/span&gt;i &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="m"&gt;50&lt;/span&gt;&lt;span class="k"&gt;))&lt;/span&gt;&lt;span class="s2"&gt;]"&lt;/span&gt; prompts.json&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-n&lt;/span&gt; 256 &lt;span class="nt"&gt;-no-cnv&lt;/span&gt; 2&amp;gt;&amp;gt;latencies.csv
&lt;span class="k"&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The lesson on the edge is the same as on the workstation: optimize for the experience the user actually has, which is the tail.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to take away beyond this release
&lt;/h2&gt;

&lt;p&gt;Gemma 4 forces a clean architectural choice at the same scale tier — dense versus MoE, 31B versus 26B. The next open release will too. The release after that. The "always pick MoE because it's more efficient" reflex was correct for cloud training and cloud serving, and it has been pattern-matched into local deployment where it doesn't belong.&lt;/p&gt;

&lt;p&gt;The lens that holds up across releases is not parameter count and not FLOPs per token. It is &lt;strong&gt;the latency profile of your workload&lt;/strong&gt;. If a single user is waiting on a single response chain, you want predictable tokens, not cheap tokens. If many users are sharing a pool of compute and the throughput is what's billed, you want cheap tokens at the expense of any single user's tail.&lt;/p&gt;

&lt;p&gt;The open-weights ecosystem is mature enough now that you don't have to pick on vibes. Two &lt;code&gt;llama-bench&lt;/code&gt; runs and a quantile calculation will tell you the answer for your hardware and your prompts. Run them before you commit to an architecture.&lt;/p&gt;

&lt;p&gt;The dense model is the default. The MoE is the optimization. Treat them in that order.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Submitted to the &lt;a href="https://dev.to/devteam/announcing-the-gemma-4-challenge"&gt;DEV.to Gemma 4 Write-About challenge&lt;/a&gt;, May 2026, with some help from AI. Code samples have been validated against the relevant tool documentation; concrete benchmark numbers are intentionally omitted because the right benchmark is your benchmark, on your hardware, with your prompts.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>gemmachallenge</category>
      <category>gemma</category>
    </item>
    <item>
      <title>CodeMender Doesn't Work Without a Skeptic</title>
      <dc:creator>Rome</dc:creator>
      <pubDate>Sat, 23 May 2026 22:52:05 +0000</pubDate>
      <link>https://dev.to/rome1/codemender-doesnt-work-without-a-skeptic-1dij</link>
      <guid>https://dev.to/rome1/codemender-doesnt-work-without-a-skeptic-1dij</guid>
      <description>&lt;p&gt;The headlines at Google I/O 2026 went where you'd expect — Gemini 3.5, the new intelligent eyewear shipping in the fall, Antigravity 2.0 with its new CLI and subagents. CodeMender — Google DeepMind's autonomous code-security agent — got folded into Agent Platform almost as a footnote. Most coverage moved on.&lt;/p&gt;

&lt;p&gt;That's a mistake. CodeMender is the most architecturally significant announcement of the event, and not because it finds bugs. Tools have found bugs for decades. It's significant because of &lt;em&gt;how&lt;/em&gt; it claims to find them, and because of what its architecture quietly admits about where AI security is actually heading.&lt;/p&gt;

&lt;p&gt;Read Google's own &lt;a href="https://blog.google/innovation-and-ai/technology/safety-security/ai-security-frontier-strategy-tools/" rel="noopener noreferrer"&gt;description&lt;/a&gt; of how CodeMender works:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;CodeMender is an AI-powered agent utilizing the advanced reasoning capabilities of our Gemini models to automatically fix critical code vulnerabilities. […] These patches are then routed to specialized "critique" agents, which act as automated peer reviewers, rigorously validating the patch for correctness, security implications and adherence to code standards before it's proposed for final human sign-off.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Read that twice. A reasoning model proposes a patch. A &lt;em&gt;separate&lt;/em&gt; "critique" agent — Google's own term — sits behind it and rigorously validates before anything reaches a human. CodeMender isn't a model. It's a loop. And the critique agent is doing the load-bearing work.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pitch that has been DOA for eighteen months
&lt;/h2&gt;

&lt;p&gt;Every AI-security vendor since 2024 has been selling roughly the same pitch: large language models will find vulnerabilities faster than humans, with broader coverage and cheaper cost per finding. The demos look magical — a model reads a function, mumbles something about CWE-89, and produces a patch.&lt;/p&gt;

&lt;p&gt;What the demos don't show is what happens at scale. The genuine signal those tools surface is real and useful — that's the upside that makes the category viable. But security analyst attention is the scarcest resource in the company. After two or three findings turn out to be noise, the analyst marks the &lt;em&gt;next&lt;/em&gt; one as noise too — independent of whether it's real. Trust falls off a cliff and does not amortize back. This isn't speculative; it's the binding constraint on any scanner's adoption curve.&lt;/p&gt;

&lt;p&gt;Static analyzers shipped with well-documented false-positive problems for two decades; the LLM-on-top wave inherited and often amplified them. Snyk's &lt;a href="https://snyk.io/platform/deepcode-ai/" rel="noopener noreferrer"&gt;DeepCode AI&lt;/a&gt; describes itself as combining "symbolic and generative AI, several machine learning methods, and the expertise of Snyk security researchers" — explicitly a hybrid stack, not a single model. GitHub's &lt;a href="https://github.blog/news-insights/product-news/secure-code-more-than-three-times-faster-with-copilot-autofix/" rel="noopener noreferrer"&gt;Copilot Autofix&lt;/a&gt; "leverages the CodeQL engine, GPT-4o, and a combination of heuristics" — again, an LLM embedded in a deterministic pipeline, not a model alone. The products that struggled were the ones that shipped a single LLM with no scaffolding around it. The products that survived were the ones that figured out the scaffolding was the problem.&lt;/p&gt;

&lt;p&gt;The model wasn't ever the bottleneck. Trust calibration was.&lt;/p&gt;

&lt;h2&gt;
  
  
  What CodeMender quietly admits
&lt;/h2&gt;

&lt;p&gt;The architecture description is a structural concession. Auto-patching makes the false-positive problem worse, not better — a wrong patch doesn't just waste analyst time, it changes the codebase, breaks the build, sometimes introduces a new bug downstream. If "find a bug" is hard to ship with high enough precision, "find and fix a bug" is harder still.&lt;/p&gt;

&lt;p&gt;You cannot solve this by training a better generator. Train the model to be more conservative and recall collapses. Train it to be more aggressive and precision collapses. The recall-precision frontier of a single model is fundamentally bounded by its prompt and its training distribution.&lt;/p&gt;

&lt;p&gt;What you can do is split the job. One model proposes. Another model disposes. The proposer is recall-biased: it throws a wide net, surfaces everything that smells like a vulnerability, optimizes for not-missing. The critique agent is precision-biased: its job is to take each finding and try to kill it. Is this exploit actually reachable? Is there a guard upstream the proposer missed? Is this a known false-positive pattern? Does the proposed patch break existing tests?&lt;/p&gt;

&lt;p&gt;The critique agent &lt;em&gt;is&lt;/em&gt; the architecture. Without it, the proposer's output is unshippable. With it, the system can claim what CodeMender does claim: that it surfaces validated patches for human review rather than a firehose of speculative ones.&lt;/p&gt;

&lt;p&gt;That decoupling — the model that finds the bug is not the model that decides the bug is real — is the entire ballgame.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why a two-model loop composes better than one smart model
&lt;/h2&gt;

&lt;p&gt;Proposer-skeptic decomposition isn't new. Math reasoning uses proposer-verifier loops. Multi-agent debate frames the same idea. RLHF with adversarial critics is the training-time version. Ensemble methods in classical ML have known forever that diverse weak learners beat a strong monolith on tasks with asymmetric error costs.&lt;/p&gt;

&lt;p&gt;What's &lt;em&gt;security-specific&lt;/em&gt; is that the asymmetry isn't a nice-to-have — it's the binding constraint that makes monolithic models structurally unshippable. When error costs are asymmetric, you don't want a single model trading off precision against recall on one axis. You want two models trading off against each other on different axes.&lt;/p&gt;

&lt;p&gt;Four concrete payoffs from the decomposition:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Different models for different roles.&lt;/strong&gt; The discovery model can be your biggest, most expensive reasoning model, because it runs once per file. The critique model can be smaller, faster, cheaper, running many times per finding. Cost-latency becomes a tunable knob, not a constraint baked into the architecture.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Independent post-training.&lt;/strong&gt; Discover a new false-positive pattern the system keeps surfacing? You don't retrain the giant discovery model. You add an adversarial example to the critique distribution and ship.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Hot-swappable foundations.&lt;/strong&gt; As foundation models improve, you swap them in without rebuilding the loop.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Auditable disagreements.&lt;/strong&gt; When the proposer says "vuln" and the critique says "no," you have a structured artifact — a transcript of an argument — a human can inspect. This is the explainability story monolithic models can't tell.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The pattern, in code
&lt;/h2&gt;

&lt;p&gt;The naive single-pass setup that has powered two years of demos:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;findings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;discovery_model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;findings&lt;/span&gt;  &lt;span class="c1"&gt;# FP-saturated, analyst trust gone in two findings
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The decomposed loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;candidates&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;discovery_model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;DISCOVERY_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# recall-biased: "find anything that smells wrong"
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;validated&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;finding&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;candidates&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;confidence&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;DISCOVERY_THRESHOLD&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;continue&lt;/span&gt;
    &lt;span class="n"&gt;verdict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;critique_model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;adjudicate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;CRITIQUE_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# adversarial: "try to kill this finding"
&lt;/span&gt;        &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;surrounding_code&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kept&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;confidence&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;PRECISION_THRESHOLD&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;validated&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_verdict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;patches&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;patch_model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;propose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;validated&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;patches&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;regression_tests_pass&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;CodeMender — running on Google's most capable reasoning models — still needs the critique stage. The structural decomposition does more for shippability than another order of magnitude in foundation-model parameters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the convergence is
&lt;/h2&gt;

&lt;p&gt;This pattern is showing up everywhere a serious security tool ships. Purdue's &lt;a href="https://github.com/PurCL/LLMSCAN" rel="noopener noreferrer"&gt;LLMSCAN&lt;/a&gt; separates concerns explicitly: syntactic facts come from Tree-sitter parsing (verifiable, deterministic); the LLM handles only the semantic facts parsing can't reach. The insight isn't quite "skeptic-after-proposer" — it's "ground the proposer in structure the LLM can't hallucinate." Same recall-precision frontier, attacked one tier earlier.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://github.com/gadievron/raptor" rel="noopener noreferrer"&gt;RAPTOR framework&lt;/a&gt; goes the other direction. Candidate findings from classical pattern-matching (Semgrep, CodeQL) flow through progressive validation gates that include reachability verification via Z3 SMT solving and an exploitation-feasibility check before anything surfaces. A four-stage skeptic stack behind a recall-biased finder.&lt;/p&gt;

&lt;p&gt;CodeMender is the most prominent expression of a pattern that has been forming quietly for over a year. The convergence is the signal.&lt;/p&gt;

&lt;h2&gt;
  
  
  The actual moat
&lt;/h2&gt;

&lt;p&gt;Here is where the I/O 2026 announcement gets genuinely interesting, and where most takes on it will miss the point.&lt;/p&gt;

&lt;p&gt;The architecture itself is not a moat. Anyone can stack two LLMs. The recipe is now public — Google just published it.&lt;/p&gt;

&lt;p&gt;The moat is the &lt;em&gt;critique model's training distribution.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Think about what makes a critique agent useful in production. It is not the cleverness of the prompt. It is the catalog of adjudicated false-positive patterns the model has been exposed to — the corpus of "the proposer flagged this, a human checked it, the human said: not a vuln, here's why" examples. That corpus is built by burning analyst trust the first time and never the second. It compounds slowly. It cannot be scraped. It is utterly specific to the languages, frameworks, and idioms of the customer base that produced it.&lt;/p&gt;

&lt;p&gt;CodeMender, in its first six months, had upstreamed &lt;a href="https://siliconangle.com/2025/10/06/google-deepmind-unveils-codemender-ai-agent-autonomously-patches-software-vulnerabilities/" rel="noopener noreferrer"&gt;72 security fixes&lt;/a&gt; to large open-source codebases, including projects of over four million lines. The output is impressive on its own. What matters more for Google's strategic position is what those 72 fixes &lt;em&gt;built&lt;/em&gt;: a labeled corpus of "proposed patch, critique verdict, human sign-off" tuples that nobody else has and nobody else can quickly acquire. Every accepted fix narrows the gap between Google's critique distribution and the actual statistical landscape of real vulnerabilities in deployed code. Every rejected patch sharpens the critique's nose for the specific shape of plausible-but-wrong fixes.&lt;/p&gt;

&lt;p&gt;CodeMender's moat is the critique agent's &lt;em&gt;experience&lt;/em&gt;, not its capability. The architecture is the cost of entry. The corpus is the durable asset.&lt;/p&gt;

&lt;h2&gt;
  
  
  Auto-remediation as forcing function
&lt;/h2&gt;

&lt;p&gt;Here is the counterintuitive corollary. Most observers will read CodeMender as a find-and-patch tool that prioritizes auto-fix because Google wants the dramatic demo. The deeper reason is structural: auto-fix is the only forcing function that exposes whether a finding was &lt;em&gt;real&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Think about the feedback signal. A patch that doesn't compile is a confirmed false positive. A patch that compiles but breaks behavioral tests is a confirmed false positive. A patch that compiles, passes tests, but gets rejected by a maintainer with "this isn't a security issue" is a confirmed false positive — and the maintainer's reasoning becomes training data for the next iteration of the critique agent.&lt;/p&gt;

&lt;p&gt;Compare this to a detection-only tool that surfaces findings to humans and asks them to triage. The feedback is delayed, ambiguous, often never logged — a JIRA ticket marked "won't fix" doesn't tell the model &lt;em&gt;why&lt;/em&gt;. The find-then-fix architecture closes the loop with a far higher-fidelity signal.&lt;/p&gt;

&lt;p&gt;Auto-remediation isn't a UX feature riding on top of detection. It is the validation mechanism that makes detection &lt;em&gt;improve&lt;/em&gt; over time. Tools that lean into the find-and-fix loop will out-improve tools that try to perfect detection in isolation, because they will be ingesting orders of magnitude more high-quality false-positive feedback per unit of analyst time.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means for the rest of the stack
&lt;/h2&gt;

&lt;p&gt;A few extrapolations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The unit of value in AI security is the loop, not the model.&lt;/strong&gt; Disciplined validation with a mediocre foundation model will out-ship a state-of-the-art foundation model wrapped in a naive single-pass setup. Vendors who think the model is the product are about to get out-shipped by vendors who think the loop is the product.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Critique-model training data is the new ImageNet.&lt;/strong&gt; The next round of M&amp;amp;A in security tooling will not be about who has the best static analyzer. It will be about who has the deepest labeled corpus of validated-and-rejected findings. The companies worth buying are the ones whose loops have been running longest with real analysts in them.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Open-sourcing the loop won't kill the moat.&lt;/strong&gt; Google could publish CodeMender's full architecture tomorrow. The architecture isn't the asset. The accumulated critique distribution — the millions of micro-judgments embedded after a year-plus of running against real OSS — does not transfer with the source code.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The real headline
&lt;/h2&gt;

&lt;p&gt;The story buried in CodeMender's I/O 2026 footnote is that Google — the company with the resources to build any single model it wants — has decided the path forward in code security runs through architecture &lt;em&gt;and&lt;/em&gt; through accumulated experience, not through model capability alone. They did not announce a bigger reasoning model with better vuln-detection benchmarks. They announced a loop that has been running for over six months, learning from real upstream contributions, and getting more useful with every accepted patch.&lt;/p&gt;

&lt;p&gt;The model is not the product. The critique agent is not the product. &lt;em&gt;The corpus the critique agent has been trained on&lt;/em&gt; is the product. Everything else is infrastructure around that compounding asset.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Notes and sources:&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Google's &lt;a href="https://developers.googleblog.com/all-the-news-from-the-google-io-2026-developer-keynote/" rel="noopener noreferrer"&gt;I/O 2026 developer keynote roundup&lt;/a&gt; — official announcements including Antigravity 2.0 and the CodeMender integration into Agent Platform.&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Google DeepMind's &lt;a href="https://blog.google/innovation-and-ai/technology/safety-security/ai-security-frontier-strategy-tools/" rel="noopener noreferrer"&gt;CodeMender announcement&lt;/a&gt; — source for the architecture quote and the description of "critique" agents as automated peer reviewers (October 2025).&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;a href="https://siliconangle.com/2025/10/06/google-deepmind-unveils-codemender-ai-agent-autonomously-patches-software-vulnerabilities/" rel="noopener noreferrer"&gt;SiliconANGLE coverage&lt;/a&gt; — the 72-fix figure and the codebase-size detail.&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;The New Stack's &lt;a href="https://thenewstack.io/google-io-antigravity-codemender-ai-agentic/" rel="noopener noreferrer"&gt;Antigravity / CodeMender I/O 2026 writeup&lt;/a&gt; — framing on the agentic platform positioning.&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;a href="https://github.com/PurCL/LLMSCAN" rel="noopener noreferrer"&gt;LLMSCAN&lt;/a&gt; (Purdue) and &lt;a href="https://github.com/gadievron/raptor" rel="noopener noreferrer"&gt;RAPTOR&lt;/a&gt; — adjacent architectures converging on the same recall-then-precision pattern.&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Used AI for research and writing.&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>devchallenge</category>
      <category>googleiochallenge</category>
    </item>
    <item>
      <title>Interactive 3D Office Desk</title>
      <dc:creator>Rome</dc:creator>
      <pubDate>Fri, 04 Jul 2025 04:09:37 +0000</pubDate>
      <link>https://dev.to/rome1/interactive-3d-office-desk-1b4i</link>
      <guid>https://dev.to/rome1/interactive-3d-office-desk-1b4i</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for &lt;a href="https://dev.to/challenges/frontend/axero"&gt;Frontend Challenge: Office Edition sponsored by Axero, Holistic Webdev: Office Space&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Submission by dev username: member_01928ffe&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;For this challenge, I decided to step away from traditional 2D layouts and build something that felt truly immersive and personal: a fully interactive 3D digital desk space. My goal was to create more than just a webpage; I wanted to build an experience. This virtual office isn't just a static image—it's a dynamic environment where every object tells a story and serves a purpose.&lt;/p&gt;

&lt;p&gt;The scene starts with a "hero" view of a complete desk setup. From there, the user can scroll down to navigate through a series of full-screen drawers, each dedicated to a different project. But the interaction doesn't stop there. Almost every item on the desk is clickable, opening up detailed pop-up modals:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The Monitor: Brings up a modal with a snippet of code.&lt;/li&gt;
&lt;li&gt;Sticky Notes: Each note opens a unique, interactive to-do list.&lt;/li&gt;
&lt;li&gt;The Calendar: Reveals a fully functional, navigable calendar widget.&lt;/li&gt;
&lt;li&gt;Mail: Opens a ready-to-use email editor.&lt;/li&gt;
&lt;li&gt;Name Card: Displays a weekly timesheet and a sign-out option.&lt;/li&gt;
&lt;li&gt;The Camera: Shows a gallery of photos with captions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This project was an ambitious attempt to blend the lines between a user interface and a tangible, explorable space.&lt;/p&gt;

&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;You can &lt;a href="https://codepen.io/Rome-1/pen/PwqrmMo" rel="noopener noreferrer"&gt;experience the live version&lt;/a&gt; of the project.&lt;/p&gt;

&lt;p&gt;Below is an embedded version of the interactive desk scene. Click on the objects, scroll down to explore the drawers, and see what you can discover!&lt;/p&gt;

&lt;p&gt;&lt;iframe height="600" src="https://codepen.io/Rome-1/embed/PwqrmMo?height=600&amp;amp;default-tab=result&amp;amp;embed-version=2"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;&lt;br&gt;
  See the Pen &lt;a href="https://codepen.io/Rome-1/pen/PwqrmMo" rel="noopener noreferrer"&gt;&lt;br&gt;
  Interactive 3D Office Desk in Three.js&lt;/a&gt; by Rome (&lt;a href="https://codepen.io/Rome-1" rel="noopener noreferrer"&gt;@Rome-1&lt;/a&gt;)&lt;br&gt;
  on &lt;a href="https://codepen.io" rel="noopener noreferrer"&gt;CodePen&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  Journey
&lt;/h2&gt;

&lt;p&gt;This project was a deeply meaningful journey for me. For a while, I'd been looking for a challenge that would not only let me flex my existing web design and frontend skills but also push me into new, unfamiliar territory. This hackathon was the perfect opportunity.&lt;/p&gt;

&lt;p&gt;I started with a simple photo of a desk and a question: "What if this was real?" That question led me down the rabbit hole of WebGL and, specifically, Three.js. The learning curve was steep. Moving from the predictable world of CSS layouts to the boundless 3D space of a  element was both daunting and exhilarating. I spent so many hours learning about scenes, cameras, lighting, materials, and raycasting for interactivity.&lt;/p&gt;

&lt;p&gt;One of the choices I'm most proud of is the scrolling navigation. Instead of simple anchor links, I wanted the transition from the main desk to the project drawers to feel like a seamless, cinematic camera movement. Implementing this required custom logic to control the camera's position with the mouse wheel, creating a unique, webpage-like feel within a 3D environment.&lt;/p&gt;

&lt;p&gt;Every interactive pop-up was its own mini-project. Building the dynamic calendar, the to-do lists where you can check off items, and ensuring that all modals felt polished and prevented background interactions was a fantastic exercise in detail-oriented development.&lt;/p&gt;

&lt;p&gt;This hackathon has been more than just a coding challenge; it's been a reaffirmation of my passion for creating engaging and beautiful user experiences. It has reignited my confidence in my design sensibilities and given me a powerful new tool in Three.js to bring even more ambitious ideas to life. I walk away from this not just with a project I'm incredibly proud of, but with a renewed sense of what's possible on the web.&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>frontendchallenge</category>
      <category>css</category>
      <category>javascript</category>
    </item>
  </channel>
</rss>
