<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tejas Patil</title>
    <description>The latest articles on DEV Community by Tejas Patil (@tejas164321).</description>
    <link>https://dev.to/tejas164321</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3828498%2Ff99fdd48-f37e-48e0-abdd-ba64eb016704.jpg</url>
      <title>DEV Community: Tejas Patil</title>
      <link>https://dev.to/tejas164321</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tejas164321"/>
    <language>en</language>
    <item>
      <title>Gemma 4's Multi-Token Prediction Changes the Economics of Running AI Locally — Here's the Full Breakdown</title>
      <dc:creator>Tejas Patil</dc:creator>
      <pubDate>Sun, 24 May 2026 06:34:05 +0000</pubDate>
      <link>https://dev.to/tejas164321/gemma-4s-multi-token-prediction-changes-the-economics-of-running-ai-locally-heres-the-full-2o36</link>
      <guid>https://dev.to/tejas164321/gemma-4s-multi-token-prediction-changes-the-economics-of-running-ai-locally-heres-the-full-2o36</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/google-gemma-2026-05-06"&gt;Gemma 4 Challenge: Write About Gemma 4&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;There's a hard wall that every developer hits when they try to run a capable AI model locally. It's not the GPU. It's not the RAM. It's the &lt;strong&gt;memory bandwidth&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Standard autoregressive generation — the way every LLM has worked since GPT-2 — does one thing at a time: predict a token, move that token back through the model, predict the next one. Each step requires shipping gigabytes of weight matrices from memory to the processor. On a MacBook, an RTX 4080, or a cloud instance you're paying $0.40/hour for, this shuffle is the bottleneck. More VRAM doesn't fix it. Faster GPUs barely dent it. It's a structural constraint baked into how transformers generate text.&lt;/p&gt;

&lt;p&gt;On May 5, 2026, Google shipped the fix. &lt;strong&gt;Multi-Token Prediction (MTP) drafters for the entire Gemma 4 family&lt;/strong&gt; — and the numbers are real: up to &lt;strong&gt;3x faster inference, zero quality loss&lt;/strong&gt;, Apache 2.0 licensed, works with Ollama, vLLM, Hugging Face, MLX, SGLang, and LiteRT-LM out of the box.&lt;/p&gt;

&lt;p&gt;This is the most important thing that happened to local AI this month. Let me show you exactly why — and help you figure out which Gemma 4 model is actually right for your use case.&lt;/p&gt;




&lt;h2&gt;
  
  
  First: The Four Models Explained
&lt;/h2&gt;

&lt;p&gt;Gemma 4 isn't one model. It's a family of four, each designed for a genuinely different deployment context. Getting the model selection right matters as much as understanding MTP.&lt;/p&gt;

&lt;h3&gt;
  
  
  E2B — The Pocket Rocket
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;"E" stands for "effective" parameters.&lt;/strong&gt; The E2B weighs in at roughly 1.5GB at 4-bit, runs on modern Android phones via Google AICore, works completely offline, and natively understands audio and images. It has a &lt;strong&gt;128K context window&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The trick behind its size efficiency is &lt;strong&gt;Per-Layer Embeddings (PLE)&lt;/strong&gt; — instead of stacking more transformer layers, each decoder layer gets its own small embedding table per token. The static weight footprint is technically larger than 2B parameters might suggest, but the &lt;em&gt;active compute&lt;/em&gt; stays tiny. The result: a model that can live on a Raspberry Pi or a mid-range Android phone and still reason across an entire book chapter in one shot.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it when:&lt;/strong&gt; you're building a mobile app, an offline tool, an IoT integration, or anything that must run without a network connection.&lt;/p&gt;

&lt;h3&gt;
  
  
  E4B — The Edge Sweet Spot
&lt;/h3&gt;

&lt;p&gt;Same architecture philosophy as E2B, more headroom. Runs in ~5GB RAM at 4-bit, ~15GB at full 16-bit. Also supports audio and image input natively. Also 128K context.&lt;/p&gt;

&lt;p&gt;The E4B hits the crossover point where capability meets practicality for most developer laptops. You're not giving up much compared to the bigger models for typical tasks — coding assistance, document Q&amp;amp;A, image analysis — and you keep the low-latency edge advantage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it when:&lt;/strong&gt; you're building a local-first desktop app, a developer tool, or anything running on a laptop that needs genuine multimodal capability.&lt;/p&gt;

&lt;h3&gt;
  
  
  26B A4B — The Efficiency Cheat Code
&lt;/h3&gt;

&lt;p&gt;This is the sneaky one. &lt;strong&gt;26 billion total parameters, but only 4 billion activate during any given inference.&lt;/strong&gt; It's a Mixture-of-Experts (MoE) architecture: the model routes each token through the 4B expert subset most relevant to that input, ignoring the rest. All 26B must be loaded into memory (~18GB at 4-bit), but the &lt;em&gt;compute per token&lt;/em&gt; is closer to a 4B model.&lt;/p&gt;

&lt;p&gt;The result: &lt;strong&gt;#6 open model in the world&lt;/strong&gt; on the Arena AI leaderboard, outcompeting models 20x its size, running at 4B-like speeds, with a &lt;strong&gt;256K context window&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it when:&lt;/strong&gt; you have ~20GB VRAM (RTX 3090, 4090, A10G) and want near-frontier capability with fast inference. This is the production sweet spot for most self-hosted deployments.&lt;/p&gt;

&lt;h3&gt;
  
  
  31B Dense — The Flagship
&lt;/h3&gt;

&lt;p&gt;The 31B is currently &lt;strong&gt;#3 open model in the world&lt;/strong&gt; on the Arena AI text leaderboard. Dense architecture (no MoE routing), 256K context, 20GB at 4-bit or 34GB at 8-bit. The most capable in the family, the most hardware-hungry.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it when:&lt;/strong&gt; you need maximum quality and have the iron to back it up — A100, H100, multi-GPU setups, or high-memory cloud instances.&lt;/p&gt;




&lt;h2&gt;
  
  
  Deep Dive: How MTP Actually Works
&lt;/h2&gt;

&lt;p&gt;Now for the part that changes everything.&lt;/p&gt;

&lt;p&gt;The core insight behind Multi-Token Prediction is that the big, slow target model doesn't need to do all the work. A small, fast &lt;strong&gt;drafter model&lt;/strong&gt; can predict several tokens ahead speculatively — and the target model can &lt;strong&gt;verify all of them in parallel&lt;/strong&gt; in a single forward pass.&lt;/p&gt;

&lt;p&gt;Here's the pipeline step by step:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1 — Draft.&lt;/strong&gt; The drafter (a compact model purpose-built for this) takes the current sequence and rapidly predicts 4–8 tokens ahead. This is cheap: the drafter is small, and it runs quickly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2 — Verify.&lt;/strong&gt; The full target model (E2B, 26B A4B, whatever you're using) processes all the drafted tokens simultaneously in one forward pass. It checks each one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3 — Accept or Correct.&lt;/strong&gt; If the target model agrees with a drafted token, it's accepted for free. If it disagrees, it generates the correct token for that position and the drafter starts fresh from there. Importantly, even a rejected step isn't wasted — the target model always produces the correct token at that position.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Net result:&lt;/strong&gt; The target model does dramatically fewer forward passes per output token. The memory bandwidth bottleneck still exists, but you hit it far less often. Hence the 3x speedup.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Makes Gemma 4's MTP Different
&lt;/h3&gt;

&lt;p&gt;Here's the part that genuinely separates this from what others are doing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;KV cache sharing.&lt;/strong&gt; The Key-Value cache (the model's short-term memory for attention values) is shared between the drafter and the target model. On a memory-constrained device, this is critical — no duplicating data in VRAM, no cache invalidation overhead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shared target activations.&lt;/strong&gt; The drafter doesn't start from scratch. It uses the internal representations — the "activations" — that the target model has already computed in its deeper layers. The drafter is piggybacking on work already done. This makes the draft step faster and more accurate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Official, first-party, Apache 2.0.&lt;/strong&gt; Llama, Qwen, and DeepSeek all train MTP-aware variants. None of them ship official drafter checkpoints. Community drafters exist for those models, but the quality is uneven and the integration is manual. Gemma 4 ships polished, purpose-built drafters as standalone checkpoints on Hugging Face and Kaggle, with runtime support already baked into Ollama, vLLM, SGLang, MLX, and Hugging Face Transformers. It's one config flag, not a research project.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Hardware Math (This Is Where It Gets Interesting)
&lt;/h2&gt;

&lt;p&gt;The 26B A4B model, with MTP enabled, running on a cloud instance with 20GB VRAM:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Instance cost: ~$0.40–$0.80/hour (RTX A10G class)&lt;/li&gt;
&lt;li&gt;MTP throughput improvement: ~2.5–3x over baseline&lt;/li&gt;
&lt;li&gt;Per-token cost at sustained inference: &lt;strong&gt;competitive with GPT-4o mini pricing&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's the sentence that changes the build-vs-hosted calculus for a lot of teams. "Competitive with GPT-4o mini" at a capability level that places the model in the top 10 open models globally, on hardware you fully control, with data that never leaves your infrastructure, under a license with no MAU limits and no royalty clauses.&lt;/p&gt;

&lt;p&gt;For mobile: the E2B with MTP runs on Android via Google AI Edge Gallery. The efficient embedder in the E-series models further reduces the compute overhead of the drafter on constrained hardware. A 3x speedup on a phone means the difference between a model that feels native and one that feels like it's thinking.&lt;/p&gt;




&lt;h2&gt;
  
  
  Setting It Up (This Takes About 10 Minutes)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;With Ollama:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Pull the 26B MoE model + its MTP drafter&lt;/span&gt;
ollama pull gemma4:26b
ollama pull gemma4:26b-mtp-drafter

&lt;span class="c"&gt;# Run with speculative decoding enabled&lt;/span&gt;
ollama run gemma4:26b &lt;span class="nt"&gt;--speculative-model&lt;/span&gt; gemma4:26b-mtp-drafter
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;With vLLM:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;vllm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LLM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SamplingParams&lt;/span&gt;

&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LLM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;google/gemma-4-26B-A4B-it&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;speculative_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;google/gemma-4-26B-A4B-mtp-drafter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;num_speculative_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tensor_parallel_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;sampling_params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SamplingParams&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain speculative decoding in plain English.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;sampling_params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;With Hugging Face Transformers:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;

&lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;google/gemma-4-26B-A4B-it&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;target&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;google/gemma-4-26B-A4B-it&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;torch_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bfloat16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;device_map&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;drafter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;google/gemma-4-26B-A4B-mtp-drafter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;torch_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bfloat16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;device_map&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;inputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Walk me through the 128K context window use case:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;return_tensors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;assistant_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;drafter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_new_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;skip_special_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Which Model Should You Actually Pick?
&lt;/h2&gt;

&lt;p&gt;Here's the decision tree I'd give a colleague:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Are you building for mobile or IoT?&lt;/strong&gt;&lt;br&gt;
→ E2B. No competition. 1.5GB, offline, audio-native, Apache 2.0.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Are you building a local-first desktop tool or developer assistant?&lt;/strong&gt;&lt;br&gt;
→ E4B with MTP drafter. Best balance of speed and capability for a laptop GPU.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Are you self-hosting for a production SaaS or internal tool?&lt;/strong&gt;&lt;br&gt;
→ 26B A4B with MTP drafter. MoE gives you near-31B quality at 4B inference speed. The economics work at scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do you need absolute maximum quality and have A100/H100 infrastructure?&lt;/strong&gt;&lt;br&gt;
→ 31B Dense with MTP drafter. #3 open model in the world. That's the ceiling of what you can run yourself right now.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;Here's my honest take after spending time with Gemma 4 and its MTP release: we just crossed a threshold.&lt;/p&gt;

&lt;p&gt;The 31B model ranking #3 globally among open models is remarkable. But it would be table stakes — every major lab has a flagship. What makes Gemma 4 significant is the &lt;em&gt;combination&lt;/em&gt;: frontier-level capability at the top, a model that runs in 1.5GB on a phone at the bottom, and MTP drafters that make all of them dramatically faster, all under a license with no strings attached.&lt;/p&gt;

&lt;p&gt;The MTP implementation specifically matters because it signals something about Google's intent. This isn't a capability demo — it's infrastructure. Shipping official, polished, first-party drafter checkpoints that plug into every major serving framework in a single afternoon is the kind of work that benefits the entire open-weight ecosystem, not just Gemma users.&lt;/p&gt;

&lt;p&gt;The other labs will follow. Llama and Qwen will ship official drafters. The bar just moved.&lt;/p&gt;

&lt;p&gt;For developers: the "should I use an API or run it myself" question just got a lot more interesting. For the first time, the answer for a lot of production workloads might genuinely be "run it yourself, it's cheaper, it's faster, and you own the data."&lt;/p&gt;

&lt;p&gt;That is a real change. And Gemma 4 MTP is the specific reason it's true now when it wasn't true six months ago.&lt;/p&gt;




&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://ai.google.dev/gemma/docs/mtp/overview" rel="noopener noreferrer"&gt;Official MTP Overview&lt;/a&gt; — Google AI for Developers&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ai.google.dev/gemma/docs/mtp/mtp" rel="noopener noreferrer"&gt;MTP with Hugging Face Transformers&lt;/a&gt; — full implementation guide&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ollama.com/library/gemma4" rel="noopener noreferrer"&gt;Gemma 4 on Ollama&lt;/a&gt; — one-command local setup&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/google/gemma-4-26B-A4B" rel="noopener noreferrer"&gt;Gemma 4 on Hugging Face&lt;/a&gt; — all model checkpoints&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ai.google.dev/edge/gallery" rel="noopener noreferrer"&gt;Google AI Edge Gallery&lt;/a&gt; — try E2B/E4B MTP on Android or iOS&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;What are you building with Gemma 4? I'm particularly curious who's running the E2B on actual edge hardware — drop it in the comments.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>gemmachallenge</category>
      <category>gemma</category>
    </item>
    <item>
      <title>WebMCP Is the Most Important Thing Google Announced at I/O 2026 (And Almost Nobody Is Talking About It)</title>
      <dc:creator>Tejas Patil</dc:creator>
      <pubDate>Sat, 23 May 2026 22:19:01 +0000</pubDate>
      <link>https://dev.to/tejas164321/webmcp-is-the-most-important-thing-google-announced-at-io-2026-and-almost-nobody-is-talking-about-1j8m</link>
      <guid>https://dev.to/tejas164321/webmcp-is-the-most-important-thing-google-announced-at-io-2026-and-almost-nobody-is-talking-about-1j8m</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/google-io-writing-2026-05-19"&gt;Google I/O Writing Challenge&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Right now, every AI agent that tries to use a website is basically doing this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Take a screenshot&lt;/li&gt;
&lt;li&gt;Guess what's on screen&lt;/li&gt;
&lt;li&gt;Click something and hope&lt;/li&gt;
&lt;li&gt;Take another screenshot&lt;/li&gt;
&lt;li&gt;Repeat until it works or gives up&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It's the digital equivalent of reading someone's lips through a frosted glass window. It &lt;em&gt;kind of&lt;/em&gt; works. It's slow, expensive, and breaks constantly on anything slightly dynamic — a modal, a lazy-loaded form, a JS-rendered button.&lt;/p&gt;

&lt;p&gt;Google's answer to this is called &lt;strong&gt;WebMCP — Web Model Context Protocol&lt;/strong&gt;. It entered a public origin trial in Chrome 149 on May 19, 2026, during the I/O Developer keynote. And I think it's the most consequential announcement of the whole event — not because of what it does today, but because of what it signals about where the web is going.&lt;/p&gt;

&lt;p&gt;Let me show you what it actually is, how to use it right now, and why I have real questions about whether it will succeed.&lt;/p&gt;




&lt;h2&gt;
  
  
  What WebMCP Actually Does
&lt;/h2&gt;

&lt;p&gt;The idea is simple: instead of making AI agents &lt;em&gt;figure out&lt;/em&gt; what your website does by staring at it, you &lt;em&gt;tell them explicitly&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;WebMCP lets you expose structured tools — JavaScript functions and annotated HTML forms — directly to browser-based AI agents. The agent doesn't scrape. It calls your tool like an API.&lt;/p&gt;

&lt;p&gt;There are two ways to implement it:&lt;/p&gt;

&lt;h3&gt;
  
  
  The Declarative API (for forms)
&lt;/h3&gt;

&lt;p&gt;You annotate existing HTML forms with a &lt;code&gt;data-mcp-tool&lt;/code&gt; attribute and a description. The agent reads the annotation and knows exactly what the form does.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;





All categories
Electronics
Clothing


  Search


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. An agent seeing this form no longer has to guess what the fields mean or what the form does. You've told it.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Imperative API (for JavaScript functions)
&lt;/h3&gt;

&lt;p&gt;For more complex interactions, you register tools programmatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nb"&gt;navigator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;mcp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;registerTool&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;add_to_cart&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Add a product to the shopping cart by product ID and quantity&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;productId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;The unique product identifier&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;number&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Number of units to add&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;minimum&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;productId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;quantity&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;cartService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;productId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;success&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;cartTotal&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;total&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An agent calling &lt;code&gt;add_to_cart&lt;/code&gt; with &lt;code&gt;{ productId: "ABC123", quantity: 2 }&lt;/code&gt; will get a reliable result — no screenshot guessing, no DOM parsing, no retries.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why I'm Genuinely Excited
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. This is a Google + Microsoft co-project
&lt;/h3&gt;

&lt;p&gt;This is the detail that changes everything for adoption: WebMCP is developed &lt;strong&gt;jointly by Google and Microsoft&lt;/strong&gt; in the W3C Web Machine Learning Community Group.&lt;/p&gt;

&lt;p&gt;That's not just a Google standard. It's an emerging web standard with two of the biggest browser vendors aligned on the spec from day one. Cross-vendor agreement at this stage is rare and meaningful. It substantially increases the chance this becomes a real, lasting part of the web platform.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. The timing is right
&lt;/h3&gt;

&lt;p&gt;Browser agents — AI systems that navigate websites on your behalf — are growing fast. Gemini in Chrome, which will support WebMCP APIs, is one. Others are coming. Right now these agents are all fighting the same brittle DOM-scraping battle. WebMCP gives the web a way to meet them halfway.&lt;/p&gt;

&lt;p&gt;Implementing WebMCP on your site today is the same category of investment as adding proper &lt;code&gt;aria-label&lt;/code&gt; attributes in 2015 or adding &lt;code&gt;og:title&lt;/code&gt; meta tags in 2012. It felt optional then. It became table stakes.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The developer experience is genuinely low-friction
&lt;/h3&gt;

&lt;p&gt;The declarative API requires zero new JavaScript — just HTML annotations. You can expose your most common user flows to agents in an afternoon. The barrier is low enough that "let's try it" is a reasonable thing to say at a sprint planning meeting right now.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where I Have Real Questions
&lt;/h2&gt;

&lt;p&gt;I don't want to just be a hype machine, because there are genuine open questions here.&lt;/p&gt;

&lt;h3&gt;
  
  
  Firefox and Safari haven't committed
&lt;/h3&gt;

&lt;p&gt;This is the elephant in the room. Mozilla and Apple have not signed on to WebMCP. For a standard to truly succeed on the web, it needs more than Chrome. Right now, if you implement WebMCP, it's Chrome-only by design.&lt;/p&gt;

&lt;p&gt;That's not fatal — lots of meaningful features started as Chrome-only experiments before getting broader adoption. But it's a real constraint. If your user base is heavy on Safari (mobile web, Apple users), WebMCP tooling won't work for those agents browsing on Safari.&lt;/p&gt;

&lt;h3&gt;
  
  
  "No headless support" is a meaningful limitation
&lt;/h3&gt;

&lt;p&gt;The official Chrome documentation is explicit: WebMCP requires a browser tab to be open. There's no support for agents to call your tools in a headless state.&lt;/p&gt;

&lt;p&gt;This means WebMCP is specifically for &lt;em&gt;in-browser&lt;/em&gt; agent interactions — not for server-side automation pipelines that many enterprise workflows rely on. For those use cases, you'd still need a backend MCP server. WebMCP and server-side MCP are complementary, not interchangeable.&lt;/p&gt;

&lt;h3&gt;
  
  
  The spec is not yet on the W3C official standards track
&lt;/h3&gt;

&lt;p&gt;It currently lives in the W3C Web Machine Learning Community Group — an incubation space, not the full standards process. The path from origin trial to official web standard is long and uncertain. WebMCP could follow the path of Service Workers (proposed → standard → ubiquitous). Or it could follow the path of a dozen other promising origin trials that never made it.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd Actually Recommend
&lt;/h2&gt;

&lt;p&gt;If you maintain a web app with forms or user-facing workflows, here's what I'd do this week:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Enable the flag in Chrome today&lt;/strong&gt;&lt;br&gt;
Go to &lt;code&gt;chrome://flags&lt;/code&gt; and search for "WebMCP". Set it to Enabled, relaunch, and you can start testing immediately without waiting for Chrome 149.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Pick your one most important user flow&lt;/strong&gt;&lt;br&gt;
Don't try to annotate everything. Pick the single form or interaction that an agent would most benefit from — a search form, a checkout step, a filter UI. Annotate it with the declarative API. It'll take an hour.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Sign up for the origin trial&lt;/strong&gt;&lt;br&gt;
Visit the Chrome origin trial page and register your domain for the WebMCP trial. This lets you ship WebMCP support to real users before Chrome 149 hits stable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4: Watch what happens when Gemini in Chrome supports it&lt;/strong&gt;&lt;br&gt;
This is the moment that will make the investment pay off. When Google's in-browser agent can call your registered tools directly — that's when the "I annotated my forms" work starts delivering real value.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;Here's my actual take after sitting with I/O 2026 for a few days:&lt;/p&gt;

&lt;p&gt;The Gemini model announcements are table stakes at this point. Every major AI lab releases faster, cheaper models every few months. That's not a story; it's a cadence.&lt;/p&gt;

&lt;p&gt;WebMCP is different. It's infrastructure. It's Google (and Microsoft) trying to answer a structural question about the web's future: &lt;em&gt;when AI agents become first-class citizens of the browser, what contract does a website make with them?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The answer they're proposing is WebMCP: an explicit, structured, queryable tool surface that gives agents what they actually need instead of forcing them to infer it.&lt;/p&gt;

&lt;p&gt;If that standard gets adopted, it changes how we think about building for the web. We'll think about our web apps as having three user types: humans on desktop, humans on mobile, and AI agents. WebMCP is the API layer for the third type.&lt;/p&gt;

&lt;p&gt;That is a genuinely new idea. And it came from a developer keynote that most people stopped watching after the Gemini 3.5 Flash benchmarks.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Are you going to try WebMCP in the origin trial? I'd love to hear which use cases you're thinking about — drop them in the comments.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>googleiochallenge</category>
      <category>ai</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
