<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Prabhakar Chaudhary</title>
    <description>The latest articles on DEV Community by Prabhakar Chaudhary (@prabhakar_chaudhary_7afe4).</description>
    <link>https://dev.to/prabhakar_chaudhary_7afe4</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2106903%2F3c5af1fa-ded9-460e-8d18-049d18c8ab4d.png</url>
      <title>DEV Community: Prabhakar Chaudhary</title>
      <link>https://dev.to/prabhakar_chaudhary_7afe4</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/prabhakar_chaudhary_7afe4"/>
    <language>en</language>
    <item>
      <title>Análisis de Claude Sonnet 5: El nuevo modelo 'agéntico' de Anthropic, su precio y posición en el mercado</title>
      <dc:creator>Prabhakar Chaudhary</dc:creator>
      <pubDate>Thu, 02 Jul 2026 08:00:38 +0000</pubDate>
      <link>https://dev.to/prabhakar_chaudhary_7afe4/analisis-de-claude-sonnet-5-el-nuevo-modelo-agentico-de-anthropic-su-precio-y-posicion-en-el-597i</link>
      <guid>https://dev.to/prabhakar_chaudhary_7afe4/analisis-de-claude-sonnet-5-el-nuevo-modelo-agentico-de-anthropic-su-precio-y-posicion-en-el-597i</guid>
      <description>&lt;p&gt;El 30 de junio de 2026, Anthropic anunció el lanzamiento de &lt;strong&gt;Claude Sonnet 5&lt;/strong&gt;, el último modelo de su familia Sonnet. Este lanzamiento no es una simple actualización incremental; posiciona al modelo como una herramienta "agéntica" diseñada para ejecutar flujos de trabajo autónomos y complejos a un coste más accesible que los modelos de gama alta como Opus [1].&lt;/p&gt;

&lt;p&gt;Este artículo ofrece un análisis detallado de lo que significa este lanzamiento para los desarrolladores y la industria. Se examinan las capacidades declaradas, los cambios técnicos, la estructura de precios y se sitúa la noticia en el contexto de las discusiones de la comunidad técnica y la investigación académica reciente sobre sistemas agénticos.&lt;/p&gt;

&lt;h3&gt;
  
  
  Metodología
&lt;/h3&gt;

&lt;p&gt;Este análisis se basa en la documentación oficial de Anthropic, incluyendo el anuncio de lanzamiento y la ficha de sistema (&lt;em&gt;System Card&lt;/em&gt;), discusiones técnicas en foros públicos como Hacker News, y artículos de investigación académica sobre la evaluación de agentes de IA publicados a mediados de 2026. El objetivo es ofrecer una visión equilibrada que distingue las afirmaciones del proveedor de las observaciones de la comunidad y el estado del arte académico.&lt;/p&gt;

&lt;h2&gt;
  
  
  ¿Qué es Claude Sonnet 5? Capacidades y enfoque agéntico
&lt;/h2&gt;

&lt;p&gt;Claude Sonnet 5 se presenta como un puente entre la familia Sonnet, de gama media, y la familia Opus, de gama alta. Según Anthropic, el modelo ofrece un rendimiento cercano al de Opus 4.8 en muchas tareas, pero con la velocidad y la eficiencia de costes de la línea Sonnet [1].&lt;/p&gt;

&lt;p&gt;El principal diferenciador es su optimización para &lt;strong&gt;flujos de trabajo agénticos&lt;/strong&gt;. Esto se refiere a la capacidad del modelo para realizar tareas complejas de varios pasos de forma autónoma, utilizando herramientas como un navegador web o un terminal [1]. Las capacidades clave declaradas incluyen:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Planificación y ejecución autónoma&lt;/strong&gt;: El modelo puede crear un plan para abordar una solicitud compleja y ejecutarlo sin supervisión constante [1].&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Uso avanzado de herramientas&lt;/strong&gt;: Interactúa con terminales y navegadores para automatizar tareas que tradicionalmente requerían intervención humana [1].&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Rendimiento en codificación&lt;/strong&gt;: Anthropic destaca una mejora sustancial en tareas de ingeniería de software, como la depuración de código, la navegación por bases de código complejas y la refactorización. En la prueba de referencia &lt;strong&gt;SWE-bench Pro&lt;/strong&gt;, Sonnet 5 obtuvo un 63.2%, en comparación con el 58.1% de su predecesor, Sonnet 4.6 [1].&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Seguridad&lt;/strong&gt;: El modelo presenta, según sus evaluaciones, tasas más bajas de alucinaciones y comportamientos no deseados en comparación con Sonnet 4.6. Incluye salvaguardas de ciberseguridad activadas por defecto para detectar y bloquear usos peligrosos [1, 4].&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Cambios técnicos y consideraciones para desarrolladores
&lt;/h2&gt;

&lt;p&gt;La migración a Sonnet 5 desde modelos anteriores no es completamente transparente y requiere atención a ciertos detalles técnicos:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Nuevo Tokenizador&lt;/strong&gt;: Sonnet 5 utiliza un tokenizador actualizado. Según Anthropic, el mismo texto de entrada puede generar entre un 30% más de tokens que en versiones anteriores [1]. Aunque la empresa ajustó el precio de lanzamiento para que la transición sea aproximadamente neutra en costes, es fundamental que los desarrolladores reevalúen sus prompts y ajusten los límites de &lt;code&gt;max_tokens&lt;/code&gt; [1].&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Cambios en la API&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;  La funcionalidad &lt;code&gt;Adaptive Thinking&lt;/code&gt; está activada por defecto [1].&lt;/li&gt;
&lt;li&gt;  Ya no se soportan los parámetros de muestreo (&lt;code&gt;temperature&lt;/code&gt;, &lt;code&gt;top_p&lt;/code&gt;, &lt;code&gt;top_k&lt;/code&gt;), y su uso devolverá un error. La recomendación es guiar el comportamiento del modelo mediante instrucciones en el &lt;em&gt;system prompt&lt;/em&gt; [1].&lt;/li&gt;
&lt;li&gt;  El pensamiento extendido manual (&lt;code&gt;manual thinking&lt;/code&gt;) ha sido eliminado en favor del pensamiento adaptativo [1].&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Estructura de precios y disponibilidad
&lt;/h2&gt;

&lt;p&gt;Claude Sonnet 5 está disponible en todos los planes de Anthropic (incluido el gratuito) y a través de la API de Claude en plataformas como AWS, Google Cloud y Microsoft Foundry [1]. Su estructura de precios se divide en un periodo introductorio y uno estándar [1, 5].&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Período&lt;/th&gt;
&lt;th&gt;Precio de Entrada (por millón de tokens)&lt;/th&gt;
&lt;th&gt;Precio de Salida (por millón de tokens)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Introductorio&lt;/strong&gt; (hasta 31/08/2026)&lt;/td&gt;
&lt;td&gt;$2.00&lt;/td&gt;
&lt;td&gt;$10.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Estándar&lt;/strong&gt; (desde 01/09/2026)&lt;/td&gt;
&lt;td&gt;$3.00&lt;/td&gt;
&lt;td&gt;$15.00&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Fuente: Documentación oficial de Anthropic.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Este precio lo sitúa en una posición competitiva, significativamente más bajo que el de Opus 4.8, que tiene un coste de $5 por millón de tokens de entrada y $25 por millón de tokens de salida [5].&lt;/p&gt;

&lt;h2&gt;
  
  
  El contexto: Reacciones de la comunidad y avances en la investigación
&lt;/h2&gt;

&lt;p&gt;Ningún lanzamiento tecnológico ocurre en el vacío. Para entender las implicaciones de Sonnet 5, es útil observar las reacciones de la comunidad y el estado de la investigación en IA.&lt;/p&gt;

&lt;h3&gt;
  
  
  Discusiones en Hacker News: Eficiencia vs. "Extracción de valor"
&lt;/h3&gt;

&lt;p&gt;En plataformas como Hacker News, la recepción ha sido mixta y matizada. Si bien algunos desarrolladores informan de éxitos notables al usar Sonnet 5 para tareas complejas que antes requerían modelos más caros, han surgido dos críticas principales:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Consumo de tokens&lt;/strong&gt;: Varios usuarios señalan que el modelo tiende a "sobrecomplicar" tareas sencillas, consumiendo una cantidad excesiva de tokens [2]. Este comportamiento ha alimentado la sospecha de que los modelos están siendo optimizados para la "extracción de valor" (&lt;em&gt;wealth extraction&lt;/em&gt;) a través del uso de tokens, en lugar de para la eficiencia pura [2].&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Agente asistido vs. Agente autónomo&lt;/strong&gt;: Hay un debate sobre si la optimización para flujos de trabajo "totalmente agénticos" degrada el rendimiento en casos de uso de "asistencia agéntica", donde un desarrollador busca control granular y respuestas concisas, no un agente que intente resolverlo todo de forma autónoma [2].&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Estas discusiones ponen de manifiesto una tensión clave: la promesa de la automatización total frente a la necesidad de control y eficiencia económica en el desarrollo diario.&lt;/p&gt;

&lt;h3&gt;
  
  
  El contexto de la investigación: El desafío de evaluar agentes
&lt;/h3&gt;

&lt;p&gt;El marketing de Sonnet 5 en torno a su capacidad "agéntica" coincide con un intenso enfoque de la comunidad investigadora en cómo evaluar estos sistemas. Investigaciones recientes publicadas en repositorios como arXiv subrayan que medir el rendimiento de un agente de IA es un problema no resuelto.&lt;/p&gt;

&lt;p&gt;Un artículo reciente de Zhu et al. (2026) destaca que los resultados de los benchmarks están a menudo confundidos por "efectos de andamiaje" (&lt;em&gt;scaffold effects&lt;/em&gt;) [3]. Esto significa que el rendimiento medido no solo depende del modelo de lenguaje subyacente, sino también del código específico (el "andamio") que gestiona la memoria del agente, las llamadas a herramientas y la interacción con el entorno [3].&lt;/p&gt;

&lt;p&gt;La investigación actual se está moviendo hacia:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Marcos de evaluación unificados&lt;/strong&gt;: Para aislar la capacidad real del modelo de los efectos del entorno de prueba [3].&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Diagnósticos automatizados&lt;/strong&gt;: Herramientas que analizan la traza completa de ejecución de un agente para identificar patrones de fallo recurrentes, en lugar de limitarse a una puntuación final de éxito o fracaso [3].&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Esto nos dice que, si bien la industria avanza rápidamente hacia la implementación de agentes, el campo académico todavía está construyendo las herramientas para comprender y medir de forma fiable su comportamiento, robustez y eficiencia [3].&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusión: Implicaciones prácticas
&lt;/h2&gt;

&lt;p&gt;Claude Sonnet 5 es un movimiento estratégico de Anthropic para acelerar la adopción de la IA agéntica en entornos de producción, ofreciendo capacidades cercanas a la gama alta a un precio más asequible. Su objetivo es claro: permitir que las empresas pasen de la experimentación a la implementación de flujos de trabajo automatizados [1, 5].&lt;/p&gt;

&lt;p&gt;Sin embargo, para los desarrolladores, la adopción no es trivial. Las implicaciones prácticas clave son:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;El coste real es variable&lt;/strong&gt;: El cambio en el tokenizador y el comportamiento a veces verboso del modelo significan que el coste por tarea debe ser evaluado cuidadosamente. No siempre será más barato que modelos anteriores o de la competencia, especialmente para tareas simples [1, 2].&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Adecuación a la tarea&lt;/strong&gt;: Sonnet 5 parece brillar en tareas autónomas y de larga duración. Para interacciones rápidas y controladas, su diseño "agéntico" podría ser contraproducente [2].&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;La evaluación es crucial&lt;/strong&gt;: La verdadera eficacia del modelo dependerá de pruebas rigurosas en los casos de uso específicos de cada equipo. Las métricas del proveedor son un punto de partida, pero la validación en el mundo real es indispensable [2, 3].&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;En resumen, Claude Sonnet 5 es una herramienta potente con un enfoque definido en la autonomía. Su éxito dependerá de si los desarrolladores pueden alinear sus capacidades con los problemas correctos, gestionando al mismo tiempo la complejidad y el coste inherentes a estos nuevos sistemas agénticos.&lt;/p&gt;

&lt;h2&gt;
  
  
  Referencias
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://www.anthropic.com/news/claude-sonnet-5" rel="noopener noreferrer"&gt;Introducing Claude Sonnet 5 - Anthropic&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://news.ycombinator.com/item?id=48736605" rel="noopener noreferrer"&gt;Hacker News: Claude Sonnet 5 - Anthropic (Discussion) - news.ycombinator.com&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/html/2605.27898v1" rel="noopener noreferrer"&gt;Agentic Models Are Not AGI Yet: A Survey of Their Capabilities, Limitations, and Future Directions - arXiv.org&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.anthropic.com/claude-sonnet-5-system-card" rel="noopener noreferrer"&gt;Claude Sonnet 5 System Card - Anthropic&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.anthropic.com/en/docs/about-claude/pricing" rel="noopener noreferrer"&gt;Pricing - docs.anthropic.com&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>agents</category>
      <category>claude</category>
      <category>llm</category>
      <category>news</category>
    </item>
    <item>
      <title>How DFlash Uses Block Diffusion to Break the Speculative Decoding Bottleneck</title>
      <dc:creator>Prabhakar Chaudhary</dc:creator>
      <pubDate>Wed, 01 Jul 2026 16:14:54 +0000</pubDate>
      <link>https://dev.to/prabhakar_chaudhary_7afe4/how-dflash-uses-block-diffusion-to-break-the-speculative-decoding-bottleneck-4921</link>
      <guid>https://dev.to/prabhakar_chaudhary_7afe4/how-dflash-uses-block-diffusion-to-break-the-speculative-decoding-bottleneck-4921</guid>
      <description>&lt;h1&gt;
  
  
  How DFlash Uses Block Diffusion to Break the Speculative Decoding Bottleneck
&lt;/h1&gt;

&lt;p&gt;Autoregressive LLM inference has a fundamental problem: every token depends on the one before it. Even with speculative decoding — where a small draft model proposes tokens and the target model verifies them in parallel — the drafting step itself has remained sequential. DFlash, a framework from researchers at UC San Diego's Z Lab, changes that by replacing the autoregressive drafter with a block diffusion model that generates an entire candidate block in a single forward pass.&lt;/p&gt;

&lt;p&gt;The results are notable: 6× lossless acceleration on Qwen3-8B, 2.5× improvement over the previous state-of-the-art EAGLE-3, and up to 15× throughput gains on NVIDIA Blackwell hardware at production concurrency levels. The framework is now integrated into SGLang and vLLM, making it accessible without application-level changes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Speculative Decoding Still Had a Bottleneck
&lt;/h2&gt;

&lt;p&gt;Speculative decoding works by having a lightweight draft model generate a sequence of candidate tokens, which the target model then verifies in a single parallel forward pass. If the target model accepts most of the draft tokens, you get significant speedups — the expensive target model runs less often.&lt;/p&gt;

&lt;p&gt;The catch is that existing draft models like EAGLE-3 are themselves autoregressive. They generate tokens one at a time, so drafting γ tokens takes γ sequential steps. This creates a ceiling: the faster you want to draft, the more you're constrained by sequential computation. EAGLE-3 achieves roughly 2–3× speedups in practice, which is useful but leaves substantial GPU capacity underutilized.&lt;/p&gt;

&lt;p&gt;Diffusion language models offer an alternative — they can generate tokens in parallel — but standalone diffusion LLMs have historically underperformed autoregressive models on quality, making them poor candidates for the verification step.&lt;/p&gt;

&lt;h2&gt;
  
  
  What DFlash Does Differently
&lt;/h2&gt;

&lt;p&gt;DFlash's core insight is to use a diffusion model only for drafting, not for final generation. The target model remains a standard autoregressive LLM that handles verification. This lets DFlash capture the parallelism of diffusion generation while preserving the quality guarantees of autoregressive verification.&lt;/p&gt;

&lt;p&gt;The drafting process works as follows:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Context extraction&lt;/strong&gt;: The target model processes the input prompt and produces hidden states at multiple layers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;KV injection&lt;/strong&gt;: These hidden states are projected and injected into the Key-Value cache of every layer in the draft model. This is the critical difference from earlier diffusion-based speculative decoding approaches, which only conditioned the drafter on the first layer's features. By injecting target context throughout the draft model's depth, DFlash maintains strong alignment between draft and target even as the draft model grows deeper and more expressive.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parallel block drafting&lt;/strong&gt;: The draft model fills in an entire block of masked token positions in a single forward pass, treating the problem as a joint denoising task rather than a sequential prediction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verification&lt;/strong&gt;: The target model checks the proposed block. Accepted tokens are kept; the first rejected token triggers a new draft cycle.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Because the drafting cost is roughly constant regardless of block size, DFlash can use deeper draft models and larger block sizes without the linear latency penalty that constrains autoregressive drafters. A 5-layer DFlash model drafting 16 tokens runs faster than a single-layer EAGLE-3 model drafting 8 tokens.&lt;/p&gt;

&lt;h2&gt;
  
  
  Training the Draft Model
&lt;/h2&gt;

&lt;p&gt;Training DFlash draft models involves a few design choices that matter for acceptance rates. The draft model shares token embeddings and the language model head with the target model, which keeps the output distribution aligned. During training, random block positions are sampled from the training data rather than always starting from the beginning of a sequence — this improves generalization to arbitrary context lengths.&lt;/p&gt;

&lt;p&gt;Loss weighting uses exponential decay across positions within a block, prioritizing accuracy at earlier positions where errors compound. The intuition is that a wrong token early in a block will cause the entire remaining block to be rejected, so it's worth spending more training signal there.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark Results
&lt;/h2&gt;

&lt;p&gt;On Qwen3-8B with greedy decoding, DFlash achieves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;6.08× speedup&lt;/strong&gt; on code generation (HumanEval)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;5.15× speedup&lt;/strong&gt; on math (MATH-500)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;5.62× speedup&lt;/strong&gt; on chat (MT-Bench)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Compared to EAGLE-3 on the same tasks, DFlash is 1.4–1.8× faster. For reasoning models at temperature 1, the gains are even larger: 4.5× acceleration on AIME benchmarks.&lt;/p&gt;

&lt;p&gt;At production scale on NVIDIA Blackwell (DGX B300), the &lt;a href="https://developer.nvidia.com/blog/boost-inference-performance-up-to-15x-on-nvidia-blackwell-using-dflash-speculative-decoding/" rel="noopener noreferrer"&gt;NVIDIA engineering team&lt;/a&gt; reports up to 15× throughput improvement over standard autoregressive decoding for gpt-oss-120B at 500–600 tokens/sec per user interactivity targets. Even against EAGLE-3, DFlash delivers 1.5–2.6× higher throughput depending on task type, with coding and multilingual tasks showing the largest gains.&lt;/p&gt;

&lt;h2&gt;
  
  
  Integration with SGLang and vLLM
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://www.lmsys.org/blog/2026-06-15-next-generation-speculative-decoding-dflash-v2/" rel="noopener noreferrer"&gt;LMSYS team's Spec V2 blog post&lt;/a&gt; describes how DFlash is now the default speculative decoding engine in SGLang. The integration adds an overlap scheduler that reduces host-device synchronization overhead by overlapping draft processing with KV cache allocation for the next batch. This alone adds roughly 33% throughput on top of DFlash's base gains — on Qwen3-8B, throughput goes from 11,400 to 15,300 tokens/second.&lt;/p&gt;

&lt;p&gt;For vLLM users, DFlash integrates through the Speculators library. Switching from EAGLE-3 requires updating the checkpoint path and specifying the algorithm; no application code changes are needed. TensorRT-LLM support is also available for Blackwell and Hopper deployments.&lt;/p&gt;

&lt;p&gt;Z Lab has released over 20 DFlash draft model checkpoints on Hugging Face covering Qwen, Llama, Gemma, and Kimi K2.6 model families. The &lt;a href="https://arxiv.org/abs/2602.06036" rel="noopener noreferrer"&gt;original paper&lt;/a&gt; and &lt;a href="https://z-lab.ai/projects/dflash/" rel="noopener noreferrer"&gt;project page&lt;/a&gt; include training code and quick-start examples for both SGLang and the Transformers library.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means for Inference Infrastructure
&lt;/h2&gt;

&lt;p&gt;Speculative decoding has been a useful but niche optimization — effective mainly when you have a good draft model and the right hardware setup. DFlash makes the case that the drafting step itself was the limiting factor, not the verification step.&lt;/p&gt;

&lt;p&gt;The practical implication is that inference serving costs for large models can drop substantially without any change to model quality. For teams running LLMs at scale, the combination of DFlash with modern inference frameworks like SGLang or vLLM represents a meaningful reduction in GPU hours per token — particularly for coding and reasoning workloads where token acceptance rates are high.&lt;/p&gt;

&lt;p&gt;The framework also points toward a broader pattern: diffusion models may be most useful not as standalone generators but as components within hybrid systems where their parallelism can be exploited without sacrificing the quality guarantees of autoregressive verification.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>deeplearning</category>
      <category>llm</category>
    </item>
    <item>
      <title>Kimi K2.7 Code: How Moonshot AI Built an Open-Weight Coding Model That Reasons More Efficiently</title>
      <dc:creator>Prabhakar Chaudhary</dc:creator>
      <pubDate>Mon, 29 Jun 2026 16:37:24 +0000</pubDate>
      <link>https://dev.to/prabhakar_chaudhary_7afe4/kimi-k27-code-how-moonshot-ai-built-an-open-weight-coding-model-that-reasons-more-efficiently-2kf3</link>
      <guid>https://dev.to/prabhakar_chaudhary_7afe4/kimi-k27-code-how-moonshot-ai-built-an-open-weight-coding-model-that-reasons-more-efficiently-2kf3</guid>
      <description>&lt;h1&gt;
  
  
  Kimi K2.7 Code: How Moonshot AI Built an Open-Weight Coding Model That Reasons More Efficiently
&lt;/h1&gt;

&lt;p&gt;Moonshot AI released &lt;a href="https://www.kimi.com/resources/kimi-k2-7-code" rel="noopener noreferrer"&gt;Kimi K2.7 Code&lt;/a&gt; on June 12, 2026 — a coding-focused, open-weight model built on the same 1-trillion-parameter Mixture-of-Experts backbone as its predecessor, K2.6. The headline improvement is not a bigger model or a longer context window. It is a roughly 30% reduction in reasoning-token usage, which translates directly into lower inference costs and faster agentic task loops.&lt;/p&gt;

&lt;p&gt;This post explains what that efficiency gain actually means, how the model is structured, and where it fits relative to closed frontier models.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Changed Between K2.6 and K2.7 Code
&lt;/h2&gt;

&lt;p&gt;K2.7 Code is not a general-purpose upgrade. Moonshot AI explicitly positioned it as a coding specialist, while K2.6 remains the recommended choice for writing, analysis, and conversation. The two models share the same architecture; what changed is how K2.7 Code was trained and fine-tuned.&lt;/p&gt;

&lt;p&gt;The most significant change is reasoning efficiency. Language models that use "thinking" or chain-of-thought modes generate internal reasoning tokens before producing a final answer. These tokens cost money and add latency. K2.7 Code reduces that overhead by approximately 30% compared to K2.6 — meaning the model reaches the same or better answers while generating fewer intermediate steps.&lt;/p&gt;

&lt;p&gt;Moonshot AI also addressed a stability problem common in long-horizon agentic tasks: models that perform well for the first ten steps of a workflow but degrade over fifty. K2.7 Code shows improved reliability across extended multi-step coding sessions, covering more than ten programming languages including Python, Rust, and Go.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture: A 1T-Parameter MoE That Activates 32B Per Token
&lt;/h2&gt;

&lt;p&gt;The underlying architecture is a &lt;a href="https://arxiv.org/abs/2101.03961" rel="noopener noreferrer"&gt;Mixture-of-Experts (MoE)&lt;/a&gt; design with 1 trillion total parameters and approximately 32 billion active parameters per token. MoE models route each token through a subset of specialized sub-networks (experts) rather than the full model, which keeps inference costs manageable despite the large total parameter count.&lt;/p&gt;

&lt;p&gt;Key architectural details:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;61 transformer layers&lt;/strong&gt; (1 dense, 60 MoE)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;384 experts&lt;/strong&gt;, with 8 selected per token plus 1 shared expert&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-head Latent Attention (MLA)&lt;/strong&gt; for efficient key-value memory compression in long contexts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SwiGLU activation&lt;/strong&gt; in feed-forward layers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;256K-token context window&lt;/strong&gt; (262,144 tokens), suited for repository-scale codebases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MoonViT&lt;/strong&gt;, a 400M-parameter vision encoder for image and video inputs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The MLA mechanism is worth noting. Standard attention scales quadratically with sequence length in memory usage. MLA compresses the key-value cache, which is what makes a 256K context window practical rather than theoretical. This matters for tasks like reviewing a full pull request with diffs, logs, and test output in a single prompt.&lt;/p&gt;

&lt;p&gt;The vision encoder (MoonViT) enables multimodal inputs — a developer can pass a screenshot of a failing UI alongside the relevant code, or include a short video of a bug reproduction.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark Numbers and What They Mean
&lt;/h2&gt;

&lt;p&gt;Moonshot AI reports the following improvements over K2.6 on their internal and external benchmarks (all evaluations used Kimi Code CLI with thinking mode enabled):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benchmark&lt;/th&gt;
&lt;th&gt;K2.6&lt;/th&gt;
&lt;th&gt;K2.7 Code&lt;/th&gt;
&lt;th&gt;Change&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Kimi Code Bench v2&lt;/td&gt;
&lt;td&gt;50.9&lt;/td&gt;
&lt;td&gt;62.0&lt;/td&gt;
&lt;td&gt;+21.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Program Bench&lt;/td&gt;
&lt;td&gt;48.3&lt;/td&gt;
&lt;td&gt;53.6&lt;/td&gt;
&lt;td&gt;+11.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MLS Bench Lite&lt;/td&gt;
&lt;td&gt;26.7&lt;/td&gt;
&lt;td&gt;35.1&lt;/td&gt;
&lt;td&gt;+31.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MCP Mark Verified&lt;/td&gt;
&lt;td&gt;72.8&lt;/td&gt;
&lt;td&gt;81.1&lt;/td&gt;
&lt;td&gt;+11.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MCP Atlas&lt;/td&gt;
&lt;td&gt;69.4&lt;/td&gt;
&lt;td&gt;76.0&lt;/td&gt;
&lt;td&gt;+9.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For context, GPT-5.5 scores 69.0 on Kimi Code Bench v2 and Claude Opus 4.8 scores 67.4. K2.7 Code at 62.0 trails both closed models on raw coding benchmarks. On MCP Mark Verified — an agentic benchmark measuring reliable tool invocation — K2.7 Code at 81.1 actually exceeds Claude Opus 4.8's 76.4.&lt;/p&gt;

&lt;p&gt;A few caveats apply. Kimi Code Bench v2 is an in-house benchmark. Comparisons against GPT-5.5 and Claude Opus 4.8 used their respective coding agent interfaces. Independent third-party verification on public suites like SWE-bench Verified has not yet been published, so the numbers should be treated as vendor-reported until confirmed externally.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Reasoning Efficiency Argument
&lt;/h2&gt;

&lt;p&gt;The 30% reduction in reasoning tokens is the most practically interesting aspect of K2.7 Code for teams running agentic workflows at scale.&lt;/p&gt;

&lt;p&gt;In a 12-hour autonomous coding session, if K2.6 generates 2 million reasoning tokens, K2.7 Code generates approximately 1.4 million — saving 600,000 tokens. At $4.00 per million output tokens, that is $2.40 per session. Across hundreds of concurrent agent runs, the savings compound quickly.&lt;/p&gt;

&lt;p&gt;There is also a secondary benefit: fewer reasoning tokens mean the model hits the 256K context limit later in a session, allowing it to maintain more task history before truncating.&lt;/p&gt;

&lt;h2&gt;
  
  
  Operational Constraints
&lt;/h2&gt;

&lt;p&gt;K2.7 Code has several fixed parameters developers should know before integrating it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Thinking mode is mandatory.&lt;/strong&gt; Requests that attempt to disable it default back to K2.6.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sampling parameters are locked.&lt;/strong&gt; Temperature is fixed at 1.0 and top-p at 0.95, preventing deterministic output modes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;reasoning_content&lt;/code&gt; must be preserved&lt;/strong&gt; across multi-turn interactions for coherent task state.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool choice is limited&lt;/strong&gt; to &lt;code&gt;auto&lt;/code&gt; or &lt;code&gt;none&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These constraints reflect a deliberate design choice: K2.7 Code is optimized for agentic coding, and the fixed parameters match the settings under which it was trained. Developers who need deterministic outputs or fine-grained sampling control should use a different model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Access and Deployment Options
&lt;/h2&gt;

&lt;p&gt;The model weights are available on &lt;a href="https://huggingface.co/moonshotai/Kimi-K2.7-Code" rel="noopener noreferrer"&gt;Hugging Face&lt;/a&gt; under a Modified MIT license. The full weights are approximately 595GB, which means self-hosting requires server-class infrastructure. Supported inference frameworks include vLLM, SGLang, and KTransformers, and the model requires transformers 4.57.1 or later.&lt;/p&gt;

&lt;p&gt;For teams that do not want to manage their own infrastructure, the model is accessible via:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Kimi API&lt;/strong&gt; at &lt;code&gt;platform.kimi.ai&lt;/code&gt; with OpenAI-compatible endpoints ($0.95/M input tokens, $4.00/M output tokens, $0.19/M for cache hits)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloudflare Workers AI&lt;/strong&gt; for serverless agentic workloads&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenRouter&lt;/strong&gt; under the model ID &lt;code&gt;moonshotai/kimi-k2.7-code&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ollama&lt;/strong&gt; via &lt;code&gt;kimi-k2.7-code:cloud&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The OpenAI-compatible API format means K2.7 Code can be used as a drop-in replacement in existing agent frameworks that already support OpenAI endpoints.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where It Fits
&lt;/h2&gt;

&lt;p&gt;K2.7 Code occupies a specific niche: an open-weight coding model that is meaningfully cheaper than closed frontier models for high-volume agentic use, while remaining competitive (though not leading) on raw coding benchmarks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://aimlapi.com/blog/kimi-k2-7-code-the-complete-guide-to-moonshot-ais-new-open-weight-coding-model" rel="noopener noreferrer"&gt;According to the AI/ML API guide&lt;/a&gt;, K2.7 Code offers roughly 12x lower cost per token compared to GPT-5.5 and Claude Opus 4.8 for high-volume agent workflows. For teams running autonomous coding agents at scale — where the bottleneck is cost and throughput rather than peak benchmark performance — that gap matters.&lt;/p&gt;

&lt;p&gt;The open-weight license also matters for teams with data residency requirements or those who want to fine-tune the model on proprietary codebases. Self-hosting at 595GB is not trivial, but feasible for organizations with the infrastructure.&lt;/p&gt;

&lt;p&gt;The main limitation is that K2.7 Code still trails the leading closed models on absolute coding performance. Teams that need the highest possible success rate on complex engineering tasks will likely still reach for GPT-5.5 or Claude Opus 4.8. K2.7 Code is the better choice when cost efficiency and open weights are the primary constraints.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Sources: &lt;a href="https://www.kimi.com/resources/kimi-k2-7-code" rel="noopener noreferrer"&gt;Kimi K2.7 Code official page&lt;/a&gt; · &lt;a href="https://aimlapi.com/blog/kimi-k2-7-code-the-complete-guide-to-moonshot-ais-new-open-weight-coding-model" rel="noopener noreferrer"&gt;AI/ML API complete guide&lt;/a&gt; · &lt;a href="https://www.i-scoop.eu/kimi-k2-7-code-the-open-weight-coding-model-that-thinks-30-less" rel="noopener noreferrer"&gt;i-SCOOP analysis&lt;/a&gt; · &lt;a href="https://developers.cloudflare.com/changelog/post/2026-06-12-kimi-k2-7-code-workers-ai/" rel="noopener noreferrer"&gt;Cloudflare Workers AI changelog&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>opensource</category>
      <category>llm</category>
    </item>
    <item>
      <title>Gemini 3.5 Flash Now Has Native Computer Use — Here's What That Actually Changes</title>
      <dc:creator>Prabhakar Chaudhary</dc:creator>
      <pubDate>Fri, 26 Jun 2026 16:18:05 +0000</pubDate>
      <link>https://dev.to/prabhakar_chaudhary_7afe4/gemini-35-flash-now-has-native-computer-use-heres-what-that-actually-changes-ol0</link>
      <guid>https://dev.to/prabhakar_chaudhary_7afe4/gemini-35-flash-now-has-native-computer-use-heres-what-that-actually-changes-ol0</guid>
      <description>&lt;h1&gt;
  
  
  Gemini 3.5 Flash Now Has Native Computer Use — Here's What That Actually Changes
&lt;/h1&gt;

&lt;p&gt;On June 24, 2026, Google folded computer use directly into Gemini 3.5 Flash — the same production model developers already use for function calling, Search grounding, and Maps. It is available as a public preview via the Gemini API and the Gemini Enterprise Agent Platform. The headline benchmark (78.4 on OSWorld-Verified) is close to GPT-5.5's 78.7, but the more durable story is the architectural shift: screen interaction is no longer a separate, premium service. It is becoming a default capability of a production-tier model.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Changed: From a Separate Model to a Native Tool
&lt;/h2&gt;

&lt;p&gt;Before this release, developers who wanted an AI agent to control a screen had two options: use the standalone &lt;a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/introducing-computer-use-gemini-3-5-flash/" rel="noopener noreferrer"&gt;Gemini 2.5 Computer Use preview model&lt;/a&gt; (limited to 128K tokens, browser-focused, no simultaneous Search or Maps), or build a multi-model pipeline that routed tasks between a reasoning model and a computer-use model. Both approaches added engineering overhead and context-switching costs.&lt;/p&gt;

&lt;p&gt;Gemini 3.5 Flash eliminates that split. Computer use is now declared as a tool alongside other capabilities:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;tools=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"computer_use"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"environment"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"browser|mobile|desktop"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A single agent can now browse the web for pricing data, operate an enterprise application, and ground a response with Maps — all within one inference pass, with no model-hopping. The context window also expanded from 128K (the standalone model) to 1 million tokens, which matters for long-horizon tasks that need to maintain coherence across many screen states.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the Perception-Action Loop Works
&lt;/h2&gt;

&lt;p&gt;The model operates on a straightforward cycle:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The developer application captures a screenshot of the target environment.&lt;/li&gt;
&lt;li&gt;The screenshot and the task goal are sent to the Gemini API.&lt;/li&gt;
&lt;li&gt;The model identifies UI elements, reasons about the next action, and returns a structured command (e.g., a click at normalized coordinates, a keystroke, a scroll).&lt;/li&gt;
&lt;li&gt;The application executes the action, captures a new screenshot, and repeats until the task is complete.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;One notable addition is the &lt;strong&gt;intent field&lt;/strong&gt;: every action response now includes a natural-language explanation of why the model chose that action (e.g., "Click the search box to type the destination"). For enterprise teams operating in regulated environments, this serves as an audit trail — a log of the agent's reasoning that compliance teams can review. It also makes debugging significantly easier when an agent takes an unexpected path through a UI.&lt;/p&gt;

&lt;p&gt;The model supports 20+ action types across environments: click, double-click, right-click, type, scroll, navigate, drag-and-drop, hotkeys, and screenshots for browser; open_app, long_press, go_back for mobile; and OS-level operations for desktop.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Benchmark Picture
&lt;/h2&gt;

&lt;p&gt;On &lt;a href="https://www.techtimes.com/articles/319071/20260625/gemini-computer-use-baked-gemini-35-flash-screen-control-now-pairs-search-maps.htm" rel="noopener noreferrer"&gt;OSWorld-Verified&lt;/a&gt; — a benchmark that evaluates agents on real tasks across Ubuntu, Windows, and macOS — Gemini 3.5 Flash scores 78.4. For context:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;OSWorld-Verified Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude Opus 4.8&lt;/td&gt;
&lt;td&gt;83.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.5&lt;/td&gt;
&lt;td&gt;78.7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gemini 3.5 Flash&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;78.4&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 3 Flash (prior)&lt;/td&gt;
&lt;td&gt;65.1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two important caveats: all scores on this leaderboard are self-reported by model providers, with no independent third-party verification as of June 2026. And the 13.3-point jump from Gemini 3 Flash (65.1) to Gemini 3.5 Flash (78.4) is the more meaningful number — it shows how much the native integration improved over the previous built-in tool approach.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Cost Argument
&lt;/h2&gt;

&lt;p&gt;When benchmark scores are within 0.3 points of each other, pricing becomes the deciding factor. Gemini 3.5 Flash is priced at $1.50 per million input tokens and $9 per million output tokens. GPT-5.5 costs $5 input and $30 output — roughly three times more. Cached input for Gemini 3.5 Flash drops further to $0.15 per million tokens, which compresses costs significantly for agents that reuse long system prompts across many tasks.&lt;/p&gt;

&lt;p&gt;For organizations running computer-use agents at scale — continuous software testing, automated data entry, UI regression checks — that cost difference compounds quickly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Safety: What Google Built In and What Remains Unsolved
&lt;/h2&gt;

&lt;p&gt;Computer use introduces a specific class of risk: an agent that can click, type, and navigate can also take irreversible actions (sending an email, submitting a form, approving a transaction) if it misinterprets a screen or is manipulated by malicious content.&lt;/p&gt;

&lt;p&gt;Google's approach is layered. At the model level, Gemini 3.5 Flash underwent targeted adversarial training for computer-use scenarios to reduce susceptibility to prompt injection — where malicious instructions hidden in on-screen content could redirect the agent. At the deployment level, two opt-in enterprise safeguards are available: one that requires explicit human confirmation before sensitive or irreversible actions, and one that automatically terminates a task if indirect prompt injection is detected.&lt;/p&gt;

&lt;p&gt;Seven configurable safety categories let developers block or gate specific action types: financial transactions, sensitive data modification, communication tools (sending messages or emails), account creation, data modification, user consent management, and legal terms acceptance. For most production deployments, Google recommends treating screen agents like a new employee with access to passwords — scoped permissions, human confirmation on anything irreversible, and reviewable logs via the intent field.&lt;/p&gt;

&lt;p&gt;What remains unsolved industry-wide is UI drift: real-world applications change continuously, present authentication flows, and show untrained UI states that OSWorld's controlled test environments do not capture. The benchmark-to-production gap is real, and Google's own documentation acknowledges it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means for Developers
&lt;/h2&gt;

&lt;p&gt;The practical implication is that computer use is no longer a specialized capability requiring a dedicated model and a separate integration path. It is becoming a tool you declare alongside Search and Maps in a standard Gemini API call.&lt;/p&gt;

&lt;p&gt;For teams already using Gemini 3.5 Flash for reasoning or function calling, adding screen interaction is now an incremental step rather than an architectural overhaul. The &lt;a href="https://www.digitalapplied.com/blog/gemini-3-5-flash-computer-use-agent-automation-2026" rel="noopener noreferrer"&gt;reference implementation and documentation&lt;/a&gt; are available via the Gemini API, and Google provides a demo environment hosted by Browserbase for prototyping.&lt;/p&gt;

&lt;p&gt;The sensible starting point is read-only tasks: pointing an agent at a dashboard to read state and flag anomalies, with no write access. Once the intent field logs earn your team's trust, you can expand to human-in-the-loop workflows for data entry or form submission, with enforced confirmation on anything that spends or sends.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Broader Trend
&lt;/h2&gt;

&lt;p&gt;Gemini 3.5 Flash's native computer use is part of a wider pattern: capabilities that started as standalone, experimental models are being absorbed into general-purpose production models. The same happened with code execution, web search, and image understanding. Screen interaction appears to be following the same path — moving from a premium add-on to a standard tool in the agentic stack.&lt;/p&gt;

&lt;p&gt;Whether that consolidation makes agents more capable or just more convenient depends on how well the underlying models handle the messiness of real-world UIs. The OSWorld numbers are a starting point, not a guarantee. But the direction is clear: the boundary between "language model" and "computer-using agent" is narrowing.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>programming</category>
      <category>opensource</category>
    </item>
    <item>
      <title>What the Age of LLM Benchmark Says About Evaluating Agentic AI</title>
      <dc:creator>Prabhakar Chaudhary</dc:creator>
      <pubDate>Thu, 25 Jun 2026 14:04:58 +0000</pubDate>
      <link>https://dev.to/prabhakar_chaudhary_7afe4/what-the-age-of-llm-benchmark-says-about-evaluating-agentic-ai-2hfc</link>
      <guid>https://dev.to/prabhakar_chaudhary_7afe4/what-the-age-of-llm-benchmark-says-about-evaluating-agentic-ai-2hfc</guid>
      <description>&lt;h1&gt;
  
  
  What the Age of LLM Benchmark Says About Evaluating Agentic AI
&lt;/h1&gt;

&lt;p&gt;Most AI evaluation still leans on a simple pattern: give the model a prompt, compare the answer against a reference, and score the result. That works reasonably well for summarization, classification, and many single-turn language tasks. It works much less well once a model is asked to act inside a changing environment.&lt;/p&gt;

&lt;p&gt;That is why the new arXiv paper &lt;a href="https://arxiv.org/abs/2606.24391" rel="noopener noreferrer"&gt;&lt;strong&gt;Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War&lt;/strong&gt;&lt;/a&gt; is worth reading carefully. It is not just another benchmark. It is a compact example of what agentic AI evaluation has to look like when the system must reason under uncertainty, keep track of hidden state, and produce valid actions instead of only plausible text.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this benchmark is different
&lt;/h2&gt;

&lt;p&gt;Age of LLM puts two language models into a turn-based 1v1 game on a 13x7 grid. The models do not see everything. They operate under &lt;strong&gt;fog of war&lt;/strong&gt;, which means enemy units and some resources remain hidden unless the agent scouts or infers them. The environment also allows &lt;strong&gt;full diplomacy&lt;/strong&gt;: models can send messages, propose ceasefires, and issue ultimatums. On top of that, every turn has to obey a strict JSON schema, and illegal actions are silently discarded.&lt;/p&gt;

&lt;p&gt;That combination matters because it tests several abilities at once:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;State tracking&lt;/strong&gt;: does the model remember what it has seen and what it has already lost?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Belief management&lt;/strong&gt;: does it behave sensibly when information is incomplete?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Action validity&lt;/strong&gt;: can it stay inside the environment’s rules?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long-horizon strategy&lt;/strong&gt;: can it choose a sequence of actions that actually leads somewhere useful?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those are the same pressures that show up in real agent systems. A model that sounds fluent can still fail if it forgets state, issues invalid tool calls, or mishandles partial information.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the paper found
&lt;/h2&gt;

&lt;p&gt;The headline result is not that one strategy won every time. The more interesting point is how easily models fell into simple patterns under uncertainty.&lt;/p&gt;

&lt;p&gt;The paper reports that &lt;strong&gt;nuclear rush&lt;/strong&gt; strategies dominated most outcomes, even though the game offers multiple paths to victory. Military conquest was faster when it worked, but it was less common. Diplomacy was active, but agreements were rarely completed. The benchmark also found that a large share of illegal actions came from &lt;strong&gt;fog-of-war or state-tracking errors&lt;/strong&gt;, which is exactly the sort of failure that is hard to see in ordinary text benchmarks.&lt;/p&gt;

&lt;p&gt;That last point is important. If a model writes a good explanation but fails to track hidden units, the failure will not appear in a standard answer-comparison benchmark. It appears only when the environment forces the model to commit to actions and live with their consequences.&lt;/p&gt;

&lt;p&gt;The paper’s design also tries to reduce contamination. It uses a private engine with fresh map seeds, so the evaluation is not just replaying memorized solutions. That is a good reminder that a benchmark for agentic systems should minimize opportunities for shortcut learning.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters for production agents
&lt;/h2&gt;

&lt;p&gt;A lot of current agent work focuses on whether models can use tools. That is necessary, but it is not sufficient. A production agent usually has to do more than call an API once. It has to maintain context, respect permissions, reason over state, and recover when the environment changes under it.&lt;/p&gt;

&lt;p&gt;That is why the broader 2026 agentic AI discussion has started to shift toward outcomes rather than chat quality. The Hugging Face article &lt;a href="https://huggingface.co/blog/daya-shankar/agentic-ai-trends-2026" rel="noopener noreferrer"&gt;&lt;strong&gt;Latest Agentic AI Trends to Watch in 2026&lt;/strong&gt;&lt;/a&gt; makes the same point from the enterprise side: useful systems are measured by whether they complete work, not whether they produce polished prose. It also emphasizes specialization, orchestration, and governance rather than one all-purpose assistant.&lt;/p&gt;

&lt;p&gt;Age of LLM gives that trend a concrete evaluation shape. If an agent cannot maintain a belief state under fog of war, then any claim that it is “strategic” is premature. If it cannot keep its outputs within schema, then tool use is still brittle. If it can persuade but not coordinate, then conversational ability is outpacing execution.&lt;/p&gt;

&lt;h2&gt;
  
  
  A useful comparison: search is not the same as narration
&lt;/h2&gt;

&lt;p&gt;One reason this benchmark feels timely is that a separate line of research is asking how agents actually learn to search. The paper &lt;a href="https://arxiv.org/abs/2606.00183" rel="noopener noreferrer"&gt;&lt;strong&gt;Agentic Transformers Provably Learn to Search via Reinforcement Learning&lt;/strong&gt;&lt;/a&gt; studies how transformer policies can acquire depth-first search behavior from sparse reinforcement feedback. Its main takeaway is that search is not just a clever prompt pattern; it is something a policy can learn, specialize into, and improve through training dynamics.&lt;/p&gt;

&lt;p&gt;That is a useful complement to Age of LLM. One paper asks how search behavior emerges. The other asks how that behavior holds up when the world is partially hidden, adversarial, and constrained by strict action rules.&lt;/p&gt;

&lt;p&gt;Put together, they suggest a practical lesson: agentic capability is not one thing. A model can be good at planning in the abstract, but weak at executing under uncertainty. Or it can follow rules well, but fail to explore. Real systems need both.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why evaluations should include reliability as a first-class metric
&lt;/h2&gt;

&lt;p&gt;The benchmark also treats reliability as part of the task, not as an afterthought. That is a healthier way to think about agents.&lt;/p&gt;

&lt;p&gt;In normal software, invalid output is a bug. In agentic AI, invalid output often becomes a silent failure: a tool call does nothing, a hidden assumption is wrong, or the model acts on stale state. If you only score final answers, you miss the failure mode entirely.&lt;/p&gt;

&lt;p&gt;This is also why the public conversation around AI has become more cautious. Anthropic’s &lt;a href="https://www.anthropic.com/news/anthropic-public-record" rel="noopener noreferrer"&gt;Public Record survey&lt;/a&gt; found that Americans want accountability, worry about job loss and cognitive dependency, and strongly prefer that companies be held responsible for harm. That is not a technical benchmark, but it matches the same direction: people care about whether AI systems behave reliably in the real world, not just whether they sound impressive in demos.&lt;/p&gt;

&lt;p&gt;And if you want a view from practitioners rather than labs, the Hacker News discussion &lt;a href="https://news.ycombinator.com/item?id=48168221" rel="noopener noreferrer"&gt;“I don't think AI will make your processes go faster”&lt;/a&gt; captures a familiar pattern: speedups are real only when the surrounding process is redesigned, reviewed, and supervised. Otherwise, the model can simply move mistakes faster.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to take away from Age of LLM
&lt;/h2&gt;

&lt;p&gt;The most useful lesson from this benchmark is that agent evaluation needs to move closer to the conditions of deployment.&lt;/p&gt;

&lt;p&gt;That means testing for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;partial observability&lt;/li&gt;
&lt;li&gt;hidden state&lt;/li&gt;
&lt;li&gt;long-horizon coordination&lt;/li&gt;
&lt;li&gt;action validity&lt;/li&gt;
&lt;li&gt;recovery from mistakes&lt;/li&gt;
&lt;li&gt;and the difference between sounding right and acting correctly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Age of LLM is valuable because it compresses all of those concerns into a small, readable setting. The game is artificial, but the evaluation logic is not. If an AI system is supposed to function as an agent, then it should be judged in an environment that makes action, memory, and uncertainty visible.&lt;/p&gt;

&lt;p&gt;That is where the field seems to be heading: away from benchmarks that reward fluent explanations, and toward benchmarks that expose whether a model can actually operate.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Related reading:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2606.24391" rel="noopener noreferrer"&gt;Age of LLM&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/blog/daya-shankar/agentic-ai-trends-2026" rel="noopener noreferrer"&gt;Latest Agentic AI Trends to Watch in 2026&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2606.00183" rel="noopener noreferrer"&gt;Agentic Transformers Provably Learn to Search via Reinforcement Learning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://news.ycombinator.com/item?id=48168221" rel="noopener noreferrer"&gt;Hacker News discussion on process speed and AI&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>deeplearning</category>
      <category>programming</category>
    </item>
    <item>
      <title>Orion-100B: How Macrocosmos Trained a 100B-Parameter Model Over the Open Internet</title>
      <dc:creator>Prabhakar Chaudhary</dc:creator>
      <pubDate>Thu, 25 Jun 2026 13:57:53 +0000</pubDate>
      <link>https://dev.to/prabhakar_chaudhary_7afe4/orion-100b-how-macrocosmos-trained-a-100b-parameter-model-over-the-open-internet-4d5i</link>
      <guid>https://dev.to/prabhakar_chaudhary_7afe4/orion-100b-how-macrocosmos-trained-a-100b-parameter-model-over-the-open-internet-4d5i</guid>
      <description>&lt;p&gt;Training a 100-billion-parameter language model has, until recently, been the exclusive domain of organizations with access to tightly coupled, high-bandwidth GPU clusters — the kind that cost tens of millions of dollars to build and operate. Macrocosmos, a team building on the Bittensor decentralized AI network, just published results showing that this assumption is no longer strictly true.&lt;/p&gt;

&lt;p&gt;Their &lt;strong&gt;Orion-100B&lt;/strong&gt; project completed a full distributed pretraining run of a 100B-parameter model across nodes spread over five U.S. datacenters, connected via the public internet. The result is not a toy experiment: the system achieved 30–38% Model FLOP Utilization (MFU) and ran at roughly 65% of the throughput of an equivalent centralized setup — while allowing individual participants to contribute compute for as little as $1.25 per hour.&lt;/p&gt;

&lt;p&gt;Here is what they actually built, and why the engineering choices matter.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Problem: Why Distributed Training Is Hard
&lt;/h2&gt;

&lt;p&gt;Most large-scale model training relies on &lt;strong&gt;distributed data parallelism (DDP)&lt;/strong&gt;: each node holds a full copy of the model, processes a different batch of data, and synchronizes gradients across all nodes after each step. DDP works well when nodes are colocated in the same datacenter with high-bandwidth interconnects (NVLink, InfiniBand), but it has a critical weakness for heterogeneous, internet-connected setups: the system's effective capacity is bounded by the memory of the &lt;em&gt;smallest&lt;/em&gt; participating node. A single underpowered machine can bottleneck the entire run.&lt;/p&gt;

&lt;p&gt;Macrocosmos chose a different approach: &lt;strong&gt;distributed pipeline parallelism (DPP)&lt;/strong&gt;. Instead of replicating the full model across nodes, DPP shards the model's layers across multiple machines, with each node responsible for a contiguous slice of the network. Data flows through the pipeline sequentially — node 1 processes the first set of layers, passes activations to node 2, and so on. The total model capacity scales with the &lt;em&gt;aggregate&lt;/em&gt; memory of all participants, not the minimum.&lt;/p&gt;

&lt;p&gt;For Orion-100B, the team configured 16 pipeline stages across 48 devices (16 stages × 3 replicas each), all running on Nvidia A100 80GB GPUs distributed across five non-colocated datacenters.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bandwidth Problem — and How ResBM Solves It
&lt;/h2&gt;

&lt;p&gt;Pipeline parallelism introduces its own bottleneck: every time activations pass from one pipeline stage to the next, those tensors must travel over the network. In a standard setup, transferring activations between stages for a 100B-parameter model requires moving roughly &lt;strong&gt;140.6 MB per step&lt;/strong&gt;. Over a public internet connection, that is prohibitive.&lt;/p&gt;

&lt;p&gt;Macrocosmos addressed this with &lt;strong&gt;ResBM activation compression&lt;/strong&gt;, a lossless compression technique applied specifically to the inter-stage activation tensors. ResBM reduced the transfer size from 140.6 MB down to &lt;strong&gt;2.2 MB&lt;/strong&gt; — a 64× reduction — making the bandwidth requirements compatible with commodity internet connections (the cluster's median upload speed was 856 Mbps, download 1,322 Mbps).&lt;/p&gt;

&lt;p&gt;This is arguably the most important technical contribution of the project. Without it, the communication overhead would dominate training time and make the approach impractical. With it, the system can sustain meaningful throughput even across geographically distributed nodes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Keeping the Pipeline Coherent: IOTA and Stochastic Pathfinding
&lt;/h2&gt;

&lt;p&gt;Running a pipeline across non-colocated, potentially unreliable nodes requires solving two additional problems: synchronizing model weights across replicas, and handling node failures gracefully.&lt;/p&gt;

&lt;p&gt;For synchronization, Macrocosmos built the &lt;strong&gt;IOTA Bridge Service&lt;/strong&gt;, which manages distributed variable synchronization across pipeline stages. The system runs 10 inner gradient accumulation steps (H=10) per synchronization cycle. Because synchronization time scales inversely with the number of pipeline stages, this design keeps sync overhead low — and the team estimates that with further tuning (pseudogradient compression, H=100), synchronization could be reduced to just 0.5% of total training time, pushing utilization toward 97.8%.&lt;/p&gt;

&lt;p&gt;For fault tolerance, the system uses a &lt;strong&gt;stochastic pathfinding algorithm&lt;/strong&gt; (co-developed with Bittensor Subnet 1) that dynamically reroutes data flow when a node drops out, maintaining training coherence without requiring a full restart.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Numbers Actually Mean
&lt;/h2&gt;

&lt;p&gt;The headline metrics from the Orion-100B run:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MFU:&lt;/strong&gt; 30.8% sustained, 38% peak&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Throughput:&lt;/strong&gt; ~65% of an equivalent centralized datacenter setup&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Entry cost:&lt;/strong&gt; $1.25/hr for a single contributing node (16 non-colocated A100s: ~$20/hr; enterprise 8×B200 peer: ~$50/hr)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A 30% MFU is not exceptional by datacenter standards — well-optimized centralized runs on H100s can reach 50–60% MFU. But the comparison point here is not a centralized cluster; it is the alternative of &lt;em&gt;not being able to train at all&lt;/em&gt; without one. For organizations that cannot afford or access a dedicated GPU cluster, 65% of datacenter throughput at a fraction of the cost is a meaningful option.&lt;/p&gt;

&lt;p&gt;The economic model is also worth noting. Orion-100B runs on &lt;strong&gt;Bittensor Subnet 9&lt;/strong&gt;, where participants are compensated in TAO tokens for contributing compute. This creates an incentive structure for distributed contributors that does not exist in traditional cloud training setups.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Comes Next
&lt;/h2&gt;

&lt;p&gt;Macrocosmos has outlined a roadmap for progressively relaxing the constraints of the current system:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Heterogeneous hardware:&lt;/strong&gt; Mixing different GPU generations to utilize stranded or underutilized compute&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Interruptible compute:&lt;/strong&gt; Using low-cost spot-market instances that can be preempted and resumed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Permissionless participation:&lt;/strong&gt; Removing centralized coordination to allow untrusted global contributions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consumer hardware:&lt;/strong&gt; Onboarding RTX 4090/5090 cards and Apple Silicon systems via the existing "Train at Home" initiative on Bittensor&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each step increases the pool of available compute while introducing new engineering challenges around fault tolerance, security, and gradient integrity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters for the Field
&lt;/h2&gt;

&lt;p&gt;The standard narrative around frontier model training is that it requires centralized infrastructure at a scale only a handful of organizations can afford. Orion-100B does not overturn that narrative entirely — a 30% MFU run on A100s is not going to out-compete a well-funded lab's H100 cluster. But it demonstrates that the &lt;em&gt;technical&lt;/em&gt; barriers to distributed, internet-scale training are lower than previously assumed.&lt;/p&gt;

&lt;p&gt;The key enablers — pipeline parallelism over DDP, aggressive activation compression, and fault-tolerant synchronization — are all transferable techniques. As ResBM and similar compression methods mature, and as the Bittensor ecosystem grows, the cost floor for training large models will continue to drop.&lt;/p&gt;

&lt;p&gt;For developers and researchers who want to follow the project, the primary technical writeup is available on the &lt;a href="https://macrocosmosai.substack.com/p/orion-100b-distributed-pretraining" rel="noopener noreferrer"&gt;Macrocosmos Substack&lt;/a&gt;. Additional technical analysis can be found at &lt;a href="https://simplytao.ai/blog/orion-100b-macrocosmos" rel="noopener noreferrer"&gt;SimplyTao&lt;/a&gt; and &lt;a href="https://www.tao.media/macrocosmos-unveils-orion-100b-a-100b-parameter-distributed-ai-training-run/" rel="noopener noreferrer"&gt;Tao.media&lt;/a&gt;. The economic breakdown is covered in detail at &lt;a href="https://news.ayen.in/orion-100b-low-cost-ai-training/" rel="noopener noreferrer"&gt;Ayen&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The broader question Orion-100B raises is not whether decentralized training can match centralized infrastructure today — it cannot, yet. The question is how quickly the gap closes as compression, fault tolerance, and incentive mechanisms improve. Based on this run, the answer appears to be: faster than most expected.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>deeplearning</category>
      <category>programming</category>
    </item>
    <item>
      <title>Why Real-Time AI Assistants Are Hard — and What Wan-Streamer v0.1 Changes</title>
      <dc:creator>Prabhakar Chaudhary</dc:creator>
      <pubDate>Thu, 25 Jun 2026 08:00:18 +0000</pubDate>
      <link>https://dev.to/prabhakar_chaudhary_7afe4/why-real-time-ai-assistants-are-hard-and-what-wan-streamer-v01-changes-3m70</link>
      <guid>https://dev.to/prabhakar_chaudhary_7afe4/why-real-time-ai-assistants-are-hard-and-what-wan-streamer-v01-changes-3m70</guid>
      <description>&lt;h1&gt;
  
  
  Why Real-Time AI Assistants Are Hard — and What Wan-Streamer v0.1 Changes
&lt;/h1&gt;

&lt;p&gt;Real-time AI feels easy to imagine and hard to build. A user speaks, the system thinks for a moment, then answers with the right words and a synchronized voice or avatar. In practice, that experience is usually stitched together from a chain of separate components: voice activity detection, speech recognition, a language model, text-to-speech, and some kind of animation or video renderer. Every handoff adds latency. Every boundary creates another place for timing errors to creep in.&lt;/p&gt;

&lt;p&gt;That is why Wan-Streamer v0.1 is interesting. The &lt;a href="https://wan-streamer.com/" rel="noopener noreferrer"&gt;Wan-Streamer project page&lt;/a&gt; presents a different idea: instead of treating audio, video, and text as separate services, use one streaming Transformer to handle them as a single interaction loop. In other words, the model does not merely respond faster; it is built around the assumption that interaction is continuous, causal, and full-duplex.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real problem: pipelines are slow
&lt;/h2&gt;

&lt;p&gt;A standard multimodal assistant often looks like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;User audio comes in.&lt;/li&gt;
&lt;li&gt;ASR converts speech to text.&lt;/li&gt;
&lt;li&gt;A language model produces text.&lt;/li&gt;
&lt;li&gt;TTS turns that text back into speech.&lt;/li&gt;
&lt;li&gt;A renderer or avatar system tries to keep the face and mouth in sync.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This works, but it is fragile. If the ASR step is late, everything downstream waits. If the text-to-speech system starts too early or too late, the avatar can look unnatural. If the user interrupts, the system may not notice quickly enough.&lt;/p&gt;

&lt;p&gt;That is the design problem Wan-Streamer tries to solve. The model is meant to perceive and respond in one causal stream, rather than bouncing data across multiple subsystems.&lt;/p&gt;

&lt;h2&gt;
  
  
  One Transformer, not a chain of modules
&lt;/h2&gt;

&lt;p&gt;The core idea is straightforward to say and harder to execute: model language, audio, and video together inside a single Transformer. The &lt;a href="https://huggingface.co/papers/2606.25041" rel="noopener noreferrer"&gt;Hugging Face paper page&lt;/a&gt; summarizes the approach well: interleaved text, audio, and video tokens are processed with block-causal attention, so the model can keep a valid streaming history while still updating its internal state at each step.&lt;/p&gt;

&lt;p&gt;Why does that matter? Because the model is not waiting for a full conversation turn to finish before it can act. It can update continuously. That makes it closer to how humans interact in real time: we listen while planning a response, and we can react to interruptions before the other person finishes talking.&lt;/p&gt;

&lt;p&gt;The project page also describes a helpful systems idea: a &lt;strong&gt;thinker–performer&lt;/strong&gt; split. The thinker handles perception, state updates, and rendering of the previous unit. The performer handles the next unit’s latent generation. That overlap is important because low latency is not only about making one model faster. It is also about keeping different parts of the streaming loop from blocking one another.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the model keeps the stream moving
&lt;/h2&gt;

&lt;p&gt;Wan-Streamer is built around causality from the start. That means every new piece of information is processed in a way that respects time order. The system uses causal encoders and decoders, and the output side is generated in small streaming units rather than in one big batch.&lt;/p&gt;

&lt;p&gt;The most useful high-level mental model is this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the input stream is continuously encoded&lt;/li&gt;
&lt;li&gt;the model updates a shared state from the interaction history&lt;/li&gt;
&lt;li&gt;the next response is predicted from that state&lt;/li&gt;
&lt;li&gt;audio and video are generated from latent representations&lt;/li&gt;
&lt;li&gt;the generated output is appended back into the history&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is important because it turns the whole interaction into a loop. The model is not just answering a prompt; it is living inside a sequence of observation, response, and updated context.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://arxiv.org/abs/2606.25041" rel="noopener noreferrer"&gt;arXiv abstract&lt;/a&gt; adds a few concrete details: Wan-Streamer uses interleaved visual, audio, and text tokens, block-causal attention for incremental streaming, and low-latency scheduling that supports roughly 160 ms streaming units. The project page says the system reaches about 200 ms model-side latency and roughly 550 ms total interaction latency when network delay is included.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why that is more than a benchmark number
&lt;/h2&gt;

&lt;p&gt;Latency numbers are easy to treat as vanity metrics, but in an interactive system they shape the user experience directly. When response time drops under a second, the conversation starts to feel live. When the system can also keep audio and video synchronized, it can behave more like a participant than a gadget.&lt;/p&gt;

&lt;p&gt;That matters for a few product categories:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Customer support avatars&lt;/strong&gt; that need to respond naturally while keeping eye contact and facial timing intact.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tutoring agents&lt;/strong&gt; that should be able to listen, explain, and adapt in the same session without awkward pauses.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Telepresence tools&lt;/strong&gt; where the agent’s speech, lip movement, and scene changes all need to arrive together.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Interactive demos&lt;/strong&gt; where the difference between “it works” and “it feels responsive” is mostly a systems problem.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In that sense, Wan-Streamer is less about making a chatbot talk and more about rethinking the structure of the interface. The model has to be aware of turn-taking, interruption, and timing as first-class behaviors.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to be careful about
&lt;/h2&gt;

&lt;p&gt;There are still some caveats.&lt;/p&gt;

&lt;p&gt;First, this is a v0.1 system, and the demo quality is clearly still evolving. The project page shows a 192p proof-of-concept, which tells you that resolution and polish are not solved by architecture alone.&lt;/p&gt;

&lt;p&gt;Second, the public latency comparisons should be read carefully. Some systems are measured end-to-end, while others report only rendering-stage latency. Those are not the same thing.&lt;/p&gt;

&lt;p&gt;Third, a single streaming Transformer does not remove the hard problems of safety, robustness, or long-horizon consistency. It reduces a class of systems bottlenecks, but it does not magically solve alignment or reliability.&lt;/p&gt;

&lt;p&gt;Finally, the thinker–performer split is clever, but it is also a reminder that real-time multimodal AI can be hardware-heavy. Engineering the loop is part of the work, not a side detail.&lt;/p&gt;

&lt;h2&gt;
  
  
  What developers should take away
&lt;/h2&gt;

&lt;p&gt;The biggest lesson from Wan-Streamer is not just “make the model bigger” or “make it faster.” It is that the &lt;strong&gt;shape of the interaction loop matters&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If you are building real-time AI products, ask a different set of questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Does the system really need separate modules, or can some of them be fused into one causal backbone?&lt;/li&gt;
&lt;li&gt;Where are the unavoidable waits in the pipeline?&lt;/li&gt;
&lt;li&gt;What state must be preserved across turns to make the experience feel continuous?&lt;/li&gt;
&lt;li&gt;Which parts of the system can overlap instead of blocking each other?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those questions apply even if you never build a full audio-video agent. They are just as relevant for voice assistants, streaming transcription tools, multimodal copilots, and collaborative creation apps.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing thought
&lt;/h2&gt;

&lt;p&gt;Wan-Streamer v0.1 is a useful reminder that many “AI experience” problems are really &lt;strong&gt;systems design&lt;/strong&gt; problems. If the model has to feel live, the architecture has to be live too. The project shows one path forward: causal, streaming, and unified rather than modular, batch-oriented, and stitched together after the fact.&lt;/p&gt;

&lt;p&gt;That does not mean every team should copy the exact design. It does mean that if your product depends on natural interaction, you should pay close attention to how information moves through the system. In real-time AI, the route from input to response is often the product.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>deeplearning</category>
    </item>
    <item>
      <title>OpenAI's Jalapeño Chip: Why a Custom Inference ASIC Changes the Economics of Running LLMs</title>
      <dc:creator>Prabhakar Chaudhary</dc:creator>
      <pubDate>Thu, 25 Jun 2026 07:52:34 +0000</pubDate>
      <link>https://dev.to/prabhakar_chaudhary_7afe4/openais-jalapeno-chip-why-a-custom-inference-asic-changes-the-economics-of-running-llms-59km</link>
      <guid>https://dev.to/prabhakar_chaudhary_7afe4/openais-jalapeno-chip-why-a-custom-inference-asic-changes-the-economics-of-running-llms-59km</guid>
      <description>&lt;h1&gt;
  
  
  OpenAI's Jalapeño Chip: Why a Custom Inference ASIC Changes the Economics of Running LLMs
&lt;/h1&gt;

&lt;p&gt;OpenAI just unveiled its first custom silicon: a chip called Jalapeño, built in partnership with Broadcom. The announcement landed on June 24, 2026, and while the headline is easy to summarize — "OpenAI made a chip" — the more interesting story is &lt;em&gt;why&lt;/em&gt; a purpose-built inference processor can outperform a GPU, and what that means for the cost of running large language models at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem Jalapeño Is Solving
&lt;/h2&gt;

&lt;p&gt;To understand why Jalapeño matters, you need to understand what actually limits LLM inference performance.&lt;/p&gt;

&lt;p&gt;When you send a prompt to ChatGPT, the model isn't bottlenecked by raw computation — it's bottlenecked by &lt;strong&gt;memory bandwidth&lt;/strong&gt;. Transformer-based models need to repeatedly load billions of parameters from memory into compute units for every token they generate. The bottleneck is the data movement between memory and logic circuits, not the arithmetic itself.&lt;/p&gt;

&lt;p&gt;General-purpose GPUs like Nvidia's Blackwell are designed to handle a wide range of workloads: graphics rendering, scientific simulation, model training, and inference. That flexibility comes at a cost — GPUs carry a lot of hardware that sits idle during inference, and their memory architectures aren't optimized specifically for the access patterns of autoregressive token generation.&lt;/p&gt;

&lt;p&gt;An ASIC (Application-Specific Integrated Circuit) trades flexibility for efficiency. By designing the chip specifically around the memory access patterns and compute requirements of transformer inference, you can eliminate the wasted cycles and reduce the energy spent moving data around.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Jalapeño Actually Is
&lt;/h2&gt;

&lt;p&gt;Jalapeño is a massive chip. The compute chiplet is approximately 840mm², which puts it near the EUV reticle size limit — essentially as large as a single chiplet can be with current lithography. It integrates six HBM (High Bandwidth Memory) modules alongside one I/O chiplet, giving it the memory bandwidth needed to serve large models without the data-movement bottleneck that plagues GPU-based inference clusters.&lt;/p&gt;

&lt;p&gt;The chip is paired with Broadcom's Tomahawk 6 Ethernet switch chips, which handle up to 1.6 terabits of traffic per second and include built-in congestion management. In a large inference cluster, network congestion between chips is a real performance killer — the Tomahawk integration addresses that at the hardware level. Custom server racks are being designed with Celestia Inc. to house the full system.&lt;/p&gt;

&lt;p&gt;One notable detail: the chip went from design to tape-out in nine months. The typical ASIC development cycle runs 1.5 to 2 years. OpenAI attributes the compressed timeline to using its own AI models to assist in chip design, along with Broadcom's IP reuse strategies. Whether or not you take that claim at face value, a nine-month cycle for a chip this complex is fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Cost Claim
&lt;/h2&gt;

&lt;p&gt;Broadcom CEO Hock Tan stated that early internal testing shows approximately &lt;strong&gt;50% lower inference cost per token&lt;/strong&gt; compared to current-generation GPUs. OpenAI describes the performance-per-watt as "substantially better than state-of-the-art," and Broadcom claims the chip performs on par with Nvidia Blackwell and Google TPUs for relevant workloads.&lt;/p&gt;

&lt;p&gt;These numbers come from OpenAI's own lab testing and haven't been independently verified. Lab benchmarks also tend to diverge from production environments where workloads are more variable and unpredictable. Detailed technical reports are expected in the coming months, so treat the 50% figure as a directional claim rather than a settled fact.&lt;/p&gt;

&lt;p&gt;That said, the underlying logic is sound. If you design hardware specifically for the memory access patterns of transformer inference, you should be able to do better than a general-purpose GPU on that specific task. The question is how much better, and whether the efficiency holds up at production scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Doesn't Do
&lt;/h2&gt;

&lt;p&gt;Jalapeño is an inference-only chip. It won't replace Nvidia for training frontier models — that workload requires the flexibility and raw compute of GPUs, and Nvidia's CUDA ecosystem has a decade-long head start in software tooling. OpenAI continues to maintain large-scale agreements with Nvidia, AMD, and Amazon for training compute.&lt;/p&gt;

&lt;p&gt;There's also an adaptability risk inherent to ASICs. If model architectures shift significantly — say, a move away from standard transformer attention patterns — a custom chip optimized for today's architectures may lose its efficiency advantage. GPUs can be repurposed; ASICs generally can't. This is a real long-term risk, though one that OpenAI is presumably willing to accept given the scale of their inference workloads.&lt;/p&gt;

&lt;p&gt;Deployment is also not imminent. The timeline is: small prototype deployments in late 2026, scaling through 2027, and full production in the first half of 2028. The cost savings won't show up in API pricing anytime soon.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters for the Broader Ecosystem
&lt;/h2&gt;

&lt;p&gt;OpenAI is not the first to go down this path. Google has been running TPUs for inference since 2016. AWS has Trainium (training) and Inferentia (inference). Microsoft has the Maia chip. What's notable is that OpenAI — historically a pure software company — is now building its own silicon as part of a strategy to own the full stack: models, data centers, networking, and chips.&lt;/p&gt;

&lt;p&gt;For ML practitioners and developers, the practical implications are a few years out. If Jalapeño scales as planned, inference costs for OpenAI's APIs should decrease meaningfully by 2028. Lower inference costs tend to unlock new use cases — applications that are currently too expensive to run at scale become viable when the per-token cost drops by half.&lt;/p&gt;

&lt;p&gt;The nine-month development cycle is also worth noting. AI-assisted chip design is moving from research curiosity to production reality. If AI tools can compress hardware development timelines this significantly, the pace of custom silicon development across the industry could accelerate.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;The inference bottleneck — memory bandwidth, not compute — is a well-understood problem in the ML infrastructure world. What Jalapeño represents is a large-scale, production-grade attempt to solve it with custom hardware rather than software workarounds. Whether the 50% cost reduction claim holds up in production will be the real test.&lt;/p&gt;

&lt;p&gt;For now, the announcement signals that the era of AI labs running entirely on commodity GPU hardware is ending. Custom silicon, designed around specific model architectures and inference patterns, is becoming a competitive necessity at the frontier. The economics of serving intelligence at scale are being renegotiated at the hardware level.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Primary source: &lt;a href="https://techcrunch.com/2026/06/24/openai-unveils-its-first-custom-chip-built-by-broadcom/" rel="noopener noreferrer"&gt;OpenAI unveils its first custom chip, built by Broadcom&lt;/a&gt; (TechCrunch)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Supporting sources: &lt;a href="https://siliconangle.com/2026/06/24/openai-broadcom-debut-custom-jalapeno-chip-llm-inference/" rel="noopener noreferrer"&gt;SiliconAngle — technical cluster details&lt;/a&gt; | &lt;a href="https://www.tomshardware.com/tech-industry/artificial-intelligence/broadcom-and-openai-unveil-custom-built-jalapeno-inference-processor-openais-first-chip-is-a-massive-reticle-sized-asic-built-in-an-ultra-fast-nine-month-development-cycle" rel="noopener noreferrer"&gt;Tom's Hardware — chip architecture&lt;/a&gt; | &lt;a href="https://venturebeat.com/infrastructure/openai-unveils-first-custom-ai-inference-chip-jalapeno-with-broadcom-and-its-development-was-sped-up-with-openais-own-models" rel="noopener noreferrer"&gt;VentureBeat — strategic context&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>deeplearning</category>
      <category>programming</category>
    </item>
    <item>
      <title>How DeepSeek-V4 Achieves Million-Token Contexts Without Quadratic Attention Costs</title>
      <dc:creator>Prabhakar Chaudhary</dc:creator>
      <pubDate>Wed, 24 Jun 2026 16:15:25 +0000</pubDate>
      <link>https://dev.to/prabhakar_chaudhary_7afe4/how-deepseek-v4-achieves-million-token-contexts-without-quadratic-attention-costs-1g23</link>
      <guid>https://dev.to/prabhakar_chaudhary_7afe4/how-deepseek-v4-achieves-million-token-contexts-without-quadratic-attention-costs-1g23</guid>
      <description>&lt;h1&gt;
  
  
  How DeepSeek-V4 Achieves Million-Token Contexts Without Quadratic Attention Costs
&lt;/h1&gt;

&lt;p&gt;DeepSeek-V4, released in April 2026 under the MIT license, is a Mixture-of-Experts model that supports a one-million-token context window while using only 27% of the inference FLOPs that its predecessor DeepSeek-V3.2 required at the same context length. The key to that efficiency is a hybrid attention mechanism that replaces standard full attention with two complementary compression strategies. This post walks through how those mechanisms work, what else changed in the architecture, and what the numbers look like in practice.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: Attention Scales Quadratically
&lt;/h2&gt;

&lt;p&gt;Standard transformer attention computes a similarity score between every pair of tokens. At 1,000 tokens, that's one million comparisons. At one million tokens, it's one trillion — and the KV cache grows proportionally. This is why most production models cap out at 128K or 256K tokens even when longer contexts would be useful.&lt;/p&gt;

&lt;p&gt;The standard workarounds — sliding window attention, linear attention approximations, retrieval-augmented generation — each trade off some capability for efficiency. DeepSeek-V4 takes a different approach: it keeps full attention for a small fraction of the sequence and uses two levels of aggressive compression for the rest.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two Compression Strategies: CSA and HCA
&lt;/h2&gt;

&lt;p&gt;The hybrid attention system in DeepSeek-V4 interleaves two mechanisms across the model's layers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compressed Sparse Attention (CSA)&lt;/strong&gt; compresses the KV cache by a factor of 4. Every four consecutive tokens are merged into a single KV entry using softmax-weighted pooling. A lightweight "lightning indexer" — running in FP4 precision — then scores the query against these compressed entries and selects the top-k most relevant ones (typically around 128) for the expensive softmax and matrix multiplication operations. A sliding window branch runs in parallel to preserve local dependencies. The result is that most of the sequence is represented at quarter resolution, and only the most relevant portions are attended to at full resolution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Heavily Compressed Attention (HCA)&lt;/strong&gt; applies more aggressive compression: 128 tokens are merged into a single KV entry, reducing a one-million-token sequence to roughly 8,000 entries — small enough for standard dense attention. HCA provides a coarse, document-level view of the entire context.&lt;/p&gt;

&lt;p&gt;Both mechanisms share KV projections using Multi-Query Attention (MQA), apply query and KV normalization, and use Partial Rotary Positional Embedding (RoPE). The combination gives the model three views at any layer: a high-resolution local window, a medium-resolution sparse selection via CSA, and a low-resolution global summary via HCA.&lt;/p&gt;

&lt;p&gt;The efficiency gains are substantial. At a one-million-token context, DeepSeek-V4-Pro requires 27% of the single-token inference FLOPs and 10% of the KV cache size compared to DeepSeek-V3.2 — the difference between a context length being theoretically possible and being economically viable to serve.&lt;/p&gt;

&lt;h2&gt;
  
  
  Manifold-Constrained Hyper-Connections
&lt;/h2&gt;

&lt;p&gt;Beyond attention, DeepSeek-V4 replaces standard residual connections with Manifold-Constrained Hyper-Connections (mHC). In a standard residual block, the output is &lt;code&gt;x + f(x)&lt;/code&gt;. In mHC, the residual mapping is constrained to lie on the manifold of doubly stochastic matrices, which bounds the spectral norm to 1. The parameters controlling the connection are generated dynamically from the input at each layer rather than being fixed.&lt;/p&gt;

&lt;p&gt;The practical effect is improved numerical stability during training and better signal propagation across depth. This matters more at scale: very deep models can suffer from gradient issues that mHC is designed to suppress.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Muon Optimizer
&lt;/h2&gt;

&lt;p&gt;DeepSeek-V4 was trained using the &lt;a href="https://arxiv.org/abs/2606.09079" rel="noopener noreferrer"&gt;Muon optimizer&lt;/a&gt;, which orthogonalizes gradient update matrices using Newton-Schulz iterations before applying them. Standard Adam-style optimizers apply updates that can have large singular values, which can destabilize training. Muon constrains the update to be approximately orthogonal, which keeps singular values bounded and improves convergence stability.&lt;/p&gt;

&lt;p&gt;The training run used more than 32 trillion tokens covering math, code, web text, and long documents. Additional stability measures included Anticipatory Routing (which decouples the routing update cycle from the main training step to prevent expert collapse) and SwiGLU Clamping (which clips activation values to prevent outliers from destabilizing the MoE gating).&lt;/p&gt;

&lt;h2&gt;
  
  
  Model Variants and Deployment
&lt;/h2&gt;

&lt;p&gt;The V4 series ships in two sizes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Total Parameters&lt;/th&gt;
&lt;th&gt;Active per Token&lt;/th&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek-V4-Pro&lt;/td&gt;
&lt;td&gt;1.6 trillion&lt;/td&gt;
&lt;td&gt;49 billion&lt;/td&gt;
&lt;td&gt;1 million tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek-V4-Flash&lt;/td&gt;
&lt;td&gt;284 billion&lt;/td&gt;
&lt;td&gt;13 billion&lt;/td&gt;
&lt;td&gt;1 million tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Both models support three reasoning modes: Non-Think (standard generation), Think High (chain-of-thought reasoning), and Think Max (extended reasoning). Unlike earlier DeepSeek models, V4 supports tool calls even while in thinking mode, which matters for agentic pipelines.&lt;/p&gt;

&lt;p&gt;The models are available via the &lt;a href="https://api-docs.deepseek.com" rel="noopener noreferrer"&gt;DeepSeek API&lt;/a&gt; at $0.435/1M input tokens for Pro and $0.14/1M input tokens for Flash (as of May 2026). Both are released under the MIT license, with weights available on &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro" rel="noopener noreferrer"&gt;Hugging Face&lt;/a&gt;. Self-hosting V4 Flash requires roughly 158GB of VRAM; V4 Pro requires a DGX H200 or equivalent.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark Results
&lt;/h2&gt;

&lt;p&gt;On standard benchmarks, V4-Pro shows consistent improvements over V3.2:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benchmark&lt;/th&gt;
&lt;th&gt;DeepSeek-V3.2&lt;/th&gt;
&lt;th&gt;DeepSeek-V4-Flash&lt;/th&gt;
&lt;th&gt;DeepSeek-V4-Pro&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MMLU (5-shot)&lt;/td&gt;
&lt;td&gt;87.8&lt;/td&gt;
&lt;td&gt;88.7&lt;/td&gt;
&lt;td&gt;90.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MMLU-Pro (5-shot)&lt;/td&gt;
&lt;td&gt;65.5&lt;/td&gt;
&lt;td&gt;68.3&lt;/td&gt;
&lt;td&gt;73.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HumanEval (Pass@1)&lt;/td&gt;
&lt;td&gt;62.8&lt;/td&gt;
&lt;td&gt;69.5&lt;/td&gt;
&lt;td&gt;76.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LongBench-V2 (1-shot)&lt;/td&gt;
&lt;td&gt;40.2&lt;/td&gt;
&lt;td&gt;44.7&lt;/td&gt;
&lt;td&gt;51.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SimpleQA (25-shot)&lt;/td&gt;
&lt;td&gt;28.3&lt;/td&gt;
&lt;td&gt;30.1&lt;/td&gt;
&lt;td&gt;55.2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The LongBench-V2 result is the most directly relevant: a 28% relative improvement over V3.2 on long-context tasks, where the CSA/HCA hybrid attention does the most work. The SimpleQA jump for Pro (28.3 → 55.2) reflects improvements in factual recall from the larger active parameter count and the OPD post-training step.&lt;/p&gt;

&lt;h2&gt;
  
  
  Post-Training: On-Policy Distillation
&lt;/h2&gt;

&lt;p&gt;The post-training pipeline uses On-Policy Distillation (OPD), which transfers knowledge from multiple domain-specific teacher models into the unified student using full-vocabulary logit distillation. Rather than training a single generalist model from scratch, OPD lets the team maintain specialized teachers for coding, math, and reasoning, then distill their combined knowledge into V4-Pro and V4-Flash.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means for Developers
&lt;/h2&gt;

&lt;p&gt;If you are building applications that need to process long documents — legal contracts, codebases, research papers, extended conversation histories — the CSA/HCA architecture offers a concrete path to doing so without the memory and compute costs that have historically made million-token contexts impractical.&lt;/p&gt;

&lt;p&gt;A few practical notes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The Flash variant is the realistic self-hosting option.&lt;/strong&gt; At 13B active parameters and 284B total, V4-Flash is manageable on a cluster of H100s or H200s. V4-Pro at 49B active parameters requires significantly more infrastructure.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Reasoning mode selection matters for cost.&lt;/strong&gt; Think Max mode uses substantially more tokens than Non-Think. For applications where you need the long context but not extended chain-of-thought, Non-Think mode at V4-Flash pricing is quite affordable.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The MIT license is genuinely permissive.&lt;/strong&gt; Unlike some open-weight releases with restrictive commercial terms, MIT allows unrestricted use, modification, and redistribution. The weights and infrastructure details are documented in the &lt;a href="https://arxiv.org/abs/2606.19348" rel="noopener noreferrer"&gt;technical report on arXiv&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The CSA/HCA framing — coarse global attention plus sparse high-resolution attention — is a general architectural pattern that is likely to appear in other long-context models as the field continues to push context lengths upward.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Primary source: &lt;a href="https://arxiv.org/abs/2606.19348" rel="noopener noreferrer"&gt;DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence&lt;/a&gt; — DeepSeek AI, 2026&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Supporting sources: &lt;a href="https://codersera.com/blog/deepseek-v4-complete-guide-2026/" rel="noopener noreferrer"&gt;DeepSeek V4 Complete Guide&lt;/a&gt; | &lt;a href="https://simonwillison.net/2026/apr/24/deepseek-v4/" rel="noopener noreferrer"&gt;Simon Willison's notes on DeepSeek-V4&lt;/a&gt; | &lt;a href="https://api-docs.deepseek.com/news/news260424" rel="noopener noreferrer"&gt;DeepSeek V4 API Documentation&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>deeplearning</category>
      <category>llm</category>
    </item>
    <item>
      <title>How AtomMem Teaches LLM Agents to Manage Their Own Memory Using Reinforcement Learning</title>
      <dc:creator>Prabhakar Chaudhary</dc:creator>
      <pubDate>Mon, 22 Jun 2026 16:41:50 +0000</pubDate>
      <link>https://dev.to/prabhakar_chaudhary_7afe4/how-atommem-teaches-llm-agents-to-manage-their-own-memory-using-reinforcement-learning-2k0i</link>
      <guid>https://dev.to/prabhakar_chaudhary_7afe4/how-atommem-teaches-llm-agents-to-manage-their-own-memory-using-reinforcement-learning-2k0i</guid>
      <description>&lt;h1&gt;
  
  
  How AtomMem Teaches LLM Agents to Manage Their Own Memory Using Reinforcement Learning
&lt;/h1&gt;

&lt;p&gt;Most LLM agents today treat memory as a filing cabinet: information goes in, retrieval pulls it out, and the rules for what to keep or discard are written by hand. AtomMem, a recent paper from Huo et al. (arXiv:2601.08323), takes a different approach — it lets the agent learn its own memory management policy through reinforcement learning, using a minimal set of four atomic operations as the action space. The result is an agent that adapts how it stores and retrieves information based on what the task actually demands.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with Static Memory Workflows
&lt;/h2&gt;

&lt;p&gt;Early retrieval-augmented generation (RAG) systems treated memory as append-only: new information was added to a vector store, and retrieval pulled the most semantically similar chunks at query time. This works reasonably well for single-turn question answering, but it breaks down in long-horizon tasks where the agent needs to update beliefs, discard stale facts, or reorganize what it knows as the task evolves.&lt;/p&gt;

&lt;p&gt;The standard fix has been to write more elaborate rules: summarize after N turns, delete entries older than K steps, merge duplicates when similarity exceeds a threshold. These heuristics can work, but they are brittle. A rule tuned for a multi-hop QA task may perform poorly on a web navigation task where the agent needs to track a rapidly changing page state.&lt;/p&gt;

&lt;p&gt;A &lt;a href="https://arxiv.org/html/2603.11768v1" rel="noopener noreferrer"&gt;recent survey on evolving agent memory systems&lt;/a&gt; (Lam et al., 2026) identifies this as a core tension: static workflows are stable but inflexible, while fully autonomous memory management introduces risks like semantic drift and memory poisoning. AtomMem sits in the middle — it learns a policy, but constrains the action space to four well-defined primitives.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Four Atomic Operations
&lt;/h2&gt;

&lt;p&gt;AtomMem decomposes memory management into four operations borrowed from database theory:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Create&lt;/strong&gt;: Add a new memory unit to the store.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Read&lt;/strong&gt;: Query the store to retrieve relevant information.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Update&lt;/strong&gt;: Modify or refine an existing memory unit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Delete&lt;/strong&gt;: Remove a unit that is no longer relevant.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The authors argue that these four operations are &lt;em&gt;complete&lt;/em&gt; (any valid memory state can be reached through some sequence of them), &lt;em&gt;atomic&lt;/em&gt; (they cannot be meaningfully decomposed further), and &lt;em&gt;task-agnostic&lt;/em&gt; (they apply equally to QA, web navigation, or any other agentic setting).&lt;/p&gt;

&lt;p&gt;At each step, the agent receives the current task context and its memory state, then selects one of these operations and executes it. The key insight is that the &lt;em&gt;policy&lt;/em&gt; for choosing which operation to apply — and when — is what gets learned, not the operations themselves.&lt;/p&gt;

&lt;h2&gt;
  
  
  Learning the Policy with GRPO
&lt;/h2&gt;

&lt;p&gt;To train the memory management policy, AtomMem frames the problem as a Partially Observable Markov Decision Process (POMDP). The agent cannot see the full task state; it only sees what has been retrieved from memory and what the current context provides.&lt;/p&gt;

&lt;p&gt;The training algorithm is &lt;a href="https://arxiv.org/abs/2601.08323" rel="noopener noreferrer"&gt;GRPO (Group Relative Policy Optimization)&lt;/a&gt;, which evaluates a group of candidate actions relative to each other rather than against an absolute baseline. This suits the memory management setting, where the "correct" action is context-dependent and hard to specify in advance.&lt;/p&gt;

&lt;p&gt;The reward signal comes from downstream task performance: if the agent's memory decisions lead to better answers, those decisions are reinforced. RL-based training improves performance by approximately 9 percentage points over a supervised baseline that uses the same CRUD operations but with a fixed policy.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Agent Actually Learns
&lt;/h2&gt;

&lt;p&gt;One of the more interesting findings is the behavioral analysis of the trained policy. Rather than applying all four operations uniformly, the agent develops a structured strategy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Create and Update operations increase&lt;/strong&gt; as task complexity grows. The agent actively builds and refines its memory representation when the task demands it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Delete operations also increase&lt;/strong&gt; in complex settings, suggesting the agent learns to prune irrelevant information rather than letting the memory store grow unbounded.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Read operations stabilize&lt;/strong&gt; at an efficient level — the agent learns not to over-query its own memory, which would waste context tokens.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This emergent behavior was not explicitly programmed. The agent discovered that selective deletion and targeted updates are more useful than passive accumulation, simply by optimizing for task performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark Results
&lt;/h2&gt;

&lt;p&gt;AtomMem was evaluated across five benchmarks covering two categories:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Long-context multi-hop QA:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HotpotQA&lt;/li&gt;
&lt;li&gt;2WikiMultiHopQA&lt;/li&gt;
&lt;li&gt;MuSiQue&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Web and agentic tasks:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GAIA&lt;/li&gt;
&lt;li&gt;WebWalkerQA&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Across these benchmarks, AtomMem achieves a &lt;strong&gt;3–8 percentage point improvement&lt;/strong&gt; over static-workflow baselines that use the same underlying LLM but with hand-coded memory rules. The gains are consistent across both QA and web navigation settings, which suggests the learned policy generalizes across task types rather than overfitting to a specific domain.&lt;/p&gt;

&lt;p&gt;The authors also test robustness by expanding context length up to 4× the training length. AtomMem maintains its advantage while static baselines degrade more sharply — a sign that the learned policy is more adaptive when the information environment changes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where This Fits in the Broader Memory Landscape
&lt;/h2&gt;

&lt;p&gt;AtomMem is part of a broader shift in how researchers think about agent memory. The &lt;a href="https://arxiv.org/html/2603.11768v1" rel="noopener noreferrer"&gt;survey by Lam et al.&lt;/a&gt; classifies current memory systems into three categories: adaptive and learning-based systems (like AtomMem and Memory-R1), graph-based cognitive systems (like A-MEM and HippoRAG), and multimodal systems for lifelong learning.&lt;/p&gt;

&lt;p&gt;The learning-based category moves the design question from "what rules should govern memory?" to "what reward signal should shape memory behavior?" That shift is more honest about the fact that the right memory strategy depends on the task.&lt;/p&gt;

&lt;p&gt;The tradeoff the survey highlights — stability versus plasticity — is real. A policy that aggressively updates and deletes memory can drift semantically over long sessions. AtomMem addresses this partly through the atomic operation framing, though it does not yet include explicit mechanisms for detecting or correcting semantic drift.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Implications for Agent Developers
&lt;/h2&gt;

&lt;p&gt;If you are building agents that need to maintain state across long sessions — customer support bots, research assistants, coding agents that track a codebase — the AtomMem framing offers a few concrete takeaways:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Treat memory management as a learned skill, not a fixed pipeline.&lt;/strong&gt; The right strategy for when to summarize, delete, or update depends on the task, and that dependency is hard to capture with static rules.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;CRUD operations are a useful abstraction.&lt;/strong&gt; Even if you are not training a full RL policy, structuring your memory system around Create/Read/Update/Delete makes the behavior more auditable and easier to debug than monolithic retrieval pipelines.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Reward shaping matters.&lt;/strong&gt; AtomMem uses downstream task performance as the reward signal, which is clean but requires a task-specific evaluation setup. For production systems, you may need proxy rewards (e.g., retrieval precision, answer consistency) that are cheaper to compute.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The &lt;a href="https://arxiv.org/abs/2601.08323" rel="noopener noreferrer"&gt;AtomMem paper&lt;/a&gt; is worth reading if you are working on long-horizon agents. The CRUD framing is simple enough to implement incrementally, and the RL training approach is compatible with standard post-training pipelines that use GRPO or similar algorithms.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Primary source: &lt;a href="https://arxiv.org/abs/2601.08323" rel="noopener noreferrer"&gt;AtomMem: Learnable Dynamic Agentic Memory with Atomic Memory Operation&lt;/a&gt; — Huo et al., 2026&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Supporting sources: &lt;a href="https://arxiv.org/html/2603.11768v1" rel="noopener noreferrer"&gt;Survey on Evolving LLM Agent Memory Systems&lt;/a&gt; — Lam et al., 2026 | &lt;a href="https://arxiv.org/html/2601.08323v3" rel="noopener noreferrer"&gt;AtomMem full paper HTML&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>llm</category>
      <category>programming</category>
    </item>
    <item>
      <title>Nemotron 3 Ultra: How NVIDIA Built a 550B Open Model That Runs Faster Than Its Smaller Rivals</title>
      <dc:creator>Prabhakar Chaudhary</dc:creator>
      <pubDate>Fri, 19 Jun 2026 16:15:16 +0000</pubDate>
      <link>https://dev.to/prabhakar_chaudhary_7afe4/nemotron-3-ultra-how-nvidia-built-a-550b-open-model-that-runs-faster-than-its-smaller-rivals-hpg</link>
      <guid>https://dev.to/prabhakar_chaudhary_7afe4/nemotron-3-ultra-how-nvidia-built-a-550b-open-model-that-runs-faster-than-its-smaller-rivals-hpg</guid>
      <description>&lt;h1&gt;
  
  
  Nemotron 3 Ultra: How NVIDIA Built a 550B Open Model That Runs Faster Than Its Smaller Rivals
&lt;/h1&gt;

&lt;p&gt;NVIDIA's Nemotron 3 Ultra, released on June 4, 2026, is a 550-billion-parameter open model that manages to outrun several competing models with far fewer active parameters per token. The trick is a hybrid architecture that mixes Mamba state-space layers with standard Transformer attention — a combination that sidesteps the memory bottlenecks that typically make large models slow in long-context settings.&lt;/p&gt;

&lt;p&gt;This post walks through what that architecture actually does, why it matters for agentic workloads, and what the training pipeline looks like.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Problem: Attention Doesn't Scale Well to Long Contexts
&lt;/h2&gt;

&lt;p&gt;Standard Transformer attention has quadratic complexity with respect to sequence length. Double the context, and the compute cost for attention quadruples. For agentic tasks — where a model might need to reason over a long conversation history, a large codebase, or many tool-call results — this becomes a real bottleneck.&lt;/p&gt;

&lt;p&gt;One response is to replace some attention layers with state-space models (SSMs) like Mamba, which process sequences in linear time. The tradeoff is that SSMs are less precise at retrieving specific facts from long contexts. Nemotron 3 Ultra's hybrid design tries to get the best of both: Mamba layers handle the bulk of sequence processing at sub-quadratic cost, while a subset of full attention layers is retained for precise recall when it matters.&lt;/p&gt;

&lt;p&gt;The attention layers themselves are configured with 64 query heads but only 2 key-value heads. This keeps the KV cache small — a meaningful memory saving when you're running a 1-million-token context window.&lt;/p&gt;

&lt;h2&gt;
  
  
  LatentMoE: More Experts Without More Inference Cost
&lt;/h2&gt;

&lt;p&gt;The model uses a Mixture-of-Experts (MoE) design with 512 total experts, of which 22 are activated per token. What makes this unusual is the "LatentMoE" routing mechanism: before tokens are routed to experts, they're projected into a compressed latent space. This lets NVIDIA pack in more specialized experts without proportionally increasing inference cost, since the routing decision happens in a lower-dimensional space.&lt;/p&gt;

&lt;p&gt;The result is a model with 550 billion total parameters but only 55 billion active per token — roughly a 10:1 ratio. That's why the inference throughput numbers are competitive despite the headline parameter count.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-Token Prediction for Native Speculative Decoding
&lt;/h2&gt;

&lt;p&gt;Nemotron 3 Ultra includes Multi-Token Prediction (MTP) heads that predict several future tokens in a single forward pass. During training, these heads share parameters with the main model. At inference time, they enable speculative decoding natively — the model proposes multiple tokens at once, which can then be verified in parallel, reducing the number of sequential forward passes needed.&lt;/p&gt;

&lt;p&gt;This is different from the more common approach of using a separate, smaller draft model for speculative decoding. Having MTP built into the architecture means there's no need to maintain a separate model or tune the draft model separately.&lt;/p&gt;

&lt;h2&gt;
  
  
  NVFP4 Training: 4-Bit Precision From the First Gradient Update
&lt;/h2&gt;

&lt;p&gt;The model was trained using NVFP4, a 4-bit floating-point format (E2M1 datatype with two-dimensional block quantization on weights). NVIDIA describes this as one of the largest demonstrations of stable NVFP4 training to date. The deployed model runs at an average of 5.03 bits-per-element, mixing NVFP4, FP8, and BF16 layers depending on the layer's sensitivity.&lt;/p&gt;

&lt;p&gt;For deployment, the model supports W4A16 quantization on Hopper-generation hardware (H100/H200), which lacks native FP4 tensor cores, and can use native FP4 math on Blackwell (B200/GB200). This means the same model weights can be served efficiently across both hardware generations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Post-Training: RL Across 55 Environments
&lt;/h2&gt;

&lt;p&gt;Pre-training used specialized datasets including 173 billion tokens of GitHub code (up to September 2025), plus synthetic datasets for legal text, factual recall, and moral reasoning. Post-training combined supervised fine-tuning, reinforcement learning across 55 distinct environments, and Multi-teacher On-Policy Distillation (MOPD).&lt;/p&gt;

&lt;p&gt;MOPD addresses a known problem with multi-environment RL: when you train across many different task types simultaneously, the learning signal from any one environment gets diluted. NVIDIA's solution was to distill knowledge from over ten domain-specialized teacher models into the student model, concentrating the signal from each domain.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark Performance
&lt;/h2&gt;

&lt;p&gt;On inference throughput in 8K input / 64K output settings, Nemotron 3 Ultra is reported to be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;5.9x faster&lt;/strong&gt; than GLM-5.1-754B-A40B&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4.8x faster&lt;/strong&gt; than Kimi-K2.6-1T-A32B&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1.6x faster&lt;/strong&gt; than Qwen-3.5-397B-17B&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On the RULER benchmark at 1-million-token context, it outperforms other open LLMs. Accuracy on standard benchmarks is described as matching current state-of-the-art open models.&lt;/p&gt;

&lt;p&gt;The model also supports three reasoning modes: "Reasoning-off," "Regular," and "Medium-effort." The medium-effort mode uses 2.5x fewer tokens than regular mode at the cost of roughly 7% accuracy — a useful knob for applications where inference cost matters more than peak accuracy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Availability
&lt;/h2&gt;

&lt;p&gt;Nemotron 3 Ultra is released under the OpenMDW-1.1 license, with weights, training data, and recipes publicly available. It can be accessed via &lt;a href="https://huggingface.co/nvidia/Nemotron-3-Ultra" rel="noopener noreferrer"&gt;Hugging Face&lt;/a&gt;, &lt;a href="https://build.nvidia.com/" rel="noopener noreferrer"&gt;NVIDIA NIM&lt;/a&gt;, and OpenRouter. NVIDIA also released an Agent Toolkit alongside the model, including NemoClaw and OpenShell components for building agentic pipelines.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://arxiv.org/abs/2606.15007" rel="noopener noreferrer"&gt;technical report on arXiv&lt;/a&gt; covers the architecture and training in detail. The &lt;a href="https://research.nvidia.com/labs/nemotron/Nemotron-3-Ultra/" rel="noopener noreferrer"&gt;NVIDIA Research page&lt;/a&gt; has benchmark comparisons and deployment guidance.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means in Practice
&lt;/h2&gt;

&lt;p&gt;The throughput advantage comes from the architectural choices working together: Mamba layers reduce the per-step cost of processing long sequences, the small KV cache from the 2-head attention configuration reduces memory pressure, and MTP enables speculative decoding without a separate draft model. None of these is new individually, but combining them at 550B scale with stable NVFP4 training is a meaningful engineering result.&lt;/p&gt;

&lt;p&gt;For developers building agentic systems that need to process long contexts — code repositories, document collections, extended tool-call histories — the 1M-token window and the throughput numbers make Nemotron 3 Ultra worth evaluating, particularly if you're already running on NVIDIA Blackwell hardware where native FP4 support gives an additional efficiency boost.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>opensource</category>
      <category>llm</category>
    </item>
    <item>
      <title>Why Agentic Resource Discovery Is the Missing Layer for AI Agents</title>
      <dc:creator>Prabhakar Chaudhary</dc:creator>
      <pubDate>Thu, 18 Jun 2026 10:57:12 +0000</pubDate>
      <link>https://dev.to/prabhakar_chaudhary_7afe4/why-agentic-resource-discovery-is-the-missing-layer-for-ai-agents-2lnh</link>
      <guid>https://dev.to/prabhakar_chaudhary_7afe4/why-agentic-resource-discovery-is-the-missing-layer-for-ai-agents-2lnh</guid>
      <description>&lt;p&gt;AI agents are getting better at many individual tasks, but they still run into a familiar systems problem: choosing the right capability at the right time. A model can be strong at reasoning, a separate tool can be strong at search, and another can be strong at GUI control, but none of that helps if the agent does not know what is available, how to rank options, or which artifact to load for the current task. That is the problem the Hugging Face article on &lt;a href="https://huggingface.co/blog/agentic-resource-discovery-launch" rel="noopener noreferrer"&gt;Agentic Resource Discovery&lt;/a&gt; tries to solve.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real bottleneck is not generation
&lt;/h2&gt;

&lt;p&gt;The default agent pattern today is still install-first, use-later. A developer wires in a tool, a skill, or another agent ahead of time, then hopes the same configuration keeps working as the ecosystem changes. That approach breaks down quickly once an agent has to operate across many domains. The moment you move beyond a handful of curated tools, static configuration becomes a maintenance burden.&lt;/p&gt;

&lt;p&gt;The ARD proposal changes the selection step itself. Instead of hardcoding every integration into the agent, capabilities are published into a registry and searched at runtime. In other words, the agent does not need to know every tool in advance. It needs a good discovery layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  What ARD adds that MCP alone does not
&lt;/h2&gt;

&lt;p&gt;The important idea in ARD is that it sits in front of execution protocols rather than replacing them. MCP describes how an agent calls a tool. Skills describe how an agent consumes instructions. A2A describes how an agent reaches another agent. ARD is the layer that helps the agent find the right thing before any of those protocols are used.&lt;/p&gt;

&lt;p&gt;The spec defines two core pieces:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Static manifests
&lt;/h3&gt;

&lt;p&gt;Publishers can expose an &lt;code&gt;ai-catalog.json&lt;/code&gt; manifest at a well-known URL. That gives the registry enough metadata to index the capability without requiring a custom integration for every client. A manifest can carry identity, tags, representative queries, and compliance-related signals. That matters because search quality depends on more than a name and a short description.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Dynamic search
&lt;/h3&gt;

&lt;p&gt;ARD also defines a &lt;code&gt;POST /search&lt;/code&gt; API. The client submits an intent in natural language, and the registry returns ranked capabilities. This shifts the selection problem away from the model’s context window and toward an explicit search service. For agents, that is a practical improvement: search is cheaper than stuffing every tool description into the prompt, and it is easier to update than a hardcoded allowlist.&lt;/p&gt;

&lt;p&gt;Hugging Face’s &lt;a href="https://github.com/huggingface/hf-discover" rel="noopener noreferrer"&gt;Discover tool&lt;/a&gt; is a concrete implementation of that idea. It wraps the Hub’s search infrastructure and exposes results as skills, MCP servers, or raw Space metadata depending on what the client asks for.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters more for computer-use agents
&lt;/h2&gt;

&lt;p&gt;If agents only called APIs, discovery would be useful but modest. The problem becomes sharper once agents need to operate graphical software. GUI agents have to choose not only a tool, but often the right skill pack, the right screenshot, or the right task-specific playbook.&lt;/p&gt;

&lt;p&gt;That is why the arXiv paper &lt;a href="https://arxiv.org/abs/2606.18448" rel="noopener noreferrer"&gt;VISUALSKILL: Multimodal Skills for Computer-Use Agents&lt;/a&gt; is a useful companion to ARD. The paper argues that existing skill libraries are often text-only even though GUI work is visual. Its results are notable: a Claude Code CLI agent backed by Claude Opus 4.6 reaches an average score of 0.456 with VISUALSKILL, which is 15.3 points above the no-skill baseline and 8.3 points above a matched text-only skill.&lt;/p&gt;

&lt;p&gt;The takeaway is not just that multimodal skills help. It is that agent ecosystems are becoming heterogeneous. Some capabilities are APIs, some are UI workflows, some are robot policies, and some are reusable task bundles. Once those capabilities exist, the next question is how an agent discovers the right one quickly and reliably.&lt;/p&gt;

&lt;h2&gt;
  
  
  The ecosystem is already moving in that direction
&lt;/h2&gt;

&lt;p&gt;Recent projects make the same point from different angles. Hugging Face’s post on &lt;a href="https://huggingface.co/blog/amazon/strands-lerobot-hub-to-hardware" rel="noopener noreferrer"&gt;From the Hugging Face Hub to Robot Hardware with Strands Agents and LeRobot&lt;/a&gt; shows an agent loop spanning simulation, Hub datasets, policy inference, and physical robot deployment. The important detail is not only that the stack works, but that it combines several resources with different lifecycles and formats.&lt;/p&gt;

&lt;p&gt;On the more practical side, the Hacker News thread for &lt;a href="https://news.ycombinator.com/item?id=48572553" rel="noopener noreferrer"&gt;Launch HN: Adam (YC W25) – Open-Source AI CAD&lt;/a&gt; and the discussion around &lt;a href="https://vickiboykis.com/2026/06/15/running-local-models-is-good-now/" rel="noopener noreferrer"&gt;Running local models is good now&lt;/a&gt; both point to the same trend: the number of usable local and open-source capabilities is rising. When the ecosystem grows that fast, a manual setup for each tool stops being sustainable.&lt;/p&gt;

&lt;p&gt;ARD is a response to that growth. It treats discovery as infrastructure, not as a side feature.&lt;/p&gt;

&lt;h2&gt;
  
  
  What builders should take from this
&lt;/h2&gt;

&lt;p&gt;If you are building agent products, the lesson is straightforward.&lt;/p&gt;

&lt;p&gt;First, do not assume a static tool list will age well. New tools will appear, old ones will change shape, and users will expect the agent to adapt.&lt;/p&gt;

&lt;p&gt;Second, publish richer metadata than a short tool name. Representative queries, task types, and capability tags improve ranking. For multimodal or GUI-heavy systems, include enough structure that a client can understand what kind of artifact it is loading.&lt;/p&gt;

&lt;p&gt;Third, separate discovery from execution. Search should tell the agent what exists. The execution protocol should handle how to use it. That separation makes the system easier to federate, safer to maintain, and easier to extend across vendors.&lt;/p&gt;

&lt;p&gt;ARD is still a draft, but it points at a real architectural shift. As agents become capable of working across APIs, GUIs, local models, robot stacks, and shared skills, the main challenge is no longer only model quality. It is capability routing. The agents that perform best will not just reason better; they will also find the right resource faster.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>opensource</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
