DEV Community: Prabhakar Chaudhary

LongCat-2.0: How Meituan Trained a 1.6T-Parameter Coding Model Without a Single Nvidia GPU

Prabhakar Chaudhary — Wed, 22 Jul 2026 16:10:21 +0000

LongCat-2.0: How Meituan Trained a 1.6T-Parameter Coding Model Without a Single Nvidia GPU

Meituan — best known in China as a food delivery and super-app company — quietly ran one of the most technically ambitious open-weight model releases of 2026. LongCat-2.0 is a 1.6-trillion-parameter Mixture-of-Experts (MoE) model built for agentic coding tasks, released under the MIT license on June 30, 2026, with full weights available on Hugging Face as of July 5. What makes it worth examining is not just the scale, but two specific engineering choices: a novel sparse attention mechanism designed for million-token contexts, and a training run conducted entirely on domestic Chinese AI ASICs — no Nvidia hardware involved.

What LongCat-2.0 Is Built For

LongCat-2.0 is not a general-purpose chat model. It is explicitly designed for autonomous software engineering: repository-level code edits, multi-step tool use, and long-horizon agent workflows. Meituan measured it through Claude Code sandboxes and the Hermes agent harness, and the benchmarks reflect that focus.

On SWE-bench Pro, LongCat-2.0 scores 59.5, narrowly ahead of GPT-5.5 at 58.6. On Terminal-Bench 2.1, it scores 70.8, comparable to Gemini 3.1 Pro (70.7). It trails Claude Opus 4.8 (69.2 on SWE-bench Pro, 78.9 on Terminal-Bench) on coding benchmarks, and falls behind on broader general-agent tasks like BrowseComp. The model's strength is narrow and deliberate: it is competitive with closed-source frontier models specifically on software engineering, at a fraction of the API cost.

Before the official reveal, LongCat-2.0 operated anonymously on OpenRouter as "Owl Alpha," where it accumulated roughly 10.1 trillion monthly tokens — a 242% month-over-month increase — and reached the top of the Hermes Agent leaderboard. Developers were using it without knowing who built it.

The Sparse Attention Problem at 1M Tokens

Running a 1-million-token context window efficiently is genuinely hard. Standard attention scales quadratically with sequence length, and even sparse attention mechanisms can create memory fragmentation and unpredictable bandwidth usage at this scale.

Meituan's answer is LongCat Sparse Attention (LSA), which extends DeepSeek Sparse Attention with three orthogonal improvements:

Streaming-aware Indexing (SI) restructures how tokens are selected for attention. Instead of fragmented random reads from GPU high-bandwidth memory (HBM), SI converts token selection into sequential, hardware-aligned memory reads. The result is coalesced HBM access — the memory controller can prefetch predictably rather than chasing scattered addresses.

Cross-Layer Indexing (CLI) exploits a property of transformer attention: which tokens matter tends to be stable across adjacent layers. CLI computes one indexing pass and reuses it across multiple consecutive layers during inference. This amortizes the cost of figuring out which tokens to attend to, and the stability assumption is reinforced during training via cross-layer distillation.

Hierarchical Indexing (HI) applies a two-stage scoring approach. A fast, approximate block-level pass filters the token population down to candidates, and then fine-grained token selection runs only on that smaller set. This avoids scoring every token at full precision.

LSA also extends to Multi-Token Prediction (MTP) for speculative decoding, which can further improve throughput at inference time.

On top of LSA, the model adds 135 billion N-gram Embedding parameters (using 5-gram token combinations). This expands the embedding space roughly 100-fold without adding to the MoE expert count, and it reduces memory I/O during large-batch decoding by shifting parameter weight away from the expert routing path.

Training on 50,000 Chinese ASICs

The hardware story is the part that has attracted the most attention. Meituan trained LongCat-2.0 on a cluster of over 50,000 domestic Chinese AI ASICs — reportedly Huawei Ascend-class accelerators — without any Nvidia GPUs. The training run covered 35+ trillion tokens with no rollbacks or irrecoverable loss spikes.

This is not trivial. Frontier-scale training runs on non-standard hardware typically encounter instability from non-deterministic floating-point operations, fault recovery gaps, and memory parallelism issues that are well-solved on Nvidia's CUDA stack but require custom engineering elsewhere. Meituan addressed this through deterministic operators across embedding, flash attention, LSA, and MoE layers; automated link isolation and rejoin after hardware stress tests; and a 6D parallelism scheme that includes a custom approach called EMBP for the N-gram Embedding module.

The cluster is organized into superpods of up to 48 machines each, with all-to-all communication inside each pod and RoCE (RDMA over Converged Ethernet) between pods. Meituan reports approximately 30% pretraining throughput gain from this topology.

The significance here is structural. U.S. export controls have restricted access to Nvidia's highest-end chips for Chinese companies. If a 1.6T-parameter model can be trained to near-frontier performance on alternative silicon, it suggests that export restrictions on GPU hardware may have less long-term effect on Chinese AI development than initially assumed.

Post-Training: Three Expert Clusters

LongCat-2.0 uses a post-training framework called MOPD (Multi-Teacher Optimization via Mixture of Specialized Experts). Rather than a single reward model, MOPD trains three independent expert clusters:

Agent Experts handle tool invocation, multi-turn API parameter parsing, and self-correcting execution loops.
Reasoning Experts focus on multi-hop logic, mathematics, and STEM problem-solving, with adaptive compute allocation based on problem difficulty.
Interaction Experts handle instruction following, hallucination suppression, and safety constraints.

A dynamic gate-routing mechanism combines these at runtime. The separation is meant to prevent the common failure mode where safety tuning degrades reasoning performance, or where reasoning optimization causes the model to ignore instructions.

Access and Licensing

The model is available on Hugging Face under the MIT license, with BF16/F32 safetensors and an FP8 variant. GPU inference runs via SGLang (16x H20 recommended); NPU inference is supported through SGLang-FluentLLM. Self-hosting at this scale requires significant infrastructure — multiple H20 nodes — so it is not a laptop experiment.

Via API, Meituan offers standard pay-as-you-go pricing ($0.75/$2.95 per million tokens in/out) with a promotional rate of $0.30/$1.20. A notable feature: context cache hits are processed free of charge, which meaningfully changes the economics for agentic workflows that repeatedly reference the same large codebase.

What This Demonstrates

LongCat-2.0 is worth paying attention to for two reasons that go beyond benchmark numbers. First, it shows that sparse attention for million-token contexts is a solvable engineering problem — LSA's three-component approach is a concrete example of how to make long-context inference practical at scale. Second, the training infrastructure story is a data point about the state of non-Nvidia AI hardware: a 35-trillion-token run on 50,000 ASICs, completed without instability, is evidence that the hardware gap is narrowing.

For developers, the MIT license and open weights mean LongCat-2.0 can be integrated into commercial products without restriction. For researchers, the technical report details on LSA and MOPD are worth reading as examples of how to engineer around the specific bottlenecks of very large, very long-context MoE models.

TriAttention: How a Geometric Trick Cuts LLM Memory Use by 10x Without Losing Accuracy

Prabhakar Chaudhary — Mon, 20 Jul 2026 16:31:53 +0000

TriAttention: How a Geometric Trick Cuts LLM Memory Use by 10x Without Losing Accuracy

Long-context reasoning is one of the most memory-hungry workloads in modern LLM inference. When a model generates 32,000 tokens of chain-of-thought, its KV cache — the stored keys and values from every previous attention step — can consume tens of gigabytes of GPU memory. Researchers from MIT, NVIDIA, and Zhejiang University recently published TriAttention, a compression method that reduces that memory footprint by up to 10.7× while matching the accuracy of full attention. The key insight is geometric rather than heuristic: instead of guessing which tokens matter at runtime, TriAttention predicts importance from a stable property of the model's own weight space.

Why Existing KV Cache Compression Breaks in Production

The standard approach to KV cache compression is token eviction: score each cached token by how much attention it received, then drop the low-scorers. Methods like SnapKV and H2O follow this pattern. They work well in research settings but run into two concrete problems when deployed in production systems.

The FlashAttention visibility problem. Production inference relies on FlashAttention, which tiles its computation inside SRAM and never writes the full N×N attention score matrix to GPU memory. Most eviction methods need those scores to decide which tokens to keep. Without them, the system has to fall back to slower "eager" attention, which erases the performance benefit of compression in the first place.

The paged memory fragmentation problem. Serving frameworks like vLLM manage GPU memory in fixed-size physical blocks using a paged allocator. A block is only freed when it is completely empty. Standard eviction strategies scatter a handful of "survivor" tokens across many blocks, leaving the allocator unable to reclaim any of them. The memory savings exist on paper but not in practice.

TriAttention was designed specifically to avoid both of these failure modes, as detailed in NVIDIA's research blog on KV cache compression infrastructure.

The Core Idea: Q/K Concentration in Pre-RoPE Space

The method starts with an empirical observation about how transformer attention heads behave internally. In the pre-RoPE representation space — before Rotary Position Embedding rotates the query and key vectors based on their sequence position — the Q and K vectors for roughly 90% of attention heads cluster tightly around stable, non-zero centers. The authors call this Q/K concentration.

This matters because RoPE rotation is what makes post-RoPE queries unstable. A query vector at position 1,000 points in a different direction than the same semantic query at position 5,000, because RoPE has rotated it. That instability limits how far back you can look when estimating token importance. Pre-RoPE vectors don't have this problem: their centers stay fixed regardless of position or input context.

When Q and K vectors are concentrated around fixed centers, the attention logit between any query and any key can be approximated as a trigonometric series whose coefficients depend only on those centers and the positional distance between the two tokens. TriAttention uses this series to score key importance without ever needing to observe live attention scores. For the minority of heads where concentration is lower, it blends in a norm-based signal, weighted automatically by a measured concentration metric.

The result is an importance estimator that is both accurate and infrastructure-friendly: it requires no attention score matrix, so it works natively with FlashAttention.

Solving the Memory Fragmentation Problem

Predicting which tokens to keep is only half the problem. The other half is actually freeing the memory they occupied. TriAttention introduces a mechanism called Forward-Packing Compaction: every ~128 decode steps, the system physically moves surviving tokens to consolidate them into as few memory blocks as possible, emptying out the tail blocks so the paged allocator can reclaim them.

This is a straightforward engineering step, but it's the piece that makes the memory savings real rather than theoretical. Without compaction, eviction methods leave fragmented survivors that the allocator cannot touch.

What the Numbers Look Like

The authors evaluated TriAttention on AIME 2025 with 32,000-token generation, using a KV cache budget of 3,072 tokens — roughly a 10× reduction from the full cache size.

Method	AIME25 Accuracy	KV Memory	Throughput vs. Full Attn
Full Attention	40.8%	1×	1×
SnapKV	~20%	~0.1×	—
TriAttention	40.8%	0.093×	2.5×

TriAttention matches full attention accuracy exactly while using about 1/10th the KV memory and running 2.5× faster. Competing methods at similar compression ratios lose roughly half the accuracy.

On MATH 500, the method retains only 1,024 of 32,768 tokens and scores 68.4% versus 69.6% for full attention — a gap of 1.2 percentage points at a 32× compression ratio.

A practical consequence: models like Qwen3-32B that would run out of memory on a single RTX 4090 during long-context generation can run successfully with TriAttention applied. The GitHub repository includes vLLM and SGLang integrations, and community ports to llama.cpp with HIP/ROCm support have appeared.

What This Means for Long-Context Inference

The broader significance of TriAttention is that it closes the gap between laboratory compression results and production deployment. Previous methods often looked good on benchmarks but failed in real serving stacks because they conflicted with FlashAttention or paged memory management. TriAttention was designed with those constraints in mind from the start.

The trigonometric approximation approach also suggests a direction for future work: rather than treating token importance as something to measure at runtime, it can be predicted from the model's learned geometry. That shift — from observation to prediction — is what allows the method to sidestep the FlashAttention visibility problem entirely.

For teams running long-context workloads at scale, the combination of 10× memory reduction and 2.5× throughput improvement without accuracy loss is worth evaluating. The arXiv paper includes implementation details, and the Hugging Face paper page links to community discussion and additional resources.

The method was presented at ICML 2026, with authors from MIT, NVIDIA, and Zhejiang University.

Inkling: How Thinking Machines Lab Built a 975B Open-Weight Model Around Controllable Thinking

Prabhakar Chaudhary — Fri, 17 Jul 2026 16:11:58 +0000

Inkling: How Thinking Machines Lab Built a 975B Open-Weight Model Around Controllable Thinking

Thinking Machines Lab — the startup founded by former OpenAI CTO Mira Murati — released its first in-house model on July 15, 2026. Called Inkling, it is a 975-billion-parameter Mixture-of-Experts (MoE) transformer with 41 billion active parameters, a 1-million-token context window, and native support for text, image, and audio inputs. The weights are available on Hugging Face under the Apache 2.0 license.

What makes Inkling worth examining is not its raw benchmark position — the company is explicit that it is "not the strongest overall model available today, open or closed." Instead, the design choices around efficiency, multimodality, and a feature called controllable thinking effort reflect a specific set of tradeoffs that are worth understanding on their own terms.

Architecture: A Familiar Spine with Deliberate Departures

Inkling is a 66-layer decoder-only transformer. Its MoE feed-forward layers follow the DeepSeek-V3 design closely: each layer contains 256 routed experts and 2 shared experts, with 6 routed experts activated per token. This keeps the active parameter count at 41B while the full parameter count sits at 975B — a ratio that makes inference substantially cheaper than a dense model of equivalent capacity.

A few choices diverge from the common recipe:

Attention: Inkling uses an interleaved mix of sliding-window and global attention layers at a 5:1 ratio, with 8 KV heads. Rather than RoPE (the positional encoding used by most recent models), it uses learned relative positional embeddings.
Multimodal encoding: The model uses an encoder-free early fusion approach. Audio is ingested as dMel spectrograms; images are encoded as 40×40 pixel patches via a four-layer hierarchical MLP (hMLP). Both modalities are projected into the same hidden space as text tokens and processed jointly by the decoder.
Optimization: Training used a hybrid strategy — Muon for large matrix weights, Adam for everything else. The model was trained entirely on NVIDIA GB300 NVL72 systems.

The encoder-free design for vision and audio is notable. Most multimodal models use a separate vision encoder (like a ViT) and project its outputs into the language model's space. Inkling skips that step, processing raw patches directly. This simplifies the architecture and avoids the mismatch between a separately trained encoder and the main model, but it also means the model has to learn visual representations from scratch rather than inheriting them from a pretrained encoder.

Controllable Thinking Effort: What It Is and How It Works

The most distinctive feature of Inkling is its controllable thinking effort — a parameter that lets developers set the model's reasoning budget on a scale from 0.2 to 0.99 at inference time.

This is not a simple temperature knob. It was trained into the model through large-scale asynchronous reinforcement learning over more than 30 million rollouts. During RL training, Thinking Machines varied the system message and applied a per-token cost to the model's chain-of-thought reasoning. By adjusting that cost across different training samples, the model learned to compress or expand its reasoning depending on the effort level specified.

The practical result: at lower effort settings, the model produces shorter reasoning chains and responds faster. At higher settings, it reasons more extensively before generating a final answer. According to Thinking Machines' benchmarking, Inkling at effort=0.99 uses roughly one-third as many tokens as Nemotron 3 Ultra to reach comparable coding performance on certain tasks.

This matters for real-world deployment. Most production applications do not need maximum reasoning on every query. A customer support system, a document classifier, or a retrieval-augmented generation pipeline can often get adequate results with a fraction of the token budget that a hard reasoning task requires. Having a single model that can operate across that range — rather than maintaining separate fast and slow models — simplifies infrastructure.

Multimodal Benchmarks: Where It Stands

Inkling's multimodal performance is competitive among open-weight models, though closed models still lead on most audio and vision tasks. On MMMU Pro (Standard 10), it scores 73.5% — above Qwen3-Omni (60.0%) but below Gemini 3.1 Pro (82.0%). On VoiceBench it reaches 91.4%, and on MMAU (audio understanding) 77.2%, roughly on par with Qwen3-Omni.

On agentic coding, Inkling scores 77.6% on SWE-Bench Verified and 63.8% on Terminal Bench 2.1. On instruction following (IFBench), it scores 79.8% — above Claude Fable 5 (63.5%) and GPT-5.6 Sol (72.7%), which is one of its stronger relative results.

Inkling-Small: A Preview of the Smaller Sibling

Alongside the main release, Thinking Machines previewed Inkling-Small: a 276B-parameter MoE with 12B active parameters. The performance numbers are striking. On HLE (with tools), Inkling-Small scores 46.6% versus Inkling's 46.0%. On GPQA Diamond, it scores 88.3% versus 87.2%. On MCP-Atlas, it scores 74.9% versus 74.1%.

The smaller model underperforms on factuality (SimpleQA Verified: 20.9% vs. 43.9%) and on Terminal Bench 2.1 (52.7% vs. 63.8%), but for many agentic and reasoning tasks, the gap is narrow enough that the lower cost and latency of Inkling-Small may be the better tradeoff.

The Customization Argument

Thinking Machines is positioning Inkling primarily as a base for fine-tuning rather than a finished product. The company's platform, Tinker, provides fine-tuning infrastructure and a developer playground. The Apache 2.0 license allows commercial use and modification without restriction.

The business logic differs from OpenAI or Anthropic, which sell metered API access to locked models. As TechCrunch noted, once the weights are public, nothing obligates anyone to pay Thinking Machines to run them — so Tinker, not Inkling itself, is where the company's revenue has to come from.

One example: Thinking Machines worked with Bridgewater Associates to fine-tune a model on the hedge fund's internal financial expertise. The result reportedly scored 84.7% on financial reasoning tests at roughly one-fourteenth the running cost of comparable proprietary models — though those numbers come from the two companies' own evaluation, not an independent one.

Hardware Requirements and Deployment

Running the BF16 checkpoint requires at least 2 TB of aggregated VRAM — either 8× NVIDIA B300 GPUs or 16× H200 GPUs. A quantized NVFP4 checkpoint reduces this to 600 GB (4× B300 in W4A4 mode, or 8× H200 in W4A16 mode). Supported inference frameworks include SGLang, vLLM, TokenSpeed, Unsloth, and Hugging Face Transformers.

For most organizations, self-hosting a model at this scale is not practical. The more realistic path is using Tinker's hosted fine-tuning and inference, or one of the third-party inference providers that Thinking Machines has partnered with.

What to Make of It

Inkling is a technically coherent model with a clear design philosophy: broad multimodal capability, efficient inference through sparse MoE, and a training-time mechanism for controlling reasoning cost at inference. It is not trying to top every benchmark — it is trying to be a good starting point for organizations that want to adapt a model to their own data and workflows.

The controllable thinking effort mechanism is the most interesting technical contribution. Training a model to genuinely compress or expand its reasoning chain — rather than just truncating output — requires careful RL design, and the token efficiency numbers suggest it works. Whether that holds up after fine-tuning on domain-specific data remains to be seen.

Weights, model card, and fine-tuning documentation are at thinkingmachines.ai.

Grok Build is open source, and that matters for AI coding tools

Prabhakar Chaudhary — Thu, 16 Jul 2026 06:53:40 +0000

Grok Build is open source, and that matters for AI coding tools

What happened

xAI published the source code for Grok Build, its terminal-based AI coding agent. The repository shows a full stack for a TUI-driven assistant that can inspect a codebase, edit files, run shell commands, search the web, and manage longer-running tasks. In other words, this is not just a model demo or a chat wrapper; it is the software layer that turns a model into a usable developer tool.

The release came up on the Hacker News front page, which is useful context because the discussion there was less about model benchmarks and more about tooling, workflow, and whether open-source agent infrastructure is becoming a competitive advantage on its own.

Primary source: Grok Build repository

Why this release is interesting

A lot of AI coding products hide the implementation details behind a hosted UI. Open-sourcing the agent runtime gives the community something different to inspect: how the tool is structured, how it handles shell access, and how it organizes the user experience around files, commands, and context. That matters for engineers because the practical questions are often not about raw model capability. They are about reliability, prompting surfaces, permissions, and how much of the workflow can be automated without turning the tool into a black box.

The README describes Grok Build as a terminal-based coding agent that supports interactive use, headless scripting, editor integration via the Agent Client Protocol, and a modular tool/runtime layout. That makes it closer to an infrastructure project than a showcase demo. If you are building internal copilots, code assistants, or agent workflows, the design choices here are worth studying.

What the repository tells us

The repository description makes a few things clear:

1. The agent is meant to be operational, not decorative

The docs emphasize real actions: editing files, executing shell commands, searching the web, and coordinating long-running tasks. That means the system is designed around stateful work rather than one-off responses. For engineers, that shifts the focus from “Can the model answer?” to “Can the tool safely do the work?”

2. Terminal-first design is still relevant

A terminal UI may sound old-fashioned, but it has a real advantage: it fits naturally into the developer workflow. It also makes the control surface explicit. You can see the prompt, the output, the context, and the file system operations in one place. That is often easier to reason about than a browser-only experience.

3. The project is modular

The README breaks the codebase into packages for the pager UI, shell/runtime, tools, workspace management, markdown, sandboxing, and related components. That modularity matters because agent systems are usually a bundle of concerns: model orchestration, command execution, parsing, permissions, and UX. Splitting those pieces makes the system easier to audit and extend.

Why engineers should care

There are three reasons this release is worth paying attention to.

First, it highlights that agent quality is partly an engineering problem. The best model in the world is still hard to use if the surrounding toolchain is brittle, opaque, or unsafe. Open source lets others inspect where failures might happen: shell invocation, context assembly, file editing, and state management.

Second, it reflects a broader pattern in AI product development. The model is only one layer. The surrounding system—the editor integration, command runner, sandboxing, and UX—often determines whether people trust it enough for daily work.

Third, it gives teams a reference point. Even if you never adopt Grok Build directly, reading the repo can help you compare your own agent architecture against a real implementation.

Caveats

This is still just the code release and documentation, not a guarantee of performance. Open sourcing a tool does not automatically make it safer, better, or easier to operate in production. The important questions remain open:

How does it handle permissions and command execution boundaries?
What kind of sandboxing is actually enforced?
How does it recover from bad actions or partial failures?
How much of the workflow depends on model quality versus careful orchestration?

Those are the kinds of details that determine whether an AI coding agent is practical in a team setting.

Supporting context from the discussion

The Hacker News front page included the Grok Build announcement, which suggests the release is being interpreted as a tooling story rather than only a model story. That framing is useful: in practice, the people who build with AI spend a lot of time on tool behavior, integration, and reliability.

If you want to inspect the broader conversation, the front page listing is here: Hacker News.

Bottom line

Open-sourcing Grok Build is a reminder that the most useful AI systems are increasingly a combination of model, runtime, and developer workflow. For teams building internal agents or code assistants, the release is worth reading not because it settles the debate about AI coding tools, but because it exposes the mechanics behind one.

Sources

MiMo-V2-Flash: How Xiaomi Built a 309B MoE Model That Tops SWE-Bench Without Burning Through Compute

Prabhakar Chaudhary — Wed, 15 Jul 2026 16:12:13 +0000

MiMo-V2-Flash: How Xiaomi Built a 309B MoE Model That Tops SWE-Bench Without Burning Through Compute

Xiaomi's AI research team recently released MiMo-V2-Flash, a 309-billion-parameter Mixture-of-Experts language model that currently holds the top spot among open-source models on both SWE-Bench Verified and SWE-Bench Multilingual. What makes it worth examining isn't just the benchmark number — it's the combination of architectural choices and a post-training method called Multi-Teacher On-Policy Distillation (MOPD) that together let the team achieve frontier-level coding performance at a fraction of the usual compute cost.

The Architecture: Hybrid Attention and Built-In Speculative Decoding

MiMo-V2-Flash has 309B total parameters but only 15B active per token — a standard MoE tradeoff that keeps inference costs manageable relative to a dense model of equivalent capacity. The more interesting architectural decisions are in how it handles attention and decoding speed.

Hybrid attention (SWA + GA at 5:1). Most of the attention layers use Sliding Window Attention with a 128-token window, which dramatically reduces the KV-cache footprint for long sequences. Every sixth layer uses Global Attention to preserve long-range dependencies. This 5:1 interleaving cuts KV-cache storage by roughly 6× compared to full global attention across all layers, while a learnable attention sink bias keeps the model from losing track of distant context.

Multi-Token Prediction as a native draft mechanism. The model includes a 3-layer MTP module — lightweight dense FFNs with their own SWA — that predicts multiple future tokens in parallel. This functions as a built-in speculative decoding draft head, delivering a 2.0–2.6× decoding speedup without requiring a separate draft model. The MTP module adds relatively few parameters (0.33B per block) but meaningfully changes the throughput profile.

256K context window. Pre-training used a native 32K sequence length on 27 trillion tokens, then extended to 256K. The NIAH-Multi retrieval benchmark at 256K scores 96.7, suggesting the extension holds up in practice for long-document and multi-turn agent tasks.

Multi-Teacher On-Policy Distillation: The Post-Training Story

The benchmark result that stands out — 73.4% on SWE-Bench Verified — comes largely from the post-training approach rather than the base model alone. The team developed MOPD to address a real problem with standard RL-based post-training: it's expensive, and using a single teacher model limits how well the student can specialize across different domains.

How MOPD works. Instead of training a single teacher and distilling from it, the team trains multiple domain-specialized teacher models — one focused on math, one on code, and so on — each using large-scale reinforcement learning. The student (MiMo-V2-Flash) then learns from all of them simultaneously, but the key difference from standard knowledge distillation is that the process is formulated as an RL problem:

The student generates responses on-policy (from its own current distribution), which eliminates the exposure bias that plagues offline distillation.
Teachers provide dense, token-level reward signals rather than sparse sequence-level feedback. This gives the student much more informative gradients.
Because the student is learning from its own generations rather than teacher demonstrations, the training stays stable even as the student improves.

The efficiency claim is striking: the team reports that MOPD requires less than 1/50th of the compute of a comparable SFT + RL pipeline while matching or exceeding teacher performance. That's a significant reduction if it holds across different model scales and domains.

Why on-policy matters here. Standard knowledge distillation has a well-known problem: the student is trained on teacher-generated data, but at inference time it generates from its own distribution. The gap between these two distributions grows as the student diverges from the teacher, leading to compounding errors. By keeping the student on-policy throughout distillation, MOPD avoids this — the student always sees data from its own current distribution, so the training signal stays relevant.

Benchmark Results in Context

The headline numbers from the post-training evaluation:

Benchmark	Score
SWE-Bench Verified	73.4%
SWE-Bench Multilingual	#1 open-source
GPQA-Diamond	83.7%
MMLU-Pro	84.9%
Arena-Hard (Creative Writing)	86.2%

SWE-Bench Verified tests whether a model can resolve real GitHub issues — it requires understanding a codebase, identifying the relevant files, writing a patch, and passing the existing test suite. A 73.4% score puts MiMo-V2-Flash on par with several closed-source frontier models on this specific task.

The GPQA-Diamond score (83.7%) is also notable — this benchmark tests graduate-level science reasoning, and it's not a task the model was explicitly optimized for, suggesting the MOPD approach generalizes beyond coding.

Deployment Considerations

The 309B parameter count means hardware requirements are substantial. All expert weights need to fit in VRAM regardless of how many are active per token:

FP16/BF16: ~618 GB VRAM — roughly 10× H100 80GB GPUs
FP8: ~309 GB — 4× H100 80GB or 3× H200 141GB
INT4: ~155 GB — 2× H200 141GB

The model is compatible with vLLM (v0.8.0+) and SGLang, both of which support MoE expert parallelism. For the 256K context window, the recommendation is to start with --max-model-len 32768 to verify stability before scaling up, since KV-cache requirements grow significantly at full context length.

The model weights are available on Hugging Face under an open license, and the GitHub repository includes the technical report with full training details.

What's Worth Watching

Two things stand out about MiMo-V2-Flash beyond the benchmark numbers.

First, the MOPD approach is a concrete answer to a real scaling problem. Post-training with RL is expensive, and the standard recipe (SFT then RL with a single reward model) doesn't scale gracefully to multiple domains. If the 1/50th compute claim holds up under scrutiny, it's a meaningful contribution to how the field thinks about efficient post-training.

Second, the hybrid SWA+GA attention pattern — combined with built-in MTP for speculative decoding — represents a coherent set of engineering choices aimed at making a large MoE model actually deployable. The 6× KV-cache reduction and 2–2.6× decoding speedup aren't just nice-to-haves; they're what makes a 309B model viable for agentic workflows where you're running many sequential inference calls.

The technical report goes into considerably more depth on the training pipeline, including the RL curriculum and how the domain-specialized teachers were constructed. Worth reading if you're working on post-training methods or large-scale MoE deployment.

Tencent Hy3: How a 295B Sparse MoE Model Runs on 21B Active Parameters

Prabhakar Chaudhary — Mon, 13 Jul 2026 16:30:57 +0000

Tencent Hy3: How a 295B Sparse MoE Model Runs on 21B Active Parameters

Tencent released Hy3 on July 6, 2026 — a 295-billion-parameter Mixture-of-Experts model under the Apache 2.0 license. The headline numbers are striking, but the more interesting story is in the architecture: despite 295B total parameters, only 21 billion are active on any given forward pass. That gap between total and active parameters is the core design decision worth understanding.

What "Mixture of Experts" Actually Means Here

In a standard dense transformer, every parameter participates in every token's computation. A 295B dense model would require roughly 590GB of GPU memory just for weights, and every inference call would use all of it.

Hy3 takes a different approach. Each of its 80 transformer layers contains 192 routed experts — small feed-forward networks — plus one always-active shared expert. A learned router examines each token and selects the top 8 experts to process it. The other 184 experts sit idle for that token. The result: 21B parameters do the actual computation, while the remaining 274B provide specialized capacity that gets called on selectively.

This is why the model's inference cost is closer to a 21B dense model than a 295B one. The tradeoff is that all 295B weights must remain resident in GPU memory so the router can dispatch to any expert at any time — you can't swap experts in and out without introducing latency. In practice, serving Hy3 requires 8× H200-class GPUs in BF16 (about 590GB VRAM), or 4× H200s with FP8 quantization (around 295GB).

The Multi-Token Prediction Layer

Beyond the MoE routing, Hy3 includes a 3.8B-parameter Multi-Token Prediction (MTP) layer. Standard autoregressive generation produces one token per forward pass. MTP predicts several tokens ahead simultaneously, enabling speculative decoding: the model proposes a batch of candidate tokens, then verifies them in parallel rather than sequentially.

According to Tencent's deployment data, the MTP layer reduces time-to-first-token by 54% and end-to-end latency by 47% in production workloads. For agentic applications where the model is called repeatedly in a loop — tool calls, code execution, multi-step planning — that latency reduction compounds significantly.

The MTP layer is compatible with vLLM and SGLang's EAGLE-style speculative decoding, so teams already using those serving stacks can enable it without custom infrastructure.

Reasoning Modes and Tool Calling

Hy3 exposes three inference modes via API: no_think (direct response, lowest latency), think_low (light chain-of-thought), and think_high (extended reasoning). Developers can tune this per request depending on whether they need fast responses or careful deliberation.

For tool calling, Tencent ships dedicated parsers for both vLLM and SGLang:

vllm serve tencent/Hy3 \
  --tensor-parallel-size 8 \
  --speculative-config.method mtp \
  --speculative-config.num_speculative_tokens 2 \
  --tool-call-parser hy_v3 \
  --reasoning-parser hy_v3 \
  --enable-auto-tool-choice

The model is also OpenAI API-compatible when served this way, which means existing agent frameworks — Cline, Kilo Code, custom MCP loops — can swap in Hy3 without rewriting orchestration code.

What Changed Between the April Preview and the July Release

Tencent ran a preview of Hy3 in late April 2026 and collected feedback from over 50 internal product teams before the general release. The post-training improvements are measurable:

Hallucination rate dropped from 12.5% to 5.4%
Commonsense errors fell from 25.4% to 12.7%
Multi-turn intent drift decreased from 17.4% to 7.9%

These aren't benchmark numbers — they come from 270 expert human evaluators rating outputs across coding, document analysis, and frontend development tasks. The model scored 2.67/4 in blind evaluation against GLM-5.1's 2.51/4.

Benchmark Context

On public leaderboards, Hy3 scores 78% on SWE-bench Verified and 90.4% on GPQA Diamond. The SWE-bench number trails GLM-5.2's 84.2%, which has a higher active parameter budget (~40B active vs. Hy3's 21B). On GPQA Diamond — a graduate-level science reasoning benchmark — Hy3 is competitive with models several times its active size.

The more relevant metric for most teams is tool-calling reliability. Tencent reports less than 4% accuracy variance across different agent scaffoldings, which matters when you're running hundreds of tool calls in a single session and need consistent output formatting.

The Licensing Angle

Hy3 is released under Apache 2.0, which means unrestricted commercial use, no registration requirements, and no revenue caps. This contrasts with Meta's Llama license, which requires registration and caps commercial use based on monthly active users.

For teams that have been waiting for a permissive license on a capable open-weight model, Hy3 fills a gap. The 256K context window also makes it viable for repository-scale code analysis and long-document workflows that would otherwise require proprietary APIs.

The practical constraint is hardware. Consumer GPUs can't run this model — even an FP8-quantized version needs 4× H200s. For teams without that infrastructure, the model is available free on OpenRouter through July 21, 2026, and via the Nous Research portal during the same window.

Who This Is For

Hy3 is most useful for teams building agentic systems that need a capable, commercially permissive model with a large context window. If you're running agent frameworks that make repeated tool calls, the MTP-accelerated latency and reliable tool-call parsing are concrete advantages.

It's less suited for teams that need the absolute highest coding benchmark scores (GLM-5.2 leads there), or for anyone running on consumer hardware. The self-hosting economics work out to roughly $0.90–$1.62 per million output tokens on 4× H200 FP8 — significantly cheaper than proprietary APIs, but only if you have the infrastructure budget.

The model weights and an FP8-quantized version are both available on Hugging Face. The GitHub repository includes deployment recipes for vLLM and SGLang with MTP enabled.

The Broader Pattern

Hy3 is part of a broader trend in open-weight model releases: the interesting design decisions are increasingly in the efficiency layer rather than raw parameter counts. Sparse MoE routing, speculative decoding via MTP, and adjustable reasoning depth are all mechanisms for getting more useful work out of a given compute budget.

The 295B vs. 21B gap is a useful reminder that model size headlines don't tell you much about inference cost or practical capability. What matters is how many parameters are active per token, how efficiently the serving stack uses them, and whether the model's reliability holds up across the kinds of tasks you actually need to run.

NVIDIA Isaac GR00T N1.7: How Human Video Data Is Teaching Robots to Use Their Hands

Prabhakar Chaudhary — Fri, 10 Jul 2026 16:11:55 +0000

NVIDIA Isaac GR00T N1.7: How Human Video Data Is Teaching Robots to Use Their Hands

Humanoid robots have long struggled with a fundamental problem: getting them to do anything useful requires enormous amounts of robot-specific training data, collected through tedious teleoperation sessions where a human manually guides the robot through each task. NVIDIA's newly released Isaac GR00T N1.7 takes a different approach — one that leans heavily on the vast supply of human egocentric video that already exists in the world.

The result is a 3-billion-parameter open Vision-Language-Action (VLA) model that is commercially licensed, integrates with the popular LeRobot framework, and introduces what NVIDIA calls the first-ever scaling law for robot dexterity.

What Is a Vision-Language-Action Model?

A VLA model sits at the intersection of computer vision, natural language understanding, and motor control. It takes in what the robot sees (camera frames), what it's been told to do (a language instruction), and the robot's current physical state (joint positions, velocities, end-effector poses), and outputs continuous action vectors — the actual motor commands that move the robot's joints.

GR00T N1.7 uses a dual-system architecture called Action Cascade:

System 2 — Vision-Language Model: A Cosmos-Reason2-2B backbone (built on the Qwen3-VL architecture) processes image tokens and language instructions. This is where high-level task decomposition happens — breaking a complex instruction like "assemble the small parts" into a sequence of subtasks.
System 1 — Diffusion Transformer: A 32-layer DiT (Diffusion Transformer) takes the VLM's output alongside live robot proprioceptive state and denoises them into precise, continuous motor commands in real time.

The separation matters. High-level reasoning and low-level motor control have very different latency and precision requirements. Keeping them in separate systems lets each be optimized independently.

The EgoScale Insight: Human Hands as Training Data

The most interesting technical contribution in GR00T N1.7 is EgoScale — a pre-training strategy built on human egocentric video rather than robot teleoperation data.

The intuition is straightforward: humans and humanoid robots share a similar embodiment. Both have two hands, a first-person viewpoint, and operate in environments full of objects to pick up, assemble, and manipulate. Sensorized human video — ego cameras, wrist cameras, hand tracking — captures rich manipulation priors without requiring every behavior to be demonstrated on a physical robot first.

GR00T N1.7 was pre-trained on 20,854 hours of human egocentric video spanning more than 20 task categories, from manufacturing and retail to healthcare and home environments. This is a substantial increase from the few thousand hours of robot teleoperation data used to train the previous version, N1.6.

The key finding: more human egocentric data produces predictable, consistent improvements in dexterous manipulation capability. Going from 1,000 to 20,000 hours of human video more than doubles average task completion rates. This is what NVIDIA is calling a "scaling law for dexterity" — the same kind of predictable improvement curve that language model researchers have observed when scaling training data for text.

This matters because teleoperation is expensive and slow to scale. If human video can substitute for robot-specific demonstrations during pre-training, the data bottleneck for robot learning becomes much easier to address.

What the Model Can Actually Do

GR00T N1.7 has been validated across three categories of tasks:

Loco-manipulation — tasks that combine locomotion and arm control
Tabletop manipulation — pick-and-place, sorting, and assembly tasks on a fixed surface
Dexterous bimanual tasks — tasks requiring coordinated use of both hands, including contact-rich operations like small parts assembly

The model has been tested on the Unitree G1, Bimanual Manipulator YAM, and AGIBot Genie 1 robot platforms. The expanded action space — 132 state/action dimensions, up from 29 in N1.6 — and a 40-step action horizon (up from 16) give the model finer-grained control over complex movements.

The model uses a relative end-effector (EEF) action space, representing actions as deltas from the current pose rather than absolute targets. This design choice improves cross-embodiment generalization: the same model can be adapted to different robot hardware without retraining from scratch.

Integration with LeRobot and the Broader Ecosystem

GR00T N1.7 is now integrated into Hugging Face's LeRobot library, which provides a standardized, open-source framework for robot learning. The integration covers the full development loop:

Data collection via Isaac Teleop, which captures human demonstrations in the LeRobot dataset format
Simulation and prototyping through Isaac Lab-Arena
Fine-tuning on custom robot embodiments using launch_finetune.py
Deployment to physical hardware via the Gr00tPolicy interface, with export to ONNX and TensorRT for edge inference on platforms like NVIDIA Jetson Thor

Pre-registered embodiments include UNITREE_G1, LIBERO_PANDA, and OXE_WIDOWX. Developers can also register custom embodiments. Upgrading from N1.6 is designed to be a drop-in swap — point --model-path to nvidia/GR00T-N1.7 and existing configs carry over.

Hardware requirements are 16 GB+ VRAM for inference (an RTX 4090 or L40 works) and 40 GB+ VRAM for fine-tuning (H100 or A100 recommended).

Why the Open Licensing Matters

GR00T N1.7 is released under the Apache 2.0 license, which permits commercial use and redistribution. This is a meaningful choice in the robotics space, where many foundation models are research-only or carry restrictive terms that prevent production deployment.

NVIDIA explicitly positions this as "factory-floor ready" — the commercial license enables production deployments in material handling, packaging, and inspection workflows today. Industry partners including Agility Robotics, ANYBotics, and NEURA Robotics are already adopting GR00T components, alongside research institutions like Stanford, CMU, and ETH Zurich.

What This Means for Robot Learning Research

The EgoScale finding — that human video data transfers meaningfully to robot dexterity — opens a practical path for scaling robot pre-training without proportionally scaling teleoperation infrastructure. If the scaling law holds as data volumes increase further, it suggests that the gap between robot capability and human dexterity could narrow faster than the teleoperation bottleneck would otherwise allow.

The dual-system Action Cascade architecture also reflects a broader trend in embodied AI: separating the reasoning layer (which benefits from large language model capabilities) from the control layer (which needs low latency and precise continuous outputs). This decomposition is appearing in multiple recent robot learning systems, and GR00T N1.7 is one of the more complete open implementations of it.

The model weights, code, and documentation are available at github.com/NVIDIA/Isaac-GR00T and huggingface.co/nvidia/GR00T-N1.7.

GDPO: How Decoupled Reward Normalization Fixes Multi-Objective RL for LLMs

Prabhakar Chaudhary — Wed, 08 Jul 2026 16:12:25 +0000

GDPO: How Decoupled Reward Normalization Fixes Multi-Objective RL for LLMs

Training large language models with reinforcement learning almost always involves juggling multiple reward signals at once. You want the model to be accurate, follow a specific format, stay within a length budget, and avoid unsafe outputs — all simultaneously. The standard approach is to sum those rewards into a single number and feed it into Group Relative Policy Optimization (GRPO). It works, but it quietly throws away information in a way that hurts training. A new paper from NVIDIA researchers introduces GDPO (Group reward-Decoupled Normalization Policy Optimization) to fix this, and the fix turns out to be surprisingly clean.

What GRPO Does — and Where It Breaks Down

GRPO is a popular alternative to PPO for post-training LLMs. Instead of maintaining a separate critic network, it samples a group of responses for each prompt and computes advantages by comparing each response's reward to the group mean. This is memory-efficient and works well when there is a single reward signal.

The trouble starts when you add more rewards. The common workaround is to sum all reward components — say, r_accuracy + r_format + r_length — and treat the total as one number before running the GRPO advantage calculation. This is where reward collapse happens.

Here is the core problem: different combinations of partial rewards can produce the same total. Imagine two binary reward signals. A response that satisfies both rewards (total = 2) and a response that satisfies neither (total = 0) are clearly different. But a response that satisfies only the first reward (total = 1) and one that satisfies only the second (total = 1) are treated as identical by the advantage formula, even though the model should learn different things from each. When you normalize the summed rewards across a group, these distinct situations collapse into the same advantage value, and the training signal loses resolution.

The GDPO paper formalizes this: as the number of reward signals or sampled responses grows, the number of distinct advantage values that naïve GRPO can produce grows much more slowly than it should. The model is effectively learning from a blurrier signal than the data warrants.

The GDPO Fix: Normalize First, Then Sum

GDPO changes the order of operations. Instead of summing rewards and then normalizing, it normalizes each reward dimension independently within the group before summing. The two-step process looks like this:

Step 1 — Reward-decoupled group normalization. For each reward component r_k, compute the group mean and standard deviation across the sampled responses for that prompt. Subtract the mean and divide by the standard deviation to get a normalized advantage for that reward dimension. Do this separately for every reward signal.

Step 2 — Batch-level advantage normalization. Sum the per-reward normalized advantages into a single total advantage. Then apply a second normalization pass across the entire training batch (not just the group for one prompt). This prevents the magnitude of the combined signal from inflating as you add more reward objectives, keeping training numerically stable.

The result is that responses with different reward profiles now receive genuinely different advantage values. In the binary two-reward example, GDPO produces three distinct advantage combinations where naïve GRPO produces only two. At scale — with more rewards and larger group sizes — the gap widens considerably.

Why Batch-Level Normalization Matters

The second step is easy to overlook but important. Standard GRPO normalizes advantages only within the group sampled for a single prompt. When you sum multiple normalized reward signals, the variance of the combined advantage can grow with the number of rewards, making learning rates and clipping thresholds harder to tune. GDPO's batch-level normalization keeps the advantage distribution stable regardless of how many reward objectives you add, which means you do not need to re-tune hyperparameters every time you introduce a new reward signal.

This is a practical benefit for teams iterating on reward design. Adding a new reward component to a GDPO pipeline is less likely to destabilize training than adding it to a naïve GRPO setup.

Benchmark Results

The NVIDIA team tested GDPO using Qwen2.5-Instruct models at 1.5B and 3B parameter scales across three task categories:

Tool calling. Models trained with GDPO showed clear improvements in both tool-calling accuracy and format compliance compared to GRPO baselines. Format compliance is a natural multi-reward problem — the model must simultaneously produce a valid function call structure and select the correct function — so this is where the advantage of decoupled normalization is most direct.

Mathematical reasoning. On benchmarks including MATH, AIME, and AMC, GDPO maintained accuracy while adhering significantly better to length constraints. Length is often added as a secondary reward to discourage verbose reasoning chains; GRPO's reward collapse makes it harder for the model to learn the accuracy-length trade-off cleanly.

Coding. Across Apps, CodeContests, and Codeforces benchmarks, GDPO achieved a better balance between pass rates and bug ratios. Coding tasks often combine a correctness reward (does the code pass tests?) with a style or efficiency reward, making them another natural fit for decoupled normalization.

The paper also reports that GDPO has been applied in the C-MORAL framework for molecular optimization, where competing chemical property rewards (e.g., binding affinity vs. synthesizability) make reward collapse particularly costly.

How This Fits Into the Broader RL-for-LLMs Landscape

GDPO is not the only recent attempt to address GRPO's limitations. AVSPO tackles a related but distinct problem — advantage collapse when all responses in a group receive the same reward — by injecting virtual samples to restore variance. DAPO addresses entropy collapse in long chain-of-thought training. These methods are complementary: GDPO specifically targets the multi-reward normalization problem, while AVSPO and DAPO address single-reward collapse and entropy issues respectively.

What makes GDPO notable is its simplicity. The change is a reordering of normalization steps — no new model components, no additional forward passes, no architectural modifications. It slots into existing GRPO training pipelines with minimal engineering overhead.

Practical Takeaways

If you are training an LLM with multiple reward signals using GRPO, the key question is whether your reward combination step happens before or after normalization. If you are summing first, you are likely losing signal. Switching to per-reward normalization before aggregation is a low-cost change that the GDPO results suggest is worth making.

The batch-level normalization step is also worth adopting even if you only have one reward signal, since it provides more stable advantage magnitudes across training than group-only normalization.

The GDPO paper is concise and the math is accessible — if you are working on post-training pipelines, it is worth reading alongside the original GRPO paper to understand exactly where the information loss occurs and why the fix works.

ReContext: How Recursive Evidence Replay Helps LLMs Actually Use Long Contexts

Prabhakar Chaudhary — Mon, 06 Jul 2026 16:35:49 +0000

ReContext: How Recursive Evidence Replay Helps LLMs Actually Use Long Contexts

Large language models can now accept context windows of 128K, 1M, or even 10M tokens. But accepting a long input and reasoning well over it are two different things. A growing body of research shows that models frequently fail to retrieve or correctly use information buried in the middle of a long context — a problem known as the "lost in the middle" effect. A new paper from July 2026, ReContext: Recursive Evidence Replay as LLM Harness for Long-Context Reasoning, proposes a training-free inference method that addresses this gap without modifying the model or adding external memory.

The Problem: Context Access ≠ Context Utilization

When you feed a 128K-token document to an LLM and ask a question, the model technically has access to every token. In practice, attention tends to concentrate on the beginning and end of the sequence, leaving the middle largely ignored. This happens for structural reasons rooted in the Transformer architecture:

Causal attention masking gives early tokens more exposure across the sequence, creating a primacy bias.
Rotary position embeddings (RoPE) introduce distance-based decay, so tokens far from both ends fall into an "attention dead zone."
Training data patterns reinforce this: most documents place key information at boundaries (headings, conclusions), so models learn to weight those positions more heavily.

The result is that even when a model can process a million tokens, its effective retrieval accuracy often degrades sharply past 200K–400K tokens. Benchmarks like RULER and multi-needle NIAH tests consistently show 30–60 point accuracy drops as context length increases for most frontier models.

What ReContext Does

ReContext treats the long-context reasoning problem as an associative memory problem. The authors draw an explicit analogy:

The context is the memory store.
The question is the retrieval cue.
The model's attention is the cue-trace association mechanism.
The replay is trace reactivation — surfacing the relevant memory before generating an answer.

The method works in three stages at inference time:

1. Identify Heavy-Hitter Tokens

ReContext uses the model's own internal attention signals to find "heavy-hitter" tokens — spans that receive disproportionately high attention relative to the query. Research on attention sinks and KV-cache optimization has shown that these high-attention tokens often account for 50–80% of query-relevant information in a document.

Rather than relying on an external retriever or a separate embedding model, ReContext reads these signals directly from the model's forward pass. No additional components are needed.

2. Build a Query-Conditioned Evidence Pool

The identified spans are assembled into an ordered evidence pool. This pool is constructed recursively: each round conditions on the original context, the question, and the evidence accumulated so far. The recursion typically runs for 2–4 rounds with an evidence token budget of 8–32 tokens per round.

This recursive selection is what distinguishes ReContext from simpler approaches like context truncation or sliding-window retrieval. Instead of discarding context, it adds a focused evidence summary on top of the full input.

3. Replay Evidence Before Generation

The final evidence pool is prepended to the prompt before the model generates its answer. This replay step re-binds the model's attention to the relevant spans, counteracting the positional bias that would otherwise cause it to overlook mid-context information.

Crucially, the full original context is still present. ReContext does not prune or compress the input — it supplements it with a distilled evidence trace.

Performance Results

The authors evaluated ReContext on eight long-context benchmarks at 128K token lengths, using Qwen3-4B, Qwen3-8B, and Llama3-8B as base models. The benchmarks include Natural Questions, TriviaQA, HotpotQA, and several multi-hop reasoning tasks.

Key findings:

ReContext achieves the best average rank across all eight benchmarks compared to baseline prompting and other training-free long-context methods.
Mean accuracy improves by approximately 24.6% relative to vanilla prompting across the tested models.
The gains are consistent across model sizes, suggesting the method is not dependent on a specific architecture or scale.

The improvement is particularly pronounced on multi-hop tasks, where the model needs to synthesize information from multiple locations in the context — exactly the scenario where positional bias causes the most damage.

Why Training-Free Matters

Most approaches to improving long-context reasoning require either fine-tuning the model (expensive, model-specific) or adding external retrieval infrastructure (RAG pipelines, vector databases, re-rankers). ReContext requires neither.

It runs entirely at inference time, using only the model's existing attention mechanism. This means it can be applied to any transformer-based LLM without modification, and it adds no new parameters or training data requirements. The computational overhead is modest: a few additional forward passes for the recursive evidence selection rounds.

This makes it practical for teams that want better long-context performance without committing to a full RAG architecture or a fine-tuning pipeline.

Limitations and Open Questions

ReContext is not a complete solution to the long-context problem. A few caveats worth noting:

The method has been tested on models up to 8B parameters. Whether the same gains hold for larger models (70B+) with different attention patterns is an open question.
The evidence token budget (8–32 tokens per round) is a hyperparameter that may need tuning for different task types. Very long or complex answers may require larger budgets.
The recursive selection process adds latency proportional to the number of rounds. For latency-sensitive applications, this tradeoff needs to be weighed against the accuracy gains.
The paper evaluates on English-language benchmarks. Performance on multilingual or code-heavy contexts is not yet characterized.

The Broader Context

ReContext fits into a growing set of inference-time techniques that try to extract more value from existing models without retraining. Methods like StreamingLLM address infinite-length streaming by managing attention sinks. KV-cache compression techniques like H2O and SnapKV reduce memory overhead. ReContext occupies a different niche: it improves reasoning quality over a fixed long input, rather than managing memory or extending the effective window.

As context windows continue to grow and agentic workflows push models to process longer and longer histories, the gap between nominal context length and effective reasoning depth will remain a practical concern. Training-free methods like ReContext offer one path toward closing that gap incrementally, without waiting for the next generation of model architectures.

The paper is available at arXiv:2607.02509.

Reversal Q-Learning: Teaching Offline RL to Work with Flow-Matching Policies

Prabhakar Chaudhary — Fri, 03 Jul 2026 16:14:28 +0000

Reversal Q-Learning: Teaching Offline RL to Work with Flow-Matching Policies

Flow matching has become one of the more useful tools in the generative modeling toolkit. It trains faster than diffusion models, produces high-quality samples, and handles multimodal distributions well — which makes it attractive for modeling robot actions, where the "right" move in a given situation might not be a single point but a whole family of plausible behaviors.

The catch is that combining flow matching with reinforcement learning is genuinely hard, especially in the offline setting where you only have a fixed dataset and no ability to collect new experience. A new paper from Aditya Oberai, Seohong Park, and Sergey Levine — Reversal Q-Learning (RQL) — proposes a clean solution to this problem, and the core idea is elegant enough to be worth understanding in detail.

Why Flow Matching and Offline RL Don't Play Well Together

To understand the problem RQL solves, it helps to know what flow matching actually does. A flow matching policy learns a vector field that transports samples from a simple noise distribution toward the target action distribution. At inference time, you start with noise and integrate the vector field over F steps to produce an action. The more steps, the more expressive the policy — but also the more computation.

When you want to improve this policy using reinforcement learning, you need to assign credit to actions based on their downstream returns. In offline RL, you do this with Q-functions estimated from a static dataset. The problem is that the dataset contains raw (state, action) pairs — it has no record of the intermediate flow steps that produced those actions. The flow steps are invisible.

One principled way to handle this is the expanded MDP framework: treat each of the F flow refinement steps as a separate action in a longer Markov decision process. This makes the flow steps explicit and lets you apply standard Q-learning. But it creates two new problems:

Dataset incompatibility. Your offline dataset doesn't contain the intermediate flow states. You can't directly apply Q-learning to transitions that don't exist in your data.
The curse of horizon. Expanding the MDP by a factor of F means your effective planning horizon grows by F. Temporal difference (TD) learning accumulates bias over long horizons, so value estimates become unreliable.

Previous approaches worked around these issues by using weighted regression, distillation, or rejection sampling — all of which either discard information or introduce their own approximation errors.

The RQL Solution: Reverse the Flow to Reconstruct What Happened

RQL's key insight is that deterministic flow ODEs are reversible. If you know the final action a that the policy produced for state s, you can run the flow ODE backwards to recover the entire sequence of intermediate states x⁰, x¹, ..., xᶠ that led to it.

Formally, for any transition (s, a, r, s') in the offline dataset, RQL solves the reverse ODE:

d/df θ(s, x, f) = -v(s, θ(s, x, f), f)

where v is the learned vector field. This reconstructs the "virtual" on-policy trajectory through flow space — the exact sequence of intermediate states the current policy would have taken to produce action a from state s.

These virtual trajectories are deterministic and on-policy with respect to the current flow policy. That's what makes them useful: because they're on-policy, you can apply multi-step returns across the flow steps without introducing off-policy bias. And because they're deterministic, the multi-step returns are exact rather than sampled estimates.

Collapsing the Horizon

The second innovation addresses the curse of horizon. Because the virtual trajectories are deterministic and on-policy, RQL can use multi-step returns to skip over intermediate flow steps entirely. Instead of estimating a value function over a horizon of T × F steps (where T is the task horizon and F is the number of flow steps), RQL collapses the effective horizon back down to T.

This works because the intermediate flow steps don't interact with the environment — they're purely internal to the policy's generation process. The reward signal only arrives at the end of a full action, not after each flow step. So you can treat the entire flow generation as a single "macro-action" for the purposes of value estimation, while still training the individual flow steps using the expanded MDP structure.

The result is that RQL gets the expressiveness benefits of training the full flow policy step-by-step, without paying the value estimation cost of a T × F horizon.

What This Avoids

RQL avoids several costly alternatives:

No backpropagation through time. BPTT through the entire ODE integration is expensive and numerically unstable for long chains.
No distillation. Distilling the flow policy into a one-step approximation loses expressiveness.
No rejection sampling. Filtering offline data by Q-value wastes data and doesn't directly optimize the policy.

RQL trains the full flow policy directly using the actual Q-function, without BPTT instability.

Empirical Results

The authors evaluate RQL on 50 simulated robotic tasks, covering locomotion and manipulation environments. RQL achieves the best average performance among state-of-the-art flow-based offline RL methods, with particularly strong results on long-horizon tasks where the curse of horizon would otherwise hurt competitors most.

The implementation is in JAX and is available on GitHub, which makes it relatively accessible for researchers working in the offline RL space.

Why This Matters for Robotics

Offline RL is especially important for robotics because collecting online experience is expensive, slow, and sometimes unsafe. A large dataset of robot demonstrations — even imperfect ones — lets offline RL extract a policy that improves on the demonstrations by optimizing for reward rather than just imitating behavior.

Flow matching is attractive for robot policies because robot actions are often multimodal: there might be several equally valid ways to grasp an object, and a unimodal Gaussian policy would average over them into an invalid action. RQL makes it practical to combine expressive flow policies with offline RL without the approximations that previous methods required.

The Broader Context

RQL fits into a growing body of work on training generative policies with RL. Related approaches include GenPO (which uses exact diffusion inversion for on-policy RL) and FMER (which uses advantage-weighted regression with flow policies). What distinguishes RQL is its focus on the offline setting and its use of ODE reversibility to avoid the dataset incompatibility problem entirely. The expanded MDP framework itself is not new, but applying it offline required the virtual trajectory construction that RQL introduces.

Summary

Reversal Q-Learning addresses a concrete technical obstacle: how to apply Q-learning to flow-matching policies using offline data that doesn't contain intermediate flow states. The solution — run the flow ODE in reverse to reconstruct virtual on-policy trajectories, then use multi-step returns to collapse the expanded horizon — is technically clean and empirically effective. For researchers working at the intersection of generative models and offline RL, it's a useful addition to the toolkit.

The paper is available at arxiv.org/abs/2606.17551 and the code at github.com/aoberai/rql.

Análisis de Claude Sonnet 5: El nuevo modelo 'agéntico' de Anthropic, su precio y posición en el mercado

Prabhakar Chaudhary — Thu, 02 Jul 2026 08:00:38 +0000

El 30 de junio de 2026, Anthropic anunció el lanzamiento de Claude Sonnet 5, el último modelo de su familia Sonnet. Este lanzamiento no es una simple actualización incremental; posiciona al modelo como una herramienta "agéntica" diseñada para ejecutar flujos de trabajo autónomos y complejos a un coste más accesible que los modelos de gama alta como Opus [1].

Este artículo ofrece un análisis detallado de lo que significa este lanzamiento para los desarrolladores y la industria. Se examinan las capacidades declaradas, los cambios técnicos, la estructura de precios y se sitúa la noticia en el contexto de las discusiones de la comunidad técnica y la investigación académica reciente sobre sistemas agénticos.

Metodología

Este análisis se basa en la documentación oficial de Anthropic, incluyendo el anuncio de lanzamiento y la ficha de sistema (System Card), discusiones técnicas en foros públicos como Hacker News, y artículos de investigación académica sobre la evaluación de agentes de IA publicados a mediados de 2026. El objetivo es ofrecer una visión equilibrada que distingue las afirmaciones del proveedor de las observaciones de la comunidad y el estado del arte académico.

¿Qué es Claude Sonnet 5? Capacidades y enfoque agéntico

Claude Sonnet 5 se presenta como un puente entre la familia Sonnet, de gama media, y la familia Opus, de gama alta. Según Anthropic, el modelo ofrece un rendimiento cercano al de Opus 4.8 en muchas tareas, pero con la velocidad y la eficiencia de costes de la línea Sonnet [1].

El principal diferenciador es su optimización para flujos de trabajo agénticos. Esto se refiere a la capacidad del modelo para realizar tareas complejas de varios pasos de forma autónoma, utilizando herramientas como un navegador web o un terminal [1]. Las capacidades clave declaradas incluyen:

Planificación y ejecución autónoma: El modelo puede crear un plan para abordar una solicitud compleja y ejecutarlo sin supervisión constante [1].
Uso avanzado de herramientas: Interactúa con terminales y navegadores para automatizar tareas que tradicionalmente requerían intervención humana [1].
Rendimiento en codificación: Anthropic destaca una mejora sustancial en tareas de ingeniería de software, como la depuración de código, la navegación por bases de código complejas y la refactorización. En la prueba de referencia SWE-bench Pro, Sonnet 5 obtuvo un 63.2%, en comparación con el 58.1% de su predecesor, Sonnet 4.6 [1].
Seguridad: El modelo presenta, según sus evaluaciones, tasas más bajas de alucinaciones y comportamientos no deseados en comparación con Sonnet 4.6. Incluye salvaguardas de ciberseguridad activadas por defecto para detectar y bloquear usos peligrosos [1, 4].

Cambios técnicos y consideraciones para desarrolladores

La migración a Sonnet 5 desde modelos anteriores no es completamente transparente y requiere atención a ciertos detalles técnicos:

Nuevo Tokenizador: Sonnet 5 utiliza un tokenizador actualizado. Según Anthropic, el mismo texto de entrada puede generar entre un 30% más de tokens que en versiones anteriores [1]. Aunque la empresa ajustó el precio de lanzamiento para que la transición sea aproximadamente neutra en costes, es fundamental que los desarrolladores reevalúen sus prompts y ajusten los límites de max_tokens [1].
Cambios en la API:
- La funcionalidad Adaptive Thinking está activada por defecto [1].
- Ya no se soportan los parámetros de muestreo (temperature, top_p, top_k), y su uso devolverá un error. La recomendación es guiar el comportamiento del modelo mediante instrucciones en el system prompt [1].
- El pensamiento extendido manual (manual thinking) ha sido eliminado en favor del pensamiento adaptativo [1].

Estructura de precios y disponibilidad

Claude Sonnet 5 está disponible en todos los planes de Anthropic (incluido el gratuito) y a través de la API de Claude en plataformas como AWS, Google Cloud y Microsoft Foundry [1]. Su estructura de precios se divide en un periodo introductorio y uno estándar [1, 5].

Período	Precio de Entrada (por millón de tokens)	Precio de Salida (por millón de tokens)
Introductorio (hasta 31/08/2026)	$2.00	$10.00
Estándar (desde 01/09/2026)	$3.00	$15.00

Fuente: Documentación oficial de Anthropic.

Este precio lo sitúa en una posición competitiva, significativamente más bajo que el de Opus 4.8, que tiene un coste de $5 por millón de tokens de entrada y $25 por millón de tokens de salida [5].

El contexto: Reacciones de la comunidad y avances en la investigación

Ningún lanzamiento tecnológico ocurre en el vacío. Para entender las implicaciones de Sonnet 5, es útil observar las reacciones de la comunidad y el estado de la investigación en IA.

Discusiones en Hacker News: Eficiencia vs. "Extracción de valor"

En plataformas como Hacker News, la recepción ha sido mixta y matizada. Si bien algunos desarrolladores informan de éxitos notables al usar Sonnet 5 para tareas complejas que antes requerían modelos más caros, han surgido dos críticas principales:

Consumo de tokens: Varios usuarios señalan que el modelo tiende a "sobrecomplicar" tareas sencillas, consumiendo una cantidad excesiva de tokens [2]. Este comportamiento ha alimentado la sospecha de que los modelos están siendo optimizados para la "extracción de valor" (wealth extraction) a través del uso de tokens, en lugar de para la eficiencia pura [2].
Agente asistido vs. Agente autónomo: Hay un debate sobre si la optimización para flujos de trabajo "totalmente agénticos" degrada el rendimiento en casos de uso de "asistencia agéntica", donde un desarrollador busca control granular y respuestas concisas, no un agente que intente resolverlo todo de forma autónoma [2].

Estas discusiones ponen de manifiesto una tensión clave: la promesa de la automatización total frente a la necesidad de control y eficiencia económica en el desarrollo diario.

El contexto de la investigación: El desafío de evaluar agentes

El marketing de Sonnet 5 en torno a su capacidad "agéntica" coincide con un intenso enfoque de la comunidad investigadora en cómo evaluar estos sistemas. Investigaciones recientes publicadas en repositorios como arXiv subrayan que medir el rendimiento de un agente de IA es un problema no resuelto.

Un artículo reciente de Zhu et al. (2026) destaca que los resultados de los benchmarks están a menudo confundidos por "efectos de andamiaje" (scaffold effects) [3]. Esto significa que el rendimiento medido no solo depende del modelo de lenguaje subyacente, sino también del código específico (el "andamio") que gestiona la memoria del agente, las llamadas a herramientas y la interacción con el entorno [3].

La investigación actual se está moviendo hacia:

Marcos de evaluación unificados: Para aislar la capacidad real del modelo de los efectos del entorno de prueba [3].
Diagnósticos automatizados: Herramientas que analizan la traza completa de ejecución de un agente para identificar patrones de fallo recurrentes, en lugar de limitarse a una puntuación final de éxito o fracaso [3].

Esto nos dice que, si bien la industria avanza rápidamente hacia la implementación de agentes, el campo académico todavía está construyendo las herramientas para comprender y medir de forma fiable su comportamiento, robustez y eficiencia [3].

Conclusión: Implicaciones prácticas

Claude Sonnet 5 es un movimiento estratégico de Anthropic para acelerar la adopción de la IA agéntica en entornos de producción, ofreciendo capacidades cercanas a la gama alta a un precio más asequible. Su objetivo es claro: permitir que las empresas pasen de la experimentación a la implementación de flujos de trabajo automatizados [1, 5].

Sin embargo, para los desarrolladores, la adopción no es trivial. Las implicaciones prácticas clave son:

El coste real es variable: El cambio en el tokenizador y el comportamiento a veces verboso del modelo significan que el coste por tarea debe ser evaluado cuidadosamente. No siempre será más barato que modelos anteriores o de la competencia, especialmente para tareas simples [1, 2].
Adecuación a la tarea: Sonnet 5 parece brillar en tareas autónomas y de larga duración. Para interacciones rápidas y controladas, su diseño "agéntico" podría ser contraproducente [2].
La evaluación es crucial: La verdadera eficacia del modelo dependerá de pruebas rigurosas en los casos de uso específicos de cada equipo. Las métricas del proveedor son un punto de partida, pero la validación en el mundo real es indispensable [2, 3].

En resumen, Claude Sonnet 5 es una herramienta potente con un enfoque definido en la autonomía. Su éxito dependerá de si los desarrolladores pueden alinear sus capacidades con los problemas correctos, gestionando al mismo tiempo la complejidad y el coste inherentes a estos nuevos sistemas agénticos.

Referencias

How DFlash Uses Block Diffusion to Break the Speculative Decoding Bottleneck

Prabhakar Chaudhary — Wed, 01 Jul 2026 16:14:54 +0000

How DFlash Uses Block Diffusion to Break the Speculative Decoding Bottleneck

Autoregressive LLM inference has a fundamental problem: every token depends on the one before it. Even with speculative decoding — where a small draft model proposes tokens and the target model verifies them in parallel — the drafting step itself has remained sequential. DFlash, a framework from researchers at UC San Diego's Z Lab, changes that by replacing the autoregressive drafter with a block diffusion model that generates an entire candidate block in a single forward pass.

The results are notable: 6× lossless acceleration on Qwen3-8B, 2.5× improvement over the previous state-of-the-art EAGLE-3, and up to 15× throughput gains on NVIDIA Blackwell hardware at production concurrency levels. The framework is now integrated into SGLang and vLLM, making it accessible without application-level changes.

Why Speculative Decoding Still Had a Bottleneck

Speculative decoding works by having a lightweight draft model generate a sequence of candidate tokens, which the target model then verifies in a single parallel forward pass. If the target model accepts most of the draft tokens, you get significant speedups — the expensive target model runs less often.

The catch is that existing draft models like EAGLE-3 are themselves autoregressive. They generate tokens one at a time, so drafting γ tokens takes γ sequential steps. This creates a ceiling: the faster you want to draft, the more you're constrained by sequential computation. EAGLE-3 achieves roughly 2–3× speedups in practice, which is useful but leaves substantial GPU capacity underutilized.

Diffusion language models offer an alternative — they can generate tokens in parallel — but standalone diffusion LLMs have historically underperformed autoregressive models on quality, making them poor candidates for the verification step.

What DFlash Does Differently

DFlash's core insight is to use a diffusion model only for drafting, not for final generation. The target model remains a standard autoregressive LLM that handles verification. This lets DFlash capture the parallelism of diffusion generation while preserving the quality guarantees of autoregressive verification.

The drafting process works as follows:

Context extraction: The target model processes the input prompt and produces hidden states at multiple layers.
KV injection: These hidden states are projected and injected into the Key-Value cache of every layer in the draft model. This is the critical difference from earlier diffusion-based speculative decoding approaches, which only conditioned the drafter on the first layer's features. By injecting target context throughout the draft model's depth, DFlash maintains strong alignment between draft and target even as the draft model grows deeper and more expressive.
Parallel block drafting: The draft model fills in an entire block of masked token positions in a single forward pass, treating the problem as a joint denoising task rather than a sequential prediction.
Verification: The target model checks the proposed block. Accepted tokens are kept; the first rejected token triggers a new draft cycle.

Because the drafting cost is roughly constant regardless of block size, DFlash can use deeper draft models and larger block sizes without the linear latency penalty that constrains autoregressive drafters. A 5-layer DFlash model drafting 16 tokens runs faster than a single-layer EAGLE-3 model drafting 8 tokens.

Training the Draft Model

Training DFlash draft models involves a few design choices that matter for acceptance rates. The draft model shares token embeddings and the language model head with the target model, which keeps the output distribution aligned. During training, random block positions are sampled from the training data rather than always starting from the beginning of a sequence — this improves generalization to arbitrary context lengths.

Loss weighting uses exponential decay across positions within a block, prioritizing accuracy at earlier positions where errors compound. The intuition is that a wrong token early in a block will cause the entire remaining block to be rejected, so it's worth spending more training signal there.

Benchmark Results

On Qwen3-8B with greedy decoding, DFlash achieves:

6.08× speedup on code generation (HumanEval)
5.15× speedup on math (MATH-500)
5.62× speedup on chat (MT-Bench)

Compared to EAGLE-3 on the same tasks, DFlash is 1.4–1.8× faster. For reasoning models at temperature 1, the gains are even larger: 4.5× acceleration on AIME benchmarks.

At production scale on NVIDIA Blackwell (DGX B300), the NVIDIA engineering team reports up to 15× throughput improvement over standard autoregressive decoding for gpt-oss-120B at 500–600 tokens/sec per user interactivity targets. Even against EAGLE-3, DFlash delivers 1.5–2.6× higher throughput depending on task type, with coding and multilingual tasks showing the largest gains.

Integration with SGLang and vLLM

The LMSYS team's Spec V2 blog post describes how DFlash is now the default speculative decoding engine in SGLang. The integration adds an overlap scheduler that reduces host-device synchronization overhead by overlapping draft processing with KV cache allocation for the next batch. This alone adds roughly 33% throughput on top of DFlash's base gains — on Qwen3-8B, throughput goes from 11,400 to 15,300 tokens/second.

For vLLM users, DFlash integrates through the Speculators library. Switching from EAGLE-3 requires updating the checkpoint path and specifying the algorithm; no application code changes are needed. TensorRT-LLM support is also available for Blackwell and Hopper deployments.

Z Lab has released over 20 DFlash draft model checkpoints on Hugging Face covering Qwen, Llama, Gemma, and Kimi K2.6 model families. The original paper and project page include training code and quick-start examples for both SGLang and the Transformers library.

What This Means for Inference Infrastructure

Speculative decoding has been a useful but niche optimization — effective mainly when you have a good draft model and the right hardware setup. DFlash makes the case that the drafting step itself was the limiting factor, not the verification step.

The practical implication is that inference serving costs for large models can drop substantially without any change to model quality. For teams running LLMs at scale, the combination of DFlash with modern inference frameworks like SGLang or vLLM represents a meaningful reduction in GPU hours per token — particularly for coding and reasoning workloads where token acceptance rates are high.

The framework also points toward a broader pattern: diffusion models may be most useful not as standalone generators but as components within hybrid systems where their parallelism can be exploited without sacrificing the quality guarantees of autoregressive verification.