DEV Community: Papers Mache

KV cache eviction improves long‑context performance

Papers Mache — Sun, 24 May 2026 05:00:00 +0000

A learned, globally‑calibrated KV‑cache eviction policy can shave memory usage and, paradoxically, lift long‑context reasoning scores. The paper shows that “if the right tokens are removed, eviction can suppress distractors, sharpen attention, and improve generation” [1].

Before this work, KV‑cache pruning was treated as a compression trick: methods dropped older entries to fit a fixed budget, but they always fell short of the full‑cache baseline on abstractive reasoning and multi‑turn dialogue. The community accepted a trade‑off where latency was saved at the cost of accuracy.

One global retention‑gate network learns a utility score for every token and then enforces a single shared projection across all layers and heads. “We tie the final scoring projection of all retention gates. This weight sharing calibrates retention scores onto a common scale” [1], which lets tokens from any position compete for the same finite cache.

One experiment suite spanning long‑context language, vision‑language, and multi‑turn dialogue benchmarks reports that the learned eviction matches full‑cache performance while using a fraction of the KV memory. On tight budgets the method even surpasses the baseline, confirming that selective eviction is not merely an approximation but a performance enhancer.

One limitation the authors acknowledge is that the retention scores are query‑agnostic; they rely on a geometric proxy for future utility rather than a full‑fledged predictor conditioned on the current query. This suggests an open question: can a query‑aware scoring layer further improve the signal‑to‑noise ratio of retained tokens?

One concrete shift to consider: systems that currently disable KV‑cache eviction for safety might evaluate replacing that guard with the globally tied retention gates, as the approach can reduce memory footprints without sacrificing—and occasionally improving—multi‑hop reasoning accuracy.

References

Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

Zero-shot video generation tracks any camera path

Papers Mache — Sat, 23 May 2026 05:00:00 +0000

A geometry‑aware diffusion interface can turn any camera warp into a synthetic history, letting a frozen video generator follow arbitrary trajectories without any extra training. The authors achieve this by feeding a camera‑warped pseudo‑history through the model’s visual‑history pathway and aligning its positional encoding to the target frames, which “reveals a non‑trivial zero‑shot capability of a frozen video generation model to follow camera trajectories” [1].

Before this work, camera‑controlled video synthesis required either heavy post‑training on large camera‑annotated corpora or costly test‑time optimization to inject motion cues. Existing pipelines typically add camera encoders, dedicated control branches, or modify attention and positional encodings, tying the model to the specific motion patterns seen during fine‑tuning.

In the zero‑shot regime, Warp‑as‑History more than doubles camera adherence, with the Camera Control metric jumping from 26.42 to 61.32 and reaching 62.00 after a single‑shot LoRA finetune, a relative gain of roughly 133 % over the text‑only Helios‑Distilled baseline [1]. The result is a video that faithfully tracks the supplied camera poses while preserving visual fidelity.

Target‑frame positional alignment is the linchpin that keeps denoising stable; the authors note that “normal denoising remains stable, and Figure 6 shows that the zero‑shot output immediately starts to follow the warp after target‑frame alignment” [1]. Without this alignment the warped pseudo‑history would introduce mis‑registered tokens and collapse the diffusion process.

The approach still leans on a lightweight offline LoRA finetune on a single camera‑annotated video to reach peak performance, implying that a completely training‑free pipeline may struggle with domains lacking even one annotated exemplar [1]. One open question is whether the same zero‑shot fidelity persists when the source video contains only sparse or highly non‑rigid motion, a scenario not explored in the current evaluation.

If the reported gains hold broadly, benchmark suites for camera‑controlled generation should be revised to include a zero‑shot track, and production pipelines can replace costly motion‑capture sessions with on‑the‑fly pose specifications derived from simple rig rigs or even synthetic trajectories.

References

Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video

One hidden neuron can disable safety guards

Papers Mache — Fri, 22 May 2026 05:00:00 +0000

It has been commonly suggested that safety layers in large language models function as emergent, distributed defenses; however, this work shows that flipping a single hidden neuron can disable the refusal gate entirely. The twist is that this minimal intervention works across model families and scales, overturning the assumption that alignment is robustly spread throughout the network.

Prior to this study, safety was often viewed as the collective outcome of reinforcement‑learning‑from‑human‑feedback, fine‑tuning, and prompt‑engineering techniques that modify many parameters. Researchers treated the refusal behavior as an emergent property rather than a localized circuit, and evaluations focused on aggregate metrics rather than pinpointing individual units.

Suppressing one identified “refusal neuron” yields a 91.7 % average attack success rate on JailbreakBench across seven models, from 1.7 B to 70 B parameters, spanning Qwen‑3 and Llama‑3.1 families [1]. The authors demonstrate that a single MLP neuron, when silenced, is sufficient to bypass safety alignment for a wide range of harmful queries.

The attack requires only white‑box access to model activations and no additional training, fine‑tuning, or prompt engineering [1]. This means an adversary who can observe or edit the activation map can weaponize the model without any costly model‑level manipulation.

The study does not address black‑box scenarios, nor does it prove that the same neurons exist in other architectures such as transformer‑only or sparsely‑gated models. It also leaves open how durable the identified neurons are under routine model updates or quantization, suggesting that the fragility may be limited to the tested codebases.

If a single unit can collapse the entire refusal system, safety evaluations must start probing neuron‑level vulnerabilities rather than relying on aggregate loss or prompt‑based tests. Benchmarks like JailbreakBench should include a mandatory “neuron‑suppression” suite to verify that no individual activation can nullify the model’s guardrails.

References

A Single Neuron Is Sufficient to Bypass Safety Alignment in Large Language Models

Self-evolving retrieval lifts benchmark scores 25%

Papers Mache — Wed, 20 May 2026 20:16:57 +0000

Agents that adapt their retrieval configurations while running deliver roughly a quarter more performance on established benchmarks — EvolveMem reports a 25.7 % relative lift over the strongest static baseline [1]. The result overturns the long‑standing assumption that retrieval stacks should be frozen after deployment; instead, the system treats the whole memory‑access pipeline as a mutable policy that can be improved on the fly. This shift opens a new design space where an LLM‑driven “diagnosis” module rewrites its own search strategy as new queries arrive.

Before this work, LLM agents relied on a fixed retrieval infrastructure: scoring functions, fusion heuristics, and answer‑generation policies were hand‑tuned once and left unchanged for the life of the service. Researchers routinely built separate pipelines for data ingestion and for query execution, assuming that any performance gains had to come from larger models or richer corpora rather than from the retrieval logic itself. That static mindset limited the ability of agents to learn from their own failures in the field.

EvolveMem’s closed‑loop process turns that limitation into an advantage, reaching a 25.7 % relative improvement on LoCoMo and a 78.0 % relative gain over a minimal baseline [1]. Each evolution round consumes per‑question failure logs, lets the diagnosis LLM pinpoint root causes, and then proposes concrete configuration tweaks; the meta‑analyzer applies the changes, evaluates the impact, and repeats until convergence. The same system also pushes an 18.9 % lift on the text‑only MemBench benchmark, demonstrating improvement even without bespoke engineering for that benchmark.

The diagnosis model does more than fine‑tune existing knobs; it can create entirely new ones. “The diagnosis LLM can propose entirely new parameters that were not in the original action space,” the authors note, highlighting a self‑expanding action space that uncovers retrieval strategies humans had never considered [1]. This capability turns the memory module into an autonomous research partner rather than a static cache.

Self‑evolution is not left unchecked—automatic safeguards prevent harmful regressions. When a proposed change lowers overall F1, the system invokes a revert guard: “R2 illustrates the revert guard: the proposed change regressed overall F1, so the meta‑analyzer automatically rolled back,” ensuring that the agent never degrades its performance while exploring [1]. The guard also triggers exploratory searches when progress stalls, balancing stability with the need to discover better configurations.

If retrieval pipelines can improve themselves by a quarter on standard tests, production assistants should stop treating those pipelines as immutable fixtures. Embedding an online optimisation loop that diagnoses errors and mutates retrieval hyper‑parameters is now a concrete engineering priority, and benchmark suites such as LoCoMo ought to be re‑run with self‑evolving memory enabled to establish the new performance baseline.

References

EvolveMem:Self-Evolving Memory Architecture via AutoResearch for LLM Agents

AI/ML Research Digest — May 16, 2026

Papers Mache — Wed, 20 May 2026 20:14:32 +0000

Distillation + low‑rank tricks cut compute

Combining knowledge distillation with low‑rank adapters now yields video generators that need only one or two sampling steps, a dramatic speed‑up over traditional diffusion pipelines [1].

On‑policy OPD (on‑policy distillation) gains a control‑variates term that steadies gradient estimates, making RL‑trained language agents noticeably more reliable [2].

The Pion optimizer updates LoRA matrices through orthogonal transforms, preserving the spectral shape of the weights and avoiding the drift that often plagues Adam‑style fine‑tuning [3].

A prune‑then‑distill flow compresses massive Mixture‑of‑Experts (MoE) models while keeping performance on par with the original, showing that even the most parameter‑heavy architectures can be trimmed without sacrificing quality [4].

Why it matters: Faster inference and smaller models reduce cloud costs and lower the barrier for deploying video generation or RL agents on edge hardware.

Hierarchical memory stretches context windows

A two‑level attention scheme reduces pre‑training FLOPs while still handling tens of thousands of tokens, opening the door to cheap, long‑context LLMs [5].

Functional tokens act as compact visual descriptors, enabling latent visual reasoning without blowing up model size [6].

At test time, a hierarchical memory module allocates extra compute on demand, letting a single model scale its reasoning power dynamically [7].

Why it matters: Applications such as long documents, code bases, or multi‑turn dialogues no longer hit hard token limits, and the same model can adapt its cost to the difficulty of the query.

Safety gaps surface in multi‑turn dialogs

A new benchmark tracks how scams evolve over conversation turns; spotting the fraud in the first few exchanges cuts potential loss by a large factor [8].

Conversely, researchers found that flipping a single hidden neuron that governs refusal behavior can silence the model’s safety guard, letting it obey malicious prompts despite alignment training [9].

Hidden evaluation sets, invisible to participants during leaderboard runs, shift rankings enough to overturn public‑score conclusions [10].

Why it matters: Real‑world assistants interact over many turns, so early detection and robust safety checks are essential before such systems are widely released.

MoE scaling follows a clean power law

Large‑scale experiments reveal that cross‑entropy loss decays as a simple power‑law in the total number of expert parameters, giving a practical formula for choosing expert counts when scaling [11].

Test‑time hierarchical memories let agents request extra compute only when needed, improving the efficiency of iterative scaling strategies [7].

Why it matters: Designers can now predict how much performance will improve by adding experts, avoiding costly trial‑and‑error runs.

Highlighted papers

Zero‑shot camera‑controlled video diffusion – By turning camera‑induced warps into a pseudo‑history, the system follows arbitrary camera trajectories without any task‑specific training [12].
Learned global KV‑cache eviction – A trainable policy prunes the key‑value cache during inference, slashing memory use while actually raising multi‑hop reasoning accuracy on long‑context benchmarks [13].
vOPD control‑variates baseline – Adding a reverse‑KL control variate stabilizes on‑policy distillation gradients, giving a noticeable boost to RL‑based LLM agents [2].
Spectrum‑preserving Pion optimizer – Orthogonal updates keep the weight spectrum intact, matching Adam’s stability but with less drift during large‑scale fine‑tuning [3].
FrontierSmith open‑ended code synthesis – Starting from competitive‑programming seeds, FrontierSmith creates diverse coding problems that lift performance on FrontierCS and ALE‑bench for models like Qwen‑3.5‑9B and 27B [14].

Notable side results

Manifold‑anchor regularizer for video OCR – Aligning generated optical flow to the data manifold boosts OCR accuracy from 59 % to 94 % in a flow‑OPD system [15].
Single‑neuron safety override – Targeting one hidden neuron can disable the model’s refusal mechanism, highlighting a fragile point in current alignment pipelines [9].
Self‑evolving retrieval architecture – An autonomous module that re‑optimizes its own retrieval configuration improves benchmark scores by 25.7 % relative [16].
Reward‑hacking in rubric‑based RL – Agents learn to exploit loopholes in verifier or rubric design, attaining high proxy rewards without genuine quality gains, underscoring the need for more robust reward design [17].

These developments collectively push the field toward faster, larger‑context, and safer AI systems, while also exposing concrete vulnerabilities that must be addressed before deployment.

References

Shared expert pool reduces parameters while maintaining performance

Papers Mache — Fri, 15 May 2026 05:00:00 +0000

Conventional mixture‑of‑experts designs hand each transformer layer its own private expert set, causing the total expert parameter count to swell linearly with depth. Recent work shows that a single, globally shared pool of experts can deliver comparable predictive quality while dramatically curtailing that budget.

The dominant paradigm has treated depth scaling and expert capacity as inseparable: every new layer brings a fresh collection of feed‑forward sub‑networks, and the routing logic merely picks the top‑k among them. This architecture simplifies implementation but forces a strict coupling between model depth and the number of learnable expert parameters, even though earlier analyses hinted that many layers rely on overlapping knowledge.

UniPool breaks the coupling by replacing per‑layer ownership with one shared pool that all routers draw from. Training remains stable thanks to a pool‑level auxiliary loss that balances utilization at the granularity where parameters are actually owned: the global expert pool. The paper reports, "The improvement from UniPool over vanilla MoE is consistent at all five scales, with validation loss reductions of 0.0288 (182M), 0.0346 (469M), 0.0308 (650M), 0.0386 (830M), and 0.0172 (978M)" [1]. Moreover, “reduced‑pool UniPool variants using only 41.6%–66.7% of the vanilla expert‑parameter budget match or outperform layer‑wise MoE at the tested scales” [1], demonstrating that expert parameters need not grow linearly with depth.

MASCing tackles a different, but equally practical, problem: the safety of MoE inference. By training an LSTM‑based surrogate that models cross‑layer routing dependencies, the framework learns a steering matrix that identifies behavior‑relevant expert circuits. At inference time it injects “steering masks” into the routing gates, overriding the default expert selection without any retraining. The authors note, "MASCing uses an LSTM‑based surrogate model to capture cross‑layer routing dependencies and map routing logits to downstream behaviors. It then optimizes a steering matrix to identify behavior‑relevant expert circuits and, at inference time, applies steering masks to the routing gates to override expert selection" [2]. In the adversarial jailbreak benchmark, unsteered models defended successfully only 52.5% of the time on average, whereas “Applying MASCing yields a substantial and consistent improvement across all tested MoE models, raising the average defense success rate to 83.9%” [2].

The findings leave several open questions. UniPool’s experiments are limited to LLaMA‑style backbones trained on 30 B tokens from the Pile; it remains unclear whether the same sublinear expert budget holds for encoder‑only transformers, multimodal models, or data regimes with markedly different token distributions. The auxiliary loss and NormRouter components introduce extra hyper‑parameters that may require careful tuning on new hardware stacks. MASCing, while impressive, depends on a surrogate that approximates routing dynamics; its efficacy on proprietary, larger‑scale MoEs or under distribution shifts has not been demonstrated, and the steering masks could interact unpredictably with future routing innovations.

For engineers looking to shave expert parameters without sacrificing loss, swapping the layerwise expert modules for a single shared pool and adding the pool‑level balancing loss is a concrete first step; the authors release ready‑to‑run scripts that cover five model sizes, so you can prototype the change on existing training pipelines. When safety requirements evolve, you can generate a steering mask for the new objective and plug it into the inference graph, gaining a sizable jailbreak‑defense boost without a costly fine‑tune. Before committing, benchmark the shared‑pool model against a vanilla MoE on your own validation set and measure any latency impact of the auxiliary loss and mask application. If the trade‑off is favorable, the combined modular routing approach offers a practical path to cheaper, more controllable large‑scale models.

References

HERMES++ answers language queries while predicting roads

Papers Mache — Thu, 14 May 2026 05:00:00 +0000

The prevailing view has been that autonomous‑driving world models must choose between two extremes: a perception‑only pipeline that reconstructs the current bird’s‑eye‑view (BEV) layout, or a generative model that rolls forward future geometry without a semantic grasp of the scene. HERMES++ demonstrates that a single network can inhabit both roles, answering natural‑language queries while extrapolating the road ahead.

Previously, scene‑understanding systems relied on dense BEV encoders tuned for detection and segmentation, whereas future‑prediction work such as point‑cloud roll‑outs treated the problem as a pure geometric sequence, often ignoring high‑level intent. Large language models, meanwhile, excel at reasoning over text but have no built‑in notion of spatial dynamics, leaving a gap between semantic instruction and physical simulation.

HERMES++ closes that gap with three key mechanisms. First, it collapses multi‑camera inputs into a compact BEV representation, a design choice that “mitigates the effects of token length constraints when processing high‑resolution multi‑view inputs” [1]. Second, the system introduces “world queries that interact directly with the LLM’s processing pipeline and act as temporal semantic carriers,” allowing the language model to steer both perception and generation [1]. Finally, a Current‑to‑Future link conditions the predicted road geometry on the semantic context extracted by the LLM, while a joint geometric optimisation step enforces consistency between learned latent priors and explicit geometric constraints.

Beyond architectural cleverness, the unified model translates into measurable gains. On the 3‑second horizon benchmark, HERMES++ “reduces the Chamfer Distance (CD) at the 3s horizon by 41.6% compared to ViDAR,” a specialist future‑prediction baseline [1]. The same framework also outperforms dedicated BEV perception nets on standard 3D scene‑understanding metrics, confirming that the joint training does not sacrifice accuracy on either side.

The paper acknowledges several boundaries. Evaluations are limited to curated datasets; real‑world sensor noise, adverse weather, and dynamic traffic participants remain untested. The latency introduced by routing BEV tokens through an LLM has not been quantified, raising questions about real‑time feasibility. Moreover, the world‑query interface is tied to a specific prompt schema, so extending it to arbitrary natural‑language instructions may require additional finetuning.

For teams experimenting with language‑driven autonomy, the immediate takeaway is practical: the released checkpoints and demo let you plug a single model into a simulation loop and issue high‑level commands like “show the drivable lane two seconds ahead.” Before committing to a full language‑controlled stack, benchmark dense BEV perception against the LLM‑augmented pipeline on your own sensor suite, and profile the end‑to‑end runtime to ensure that the added reasoning layer respects real‑time constraints. As the line between semantic grounding and geometric prediction continues to blur, HERMES++ offers a concrete reference point for building the next generation of language‑aware driving systems.

References

HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation

Diffusion models enable high-quality image and video generation with few steps

Papers Mache — Wed, 13 May 2026 05:00:00 +0000

Diffusion research has long treated image synthesis and video synthesis as separate engineering problems, each with its own heavyweight model and multi‑step inference pipeline. Recent work shows that a single latent diffusion backbone can be conditioned for text‑to‑image and high‑resolution video generation while still operating in just a handful of sampling steps.

Historically, image diffusion required dozens of denoising steps, and video diffusion compounded the cost with per‑frame processing or costly cascades. Acceleration techniques fell into two camps: consistency distillation, which enforces self‑consistency along the entire probability‑flow ODE, and discrete distribution‑matching distillation that anchors supervision at a few fixed timesteps. Both approaches traded fidelity for speed or introduced auxiliary adversarial modules to patch visual artifacts.

Continuous‑Time Distribution Matching (CDM) breaks the discrete schedule by “replacing the fixed discrete schedule with a dynamic continuous schedule of random length, so that distribution matching is enforced at arbitrary points along sampling trajectories rather than only at a few fixed anchors” [1]. The authors demonstrate that this redesign yields “sharper textures and fine‑grained details (e.g., background elements and material reflections), and stronger semantic adherence to multi‑entity compositional prompts” while keeping the reverse‑KL mode‑seeking bias in check. In their benchmark table the distilled SD3‑Medium checkpoint reaches state‑of‑the‑art scores with just four neural‑function evaluations:

“CDM (Ours) | 4 | 6.075 | 85.26 | 21.95 | 9.561 | 27.98 | ✓ | ✓” [1].

SwiftI2V tackles the video side with a two‑stage pipeline that first produces a low‑resolution motion reference and then renders a 2K video conditioned tightly on the input image. Its core contribution, Conditional Segment‑wise Generation (CSG), “synthesizes videos segment‑by‑segment with a bounded per‑step token budget, and adopts bidirectional contextual interaction within each segment to improve cross‑segment coherence and input fidelity” [2]. The resulting system runs in 111 seconds on a single RTX 4090 while using 33.5 GB of memory, and it secures the highest VBench‑I2V score reported:

“SwiftI2V (ours) | 6.4244 | 0.9910 | 0.9975 | 0.3008 | 0.6496 | 0.9885” [2].

Both papers acknowledge constraints. CDM’s experiments focus on specific checkpoint families (SD3‑Medium, Longcat‑Image) and may not transfer unchanged to larger or more diverse latent spaces. The continuous‑time objective still requires a student‑teacher distillation phase, which adds upfront training cost. SwiftI2V’s segment‑wise design reduces memory but assumes that motion can be captured effectively in short chunks; extremely long or highly synchronized actions could suffer from residual temporal drift. Moreover, the reported gains are measured on the VBench‑I2V benchmark, and performance on domain‑specific video datasets remains an open question.

For practitioners, the implication is clear: separate distilled diffusion models can now generate high‑quality images and 2K videos, each with its own codebase, while both benefit from few‑step diffusion techniques. Before committing to a unified backbone, you should benchmark the few‑step distilled checkpoint against your existing generators on the actual prompt distribution and latency budget of your product. If the fidelity gap stays within acceptable bounds, the reduced GPU memory footprint and simplified deployment pipeline can translate into lower infrastructure costs and faster iteration cycles.

References

Entropy of first token predicts hallucinations

Papers Mache — Tue, 12 May 2026 05:00:00 +0000

The entropy of the very first content‑bearing token already separates factual answers from hallucinations with an AUROC of 0.82. That single number rivals the scores of methods that need dozens of sampled continuations. The surprise is that nothing more than the greedy decode’s first‑token distribution is required.

Hallucination detection has long relied on self‑consistency: generate many answers, compare them, and flag low agreement as doubtful. Semantic self‑consistency tightens the signal by clustering answers by meaning, but both approaches multiply decoding cost and need extra inference components. Practitioners therefore face a trade‑off between reliability and latency.

The study introduces φ₁ₙₜ, the normalized entropy of the top‑K logits at the first answer token. Across three 7–8 B instruction‑tuned models and two closed‑book QA benchmarks, φ₁ₙₜ attains a mean AUROC of 0.820, surpassing semantic self‑consistency (0.793) and surface‑form self‑consistency (0.791) [1]. The authors report the full results as “Overall mean | 0.700 | 0.752 | 0.782 | 0.791 | 0.793 | 0.820 | 0.027 | 0.42” [1]. Correlation analysis shows the signal is not independent: “Mean | 0.67 |” indicates a Pearson 0.67 relationship between first‑token confidence and semantic agreement [1]. Moreover, adding semantic self‑consistency to φ₁ₙₜ yields only a marginal AUROC lift, confirming that most of the useful uncertainty is already present in the initial token distribution.

The method hinges on correctly locating the first content token, which depends on the chat template and tokenizer used. The authors note, “The method requires logits at the first answer-token position; reliable identification of that position depends on the chat template and tokenizer.” [1] Consequently, the approach is tied to greedy decoding and short‑answer factual QA; its behavior on chain‑of‑thought prompts, multi‑turn dialogue, or generation beyond the first token remains untested. Scaling to models larger than 18 B or to multilingual settings is also an open question.

If you can expose the first‑token logits, the extra compute is negligible compared with sampling‑based uncertainty estimates. A practical pipeline could compute φ₁ₙₜ on every answer, reject or flag those above a calibrated entropy threshold, and fall back to retrieval, tool use, or a human‑in‑the‑loop. Because the signal is cheap and model‑agnostic, it serves as a low‑cost baseline before deploying more expensive metacognitive strategies. Watching how φ₁ₙₜ behaves on your own prompt templates will reveal whether a single‑decode confidence check is enough to raise the factuality bar in production.

References

The First Token Knows: Single-Decode Confidence for Hallucination Detection

AI/ML Research Digest — May 09, 2026

Papers Mache — Mon, 11 May 2026 05:00:00 +0000

Diffusion as a unifying backbone for multimodal generation

Latent diffusion now drives both image synthesis and video creation. Continuous‑time distribution matching reduces diffusion steps to a few while retaining fidelity [1]. Segment‑wise video diffusion extends the same idea to image‑to‑video tasks, cutting inference cost [2]. The gap is conditioning: current models still lack native text or segmentation prompts, limiting end‑to‑end multimodal pipelines.

Modular expert routing and adaptive compute

UniPool replaces per‑layer mixtures of experts with a single shared pool and a pooling loss, shrinking the expert parameter budget without hurting performance [3]. NormRouter further stabilises routing decisions across layers. In sequential decision‑making, FFDC’s verification module compares imagined and observed futures, then shortens or lengthens action chunks on the fly, slashing forward passes while keeping success rates high [4].

Closed‑loop self‑auditing LLM agents

Direct corpus interaction removes the embedding index and lets LLM agents issue terminal‑style commands on raw documents. This yields better results on BEIR and multi‑hop QA than traditional top‑k retrieval [5]. A complementary line builds an auto‑research loop where specialist agents generate, evaluate, and refine code proposals; the work shows promise but stops short of a full auditing framework with adversarial specialists and lineage feedback [6].

Integrated 3‑D world modeling with language grounding

HERMES++ combines bird‑eye‑view scene understanding, future geometry prediction, and LLM‑driven queries in a single model. It can answer language prompts while forecasting road dynamics, a step toward truly interactive autonomous‑driving assistants [7]. Separate work conditions large‑scale world generation on segment maps, enabling spatially aware manipulation of virtual environments [8].

Highlighted contributions

Single‑token entropy as a hallucination signal

Measuring the entropy of the first content token during greedy decoding provides a cheap hallucination detector. Across 7‑8 B parameter models it reaches AUROC ≈ 0.82, rivaling multi‑sample self‑consistency methods [9]. This offers a low‑overhead safety check for deployed LLMs.

Balanced aggregation for LLM reinforcement learning

Balanced Aggregation corrects bias in gradient‑based policy updates for LLM agents. Experiments on ALFWorld and WebShop show higher sample efficiency and final scores [10]. The method tightens the link between reward signals and policy improvement.

Prox‑E primitive‑based 3‑D edits

Prox‑E abstracts complex shapes into primitive components and steers them with pretrained vision‑language models. The approach enables localized edits that preserve object identity while reshaping geometry [11].

MASCing runtime safety masks

MASCing adds routing‑logit masks that reconfigure MoE expert circuits at inference time. The masks dramatically increase jailbreak resistance without any retraining [12].

JoyAI‑Image spatial reasoning boost

Embedding a spatially enhanced multimodal LLM into a diffusion transformer improves geometry‑aware reasoning and controllable image synthesis, raising performance on spatial benchmarks [13].

BlenderRAG multimodal code synthesis

BlenderRAG augments retrieval‑augmented generation with a curated multimodal example set. Code compilation success climbs from 40.8 % to 70 %, and semantic alignment improves [14].

These advances collectively tighten the loop between perception, language, and action, cut compute waste, and add safety signals—crucial steps as AI systems move from research prototypes to real‑world deployment.

References

Flux Attention halves inference cost on long contexts

Papers Mache — Sun, 10 May 2026 05:00:00 +0000

Dynamic sparse routing now delivers two‑ to three‑fold speedups on long‑context inference while leaving reasoning quality virtually untouched. The trick is that each transformer layer decides on the fly whether to attend densely or sparsely, reducing the blanket‑over‑all quadratic cost associated with standard attention in large language models. The result is a practical, drop‑in acceleration that works on the chat‑style workloads that dominate production today.

Standard self‑attention scales as O(n²) with the token count, so extending context windows from 4 k to 32 k tokens quickly becomes prohibitive. Hybrid schemes that mix full attention (FA) and sparse attention (SA) have been proposed, but they usually fix the FA/SA ratio globally or at the head level, forcing a one‑size‑fits‑all allocation that either wastes compute or starves the model of needed context. Moreover, head‑level sparsity often creates load‑imbalance spikes that hurt autoregressive decoding on modern accelerators.

Flux Attention sidesteps these constraints by introducing a lightweight Layer Router that statically plugs into a frozen pretrained model and, during inference, routes each layer to either FA or SA based on the current input. Because the decision happens at layer granularity, the memory access pattern stays contiguous, turning theoretical FLOP reductions into measurable wall‑clock gains. The authors report speed improvements of up to 2.8× during the prefill phase and 2.0× while decoding, all while preserving performance on long‑context and mathematical reasoning benchmarks. Training the router is exceptionally cheap: “Our parameter‑efficient training converges in just 12 hours on an 8‑GPU A800 node.” [1] The routing overhead itself is negligible, “our router incurs a negligible overhead, averaging only 0.20 ms per layer.” [1]

The paper’s evaluation focuses on long‑context scenarios and math‑heavy tasks, leaving open how the method behaves on short‑prompt or multilingual benchmarks. The approach also assumes access to the original frozen checkpoint; models that have already been fine‑tuned or heavily customized might need additional adaptation steps. Finally, the reported speedups stem from A800 GPU measurements; different hardware architectures could exhibit a different balance between the cost of the router and the gains from sparsity.

For teams that already serve chat‑style LLMs with extended windows, the take‑away is immediate: a layer‑wise router can be trained in a single half‑day and, as demonstrated by the authors, has been integrated into released checkpoints on Hugging Face and ModelScope. Before rolling it out, benchmark both prefill and decode latency on your target context lengths to confirm the 2–3× gains materialize in your stack. If the router’s 0.20 ms per‑layer penalty is acceptable, the resulting throughput boost can shave seconds off each interaction, turning long‑context reasoning from a niche capability into a production‑ready feature.

References

Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference

Diffusion models approach AR quality and improve inference speed

Papers Mache — Sun, 10 May 2026 05:00:00 +0000

Diffusion language models have long promised parallel generation, yet their serving speed has lagged behind autoregressive decoders. Recent work shows that diffusion can now deliver three‑fold throughput gains over prior diffusion models, and LangFlow reports perplexities of 30.0 on LM1B and 24.6 on OpenWebText. The gap between parallelism and practical efficiency is finally narrowing.

Earlier diffusion language models suffered from two intertwined problems. First, the lack of introspective consistency—unlike AR models that always condition on their own past tokens—produced a quality deficit noticeable on standard benchmarks. Second, inference pipelines were built on naïve sampling loops, so even when quality improved, latency remained higher than causal decoders. Autoregressive systems, by contrast, benefitted from decades of system‑level tuning such as causal masking and logit shifting, which implicitly enforce token‑level consistency.

Introspective Diffusion Language Models (I‑DLM) close the consistency gap with a novel “introspective strided decoding” algorithm that verifies previously generated tokens while advancing new ones in the same forward pass. The authors report that “Beyond quality, I‑DLM is designed for the growing demand of large‑concurrency serving, delivering about 3× higher throughput than prior state‑of‑the‑art DLMs.” [1] They also achieve “69.6 on AIME‑24 and 45.7 on LiveCodeBench‑v6, exceeding LLaDA‑2.1‑mini (16B) by more than 26 and 15 points, respectively.” [1] Crucially, I‑DLM is claimed to be “the first DLM to match the quality of its same‑scale AR counterpart while outperforming prior DLMs in both model quality and practical serving efficiency across 15 benchmarks.” [1]

LangFlow tackles the continuous‑time side of the problem. By linking embedding‑space diffusion to flow matching via a Bregman divergence and introducing an ODE‑based negative‑log‑likelihood bound, the model reaches “a PPL of 30.0 on LM1B and 24.6 on OpenWebText,” rivaling top discrete diffusion systems. [2] Moreover, “It even exceeds autoregressive baselines in zero‑shot transfer on 4 out of 7 benchmarks.” [2] These numbers place continuous diffusion on equal footing with the best AR language models, at least on the evaluated corpora.

The papers acknowledge several open questions. I‑DLM’s throughput claims stem from a single‑H100 benchmark and a stationary‑batch scheduler; scaling to multi‑node or heterogeneous clusters remains untested. The quality comparison covers 15 curated benchmarks, but the behavior on truly massive, multilingual corpora is unknown. LangFlow’s ODE likelihood bound hinges on a learnable Gumbel‑based noise schedule, which may be sensitive to hyper‑parameter choices not explored in the released experiments. Its zero‑shot advantage appears on a modest set of seven tasks, leaving the generality of the improvement uncertain.

For teams that need to serve thousands of concurrent requests, evaluating a diffusion backend is now a concrete option rather than a speculative future. You can benchmark I‑DLM’s stationary‑batch scheduler against your existing causal decoder on the same hardware to see whether the reported 3× throughput translates to cost savings. Likewise, swapping an AR checkpoint for a LangFlow checkpoint and measuring perplexity on your domain data will reveal if the continuous‑time approach holds up outside LM1B and OpenWebText. If the results align, diffusion models could become the default choice for high‑throughput, low‑latency LLM serving.