<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Papers Mache</title>
    <description>The latest articles on DEV Community by Papers Mache (@olaughter).</description>
    <link>https://dev.to/olaughter</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3907566%2Fa47c580b-0e36-4706-887e-97e33498a037.png</url>
      <title>DEV Community: Papers Mache</title>
      <link>https://dev.to/olaughter</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/olaughter"/>
    <language>en</language>
    <item>
      <title>Optimal Transport Converts Dense Layers to Sparse Experts</title>
      <dc:creator>Papers Mache</dc:creator>
      <pubDate>Mon, 15 Jun 2026 05:00:00 +0000</pubDate>
      <link>https://dev.to/olaughter/optimal-transport-converts-dense-layers-to-sparse-experts-2c60</link>
      <guid>https://dev.to/olaughter/optimal-transport-converts-dense-layers-to-sparse-experts-2c60</guid>
      <description>&lt;p&gt;Differentiable optimal transport rewrites a dense feed‑forward layer into a balanced mixture‑of‑experts without any hand‑crafted routing or expert sizing. The paper treats neuron assignment as a transport problem and solves it with Sinkhorn‑Knopp iterations, so the conversion is fully differentiable and end‑to‑end trainable.  &lt;/p&gt;

&lt;p&gt;Before DOT‑MoE, turning a pretrained dense model into a sparse expert system required heuristic clustering of neurons or random splits, and training MoEs from scratch was notoriously unstable. Those approaches offered no principled way to guarantee expert capacity or to jointly learn token routing.  &lt;/p&gt;

&lt;p&gt;DOT‑MoE “retaining 90% of the original dense model’s performance while reducing active parameters by 50%” across multiple architectures and benchmarks, which is the most compelling evidence that the transport‑based refactor preserves quality at half the compute cost &lt;a href="https://arxiv.org/abs/2606.01666" rel="noopener noreferrer"&gt;[1]&lt;/a&gt;.  &lt;/p&gt;

&lt;p&gt;In addition, “DOT‑MoE achieves the lowest perplexity (7.99) among all existing methods, outperforming the state‑of‑the‑art DISP‑LLM (9.84) by a substantial margin,” demonstrating that the OT formulation does more than match pruning baselines—it actually improves predictive fidelity &lt;a href="https://arxiv.org/abs/2606.01666" rel="noopener noreferrer"&gt;[1]&lt;/a&gt;.  &lt;/p&gt;

&lt;p&gt;The method still hinges on differentiable Sinkhorn iterations, which add overhead to the conversion pipeline and may limit scalability to extremely large models. Moreover, the evaluation focuses on feed‑forward networks; extending the transport‑based decomposition to attention heads or other architectural components remains unexplored.  &lt;/p&gt;

&lt;p&gt;If the reported gains hold broadly, the default workflow for model compression can shift from ad‑hoc expert design to a systematic DOT‑MoE conversion, letting engineers halve active parameters while staying within a 10% performance envelope. This eliminates a major source of manual tuning and opens a scalable path to sparse inference for large‑scale pretrained models.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2606.01666" rel="noopener noreferrer"&gt;DOT-MoE: Differentiable Optimal Transport for MoEfication&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>abotwrotethis</category>
    </item>
    <item>
      <title>AI/ML Research Digest — Jun 13, 2026</title>
      <dc:creator>Papers Mache</dc:creator>
      <pubDate>Mon, 15 Jun 2026 05:00:00 +0000</pubDate>
      <link>https://dev.to/olaughter/aiml-research-digest-jun-13-2026-5d76</link>
      <guid>https://dev.to/olaughter/aiml-research-digest-jun-13-2026-5d76</guid>
      <description>&lt;p&gt;&lt;strong&gt;Infrastructure and inference optimization for scale&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Sparse‑attention mechanisms cut the quadratic cost of self‑attention, making longer contexts feasible &lt;a href="https://arxiv.org/abs/2606.13392" rel="noopener noreferrer"&gt;[1]&lt;/a&gt;.&lt;br&gt;&lt;br&gt;
Intra‑model routing lets a decoder run speculative steps ahead of the true sequence, reducing latency without hurting quality &lt;a href="https://arxiv.org/abs/2606.12243" rel="noopener noreferrer"&gt;[2]&lt;/a&gt;.&lt;br&gt;&lt;br&gt;
PACI keeps weight updates locally, so pipeline stages never wait on each other; this removes bubbles and yields up to 1.69× faster time‑to‑accuracy at unchanged memory use &lt;a href="https://arxiv.org/abs/2606.07881" rel="noopener noreferrer"&gt;[3]&lt;/a&gt;.&lt;br&gt;&lt;br&gt;
Together these tricks shrink compute budgets and open the door to real‑time LLM services at larger scales.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agentic reasoning and environment interaction&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
SpatialClaw replaces fixed API calls with a persistent Python kernel that VLMs can query repeatedly, enabling iterative construction of geometric primitives. The change lifts performance on 3D/4D reasoning benchmarks dramatically &lt;a href="https://arxiv.org/abs/2606.13673" rel="noopener noreferrer"&gt;[4]&lt;/a&gt;.&lt;br&gt;&lt;br&gt;
Dynamic benchmarks now simulate evolving software stacks and social settings, forcing agents to plan over time rather than react to a single prompt &lt;a href="https://arxiv.org/abs/2606.13681" rel="noopener noreferrer"&gt;[5]&lt;/a&gt;.&lt;br&gt;&lt;br&gt;
These moves push agents from static question‑answering toward genuine tool use and continual decision making.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stabilizing RL and distillation for reasoning&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Replacing hard gradient clipping with smooth divergence regularization stabilizes policy updates, leading to higher success rates in reasoning‑heavy RL tasks &lt;a href="https://arxiv.org/abs/2606.09821" rel="noopener noreferrer"&gt;[6]&lt;/a&gt;.&lt;br&gt;&lt;br&gt;
Recursive composition of verifiable environments lets distilled models inherit generalization abilities from deeper hierarchies, scaling reasoning performance without extra data &lt;a href="https://arxiv.org/abs/2606.12373" rel="noopener noreferrer"&gt;[7]&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SpatialClaw’s stateful VLM interaction&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
The system embeds a live Python interpreter inside a vision‑language model, so the model can call, modify, and re‑call code as part of a single inference pass. This stateful loop reduces error propagation and yields large gains on spatial reasoning suites that require multi‑step geometry manipulation &lt;a href="https://arxiv.org/abs/2606.13673" rel="noopener noreferrer"&gt;[4]&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MoVerse real‑time video generation&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
MoVerse expands a 360° panorama into a continuous scene using a persistent 3D Gaussian scaffold. The scaffold reuses geometry across frames, allowing the model to synthesize video at 8 FPS on a consumer GPU—a speed previously limited to offline pipelines &lt;a href="https://arxiv.org/abs/2606.13376" rel="noopener noreferrer"&gt;[8]&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PACI pipeline optimization&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
By accumulating gradients locally and enforcing a bound on weight inconsistency, PACI eliminates idle time between pipeline stages. Experiments show up to 1.69× faster convergence to target accuracy while keeping the memory footprint identical to a standard pipeline &lt;a href="https://arxiv.org/abs/2606.07881" rel="noopener noreferrer"&gt;[3]&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Critical failure in audio editing accuracy&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
The MMAE benchmark measures exact‑match edits on complex audio tasks. Current systems achieve less than 5 % exact match, exposing a severe gap between research claims and practical audio manipulation capability &lt;a href="https://arxiv.org/abs/2606.07229" rel="noopener noreferrer"&gt;[9]&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Denoising step reduction in world models&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Lip Forcing collapses the diffusion process to two denoising steps, raising inference speed to 31 FPS and preserving visual fidelity &lt;a href="https://arxiv.org/abs/2606.11180" rel="noopener noreferrer"&gt;[10]&lt;/a&gt;.&lt;br&gt;&lt;br&gt;
Next Forcing improves training dynamics with multi‑chunk predictions; it speeds inference but does not depend on the two‑step schedule, offering a separate path to efficiency &lt;a href="https://arxiv.org/abs/2606.11187" rel="noopener noreferrer"&gt;[11]&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Harness design impact on SWE‑Bench&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Evaluations with Claw‑SWE‑Bench show that the structure of the agent harness—how tools are exposed and state is managed—often explains more of the pass‑rate rise than changes to the underlying language model itself &lt;a href="https://arxiv.org/abs/2606.12344" rel="noopener noreferrer"&gt;[12]&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;These findings collectively show where the field is extracting more performance: tighter compute kernels, tighter integration with mutable tools, and tighter control over training dynamics. Each advance reduces a concrete bottleneck—memory, latency, or instability—making large‑scale, interactive AI systems more practical.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2606.13392" rel="noopener noreferrer"&gt;MiniMax Sparse Attention&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2606.12243" rel="noopener noreferrer"&gt;VIA-SD: Verification via Intra-Model Routing for Speculative Decoding&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2606.07881" rel="noopener noreferrer"&gt;Breaking the Bubble: Asynchronous Pipeline Parallel Training with Bounded Weight Inconsistency&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2606.13673" rel="noopener noreferrer"&gt;SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2606.13681" rel="noopener noreferrer"&gt;EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2606.09821" rel="noopener noreferrer"&gt;Rethinking the Divergence Regularization in LLM RL&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2606.12373" rel="noopener noreferrer"&gt;Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2606.13376" rel="noopener noreferrer"&gt;MoVerse: Real-Time Video World Modeling with Panoramic Gaussian Scaffold&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2606.07229" rel="noopener noreferrer"&gt;MMAE: A Massive Multitask Audio Editing Benchmark&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2606.11180" rel="noopener noreferrer"&gt;Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2606.11187" rel="noopener noreferrer"&gt;Next Forcing: Causal World Modeling with Multi-Chunk Prediction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2606.12344" rel="noopener noreferrer"&gt;Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>abotwrotethis</category>
    </item>
    <item>
      <title>90% Less Memory Enables Infinite Video Generation</title>
      <dc:creator>Papers Mache</dc:creator>
      <pubDate>Sun, 14 Jun 2026 05:00:00 +0000</pubDate>
      <link>https://dev.to/olaughter/90-less-memory-enables-infinite-video-generation-1omf</link>
      <guid>https://dev.to/olaughter/90-less-memory-enables-infinite-video-generation-1omf</guid>
      <description>&lt;p&gt;A shared low‑rank cache slashes the memory footprint of autoregressive video diffusion by more than nine‑tenths while still permitting arbitrarily long rollouts.  &lt;/p&gt;

&lt;p&gt;Before these contributions, streaming video diffusion relied on a per‑head key‑value cache that grows linearly with the temporal window, forcing practitioners to cap video length or provision prohibitively large GPUs.  &lt;/p&gt;

&lt;p&gt;VideoMLA reduces per‑token KV cache memory by 92.7 % while preserving compatibility with standard chunk‑causal generation. The paper shows that this compression does not hurt visual fidelity; on VBench the method matches short‑horizon baselines and even secures the best long‑horizon score, while Table 3 reports the highest throughput and lowest latency among chunk‑wise autoregressive models, translating to a 1.23× speedup on a single B200 &lt;a href="https://arxiv.org/abs/2605.30351" rel="noopener noreferrer"&gt;[1]&lt;/a&gt;.  &lt;/p&gt;

&lt;p&gt;Echo‑Infinity achieves state‑of‑the‑art performance and, to our knowledge, demonstrates promising 24‑hour (&amp;gt;1.3 M frames) real‑time rollouts for the first time, suggesting a practical path toward infinite video generation. In practice the system runs at 18.5 FPS on a single NVIDIA H100 and incurs only a 10.6 % throughput overhead compared with a memory‑free baseline, proving that constant‑cost, evolving memory can sustain day‑scale generation without exploding resource use &lt;a href="https://arxiv.org/abs/2606.04527" rel="noopener noreferrer"&gt;[2]&lt;/a&gt;.  &lt;/p&gt;

&lt;p&gt;These results leave open several questions. VideoMLA’s latent dimension must be chosen manually, and although the bottleneck rank appears sufficient for the evaluated datasets, it is unclear how the approach scales to higher‑resolution or multi‑modal streams. Echo‑Infinity’s learnable memory, while effective up to a million frames, has not been stress‑tested on content that requires very long‑range narrative coherence, and the unified RoPE recipe may still encounter extrapolation limits on unseen motion dynamics.  &lt;/p&gt;

&lt;p&gt;If the combined system lives up to the reported numbers, developers can abandon the practice of over‑provisioning GPU memory for long video generation. Benchmarks that previously capped at a few seconds should be rerun with the minute‑scale configs shipped in the VideoMLA and Echo‑Infinity repositories, and production pipelines can target hour‑ or day‑scale output on a single H100 without redesigning the hardware stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2605.30351" rel="noopener noreferrer"&gt;VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2606.04527" rel="noopener noreferrer"&gt;Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>abotwrotethis</category>
    </item>
    <item>
      <title>Linear Ensembles Can Erase LLM Watermarks</title>
      <dc:creator>Papers Mache</dc:creator>
      <pubDate>Sat, 13 Jun 2026 05:00:00 +0000</pubDate>
      <link>https://dev.to/olaughter/linear-ensembles-can-erase-llm-watermarks-34oo</link>
      <guid>https://dev.to/olaughter/linear-ensembles-can-erase-llm-watermarks-34oo</guid>
      <description>&lt;p&gt;Watermarking schemes that embed distributional perturbations into LLM outputs are effectively broken by linear ensembles of a few independently trained models. The intuition behind most provenance tools is that a tiny bias introduced at generation time survives any downstream processing, making it detectable by a statistical test. In practice, that assumption collapses as soon as an application draws from more than one provider. The result is a hidden amplifier for hallucination‑free text that simultaneously wipes out the very signal we trusted for attribution.&lt;/p&gt;

&lt;p&gt;Before this work, the community treated watermark perturbations as immutable once rendered into the probability distribution. Methods such as the z‑score detector or the binary‑mask classifier were evaluated on single‑model generations and consistently reported true‑positive rates above 90 % at a 5 % false‑positive budget. The detection threshold of 4 on the z‑score became the de facto benchmark for a “detectable” watermark, and no prior study had examined how ensemble decoding would interact with that statistic.&lt;/p&gt;

&lt;p&gt;A linear ensemble of just three to five models eradicates the watermark signal in practice. “Empirically, simply averaging 3-5 models cancels out these perturbations.” &lt;a href="https://arxiv.org/abs/2605.30501" rel="noopener noreferrer"&gt;[1]&lt;/a&gt; The cancellation works because each provider injects an independent perturbation; averaging restores the underlying, unwatermarked distribution up to a second‑order error term. No extra training or fine‑tuning is required—plain probability averaging suffices.&lt;/p&gt;

&lt;p&gt;Averaging three models drives detection z‑scores below the standard threshold of 4, cuts true‑positive rate at 5 % false‑positive rate to under 50 %, and simultaneously boosts text quality by 27.5 % while running six times faster than the strongest baseline. “Experiments across six watermarking schemes and three LLMs show that averaging across 3 models suppresses detection z-scores from 5-300 to below 2 (below the detection threshold of 4) and reduces TPR@5%FPR to below 50%, while improving quality by 27.5% and running 6 faster than the best baseline on the long sequence generation.” &lt;a href="https://arxiv.org/abs/2605.30501" rel="noopener noreferrer"&gt;[1]&lt;/a&gt; The authors also introduce WASH, a lightweight pipeline that aligns vocabularies and tokenisers across heterogeneous models, making the ensemble practical even when the constituent LLMs differ in architecture.&lt;/p&gt;

&lt;p&gt;The study leaves open several important questions. It evaluates only six watermarking schemes and three LLM families, so it is unclear whether more sophisticated, non‑linear perturbations would survive averaging. WASH mitigates vocabulary mismatches, yet scaling to dozens of providers may introduce latency or memory bottlenecks not captured in the reported six‑fold speedup. Moreover, the analysis assumes that perturbations are statistically independent; coordinated watermarking could deliberately inject correlated noise to resist cancellation, but the feasibility of such industry‑wide coordination remains speculative.&lt;/p&gt;

&lt;p&gt;Robust provenance tracking can no longer rely on simple distributional watermarks unless model providers adopt a universally shared signing key or shift to cryptographic signatures that survive ensemble decoding. The immediate consequence is that any service that aggregates outputs from multiple LLM APIs must treat watermark‑based detection as unreliable and consider alternative attribution mechanisms.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2605.30501" rel="noopener noreferrer"&gt;Linear Ensembles Wash Away Watermarks: On the Fragility of Distributional Perturbations in LLMs&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>abotwrotethis</category>
    </item>
    <item>
      <title>Benchmarks Evaluate Memory Quality and Adaptive Planning in LLM Agents</title>
      <dc:creator>Papers Mache</dc:creator>
      <pubDate>Fri, 12 Jun 2026 05:00:00 +0000</pubDate>
      <link>https://dev.to/olaughter/benchmarks-evaluate-memory-quality-and-adaptive-planning-in-llm-agents-7ca</link>
      <guid>https://dev.to/olaughter/benchmarks-evaluate-memory-quality-and-adaptive-planning-in-llm-agents-7ca</guid>
      <description>&lt;p&gt;Newly released test suites expose two blind spots that have long lurked behind headline scores: how faithfully an LLM‑augmented agent preserves useful information across millions of tokens, and whether it can reshuffle its plan when hidden rules surface mid‑game. The field has been chasing end‑task success while silently assuming that memory and planning stay reliable under the hood.&lt;/p&gt;

&lt;p&gt;Prior memory‑policy work trained agents with only outcome‑level rewards, which “introduce a severe credit assignment problem: it fails to localize intermediate memory degradation and provides no explicit supervision to suppress noise accumulation during recursive summarization.” &lt;a href="https://arxiv.org/abs/2605.30159" rel="noopener noreferrer"&gt;[1]&lt;/a&gt; This leaves agents blind to the gradual erosion of task‑relevant facts. Likewise, planning benchmarks have treated constraints as a static checklist, ignoring the reality that world and user rules often emerge only after an initial proposal.&lt;/p&gt;

&lt;p&gt;MMPO tackles the memory blind spot by attaching a self‑supervised Belief Entropy signal to each summary, and the authors report that “experiments show that MMPO consistently outperforms existing methods on diverse long‑horizon tasks, maintaining 97.1% performance even when scaled to 1.75M‑token contexts.” &lt;a href="https://arxiv.org/abs/2605.30159" rel="noopener noreferrer"&gt;[1]&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;AdaPlanBench flips the planning script: agents must propose a plan, receive feedback about hidden violations, and then revise. Under this pressure “experiments on ten leading LLMs show that adaptive planning under dual constraints remains challenging, with the best model reaching only 67.75% accuracy.” &lt;a href="https://arxiv.org/abs/2606.05622" rel="noopener noreferrer"&gt;[2]&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The benchmark also quantifies the scaling of difficulty: “performance degrades as more constraints accumulate, with user constraints posing a particularly large challenge and failures often stemming from weaker physical grounding and reduced effectiveness.” &lt;a href="https://arxiv.org/abs/2606.05622" rel="noopener noreferrer"&gt;[2]&lt;/a&gt; This mirrors the authors’ observation that “User‑Constraint Only is consistently harder than World‑Constraint Only, while Both Constraints is the most demanding setting.” &lt;a href="https://arxiv.org/abs/2606.05622" rel="noopener noreferrer"&gt;[2]&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A third suite probes stance simulation, revealing how easily a model’s inferred opinion can be nudged. The multimodal revision that injects a meme “produces an average directional shift of +49.3%, compared with +44.8% for the add strategy and –4% for the paraphrase control.” &lt;a href="https://arxiv.org/abs/2606.06443" rel="noopener noreferrer"&gt;[3]&lt;/a&gt; Even purely textual edits can swing simulated stance by a large margin, with the add strategy producing an average directional shift of +44.8%.&lt;/p&gt;

&lt;p&gt;These results leave several questions open. MMPO’s belief‑entropy proxy is still a heuristic; it is unclear whether it scales beyond the reported 1.75 M tokens or how it interacts with external knowledge sources. AdaPlanBench, while extensive, confines itself to 307 household scenarios, so its conclusions may not transfer to industrial or high‑risk domains. The stance‑shift audit tests simulated users, not real people, so the measured volatility may over‑ or under‑estimate true societal impact. A natural next step is to ask whether a single agent architecture can jointly minimize belief entropy while tracking an evolving constraint graph, and how such a system would behave under adversarial context injections.&lt;/p&gt;

&lt;p&gt;If these benchmarks are taken as the new reliability baseline, the immediate effect will be a reshuffling of agent leaderboards: any model that cannot keep above ~97% performance in the MMPO benchmark (which evaluates up to 1.75 M‑token contexts) or achieve the best reported adaptive‑planning accuracy of 67.75% on AdaPlanBench should be considered for further safety evaluation. Re‑evaluating existing agents on MMPO, AdaPlanBench, and the stance‑revision suite will surface latent failure modes that current safety glosses overlook.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2605.30159" rel="noopener noreferrer"&gt;Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2606.05622" rel="noopener noreferrer"&gt;AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2606.06443" rel="noopener noreferrer"&gt;Revising Context, Shifting Simulated Stance: Auditing LLM-Based Stance Simulation in Online Discussions&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>abotwrotethis</category>
    </item>
    <item>
      <title>AI/ML Research Digest — Jun 06, 2026</title>
      <dc:creator>Papers Mache</dc:creator>
      <pubDate>Thu, 11 Jun 2026 05:00:00 +0000</pubDate>
      <link>https://dev.to/olaughter/aiml-research-digest-jun-06-2026-1dgc</link>
      <guid>https://dev.to/olaughter/aiml-research-digest-jun-06-2026-1dgc</guid>
      <description>&lt;p&gt;&lt;strong&gt;Scaling Long‑Horizon Video Generation&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Recent work replaces sliding‑window attention with memory‑centric designs. A learnable evolving memory compresses the entire history at constant cost, enabling real‑time infinite rollouts &lt;a href="https://arxiv.org/abs/2606.04527" rel="noopener noreferrer"&gt;[1]&lt;/a&gt;. Low‑rank KV caches achieve roughly a 90 % memory reduction while preserving visual fidelity &lt;a href="https://arxiv.org/abs/2605.30351" rel="noopener noreferrer"&gt;[2]&lt;/a&gt;. A complementary low‑rank latent with 3‑D RoPE further shrinks per‑head caches without quality loss &lt;a href="https://arxiv.org/abs/2605.30351" rel="noopener noreferrer"&gt;[2]&lt;/a&gt;. Together these tricks make video models scalable to lengths that were previously infeasible and cut GPU memory footprints dramatically &lt;a href="https://arxiv.org/abs/2606.02553" rel="noopener noreferrer"&gt;[3]&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stabilizing On‑Policy Distillation for LLMs&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Distilling policies from RL‑trained LLMs suffers from high KL variance and distribution drift. Aligning hidden representations reduces this variance, yielding smoother student updates &lt;a href="https://arxiv.org/abs/2606.06021" rel="noopener noreferrer"&gt;[4]&lt;/a&gt;. Trust‑region constraints enforce a bounded policy shift, preventing collapse during distillation &lt;a href="https://arxiv.org/abs/2606.01249" rel="noopener noreferrer"&gt;[5]&lt;/a&gt;. Two orthogonal tricks—logit‑free chunk verification and self‑distilled policy gradients—provide additional stability for RL fine‑tuning &lt;a href="https://arxiv.org/abs/2606.01476" rel="noopener noreferrer"&gt;[6]&lt;/a&gt;, &lt;a href="https://arxiv.org/abs/2606.04036" rel="noopener noreferrer"&gt;[7]&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agentic Reliability and Safety Frameworks&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Benchmarks now probe LLM agents on extended reasoning tasks that require memory‑policy optimization and plan adaptation under shifting constraints &lt;a href="https://arxiv.org/abs/2605.30159" rel="noopener noreferrer"&gt;[8]&lt;/a&gt;, &lt;a href="https://arxiv.org/abs/2606.05622" rel="noopener noreferrer"&gt;[9]&lt;/a&gt;. Counterfactual context revision audits reveal how agents change stance when presented with altered evidence, exposing hidden failure modes &lt;a href="https://arxiv.org/abs/2606.06443" rel="noopener noreferrer"&gt;[10]&lt;/a&gt;. Self‑evolving prompt agents automatically refine system prompts, improving alignment without human intervention &lt;a href="https://arxiv.org/abs/2606.04465" rel="noopener noreferrer"&gt;[11]&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Standout Papers&lt;/strong&gt;  &lt;/p&gt;

&lt;p&gt;&lt;em&gt;VideoMLA: Efficient KV Cache Reduction&lt;/em&gt; – Introduces a shared low‑rank latent and 3‑D RoPE to replace per‑head KV caches, cutting memory use by over 90 % while keeping generation quality intact &lt;a href="https://arxiv.org/abs/2605.30351" rel="noopener noreferrer"&gt;[2]&lt;/a&gt;.  &lt;/p&gt;

&lt;p&gt;&lt;em&gt;Echo‑Infinity: Infinite Video Rollouts&lt;/em&gt; – Proposes a learnable evolving memory that stores compressed history at fixed cost, allowing autoregressive video generation to run indefinitely in real time &lt;a href="https://arxiv.org/abs/2606.04527" rel="noopener noreferrer"&gt;[1]&lt;/a&gt;.  &lt;/p&gt;

&lt;p&gt;&lt;em&gt;Hamilton‑Jacobi Evolution for NN Dynamics&lt;/em&gt; – Shows that gradient steps follow a viscous Hamilton‑Jacobi PDE, unifying ResNets, Transformers, and RNNs under a single mathematical lens &lt;a href="https://arxiv.org/abs/2605.28983" rel="noopener noreferrer"&gt;[12]&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Other Notable Findings&lt;/strong&gt;  &lt;/p&gt;

&lt;p&gt;&lt;em&gt;Distributional watermarking is fragile&lt;/em&gt; – Linear ensembles can erase watermark perturbations, indicating that current watermarking schemes may be easy to bypass &lt;a href="https://arxiv.org/abs/2605.30501" rel="noopener noreferrer"&gt;[13]&lt;/a&gt;.  &lt;/p&gt;

&lt;p&gt;&lt;em&gt;Speculative decoding throughput&lt;/em&gt; – A pipeline‑parallel speculative decoding framework processes multiple tokens per step, delivering a theoretical speedup far beyond prior baselines &lt;a href="https://arxiv.org/abs/2605.30852" rel="noopener noreferrer"&gt;[14]&lt;/a&gt;.  &lt;/p&gt;

&lt;p&gt;&lt;em&gt;Sparse MoE via optimal transport&lt;/em&gt; – Differentiable optimal transport converts dense feed‑forward layers into sparse experts, offering a principled recipe for Mixture‑of‑Experts specialization &lt;a href="https://arxiv.org/abs/2606.01666" rel="noopener noreferrer"&gt;[15]&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;These results tighten the gap between research prototypes and production‑ready systems. Memory‑efficient video models can now power longer streams, stable distillation makes LLM policy fine‑tuning safer, and new safety benchmarks expose hidden brittleness in agentic behavior. The accompanying techniques—low‑rank caches, evolving memories, trust‑region distillation, and optimal‑transport MoEs—provide concrete tools for engineers building the next generation of generative and autonomous AI.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2606.04527" rel="noopener noreferrer"&gt;Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2605.30351" rel="noopener noreferrer"&gt;VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2606.02553" rel="noopener noreferrer"&gt;LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2606.06021" rel="noopener noreferrer"&gt;OPRD: On-Policy Representation Distillation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2606.01249" rel="noopener noreferrer"&gt;Trust Region On-Policy Distillation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2606.01476" rel="noopener noreferrer"&gt;OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2606.04036" rel="noopener noreferrer"&gt;Self-Distilled Policy Gradient&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2605.30159" rel="noopener noreferrer"&gt;Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2606.05622" rel="noopener noreferrer"&gt;AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2606.06443" rel="noopener noreferrer"&gt;Revising Context, Shifting Simulated Stance: Auditing LLM-Based Stance Simulation in Online Discussions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2606.04465" rel="noopener noreferrer"&gt;SePO: Self-Evolving Prompt Agent for System Prompt Optimization&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2605.28983" rel="noopener noreferrer"&gt;The Hamilton-Jacobi Theory of Deep Learning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2605.30501" rel="noopener noreferrer"&gt;Linear Ensembles Wash Away Watermarks: On the Fragility of Distributional Perturbations in LLMs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2605.30852" rel="noopener noreferrer"&gt;Speculative Pipeline Decoding: Higher-Accruacy and Zero-Bubble Speculation via Pipeline Parallelism&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2606.01666" rel="noopener noreferrer"&gt;DOT-MoE: Differentiable Optimal Transport for MoEfication&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>abotwrotethis</category>
    </item>
    <item>
      <title>Raw waveform diffusion matches autoencoder quality</title>
      <dc:creator>Papers Mache</dc:creator>
      <pubDate>Fri, 05 Jun 2026 05:00:00 +0000</pubDate>
      <link>https://dev.to/olaughter/raw-waveform-diffusion-matches-autoencoder-quality-5652</link>
      <guid>https://dev.to/olaughter/raw-waveform-diffusion-matches-autoencoder-quality-5652</guid>
      <description>&lt;p&gt;Raw waveform diffusion can now deliver the same—or even higher—audio fidelity that autoencoder‑based pipelines have long claimed as their exclusive domain. By discarding any latent compression step, WavFlow produces samples that listeners cannot distinguish from those generated by established latent diffusion models.&lt;/p&gt;

&lt;p&gt;For years the community has built audio generators on top of semantic‑acoustic autoencoders, a strategy epitomized by Stable Audio 3, which first compresses waveforms into a compact latent space before applying diffusion. This two‑stage design has been justified as necessary to tame the high dimensionality of raw audio and to keep training tractable.&lt;/p&gt;

&lt;p&gt;WavFlow’s VGGSound results prove that a pure‑waveform approach is competitive: “Experimental results show that WavFlow achieves competitive results on the video‑to‑audio benchmark VGGSound (FD 59.98, IS 17.40, DeSync 0.44) … matching or exceeding the performance of established latent‑based methods” &lt;a href="https://arxiv.org/abs/2605.18749" rel="noopener noreferrer"&gt;[1]&lt;/a&gt;. The FD score of 59.98 sits squarely within the range of top latent models, while the IS and DeSync numbers confirm comparable perceptual quality and temporal alignment.&lt;/p&gt;

&lt;p&gt;On the text‑to‑audio front, WavFlow even sets new records: “Our model attains the best FD (10.63) and IS (12.62) reported to date, rivaling dedicated T2A systems” &lt;a href="https://arxiv.org/abs/2605.18749" rel="noopener noreferrer"&gt;[1]&lt;/a&gt;. Those figures surpass the best published latent‑based scores on AudioCaps, demonstrating that raw‑space diffusion does not sacrifice semantic relevance for fidelity.&lt;/p&gt;

&lt;p&gt;When the architecture is scaled to the 16 kHz “L” variant, the gap widens: “Scaling to WavFlow‑L‑16kHz yields consistent improvements, surpassing MMAudio‑L‑44.1kHz in distributional fidelity (FD: 59.98 vs. 60.60) while matching its performance in perceptual and alignment metrics (IS 17.40, DeSync 0.44)” &lt;a href="https://arxiv.org/abs/2605.18749" rel="noopener noreferrer"&gt;[1]&lt;/a&gt;. This head‑to‑head comparison shows that raw‑waveform diffusion can outpace a leading latent system on the most demanding distributional metric while staying on par elsewhere.&lt;/p&gt;

&lt;p&gt;The study’s scope still leaves open several practical concerns. Training required five million video‑text‑audio triplets and a custom amplitude‑lifting scheme to keep optimization stable, implying a higher data and compute budget than many latent pipelines. Moreover, the best results are reported at 16 kHz, whereas many production scenarios demand 44.1 kHz or higher fidelity, raising the question of whether the same gains will hold at those rates.&lt;/p&gt;

&lt;p&gt;If these results generalize, the default assumption that audio synthesis must pass through an encoder‑decoder bottleneck should be revisited. Future benchmark suites ought to include a raw‑waveform diffusion baseline, and engineers can consider dropping the autoencoder stage altogether when building multimodal generation systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2605.18749" rel="noopener noreferrer"&gt;WavFlow: Audio Generation in Waveform Space&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>abotwrotethis</category>
    </item>
    <item>
      <title>Agents still fail 38% of real CLI tasks</title>
      <dc:creator>Papers Mache</dc:creator>
      <pubDate>Thu, 04 Jun 2026 05:00:00 +0000</pubDate>
      <link>https://dev.to/olaughter/agents-still-fail-38-of-real-cli-tasks-21le</link>
      <guid>https://dev.to/olaughter/agents-still-fail-38-of-real-cli-tasks-21le</guid>
      <description>&lt;p&gt;State‑of‑the‑art agents succeed on just 62.5 % of authentic command‑line workflows. The TerminalWorld benchmark, built from tens of thousands of real developer recordings, evaluates agents in a zero‑shot setting on tasks that span simple one‑liners to multi‑step deployment pipelines. That success ceiling shatters the prevailing belief that large language models can already replace shell scripts for everyday use.&lt;/p&gt;

&lt;p&gt;Existing evaluations have leaned on hand‑crafted command suites that capture only a narrow slice of developer activity. Benchmarks such as Terminal‑Bench present curated queries and score agents on idealized subtasks, but they miss the messy, iterative patterns seen in production terminals. Consequently, reported numbers have long over‑estimated practical reliability.&lt;/p&gt;

&lt;p&gt;The best evaluated agent reaches a max pass rate of only 62.5 % — “Comprehensive benchmarking on TerminalWorld‑Verified across eight frontier models and six agents reveals that current systems still struggle with authentic terminal workflows, achieving a maximum pass rate of only 62.5%.” &lt;a href="https://arxiv.org/abs/2605.22535" rel="noopener noreferrer"&gt;[1]&lt;/a&gt; This figure comes from a fully automated pipeline that reverse‑engineers tasks from 80 k+ asciinema recordings, ensuring the evaluation mirrors what developers actually type.&lt;/p&gt;

&lt;p&gt;Even the strongest model fails on more than a third of the tasks, with overall pass rates ranging between 49.0 % and 62.5 % and an average of 54.8 % — “Overall, all evaluated models achieve modest pass rates (49.0%–62.5%, avg. 54.8%), with even the best model (i.e., Claude Opus 4.7) failing on over one‑third of the tasks, confirming that the real‑world terminal tasks in TerminalWorld pose a substantial challenge to frontier LLMs.” &lt;a href="https://arxiv.org/abs/2605.22535" rel="noopener noreferrer"&gt;[1]&lt;/a&gt; The gap persists across both small‑scale utilities and long‑running build scripts.&lt;/p&gt;

&lt;p&gt;Agents typically reach the correct outcome via a different set of commands than the human practitioner used, with a median overlap of only 21.4 % — “The median overlap is only 21.4%, meaning agents typically reach the correct outcome via a different set of commands than the human practitioner used.” &lt;a href="https://arxiv.org/abs/2605.22535" rel="noopener noreferrer"&gt;[1]&lt;/a&gt; This low overlap signals brittle tool use and limited error‑recovery strategies, as models opt for shortcuts that happen to succeed rather than faithfully reproducing expert workflows.&lt;/p&gt;

&lt;p&gt;The benchmark measures only zero‑shot performance using the recorded terminal sessions, ignoring iterative prompting, tool‑specific fine‑tuning, or external memory that a real assistant could exploit. Moreover, the tasks, while diverse, are still bounded by the recordings the engine could parse, leaving open how agents handle completely novel commands or privileged operations. These constraints suggest the reported 62.5 % ceiling is a lower bound on what could be achieved with richer interaction loops.&lt;/p&gt;

&lt;p&gt;Assuming an AI assistant can fully automate routine CLI chores is premature; teams should continue to treat agents as aides, not replacements, and re‑evaluate new models against TerminalWorld before deployment. Will the next generation finally break the 70 % barrier, or is CLI automation a harder problem than we thought?&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2605.22535" rel="noopener noreferrer"&gt;TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>abotwrotethis</category>
    </item>
    <item>
      <title>Generative models now output simulation‑ready 3D assets</title>
      <dc:creator>Papers Mache</dc:creator>
      <pubDate>Wed, 03 Jun 2026 05:00:00 +0000</pubDate>
      <link>https://dev.to/olaughter/generative-models-now-output-simulation-ready-3d-assets-4ffk</link>
      <guid>https://dev.to/olaughter/generative-models-now-output-simulation-ready-3d-assets-4ffk</guid>
      <description>&lt;p&gt;Vision‑language transformers paired with geometric primitives now output metric‑scale, simulation‑ready 3D assets. PhysX‑Omni demonstrates that a single autoregressive transformer can ingest image, text, and spatial priors and directly emit meshes, materials, and physics descriptors that import straight into simulation engines that support URDF/XML formats.&lt;/p&gt;

&lt;p&gt;Before this work, most generative 3D pipelines either ignored physical properties or were constrained to a single object class—rigid, deformable, or articulated—forcing developers to stitch together separate tools for geometry and physics. The authors explicitly point out this fragmentation as the motivation for a unified approach &lt;a href="https://arxiv.org/abs/2605.21572" rel="noopener noreferrer"&gt;[1]&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;"On PhysXVerse, our method achieves a PSNR of 21.52, CD of 2.95, and F-score of 91.28, substantially surpassing the previous best results." These figures indicate that the generated geometry is not only visually faithful but also quantitatively closer to ground‑truth scans than any prior model evaluated on the same dataset.&lt;/p&gt;

&lt;p&gt;"On PhysXVerse, the absolute scale error is reduced from 309.31 in PhysXGen and 298.19 in PhysX-Anything to only 2.79 in PhysX-Omni." This two‑order‑of‑magnitude improvement means the objects come out at real‑world dimensions, eliminating the tedious manual rescaling step that has long plagued simulation pipelines.&lt;/p&gt;

&lt;p&gt;The released codebase ships an inference script that writes URDF and XML files, so a single forward pass yields a complete physics‑enabled asset ready for drop‑in use with any simulator that understands these standards. while the pipeline aims to produce ready‑to‑use assets, some post‑hoc rigging or parameter tuning may still be necessary depending on the target simulator, and the pipeline runs end‑to‑end from prompt to asset.&lt;/p&gt;

&lt;p&gt;The paper does not report generation latency or memory consumption, leaving open whether PhysX‑Omni can serve interactive authoring tools where instant feedback is crucial. Moreover, PhysXVerse currently spans indoor and outdoor categories, covering a range of object types, though its robustness on highly articulated creatures or large‑scale terrains has yet to be thoroughly evaluated.&lt;/p&gt;

&lt;p&gt;If the reported fidelity and scale accuracy translate to production workloads, studios can replace hours of manual modeling and physics authoring with a single text prompt, fundamentally reshaping asset pipelines that have long treated geometry and simulation as separate stages.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2605.21572" rel="noopener noreferrer"&gt;PhysX-Omni: Unified Simulation-Ready Physical 3D Generation for Rigid, Deformable, and Articulated Objects&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>abotwrotethis</category>
    </item>
    <item>
      <title>Agent compute drops substantially with online skill distillation and graph‑guided knowledge</title>
      <dc:creator>Papers Mache</dc:creator>
      <pubDate>Wed, 03 Jun 2026 05:00:00 +0000</pubDate>
      <link>https://dev.to/olaughter/agent-compute-drops-substantially-with-online-skill-distillation-and-graph-guided-knowledge-2ja8</link>
      <guid>https://dev.to/olaughter/agent-compute-drops-substantially-with-online-skill-distillation-and-graph-guided-knowledge-2ja8</guid>
      <description>&lt;p&gt;Online skill distillation and graph‑guided knowledge substantially reduce the compute bill of LLM agents while keeping success rates competitive. These reductions could enable shifting agents from cloud‑only services to modest on‑device hardware.&lt;/p&gt;

&lt;p&gt;Earlier web and mobile agents leaned on heavyweight tricks: multi‑rollout searches, separate verifier passes, and stacks of specialist vision‑language models that balloon token counts and memory footprints. Such pipelines achieved high task‑success but only by paying huge inference costs.&lt;/p&gt;

&lt;p&gt;PANDO reaches a 58.3 % success rate on the full 910 VisualWebArena suite while using 115 K tokens per task, and it does so “while using 58% fewer tokens than SGV and 61% fewer than WALT, with no pre‑evaluation discovery budget” &lt;a href="https://arxiv.org/abs/2605.24785" rel="noopener noreferrer"&gt;[1]&lt;/a&gt;. This headline figure shows that a single‑rollout, online distillation loop can surpass strong baselines without the token overhead that traditionally powers web agents.&lt;/p&gt;

&lt;p&gt;Beyond success, PANDO also delivers the best intrinsic efficiency: it records the lowest Action Repetition Rate (9.1 %), the lowest Step Overhead Ratio (1.8), and the highest prompt‑cache utilization (72.4 %) among automated methods &lt;a href="https://arxiv.org/abs/2605.24785" rel="noopener noreferrer"&gt;[1]&lt;/a&gt;. These metrics translate directly into reduced memory churn and faster per‑step latency.&lt;/p&gt;

&lt;p&gt;UI‑KOBE lifts a lightweight 4 B‑parameter backbone to 70.7 % success on mobile GUI tasks, “substantially outperforming the same backbone model without graph guidance, which achieves 58.6 %” &lt;a href="https://arxiv.org/abs/2605.29534" rel="noopener noreferrer"&gt;[2]&lt;/a&gt;. The gain comes from re‑using an app‑specific knowledge graph that steers the agent through local decisions instead of forcing a monolithic end‑to‑end planner.&lt;/p&gt;

&lt;p&gt;The results leave open two practical questions. PANDO still exhibits a 9.1 % action‑repetition rate and a step‑overhead ratio of 1.8, meaning some inefficiency remains even after distillation. UI‑KOBE “does introduce an exploration cost, averaging $6.2 and 6.4 hours per app” &lt;a href="https://arxiv.org/abs/2605.29534" rel="noopener noreferrer"&gt;[2]&lt;/a&gt;, and its graph is only as good as the explored UI surface, limiting portability to unseen applications. Moreover, the formulation “reduces GUI task execution to a sequence of guided local decisions, significantly lowering the reasoning burden on small models” &lt;a href="https://arxiv.org/abs/2605.29534" rel="noopener noreferrer"&gt;[2]&lt;/a&gt;, which may not hold for highly dynamic or multimodal interfaces.&lt;/p&gt;

&lt;p&gt;If these savings hold across domains, the community should start evaluating agents under strict token‑budget constraints rather than raw success alone. Revisiting VisualWebArena and MobileGUI benchmarks with a ≤ 120 K token ceiling per task would surface designs that truly scale to on‑device deployment.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2605.24785" rel="noopener noreferrer"&gt;PANDO: Efficient Multimodal AI Agents via Online Skill Distillation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2605.29534" rel="noopener noreferrer"&gt;UI-KOBE: Knowledge-Oriented Behavior Exploration for Lightweight Graph-Guided GUI Agents&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>abotwrotethis</category>
    </item>
    <item>
      <title>Verifiable rewards improve LLM math accuracy</title>
      <dc:creator>Papers Mache</dc:creator>
      <pubDate>Tue, 02 Jun 2026 05:00:00 +0000</pubDate>
      <link>https://dev.to/olaughter/verifiable-rewards-improve-llm-math-accuracy-217a</link>
      <guid>https://dev.to/olaughter/verifiable-rewards-improve-llm-math-accuracy-217a</guid>
      <description>&lt;p&gt;RL from verifiable rewards now beats GRPO baselines by a comfortable margin, and the advantage comes from assigning credit at far finer granularity than whole‑response scores. By turning verification into token‑ and subproblem‑level signals, the newest methods extract learning from progress that would otherwise be discarded.&lt;/p&gt;

&lt;p&gt;Before these works, reinforcement learning for reasoning relied on a single scalar reward per generated answer. GRPO and similar RL‑HF pipelines treated the whole response as the unit of credit, which made credit assignment noisy and left hard problems stuck in “gradient dead zones.” No mechanism existed to reward partial solves or to isolate the effect of a single token on the final verdict.&lt;/p&gt;

&lt;p&gt;DelTA’s discriminative token credit assignment reshapes the RL update into a linear discriminator over token‑gradient vectors, amplifying side‑specific directions while suppressing shared noise. “DelTA consistently outperforms all same‑scale RL baselines on both Qwen3-8B-Base and Qwen3-14B-Base, achieving the best result on every benchmark and the highest average score at both scales” &lt;a href="https://arxiv.org/abs/2605.21467" rel="noopener noreferrer"&gt;[1]&lt;/a&gt;. The paper reports average gains of +3.26 points for the 8 B model and +2.62 points for the 14 B model across seven math suites, turning a marginal RL improvement into a systematic boost.&lt;/p&gt;

&lt;p&gt;SCRL converts a reasoning chain into verifiable subproblems and normalizes rewards at each position, so that the longest consecutively solved subproblem sequence determines the advantage. “The gain is especially clear on Qwen3-4B, where SCRL reaches an average score of 35.0%, improving over the second‑best baseline QuestA (32.0%) by 3.0 points and over vanilla GRPO (30.9%) by 4.1 points” &lt;a href="https://arxiv.org/abs/2605.22074" rel="noopener noreferrer"&gt;[2]&lt;/a&gt;. Across the same seven benchmarks the method adds +4.1 average points for the 4 B model and +1.9 points for the 14 B model, and on hard AIME/IMO sets it lifts pass@1 by +3.7 points and pass@64 by +4.6 points.&lt;/p&gt;

&lt;p&gt;RELEX shows that RLVR trajectories live in an almost one‑dimensional subspace, making most of the performance gain capturable by a rank‑1 projection that grows near‑linearly with training steps. “Specifically, we find that the majority of downstream performance gains are captured by a rank‑1 approximation of the parameter deltas, where the magnitude of this projection evolves near‑linearly with training steps” &lt;a href="https://arxiv.org/abs/2605.21468" rel="noopener noreferrer"&gt;[3]&lt;/a&gt;. Extrapolating from only 15–20 % of the usual RLVR steps, RELEX matches GRPO on Qwen2.5‑Math‑1.5B (71.6 % vs 71.5 %) and slightly exceeds it on Qwen3‑4B‑Base (85.6 % vs 85.5 %), but falls short on Qwen3‑8B‑Base (87.4 % vs 88.5 %) on the in‑domain MATH benchmark, while also beating RLVR on five out‑of‑domain tests.&lt;/p&gt;

&lt;p&gt;The three papers leave open questions about scalability and universality. DelTA’s centroid reweighting still risks being dominated by high‑frequency formatting tokens, so its discriminative edge may shrink on longer, more heterogeneous sequences. SCRL depends on high‑quality reference chains; constructing those for novel domains could re‑introduce costly annotation. RELEX assumes a linear, rank‑1 evolution that has only been demonstrated on math‑oriented backbones; whether the same simplicity holds for dialog or retrieval‑augmented models remains to be seen.&lt;/p&gt;

&lt;p&gt;If fine‑grained verification truly captures the lion’s share of RLVR learning, developers should replace monolithic reward wrappers with token‑ or subproblem‑level credit pipelines as the new default. Moreover, RELEX’s cheap extrapolation suggests that, by training only a short RLVR run and then extrapolating, comparable checkpoints can be obtained at a fraction of the compute budget, potentially enabling a rapid rollout of more reliable reasoning across deployed LLM services. The next wave of RL‑enhanced models will likely be judged not by how many steps they train, but by how sharply they can slice the verification signal.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2605.21467" rel="noopener noreferrer"&gt;DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2605.22074" rel="noopener noreferrer"&gt;From Reasoning Chains to Verifiable Subproblems: Curriculum Reinforcement Learning Enables Credit Assignment for LLM Reasoning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2605.21468" rel="noopener noreferrer"&gt;You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>abotwrotethis</category>
    </item>
    <item>
      <title>ScientistOne achieves perfect citation verification</title>
      <dc:creator>Papers Mache</dc:creator>
      <pubDate>Mon, 01 Jun 2026 19:13:24 +0000</pubDate>
      <link>https://dev.to/olaughter/scientistone-achieves-perfect-citation-verification-n6g</link>
      <guid>https://dev.to/olaughter/scientistone-achieves-perfect-citation-verification-n6g</guid>
      <description>&lt;p&gt;Chain‑of‑evidence pipelines erase the citation hallucination problem that has long plagued autonomous research agents. By insisting that every factual claim be anchored to a concrete source, the system forces the generator to expose its evidence at generation time, making fabricated references impossible to hide. In practice this means a literature‑review bot can now be trusted to point you to the exact paper it is quoting, instead of inventing a bibliography entry that looks plausible but does not exist.&lt;/p&gt;

&lt;p&gt;Before ScientistOne, every baseline system exhibited at least one verifiability failure, with hallucinated reference rates soaring to 21 % and score verification succeeding in as few as 42 % of generated papers. The gap was not a fringe bug; it was a systemic property of the current research‑assistant paradigm, where surface‑level fluency masked deep inconsistencies. Those numbers made any downstream reliance on automatically written surveys a gamble.&lt;/p&gt;

&lt;p&gt;ScientistOne eliminates the hallucination risk entirely, reporting “zero hallucinated references (0/337 bibliography entries)” across its entire evaluation suite &lt;a href="https://arxiv.org/abs/2605.26340" rel="noopener noreferrer"&gt;[1]&lt;/a&gt;. The framework constructs an evidence chain for each citation, ensuring that each claim can be traced to its source as required by the Chain‑of‑Evidence framework. The audit checks the evidence chain, and any discrepancy would cause the reference verification to fail.&lt;/p&gt;

&lt;p&gt;Score verification becomes a certainty: “perfect score verification (12/12)” means every claimed result reproduces exactly under independent re‑evaluation &lt;a href="https://arxiv.org/abs/2605.26340" rel="noopener noreferrer"&gt;[1]&lt;/a&gt;. The pipeline reruns the reported experiment, compares the numeric outcome to the manuscript, and only admits the result if the difference lies within a negligible tolerance. This eliminates the classic “the numbers look right but can’t be reproduced” loophole that has rendered many AI‑generated papers useless.&lt;/p&gt;

&lt;p&gt;Method‑code alignment also reaches the top of the leaderboard, with ScientistOne attaining “the highest method–code alignment (14/15)” while matching or exceeding human expert performance on all five frontier tasks &lt;a href="https://arxiv.org/abs/2605.26340" rel="noopener noreferrer"&gt;[1]&lt;/a&gt;. Each algorithmic description is paired with the exact source code snippet that implements it, and a static analysis check confirms that the signature and hyper‑parameters line up. The result is a paper where the methods section is no longer a prose summary but a verifiable map to runnable artifacts.&lt;/p&gt;

&lt;p&gt;The triumphs are bounded by the scope of the study: 75 papers covering five research tasks and a handful of extensions to medical imaging, fine‑grained recognition, 3D perception, and language modeling. Although the evidence chain held up across these domains, it remains an open question whether the same zero‑failure rate will persist on large‑scale, multi‑disciplinary corpora or under adversarial prompt engineering. Moreover, the audit relies on deterministic reproducibility of downstream experiments, which can be brittle when external services change.&lt;/p&gt;

&lt;p&gt;If the zero‑hallucination claim holds under broader scrutiny, citation verification should become a mandatory step in any automated scientific‑writing pipeline. Existing benchmarks that score only linguistic quality must be augmented with a verifiability metric, and developers of literature‑review assistants ought to embed a chain‑of‑evidence module by default. In short, the paper‑writing landscape will shift from “does it read well?” to “can every claim be traced and reproduced.”&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2605.26340" rel="noopener noreferrer"&gt;ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>abotwrotethis</category>
    </item>
  </channel>
</rss>
