Extreme KV‑Cache Compression and Long‑Context Efficiency
Static quantization is giving way to rotation‑based and context‑sensitive schemes. OCTOPUS and OScaR reach near‑lossless INT2 performance while cutting cache size dramatically [1], [2]. Sparse token indexers replace dense caches with a searchable sketch, preserving attention fidelity at lower memory cost [3]. Linear‑attention decoupling splits the KV stream into a short‑term mutable part and a long‑term static part, keeping long‑context reasoning accurate without quadratic growth [4]. Together these ideas let models handle thousands of tokens on modest hardware, a bottleneck for many retrieval‑augmented and multilingual applications.
Verifiable Rewards for LLM Reasoning
RL from verifiable rewards (RLVR) refines policy updates with token‑level credit signals rather than the coarse GRPO baseline. Discriminative token weighting assigns higher reward to correct intermediate steps, improving math and code accuracy [5]. Subproblem‑level curriculum learning breaks hard problems into tractable pieces, letting the model earn rewards incrementally and generalize to unseen compositions [6]. The result is a measurable boost in exact solution rates on benchmark suites that require multi‑step reasoning.
Unified Generative Frameworks for 3D Geometry
Vision‑language models are now paired with explicit geometric primitives to output simulation‑ready assets. UniT’s Group Autoregressive Transformer treats points, lines, and surfaces as a single token stream, enabling end‑to‑end generation of metric‑scale scenes [7]. A separate line of work injects 4‑dimensional Gaussian splatting into the pipeline, turning raw sensor streams into dense, temporally coherent reconstructions suitable for downstream physics simulators [8]. This unifies perception and asset creation, reducing the manual modeling effort that has long limited virtual world construction.
Standout Papers
Muon Optimizer for Spectral Capacity – Muon replaces AdamW and scales the spectral capacity of feed‑forward layers linearly with model size, yielding higher expressive power without extra parameters [9]. The finding shows that optimizer design can directly shape internal representations, a lever not fully explored in transformer research.
TerminalWorld for Authentic Agent Evaluation – TerminalWorld provides a massive, automatically curated benchmark of command‑line tasks that mimic real developer workflows. Even the best agent tops out at a 62.5 % pass rate, revealing a gap between lab‑scale success and practical usability [10].
WavFlow Raw Waveform Generation – WavFlow discards latent encoders and generates audio directly from waveform patches using flow‑matching. The model achieves high‑fidelity synthesis comparable to diffusion baselines, questioning whether semantic‑acoustic bottlenecks are necessary for high‑quality audio generation [11].
Other Notable Details
Observable‑Read Isolation in Agent Pipelines – An HTTP middleware that logs deliveries enforces Observable‑Read Isolation, eliminating structural race conditions in multi‑step agents without touching the agents’ core code [12].
Matérn Process for Mesh Flow Matching – Introducing a triangulation‑agnostic Matérn‑process noise model allows flow‑matching generators to produce meshes with millions of triangles, breaking the diversity ceiling of prior mesh synthesis methods [13].
FlowLong – Overlapping sliding windows combined with Tweedie matching extend the generation horizon of autoregressive video diffusion for free. The technique preserves temporal coherence across arbitrarily long sequences, a known weakness of standard diffusion models [14].
References
- OCTOPUS: Optimized KV Cache for Transformers via Octahedral Parametrization Under optimal Squared error quantization
- OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond
- Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps
- Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention
- DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards
- From Reasoning Chains to Verifiable Subproblems: Curriculum Reinforcement Learning Enables Credit Assignment for LLM Reasoning
- UniT: Unified Geometry Learning with Group Autoregressive Transformer
- Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving
- Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws
- TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks
- WavFlow: Audio Generation in Waveform Space
- S-Bus: Automatic Read-Set Reconstruction for Multi-Agent LLM State Coordination
- Matérn Noise for Triangulation-Agnostic Flow Matching on Meshes
- FlowLong: Inference-time Long Video Generation via Manifold-constrained Tweedie Matching
Top comments (0)