Open Source LLM Spring 2026: What Changed in 2 Months
After tracking open-weight LLM releases for the past two months, here's what's actually moving the needle. Not hype — architecture and data decisions that matter.
1. Sliding Window Attention Goes Mainstream
Almost everyone switched to SWA. Context windows growing substantially without model bloat. The exception: MiniMax M2.5 still uses GQA (Grouped-Query Attention) but compensates purely through data quality on coding tasks.
Why it matters: You can now fit 200K+ context in models that previously handled 32K. Same parameter count, different attention mechanism.
2. QK-Norm Spreading
QK-Norm (query-key normalization) is emerging as an RMSNorm analogue. Traces back to Gemini 3 architecture. Stabilizes training at scale without adding compute.
3. Multimodal Pretraining Early
Kimi k2.5 showed that pretraining on images at early stages (not just late-stage fine-tuning) significantly helps reasoning. The model learns visual concepts before language alignment, making downstream multimodal tasks more robust.
4. GLM-5 (Z.ai) — Not a DeepSeek Clone
On release, GLM-5 matched GPT-5.2 / Opus 4.5 / Gemini 3 Pro on key benchmarks. What's inside: heavily modified DeepSeek-V2 architecture with changed parameters, especially active expert count in the MoE layers.
Key difference: It's not "DeepSeek with a new name" — the routing and expert allocation is fundamentally redesigned.
5. Step 3.5 Flash — Efficiency King
196B parameter MoE, architecturally similar to DeepSeek but:
- 3x faster inference speed
- Multi-Token Prediction (generates 3 additional tokens per step instead of 1)
- Currently #2 on OpenRouter by token consumption
The catch: Top benchmarks but Chatbot Arena tells a different story. Benchmarks ≠ real user preference.
What Surprised Me
The pricing. Z.ai raised GLM-5 subscription prices 2x immediately — up to $160/month for max tier. Open-source models aren't free to run at scale, and providers are pricing accordingly.
Production Implications
| Model | Best For | Watch Out |
|---|---|---|
| GLM-5 | General reasoning, coding | Cost |
| Step 3.5 Flash | High-throughput APIs | Arena scores |
| MiniMax M2.5 | Coding tasks | GQA limits context |
| Kimi k2.5 | Multimodal apps | Early pretraining specifics |
What I'm Watching Next
- Streaming KV-cache compression for longer contexts
- Whether anyone replicates the Kimi early-multimodal pretraining approach
- If Step 3.5's multi-token prediction becomes standard
More architecture deep-dives and production AI notes from inside a bank — follow my Telegram channel:
https://t.me/ai_tablet (Russian, technical)
More AI engineering notes, RAG benchmarks, and production insights from inside a bank — follow my Telegram channel:
🚀 https://t.me/ai_tablet (Russian, technical)
Top comments (0)