DEV Community

Ai developer
Ai developer

Posted on

Open Source LLM Spring 2026: What Changed in 2 Months

Open Source LLM Spring 2026: What Changed in 2 Months

After tracking open-weight LLM releases for the past two months, here's what's actually moving the needle. Not hype — architecture and data decisions that matter.

1. Sliding Window Attention Goes Mainstream

Almost everyone switched to SWA. Context windows growing substantially without model bloat. The exception: MiniMax M2.5 still uses GQA (Grouped-Query Attention) but compensates purely through data quality on coding tasks.

Why it matters: You can now fit 200K+ context in models that previously handled 32K. Same parameter count, different attention mechanism.

2. QK-Norm Spreading

QK-Norm (query-key normalization) is emerging as an RMSNorm analogue. Traces back to Gemini 3 architecture. Stabilizes training at scale without adding compute.

3. Multimodal Pretraining Early

Kimi k2.5 showed that pretraining on images at early stages (not just late-stage fine-tuning) significantly helps reasoning. The model learns visual concepts before language alignment, making downstream multimodal tasks more robust.

4. GLM-5 (Z.ai) — Not a DeepSeek Clone

On release, GLM-5 matched GPT-5.2 / Opus 4.5 / Gemini 3 Pro on key benchmarks. What's inside: heavily modified DeepSeek-V2 architecture with changed parameters, especially active expert count in the MoE layers.

Key difference: It's not "DeepSeek with a new name" — the routing and expert allocation is fundamentally redesigned.

5. Step 3.5 Flash — Efficiency King

196B parameter MoE, architecturally similar to DeepSeek but:

  • 3x faster inference speed
  • Multi-Token Prediction (generates 3 additional tokens per step instead of 1)
  • Currently #2 on OpenRouter by token consumption

The catch: Top benchmarks but Chatbot Arena tells a different story. Benchmarks ≠ real user preference.

What Surprised Me

The pricing. Z.ai raised GLM-5 subscription prices 2x immediately — up to $160/month for max tier. Open-source models aren't free to run at scale, and providers are pricing accordingly.

Production Implications

Model Best For Watch Out
GLM-5 General reasoning, coding Cost
Step 3.5 Flash High-throughput APIs Arena scores
MiniMax M2.5 Coding tasks GQA limits context
Kimi k2.5 Multimodal apps Early pretraining specifics

What I'm Watching Next

  • Streaming KV-cache compression for longer contexts
  • Whether anyone replicates the Kimi early-multimodal pretraining approach
  • If Step 3.5's multi-token prediction becomes standard

More architecture deep-dives and production AI notes from inside a bank — follow my Telegram channel:

https://t.me/ai_tablet (Russian, technical)


More AI engineering notes, RAG benchmarks, and production insights from inside a bank — follow my Telegram channel:

🚀 https://t.me/ai_tablet (Russian, technical)

Top comments (0)