Ai developer

Posted on May 28

Open Source LLM Spring 2026: What Changed in 2 Months

#ai #llm #rag #machinelearning

Open Source LLM Spring 2026: What Changed in 2 Months

After tracking open-weight LLM releases for the past two months, here's what's actually moving the needle. Not hype — architecture and data decisions that matter.

1. Sliding Window Attention Goes Mainstream

Almost everyone switched to SWA. Context windows growing substantially without model bloat. The exception: MiniMax M2.5 still uses GQA (Grouped-Query Attention) but compensates purely through data quality on coding tasks.

Why it matters: You can now fit 200K+ context in models that previously handled 32K. Same parameter count, different attention mechanism.

2. QK-Norm Spreading

QK-Norm (query-key normalization) is emerging as an RMSNorm analogue. Traces back to Gemini 3 architecture. Stabilizes training at scale without adding compute.

3. Multimodal Pretraining Early

Kimi k2.5 showed that pretraining on images at early stages (not just late-stage fine-tuning) significantly helps reasoning. The model learns visual concepts before language alignment, making downstream multimodal tasks more robust.

4. GLM-5 (Z.ai) — Not a DeepSeek Clone

On release, GLM-5 matched GPT-5.2 / Opus 4.5 / Gemini 3 Pro on key benchmarks. What's inside: heavily modified DeepSeek-V2 architecture with changed parameters, especially active expert count in the MoE layers.

Key difference: It's not "DeepSeek with a new name" — the routing and expert allocation is fundamentally redesigned.

5. Step 3.5 Flash — Efficiency King

196B parameter MoE, architecturally similar to DeepSeek but:

3x faster inference speed
Multi-Token Prediction (generates 3 additional tokens per step instead of 1)
Currently #2 on OpenRouter by token consumption

The catch: Top benchmarks but Chatbot Arena tells a different story. Benchmarks ≠ real user preference.

What Surprised Me

The pricing. Z.ai raised GLM-5 subscription prices 2x immediately — up to $160/month for max tier. Open-source models aren't free to run at scale, and providers are pricing accordingly.

Production Implications

Model	Best For	Watch Out
GLM-5	General reasoning, coding	Cost
Step 3.5 Flash	High-throughput APIs	Arena scores
MiniMax M2.5	Coding tasks	GQA limits context
Kimi k2.5	Multimodal apps	Early pretraining specifics

What I'm Watching Next

Streaming KV-cache compression for longer contexts
Whether anyone replicates the Kimi early-multimodal pretraining approach
If Step 3.5's multi-token prediction becomes standard

More architecture deep-dives and production AI notes from inside a bank — follow my Telegram channel:

https://t.me/ai_tablet (Russian, technical)

More AI engineering notes, RAG benchmarks, and production insights from inside a bank — follow my Telegram channel:

🚀 https://t.me/ai_tablet (Russian, technical)

DEV Community

Open Source LLM Spring 2026: What Changed in 2 Months

Open Source LLM Spring 2026: What Changed in 2 Months

1. Sliding Window Attention Goes Mainstream

2. QK-Norm Spreading

3. Multimodal Pretraining Early

4. GLM-5 (Z.ai) — Not a DeepSeek Clone

5. Step 3.5 Flash — Efficiency King

What Surprised Me

Production Implications

What I'm Watching Next

Top comments (0)