Two Months with GLM-5, and Then 5.1 Dropped

#ai #glm #codingagent #openclaw

Auto-Dev with MidWayDer: Expectations vs. Reality

The most aggressive use case was MidWayDer auto-development. I set up an auto-dev-pd cron job that ran GLM-5 every hour, generating code and committing automatically. "Unlimited calls" sounds carefree, but I was anything but. I obsessively tracked token usage per cron run, constantly checking whether this was really staying within the $80/quarter budget.

But the output wasn't great.

The UI was rough, and the feature implementations were unstable. Every time I opened an auto-committed diff, it was the same reaction: "Why did it write it like this?" Tasks that Opus would nail in one shot needed two or three rounds of fixes with GLM-5. The whole point of auto-dev is that it runs while you're not looking. If you have to review every commit anyway, the "auto" part loses its meaning.

The difference was most obvious in complex reasoning. Understanding component dependencies, considering the overall architecture, writing code with structural awareness — Opus was clearly a level above. GLM-5 could handle individual components fine, but it lacked the bird's-eye view to tie everything together.

But honestly, I'm not sure this was entirely the model's fault. I might not have been using GLM-5 properly. I was throwing prompts at it the same way I'd throw them at Opus — loosely, expecting it to just figure things out. Every model has a different sweet spot for how you give instructions. If I'd broken my prompts into smaller chunks, or set up dedicated GLM context files the way I maintain CLAUDE.md, the results might have been different. Opus is forgiving with vague prompts — it reads your intent well. I'd gotten spoiled by that.

The auto-dev results with GLM-5 were disappointing, yes. But whether that was the model's ceiling or my failure to use it right — I'd say it's fifty-fifty.

Subscribe and Get an API Key — The Real Killer Feature

The biggest advantage of the GLM Coding Plan isn't just the price. It's that you get an API key with your subscription. You're not locked into OpenClaw or Claude Code — you take that key and use it anywhere. That changes things.

I ran GLM-5 as the base model for EtherBot, my Telegram chatbot. Daily conversations, quick coding questions. For this use case, the response speed was solid and Korean language handling was surprisingly good. I wasn't asking for complex architecture design — just "what's this error" and "what does this function do" level stuff. GLM-5 handled that without breaking a sweat.

It's also running the AI Signal category on radarlog.kr. A GitHub Actions cron scrapes AI-related sources, feeds them to GLM-5 for summarization, and auto-publishes posts. This is a use case where GLM-5 genuinely shines. Structuring information into a fixed format is something it handles reliably. Running daily auto-writing on the Opus API would bleed money fast — one GLM-5 API key covers it all at a fraction of the cost.

It also pulled its weight in my CrewAI hybrid strategy. When splitting agent roles, I assigned Tech Writer to GLM-5 — documentation, code comments, README generation. Repetitive, well-defined tasks. Using an Opus-tier model for this would be a waste of money, and GLM-5 was stable enough for the job.

A pattern emerged. Clear instructions + single tasks = GLM-5 is unbeatable on cost-performance. Complex reasoning + long context = use Opus. The "GLM for daily work, Claude for heavy artillery" strategy settled in naturally over two months.

45.3 Points — What the Number Means

On March 27th, GLM-5.1 dropped.

It scored 45.3 on a coding evaluation using Claude Code as the testing tool. That's 94.6% of Opus 4.6's 47.9 points. Up from GLM-5's 35.4 — a 28% improvement. For a model that's only a month and a half newer, that's a serious jump.

Here's the interesting part: Claude Code is a tool optimized for the Claude model family. For GLM-5.1, this was a pure away game. Scoring 94.6% on your competitor's home turf suggests actual capability might be even higher.

The architecture is identical to GLM-5. 744B total parameters, 40B active MoE, DeepSeek Sparse Attention. 204,800 token context window, 131,072 max output tokens. No structural changes — they pushed hard on post-training. Z.ai's "Slime" async RL infrastructure seems to enable this kind of rapid iteration.

Pricing remains compelling. The Coding Plan starts at $3/month promotional, $10/month standard. My Pro plan at $80/quarter is still dramatically cheaper than Opus at $100–200/month.

The Meta Twist: 5.1 Organized This Post

Here's a fun one.

When I decided to write this blog post, I needed to organize my raw notes. So I pointed OpenClaw at GLM-5.1 and said "summarize what I did with GLM-5." The structured breakdown it returned became the skeleton of this post.

Would GLM-5 have organized it this cleanly? Hard to say. But 5.1 caught the context I threw at it and pulled out a structured breakdown: MidWayDer auto-dev experience, EtherBot usage, CrewAI role assignment, pros and cons, pricing tier, expectations for 5.1. Clean categorization, no fluff.

Maybe this is what 28% improvement feels like in practice. Benchmark numbers are abstract. "I asked it to organize my notes and got a usable structure back" — that's concrete.

What I'm Watching For

The biggest question: does 28% better on coding benchmarks translate to better auto-dev quality?

The weak point with GLM-5 was unstable UI and feature quality in MidWayDer auto-development. If 5.1 has improved there, I'm re-enabling the auto-dev-pd cron. Unlimited calls plus better code quality could mean actually useful code accumulating while I sleep.

Opus still wins in certain areas. 1M token ultra-long context, extreme-depth reasoning, complex multi-step agent workflows. That's a structural gap that 5.1 probably won't close overnight. But if 94.6% of daily coding tasks hit Opus-level quality, the "daily" side of my "GLM for daily, Claude for heavy" strategy gets a lot more robust.

$80 per quarter for this level of quality. For someone juggling side projects, that's close to optimal cost-performance.