Bongho Tae

Posted on May 1

When Code Stopped Being a Vibe and Started Being a Job

#machinelearning #python #ai #llm

There's a specific kind of late-night feeling familiar to anyone who has tried to build something with a chatbot. You describe what you want — "make me a little app that tracks how much water I drink" — and the model fires back fifty lines of code that look impressive, almost work, and silently break in three places. You paste the error back. You get another fifty lines. By 2 a.m. you're not building software anymore; you're playing a kind of haunted telephone game with a system that keeps confidently handing you broken tools and walking away.

Inside the AI world, this experience earned a name: vibe coding. You describe the vibe. The model conjures a snippet. You patch the snippet. Nobody, strictly speaking, is doing engineering. It's closer to commissioning a sketch from a street artist — fast, occasionally beautiful, and absolutely not a load-bearing structure.

The team behind GLM-5, a new open-weights large language model from the Chinese research group Z.ai, says the era of vibe coding should be ending. Their pitch, captured in the paper's title — From Vibe Coding to Agentic Engineering — is that the next leap isn't about producing better snippets. It's about producing a model that can act like an actual junior engineer: one who reads a ticket, plans, edits files across a codebase, runs the tests, fixes what broke, and keeps going for hours without losing the plot.

That's a much harder claim than "we beat the benchmark." So it's worth slowing down and asking what they actually changed under the hood, and what each of those changes is really trying to do.

The Difference Between a Snippet and a Shift

Picture the difference between two requests at a hardware store.

The first: "I need a piece of wood about this long." A clerk hands you a board. You take it home, cut it slightly wrong, come back, get another. This is vibe coding. Each interaction is short. Each output is small. Every mistake costs you another trip.

The second: "I need to build a deck out back. Here's a photo of the yard. Can you handle it?" The contractor walks the site, pulls permits, schedules concrete, orders lumber, supervises the crew, fixes the railing when it splinters, and hands you keys two weeks later. This is agentic engineering: not a single output, but a sustained process of planning, acting, observing, and self-correcting toward a goal that takes hundreds of small decisions to reach.

Most chat-style AI today, even the best, is essentially the lumber clerk. It hands you boards. The GLM-5 team's central wager is that an AI that can act as the contractor — that can hold a goal in mind across a long project — is a genuinely different category of tool, and it requires changes deeper than just making the model bigger.

Why the Old Architecture Started to Buckle

To understand why GLM-5 is built the way it is, it helps to understand what was breaking.

Modern language models work, very roughly, by reading every word in their context window and figuring out how each word relates to every other word. This is called attention, and the easiest way to picture it is a meeting room where every participant has to make eye contact with every other participant before anyone can speak. With ten people, that's manageable. With a thousand people — say, a thousand pages of source code — it becomes absurd. The room grinds to a halt under the sheer combinatorial weight of who-must-look-at-whom.

This is the long-context problem. If you want a model to keep an entire codebase, an entire bug report, and an entire history of previous attempts in mind at once — which a real engineer does effortlessly — the standard attention mechanism gets ruinously expensive. Training such a model costs a fortune. Running it costs almost as much.

GLM-5 attacks this with something the paper calls DSA, a form of sparse attention. The intuition is simple: in a real meeting, you don't need everyone to lock eyes with everyone else. You need the right people to listen to the right people at the right moment. Sparse attention is like a meeting facilitator who quietly tells most participants to ignore most other participants, and only routes the genuinely relevant connections. The math gets cheaper. The room can hold more people. The model can keep more of the codebase in its head at once without melting the GPUs.

Alongside that, GLM-5 keeps a structural choice from its predecessor: a Mixture of Experts layout. Instead of one enormous model that has to know everything, imagine a hospital. A patient walks in; a triage nurse looks at the symptoms and routes them to cardiology, or dermatology, or psychiatry. Only that specialist examines the patient. The other specialists don't burn energy on a case that isn't theirs. Mixture of Experts works the same way: when a question comes in, a router quietly picks a small handful of internal "specialists" to do the work, while the rest of the model rests. You get the breadth of a giant institution at roughly the cost of running one department.

Stack DSA on top of MoE and you get a model that is paradoxically large and cheap — wide enough to know a lot, but lean enough that the running costs don't strangle the project.

Teaching It to Finish a Job

Architecture is only half the story. The other half is how you train a model to behave like an engineer who finishes things, rather than a model that confidently hallucinates a half-finished function and shrugs.

The standard tool for this kind of behavioral shaping is reinforcement learning. The image to hold in your head is a child learning to ride a bike. You don't teach them by lecture; you let them try, and you cheer when they balance and steady the handlebars when they wobble. Over thousands of attempts, they develop something like a feel — not a rule, but a learned instinct.

For an AI, "trying" means generating a candidate solution, and "cheering or steadying" means scoring whether the solution worked and nudging the model's internal weights in response. The trouble is that, as conventionally implemented, this is brutally slow. The model produces an attempt. The training system grades it. The model produces the next attempt. Grade. Attempt. Grade. It's like a single piano teacher who insists on watching every note their student plays, in real time, before allowing the next note. Nobody learns very fast that way, and the teacher is exhausted.

GLM-5's team rebuilt this into what they call asynchronous reinforcement learning, which is less mysterious than it sounds. Asynchronous, in everyday English, just means "not in lockstep." Imagine instead of one piano teacher with one student, you have a music school. Some rooms are full of students improvising. Other rooms are full of teachers reviewing recordings of yesterday's sessions and writing feedback. Students don't have to wait for feedback to keep practicing. Teachers don't have to wait for students to keep grading. The whole school flows in parallel.

In machine learning terms: the part of the system that generates attempts and the part that learns from those attempts no longer have to take turns. They run side by side. The training keeps moving while new attempts are still being produced. The result is dramatically more efficient post-training, and the team says they had to build new algorithms specifically to keep this asynchronous setup from going off the rails — because in the music-school analogy, you really don't want students learning from feedback so old that their own playing has already evolved past the criticisms.

That last part matters more than it sounds. Most reinforcement learning is good at short tasks: did you get this single answer right? But agentic engineering isn't a single answer. It's a chess game's worth of decisions: open this file, edit this function, run these tests, read the failure, change strategy, try again, ask for help, refactor. The reward — "did the project actually work?" — only arrives at the very end, after dozens or hundreds of moves. This is what researchers mean by long-horizon tasks, and traditional RL handles them about as gracefully as trying to teach someone chess by only telling them whether they won the whole game and never saying anything about individual moves.

The asynchronous agent algorithms in GLM-5 are designed, broadly, to give the model better feedback over those long stretches — to act, in our analogy, like a chess coach who can replay a 60-move game and point out which decisions twenty moves ago set up the loss. That capacity to learn from messy, drawn-out, multi-step interactions is what they argue actually moves the needle from "writes nice snippets" to "completes nontrivial software tasks."

Figure 5: Overall training pipeline of GLM-5.

How They Claim to Know It Worked

Whenever an AI lab releases a model, the next ritual is the benchmark photo-op: a chart showing your model winning. GLM-5's team produces theirs, and the headline numbers are real. Across eight major tests of agentic skill, reasoning, and coding — ranging from a notoriously hard exam called Humanity's Last Exam to SWE-bench Verified, a kind of standardized engineering practical — GLM-5 lands roughly on par with Claude Opus 4.5 and GPT-5.2, the leading proprietary systems, and ahead of Gemini 3 Pro on average.

Figure 1: Results of GLM-5, GLM-4.7, Claude Opus 4.5, Gemini 3 Pro, and GPT-5.2 (xhigh) across eight agentic, reasoning, and coding benchmarks.

The number that the open-source community will care about most is on a separate index. On the Artificial Analysis Intelligence Index v4.0 — a composite that blends ten different evaluations — GLM-5 scores 50. Its predecessor scored 42. More importantly, no openly-released model had ever crossed the 50 line before. That's the difference between a model you have to rent from somebody else's API and one you can, in principle, run on your own hardware and inspect.

Figure 2: Artificial Analysis Intelligence Index v4.0 incorporates ten evaluations spanning reasoning, knowledge, coding, and instruction following.

But static benchmarks have a known weakness. They are essentially the SAT for AIs: useful, but gameable, and increasingly suspected of leaking into training data. The more interesting test is one where humans, in the wild, judge the outputs head-to-head. That's what LMArena, run out of UC Berkeley, does — millions of real users posting real prompts, blindly comparing answers from competing models. On both the text and code leaderboards there, GLM-5 is the top open model, and roughly tied with the leading closed systems. This is a softer kind of evidence than benchmark scores, but in some ways more honest. It's the difference between a chef winning a competition with a tasting menu and that same chef getting consistently ranked highest by ten thousand random diners eating dinner.

Figure 3: On LMArena, GLM-5 is the #1 open model in both Text Arena and Code Arena.

The most interesting numbers, though, sit in the long-horizon tests. Vending-Bench 2, for instance, asks a model to run a small simulated business — making decisions over many turns, pricing, restocking, and adapting — for a sustained period. CC-Bench-V2 evaluates it on extended coding tasks that can't be solved in a single shot. These are the tests that come closest to the contractor-versus-clerk distinction. GLM-5's gains here, more than its score on any one-shot exam, are what the paper frames as the real evidence that something has shifted.

Figure 4: Results on long-horizon tasks. Left: Vending-Bench 2; Right: CC-Bench-V2.

What This Looks Like If It's Real

Strip away the leaderboards and imagine the everyday consequences if a model like this works as advertised — not in a demo, but in someone's actual workflow.

A solo developer who maintains a small open-source library files her own bug report on Sunday night. She doesn't open her editor. She forwards the report to an agent and goes to bed. By morning, the agent has reproduced the bug, traced it through three files she barely remembers writing, drafted a fix, run the test suite, noticed that the fix breaks an unrelated module, written a second patch, written tests for the regression it caught, and opened a pull request with a clear human-readable explanation. She reviews it over coffee. Most of the work she would have done on a Saturday is gone — not because the AI wrote slick code, but because it sustained attention across a small, real, multi-step task.

Or imagine a hospital IT team that has spent years putting off the migration of an internal scheduling system from a deprecated language. The migration isn't hard, exactly. It's tedious and long and requires reading thousands of files and not making mistakes that quietly delete patient appointments. An agent can chip away at that for weeks under supervision, with each step inspectable, where a human would burn out by day four. The point isn't that AI replaces the engineers; it's that it absorbs the kind of grinding work that has historically forced human engineers to either skip it or quit.

This is what "agentic engineering" actually means in a non-hype register: not magic, not autonomy in any deep sense, but a system that can keep at a defined job long enough to make a dent.

Where I'd Stay Skeptical

There are several things this paper does not, and probably cannot, fully prove.

The first is the gap between benchmarks and the messy world. Even Vending-Bench-style long-horizon tests are simulations. Real software engineering involves codebases full of undocumented quirks, broken conventions, and tribal knowledge that lives only in a Slack channel from 2022. A model that can finish a clean ticket on a clean repo may still flounder on the ones that actually pile up in working teams. The paper is honest about this, framing its long-horizon results as movement in the right direction rather than arrival.

The second is that "agentic" is, frankly, a word being asked to do an enormous amount of work right now. There's a meaningful difference between an agent that completes long tasks because it's been carefully coached on millions of examples, and an agent that is in any general sense reasoning about goals. The paper's framing leans toward the more impressive interpretation. The evidence supports the more modest one.

The third — and this is the open-source community's standing skepticism — is that benchmark dominance is a moving target, and the last few years have shown that the gap between proprietary leaders and open challengers can shrink in months and reopen just as fast. GLM-5 sitting at the top today says less about a permanent shift than about a particular team executing well at a particular moment. What matters more, long-term, is whether the architectural and training ideas — sparse attention to manage long contexts, mixture-of-experts for cheap breadth, asynchronous reinforcement learning to teach long-horizon behavior — propagate. They probably will, because they're sensible, and because the cost pressures pushing toward them aren't going away.

What's worth taking seriously, regardless of where the leaderboards land in six months, is the quiet repositioning of what an AI is for. The vibe coder asks for a snippet and gets one. The agentic engineer takes a task and brings it back finished. Whether or not GLM-5 is the model that proves that distinction matters, the distinction itself is likely the one the field will be organized around for the next several years. The clerks have been useful. The contractors, if they really are coming, are going to change what "asking for help" means.

📄 https://arxiv.org/abs/2602.15763

tags: ai, llm, coding, agents