Opus 4.8 in Practice: A Real Coding Session

#ai #productivity #claudecode #automation

On an ambiguous date field, Opus 4.8 stopped and asked me instead of guessing wrong.
It flagged its own off-by-one bug mid-task, before I ran a single test.
The 1M context held all nine files at once, so zero "remind me what this does" round trips.
It still needs me for naming, product taste, and the call on what not to build.
SWE-Bench Pro is 69.2, but the abstention is the change I actually felt all day.

I had been putting off a refactor for three days. Not because it was hard in theory, but because it was the kind of multi-file, easy-to-break job where one wrong assumption costs you an afternoon. I gave it to Claude Code running Opus 4.8 on a Friday. Here is what happened at the keyboard, the moments that surprised me, and the parts that still need a human.

The task I would normally dread

The job was a syndication scheduler inside my blog tooling. A TypeScript service that reads a queue of articles, decides which platforms each one goes to, and stamps a publish time per platform. It had grown organically. The platform list lived in one file, the time logic in another, the queue reader in a third, and a tangle of helpers spread across six more. I wanted to add a per-platform delay window (post to Dev.to immediately, hold the editorial feeds back two hours) without breaking the existing same-day behavior that I had already fixed once before, painfully, around a timezone bug.

This is the category of work I dread. Not algorithmically interesting, just wide. Nine files in scope. The failure mode is never "it cannot be done", it is "I changed the queue reader and forgot the scheduler reads a stale shape three files over, and now Tuesday's articles silently never ship." I have lost whole afternoons to exactly that.

I opened the session, pointed it at the directory, and described the change in three sentences. No file list, no hand-holding on where things live. My honest expectation was the usual: a confident, mostly-right diff that I would then audit line by line, because the confidence is the dangerous part. Past models would happily restructure something and present it as finished, and the bugs hid in the seams between files.

That is not how the next twenty minutes went. The first thing it did was not write code. It read. Then it stopped and asked me a question I had not thought to answer.

It asked instead of guessing

The question was about a field called scheduledAt. In two of the nine files, that field meant "the time the article enters the queue." In one file, the scheduler, it was being read as "the time to actually publish." Same name, two meanings, and my new delay window touched both. Opus 4.8 noticed the collision, laid out both interpretations in plain language, and said it was not sure which one I intended for the new code path. Then it waited.

It did not pick one and barrel ahead. That restraint is the headline change in this model, and Simon Willison flagged it in his review as honesty by abstention: it says "I am not sure" instead of confidently guessing when the answer is genuinely ambiguous. Anthropic frames it numerically as roughly 4x less likely to let its own uncertainty or bugs pass unremarked compared to prior models. On paper that is a stat. At the keyboard it felt like working with someone who reads the whole problem before forming an opinion.

Here is why it mattered. If it had guessed "publish time", my delay would have stacked on top of an already-shifted timestamp, and same-day articles would have drifted hours late. I would not have caught it in review because the diff would have looked clean. The bug only surfaces in production, on the days the queue is busy. By asking one question, it killed an afternoon-class bug before a single line existed.

I answered in five words. It proceeded. I had explored the older tier's honesty in my Opus 4.7 solo studio workflows, but the abstention here is sharper and arrives earlier, before code, not during review.

It flagged its own bug before I ran a test

Twenty lines into the new delay logic, mid-stream, it stopped writing and pointed at code it had just produced. Its own code. It said the date arithmetic for the two-hour hold had an off-by-one risk at the day boundary: if an article queued at 23:10, adding the hold could roll it past midnight, and the same-day check downstream would then reject it as belonging to the next day. It proposed clamping the comparison to the queue date rather than the shifted date, and asked if I agreed before continuing.

I had run zero tests. I had not even read the diff yet. The model caught a flaw in its own work and surfaced it unprompted, which is the practical face of that 4x-fewer-unremarked-flaws behavior. Older models would have written that bug, presented the diff as done, and left it for my test suite (or worse, production) to find.

This is the part that changed my rhythm. My old loop was: model writes, I read every line suspiciously, I run tests, tests catch what I missed, I go back. The new loop has the model running its own first-pass review inside the same turn. It does not replace my review. It front-loads the obvious-in-hindsight class of mistake so my review can focus on intent instead of arithmetic. I am not going to relist the benchmarks here. For the full spec breakdown across every test, see yesterday's launch roundup. What I felt was simpler: fewer surprises after the fact.

It kept the whole repo in its head

Nine files. The thing that wrecks multi-file refactors is forgetting. You fix file A, and three files later the model has lost the shape of the type you changed, so it writes against the old one. The usual symptom is a round trip: "remind me what QueueEntry looks like", or a confident edit against a field that no longer exists.

That did not happen once. The 1M token context window held all nine files, the type definitions, and the timezone fix from the earlier session at the same time. When it touched the scheduler, it correctly referenced the field shape it had read in the queue reader two hundred lines and several files earlier. When I asked it to also update the one test file that exercised the boundary case, it already knew which assertions would break and named them before opening the file.

I have written before about how the 1M context window changes the way I code, and this session was the cleanest demonstration yet. No re-pasting files. No "as a reminder" preambles. No drift between what it edited in file three and what it assumed in file seven. And critically, there is no long-context price premium, so feeding it the whole repo costs the same per token as feeding it one file. I stopped rationing what I showed it. I just showed it everything, which is how the consistency happens. The fewer round trips alone probably saved me ten minutes of re-explaining state to a model that had forgotten it.

Where it still needs me

I do not want to oversell this. Three places in that session, the model was clearly not the right decider, and pretending otherwise would make me sloppy.

Naming. It proposed delayWindowMs for the new field. Functional, forgettable. I changed it to holdUntil, because months from now I will read holdUntil and instantly know what it does, and I will read delayWindowMs and have to think. Naming is a taste call rooted in how my future self reads code, and the model optimizes for correct, not memorable.

Scope. At one point it offered to also add a retry queue for failed posts. Reasonable code, genuinely useful, and exactly the kind of thing I did not want. The discipline of shipping the small thing and not the adjacent tempting thing is mine to hold. The model will happily build what you let it; deciding what not to build is the actual job. I run a 15-repo studio, and that restraint is most of how I keep it sane, which is the spirit of running everything from one CLAUDE.md file.

Product taste. Whether a two-hour editorial hold is even the right behavior is a judgment about my readers and my distribution, not a code question. The model can implement any rule flawlessly. It cannot tell me the rule is wrong for my audience. And in fairness, it is not unbeatable: it trails GPT-5.5 on Terminal-Bench 2.1, and it is not magic. It is a very good pair of hands that knows when to stop.

Bottom Line

What changed about my workday is not speed, though it is faster. It is that the dread is gone. The wide, brittle, multi-file job that I deferred for three days is now a thing I start without flinching, because the model asks before it guesses, catches its own arithmetic before I do, and holds the whole repo without forgetting the shape of a type four files back. My review shifted from hunting for hidden bugs to deciding intent and naming, which is the work I actually want to be doing. The 5/25 list price held flat from the prior tier, so nothing about the budget changed, only the confidence did. If you want the templates and hooks I use to run sessions like this one, they live in the Claude Blueprint. Start with the job you have been avoiding. That is the one this model earns its keep on.