Two states weren't enough. Here's why I added WATCH.

#devops #sre #deployment #observability

A teammate had two browser tabs open and one Slack thread waiting on him.

The checkout API had just deployed. Error rate moved from 0.1% to 0.4% in the first ninety seconds. The cart API had deployed three minutes before, so the spike could have been either one — or neither, since traffic was up after a weekend release announcement. He was supposed to give the deploy channel a thumbs up or a rollback call in the next minute or two, and he didn't want to do either yet.

So he typed "5분만 더 보자" and went back to staring at the graph.

That decision had been happening invisibly in our deploy reviews for weeks. The verdict layer I had built didn't know it existed.

Most post-deploy verdict layers I'd seen collapse the decision into two states. STABLE or RISK. Pass or fail. Internal stage gates, deploy bots, "deploy health" checks — they all picked one binary and let the operator absorb the ambiguity. It felt clean to model. The first version of Relivio went the same way.

It didn't survive contact with how people actually decide.

The thing operators were doing that I wasn't modeling

I built Relivio with STABLE and RISK. The third option — "I don't know yet, keep watching" — was supposed to fall into STABLE by default, with the operator escalating manually if things stayed off.

About two weeks of test deploys later, that broke.

The patterns we kept hitting weren't clean. Small error rate shifts that didn't breach SLO. Latency moves that overlapped with a neighbor service's deploy. P99 spikes that resolved within four minutes. Each had a real signal — something measurable changed — but acting on them immediately would have been the wrong call.

Forcing those into STABLE meant the operator stopped looking, even when there was something to look at. Forcing them into RISK meant proposing rollback for changes that resolved themselves before the operator could even confirm them. Neither was the decision an experienced engineer would have made manually.

What experienced engineers were doing manually was a third thing. They were keeping the deploy window open. Not declaring victory, not pulling the trigger. Just staying close for a few more minutes.

That decision needed a name.

What WATCH actually is

WATCH is not indecision. It's the decision not to walk away yet.

That's the version I keep coming back to when I forget what I built.

RISK means the signal is strong enough to act on. Pause the rollout, roll back, contain blast radius, open the incident channel. WATCH means the signal is real but the right operator response is attention, not intervention. Stay close. Don't close the deploy window. Re-check in N minutes.

Those are different workflows downstream. RISK triggers rollback machinery, paging, post-incident review. WATCH triggers continued observation, conditional rollout, deploy window extension. If the verdict layer collapses them into one state, the operator has to either over-alert or under-alert — and once that happens, the verdict stops being trusted.

The "5 more minutes" shape

The decision has a consistent shape, every time it shows up.

You see something. Error rate moved. Latency twitched. A new exception type fired once. It's not clearly bad — SLO holds, no customer reported anything, other services are calm. It's also not clearly nothing — the signal is there, you can see it, and if you walk away and it grows, you'll have less context when you come back.

The right move is to keep looking for a few more minutes, and let the noise sort itself out.

WATCH is that decision, made explicit by the verdict layer instead of held silently in the operator's head.

That's the part that actually mattered. Not the third state existing. The fact that the third state turned a silent operator decision into a recordable event with a name, an escalation condition, and a re-check time.

What the response shape becomes

Adding the third state didn't just extend an enum. It changed what a verdict had to carry.

In a binary world, STABLE was free of structure — there was nothing to look at. RISK carried affected_api, rationale, next_action — the operator was about to act, so they needed context. WATCH needed something different:

Specifically which signal is in motion (error rate on /api/orders/finalize, latency spike on the checkout cluster). When to re-check, explicitly — WATCH at T+3min should resolve to STABLE or RISK by T+8min, not hang indefinitely. What threshold, if crossed, converts this WATCH into RISK — so the operator can read it and know "if X happens, I'll act."

The verdict stopped being a label. It became a structured decision record. The state told you which kind of decision it was. The rest told you what to do inside that decision.

For agents reading verdicts through MCP or REST, that matters more. An agent reading STABLE walks away. An agent reading RISK takes a defined action. An agent reading WATCH needs to know exactly what to keep watching and when to come back. WATCH without that metadata is just "shrug" to an agent.

What the distribution looked like

I want to put this in the right frame before saying a number.

These were early test runs and simulation-heavy demo deploys — synthetic traffic shapes, internal test services, not customer-facing production. Real production traffic will shift these numbers significantly, and I expect that. What I cared about at this stage wasn't the absolute ratio. It was whether WATCH was carrying any meaningful share at all, or whether it was a vanity middle state nobody actually hit.

Within that narrow sample, roughly STABLE 70%, WATCH 25%, RISK 5%.

What surprised me wasn't the numbers. It was that the middle-zone decisions, which had been happening silently in operators' heads, became visible. Before WATCH existed, those decisions left no record — no escalation condition, no follow-up trigger, no rationale anyone could read later. After it existed, the same decisions became discrete events you could look at.

A few patterns surfaced once we could see them. Services with a high noise floor showed up WATCH-heavy and tended to resolve back to STABLE — they didn't need RISK to fire earlier, they needed patient verdicts. Services with genuinely risky deploys also showed up WATCH-heavy, but they escalated to RISK — they needed the threshold lowered, not WATCH made stricter. The same numeric signal meant different things in different services. Binary couldn't represent that without breaking one of the two states.

Three states didn't reveal new information. They gave a name to information that was already there.

What four states didn't do

I tried four briefly. I added a HOLD between WATCH and RISK to mean "pause rollout but don't roll back." It collapsed back into either WATCH-with-stricter-escalation or RISK-with-different-next-action. The four-state version made the operator pick which axis they were on, which is the kind of cognitive load the verdict was supposed to remove.

Two states force a decision the data doesn't support. Three states match the decision the operator was already making. More than three duplicates within the same decision and pushes interpretation back to the operator.

I'm not sure three is the final answer. I just haven't found a fourth state yet that didn't already live inside one of these three.

If you're building a post-deploy decision layer — internal stage gate, deploy automation, agent-readable deploy state — the temptation is to start with binary. Pass / fail is easier to test, easier to wire into CI, easier to demo. But binary makes you eat the ambiguity somewhere. Either the verdict eats it by forcing a decision the data doesn't support, or the operator eats it by ignoring the verdict and going back to manual judgment.

"5 more minutes" is where that ambiguity lives, in every system I've worked with that ships changes to production. Naming it gave the operator a verdict that matched the decision they were already making, and gave the system a recordable state for ambiguity in progress.

I started Relivio with two states because they felt clean. I added the third because two collapsed the wrong decisions together. Three is what I have right now. I'm still watching.

Relivio is a small verdict layer for the first 15 minutes after a production deploy. It returns STABLE / WATCH / RISK with the affected API and a next action, designed for both humans and agents to read. relivio.dev