DEV Community: Lazypl82

Two states weren't enough. Here's why I added WATCH.

Lazypl82 — Wed, 27 May 2026 02:01:14 +0000

A teammate had two browser tabs open and one Slack thread waiting on him.

The checkout API had just deployed. Error rate moved from 0.1% to 0.4% in the first ninety seconds. The cart API had deployed three minutes before, so the spike could have been either one — or neither, since traffic was up after a weekend release announcement. He was supposed to give the deploy channel a thumbs up or a rollback call in the next minute or two, and he didn't want to do either yet.

So he typed "5분만 더 보자" and went back to staring at the graph.

That decision had been happening invisibly in our deploy reviews for weeks. The verdict layer I had built didn't know it existed.

Most post-deploy verdict layers I'd seen collapse the decision into two states. STABLE or RISK. Pass or fail. Internal stage gates, deploy bots, "deploy health" checks — they all picked one binary and let the operator absorb the ambiguity. It felt clean to model. The first version of Relivio went the same way.

It didn't survive contact with how people actually decide.

The thing operators were doing that I wasn't modeling

I built Relivio with STABLE and RISK. The third option — "I don't know yet, keep watching" — was supposed to fall into STABLE by default, with the operator escalating manually if things stayed off.

About two weeks of test deploys later, that broke.

The patterns we kept hitting weren't clean. Small error rate shifts that didn't breach SLO. Latency moves that overlapped with a neighbor service's deploy. P99 spikes that resolved within four minutes. Each had a real signal — something measurable changed — but acting on them immediately would have been the wrong call.

Forcing those into STABLE meant the operator stopped looking, even when there was something to look at. Forcing them into RISK meant proposing rollback for changes that resolved themselves before the operator could even confirm them. Neither was the decision an experienced engineer would have made manually.

What experienced engineers were doing manually was a third thing. They were keeping the deploy window open. Not declaring victory, not pulling the trigger. Just staying close for a few more minutes.

That decision needed a name.

What WATCH actually is

WATCH is not indecision. It's the decision not to walk away yet.

That's the version I keep coming back to when I forget what I built.

RISK means the signal is strong enough to act on. Pause the rollout, roll back, contain blast radius, open the incident channel. WATCH means the signal is real but the right operator response is attention, not intervention. Stay close. Don't close the deploy window. Re-check in N minutes.

Those are different workflows downstream. RISK triggers rollback machinery, paging, post-incident review. WATCH triggers continued observation, conditional rollout, deploy window extension. If the verdict layer collapses them into one state, the operator has to either over-alert or under-alert — and once that happens, the verdict stops being trusted.

The "5 more minutes" shape

The decision has a consistent shape, every time it shows up.

You see something. Error rate moved. Latency twitched. A new exception type fired once. It's not clearly bad — SLO holds, no customer reported anything, other services are calm. It's also not clearly nothing — the signal is there, you can see it, and if you walk away and it grows, you'll have less context when you come back.

The right move is to keep looking for a few more minutes, and let the noise sort itself out.

WATCH is that decision, made explicit by the verdict layer instead of held silently in the operator's head.

That's the part that actually mattered. Not the third state existing. The fact that the third state turned a silent operator decision into a recordable event with a name, an escalation condition, and a re-check time.

What the response shape becomes

Adding the third state didn't just extend an enum. It changed what a verdict had to carry.

In a binary world, STABLE was free of structure — there was nothing to look at. RISK carried affected_api, rationale, next_action — the operator was about to act, so they needed context. WATCH needed something different:

Specifically which signal is in motion (error rate on /api/orders/finalize, latency spike on the checkout cluster). When to re-check, explicitly — WATCH at T+3min should resolve to STABLE or RISK by T+8min, not hang indefinitely. What threshold, if crossed, converts this WATCH into RISK — so the operator can read it and know "if X happens, I'll act."

The verdict stopped being a label. It became a structured decision record. The state told you which kind of decision it was. The rest told you what to do inside that decision.

For agents reading verdicts through MCP or REST, that matters more. An agent reading STABLE walks away. An agent reading RISK takes a defined action. An agent reading WATCH needs to know exactly what to keep watching and when to come back. WATCH without that metadata is just "shrug" to an agent.

What the distribution looked like

I want to put this in the right frame before saying a number.

These were early test runs and simulation-heavy demo deploys — synthetic traffic shapes, internal test services, not customer-facing production. Real production traffic will shift these numbers significantly, and I expect that. What I cared about at this stage wasn't the absolute ratio. It was whether WATCH was carrying any meaningful share at all, or whether it was a vanity middle state nobody actually hit.

Within that narrow sample, roughly STABLE 70%, WATCH 25%, RISK 5%.

What surprised me wasn't the numbers. It was that the middle-zone decisions, which had been happening silently in operators' heads, became visible. Before WATCH existed, those decisions left no record — no escalation condition, no follow-up trigger, no rationale anyone could read later. After it existed, the same decisions became discrete events you could look at.

A few patterns surfaced once we could see them. Services with a high noise floor showed up WATCH-heavy and tended to resolve back to STABLE — they didn't need RISK to fire earlier, they needed patient verdicts. Services with genuinely risky deploys also showed up WATCH-heavy, but they escalated to RISK — they needed the threshold lowered, not WATCH made stricter. The same numeric signal meant different things in different services. Binary couldn't represent that without breaking one of the two states.

Three states didn't reveal new information. They gave a name to information that was already there.

What four states didn't do

I tried four briefly. I added a HOLD between WATCH and RISK to mean "pause rollout but don't roll back." It collapsed back into either WATCH-with-stricter-escalation or RISK-with-different-next-action. The four-state version made the operator pick which axis they were on, which is the kind of cognitive load the verdict was supposed to remove.

Two states force a decision the data doesn't support. Three states match the decision the operator was already making. More than three duplicates within the same decision and pushes interpretation back to the operator.

I'm not sure three is the final answer. I just haven't found a fourth state yet that didn't already live inside one of these three.

If you're building a post-deploy decision layer — internal stage gate, deploy automation, agent-readable deploy state — the temptation is to start with binary. Pass / fail is easier to test, easier to wire into CI, easier to demo. But binary makes you eat the ambiguity somewhere. Either the verdict eats it by forcing a decision the data doesn't support, or the operator eats it by ignoring the verdict and going back to manual judgment.

"5 more minutes" is where that ambiguity lives, in every system I've worked with that ships changes to production. Naming it gave the operator a verdict that matched the decision they were already making, and gave the system a recordable state for ambiguity in progress.

I started Relivio with two states because they felt clean. I added the third because two collapsed the wrong decisions together. Three is what I have right now. I'm still watching.

Relivio is a small verdict layer for the first 15 minutes after a production deploy. It returns STABLE / WATCH / RISK with the affected API and a next action, designed for both humans and agents to read. relivio.dev

Why post-deploy verification deserves its own category

Lazypl82 — Tue, 12 May 2026 04:37:32 +0000

After a deploy, the hard part is often not the deploy.

CI has already passed. The rollout has started. Dashboards mostly look normal. Then one small runtime signal moves, and someone has to decide whether it came from this deploy or whether it is just noise.

That window is short. Usually the first 15 minutes. But it shows up every time a team ships, and the more I worked around it, the less it felt like a normal monitoring problem.

The decision is not "is the system healthy"

When I first started, I assumed this was a subset of observability. The signals are runtime signals. The dashboards are the same dashboards. The Sentry events are the same Sentry events.

But the decision is different.

Monitoring is mostly about ongoing system state. Is the service up. Are the latencies in range. Are the error rates trending. The decisions monitoring drives are continuous: alert on threshold breach, page on saturation, dashboards for capacity planning.

The post-deploy decision is much narrower. This deploy went out a few minutes ago. A signal just shifted. Is that shift attributable to the deploy? Is it strong enough to act on? What should happen next?

Notice that the inputs overlap with monitoring. The decision does not.

Why existing observability tools do not close it

The existing stack does inspect the right signals. Datadog can show error rate per service. Sentry can show new exception fingerprints. Grafana can correlate latency to deploy markers.

What none of them do, by default, is close the decision.

They give you the inputs. They show you the picture. The interpretation sits in the operator's head: is this fine, is this WATCH, is this RISK. That interpretation is the actual bottleneck.

In small teams, that bottleneck is invisible because one person knows the deploy, knows the service, sees the dashboards, and makes the call in a few seconds. The mental model is in their head.

In larger teams, especially in MSA setups where multiple teams ship in parallel, the bottleneck shows up. Service A and Service B both deployed in the last 30 minutes. Error rate is up on a checkout API. Whose deploy is responsible? Should anyone roll back? Whose call is that, and on what evidence?

The existing stack supplies the data. It does not supply the verdict.

What "verdict" means here

I have been calling the output a verdict because that is what it has to be: a closed decision, not another open dashboard.

The shape that has worked is three states.

STABLE means the post-deploy window looked normal. Move on.
WATCH means something shifted, but not enough to act. Stay close.
RISK means the pattern is strong enough that doing nothing is the wrong call.

Two states would have been simpler, and that is what I tried first. But binary collapses the middle, and the post-deploy middle is real. Most days, the verdict is WATCH. The deploy did not break anything outright, but something is moving. The team needs to know that without escalating.

Three states. STABLE, WATCH, RISK. Each with the affected API, a recommended action, and the next move. Not a chart. Not an alert. A short structured verdict that closes the decision.

Why the input has to stay narrow

Here is the part I want to be honest about.

The temptation here is to ingest more. More signals, more derived metrics, more partner-side measurement. The product gets bigger. It looks more capable.

That is also exactly when this category turns into a worse APM.

The boundary that keeps this category narrow is on the input side, not the output side. Teams send raw events they already have: error logs, stack traces, deploy markers, environment metadata. They do not measure latency, throughput, p95, ratios, counters, CPU, or memory just to send them somewhere else. The tool derives what it needs from the raw events.

If a partner has to install something that measures throughput just to use the verification layer, the layer is no longer narrow. It is just another collector with a verdict on top.

Raw in, verdict out. That is the shape that makes the separate tool make sense. Without that boundary, the category collapses back into monitoring.

Why the output has to be readable by agents

There is one more piece that I think matters more than it looks.

The verdict has to be readable by something other than a human dashboard.

In modern deploy pipelines, humans are not always the only consumer of deploy outcomes. An internal workflow might need to read the result. An agent might need a decision input. A release process might need a shared object that says what happened after the deploy, without scraping a dashboard or reading a Slack thread.

A dashboard does not serve those consumers well. A structured verdict does. STABLE, WATCH, RISK as discrete values, plus affected API and decision tier. That shape can be read by an agent and handed to a policy the team owns.

That second consumer is one of the reasons I stopped trying to make this look like a monitoring tool. Monitoring tools are designed for human reading. Verdict tools have to be readable by both humans and software.

Where this leaves the category

I do not think every team needs a separate post-deploy verification tool. Small teams with one engineer per deploy mostly do not. The bottleneck is not visible at that scale.

The teams where it shows up are the ones where multiple services ship in parallel, where multiple humans need a shared verdict, and where workflows downstream of the deploy need to read structured output.

For those teams, the post-deploy 15 minutes is not just "a feature your monitoring tool should add". It is its own category, with its own narrow contract: raw events in, structured verdict out. It does not fit cleanly inside an existing observability surface.

That is the bet I have been working on.

I am building Relivio, a small verdict layer for the first 15 minutes after a production deploy. If you are working on something similar, or running into the same problem, happy to talk. relivio.dev