<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Lazypl82</title>
    <description>The latest articles on DEV Community by Lazypl82 (@lazypl82).</description>
    <link>https://dev.to/lazypl82</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3793705%2Fa44b5cbf-1996-44c7-a0c5-4e75e6d0aa97.png</url>
      <title>DEV Community: Lazypl82</title>
      <link>https://dev.to/lazypl82</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/lazypl82"/>
    <language>en</language>
    <item>
      <title>My verdict layer had two readers. Only one of them had eyes.</title>
      <dc:creator>Lazypl82</dc:creator>
      <pubDate>Wed, 24 Jun 2026 04:09:53 +0000</pubDate>
      <link>https://dev.to/lazypl82/my-verdict-layer-had-two-readers-only-one-of-them-had-eyes-3il3</link>
      <guid>https://dev.to/lazypl82/my-verdict-layer-had-two-readers-only-one-of-them-had-eyes-3il3</guid>
      <description>&lt;p&gt;An internal release agent finished a deploy a little after 2 a.m. and then had nothing it could read.&lt;/p&gt;

&lt;p&gt;The dashboards were green. But green is something you see, not something you fetch. The agent's next step, continue the rollout or pause and wait for a human, depended on a judgment that lived only on a screen nobody was looking at. So it did the one safe thing available to it. It stopped, and waited for someone to wake up and confirm what the graphs were already showing.&lt;/p&gt;

&lt;p&gt;That gap stayed with me. The pipeline was automated end to end, except for the fifteen minutes right after the deploy, where it quietly fell back to a person. Someone still had to look at runtime signals and decide whether the thing that just shipped was behaving.&lt;/p&gt;

&lt;p&gt;I had built a verdict layer for that exact window. It read raw deploy signals: error rates, latency, exception types, deploy metadata. It emitted one of three states. STABLE, WATCH, RISK. The problem wasn't the verdict. The problem was that I had built it for a human to read, and the agent that actually needed the answer at 2 a.m. couldn't get to it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The second reader I hadn't designed for
&lt;/h2&gt;

&lt;p&gt;I had assumed the post-deploy decision has one consumer: the operator. On a small team that is often true. One person knows the deploy, sees the dashboards, makes the call in a few seconds.&lt;/p&gt;

&lt;p&gt;But the moment anything downstream of the deploy is automated, whether a release agent, a progressive rollout controller, or a workflow that promotes a build from canary to full, that automation becomes a second reader of the same decision. And it does not read the way a human does.&lt;/p&gt;

&lt;p&gt;A human glances at a latency panel and infers that the bump at 02:04 lines up with the rollout. An agent can't glance. It can't infer from a rendering. It needs the conclusion as a value it can branch on, not a picture it has to interpret.&lt;/p&gt;

&lt;p&gt;That is the part I had collapsed. The verdict existed, but it existed in a shape only one of its two readers could use.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why a dashboard doesn't serve it
&lt;/h2&gt;

&lt;p&gt;A dashboard is a rendering for a reader with eyes, attention, and context. It assumes someone who already knows which deploy went out, which panel matters, and what normal looks like for this service.&lt;/p&gt;

&lt;p&gt;An agent arrives with none of that. Handing it a dashboard is handing it a photograph of an answer and asking it to read the answer back. Even when the data behind the panel is available through an API, what comes back is usually more raw signal: time series, event streams, the same inputs the human was interpreting. The interpretation, the part that closes the decision, still isn't in the response. It was in the operator's head.&lt;/p&gt;

&lt;p&gt;So the agent is stuck in the same place the human was, except worse, because it can't even do the intuitive pattern-match a tired engineer does at a glance.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the verdict had to become
&lt;/h2&gt;

&lt;p&gt;The fix wasn't a new dashboard. It was making the verdict a structured object instead of a message.&lt;/p&gt;

&lt;p&gt;A human reading the verdict wants a sentence: the checkout API looks fine, move on. An agent reading the same verdict wants fields it can branch on: &lt;code&gt;verdict: WATCH&lt;/code&gt;, a &lt;code&gt;decision_tier&lt;/code&gt;, the &lt;code&gt;affected_apis&lt;/code&gt;, a &lt;code&gt;recommended_action&lt;/code&gt;, and the &lt;code&gt;operator_steps&lt;/code&gt; to follow.&lt;/p&gt;

&lt;p&gt;Same verdict. Two encodings of it. The state tells a human what happened and tells an agent which branch to take. The metadata tells a human what to look at and tells an agent what to do next or when to re-check. Once the verdict carried both, neither reader had to translate for the other.&lt;/p&gt;

&lt;p&gt;This is the shift that mattered. Not adding a feature — changing what a verdict is. It stopped being a thing rendered for a person and became a thing two kinds of reader could consume from the same source of truth.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why a read surface, and why MCP
&lt;/h2&gt;

&lt;p&gt;A structured verdict still needs a doorway the second reader can walk through.&lt;/p&gt;

&lt;p&gt;A plain REST endpoint is the obvious one, and it works. An agent can call &lt;code&gt;GET&lt;/code&gt; on the latest verdict and read the fields. But agent runtimes increasingly speak a more specific protocol for fetching context and calling tools, and for those runtimes MCP is the native door. So alongside the REST surface there's an MCP server the agent can query the same way it queries any other tool. Ask for the latest verdict, get back the structured object, branch on it.&lt;/p&gt;

&lt;p&gt;In practice that looks like one call in the pipeline. The agent finishes the deploy, asks for the latest verdict, reads the state, and decides. A &lt;code&gt;get_verdict&lt;/code&gt; call instead of a screenshot nobody opened.&lt;/p&gt;

&lt;p&gt;I want to be careful about what this is and isn't. The MCP server is a doorway, not the product. The verdict layer's job is to read deploy signals and decide STABLE, WATCH, or RISK. MCP is just one of the ways the answer gets handed to a reader that doesn't have eyes. The moment you let the doorway become the identity of the thing, you start building an "AI tool" and stop building the verification layer that was the actual point.&lt;/p&gt;

&lt;h2&gt;
  
  
  A read surface is not a control surface
&lt;/h2&gt;

&lt;p&gt;Here is the boundary that keeps this from going wrong.&lt;/p&gt;

&lt;p&gt;The agent reading the verdict is a consumer of the decision, not a thing the verdict controls. Relivio produces the verdict. It does not decide what happens next. The agent, or the policy the team wrote for it, reads STABLE and continues, reads RISK and pauses, reads WATCH and re-checks in a few minutes. That policy belongs to the team, not to the verdict layer.&lt;/p&gt;

&lt;p&gt;I learned this the hard way in an earlier version, where the verdict layer reached across that line and tried to hold deploys itself. It quietly eroded trust, because once a layer both judges and acts, you can't tell which hat it's wearing when something goes wrong. Exposing the verdict as a read surface keeps the hats separate on purpose. The verdict is readable. What to do about it stays owned by whoever owns the deploy.&lt;/p&gt;

&lt;p&gt;So the MCP surface is deliberately read-only in spirit. An agent pulls a verdict. It does not get told by Relivio what to do, and Relivio does not get to drive the rollout through the back door of "the agent read RISK so I paused it." The agent paused it. The team's policy said to. The verdict was just the input.&lt;/p&gt;

&lt;h2&gt;
  
  
  What can go wrong
&lt;/h2&gt;

&lt;p&gt;A few failure modes show up once a verdict has a machine reader.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The verdict gets read as a command.&lt;/strong&gt; An agent reading RISK and hard-stopping every time, with no policy layer in between, turns a judgment into an automatic gate. That's the coupling failure again, just relocated into the agent. The verdict should be an input to a policy, not a trigger wired straight to an action.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The read surface grows write verbs.&lt;/strong&gt; It's tempting to add "and also let the agent acknowledge, or snooze, or override the verdict through the same surface." Each of those is a control verb sneaking into a read surface. Once they're there, the layer is back to owning decisions it shouldn't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The protocol becomes the positioning.&lt;/strong&gt; Because MCP is having a moment, it's easy to start describing the whole thing as an MCP server. Then people file it under "agent tooling" and miss that the value is the verdict, which is just as useful read by a human through a REST call or a Slack message. The protocol is plumbing. The judgment is the product.&lt;/p&gt;

&lt;p&gt;None of these are loud failures. They each quietly move the layer back toward being something it was trying not to be.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this leaves it
&lt;/h2&gt;

&lt;p&gt;The verdict layer now has two readers it serves on purpose. A human reads a short message and moves on. An agent or workflow pulls the structured object, through REST or through MCP when the runtime speaks it, and branches on a value instead of squinting at a chart.&lt;/p&gt;

&lt;p&gt;What it doesn't have is a verdict that only exists as a rendering. The 2 a.m. release agent that used to stop and wait for a human can now read the same answer the human would have read, in a shape it can act on, under a policy the team still owns.&lt;/p&gt;

&lt;p&gt;I'm not sure MCP is where most of this traffic ends up; plenty of teams will just hit the REST endpoint and be done. But the underlying shift feels right and a little overdue: a post-deploy verdict isn't only a thing people look at. It's a decision other software has to be able to read.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Relivio is a small verdict layer for the first 15 minutes after a production deploy. It returns STABLE / WATCH / RISK with the affected APIs, a decision tier, and a recommended action, readable by humans and by the agents and workflows downstream of a deploy.&lt;/em&gt; &lt;a href="https://relivio.dev" rel="noopener noreferrer"&gt;relivio.dev&lt;/a&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>deployment</category>
      <category>observability</category>
      <category>mcp</category>
    </item>
    <item>
      <title>I shipped a verdict layer that gated deploys. It quietly broke trust.</title>
      <dc:creator>Lazypl82</dc:creator>
      <pubDate>Mon, 08 Jun 2026 01:56:49 +0000</pubDate>
      <link>https://dev.to/lazypl82/i-shipped-a-verdict-layer-that-gated-deploys-it-quietly-broke-trust-4e4f</link>
      <guid>https://dev.to/lazypl82/i-shipped-a-verdict-layer-that-gated-deploys-it-quietly-broke-trust-4e4f</guid>
      <description>&lt;p&gt;A teammate pinged me on a Tuesday afternoon. His deploy had stopped about a minute earlier. The deploy bot said &lt;code&gt;hold_reason: low_confidence&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;"Is that my code, or your tool?"&lt;/p&gt;

&lt;p&gt;I didn't have a good answer.&lt;/p&gt;

&lt;p&gt;The tool he was asking about was the first version of what I now call the verdict layer. Its job was to read raw deploy signals — error rates, latency, exception types, deploy metadata — and emit a verdict for the post-deploy window. STABLE, WATCH, RISK. A name on what just happened.&lt;/p&gt;

&lt;p&gt;It also had a second job I had quietly added in the same commit. If the verdict's internal confidence was low, the layer would tell the delivery side to hold the rollout. Not roll back, just hold. The deploy bot would see &lt;code&gt;hold_reason: low_confidence&lt;/code&gt; and pause until either confidence climbed or a human stepped in.&lt;/p&gt;

&lt;p&gt;It felt obviously correct at the time. If the system isn't sure, why would you let the deploy continue?&lt;/p&gt;

&lt;p&gt;The two jobs felt connected. They weren't. And the way they weren't connected showed up first in confusion like the teammate's question, then in a quieter erosion of trust I almost didn't notice.&lt;/p&gt;

&lt;h2&gt;
  
  
  The thing I missed by coupling them
&lt;/h2&gt;

&lt;p&gt;Every time the teammate got a &lt;code&gt;low_confidence&lt;/code&gt; hold, he had to choose between three possibilities he had no way to distinguish:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The verdict is uncertain because my change is actually risky.&lt;/li&gt;
&lt;li&gt;The verdict is uncertain because it doesn't have enough history with this signal pattern, but the deploy itself is fine.&lt;/li&gt;
&lt;li&gt;The verdict is just wrong this time.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;He had no way to tell which one he was looking at from the hold reason alone. The hold reason was a string about the verdict's own internal state, not about anything he could investigate in his code.&lt;/p&gt;

&lt;p&gt;After the second or third hold he started clicking through them by reflex. The hold reason had become noise. And once the hold reason is noise, the act of holding is also noise.&lt;/p&gt;

&lt;p&gt;That's the failure I almost didn't see. The hold was technically working. The system was doing exactly what I told it to. The trust that should have come with the hold was leaking out at the same rate I was producing it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why confidence as a hold reason fails operators
&lt;/h2&gt;

&lt;p&gt;Confidence is a property of the verdict, not a property of the deploy.&lt;/p&gt;

&lt;p&gt;That sentence is what took me weeks to internalize. When confidence is exposed as the reason a rollout stopped, the operator's mental model collapses two unrelated questions into one: "is my code safe" and "does the verdict layer feel sure." Those are different questions, and forcing them to share a UI surface means neither one gets answered well.&lt;/p&gt;

&lt;p&gt;Hold reasons need to be legible to the person they affect. &lt;code&gt;manual_review_requested&lt;/code&gt;, &lt;code&gt;policy_threshold_breach&lt;/code&gt;, &lt;code&gt;staging_gate_failed&lt;/code&gt; — those are reasons an operator can act on. &lt;code&gt;low_confidence&lt;/code&gt; is a reason the verdict layer can act on internally. Pushing it out as a delivery-blocking signal exposes it to the wrong audience.&lt;/p&gt;

&lt;h2&gt;
  
  
  The split that fixed it
&lt;/h2&gt;

&lt;p&gt;The change I made was structural.&lt;/p&gt;

&lt;p&gt;The verdict layer would always emit a verdict. Every deploy, every time. Internal confidence stayed inside the verdict as context, never as a delivery signal.&lt;/p&gt;

&lt;p&gt;Delivery hold became a separate concept with one input: explicit operator intent. A normal deploy hits &lt;code&gt;auto-deliver&lt;/code&gt;. A deploy the operator wants gated behind manual review gets &lt;code&gt;manual_review&lt;/code&gt;. There is no third path where the verdict layer can decide to hold something on its own.&lt;/p&gt;

&lt;p&gt;After the split, the operator's mental model collapsed into something simpler. The verdict tells me what just happened. I (or the policy I wrote) decide what to do with it. The verdict layer never reaches across that boundary.&lt;/p&gt;

&lt;p&gt;The deploys that used to stop on &lt;code&gt;low_confidence&lt;/code&gt; now continue. The operator sees the verdict, reads the confidence context if they want to, and acts or doesn't. The same information is still in the system. It just stopped pretending to be a deploy gate.&lt;/p&gt;

&lt;h2&gt;
  
  
  What confidence becomes when it isn't a gate
&lt;/h2&gt;

&lt;p&gt;Confidence didn't disappear. It became metadata that travels with the verdict.&lt;/p&gt;

&lt;p&gt;A WATCH verdict with &lt;code&gt;confidence: 0.4&lt;/code&gt; reads differently from a WATCH with &lt;code&gt;confidence: 0.9&lt;/code&gt;. Both are WATCH. The state still says "don't walk away yet." But the lower-confidence one carries an extra signal to whoever's reading it: treat the verdict itself with some skepticism.&lt;/p&gt;

&lt;p&gt;That distinction sounds small. It changed the verdict layer's relationship to every other system that touched it. Confidence is now informing how the verdict gets consumed, not whether the deploy proceeds.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this is a boundary, not an implementation detail
&lt;/h2&gt;

&lt;p&gt;A verdict-producing layer and a policy-enforcing layer have different failure modes, different ownership, and usually different audiences. The verdict layer says "here's what I think happened." The policy layer says "here's what we do when that's what happened." Confusing those is how monitoring vendors end up running deploy pipelines and deploy automation vendors end up overfitting to specific monitoring signals.&lt;/p&gt;

&lt;p&gt;If the verdict layer owns the gate, every refinement of the verdict, whether through better calibration, new signal sources, or threshold tuning, silently tunes deploy frequency too. That coupling is invisible until it bites. Then the question shifts from "is this verdict useful" to "why did we ship fewer times this week," and the operator who pushed the change loses the ability to reason about either one alone.&lt;/p&gt;

&lt;p&gt;Split them and both questions stay answerable. The verdict can get smarter without affecting deploy frequency. The deploy policy can change without invalidating verdict history. Neither one gets to silently change the other's behavior.&lt;/p&gt;

&lt;h2&gt;
  
  
  What can go wrong if you keep them coupled
&lt;/h2&gt;

&lt;p&gt;A few specific failure modes show up reliably when verdict and gate live in the same layer:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The confidence threshold becomes a release management knob.&lt;/strong&gt; Lower it and you ship more. Raise it and you ship less. Nobody decided that. The internal calibration knob is now a release lever, but nothing in the system labels it as one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verdict trust collapses on a single coupled failure.&lt;/strong&gt; If the layer holds a deploy that should have shipped, and the resulting customer impact is visible, the next conversation isn't "let's recalibrate." It's "let's bypass the layer." A single high-cost mistake in the gating role erases trust in the verdict role, even when the verdict itself was correct.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Audit trail gets ambiguous.&lt;/strong&gt; When a deploy is held, who held it. The verdict layer's internal calibration, or an explicit policy. Post-incident review wants a clean answer. A coupled system can't give one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Operators stop reading the verdict.&lt;/strong&gt; Once the hold reason reads as noise, the verdict it travels with reads as noise too. The signal that should have been useful in its own right, even with no gating authority, quietly stops being trusted.&lt;/p&gt;

&lt;p&gt;None of these failures are dramatic. They slowly remove the value the layer was supposed to add.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this leaves it
&lt;/h2&gt;

&lt;p&gt;The verdict layer in its current shape has no gating authority. It emits a verdict for every deploy, attaches confidence as context, and stops. Delivery hold is owned by an explicit policy surface that takes the verdict as input but is not the same component.&lt;/p&gt;

&lt;p&gt;What I have now is a verdict that's more useful precisely because it doesn't try to decide what happens next. Operators read it. Agents read it. Policies read it. None of them have to ask whether the verdict layer is also quietly making deploy decisions in the background.&lt;/p&gt;

&lt;p&gt;That separation should have been there from the start. The version that gated deploys felt safer at the time. It wasn't safer. It was hiding a policy decision inside a layer that wasn't supposed to own it.&lt;/p&gt;

&lt;p&gt;I'm not sure this is the final shape. But it's the first version I've shipped where the verdict layer's job and the deploy gate's job aren't fighting each other for the same operator's attention.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Relivio is a small verdict layer for the first 15 minutes after a production deploy. It returns STABLE / WATCH / RISK with the affected API, confidence context, and a next action, designed for both humans and agents to read.&lt;/em&gt; &lt;a href="https://relivio.dev" rel="noopener noreferrer"&gt;relivio.dev&lt;/a&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>sre</category>
      <category>deployment</category>
      <category>observability</category>
    </item>
    <item>
      <title>Two states weren't enough. Here's why I added WATCH.</title>
      <dc:creator>Lazypl82</dc:creator>
      <pubDate>Wed, 27 May 2026 02:01:14 +0000</pubDate>
      <link>https://dev.to/lazypl82/two-states-werent-enough-heres-why-i-added-watch-a2k</link>
      <guid>https://dev.to/lazypl82/two-states-werent-enough-heres-why-i-added-watch-a2k</guid>
      <description>&lt;p&gt;A teammate had two browser tabs open and one Slack thread waiting on him.&lt;/p&gt;

&lt;p&gt;The checkout API had just deployed. Error rate moved from 0.1% to 0.4% in the first ninety seconds. The cart API had deployed three minutes before, so the spike could have been either one — or neither, since traffic was up after a weekend release announcement. He was supposed to give the deploy channel a thumbs up or a rollback call in the next minute or two, and he didn't want to do either yet.&lt;/p&gt;

&lt;p&gt;So he typed "5분만 더 보자" and went back to staring at the graph.&lt;/p&gt;

&lt;p&gt;That decision had been happening invisibly in our deploy reviews for weeks. The verdict layer I had built didn't know it existed.&lt;/p&gt;

&lt;p&gt;Most post-deploy verdict layers I'd seen collapse the decision into two states. STABLE or RISK. Pass or fail. Internal stage gates, deploy bots, "deploy health" checks — they all picked one binary and let the operator absorb the ambiguity. It felt clean to model. The first version of Relivio went the same way.&lt;/p&gt;

&lt;p&gt;It didn't survive contact with how people actually decide.&lt;/p&gt;

&lt;h2&gt;
  
  
  The thing operators were doing that I wasn't modeling
&lt;/h2&gt;

&lt;p&gt;I built Relivio with STABLE and RISK. The third option — "I don't know yet, keep watching" — was supposed to fall into STABLE by default, with the operator escalating manually if things stayed off.&lt;/p&gt;

&lt;p&gt;About two weeks of test deploys later, that broke.&lt;/p&gt;

&lt;p&gt;The patterns we kept hitting weren't clean. Small error rate shifts that didn't breach SLO. Latency moves that overlapped with a neighbor service's deploy. P99 spikes that resolved within four minutes. Each had a real signal — something measurable changed — but acting on them immediately would have been the wrong call.&lt;/p&gt;

&lt;p&gt;Forcing those into STABLE meant the operator stopped looking, even when there was something to look at. Forcing them into RISK meant proposing rollback for changes that resolved themselves before the operator could even confirm them. Neither was the decision an experienced engineer would have made manually.&lt;/p&gt;

&lt;p&gt;What experienced engineers were doing manually was a third thing. They were keeping the deploy window open. Not declaring victory, not pulling the trigger. Just staying close for a few more minutes.&lt;/p&gt;

&lt;p&gt;That decision needed a name.&lt;/p&gt;

&lt;h2&gt;
  
  
  What WATCH actually is
&lt;/h2&gt;

&lt;p&gt;WATCH is not indecision. It's the decision not to walk away yet.&lt;/p&gt;

&lt;p&gt;That's the version I keep coming back to when I forget what I built.&lt;/p&gt;

&lt;p&gt;RISK means the signal is strong enough to act on. Pause the rollout, roll back, contain blast radius, open the incident channel. WATCH means the signal is real but the right operator response is &lt;strong&gt;attention&lt;/strong&gt;, not &lt;strong&gt;intervention&lt;/strong&gt;. Stay close. Don't close the deploy window. Re-check in N minutes.&lt;/p&gt;

&lt;p&gt;Those are different workflows downstream. RISK triggers rollback machinery, paging, post-incident review. WATCH triggers continued observation, conditional rollout, deploy window extension. If the verdict layer collapses them into one state, the operator has to either over-alert or under-alert — and once that happens, the verdict stops being trusted.&lt;/p&gt;

&lt;h2&gt;
  
  
  The "5 more minutes" shape
&lt;/h2&gt;

&lt;p&gt;The decision has a consistent shape, every time it shows up.&lt;/p&gt;

&lt;p&gt;You see something. Error rate moved. Latency twitched. A new exception type fired once. It's not clearly bad — SLO holds, no customer reported anything, other services are calm. It's also not clearly nothing — the signal is there, you can see it, and if you walk away and it grows, you'll have less context when you come back.&lt;/p&gt;

&lt;p&gt;The right move is to keep looking for a few more minutes, and let the noise sort itself out.&lt;/p&gt;

&lt;p&gt;WATCH is that decision, made explicit by the verdict layer instead of held silently in the operator's head.&lt;/p&gt;

&lt;p&gt;That's the part that actually mattered. Not the third state existing. The fact that the third state turned a silent operator decision into a recordable event with a name, an escalation condition, and a re-check time.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the response shape becomes
&lt;/h2&gt;

&lt;p&gt;Adding the third state didn't just extend an enum. It changed what a verdict had to carry.&lt;/p&gt;

&lt;p&gt;In a binary world, STABLE was free of structure — there was nothing to look at. RISK carried &lt;code&gt;affected_api&lt;/code&gt;, &lt;code&gt;rationale&lt;/code&gt;, &lt;code&gt;next_action&lt;/code&gt; — the operator was about to act, so they needed context. WATCH needed something different:&lt;/p&gt;

&lt;p&gt;Specifically which signal is in motion (error rate on &lt;code&gt;/api/orders/finalize&lt;/code&gt;, latency spike on the checkout cluster). When to re-check, explicitly — WATCH at T+3min should resolve to STABLE or RISK by T+8min, not hang indefinitely. What threshold, if crossed, converts this WATCH into RISK — so the operator can read it and know "if X happens, I'll act."&lt;/p&gt;

&lt;p&gt;The verdict stopped being a label. It became a structured decision record. The state told you which kind of decision it was. The rest told you what to do inside that decision.&lt;/p&gt;

&lt;p&gt;For agents reading verdicts through MCP or REST, that matters more. An agent reading STABLE walks away. An agent reading RISK takes a defined action. An agent reading WATCH needs to know exactly what to keep watching and when to come back. WATCH without that metadata is just "shrug" to an agent.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the distribution looked like
&lt;/h2&gt;

&lt;p&gt;I want to put this in the right frame before saying a number.&lt;/p&gt;

&lt;p&gt;These were early test runs and simulation-heavy demo deploys — synthetic traffic shapes, internal test services, not customer-facing production. Real production traffic will shift these numbers significantly, and I expect that. What I cared about at this stage wasn't the absolute ratio. It was whether WATCH was carrying any meaningful share at all, or whether it was a vanity middle state nobody actually hit.&lt;/p&gt;

&lt;p&gt;Within that narrow sample, roughly STABLE 70%, WATCH 25%, RISK 5%.&lt;/p&gt;

&lt;p&gt;What surprised me wasn't the numbers. It was that the middle-zone decisions, which had been happening silently in operators' heads, became visible. Before WATCH existed, those decisions left no record — no escalation condition, no follow-up trigger, no rationale anyone could read later. After it existed, the same decisions became discrete events you could look at.&lt;/p&gt;

&lt;p&gt;A few patterns surfaced once we could see them. Services with a high noise floor showed up WATCH-heavy and tended to resolve back to STABLE — they didn't need RISK to fire earlier, they needed patient verdicts. Services with genuinely risky deploys also showed up WATCH-heavy, but they escalated to RISK — they needed the threshold lowered, not WATCH made stricter. The same numeric signal meant different things in different services. Binary couldn't represent that without breaking one of the two states.&lt;/p&gt;

&lt;p&gt;Three states didn't reveal new information. They gave a name to information that was already there.&lt;/p&gt;

&lt;h2&gt;
  
  
  What four states didn't do
&lt;/h2&gt;

&lt;p&gt;I tried four briefly. I added a HOLD between WATCH and RISK to mean "pause rollout but don't roll back." It collapsed back into either WATCH-with-stricter-escalation or RISK-with-different-next-action. The four-state version made the operator pick which axis they were on, which is the kind of cognitive load the verdict was supposed to remove.&lt;/p&gt;

&lt;p&gt;Two states force a decision the data doesn't support. Three states match the decision the operator was already making. More than three duplicates within the same decision and pushes interpretation back to the operator.&lt;/p&gt;

&lt;p&gt;I'm not sure three is the final answer. I just haven't found a fourth state yet that didn't already live inside one of these three.&lt;/p&gt;

&lt;p&gt;If you're building a post-deploy decision layer — internal stage gate, deploy automation, agent-readable deploy state — the temptation is to start with binary. Pass / fail is easier to test, easier to wire into CI, easier to demo. But binary makes you eat the ambiguity somewhere. Either the verdict eats it by forcing a decision the data doesn't support, or the operator eats it by ignoring the verdict and going back to manual judgment.&lt;/p&gt;

&lt;p&gt;"5 more minutes" is where that ambiguity lives, in every system I've worked with that ships changes to production. Naming it gave the operator a verdict that matched the decision they were already making, and gave the system a recordable state for ambiguity in progress.&lt;/p&gt;

&lt;p&gt;I started Relivio with two states because they felt clean. I added the third because two collapsed the wrong decisions together. Three is what I have right now. I'm still watching.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Relivio is a small verdict layer for the first 15 minutes after a production deploy. It returns STABLE / WATCH / RISK with the affected API and a next action, designed for both humans and agents to read.&lt;/em&gt; &lt;a href="https://relivio.dev" rel="noopener noreferrer"&gt;relivio.dev&lt;/a&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>sre</category>
      <category>deployment</category>
      <category>observability</category>
    </item>
    <item>
      <title>Why post-deploy verification deserves its own category</title>
      <dc:creator>Lazypl82</dc:creator>
      <pubDate>Tue, 12 May 2026 04:37:32 +0000</pubDate>
      <link>https://dev.to/lazypl82/why-post-deploy-verification-deserves-its-own-category-1po</link>
      <guid>https://dev.to/lazypl82/why-post-deploy-verification-deserves-its-own-category-1po</guid>
      <description>&lt;p&gt;After a deploy, the hard part is often not the deploy.&lt;/p&gt;

&lt;p&gt;CI has already passed. The rollout has started. Dashboards mostly look normal. Then one small runtime signal moves, and someone has to decide whether it came from this deploy or whether it is just noise.&lt;/p&gt;

&lt;p&gt;That window is short. Usually the first 15 minutes. But it shows up every time a team ships, and the more I worked around it, the less it felt like a normal monitoring problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  The decision is not "is the system healthy"
&lt;/h2&gt;

&lt;p&gt;When I first started, I assumed this was a subset of observability. The signals are runtime signals. The dashboards are the same dashboards. The Sentry events are the same Sentry events.&lt;/p&gt;

&lt;p&gt;But the decision is different.&lt;/p&gt;

&lt;p&gt;Monitoring is mostly about ongoing system state. Is the service up. Are the latencies in range. Are the error rates trending. The decisions monitoring drives are continuous: alert on threshold breach, page on saturation, dashboards for capacity planning.&lt;/p&gt;

&lt;p&gt;The post-deploy decision is much narrower. This deploy went out a few minutes ago. A signal just shifted. Is that shift attributable to the deploy? Is it strong enough to act on? What should happen next?&lt;/p&gt;

&lt;p&gt;Notice that the inputs overlap with monitoring. The decision does not.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why existing observability tools do not close it
&lt;/h2&gt;

&lt;p&gt;The existing stack does inspect the right signals. Datadog can show error rate per service. Sentry can show new exception fingerprints. Grafana can correlate latency to deploy markers.&lt;/p&gt;

&lt;p&gt;What none of them do, by default, is close the decision.&lt;/p&gt;

&lt;p&gt;They give you the inputs. They show you the picture. The interpretation sits in the operator's head: is this fine, is this WATCH, is this RISK. That interpretation is the actual bottleneck.&lt;/p&gt;

&lt;p&gt;In small teams, that bottleneck is invisible because one person knows the deploy, knows the service, sees the dashboards, and makes the call in a few seconds. The mental model is in their head.&lt;/p&gt;

&lt;p&gt;In larger teams, especially in MSA setups where multiple teams ship in parallel, the bottleneck shows up. Service A and Service B both deployed in the last 30 minutes. Error rate is up on a checkout API. Whose deploy is responsible? Should anyone roll back? Whose call is that, and on what evidence?&lt;/p&gt;

&lt;p&gt;The existing stack supplies the data. It does not supply the verdict.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "verdict" means here
&lt;/h2&gt;

&lt;p&gt;I have been calling the output a verdict because that is what it has to be: a closed decision, not another open dashboard.&lt;/p&gt;

&lt;p&gt;The shape that has worked is three states.&lt;/p&gt;

&lt;p&gt;STABLE means the post-deploy window looked normal. Move on.&lt;br&gt;
WATCH means something shifted, but not enough to act. Stay close.&lt;br&gt;
RISK means the pattern is strong enough that doing nothing is the wrong call.&lt;/p&gt;

&lt;p&gt;Two states would have been simpler, and that is what I tried first. But binary collapses the middle, and the post-deploy middle is real. Most days, the verdict is WATCH. The deploy did not break anything outright, but something is moving. The team needs to know that without escalating.&lt;/p&gt;

&lt;p&gt;Three states. STABLE, WATCH, RISK. Each with the affected API, a recommended action, and the next move. Not a chart. Not an alert. A short structured verdict that closes the decision.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the input has to stay narrow
&lt;/h2&gt;

&lt;p&gt;Here is the part I want to be honest about.&lt;/p&gt;

&lt;p&gt;The temptation here is to ingest more. More signals, more derived metrics, more partner-side measurement. The product gets bigger. It looks more capable.&lt;/p&gt;

&lt;p&gt;That is also exactly when this category turns into a worse APM.&lt;/p&gt;

&lt;p&gt;The boundary that keeps this category narrow is on the input side, not the output side. Teams send raw events they already have: error logs, stack traces, deploy markers, environment metadata. They do not measure latency, throughput, p95, ratios, counters, CPU, or memory just to send them somewhere else. The tool derives what it needs from the raw events.&lt;/p&gt;

&lt;p&gt;If a partner has to install something that measures throughput just to use the verification layer, the layer is no longer narrow. It is just another collector with a verdict on top.&lt;/p&gt;

&lt;p&gt;Raw in, verdict out. That is the shape that makes the separate tool make sense. Without that boundary, the category collapses back into monitoring.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the output has to be readable by agents
&lt;/h2&gt;

&lt;p&gt;There is one more piece that I think matters more than it looks.&lt;/p&gt;

&lt;p&gt;The verdict has to be readable by something other than a human dashboard.&lt;/p&gt;

&lt;p&gt;In modern deploy pipelines, humans are not always the only consumer of deploy outcomes. An internal workflow might need to read the result. An agent might need a decision input. A release process might need a shared object that says what happened after the deploy, without scraping a dashboard or reading a Slack thread.&lt;/p&gt;

&lt;p&gt;A dashboard does not serve those consumers well. A structured verdict does. STABLE, WATCH, RISK as discrete values, plus affected API and decision tier. That shape can be read by an agent and handed to a policy the team owns.&lt;/p&gt;

&lt;p&gt;That second consumer is one of the reasons I stopped trying to make this look like a monitoring tool. Monitoring tools are designed for human reading. Verdict tools have to be readable by both humans and software.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this leaves the category
&lt;/h2&gt;

&lt;p&gt;I do not think every team needs a separate post-deploy verification tool. Small teams with one engineer per deploy mostly do not. The bottleneck is not visible at that scale.&lt;/p&gt;

&lt;p&gt;The teams where it shows up are the ones where multiple services ship in parallel, where multiple humans need a shared verdict, and where workflows downstream of the deploy need to read structured output.&lt;/p&gt;

&lt;p&gt;For those teams, the post-deploy 15 minutes is not just "a feature your monitoring tool should add". It is its own category, with its own narrow contract: raw events in, structured verdict out. It does not fit cleanly inside an existing observability surface.&lt;/p&gt;

&lt;p&gt;That is the bet I have been working on.&lt;/p&gt;




&lt;p&gt;I am building Relivio, a small verdict layer for the first 15 minutes after a production deploy. If you are working on something similar, or running into the same problem, happy to talk. relivio.dev&lt;/p&gt;

</description>
      <category>devops</category>
      <category>sre</category>
      <category>observability</category>
      <category>deployment</category>
    </item>
  </channel>
</rss>
