Lazypl82

Posted on Mar 10

What do you actually check in the first 15 minutes after deploy?

#devops #showdev #saas #sre

CI passed.

The deploy finished.

Nothing is obviously broken.

And yet, for a few minutes after release, production still feels uncertain.

I think this is one of the most awkward parts of shipping software.

A deployment can be technically successful:

build passes
tests pass
pipeline passes
container starts
health checks look fine

But real runtime problems can still show up only after actual traffic hits the system.

That creates a weird gap between deploy success and runtime confidence.

The part that still feels manual

In a lot of smaller teams, the first few minutes after deploy still look something like this:

open logs
check recent exceptions
watch for error spikes
compare current noise with what “normal” felt like before
decide whether to ignore, investigate, or rollback

We have plenty of tools for detection.

We can detect:

exceptions
timeouts
retries
latency spikes
failed external API calls
degraded endpoints

But detection is not the same as judgment.

The real post-deploy question is usually:

Did this deploy actually make things worse?

And then right after that:

Does it need attention right now?

That second layer still feels surprisingly manual.

Why I think this matters

If you have mature release control, canary rollouts, feature flags, and a strong observability setup, that uncertainty window is probably much smaller.

But many teams do not have all of that.

And even if they do, someone still has to interpret what production is actually saying after a release.

That is the part I keep coming back to.

Not “can we collect signals?”

But:

which signals matter most right after deploy?
how do you compare them against normal behavior?
how do you tell noise from regression?
what gives you enough confidence to say “this deploy is fine”?
what makes you stop and investigate immediately?

What I personally care about in that window

When I think about the first 10–15 minutes after deploy, I usually care less about giant dashboards and more about a small number of judgment signals:

did new runtime exceptions appear?
did existing exception patterns get worse?
are failures concentrated on one service or API path?
is the error pattern meaningfully different from recent baseline behavior?
does this look transient, or does it look deploy-related?

That feels like a different problem from general monitoring.

It feels closer to post-deploy runtime diagnosis.

Why I’m interested in this problem

This is the line of thinking that led me to start building Relivio.

The idea is narrow:
look at post-deploy runtime exception signals, compare them against baseline behavior, and return a deploy-level verdict with evidence and a recommended next step.

Not a full observability platform.
Not a rollback system.
Not a release-control layer.

Just a focused way to answer:

Is this deploy safe, or does it need attention?

I also put together a minimal FastAPI demo here:

https://github.com/lazypl82/relivio-demo-fastapi

And the main project is here:

https://relivio.dev

But honestly, the more interesting part for me right now is not promotion — it is how other people actually handle this in practice.

I’d really like to hear how you do this

A few questions I’d love real answers to:

What do you actually check in the first 10–15 minutes after deploy?
Do you rely mostly on logs, alerts, dashboards, release views, or something else?
What signal makes you think “this deploy is probably bad”?
What signal makes you confident enough to leave it alone?
If you already have a strong workflow for this, what does it look like?
If you do not, what part still feels manual or annoying?

I am especially interested in answers from small teams and side projects, because that is where this still feels the most human and least automated.

If you think this is already solved well by your current stack, I’d like to hear that too.

And if you think this entire problem is not painful enough to deserve a dedicated tool, I’d genuinely like to know why.

Top comments (1)

Lazypl82 • Mar 10

One way I’ve started thinking about this is:

CI tells me whether the code was allowed to ship.
Production tells me whether it should have.

That gap feels bigger than most teams admit.

A deploy can be “successful” in every mechanical sense and still quietly make runtime behavior worse under real traffic. At that point, the hard part is no longer detection, but judgment:
is this normal noise, or did the latest release actually change something important?

That’s the part I’m most interested in — the short window where the deploy is already live, the system is technically up, but human confidence still hasn’t caught up.