DEV Community

Lazypl82
Lazypl82

Posted on

The 15-minute problem: how to decide whether to rollback after deploy

Every engineer knows this feeling.

You just shipped to production. CI passed. The deploy finished clean. And now you're doing the thing nobody talks about — staring at dashboards for the next 15 minutes, waiting to see if anything breaks.

Error rate graph. Refresh. Looks okay? Maybe. Slack is quiet. Should I go back to work? What if something blows up the moment I look away?

This is the 15-minute problem. And almost every team I've talked to handles it the same way: manually, anxiously, inconsistently.


Why the first 15 minutes are different

Post-deploy is not like normal production monitoring. The questions you're asking are different:

  • Is what I'm seeing caused by this deploy, or was it already there?
  • Is this error rate actually elevated, or is it noise?
  • Should I rollback now, or give it more time?

Tools like Datadog or Grafana are great at showing you what's happening. But they don't answer the deploy-specific question: is this deploy okay or not?

You still have to decide. And without a clear framework, that decision comes down to gut feel, whoever is loudest in Slack, or just waiting until something obviously breaks.


What we tried first (and why it didn't work)

We tried setting up static alerting thresholds. "Alert if error rate > 1%." The problem is that thresholds need context. A 1% error rate might be catastrophic for one service and completely normal for another. And right after a deploy, you need to compare against pre-deploy baseline, not some fixed number.

We tried runbooks. "After deploy, check these five dashboards in this order." Runbooks are great until someone's in a hurry, or the on-call engineer is unfamiliar with the service, or it's 2am.

We tried just waiting longer before declaring a deploy stable. That helped, but it meant slower iteration and it still didn't give us a clear signal — just more time to feel anxious.


The insight: deploy-window signals are enough

Here's what we eventually realized: you don't need full observability to make a good rollback decision. You need error signal comparison — how does error behavior in the deploy window compare to what this service looked like before?

You don't need traces. You don't need logs. You don't need to know the full request path. You need to know:

  • Are new error types appearing that weren't there before?
  • Is the error rate elevated compared to baseline?
  • Are the same errors persisting across multiple observation windows?

That's a much smaller, focused problem. And it's one that can be automated.


What we built: Relivio

We built Relivio to solve exactly this. It watches the first 15 minutes after a deploy and returns a verdict: STABLE, WATCH, or RISK — with the top signals that drove the decision.

The integration is three API calls:

1. Register the deploy:

curl -X POST "https://api.relivio.dev/api/v1/deployments" \
  -H "X-API-Key: <YOUR_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "version": "v2.4.1",
    "metadata": { "environment": "production" }
  }'
Enter fullscreen mode Exit fullscreen mode

2. Send error signals during the observation window:

curl -X POST "https://api.relivio.dev/api/v1/ingest/log" \
  -H "X-API-Key: <YOUR_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "level": "ERROR",
    "message": "checkout failed",
    "service": "api",
    "api_path": "/api/checkout/:id"
  }'
Enter fullscreen mode Exit fullscreen mode

3. Read the verdict:

curl -X GET "https://api.relivio.dev/api/v1/summaries/latest" \
  -H "X-API-Key: <YOUR_API_KEY>"
Enter fullscreen mode Exit fullscreen mode

And the verdict gets posted to your Slack or Discord automatically.


What the output looks like

When a deploy is clean, you get something like this in your channel:

🟢 Deploy v2.4.1 — STABLE IMPACT
Stability Index: 100 | New error types: 0
Suggested action: Proceed

When something's wrong:

🔴 Deploy v2.4.2 — RISK IMPACT
New error types: 3 | Critical error types: 2
critical override (DatabaseConnectionError: payment replica unavail)
error rate x1228.2
Suggested action: Review now (consider rollback)

The second message is the one that matters. Instead of someone manually noticing that the error rate spiked, the decision comes to you — with the specific errors that triggered it.


What it's not

Relivio is not a replacement for full observability. If you need distributed tracing, historical log search, or infrastructure metrics, you still want Datadog or whatever you're using.

It's a focused layer on top of your existing setup, specifically for the deploy-window decision. Think of it as the automated version of that 15-minute dashboard-watching ritual — except it gives you a clear answer instead of just more data.

It also doesn't require ingesting full logs. You send error-level signals (ERROR or WARN) from your existing error handling layer — ideally from a single middleware or exception handler that already has request context. No broad logging rollout required.


Current availability

Relivio is live now. Free tier includes 2 projects and 100 deploy registrations per month — no credit card required.

If you're running CI/CD pipelines in production and you've felt the 15-minute anxiety, I'd genuinely love to hear how you handle it today and whether something like this would fit your workflow.

👉 relivio.dev | Slack, Discord, and generic webhook supported.


Top comments (0)