Shib™ 🚀

Posted on Feb 3 • Originally published at apistatuscheck.com

How DevOps Teams Use Status Aggregators to Cut Incident Response Time in Half

#devops #monitoring #api #tutorial

Originally published on API Status Check.

It's 3 AM. PagerDuty wakes you up. Error rates are spiking. You open your laptop, check Grafana — your services look healthy. CPU normal. Memory normal. No recent deploys. You start tailing logs, spot some connection timeouts to an upstream service, run a few curl commands, and 20 minutes later realize the root cause is that AWS us-east-1 is having issues.

Twenty minutes of your life you'll never get back, and your customer-facing services were degraded the entire time.

This is the most expensive problem in incident response that nobody talks about: wasting time investigating your own systems when the actual problem is a third-party dependency. Here's how DevOps and SRE teams use status aggregators to eliminate these false investigations entirely.

The False Investigation Problem

According to incident response data across the industry, 30-40% of incidents that page on-call engineers are caused by third-party dependencies, not internal systems. But the typical triage workflow assumes internal cause first:

Standard incident triage (slow):
1. Alert fires → check YOUR dashboards
2. YOUR metrics look fine → start debugging
3. 15 min of log analysis → find upstream timeout errors
4. Check vendor status page → "oh, it's them"
5. Total time wasted: 20-45 minutes

Status-aware triage (fast):
1. Alert fires → glance at status aggregator
2. See "AWS: Degraded" → immediately know root cause
3. Activate runbook for AWS dependency failure
4. Total time wasted: < 2 minutes

The difference isn't skill or tooling sophistication. It's having third-party status visible during triage.

What a Status Aggregator Does

A status aggregator pulls the operational status of dozens of services into a single view. Instead of checking 15 different status pages during an incident, you check one.

Without an Aggregator

Is it AWS?      → status.aws.amazon.com
Is it GitHub?   → githubstatus.com
Is it Stripe?   → status.stripe.com
Is it OpenAI?   → status.openai.com
Is it Datadog?  → status.datadoghq.com
Is it Vercel?   → vercel-status.com
Is it Twilio?   → status.twilio.com
... (7 more tabs)

You're context-switching between 15 browser tabs during an active incident. Under pressure. At 3 AM.

With an Aggregator

One dashboard: apistatuscheck.com
→ AWS: ✅ Operational
→ GitHub: ⚠️ Degraded (Copilot)
→ Stripe: ✅ Operational
→ OpenAI: 🔴 Major Outage
→ Datadog: ✅ Operational
→ Vercel: ✅ Operational
→ Twilio: ✅ Operational

One glance. Root cause identified (or ruled out) in seconds.

Use Case #1: On-Call Triage Acceleration

The problem: On-call engineers waste the first 15-30 minutes of every incident determining whether it's internal or external.

The solution: Make third-party status the FIRST thing checked during triage.

Integrating Into Your Incident Workflow

PagerDuty / OpsGenie Integration

Set up a status webhook that enriches your alerts with dependency context:

// Incoming alert enrichment webhook
app.post('/webhook/enrich-alert', async (req, res) => {
  const alert = req.body

  // Fetch current status of all critical dependencies
  const statuses = await fetch('https://apistatuscheck.com/api/status')
    .then(r => r.json())

  const degraded = statuses.filter(s => s.status !== 'operational')

  if (degraded.length > 0) {
    // Add dependency context to the alert
    alert.notes = `⚠️ DEPENDENCY STATUS:\n${
      degraded.map(d => `${d.name}: ${d.status}`).join('\n')
    }\n\nCheck these FIRST before investigating internal systems.`
  } else {
    alert.notes = '✅ All third-party dependencies operational. Likely internal issue.'
  }

  // Forward enriched alert
  await forwardToPagerDuty(alert)
  res.json({ enriched: true })
})

Now every alert arrives with dependency context. The on-call engineer immediately knows whether to look internally or externally.

Slack Incident Bot

# When an incident channel is created, auto-post dependency status
@slack_bot.event("channel_created")
def on_incident_channel(event):
    channel = event["channel"]

    # Only for incident channels
    if not channel["name"].startswith("inc-"):
        return

    statuses = requests.get("https://apistatuscheck.com/api/status").json()
    degraded = [s for s in statuses if s["status"] != "operational"]

    if degraded:
        message = "🔍 **Third-Party Status Check:**\n"
        for s in degraded:
            message += f"⚠️ **{s['name']}**: {s['status']}\n"
        message += "\n_Consider these as potential root causes._"
    else:
        message = "✅ All monitored third-party services are operational.\n_Likely internal root cause._"

    slack_bot.post_message(channel["id"], message)

Every incident channel automatically gets a dependency health check as its first message.

Use Case #2: Change Correlation

The problem: Something broke, and you need to know: was it our deploy or an external change?

The solution: Overlay deployment events with third-party status changes.

The Timeline View

11:00 — Deployment v2.4.1 to production
11:02 — ✅ Synthetic checks pass
11:15 — ⚠️ Error rate increases 3x
11:15 — ❓ Was it the deploy or something external?

With status aggregator:
11:14 — 🔴 OpenAI API: Elevated error rates (API Status Check alert)
11:15 — ⚠️ Your error rate increases 3x

Verdict: OpenAI outage, not your deploy. DO NOT ROLLBACK.

Without this correlation, the instinct is to rollback the deploy — which causes another round of disruption for zero benefit because the deploy wasn't the problem.

Automation: Auto-Hold Rollbacks During Dependency Outages

#!/bin/bash
# pre-rollback check — add to your rollback script

echo "Checking third-party dependencies before rollback..."
DEGRADED=$(curl -s https://apistatuscheck.com/api/status | \
  jq '[.[] | select(.status != "operational")] | length')

if [ "$DEGRADED" -gt "0" ]; then
  echo "⚠️  WARNING: $DEGRADED dependencies are currently degraded."
  echo "The error spike may be caused by external outages, not your deploy."
  curl -s https://apistatuscheck.com/api/status | \
    jq '.[] | select(.status != "operational") | "\(.name): \(.status)"'
  echo ""
  read -p "Still want to rollback? (y/N) " confirm
  [ "$confirm" != "y" ] && exit 1
fi

echo "✅ All dependencies operational. Proceeding with rollback..."

Use Case #3: SLA Tracking and Error Budgets

The problem: Your SLOs count ALL downtime, including time caused by third-party failures. This burns your error budget unfairly.

The solution: Track dependency-caused downtime separately.

Separating Internal vs. External Downtime

# Incident classification helper
class IncidentClassifier:
    def classify(self, incident_start: datetime, incident_end: datetime) -> str:
        """Check if any dependencies were down during the incident window."""

        # Query API Status Check for status changes during incident
        statuses = self.get_status_history(incident_start, incident_end)

        degraded_deps = [
            s for s in statuses 
            if s['status'] != 'operational'
            and s['timestamp'] >= incident_start
        ]

        if degraded_deps:
            return {
                'classification': 'external',
                'root_cause': degraded_deps[0]['name'],
                'exclude_from_slo': True,
                'evidence': degraded_deps,
            }

        return {
            'classification': 'internal',
            'exclude_from_slo': False,
        }

Monthly SLA Report

## Uptime Report — January 2026

### Raw Uptime: 99.87% (57 min downtime)

### Breakdown:
| Incident | Duration | Root Cause | Classification |
|----------|----------|-----------|---------------|
| Jan 5 02:15 | 23 min | AWS us-east-1 | External ❌ |
| Jan 12 14:30 | 12 min | Database migration | Internal ✅ |
| Jan 19 09:45 | 22 min | Stripe API outage | External ❌ |

### Adjusted Uptime (excluding external): 99.97% (12 min)
### Error Budget Remaining: 82%

This data is invaluable for:

Accurate SLO reporting to leadership
Error budget conversations with product teams
Vendor accountability meetings
Architecture decisions (which dependencies need fallbacks)

Use Case #4: Proactive Communication

The problem: Customers report issues before your team knows about them. Looks unprofessional.

The solution: Detect dependency issues and communicate proactively.

Auto-Status Page Updates

// When a critical dependency goes down, update YOUR status page
webhookHandler.on('dependency_degraded', async (event) => {
  if (event.severity === 'critical') {
    await updateStatusPage({
      component: mapDependencyToComponent(event.service),
      status: 'degraded_performance',
      message: `We're aware of issues affecting ${event.feature}. ` +
               `This is caused by an upstream provider issue. ` +
               `We're monitoring the situation and will update when resolved.`
    })

    // Also post to your Twitter/social
    await postToTwitter(
      `We're aware some users are experiencing issues with ${event.feature}. ` +
      `Our team is monitoring the situation. Updates to follow.`
    )
  }
})

Your customers see you're on top of it BEFORE they have to report it. That's the difference between "their team is great" and "they never know what's going on."

Use Case #5: Capacity Planning and Vendor Selection

The problem: You're choosing between two vendors and need reliability data beyond marketing claims.

The solution: Track actual uptime data for comparison.

Vendor Comparison Dashboard

Monitor competing services side-by-side:

## AI Provider Reliability — Q4 2025

| Provider | Uptime | Incidents | Avg MTTR | Longest Outage |
|----------|--------|-----------|----------|----------------|
| OpenAI | 99.82% | 8 | 23 min | 2h 15m |
| Anthropic | 99.94% | 3 | 15 min | 45m |
| Google Gemini | 99.89% | 5 | 18 min | 1h 20m |

Recommendation: Primary = Anthropic, Fallback = Gemini

This data-driven approach to vendor selection beats "their marketing says 99.99% uptime" every time.

Setting Up a Status Aggregator for Your Team

5-Minute Quick Start

Go to apistatuscheck.com — all your dependencies on one dashboard
Set up Slack/Discord alerts via integrations — route to #incidents or #dependencies
Bookmark the dashboard — make it the first thing you check during incidents

For Deeper Integration

Use the API to enrich PagerDuty/OpsGenie alerts with dependency context
Add status badges to your internal wiki and dashboards
Subscribe to the RSS feed in your team's tooling
Build the pre-rollback check into your deployment scripts

Cultural Change

The hardest part isn't technical — it's changing the triage habit:

OLD: Alert → Check OUR dashboards → Debug OUR code → 20 min later check vendor
NEW: Alert → Check dependencies (10 seconds) → Then check OUR systems

Post this in your incident channel: "Before debugging anything, check third-party status first." It's the single highest-leverage change you can make to your incident response process.

The ROI of Status Aggregation

Metric	Before	After
Median MTTR	34 min	12 min
False investigations/month	6-8	0-1
On-call satisfaction	😩	😐
Unnecessary rollbacks/quarter	2-3	0
Time spent on vendor issues	8-16 hrs/month	1-2 hrs/month

The tools are free. The habit change takes a week. The time saved is measured in engineering-hours per month.

API Status Check aggregates status for 100+ APIs in one dashboard. Set up free alerts at apistatuscheck.com.

DEV Community