DEV Community

TechLogStack
TechLogStack

Posted on • Originally published at techlogstack.com on

Slack Cut Deploy-Related Customer Impact by 90% in Eighteen Months

Slack · Reliability · 17 May 2026

73% of Slack's customer-facing incidents were being triggered by Slack itself — by its own code deploys. The team stopped treating each outage as a one-off and started treating deploy safety as a program, with metrics, milestones, and automated rollbacks. Eighteen months later, customer impact hours were down 90%.

  • 73% incidents from own deploys
  • 90% reduction in impact hours
  • Manual → automatic rollbacks
  • Jan 2025 lowest impact ever
  • Peak impact: Feb–Apr 2024
  • 18 months of sustained investment

The Story

It's mid 2023 and we've identified some opportunities to improve our reliability. Fast forward to January 2025. Customer impact hours are reduced from the peak by 90% and continuing to trend downward.

— — Slack Engineering — via 'Deploy Safety: Reducing customer impact from change', slack.engineering

Slack's reliability story in 2023 had an uncomfortable truth at its center: the biggest source of customer-facing incidents was not external infrastructure failures, not traffic spikes, not adversarial attacks — it was Slack's own code deploys. A measurement of the incident dataset showed that 73% of customer-facing incidents were change-triggered , primarily code deployments. This number reframes the reliability problem entirely. You can harden infrastructure, add redundancy, and build better monitoring — but if most incidents are self-inflicted, the highest-leverage intervention is improving how you ship code.

📊

Slack operates in a software engineering environment with hundreds of internal services and many different deployment systems and practices. Before the Deploy Safety Program, the approach to reliability was often service-specific — individual teams improving their own deploy practices independently, without a coordinated program tracking the systemic impact.

The Deploy Safety Program began in mid-2023 with a key insight: measuring reliability improvement by waiting for incidents to occur creates a long feedback loop that is difficult to optimize. The team shifted to a leading-indicator metric — customer impact hours from high-severity change-triggered incidents — that could be tracked continuously without waiting for the next major outage. This metric served as the north star throughout the 18-month program, allowing the team to see improvement (or regression) before the data showed up in annual availability reports. The program metric had a semi-loose connection to individual customer experience, but it was directionally correct and defensible enough to drive engineering prioritization.

Problem

73% of Incidents Self-Inflicted by Deploys

Slack measured that 73% of customer-facing incidents were triggered by change — primarily code deployments across the hundreds of internal services. Manual remediation processes (engineers detecting issues, investigating, deciding to roll back) added latency between deploy and recovery. Interruptions exceeding 10 minutes were disproportionately damaging to customer trust.


Cause

Manual Detection and Remediation Too Slow

The existing approach relied on engineers detecting deploy-related regressions from monitoring dashboards and making manual rollback decisions. This added human latency — the time to be paged, the time to investigate, the time to decide — to every incident. At Slack's deploy frequency, across hundreds of services, the accumulated human latency was significant.


Solution

Automatic Rollbacks + Safety Culture

The Deploy Safety Program introduced automated rollback triggers: when deploy-time metrics crossed defined thresholds, a rollback was automatically initiated without waiting for engineer intervention. This removed the human latency from the most common incident recovery path. The program also invested in safety culture across engineering teams — normalizing rollbacks as the right response rather than a failure indicator.


Result

90% Reduction by January 2025

Customer impact hours were down 90% from peak by January 2025, with the trend continuing downward. The peak of impact occurred between February and April 2024 — before automatic rollbacks were introduced. Once automatic rollbacks were live, the data showed dramatic improvement.


THE LEADING INDICATOR STRATEGY

A core innovation of the Deploy Safety Program was measuring customer impact hours as a leading indicator rather than waiting for annual availability figures. This gave the engineering team a metric they could see moving week-over-week, track against program milestones, and use to evaluate whether specific projects were actually improving reliability. Without the right metric, improvement programs are optimizing blind.

⚠️

The Communication Challenge: Non-Linear Progress

The program's progress chart showed non-linear improvement — the first quarter of work showed reduction before any code changes were deployed, just from communicating the program's goals to engineering teams. Then came a peak of impact in early 2024 before automatic rollbacks were in place. This non-linearity made it challenging to communicate progress to stakeholders who expected a smooth downward line. The team maintained confidence in the work based on leading metrics even when trailing metrics hadn't yet reflected it.

Slack's customer base had grown to treat Slack as mission-critical infrastructure — the same expectation applied to email or calendar, not messaging apps. This raised the stakes for deploy-related interruptions: an interruption that users would have tolerated as 'a blip' in 2018 was now disruptive to workflows, team meetings, and customer communications. The business context transformed the engineering mandate: deploy safety was not just a reliability metric, it was a retention metric. The Deploy Safety Program was not built in a vacuum — it was built in response to explicit customer feedback that interruptions had become more costly.

🤖

Automatic vs Manual Remediation: The Latency Gap

Manual remediation of a deploy-related incident requires: alert fires → engineer pages → engineer investigates → engineer diagnoses → engineer decides to roll back → engineer executes rollback. Each step adds minutes. Automatic rollback collapses this to: metric threshold crossed → rollback initiated. For the most common class of deploy-related incidents, this difference is often the difference between a sub-10-minute blip and a 30-minute customer-impacting incident.

ℹ️

The 10-Minute Threshold

Slack's data showed that customer tolerance for interruptions changed significantly at around 10 minutes. Shorter interruptions were treated as acceptable blips; longer ones were treated as incidents that impacted workflows and generated support tickets. Designing automatic rollbacks to trigger fast enough to resolve most issues within the 10-minute window became a key design constraint for the program.

The Manual Process Bottleneck at Scale

Before automatic rollbacks, Slack's incident response for deploy-related regressions required: engineer wakes up, opens laptop, pulls up dashboards, assesses severity, determines cause is likely the recent deploy, decides to roll back, executes rollback. At hundreds of services with multiple daily deploys each , this process was running dozens of times per week. Each execution required a human. Each human introduced minutes of latency. The system was not designed for the frequency at which it was being invoked.

One structural challenge the program navigated was the gap between program metric and individual project metric. A specific engineering project might reduce rollback time by 50% on one service — but how much does that move the top-line customer impact hours metric? The relationship is indirect and involves statistical noise from incident timing, severity distribution, and Slack's traffic patterns. Teams that couldn't see a direct line from their work to the program metric risked losing motivation. The solution was maintaining both program-level and project-level metrics and being explicit about how they connected — even when the connection was indirect.


The Fix

The Deploy Safety Program: Engineering + Culture

The Deploy Safety Program was not purely a technical program. Its first-quarter improvements came from communication — telling engineering teams what the metric was, why it mattered, and what behaviors were contributing to it. The technical work (automated alerts, automatic rollbacks, improved deploy signals) came later. This sequencing is important: culture change came before code change , and the culture change produced measurable improvement even before the tooling was in place. Engineers who understood that their deploys were the primary source of customer impact made better decisions about when to deploy, what size deploys to ship, and when to roll back.

  • 73% — Fraction of customer-facing incidents triggered by Slack's own code changes — the measurement that transformed reliability from an infrastructure problem into a deployment problem
  • 90% — Reduction in customer impact hours from peak (Feb–Apr 2024) to January 2025 — the headline outcome of the 18-month Deploy Safety Program
  • Q1 — Quarter of work when improvement appeared — before any code changes — purely from communicating the program goals to engineering teams and surfacing the metric
  • Auto — Rollback execution mode after the program's key technical milestone — removing human latency from the most common incident recovery path
# Simplified deploy safety automatic rollback logic
# Real implementation uses Slack's internal deploy orchestration system

class DeploySafetyMonitor:
    def __init__ (self, service: str, deploy_id: str):
        self.service = service
        self.deploy_id = deploy_id
        self.baseline = self._capture_pre_deploy_metrics()

    def monitor_post_deploy(self, window_minutes: int = 10):
        """Monitor service health after a deploy.
        Automatically roll back if metrics regress beyond thresholds."""
        start_time = time.time()

        while time.time() - start_time < window_minutes * 60:
            current = self._get_current_metrics()

            # Check error rate regression
            if current.error_rate > self.baseline.error_rate * ERROR_RATE_THRESHOLD:
                self._automatic_rollback(
                    reason=f"Error rate {current.error_rate:.1%} exceeded "
                           f"threshold (baseline: {self.baseline.error_rate:.1%})"
                )
                return # rollback initiated — no human required

            # Check p99 latency regression  
            if current.p99_latency > self.baseline.p99_latency * LATENCY_THRESHOLD:
                self._automatic_rollback(
                    reason=f"p99 latency {current.p99_latency}ms exceeded threshold"
                )
                return

            time.sleep(30) # check every 30 seconds during bake period

        # Monitoring window passed — deploy is baked, mark stable
        self._mark_deploy_stable(self.deploy_id)

    def _automatic_rollback(self, reason: str):
        deploy_orchestrator.rollback(self.deploy_id)
        pagerduty.notify(severity='P2', # P2, not P1 — rollback is the mitigation
                        message=f'Auto-rollback: {self.service} {self.deploy_id}\n{reason}')
Enter fullscreen mode Exit fullscreen mode

SAFETY CULTURE: NORMALIZING ROLLBACKS

One of the cultural investments of the program was normalizing rollbacks as the correct first response to a deploy-related regression, not as a failure to be avoided. Previously, some teams would try to forward-fix a regression (deploy a fix) rather than roll back. Forward-fixing maintains customer impact during the investigation and fix cycle. Rolling back immediately reduces customer impact to near-zero, then gives engineers the time and calm to properly understand and fix the issue. Rollback is not defeat — it's the right call.

The February–April 2024 Peak: A Forcing Function

The peak of customer impact hours in early 2024 — before automatic rollbacks were fully deployed — actually served as a forcing function for the program. It demonstrated to engineering leadership that the program's investment was justified, accelerated resources toward the automatic rollback work, and showed that manual remediation was insufficient at Slack's deploy frequency. Sometimes the worst period in a reliability program is the moment that unlocks the resources to fix it.

ℹ️

Agentforce Integration: Reducing the 10-Minute Threshold

Slack's blog post notes that the introduction of Agentforce in 2025 raised customer expectations further — Slack being used as an AI-assisted work tool made even shorter interruptions more disruptive. This ongoing expectation evolution means the Deploy Safety Program's work is continuous: the 10-minute acceptable interruption threshold will continue to shrink as Slack becomes more tightly integrated into customer workflows.

The Deploy Safety Program faced a fundamental measurement challenge: the program metric (customer impact hours) is a trailing indicator (a metric that reflects outcomes that have already occurred — you only know you've improved after the fact) that doesn't give engineers real-time feedback on whether a specific project change is working. The team supplemented the trailing metric with leading indicators specific to each project — deploy alert precision, rollback rate, manual rollback to auto-rollback conversion rate — that gave faster feedback on whether individual investments were on track. The relationship between program metric and individual project metric is always indirect, but tracking both gave the team the full picture.

📈

Non-Linear Progress: Why It Looked Like It Wasn't Working

The program's impact chart showed that the first quarter produced improvement before any code shipped — from communication alone. Then came the peak of impact in early 2024 , before automatic rollbacks were deployed, suggesting things were getting worse. Then a dramatic drop after automatic rollbacks went live. This non-linear curve is common in reliability programs: communication changes behavior, new tooling is built without full effect, the tooling deploys and impact drops sharply. Reading the curve correctly requires understanding what was deployed when.


Architecture

Slack's deploy safety architecture evolved from a manual-first system to an automated-first system over the 18 months of the program. The before state: engineers deploy, monitoring alerts fire, engineers investigate, engineers decide to roll back, engineers execute rollback. The after state: engineers deploy, monitoring compares against pre-deploy baseline, automatic rollback fires if thresholds are crossed, engineer is paged with context after the rollback has already happened. The human is in the loop — but as a reviewer of an automated decision, not as a prerequisite to recovery.

Before: Manual Deploy Remediation Path

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

After: Automatic Deploy Rollback Path

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

THE MEASUREMENT SYSTEM IS THE PROGRAM

The Deploy Safety Program's most lasting contribution may not be the automatic rollback tooling — it's the measurement framework. Customer impact hours from change-triggered incidents, tracked continuously, with attribution to specific deploy events and services, created visibility that had not previously existed. Engineering teams could see, for the first time, which services and which deploy patterns were contributing most to customer impact. That visibility drove behavior change even before the tooling changed.

ℹ️

Hundreds of Services: The Coordination Challenge

Slack's deploy environment includes hundreds of internal services with different deployment systems and practices. Rolling out deploy safety monitoring and automatic rollbacks across this heterogeneous environment required a program-level coordination effort — not just a single engineering team making changes to one service. Each service team needed to integrate with the monitoring framework, validate their specific alert thresholds, and adopt the rollback automation. The program's organizational structure was designed to support this breadth.

⚠️

The Threshold Calibration Problem

Automatic rollbacks require precise threshold calibration — thresholds too sensitive trigger unnecessary rollbacks on normal traffic variance, eroding engineer trust in the system. Thresholds too loose miss real regressions. Slack's approach was per-service threshold calibration based on historical metric variance, with ongoing tuning as services' traffic patterns evolved. This calibration work is ongoing — it doesn't end when the automation is deployed. Getting thresholds wrong in either direction undermines the entire program.


Lessons

Slack's Deploy Safety Program is a model for how to turn a vague reliability problem ('we have too many incidents') into a concrete engineering program with measurable outcomes. The lessons apply to any team where self-inflicted incidents are the primary reliability drain.

  1. 01. Measure what's causing incidents before investing in what might fix them. Slack's discovery that 73% of incidents were change-triggered completely reframed their reliability investment. Without measurement, they might have invested in infrastructure redundancy and network hardening while the primary driver — their own deploys — continued unchecked.
  2. 02. Trailing metrics (metrics that measure outcomes after they occur, like annual availability or incident count) tell you how things went. Leading metrics (metrics that indicate direction of travel before outcomes are fully visible, like incident rate per deploy or rollback frequency) tell you whether what you're doing is working. Run both. Use the leading metrics to steer the program, and the trailing metrics to confirm you've arrived.
  3. 03. Culture change can produce measurable improvement before code changes do. Slack saw improvement in the first quarter of the program from communication alone — before any technical work shipped. When engineers understand what behavior is costing customers, many of them change their behavior voluntarily. Don't skip the cultural investment in favor of jumping straight to tooling.
  4. 04. Automatic rollbacks are not a replacement for good engineering — they are a safety net that reduces the cost of imperfect engineering. Every team ships bugs; the question is how quickly the system detects and recovers from them. Automatic rollbacks compress the detection-to-recovery time from tens of minutes to seconds, dramatically reducing customer impact for the most common class of incidents.
  5. 05. Rollback is the correct first response to a deploy regression. Forward-fixing maintains customer impact during investigation. Rolling back immediately restores service, then gives engineers the time and safety to understand the issue properly. Normalizing rollback as the correct response — not a failure — is as important as building the tooling to do it automatically.

The Compounding Return

By January 2025, customer impact hours were at the lowest level ever recorded and continuing to trend downward. The Deploy Safety Program's investments compound: automatic rollbacks reduce impact per incident, better alert precision reduces false positives, safety culture reduces the frequency of reckless deploys. Each improvement makes the next improvement more effective.

⚠️

The Attribution Problem

Not all incidents attributed to 'change-triggered' were definitively proven to be caused by the deploy. Some correlations were timing coincidences. The Deploy Safety Program accepted some measurement noise in the metric in exchange for the simplicity of a clear, attributable signal. A useful metric with some noise is more actionable than a perfect metric that takes too long to compute. The team was explicit about this tradeoff in their communications.

Slack discovered that the biggest threat to Slack's reliability was Slack deploying Slack — which is either a very enlightened finding or a very embarrassing one, depending on how you look at it.

TechLogStack — built at scale, broken in public, rebuilt by engineers


This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack → (interactive diagrams, source links, and the full reader experience).

Top comments (0)