DEV Community

TechLogStack
TechLogStack

Posted on • Originally published at techlogstack.com on

How a Two-Line Script Silently Deleted 883 Customer Cloud Sites

Atlassian · Reliability · 17 May 2026

At 07:38 UTC on April 5th, 2022, a maintenance script begins its run — methodical, peer-reviewed, totally routine. Twenty-three minutes later, 883 Atlassian Cloud sites have been permanently deleted, and the company's own incident management tool, Opsgenie, is one of the casualties.

  • 883 sites deleted
  • 14 days max outage
  • 775 customers affected
  • 450+ engineers mobilized
  • ~5 min RPO met
  • 2 restoration approaches

The Story

The script we used provided both the 'mark for deletion' capability used in normal day-to-day operations (where recoverability is desirable), and the 'permanently delete' capability that is required to permanently remove data when required for compliance reasons.

— — Sri Viswanath, CTO — via Atlassian Engineering Blog, April 2022

In 2021, Atlassian completed the acquisition and integration of a standalone app called Insight – Asset Management into Jira Service Management as native functionality. The standalone version was now obsolete and needed to be retired from the 200,000+ customer cloud sites that had it installed. An engineering team wrote a cleanup script using an existing deletion process — nothing unusual, nothing new. What happened next would become the longest and most public cloud outage in Atlassian's history. The seeds were sown not in a line of code, but in a conversation between two teams separated by function, timezone, and context.

The deletion API that powered the script accepted two types of identifiers: app IDs to remove a specific product installation, and site IDs to remove an entire customer workspace. Both were valid inputs. The API assumed the caller knew which they were passing and offered no type-checking, no confirmation prompt, no dry-run mode. The team requesting the deletion provided the IDs of the cloud sites where the app was installed — not the IDs of the app instances themselves. The executing team, receiving a list of IDs and a known-good script, ran it. Soft delete (a reversible deletion that marks data for removal but retains it in backup for a grace period, allowing recovery) was not used; the script took the permanent deletion path instead. The script completed its run from 07:38 to 08:01 UTC on April 5th, 2022.

The entire deletion ran in just 23 minutes. Because it executed through standard provisioning workflows, Atlassian's internal monitoring detected nothing — the system behaved exactly as designed. The first signal of disaster came not from dashboards, but from a customer support ticket filed at 07:46 UTC.

Problem

Silent Deletion

At 07:38 UTC, the cleanup script begins sequentially deleting sites from a list of 883 IDs. Because deletions pass through the standard Cloud Provisioner workflow — the same pathway used for day-to-day operations — internal monitoring fires no alert. At 07:46 UTC, the first customer support ticket arrives: Jira, Confluence, Opsgenie, and Statuspage are all unreachable.


Cause

Wrong IDs, No Guard

At 08:53 UTC, engineers confirm the link between the script run and the deletions. The communication gap is clear: team A passed site IDs (unique identifiers for an entire customer workspace containing all their Atlassian products) instead of app IDs to team B. The deletion API, designed to accept both types without validation, assumed correctness. The script used the permanent delete path, not the soft-delete path, meaning no data was retained in recoverable staging — it was gone from production immediately.


Solution

Two Approaches to Rebuild

Restoration 1 — creating brand-new sites and migrating data across — took approximately 48 hours per batch and required 70 sequential steps including re-mapping immutable Cloud IDs across every third-party ecosystem app. On April 9th, the team proposed Restoration 2: re-creating records using the original site identifiers, cutting the process to ~30 steps and ~12 hours per site. An engineering-wide code freeze was imposed on April 8th to eliminate risk of compounding incidents.


Result

Full Restoration, Hard Lessons

The final affected customer was restored on April 18th — 13 days after the incident began. Atlassian met its Recovery Point Objective of one hour: no customer lost more than five minutes of data. The company permanently blocked bulk site deletes, mandated soft-delete policies across all systems, and committed to automated multi-site disaster recovery testing as a regular operational exercise.


What made the outage uniquely brutal was its second-order effect: the script that deleted customer sites also deleted the contact information for those customers. Atlassian's support systems required a valid Cloud URL and Atlassian ID to file a ticket — and both were gone. Customers couldn't reach support, and Atlassian couldn't reach customers. The company had to reconstruct contact lists from billing systems, prior support tickets, and manual outreach before they could even begin coordinating restoration. The multi-tenant architecture — where data from multiple customers lives in shared storage shards — meant that a global rollback was not an option. Restoring any individual site required surgically extracting and replaying that customer's records without disturbing the data of the thousands of other tenants sharing the same database.

⚠️

The Irony That Hurt Most

Among the 883 deleted sites were Atlassian's own internal instances — and Opsgenie, the company's own incident management product. The team managing the worst outage in Atlassian history had to do it without their primary incident tracking and alerting tool. The cobbler's children had no shoes.

THE HACKER NEWS SIGNAL

As Atlassian remained largely silent for the first nine days, the outage trended on Hacker News. The highest-voted comment, from someone claiming to be a former Atlassian employee, alleged that internal monitoring was poor and that more than 50% of incidents were customer-detected. Atlassian neither confirmed nor denied the claims — but the silence amplified the speculation. On Day 9, they finally confirmed what the community had already guessed: the Insight plugin retirement script was the cause.

At peak response, the recovery involved 450+ support engineers running 24/7 shifts across global timezones, manually validating each restored site before handing it back to the customer. The team created a dedicated Jira project — SITE — with a custom workflow to track restoration progress site-by-site across engineering, program management, and customer support. The Restoration 2 breakthrough, when it came on Day 4, was the turning point: by re-using original site identifiers, the team could eliminate the most time-consuming step — re-mapping immutable IDs across third-party app integrations — and cut site recovery time from 48 hours to 12. The final site was restored on April 18th. Every customer got their data back.

📊

Recovery Point Objective: Met

Despite the scale, Atlassian met its one-hour RPO — most customers lost at most five minutes of data. The combination of full backups plus point-in-time incremental backups, retained for 30 days, made this possible. What they missed was the RTO: restoring 775 customers took 13 days, not hours.

Peer review caught the endpoint. It just didn't ask what the IDs were for.

TechLogStack — built at scale, broken in public, rebuilt by engineers

ℹ️

The Bigger Context: Server Sunset Pressure

This incident occurred at a critical business moment: Atlassian had announced the end-of-life for its on-premises Server product, actively pushing customers toward the Cloud. The 14-day outage landed directly on top of that migration narrative, giving every enterprise customer a live, public data point about cloud reliability (the ability of a hosted service to maintain uptime and data integrity across millions of tenants) versus self-hosted control.


The Fix

  • 13d — Duration from first deletion (April 5) to final site restored (April 18) — the longest outage in Atlassian's history
  • <5 min — Maximum data loss per customer — Recovery Point Objective met, despite missing Recovery Time Objective by days
  • ~70 → 30 — Restoration steps reduced from Approach 1 to Approach 2, cutting site rebuild time from 48 hours to ~12 hours
  • 450+ — Support engineers running 24/7 manual validation shifts globally at peak incident response

Recovery began in three parallel workstreams the moment the root cause was confirmed on Day 1. The first workstream assembled a manual team to hand-walk through the restoration steps for individual sites, validating each one. The second workstream raced to automate those same steps so they could be run safely in large batches. The third — the one that ultimately broke the logjam — was a full rewrite of the restoration approach itself. Restoration 1 created brand-new sites with fresh Cloud IDs, requiring all downstream services and third-party apps to be re-mapped to the new identifiers. This was safe but brutally slow: ~70 steps, ~48 hours per batch, with cascading dependencies that could only run in sequence. Every third-party app in the ecosystem had to be re-integrated. The math told a grim story: this approach would take three weeks to clear the full backlog of 775 customers.

THE BREAKTHROUGH: RESTORATION 2

On April 9th — Day 4 — the team proposed Restoration 2 : instead of creating new sites, re-create the deleted records in-place using the original site identifiers. The key insight was that immutable identifiers (unique IDs assigned at site creation that are embedded across all downstream services, data records, and third-party integrations and cannot be changed) like CloudID were the primary source of complexity in Restoration 1. By preserving them, the team eliminated over half the restoration steps, removed the need to coordinate re-mapping with third-party app vendors, and reduced site recovery from 48 hours to approximately 12 hours. The trade-off: everything automated for Restoration 1 had to be rewritten, and both approaches ran in parallel for days while the new method was tested and validated.

# Pseudocode: Restoration 2 — re-create deleted records using original identifiers
# This was the breakthrough that cut recovery time from ~48h to ~12h per site

def restore_site_v2(site_id, restore_point_timestamp):
    # Step 1: Re-create the site record in the Catalogue Service using the ORIGINAL site_id
    # Critical: preserve original cloudId to avoid re-mapping all downstream references
    catalogue.uncreate(site_id, preserve_original_cloud_id=True)

    # Step 2: Restore identity data (users, groups, permissions) in parallel
    # These can run concurrently with database restoration — no sequential dependency
    identity.restore_async(site_id, point_in_time=restore_point_timestamp)

    # Step 3: Restore primary product databases (Jira, Confluence, etc.)
    # Point-in-time recovery to exactly 5 minutes before deletion
    for product in get_site_products(site_id):
        db.restore_to_point_in_time(
            product=product,
            timestamp=restore_point_timestamp, # 5 min before deletion
            site_id=site_id
        )

    # Step 4: Restore cross-service data (media attachments, app data, feature flags)
    # Can parallelize across services that have no dependencies on each other
    services.restore_parallel(site_id, timestamp=restore_point_timestamp)

    # Step 5: Automated validation — checks all services are healthy for the site
    validation_result = validate_site(site_id)
    if not validation_result.passed:
        raise RestorationError(f"Site {site_id} failed validation: {validation_result.errors}")

    # Step 6: Hand off to customer for final sign-off
    notify_customer(site_id, status='ready_for_validation')
Enter fullscreen mode Exit fullscreen mode

The Root Technical Failure: An API Without Type Safety

The deletion API accepted both app IDs and site IDs as valid inputs and assumed the caller knew which type they were passing. There was no runtime validation to check whether the input ID referred to an app or an entire customer site. A single guard — checking the type of the entity behind each ID before executing permanent deletion — would have surfaced the mismatch before a single site was touched.

Restoration 1 vs Restoration 2 — what changed and why it mattered

Dimension Restoration 1 Restoration 2
Approach Create new site, migrate data in Re-create original records in-place
Site identifiers New CloudID assigned (immutable IDs changed) Original CloudID preserved
Steps required ~70 sequential steps ~30 steps with parallelism
Recovery time/site ~48 hours per batch ~12 hours per site
Third-party apps Every app re-integration required per site No re-integration needed
Sites restored 112 sites (53% of affected users) 771 sites (47% of affected users)

Four changes were committed as non-negotiable outcomes. First: universal soft-delete across all Atlassian systems — permanent deletion of customer data can only occur after a soft-delete period expires, never directly. Second: automated multi-site, multi-product disaster recovery testing, regularly exercised at scale. Third: a large-scale incident playbook with sub-streams, pre-built tooling, and simulation exercises that go far beyond the single-service incidents Atlassian had historically trained for. Fourth: backup of customer contact data outside the product instance itself, so that a site deletion could never again sever the communication channel needed to coordinate recovery. Every one of these was announced as an immediate action, not a future roadmap item.

Zero Data Loss

Despite the scale and duration, no customer permanently lost data. Thirty-day immutable backups with point-in-time recovery meant the team could always get back to within five minutes of the deletion event. The RPO was met. The RTO was not. That asymmetry — data safe, access gone for two weeks — defined the entire character of the incident.

🔒

The Code Freeze Decision

On April 8th, Atlassian imposed a company-wide code freeze — no deployments across all of engineering until restoration was complete. This eliminated the risk of a compounding incident, reduced noise, and allowed the entire engineering org to focus exclusively on recovery without distraction from unrelated changes.


Architecture

To understand why a script deleting 883 sites took two weeks to reverse, you need to understand what an Atlassian site actually is. A site — for example yourcompany.atlassian.net — is not a row in a database. It is a logical container distributed across dozens of services , each maintaining their own slice of state. Identity data (users, groups, permissions) lives in one service. Product databases for Jira, Confluence, and Opsgenie live in others. Media attachments, feature flags, licensing metadata, third-party app configurations — each of these occupies a separate data store. All of it is hosted on AWS and orchestrated through Micros (Atlassian's internal Platform-as-a-Service that orchestrates deployment, security, and management for all Atlassian cloud services), Atlassian's internal PaaS. The site deletion did not touch a single database — it sent deletion events through the standard provisioning workflow, and every downstream service dutifully removed its copy.

How a site deletion propagated through Atlassian's distributed architecture

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

WHY A GLOBAL ROLLBACK WASN'T POSSIBLE

The instinctive solution — roll back the entire database to before the script ran — was blocked by the multi-tenant architecture (a design where a single database shard stores data for many customers simultaneously, isolated at the application layer rather than the database layer). Each database shard contained data from hundreds of customers, most of whom were completely unaffected. A global rollback would have wiped hours of real work from tens of thousands of innocent customers. The only option was surgical: extract and replay each deleted customer's records individually, from 30-day immutable backups, without touching any surrounding data.

Restoration 2 — re-creating records in-place using original identifiers to avoid re-mapping

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

ℹ️

The Missing Layer: Multi-Site Automated DR

Atlassian's disaster recovery had been designed for infrastructure failures — a lost database, a failed availability zone, a corrupted single service. It had never been designed for the scenario of selectively restoring hundreds of customers from shared backups into a live production environment. The capability existed for single-site recovery; it simply hadn't been automated or tested at this scale. The incident forced Atlassian to build, from scratch, the tooling that would eventually become a core part of their DR program.

🏗️

Micros: The Internal PaaS That Ran the Deletions

Atlassian's Micros platform orchestrates all service deployments, security controls, and provisioning events across their cloud. The deletion script triggered standard tenant destruction events through Micros — which is exactly why monitoring didn't fire. Normal deletions look identical to erroneous bulk deletions from an observability perspective when the system has no input validation at the API layer.


Lessons

Thirteen days of outage, 450 engineers, two restoration approaches from scratch — and all of it began with two teams that didn't fully understand what IDs they were exchanging. The Atlassian incident is not a story of technical failure. It is a story of what happens when destructive operations lack defense-in-depth : no type validation, no dry-run mode, no staged rollout, no soft-delete, and no explicit confirmation that an operation targeting 883 site-level records was actually what anyone intended.

  1. 01. Deletion APIs must validate what they are deleting, not just whether the operation is allowed. An API that accepts both app IDs and site IDs without distinguishing them is a loaded gun with the safety removed. Before any destructive operation executes in production, a system-level check should confirm the type of entity being targeted — and fail loudly if it doesn't match the caller's stated intent.
  2. 02. Soft delete (marking data for removal with a retention window rather than permanently destroying it immediately) must be the only permitted path for any operation touching customer data. Permanent deletion paths — even legitimate ones needed for compliance — should require a multi-step authorization separate from standard maintenance workflows. If an operation cannot be reversed in under an hour, it should not be triggerable in a single script run.
  3. 03. Disaster Recovery testing must include the scenario you have never practiced, not just the one you have. Atlassian's backups were excellent and their single-site recovery was proven. What failed was multi-site, multi-product coordinated recovery at scale — a scenario that had no runbook and no automation. Test the rare catastrophe, not just the common failure.
  4. 04. Customer contact information must be backed up outside the system it describes. When the deletion removed customer sites, it also removed the contact data Atlassian needed to reach those customers. Never let a single operation sever both the incident and the communication channel for resolving it. Store critical customer identifiers in a system that is logically and physically separate from the product instances they reference.
  5. 05. Staged rollout applies to maintenance scripts, not just feature deployments. The first production run of the cleanup script processed 30 sites correctly — because those IDs had been sourced before the miscommunication occurred. The second run hit 883. A staged rollout policy on any script modifying customer data at scale would have surfaced the error on a batch of 10 before it reached 883.

⚠️

The Communications Lesson They Admitted

For nine days, Atlassian remained largely silent publicly while customers speculated on forums. They later acknowledged they should have communicated directional estimates with explicit uncertainty — even imprecise timelines — far earlier. Silence read as incompetence. "We don't know yet" is a communication, not a failure to communicate.

🏛️

The Industry Shift This Incident Accelerated

After the Atlassian outage, "soft-delete by default" moved from engineering best-practice advice to boardroom checklist item at cloud companies worldwide. The incident remains the most-cited example of how blast radius (the scope of unintended damage when an operation reaches further than intended) of a single maintenance script can exceed the blast radius of a network attack — because the system was designed to trust the caller.

The API accepted both app IDs and site IDs. It just didn't ask which one you meant.

TechLogStack — built at scale, broken in public, rebuilt by engineers


This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack → (interactive diagrams, source links, and the full reader experience).

Top comments (0)