TechLogStack

Posted on May 20 • Originally published at techlogstack.com on May 17

How a Two-Line Script Silently Deleted 883 Customer Cloud Sites

#devops #reliability #cloud #programming

883 sites permanently deleted in 23 minutes by a peer-reviewed, routine maintenance script
775 customers affected — max outage 14 days
450+ engineers on 24/7 shifts at peak response
<5 minutes of data loss per customer — RPO met; RTO was not
~70 → 30 steps — the Restoration 2 breakthrough that cut site recovery from 48 hours to 12
Opsgenie — Atlassian's own incident management tool — was among the deleted sites

At 07:38 UTC on April 5th, 2022, a maintenance script begins its run — methodical, peer-reviewed, totally routine. Twenty-three minutes later, 883 Atlassian Cloud sites have been permanently deleted, and the company's own incident management tool, Opsgenie, is one of the casualties.

The Story

The script we used provided both the 'mark for deletion' capability used in normal day-to-day operations (where recoverability is desirable), and the 'permanently delete' capability that is required to permanently remove data when required for compliance reasons.

— Sri Viswanath, CTO, via Atlassian Engineering Blog, April 2022

In 2021, Atlassian completed the integration of a standalone app called Insight – Asset Management into Jira Service Management as native functionality. The standalone version was now obsolete and needed to be retired from the 200,000+ customer cloud sites that had it installed. An engineering team wrote a cleanup script using an existing deletion process — nothing unusual. The seeds of disaster were sown not in a line of code, but in a conversation between two teams separated by function, timezone, and context.

The deletion API accepted two types of identifiers: app IDs to remove a specific product installation, and site IDs to remove an entire customer workspace. Both were valid inputs. The API assumed the caller knew which they were passing and offered no type-checking, no confirmation prompt, no dry-run mode. The team requesting the deletion provided the IDs of the cloud sites where the app was installed — not the IDs of the app instances themselves. The executing team, receiving a list of IDs and a known-good script, ran it. Soft delete (a reversible deletion that marks data for removal but retains it in backup for a grace period) was not used; the script took the permanent deletion path instead.

Silent Deletion: Why Monitoring Fired No Alert

The entire deletion ran in just 23 minutes. Because it executed through the standard Cloud Provisioner workflow — the same pathway used for day-to-day operations — Atlassian's internal monitoring detected nothing. The system behaved exactly as designed. The first signal of disaster came not from dashboards, but from a customer support ticket filed at 07:46 UTC. By 08:01 UTC, the script had completed its run. By 08:53 UTC, engineers confirmed the cause.

What made the outage uniquely brutal was its second-order effect: the script that deleted customer sites also deleted the contact information for those customers. Atlassian's support systems required a valid Cloud URL and Atlassian ID to file a ticket — and both were gone. Customers couldn't reach support, and Atlassian couldn't reach customers. The company had to reconstruct contact lists from billing systems and prior support tickets before they could even begin coordinating restoration.

Problem

Silent Deletion — 883 Sites in 23 Minutes

At 07:38 UTC, the cleanup script begins sequentially deleting sites from a list of 883 IDs. Because deletions pass through standard provisioning workflows, internal monitoring fires no alert. At 07:46 UTC, the first customer support ticket arrives: Jira, Confluence, Opsgenie, and Statuspage are all unreachable.

Cause

Wrong IDs, No Guard

At 08:53 UTC, engineers confirm the link between the script run and the deletions. Team A passed site IDs (unique identifiers for an entire customer workspace containing all their Atlassian products) instead of app IDs to Team B. The deletion API, designed to accept both types without validation, assumed correctness. The script used the permanent delete path, not the soft-delete path — the data was gone from production immediately.

Solution

Two Approaches to Rebuild

Restoration 1 — creating brand-new sites and migrating data across — took approximately 48 hours per batch and required 70 sequential steps including re-mapping immutable Cloud IDs across every third-party ecosystem app. On April 9th, the team proposed Restoration 2: re-creating records using the original site identifiers, cutting the process to ~30 steps and ~12 hours per site.

Result

Full Restoration, Hard Lessons

The final affected customer was restored on April 18th — 13 days after the incident began. Atlassian met its Recovery Point Objective of one hour: no customer lost more than five minutes of data. The company permanently blocked bulk site deletes, mandated soft-delete policies across all systems, and committed to automated multi-site disaster recovery testing.

The Fix

The Restoration 2 Breakthrough

Recovery began in three parallel workstreams. The first assembled a manual team to hand-walk through restoration steps for individual sites. The second raced to automate those steps. The third — the one that broke the logjam — was a complete rewrite of the restoration approach itself.

Restoration 1 created brand-new sites with fresh Cloud IDs, requiring all downstream services and third-party apps to be re-mapped to the new identifiers: ~70 steps, ~48 hours per batch, cascading dependencies that could only run in sequence. The math told a grim story: three weeks to clear the full backlog of 775 customers.

13d — duration from first deletion (April 5) to final site restored (April 18)
<5 min — maximum data loss per customer — RPO met despite missing RTO by days
~70 → 30 — restoration steps reduced from Approach 1 to Approach 2; site recovery time cut from 48h to ~12h
450+ — engineers on 24/7 manual validation shifts globally at peak response

# Pseudocode: Restoration 2 — re-create deleted records using original identifiers
# The breakthrough: preserve original CloudID to avoid re-mapping all downstream references

def restore_site_v2(site_id, restore_point_timestamp):
    # Step 1: Re-create the site record using the ORIGINAL site_id
    # Critical: preserve original cloudId — all downstream services already reference it
    catalogue.uncreate(site_id, preserve_original_cloud_id=True)

    # Step 2: Restore identity data (users, groups, permissions) in parallel
    # No sequential dependency with database restoration
    identity.restore_async(site_id, point_in_time=restore_point_timestamp)

    # Step 3: Restore primary product databases (Jira, Confluence, etc.)
    # Point-in-time recovery to exactly 5 minutes before deletion
    for product in get_site_products(site_id):
        db.restore_to_point_in_time(
            product=product,
            timestamp=restore_point_timestamp,  # 5 min before deletion
            site_id=site_id
        )

    # Step 4: Restore cross-service data (attachments, app data, feature flags)
    # Parallelise across services with no inter-dependencies
    services.restore_parallel(site_id, timestamp=restore_point_timestamp)

    # Step 5: Automated validation — all services healthy for this site
    validation_result = validate_site(site_id)
    if not validation_result.passed:
        raise RestorationError(f"Site {site_id} failed validation: {validation_result.errors}")

    # Step 6: Hand off to customer for final sign-off
    notify_customer(site_id, status='ready_for_validation')

The Root Technical Failure: An API Without Type Safety

The deletion API accepted both app IDs and site IDs as valid inputs and assumed the caller knew which type they were passing. There was no runtime validation to check whether the input ID referred to an app or an entire customer site. A single guard — checking the type of the entity behind each ID before executing permanent deletion — would have surfaced the mismatch before a single site was touched.

Restoration 1 vs Restoration 2:

Dimension	Restoration 1	Restoration 2
Approach	Create new site, migrate data in	Re-create original records in-place
Site identifiers	New CloudID assigned	Original CloudID preserved
Steps required	~70 sequential steps	~30 steps with parallelism
Recovery time/site	~48 hours per batch	~12 hours per site
Third-party apps	Every app re-integration required	No re-integration needed
Sites restored	112 sites	771 sites

Four changes were committed as non-negotiable outcomes: universal soft-delete across all Atlassian systems; automated multi-site, multi-product disaster recovery testing at scale; a large-scale incident playbook with pre-built tooling and simulation exercises; and backup of customer contact data outside the product instance itself.

The communications lesson they admitted

For nine days, Atlassian remained largely silent publicly while customers speculated on forums. They later acknowledged they should have communicated directional estimates with explicit uncertainty — even imprecise timelines — far earlier. Silence read as incompetence. "We don't know yet" is a communication, not a failure to communicate. On Day 9, they finally confirmed what the community had already guessed: the Insight plugin retirement script was the cause.

The code freeze decision

On April 8th, Atlassian imposed a company-wide code freeze — no deployments across all of engineering until restoration was complete. This eliminated the risk of a compounding incident, reduced noise, and allowed the entire engineering org to focus exclusively on recovery without distraction from unrelated changes.

Architecture

To understand why a script deleting 883 sites took two weeks to reverse, you need to understand what an Atlassian site actually is. A site is not a row in a database. It is a logical container distributed across dozens of services, each maintaining their own slice of state. Identity data (users, groups, permissions) lives in one service. Product databases for Jira, Confluence, and Opsgenie live in others. Media attachments, feature flags, licensing metadata, third-party app configurations — each occupies a separate data store. All of it is hosted on AWS and orchestrated through Micros (Atlassian's internal Platform-as-a-Service that orchestrates deployment, security, and management for all Atlassian cloud services). The site deletion did not touch a single database — it sent deletion events through the standard provisioning workflow, and every downstream service dutifully removed its copy.

How a Site Deletion Propagated Through Atlassian's Distributed Architecture

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

Restoration 2: Re-Creating Records In-Place Using Original Identifiers

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

Why a Global Rollback Wasn't Possible

The instinctive solution — roll back the entire database to before the script ran — was blocked by the multi-tenant architecture (a design where a single database shard stores data for many customers simultaneously, isolated at the application layer rather than the database layer). Each database shard contained data from hundreds of customers, most completely unaffected. A global rollback would have wiped hours of real work from tens of thousands of innocent customers. The only option was surgical: extract and replay each deleted customer's records individually from 30-day immutable backups, without touching any surrounding data.

Lessons

Deletion APIs must validate what they are deleting, not just whether the operation is allowed. An API that accepts both app IDs and site IDs without distinguishing them is a loaded gun with the safety removed. Before any destructive operation executes in production, a system-level check should confirm the type of entity being targeted — and fail loudly if it doesn't match the caller's stated intent.
Soft delete (marking data for removal with a retention window rather than permanently destroying it immediately) must be the only permitted path for any operation touching customer data. Permanent deletion paths — even legitimate ones needed for compliance — should require a multi-step authorisation separate from standard maintenance workflows. If an operation cannot be reversed in under an hour, it should not be triggerable in a single script run.
Disaster Recovery testing must include the scenario you have never practised, not just the one you have. Atlassian's backups were excellent and their single-site recovery was proven. What failed was multi-site, multi-product coordinated recovery at scale — a scenario that had no runbook and no automation. Test the rare catastrophe, not just the common failure.
Customer contact information must be backed up outside the system it describes. When the deletion removed customer sites, it also removed the contact data Atlassian needed to reach those customers. Never let a single operation sever both the incident and the communication channel for resolving it.
Staged rollout applies to maintenance scripts, not just feature deployments. A staged rollout policy on any script modifying customer data at scale would have surfaced the error on a batch of 10 before it reached 883. The first production run processed 30 sites correctly because those IDs were sourced before the miscommunication occurred. The second run hit 883.

Engineering Glossary

CloudID — an immutable unique identifier assigned to an Atlassian site at creation, embedded across all downstream services, data records, and third-party integrations. The primary source of complexity in Restoration 1 — re-mapping CloudID across the ecosystem took ~70 sequential steps per site. Restoration 2's key insight was preserving the original CloudID, eliminating this re-mapping entirely.

Immutable identifier — a unique ID assigned at creation that cannot be changed and is embedded across all downstream systems. In Atlassian's architecture, CloudIDs are immutable — changing them requires re-integrating every third-party app that references them.

Micros — Atlassian's internal Platform-as-a-Service that orchestrates deployment, security, and management for all Atlassian cloud services. The deletion script triggered standard tenant destruction events through Micros — which is exactly why monitoring didn't fire; normal deletions look identical to erroneous bulk deletions from an observability perspective.

Multi-tenant architecture — a design where a single database shard stores data for many customers simultaneously, isolated at the application layer rather than the database layer. Prevents global rollback when only a subset of tenants are affected — rollback would wipe data from unaffected customers sharing the same shard.

Recovery Point Objective (RPO) — the maximum acceptable amount of data loss measured in time. Atlassian's one-hour RPO meant no customer should lose more than one hour of data. The incident met RPO (customers lost at most 5 minutes of data) but missed RTO — access was unavailable for up to 13 days.

Recovery Time Objective (RTO) — the maximum acceptable time to restore service after an incident. Atlassian's incident met RPO but failed RTO dramatically — 13 days vs the expectation of hours. The asymmetry between data-safe and access-unavailable defined the character of the outage.

Soft delete — a reversible deletion that marks data for removal but retains it in backup for a grace period, allowing recovery. The permanent deletion path taken by the cleanup script was the exact opposite — data was gone from production immediately, with no soft-delete safety net.

This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →

(Interactive diagrams, source links, and the full reader experience)

TechLogStack — built at scale, broken in public, rebuilt by engineers.

DEV Community