- 883 sites permanently deleted in 23 minutes by a peer-reviewed, routine maintenance script
- 775 customers affected — max outage 14 days
- 450+ engineers on 24/7 shifts at peak response
- <5 minutes of data loss per customer — RPO met; RTO was not
- ~70 → 30 steps — the Restoration 2 breakthrough that cut site recovery from 48 hours to 12
- Opsgenie — Atlassian's own incident management tool — was among the deleted sites
At 07:38 UTC on April 5th, 2022, a maintenance script begins its run — methodical, peer-reviewed, totally routine. Twenty-three minutes later, 883 Atlassian Cloud sites have been permanently deleted, and the company's own incident management tool, Opsgenie, is one of the casualties.
The Story
The script we used provided both the 'mark for deletion' capability used in normal day-to-day operations (where recoverability is desirable), and the 'permanently delete' capability that is required to permanently remove data when required for compliance reasons.
— Sri Viswanath, CTO, via Atlassian Engineering Blog, April 2022
In 2021, Atlassian completed the integration of a standalone app called Insight – Asset Management into Jira Service Management as native functionality. The standalone version was now obsolete and needed to be retired from the 200,000+ customer cloud sites that had it installed. An engineering team wrote a cleanup script using an existing deletion process — nothing unusual. The seeds of disaster were sown not in a line of code, but in a conversation between two teams separated by function, timezone, and context.
The deletion API accepted two types of identifiers: app IDs to remove a specific product installation, and site IDs to remove an entire customer workspace. Both were valid inputs. The API assumed the caller knew which they were passing and offered no type-checking, no confirmation prompt, no dry-run mode. The team requesting the deletion provided the IDs of the cloud sites where the app was installed — not the IDs of the app instances themselves. The executing team, receiving a list of IDs and a known-good script, ran it. Soft delete (a reversible deletion that marks data for removal but retains it in backup for a grace period) was not used; the script took the permanent deletion path instead.
What made the outage uniquely brutal was its second-order effect: the script that deleted customer sites also deleted the contact information for those customers. Atlassian's support systems required a valid Cloud URL and Atlassian ID to file a ticket — and both were gone. Customers couldn't reach support, and Atlassian couldn't reach customers. The company had to reconstruct contact lists from billing systems and prior support tickets before they could even begin coordinating restoration.
Problem
Silent Deletion — 883 Sites in 23 Minutes
At 07:38 UTC, the cleanup script begins sequentially deleting sites from a list of 883 IDs. Because deletions pass through standard provisioning workflows, internal monitoring fires no alert. At 07:46 UTC, the first customer support ticket arrives: Jira, Confluence, Opsgenie, and Statuspage are all unreachable.
Cause
Wrong IDs, No Guard
At 08:53 UTC, engineers confirm the link between the script run and the deletions. Team A passed site IDs (unique identifiers for an entire customer workspace containing all their Atlassian products) instead of app IDs to Team B. The deletion API, designed to accept both types without validation, assumed correctness. The script used the permanent delete path, not the soft-delete path — the data was gone from production immediately.
Solution
Two Approaches to Rebuild
Restoration 1 — creating brand-new sites and migrating data across — took approximately 48 hours per batch and required 70 sequential steps including re-mapping immutable Cloud IDs across every third-party ecosystem app. On April 9th, the team proposed Restoration 2: re-creating records using the original site identifiers, cutting the process to ~30 steps and ~12 hours per site.
Result
Full Restoration, Hard Lessons
The final affected customer was restored on April 18th — 13 days after the incident began. Atlassian met its Recovery Point Objective of one hour: no customer lost more than five minutes of data. The company permanently blocked bulk site deletes, mandated soft-delete policies across all systems, and committed to automated multi-site disaster recovery testing.
The Fix
The Restoration 2 Breakthrough
Recovery began in three parallel workstreams. The first assembled a manual team to hand-walk through restoration steps for individual sites. The second raced to automate those steps. The third — the one that broke the logjam — was a complete rewrite of the restoration approach itself.
Restoration 1 created brand-new sites with fresh Cloud IDs, requiring all downstream services and third-party apps to be re-mapped to the new identifiers: ~70 steps, ~48 hours per batch, cascading dependencies that could only run in sequence. The math told a grim story: three weeks to clear the full backlog of 775 customers.
- 13d — duration from first deletion (April 5) to final site restored (April 18)
- <5 min — maximum data loss per customer — RPO met despite missing RTO by days
- ~70 → 30 — restoration steps reduced from Approach 1 to Approach 2; site recovery time cut from 48h to ~12h
- 450+ — engineers on 24/7 manual validation shifts globally at peak response
# Pseudocode: Restoration 2 — re-create deleted records using original identifiers
# The breakthrough: preserve original CloudID to avoid re-mapping all downstream references
def restore_site_v2(site_id, restore_point_timestamp):
# Step 1: Re-create the site record using the ORIGINAL site_id
# Critical: preserve original cloudId — all downstream services already reference it
catalogue.uncreate(site_id, preserve_original_cloud_id=True)
# Step 2: Restore identity data (users, groups, permissions) in parallel
# No sequential dependency with database restoration
identity.restore_async(site_id, point_in_time=restore_point_timestamp)
# Step 3: Restore primary product databases (Jira, Confluence, etc.)
# Point-in-time recovery to exactly 5 minutes before deletion
for product in get_site_products(site_id):
db.restore_to_point_in_time(
product=product,
timestamp=restore_point_timestamp, # 5 min before deletion
site_id=site_id
)
# Step 4: Restore cross-service data (attachments, app data, feature flags)
# Parallelise across services with no inter-dependencies
services.restore_parallel(site_id, timestamp=restore_point_timestamp)
# Step 5: Automated validation — all services healthy for this site
validation_result = validate_site(site_id)
if not validation_result.passed:
raise RestorationError(f"Site {site_id} failed validation: {validation_result.errors}")
# Step 6: Hand off to customer for final sign-off
notify_customer(site_id, status='ready_for_validation')
Restoration 1 vs Restoration 2:
| Dimension | Restoration 1 | Restoration 2 |
|---|---|---|
| Approach | Create new site, migrate data in | Re-create original records in-place |
| Site identifiers | New CloudID assigned | Original CloudID preserved |
| Steps required | ~70 sequential steps | ~30 steps with parallelism |
| Recovery time/site | ~48 hours per batch | ~12 hours per site |
| Third-party apps | Every app re-integration required | No re-integration needed |
| Sites restored | 112 sites | 771 sites |
Four changes were committed as non-negotiable outcomes: universal soft-delete across all Atlassian systems; automated multi-site, multi-product disaster recovery testing at scale; a large-scale incident playbook with pre-built tooling and simulation exercises; and backup of customer contact data outside the product instance itself.
The communications lesson they admitted
For nine days, Atlassian remained largely silent publicly while customers speculated on forums. They later acknowledged they should have communicated directional estimates with explicit uncertainty — even imprecise timelines — far earlier. Silence read as incompetence. "We don't know yet" is a communication, not a failure to communicate. On Day 9, they finally confirmed what the community had already guessed: the Insight plugin retirement script was the cause.
The code freeze decision
On April 8th, Atlassian imposed a company-wide code freeze — no deployments across all of engineering until restoration was complete. This eliminated the risk of a compounding incident, reduced noise, and allowed the entire engineering org to focus exclusively on recovery without distraction from unrelated changes.
Architecture
To understand why a script deleting 883 sites took two weeks to reverse, you need to understand what an Atlassian site actually is. A site is not a row in a database. It is a logical container distributed across dozens of services, each maintaining their own slice of state. Identity data (users, groups, permissions) lives in one service. Product databases for Jira, Confluence, and Opsgenie live in others. Media attachments, feature flags, licensing metadata, third-party app configurations — each occupies a separate data store. All of it is hosted on AWS and orchestrated through Micros (Atlassian's internal Platform-as-a-Service that orchestrates deployment, security, and management for all Atlassian cloud services). The site deletion did not touch a single database — it sent deletion events through the standard provisioning workflow, and every downstream service dutifully removed its copy.
How a Site Deletion Propagated Through Atlassian's Distributed Architecture
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
Restoration 2: Re-Creating Records In-Place Using Original Identifiers
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
Lessons
Deletion APIs must validate what they are deleting, not just whether the operation is allowed. An API that accepts both app IDs and site IDs without distinguishing them is a loaded gun with the safety removed. Before any destructive operation executes in production, a system-level check should confirm the type of entity being targeted — and fail loudly if it doesn't match the caller's stated intent.
Soft delete (marking data for removal with a retention window rather than permanently destroying it immediately) must be the only permitted path for any operation touching customer data. Permanent deletion paths — even legitimate ones needed for compliance — should require a multi-step authorisation separate from standard maintenance workflows. If an operation cannot be reversed in under an hour, it should not be triggerable in a single script run.
Disaster Recovery testing must include the scenario you have never practised, not just the one you have. Atlassian's backups were excellent and their single-site recovery was proven. What failed was multi-site, multi-product coordinated recovery at scale — a scenario that had no runbook and no automation. Test the rare catastrophe, not just the common failure.
Customer contact information must be backed up outside the system it describes. When the deletion removed customer sites, it also removed the contact data Atlassian needed to reach those customers. Never let a single operation sever both the incident and the communication channel for resolving it.
Staged rollout applies to maintenance scripts, not just feature deployments. A staged rollout policy on any script modifying customer data at scale would have surfaced the error on a batch of 10 before it reached 883. The first production run processed 30 sites correctly because those IDs were sourced before the miscommunication occurred. The second run hit 883.
Engineering Glossary
CloudID — an immutable unique identifier assigned to an Atlassian site at creation, embedded across all downstream services, data records, and third-party integrations. The primary source of complexity in Restoration 1 — re-mapping CloudID across the ecosystem took ~70 sequential steps per site. Restoration 2's key insight was preserving the original CloudID, eliminating this re-mapping entirely.
Immutable identifier — a unique ID assigned at creation that cannot be changed and is embedded across all downstream systems. In Atlassian's architecture, CloudIDs are immutable — changing them requires re-integrating every third-party app that references them.
Micros — Atlassian's internal Platform-as-a-Service that orchestrates deployment, security, and management for all Atlassian cloud services. The deletion script triggered standard tenant destruction events through Micros — which is exactly why monitoring didn't fire; normal deletions look identical to erroneous bulk deletions from an observability perspective.
Multi-tenant architecture — a design where a single database shard stores data for many customers simultaneously, isolated at the application layer rather than the database layer. Prevents global rollback when only a subset of tenants are affected — rollback would wipe data from unaffected customers sharing the same shard.
Recovery Point Objective (RPO) — the maximum acceptable amount of data loss measured in time. Atlassian's one-hour RPO meant no customer should lose more than one hour of data. The incident met RPO (customers lost at most 5 minutes of data) but missed RTO — access was unavailable for up to 13 days.
Recovery Time Objective (RTO) — the maximum acceptable time to restore service after an incident. Atlassian's incident met RPO but failed RTO dramatically — 13 days vs the expectation of hours. The asymmetry between data-safe and access-unavailable defined the character of the outage.
Soft delete — a reversible deletion that marks data for removal but retains it in backup for a grace period, allowing recovery. The permanent deletion path taken by the cleanup script was the exact opposite — data was gone from production immediately, with no soft-delete safety net.
This case is a plain-English retelling of publicly available engineering material.
Read the full case on TechLogStack →
(Interactive diagrams, source links, and the full reader experience)
TechLogStack — built at scale, broken in public, rebuilt by engineers.
Top comments (0)