When working with public clouds, we all heard of the Shared Responsibility Model, where the cloud providers are responsible for the cloud infrastructure (both physical and software-related components), and we as customers are responsible for things within our cloud accounts (from configuration to security, designing for resiliency, etc.).
But what happens when the cloud providers make mistakes that impact us directly, without (almost) anything we as customers can do?
In this blog post, I will share some stories published on the Internet, and I will try to provide some recommendations to mitigate against similar cases from happening in the future to other customers.
When a 10-Year AWS Footprint Disappears Overnight
On July 10, 2025, AWS sent a notice requesting identity verification within five days. Despite submitting valid ID and utility documents, AWS rejected them as “unreadable.” Without offering clarification or follow-up, AWS terminated the entire 10-year-old account on July 23, deleting every byte of data—including multi-region backups, documentation, and open-source material tied to his Ruby projects (Reference: AWS deleted my 10-year account and all data without warning).
The deletion led to a 20‑day support ordeal, marked by canned responses and an apparent lack of escalation pathway. Ultimately, in early August 2025, an AWS employee manually intervened to restore the account after public outcry (Reference: AWS Restored My Account: The Human Who Made the Difference).
How It Could Have Been Avoided
- Redundant Multi‑Provider Backup Strategy - Cross‑provider redundancy (e.g., AWS + Azure Blob + local NAS backup) would have limited the damage.
- Independent Account Ownership - Maintaining direct billing and verified root access avoids dependencies that can trigger verification suspension.
- Automated Offsite Snapshots - Using tools like rclone or AWS Backup cross‑copy to independent storage (Google Drive, S3 Glacier Deep Archive, or on-prem object stores) protects against upstream provider errors.
- Proactive Account Monitoring - AWS customers can use AWS Health Dashboard, CloudTrail account activity alerts, and Security Hub configuration compliance checks to detect when verification or compliance flags occur.
- Internal Policy Transparency by AWS - AWS could prevent recurrence by clarifying retention behavior for “verification-suspended” accounts, introducing a mandatory grace period before data erasure, and auditable deletion logs for customer appeals.
When Microsoft Suspended Your Cloud
The incident involving Nayara Energy and Microsoft in July 2025 occurred after the European Union imposed sanctions targeting Nayara's refinery due to its partial ownership by the Russian state oil company Rosneft. Following these sanctions, Microsoft, adhering to EU compliance and its automated sanctions enforcement system, suspended critical cloud services—such as Teams and Outlook—used by Nayara’s employees without prior notice or legal obligation under Indian or U.S. law. This unilateral suspension disrupted Nayara’s day-to-day operations and raised concerns about the overreach of foreign jurisdiction in critical infrastructure services (Reference: Microsoft briefly turned off an Indian company’s cloud).
Nayara Energy legally challenged Microsoft’s suspension in the Delhi High Court, which intervened and led to the restoration of services within a few days. The case highlighted the risks associated with reliance on foreign cloud providers subject to extraterritorial sanctions regimes, prompting Indian policy voices to emphasize the strategic vulnerability of depending on non-sovereign digital infrastructure in vital sectors like energy (Reference: Indian court rules for sanctioned refiner in Microsoft services dispute).
How It Could Have Been Avoided
- Implement Sovereign or Local Cloud Solutions - Adoption of sovereign clouds or domestic cloud providers controlled under national jurisdiction to avoid external legal risks.
- Multi-Cloud Strategy - Utilizing multiple cloud providers from different legal jurisdictions can reduce exposure to unilateral service suspensions.
- Clear Jurisdictional Compliance Mapping - Cloud customers and providers must ensure clear, legally grounded interpretations of sanctions to prevent automated, unnecessary service blocks.
- Advance Notification and Dialogue - Providers should establish protocols to notify customers in sensitive sectors immediately about compliance actions and consider court injunctions or customer appeals to prevent abrupt service halts.
- Governmental and Legal Safeguards - Governments should work with cloud providers to create frameworks protecting essential services from extraterritorial compliance impacts.
When Google Cloud Deletes Your Private Cloud
The incident involving UniSuper and Google Cloud in May 2024 was caused by an inadvertent misconfiguration during the provisioning of UniSuper’s Private Cloud services, which led to the accidental deletion of UniSuper's entire cloud subscription, including all backups stored across multiple geographic regions (Reference: Google Cloud accidentally deletes UniSuper’s online account due to ‘unprecedented misconfiguration’).
Google Cloud acknowledged the issue as a "one-of-a-kind occurrence" and communicated openly with UniSuper throughout the recovery. The restoration required considerable effort from both Google Cloud and UniSuper teams, highlighting that cloud service disruptions can impact even large, sophisticated organizations (Reference: Sharing details on a recent incident impacting one of our customers).
How It Could Have Been Avoided
- Multi-Cloud Backup Strategy - Organizations should adopt a 3-2-1 backup strategy: at least three copies of data, two different storage types, and one offsite or in another cloud provider.
- Separate Logical Backup Infrastructure - Backups should not reside within the same logical cloud environment or account as primary data. Isolating backups using independent providers or physical air-gapped storage protects against provider-wide faults or misconfigurations.
- Infrastructure as Code (IaC) Safeguards - Extensive automated testing, change approval workflows, and audit trails for IaC deployments (e.g., Terraform, CloudFormation) would reduce the risk of unintentional destructive changes during provisioning.
- Robust Incident Response and Continual Monitoring - Early detection of configuration errors, combined with clear communication plans between cloud provider and customer, can minimize downtime and data loss consequences.
- Cloud Provider Controls - Providers should implement redundant fail-safes restricting wholesale deletions during provisioning and protect simultaneous deletion of multi-region backups.
AWS Outage Deep Dive: Lessons from EU-North 1
In February 2025, AWS experienced a significant networking fault in the eu-north-1 region (Stockholm), specifically originating in Availability Zone eun1-az3, which caused wide-scale service degradation. This fault disrupted internal service-to-service communications within the region, affecting many critical AWS services such as EC2, S3, Lambda, DynamoDB, and CloudWatch for over 15 hours before full recovery, impacting internal service communications (Reference: Availability issues with aws-eu-north-1c).
How Can It Be Avoided or Mitigated
- Multi-Region and AZ Redundancy - Architecting applications to span multiple Availability Zones and regions reduces the blast radius of regional faults and network isolation events.
- Network Path Redundancy and Monitoring - Continuous real-time monitoring of internal network paths and rapid failover mechanisms can help detect and route around failing network segments faster.
- Disaster Recovery Planning - Regular DR drills that include network partition scenarios validate system resilience and readiness for swift recovery.
- Multi-Cloud and Hybrid Strategies - Maintaining the ability to fail over critical workloads to a different cloud provider or on-premises infrastructure can mitigate regional cloud provider issues.
- Caching and Offline Access - Implementing edge caching and offline data accessibility for applications prevents total service disruption when cloud connectivity is impaired.
- Clear Communication and Transparency - Cloud providers offering granular health dashboards and timely communication help customers take protective actions early.
When Microsoft’s 19-Hour Outage Hits
In July 2025, Microsoft experienced a global outage lasting over 19 hours that severely impacted core Microsoft 365 services, including Exchange Online mailboxes, Outlook (web, mobile, and desktop), and Microsoft Teams for messaging, calls, and meetings. The problem quickly escalated to affect multiple interconnected services worldwide, disrupting communication and collaboration for millions of users across various sectors (Reference: Microsoft’s July 2025 Outage: A 19-Hour Disruption That Exposed Cloud Infrastructure Vulnerabilities).
How Can It Be Avoided or Mitigated
- Backup and Contingency Planning - Customers should employ third-party backup services to regularly back up Microsoft 365 data (Exchange Online, Teams, SharePoint, OneDrive) to independent, off-cloud storage.
- Multi-Channel and Alternative Communication Methods - Organizations should ensure incident management and communication systems use multiple channels (email, SMS, phone calls) and set up backup conferencing/communication tools outside Microsoft 365, such as third-party video conferencing platforms or messaging apps, to maintain communications during service disruptions.
- Local Sync and Offline Access for Critical Files - Administrators can configure OneDrive and SharePoint sync policies, enforcing critical document caching locally on user devices. Users trained to mark key files "Always Keep on This Device" ensure offline accessibility when cloud access is unavailable.
- Resilient Authentication Configurations - Implement backup authentication methods alongside Azure AD and MFA to prevent single points of failure. For hybrid architectures, configure password hash sync and alternative login methods to avoid lockouts during identity service disruptions.
- Proactive Monitoring and Incident Awareness - Continuous monitoring of Microsoft 365 service health dashboards and integrating alerts into organizational operations allows faster detection and response to outages, reducing downtime duration and impact.
- Business Continuity and Crisis Playbooks - Develop and regularly test contingency playbooks for such cloud outages, including manual workarounds, employee communication plans, and alternate workflows to maintain operations.
Google Cloud Outage: A Cloud Architect’s Reality Check on Service Dependence
In June 2025, Google Cloud experienced a major global outage impacting over 70 services—including Gmail, Google Drive, Google Meet, Spotify, Discord, and Shopify—for more than 7 hours. The root cause was traced to a faulty automated change to Google Cloud’s Service Control system, a critical API gateway responsible for authentication, quota enforcement, and request validation across Google’s infrastructure (Reference: Multiple GCP products are experiencing Service issues).
How Can It Be Avoided or Mitigated
- Multi-Cloud and Redundancy - Architect workloads across clouds or hybrid environments to reduce single-provider dependency risk.
- Graceful Degradation and Circuit Breaking - Design applications to handle API failures gracefully with retries, fallbacks, and offline capabilities to reduce impact during upstream outages.
- Decoupled Authentication Strategies - Leverage additional identity management solutions and caching to preserve session continuity if primary IAM services fail momentarily.
- Proactive Monitoring and Alerting - Monitor Google Cloud Service Health dashboards continuously and integrate alerts to enable rapid incident detection and trigger fallback workflows.
- Incident Response and Crisis Planning - Maintain tested playbooks for service disruptions that include internal communications, alternative collaboration tools, and manual operational procedures.
Salesforce OAuth Breach
In early August 2025, a sophisticated cyberattack exploited compromised OAuth tokens from the Salesloft Drift third-party application integrated with Salesforce, allowing attackers to exfiltrate large volumes of sensitive data from over 700 organizations across multiple industries, including cybersecurity firms, retail, finance, and technology. (Reference: Widespread Data Theft Targets Salesforce Instances via Salesloft Drift).
How Can It Be Avoided or Mitigated
- Comprehensive SaaS Integration Visibility - Maintain inventory of all OAuth, API, and third-party integrations across SaaS environments to identify risky or unauthorized apps, reducing blind spots that attackers exploit.
- Least Privilege and Token Scope Management - Restrict OAuth token permissions and enforce least privilege access, minimizing exposure if tokens are compromised.
- OAuth Token Rotation and Revocation - Regularly rotate and promptly revoke unused or suspicious OAuth tokens and refresh tokens to limit lifetime risk.
- Monitoring and Alerting for Anomalous Activity - Implement continuous monitoring with alerts for unusual data exports or access patterns in Salesforce and other SaaS tools, using SOAR or SIEM integrations.
- User Awareness and Phishing Resistance Training - Educate users about phishing and social engineering risks, especially regarding granting OAuth consents and app integrations.
- Third-Party Security Assessments - Evaluate the security posture and incident history of third-party SaaS integrations regularly and require security attestations.
- Incident Response and Forensic Readiness - Prepare to quickly investigate, contain, and remediate integration abuse incidents, including audit log preservation and forensic analysis processes.
Oracle Cloud Breach
The Oracle data breach of 2025 stemmed from the exploitation of a critical, long-known vulnerability (CVE-2021-35587) in Oracle WebLogic servers, which many organizations had failed to patch. Threat actors, notably identified as "rose87168," accessed legacy Oracle infrastructure, including SSO and LDAP servers, extracting approximately 6 million records comprising JKS key files, encrypted SSO passwords, hashed LDAP passwords, environment variables, and configuration backups (Reference: CVE-2021-35587 Exploited in Oracle Data Breach 2025).
Over 140,000 Oracle Cloud tenants were affected, involving small to midsize companies relying on Oracle's legacy cloud platforms and hybrid cloud components.
Oracle initially denied breach of its core Oracle Cloud Infrastructure (OCI), attributing the incident to outdated Gen1 servers, but subsequent leaked data and victim confirmations challenged this narrative (Reference: Oracle Cloud Breach).
How Can It Be Avoided or Mitigated
- Aggressive Patching and Vulnerability Management - Ensure that all Oracle software, including middleware like WebLogic, is promptly patched, especially for critical vulnerabilities listed in CISA KEV.
- Credential Rotation and Zero-Trust Access - Rotate and revoke all potentially exposed credentials, including SSH keys, API tokens, and SSO passwords. Enforce zero-trust principles with strict role-based access controls and conditional multi-factor authentication.
- Network Segmentation and Isolation - Isolate legacy systems from production environments and Internet-accessible interfaces. Use firewalls and VPNs to segment and limit access to sensitive infrastructure.
- Continuous Monitoring and Incident Response - Implement real-time monitoring of access, network traffic anomalies, and log integrity. Maintain robust incident response capabilities to detect and remediate suspicious activity quickly.
- Supply Chain and Integration Auditing - Identify and secure all third-party integrations and hybrid cloud components to prevent lateral attack vectors.
- Data Encryption and Backup Hygiene - Encrypt data at rest and in transit and regularly test backup/restoration processes to ensure data integrity and resilience.
Summary
This blog explores the limits of the Shared Responsibility Model in public cloud environments, highlighting incidents where cloud provider errors or misconfigurations directly impacted customers.
Outages at AWS, Microsoft, and Google Cloud further emphasize the operational risks of single-provider dependence, while breaches in Salesforce and Oracle Cloud reveal the critical importance of security hygiene and third-party integration management.
While no solution can fully eliminate the risk of cloud provider incidents, this post offers practical recommendations to help mitigate potential impacts. Some measures, like proactive monitoring and backup strategies, are straightforward to implement, whereas others—such as multi-cloud architectures—may be more complex or theoretical.
The key takeaway for readers is to learn from past incidents and adopt mitigations proactively, before they affect your organization or customers.
About the author
Eyal Estrin is a seasoned cloud and information security architect, AWS Community Builder, and author of Cloud Security Handbook and Security for Cloud Native Applications. With over 25 years of experience in the IT industry, he brings deep expertise to his work.
Connect with Eyal on social media: https://linktr.ee/eyalestrin.
The opinions expressed here are his own and do not reflect those of his employer.
Top comments (0)