The Pain Point Exposed by the AWS Outages
June 2026 · An analysis of infrastructure concentration
When a Few Cooling Units Failed, an Exchange Went Dark
On the night of May 7, 2026 (around 7:48 PM US Eastern Time), nearly all trading on Coinbase stopped. The cause was neither the market nor a hack. In a single availability zone (use1-az4) of AWS us-east-1, multiple chillers failed simultaneously, overheating one data-center hall; a thermal-safety shutdown then cut power to entire racks, taking down their EC2 instances and EBS volumes at once. It was a physical event. Restoring cooling to pre-incident levels took roughly 20 hours.[1][2]
Coinbase's recently published postmortem records the timeline dryly. The trading disruption lasted about 8 hours; full recovery took about 12. Quorum was restored just after midnight (12:06 AM), but markets did not reopen until 3:49 AM. The gap in between is the heart of this incident.[1][3]
Where "We're Multi-AZ, So We're Fine" Fell Apart
On the surface, Coinbase was built by the book. Even if an entire availability zone dies, service continues from the remaining zones — this is the architectural principle most AWS customers rely on, the failure mode a hyperscaler is designed to absorb at the zone boundary. This time, that principle did not hold. For two reasons.[2]
First, the component most sensitive to latency — the trade matching engine — was running, by design, in a single zone. A configuration deliberately pinned to one zone for millisecond-level speed became a single point of failure the moment that zone went down.[4]
Second, and more painful, the automatic recovery failed silently. Coinbase had placed much of its event streaming on AWS's managed Kafka service (MSK). The promise of a managed service is clear.
When some brokers die, partition leaders are automatically re-elected so that traffic keeps flowing through the surviving brokers. The loss of one zone should be "reduced capacity," not "loss of availability." But a defect in the MSK control plane blocked the automatic partition-leader re-election. Two MSK clusters were stuck in a "healing" state, producers could not write, and the fallout blocked the fee service, which in turn blocked quoting. The "broken trades and quotes" users experienced were produced this way. On top of that, one Kafka cluster was in a 2-AZ configuration, which widened the blast radius.[1][4][5]
In a system with redundancy designed in, the redundancy itself did not work, and engineers had to run disaster recovery procedures by hand. CEO Brian Armstrong described the situation as never acceptable. Coinbase committed to strengthening region-level redundancy, expanding the Kafka configuration from 2-AZ to 3-AZ, and increasing resilience testing.[3][1]
The lesson here is clear. "We are multi-AZ" is not the same statement as "we survive the loss of a zone." Redundancy that is not continuously validated under real zone-loss conditions is not redundancy but the theater of redundancy. And the abstraction of a managed service hides, inside itself, failure modes you cannot reach.
The Same Lesson, Taught Three Times in Seven Months
If this incident were a one-off stroke of bad luck, it would not be worth a column. The problem is its recurrence.
- October 20, 2025. A race condition in the internal DNS automation of DynamoDB in AWS us-east-1 cascaded across more than 70 services (about 15 hours). Coinbase stopped, the L2 network Base went down, and as Consensys's Infura RPC died, MetaMask — blockchain's core wallet service — was severed. The front ends and relays of Polygon, Optimism, Arbitrum, Linea, and Scroll were affected one after another.[6][7][8][9]
- November 18, 2025. A Cloudflare Bot Management feature file doubled in size due to a database permissions change and propagated to edge nodes worldwide; a company handling a fifth of internet traffic spewed 5xx errors for about three hours. BitMEX, DeFiLlama, Arbiscan — and once again Coinbase and Ledger — threw service errors and lost face.[12][13][14]
- May 7, 2026. The cooling failure described above.[1]
A DNS bug, a config file, a cooling unit. The cause differs each time, but the result is the same. And in all three, what stopped was not the blockchain's consensus layer.
What Stopped, and What Survived?
We must draw the distinction precisely. In the October outage, the consensus layers of Ethereum and Solana showed no protocol-level anomaly. Blocks kept being produced, and on-chain assets were safe. In the May Coinbase incident as well, user funds were intact on chain.[10]
So why could users do nothing? Today, when a single user uses a "decentralized app," that request passes through roughly the following layers.
- Edge/CDN layer — providers like Cloudflare handle front-end domains, DDoS protection, and caching.
- Hosting layer — dApp front ends, nodes, and even an exchange's matching engine run atop AWS, Google Cloud, and Alibaba Cloud.
- RPC/relay layer — a handful of gateways like Infura and Alchemy mediate between wallets and chains.
- Consensus layer — only here do distributed nodes validate blocks.
True decentralization exists only in layer 4. Layers 1–3 — the "operational surface" users actually touch — are tied to a tiny number of cloud providers. Whether a cooling failure or a DNS bug, once it breaks layers 1–3, no matter how healthy layer 4 is, it is indistinguishable to the user from the entire network being dead.
What the Numbers Say About Concentration
This is not an emotional critique but a measurable fact. Per Ethernodes, at the time of the October outage about 36% of Ethereum execution-layer nodes (roughly 2,368) were on AWS. About 70% of nodes depend on cloud hosting in some form, and geographically nearly half of all nodes are clustered in the United States.[16][17][18]
The problem is not single-provider dependence alone. us-east-1 is a special region even within AWS. Global services such as IAM authentication, CloudFront, Route 53, and DynamoDB Global Tables depend on us-east-1 endpoints even for resources deployed in other regions. This means that even a configuration believed to be "distributed across multiple regions" may be tied to a single region's control plane. And the May incident went one step lower, showing that even "multi-AZ" is no guarantee in the face of a control-plane defect. Beneath each appearance of distribution, a single point of failure is hidden, one layer at a time.[6][19]
Alibaba Cloud and Cloudflare create the same risk along different axes. Alibaba Cloud is where the nodes and infrastructure of Asian — especially Chinese — projects concentrate, and Cloudflare is the edge gateway through which almost every Web3 front end passes, regardless of where hosting lives. Even a project with no nodes on AWS would have fallen into the same outage on November 18 if it had placed Cloudflare in front of its domain.[15]
Why Did It Come to This? — Economics, Not Anomaly
This concentration is not the product of laziness or a betrayal of decentralization. It is the cumulative result of rational choices. Running your own full node demands substantial storage, bandwidth, and staff, while the cloud provides all of that in minutes, at a predictable cost. Because users will not tolerate even 200ms of latency, projects pick the fastest edge, and exchanges pin the matching engine to a single zone to cut latency. For an individual project, these choices are almost always rational.
The problem arises when everyone makes the same rational choice. The sum of individually optimal decisions becomes a system-level vulnerability. Because each party chose the most robust provider and the fastest configuration, the entire ecosystem ended up putting its eggs in the same few baskets. And when those baskets shake, risk that was supposed to be distributed reveals itself as perfectly correlated risk.
The Uncomfortable Diagnosis of "Pseudo-Decentralization"
We need to be honest. Blockchain's decentralization is real at the level of consensus mechanisms and asset ownership. The fact that no one's coins disappeared across the three outages is the proof. But decentralization in the dimension users actually experience — accessibility, availability, censorship resistance — is largely closer to narrative.
There is a wide gap between the decentralization Satoshi spoke of — decentralization as hypothesis — and decentralization as measured.
This should be treated not as a moral indictment but as engineering debt. We poured enormous intellectual resources into decentralizing the consensus layer, yet entrusted the operational surface built on top of it wholesale to the most convenient centralized infrastructure. Consensus was distributed, but the infrastructure riding on the existing internet and the cloud remained bound to Web 2.0.
So What Should Be Done? — Without Overstatement
A common trap in discussing solutions is selling the utopia of "fully decentralized infrastructure." That is not honest. Realistic mitigations are incremental, each carrying a clear trade-off.
- Redundancy must be validated redundancy. The real lesson of the Coinbase case is not "there was no redundancy" but "redundancy was not validated under real failure conditions." A fallback diagram drawn without chaos engineering and regular zone-loss drills guarantees no availability.
- Trust the managed-service abstraction, but know your dependencies. That MSK promises automatic failover does not mean the promise is kept across every failure mode. Design on the premise that failures you cannot reach — like a control-plane defect — exist.
- Infrastructure diversification starts with cloud and region diversification. Simply distributing RPC across multiple providers and regions and keeping fallback paths reduces single points of failure. Cost and complexity rise. That is the price of availability.
- Decentralized RPC and infrastructure networks (DIN) are promising but unfinished. Efforts to resolve node provisioning through distributed incentive structures are underway, but they have yet to catch up to centralized gateways on latency and consistency. Guard against both overestimating and underestimating them.
- The most honest first step is a dependency inventory. Mapping out which provider, which region, and which single control plane your stack is actually tied to. Most projects do not even realize they are far more centralized than they think.
In Closing
Blockchain is a tool, not an oracle. It elegantly tries to solve the particular problems of consensus and ownership. But the physical foundation on which that tool runs is the reality called the Web. Servers, DNS, the edge, and now cooling units still stand atop the cloud oligopoly of 2026. October 2025's DNS, November's config file, May 2026's cooling unit. In less than seven months, the same lesson was taught three times.
If you seriously dream of a decentralized future, you must confront the fact that what halted that dream was not a hostile state or a sophisticated attack, but a cooling failure, a single line in a config file, a single DNS bug. The real pain point is right there: that the system we believe to be the most distributed was the most fragile in the face of the most ordinary operational accident. And that even the redundancy we believed to be robustly designed may fail to work at the very moment it is needed.
Decentralization is a matter not of declaration but of measurement. And when you measure it, there is still a long way to go.
References
The sources below are primary postmortems, news reports, and node-distribution statistics from October 2025 to May 2026. Statistical figures are as of publication and may change over time.
May 7, 2026 — AWS us-east-1 Cooling Failure / Coinbase
- [1] Coinbase May 7 outage postmortem summary (FX News Group) — https://fxnewsgroup.com/forex-news/cryptocurrency/coinbase-issues-statement-on-may-7-2026-outage/
- [2] AWS May 2026 cooling failure & cross-region DR technical analysis (SingleStore) — https://www.singlestore.com/blog/aws-outage-may-2026-cross-region-disaster-recovery/
- [3] Coinbase 7-hour disruption & Brian Armstrong's remarks (Crowdfund Insider) — https://www.crowdfundinsider.com/2026/05/278141-coinbase-impacted-by-7-hr-outage-after-aws-data-center-cooling-failure/
- [4] Matching engine & Kafka infrastructure impact analysis (Yahoo Finance / Benzinga) — https://finance.yahoo.com/markets/crypto/articles/coinbase-says-aws-cooling-failure-013036066.html
- [5] Thermal-event cascading systems-failure analysis (Machine News) — https://www.machine.news/coinbase-hit-by-cascading-systems-failure-after-thermal-event-in-aws-data-centre/
October 20, 2025 — AWS us-east-1 DynamoDB DNS Outage
- [6] AWS us-east-1 outage & global dependencies (Network World) — https://www.networkworld.com/article/4168878/aws-hit-by-us-east-1-outage-after-data-center-thermal-event.html
- [7] October 2025 AWS outage root-cause analysis — DynamoDB DNS race condition (Medium, L. Kumili) — https://medium.com/@leela.kumili/aws-outage-root-cause-analysis-bd88ffcab160
- [8] Crypto impact of the AWS outage — Coinbase, Base, L2s (CryptoSlate) — https://cryptoslate.com/aws-failure-exposes-cryptos-centralized-weak-point/
- [9] Infura, MetaMask, and other web3 infrastructure impact (Coingape) — https://coingape.com/block-of-fame/pulse/after-aws-outage-attack-consensys-and-eigen-launch-decentralized-solution-for-web3/
- [10] Consensus layer unaffected / on-chain performance postmortem (Metrika) — https://www.metrika.co/blog/post-mortem-aws-outage-10-2025
- [11] 2025 AWS outage reliability & statistics overview (IncidentHub) — https://blog.incidenthub.cloud/definitive-aws-outage-report-2025-reliability
November 18, 2025 — Cloudflare Global Outage
- [12] Cloudflare November 18, 2025 outage official postmortem (Cloudflare Blog) — https://blog.cloudflare.com/18-november-2025-outage/
- [13] Cloudflare outage — 20% of the internet & crypto trading disrupted (Brave New Coin) — https://bravenewcoin.com/insights/database-error-takes-down-20-of-internet-cloudflare-outage-disrupts-global-crypto-trading
- [14] BitMEX, DeFiLlama, Arbiscan, and other front ends down (CoinDesk) — https://www.coindesk.com/business/2025/11/18/cloudflare-global-outage-spreads-to-crypto-multiple-front-ends-down
- [15] The pseudo-decentralization of crypto exposed by the Cloudflare outage (Bitget News) — https://www.bitget.com/news/detail/12560605075954
Node & Infrastructure Concentration Statistics
- [16] ~36% of Ethereum nodes (~2,368) on AWS — citing Ethernodes (BitKE) — https://bitcoinke.io/2025/10/over-a-third-of-ethereum-nodes-on-centralized-servers/
- [17] ~50% of validators on AWS, ~70% of nodes on cloud (Foundry, Medium) — https://medium.com/foundry-digital/the-evolution-of-ethereum-decentralization-cf55ccfcee4f
- [18] Three cloud providers account for 69% of nodes; geographic concentration — Messari/Ethernodes (Cointelegraph) — https://cointelegraph.com/news/3-cloud-providers-accounting-for-over-two-thirds-of-ethereum-nodes-data
- [19] Ethereum validator network correlation & cloud concentration study (arXiv) — https://arxiv.org/html/2404.02164v1
https://github.com/gameworkerkim/vibe-investing
About the Author — Dennis Kim
Dennis Kim is a quantitative analyst and AI researcher operating at the convergence of artificial intelligence and global financial markets. Since 2017, he has been deeply engaged in the blockchain industry, emerging as a key player connecting Korea and the broader Asian market—bridging ecosystems, capital, and technology across the region.
He served as CEO of Cyworld (Cyworld Z), steering one of Korea's most iconic social platforms, and built his foundation as a hands-on programmer with deep roots in the game security industry. Microsoft recognized his technical leadership with the Azure MVP award for nine consecutive years (2015–2023), and he remains an active cyber threat intelligence and security expert, publishing multilingual threat research read across the industry.
As a columnist, Dennis writes for both technical and general audiences, translating complex macroeconomic narratives and AI-driven signals into clear, actionable insight. Today, much of that work lives in his Vibe Investing repository, where he publishes deep-dive investment columns and develops AI-driven trading systems—turning the noise of markets and machine learning into a coherent investment edge.
His current focus sits squarely on the future he's spent his career preparing for: the fusion of AI and financial markets, where engineering rigor, security discipline, and market intuition meet.
Top comments (0)