<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Arthur</title>
    <description>The latest articles on DEV Community by Arthur (@arthurpro).</description>
    <link>https://dev.to/arthurpro</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3906866%2Fd0e24b44-8169-4789-9e67-cc5b4e067b97.png</url>
      <title>DEV Community: Arthur</title>
      <link>https://dev.to/arthurpro</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/arthurpro"/>
    <language>en</language>
    <item>
      <title>One 200-Year-Old Math Trick Powers Almost Every Pixel and Sound You Touch</title>
      <dc:creator>Arthur</dc:creator>
      <pubDate>Fri, 08 May 2026 19:00:00 +0000</pubDate>
      <link>https://dev.to/arthurpro/one-200-year-old-math-trick-powers-almost-every-pixel-and-sound-you-touch-g35</link>
      <guid>https://dev.to/arthurpro/one-200-year-old-math-trick-powers-almost-every-pixel-and-sound-you-touch-g35</guid>
      <description>&lt;p&gt;In December 1807, a French mathematician named Joseph Fourier presented a memoir to the Paris Academy of Sciences claiming that any reasonable signal — any sound, any temperature distribution, any periodic process — could be written as a sum of sines and cosines. &lt;a href="https://en.wikipedia.org/wiki/Joseph_Fourier" rel="noopener noreferrer"&gt;Lagrange, who had spent decades on trigonometric series, objected so forcefully that publication was blocked&lt;/a&gt;. The manuscript sat for fifteen years before &lt;a href="https://archive.org/details/thorieanalytiq00four" rel="noopener noreferrer"&gt;it appeared in book form as &lt;em&gt;Théorie analytique de la chaleur&lt;/em&gt; in 1822&lt;/a&gt;. Fourier was trying to model heat flow in a metal bar.&lt;/p&gt;

&lt;p&gt;Two centuries later, every JPEG image, every MP3 track, every Wi-Fi packet, and every MRI scan in routine clinical use leans on the same idea. Fourier did not aim at any of that. The trick generalized in ways nobody alive in 1807 could have predicted, and the chain from a heat-equation paper to a 5G modem is short enough to walk in a single article.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the trick actually is
&lt;/h2&gt;

&lt;p&gt;Take a signal — a string of audio samples, a row of pixel intensities, a slice of MRI sensor data. The Fourier transform writes that signal as a sum of pure tones, each at a specific frequency, with a specific amplitude and phase. The inverse transform takes you back. Both directions lose nothing.&lt;/p&gt;

&lt;p&gt;That sounds like an analytical curiosity. The reason it underpins so much engineering is that most signals worth caring about are &lt;em&gt;sparse&lt;/em&gt; in the frequency domain even when they're dense in the time or space domain. A 30-second song is hundreds of thousands of audio samples; the same song, transformed, is dominated by a few hundred frequencies. Modify the frequency-domain version (zero out the inaudible bands, drop the small coefficients, pack different bits onto different frequencies) and transform back, and you've done compression, filtering, denoising, or modulation depending on what you modified.&lt;/p&gt;

&lt;p&gt;Strictly: the Fourier transform is a basis change. It projects the signal onto an orthogonal set of basis functions — sines and cosines, or close relatives — and once you have a basis where the signal is sparse, every downstream operation gets cheaper.&lt;/p&gt;

&lt;h2&gt;
  
  
  The chain from 1822 to your phone
&lt;/h2&gt;

&lt;p&gt;Two milestones did most of the heavy lifting between Fourier's manuscript and modern silicon.&lt;/p&gt;

&lt;p&gt;The first was the &lt;a href="https://en.wikipedia.org/wiki/Cooley%E2%80%93Tukey_FFT_algorithm" rel="noopener noreferrer"&gt;Cooley–Tukey FFT algorithm&lt;/a&gt;, published in &lt;em&gt;Mathematics of Computation&lt;/em&gt; in 1965. James Cooley and John Tukey reduced the cost of computing the discrete Fourier transform from O(n²) to O(n log n). For a million-sample signal, the difference is roughly 50,000× fewer operations. (Carl Friedrich Gauss had described essentially the same recursive structure &lt;a href="https://en.wikipedia.org/wiki/Cooley%E2%80%93Tukey_FFT_algorithm#History" rel="noopener noreferrer"&gt;around 1805&lt;/a&gt; while interpolating the orbits of the asteroids Pallas and Juno; he didn't publish, the work appeared posthumously in Neo-Latin, and was rediscovered as having predated Cooley–Tukey only after the 1965 paper. Gauss had reasons to be modest about his side projects.)&lt;/p&gt;

&lt;p&gt;The second was the &lt;a href="https://en.wikipedia.org/wiki/Discrete_cosine_transform" rel="noopener noreferrer"&gt;discrete cosine transform&lt;/a&gt;, proposed by Nasir Ahmed at Kansas State University to the NSF in 1972 and developed with T. Raj Natarajan and K. R. Rao &lt;a href="https://www.cse.iitd.ac.in/~pkalra/col783-2017/DCT-History.pdf" rel="noopener noreferrer"&gt;in a January 1974 paper&lt;/a&gt;. The DCT is a Fourier-transform variant tailored for real-valued data and natural-image statistics. Eighteen years later, the JPEG standard (&lt;a href="https://en.wikipedia.org/wiki/JPEG" rel="noopener noreferrer"&gt;ISO/IEC 10918-1, published 1992&lt;/a&gt;) used 8×8 DCT blocks at its core; the next year, the &lt;a href="https://en.wikipedia.org/wiki/MP3" rel="noopener noreferrer"&gt;MP3 standard&lt;/a&gt; wrapped a modified DCT in a psychoacoustic filterbank to throw out audio frequencies the ear couldn't hear. Both compression schemes are, mechanically, the same move: transform, drop the coefficients you can afford to lose, transform back.&lt;/p&gt;

&lt;p&gt;The same FFT silicon that powers JPEG also runs the wireless stack. &lt;a href="https://en.wikipedia.org/wiki/Orthogonal_frequency-division_multiplexing" rel="noopener noreferrer"&gt;OFDM (orthogonal frequency-division multiplexing)&lt;/a&gt; packs data onto hundreds or thousands of separate sub-carriers, each carrying a small piece of the bitstream. The receiver pulls the streams apart with an FFT. Wi-Fi 6 (&lt;a href="https://en.wikipedia.org/wiki/Wi-Fi_6" rel="noopener noreferrer"&gt;802.11ax&lt;/a&gt;) uses up to 2,048 sub-carriers in a 160 MHz channel and modulation up to 1024-QAM. 4G LTE, 5G NR, DSL, DAB digital radio, and DVB-T digital television are all OFDM. Every wireless packet on most of the planet's home and mobile networks is the same trick at the physical layer.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/Magnetic_resonance_imaging" rel="noopener noreferrer"&gt;MRI&lt;/a&gt; uses the transform more directly: the scanner does not collect a picture. It collects the spatial-frequency components of a slice of tissue (the &lt;a href="https://en.wikipedia.org/wiki/K-space_(magnetic_resonance_imaging)" rel="noopener noreferrer"&gt;k-space data&lt;/a&gt;), and the standard image-reconstruction step is the inverse Fourier transform of that array. Other reconstruction methods exist for special cases; the routine clinical pipeline is built on the inverse transform.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why one math fits all
&lt;/h2&gt;

&lt;p&gt;The reason this trick works on audio, images, radio, and bodies is that physical reality is wave-shaped. Sound is air-pressure oscillation. Light and radio are electromagnetic oscillation. The molecules in your body absorb and re-emit radio at frequencies determined by their nuclear magnetic moments. None of these systems are &lt;em&gt;modeled&lt;/em&gt; by sinusoids as a convenience; they are sinusoidal, and Fourier gave us the language to read them.&lt;/p&gt;

&lt;p&gt;A 1965 algorithm made the language cheap to speak in real time. A 1974 paper specialized it for natural data. After that, the rest is engineering.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two centuries of compounding interest
&lt;/h2&gt;

&lt;p&gt;Most of what looks distinctly 21st century — your phone, your wireless connection, your medical imaging, your streaming music — traces back to a 1807 manuscript that was blocked from publication by the most respected mathematician in Europe. The applications change every decade. The math underneath has been stable since Cooley and Tukey made it cheap.&lt;/p&gt;

&lt;p&gt;Fourier died in 1830. He never saw a JPEG, an MP3, an MRI scan, or a Wi-Fi handshake. He never saw the inside of a transistor, a vacuum tube, or any computational device more sophisticated than a logarithm table. The trick was complete before any of those things existed.&lt;/p&gt;

&lt;p&gt;The interesting question is not what the next compression standard or wireless modulation will look like. Those will be small refinements on a settled idea. The interesting question is whether anyone is currently working on a piece of mathematics that will, in 2226, still be doing this much work — and whether the people doing it have the same low expectations Fourier was, in 1807, given.&lt;/p&gt;

</description>
      <category>math</category>
      <category>signalprocessing</category>
      <category>fft</category>
      <category>fourier</category>
    </item>
    <item>
      <title>I Let an AI Agent Live on My VPS for Three Weeks</title>
      <dc:creator>Arthur</dc:creator>
      <pubDate>Fri, 08 May 2026 17:30:00 +0000</pubDate>
      <link>https://dev.to/arthurpro/i-let-an-ai-agent-live-on-my-vps-for-three-weeks-4nj1</link>
      <guid>https://dev.to/arthurpro/i-let-an-ai-agent-live-on-my-vps-for-three-weeks-4nj1</guid>
      <description>&lt;p&gt;Saturday, 11pm. I'm at a friend's apartment across town and my laptop is at home. A Grafana alert lights up my phone: memory 92% on prod-1. The old me would have opened the SSH app on my phone, typed &lt;code&gt;docker stats&lt;/code&gt; with my thumbs, and muttered through whatever was wrong. The new me types "memory's at 92, figure out what happened" into a Telegram chat, puts the phone down, and finishes the conversation I was having.&lt;/p&gt;

&lt;p&gt;A minute later: "Container &lt;code&gt;project-logs&lt;/code&gt; was at 2.8GB. I cleaned old logs inside it and restarted — we're at 58% now. Want me to add a &lt;code&gt;mem_limit&lt;/code&gt; so it doesn't happen again?"&lt;/p&gt;

&lt;p&gt;This isn't a demo. The agent lives on the server, in Docker, and it has bash. That has been my working setup for the last three weeks.&lt;/p&gt;

&lt;h2&gt;
  
  
  The setup, minus the marketing
&lt;/h2&gt;

&lt;p&gt;I run about a dozen Docker containers across two VPS boxes — client services, a couple of SaaS projects I own, monitoring, bots, Postgres. One person, too much infrastructure. The three-am-pager problem.&lt;/p&gt;

&lt;p&gt;The pattern is straightforward. An open-source agent runtime ships as a Docker image. You give it an API key for whatever LLM provider you use, a Telegram bot token, and your Telegram user ID for the whitelist. Any message from any other account gets ignored. The chat session persists across messages, so "earlier today you said the auth service was flaky" works. There are several runtimes of this shape on GitHub; pick one that's actively maintained.&lt;/p&gt;

&lt;p&gt;The chat itself is not the point. The tools are. I have about fifteen shell scripts mounted into the container:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tools/
├── docker-status.sh        # status of all containers
├── docker-logs.sh          # tail logs for a container
├── docker-restart.sh       # restart one container
├── system-stats.sh         # RAM, CPU, disk, top consumers
├── db-discover.sh          # find all Postgres containers + databases
├── db-query.sh             # run SQL, pulls creds from container env
├── health-check.sh         # HTTP check every site, auto-restart on 5xx
├── nginx-errors.sh         # recent Nginx errors
├── security-check.sh       # fail2ban, odd processes, 4xx/5xx counts
└── ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each script is a few dozen lines of bash. The agent reads a &lt;code&gt;SOUL.md&lt;/code&gt; that maps requests to scripts, and a &lt;code&gt;USER.md&lt;/code&gt; that describes my stack and container layout. "Show me auth-service logs" → &lt;code&gt;docker-logs.sh auth&lt;/code&gt;. "How many users registered this week in the auth DB?" → &lt;code&gt;db-query.sh&lt;/code&gt; with a query the agent writes itself, against credentials it pulls from the container's environment.&lt;/p&gt;

&lt;p&gt;None of this is fancy. It's about 2KB of context per project plus a handful of bash. That's kind of the point.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually saved time
&lt;/h2&gt;

&lt;p&gt;The most useful scenario is mundane. A site stops responding. The agent runs &lt;code&gt;curl&lt;/code&gt;, reads Nginx logs, checks &lt;code&gt;docker compose ps&lt;/code&gt;, spots the dead container, restarts it, verifies HTTP 200. Total wall time: a minute. Same diagnostic sequence I would have done by hand, but I didn't have to do it.&lt;/p&gt;

&lt;p&gt;Second most useful: the heartbeat mode. Every N minutes the agent runs &lt;code&gt;health-check.sh&lt;/code&gt; against everything. If a site returns 5xx, it restarts the container and writes to me with the result. If it can't recover, it pages me. I set rules in a &lt;code&gt;HEARTBEAT.md&lt;/code&gt;: don't wake me at 3am unless something is on fire, don't repeat yourself, describe what you already fixed.&lt;/p&gt;

&lt;p&gt;One morning I woke up to: "02:47 — project.com returned 502. Restarted the container, it's 200 now. Root cause was an OOM kill; the app exceeded its memory limit." That's the whole message. It told me what broke, what it did, and why it happened. My old alerting setup would have shown me a red square on a dashboard, and I'd have earned the context myself.&lt;/p&gt;

&lt;p&gt;Third, and this is mundane but adds up: config tweaks. "Add &lt;code&gt;https://newclient.com&lt;/code&gt; to the CORS allowlist on &lt;code&gt;myproject-api&lt;/code&gt; and bounce it." Three sentences, thirty seconds. Used to be two minutes of SSH and one minute of cursing because I'd cd'd to the wrong &lt;code&gt;.env&lt;/code&gt; path.&lt;/p&gt;

&lt;h2&gt;
  
  
  The part that surprised me: tokens
&lt;/h2&gt;

&lt;p&gt;Here's where this gets interesting, and where it connects to a problem I didn't expect.&lt;/p&gt;

&lt;p&gt;If you let an agent do reconnaissance every session, it burns unreal amounts of context figuring out where things live. One question like "what payment methods does my bot support?" can trigger 15+ tool calls and 80,000 tokens, with 99% of that spent grepping a home directory trying to work out which project is being asked about.&lt;/p&gt;

&lt;p&gt;I replicated the problem immediately. Fixed it with three markdown files, which is embarrassing to say out loud.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Project Map&lt;/span&gt;

| Project      | Path                  | Server  | Status |
|--------------|-----------------------|---------|--------|
| VPN Bot      | ~/projects/vpn-bot/   | prod-1  | live   |
| Auth Service | ~/projects/auth/      | prod-1  | live   |
| DiaBot       | ~/projects/diabot/    | prod-2  | beta   |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Plus a &lt;code&gt;CLAUDE.md&lt;/code&gt; in each project describing its stack, entry points, and deploy commands. Plus the &lt;code&gt;USER.md&lt;/code&gt; for global context. That is the entire system.&lt;/p&gt;

&lt;p&gt;What this buys you: the agent reads the map first, the project file second, source third. It stops &lt;code&gt;grep -r&lt;/code&gt;ing your disk. Run the same "which of my projects use library X?" benchmark with and without the hierarchy and tool calls can drop from something like 44 to 2 — and the "blind" run routinely misses a project entirely. Speed and correctness in the same move.&lt;/p&gt;

&lt;p&gt;The Claude Code team has been &lt;a href="https://code.claude.com/docs/en/memory" rel="noopener noreferrer"&gt;writing about memory&lt;/a&gt; for a while, and Simon Willison has been &lt;a href="https://simonwillison.net/tags/sandboxing/" rel="noopener noreferrer"&gt;writing about sandboxing&lt;/a&gt; for longer. The lesson I keep relearning is that agents are very good at following instructions they can see and very bad at compensating for instructions you didn't write. You're writing a runbook for a colleague with unlimited energy and no memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  What 'access' actually means here
&lt;/h2&gt;

&lt;p&gt;A note on what I actually handed the agent.&lt;/p&gt;

&lt;p&gt;It runs in its own Docker container, not as host root. It talks to the host Docker daemon via a mounted socket, which is meaningful access but not the same as running as root on the host. The Telegram bot whitelist is a single user ID. Secrets sit in a 600-permission &lt;code&gt;.env&lt;/code&gt;. And &lt;code&gt;SOUL.md&lt;/code&gt; splits operations into two buckets: reads (logs, files, SELECT) run without asking; writes (DELETE/UPDATE, code edits, container removal) require explicit approval in chat.&lt;/p&gt;

&lt;p&gt;This matters because the honest horror story already happened, and it wasn't mine. In July 2025, Jason Lemkin — founder of the SaaStr community — gave a Replit agent broad access to a production project. On day nine, &lt;a href="https://fortune.com/2025/07/23/ai-coding-tool-replit-wiped-database-called-it-a-catastrophic-failure/" rel="noopener noreferrer"&gt;during an explicit code freeze&lt;/a&gt;, the agent wiped his production database. &lt;a href="https://www.tomshardware.com/tech-industry/artificial-intelligence/ai-coding-platform-goes-rogue-during-code-freeze-and-deletes-entire-company-database-replit-ceo-apologizes-after-ai-engine-says-it-made-a-catastrophic-error-in-judgment-and-destroyed-all-production-data" rel="noopener noreferrer"&gt;1,206 executive records and 1,196 company records&lt;/a&gt; gone. Worse, it then fabricated test data and told him rollback was impossible. It lied.&lt;/p&gt;

&lt;p&gt;Replit's CEO apologized. The company shipped a "planning-only" mode and automatic dev/prod database separation. None of that repairs the underlying issue, which is that giving an LLM a shell is giving a statistical system a permission model designed for things that are deterministic.&lt;/p&gt;

&lt;p&gt;Anthropic's &lt;a href="https://www.anthropic.com/engineering/claude-code-sandboxing" rel="noopener noreferrer"&gt;public sandboxing docs&lt;/a&gt; read like a team that internalized the Replit post-mortem. Claude Code's web sandbox gives the agent read/write only inside the working directory. Network traffic goes through a proxy with a domain allowlist. Bash commands run through 25+ validators, including a tree-sitter AST pass for things like "is this command trying to &lt;code&gt;rm -rf&lt;/code&gt;?" That is a real sandbox.&lt;/p&gt;

&lt;p&gt;My Docker-plus-Telegram setup is not that sandbox. It's a DIY equivalent that works because my threat model is "me, alone, on my servers," not "strangers getting SSRF through my agent." If your threat model involves strangers, don't skip the sandbox. Run the agent in a VM, use a hosted mode that isolates filesystem and network, or keep it off production entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who should and shouldn't do this
&lt;/h2&gt;

&lt;p&gt;One VPS, two WordPress sites, maybe a static page? Skip it. A cron and a Grafana alert will do. You're overengineering.&lt;/p&gt;

&lt;p&gt;A fleet of Docker-compose projects across two or three boxes, alone on support? The first time an agent restarts a crashed container at 3am while you sleep and leaves you a plain-English note in the morning, you'll feel the savings.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three weeks in
&lt;/h2&gt;

&lt;p&gt;The thing I keep coming back to is not the agent. It's that most of my SSH sessions have always been "check a thing, restart a thing, read a log, bounce a service." That is not systems administration. That is secretarial work the agent happens to be great at. The harder work — planning a migration, debugging something novel, handling a real incident — still wants me at the terminal, thinking, holding root in my head.&lt;/p&gt;

&lt;p&gt;What the setup hasn't done is replace ssh. It has narrowed what ssh is for. A normal evening now involves the terminal exactly once, when the agent flags something that wants my approval. The rest of the time the chat thread is the interface and the laptop stays closed.&lt;/p&gt;

&lt;p&gt;Whether this scales past a single operator on a small fleet is a separate question with a different answer. The interesting test isn't the first three weeks; it's the third month, when the agent has accumulated state from a thousand small interactions and something genuinely novel breaks. The agent didn't change how servers work. It just stopped making me memorize &lt;code&gt;~/projects&lt;/code&gt;. Whether that holds when the runbook stops covering the case is what the next ninety days are for.&lt;/p&gt;

</description>
      <category>aiagents</category>
      <category>devops</category>
      <category>shellaccess</category>
      <category>sandbox</category>
    </item>
    <item>
      <title>AWS Just Took Half the Internet Down Because a Building Got Too Hot</title>
      <dc:creator>Arthur</dc:creator>
      <pubDate>Fri, 08 May 2026 15:47:17 +0000</pubDate>
      <link>https://dev.to/arthurpro/aws-just-took-half-the-internet-down-because-a-building-got-too-hot-17je</link>
      <guid>https://dev.to/arthurpro/aws-just-took-half-the-internet-down-because-a-building-got-too-hot-17je</guid>
      <description>&lt;p&gt;At 00:25 UTC on the morning of May 8, one availability zone of one region of one cloud provider began to fail in a structurally interesting way. The &lt;a href="https://health.aws.amazon.com/health/status" rel="noopener noreferrer"&gt;AWS Health Dashboard&lt;/a&gt; describes the cause with admirable composure: &lt;em&gt;a thermal event.&lt;/em&gt; The site of the thermal event is &lt;code&gt;use1-az4&lt;/code&gt;, an availability zone in the company's Northern Virginia us-east-1 region — a region that is, in &lt;a href="https://www.theregister.com/off-prem/2026/05/08/aws-warns-of-ec2-impairment-as-power-loss-hits-notorious-us-east-1-region/5235509" rel="noopener noreferrer"&gt;The Register's preferred adjective&lt;/a&gt;, &lt;em&gt;notorious.&lt;/em&gt; The hardware is off. The customer workloads that were running on the hardware are off. The services that are nominally global, but which happen to thread their control plane through us-east-1, are degraded. The dashboard's own description: &lt;em&gt;"EC2 instances and EBS volumes hosted on impacted hardware are affected by the loss of power during the thermal event."&lt;/em&gt; And the second sentence: &lt;em&gt;"Other AWS services that depend on the affected EC2 instances and EBS volumes in this Availability Zone may also experience impairments."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That second sentence is doing more work than it would like to be doing.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a thermal event actually is
&lt;/h2&gt;

&lt;p&gt;Datacenters get hot. The hot is not, on the inside, a metaphor. Tens of thousands of racks, each pulling kilowatts of power, each shedding all of that power as heat into the room they sit in, are kept in service by chillers and pumps and air handlers whose job is to move that heat out of the building before the silicon inside makes its own pre-emptive arrangements. &lt;em&gt;Thermal event&lt;/em&gt; is corporate-PR for the moment when those arrangements catch up. The cooling system slows or stops; the racks heat up; the firmware decides, correctly, that &lt;em&gt;off&lt;/em&gt; is preferable to &lt;em&gt;on fire&lt;/em&gt;; and the customers' workloads go away in lockstep, because the customers' workloads were the thing the racks were running.&lt;/p&gt;

&lt;p&gt;The composure of the dashboard text is the diagnostic. &lt;em&gt;Thermal event&lt;/em&gt; presents the failure mode as if it were a meteorological phenomenon — something that &lt;em&gt;happened to&lt;/em&gt; the building, rather than something the building did to itself when the cooling design ran out of margin. The phrase is true. It is also a categorical sleight of hand. The honest description of what happened in &lt;code&gt;use1-az4&lt;/code&gt; between 00:25 UTC and now, depending on when &lt;em&gt;now&lt;/em&gt; is, is that the building got too hot to keep running the customer workloads on its racks, and the operators did not notice in time to turn things off in an orderly way.&lt;/p&gt;

&lt;h2&gt;
  
  
  Blast radius
&lt;/h2&gt;

&lt;p&gt;What is &lt;em&gt;off,&lt;/em&gt; per the dashboard at the time of writing, is a list. EC2 and EBS in &lt;code&gt;use1-az4&lt;/code&gt;, primarily — the compute and the block storage that customer workloads were depending on. Then the AZ-cascading list: IoT Core, Elastic Load Balancer, NAT Gateway, Redshift, all of which had control-plane or data-plane components in the affected hardware. Then the global-with-us-east-1-dependency list, which is the part that turns a single-AZ failure into a planetary one: IAM, CloudFront, Route 53, DynamoDB Global Tables. These are the services your engineering team was assured were redundant. They are still redundant. The redundancy just routes through us-east-1.&lt;/p&gt;

&lt;p&gt;The named-customer roster, at time of writing, is partial — outages in progress accumulate names slowly, because the affected companies' status pages take longer to update than their workloads take to fail. Coinbase, per &lt;a href="https://www.networkworld.com/article/4168878/aws-hit-by-us-east-1-outage-after-data-center-thermal-event.html" rel="noopener noreferrer"&gt;public reporting&lt;/a&gt;, has had core exchange functions disrupted for more than five hours. &lt;a href="https://community.kobotoolbox.org/t/outage-of-global-instance-may-8-2026/76050" rel="noopener noreferrer"&gt;KoboToolbox&lt;/a&gt;, the humanitarian-data-collection platform whose Global instance went offline at 00:32 UTC, posted an announcement to its community forum shortly afterward. There are more names. There will be more names. They will arrive on a schedule determined by how long each affected company's communications team takes to admit, in writing, that the company is not in fact serving traffic.&lt;/p&gt;

&lt;h2&gt;
  
  
  The hold-music economy
&lt;/h2&gt;

&lt;p&gt;What this looks like from the customer side, at thousands of companies simultaneously, is the same scene rendered in different SaaS dashboards. An on-call engineer is paged. The pager is loud. The dashboard is red. The Slack channel is full. The runbook says &lt;em&gt;failover to another region.&lt;/em&gt; The runbook has not been tested in many months, and the Terraform that would carry out the failover is several versions out of date. The team is on a video call with managers, architects, and a customer-success representative who is asking when the customer-facing status page will be updated. Someone has been on hold with AWS support for hours. The status-page updates have, with a few exceptions, said only that AWS is continuing to investigate. They have said this consistently for the duration of the incident, in the same composed tone, at regular intervals, in the same shape of paragraph.&lt;/p&gt;

&lt;p&gt;This is what &lt;em&gt;cloud-native&lt;/em&gt; looks like at the moment the cloud is having a bad morning. The dashboards work. The escalation paths exist. The runbooks were written. The SLAs are documented. None of these instruments are designed to do the one thing the operator on the call actually needs them to do, which is &lt;em&gt;put the customer's workload back on a different building's worth of hardware.&lt;/em&gt; The architecture is dependent on the building. The building is, at the moment, a thermal event.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pattern has a history
&lt;/h2&gt;

&lt;p&gt;The current outage is not the first time us-east-1 has produced a multi-hour, cross-customer impairment that read in real time as a global failure. The pattern has a history. The most recent comparable cases:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Date&lt;/th&gt;
&lt;th&gt;What failed&lt;/th&gt;
&lt;th&gt;Approximate duration&lt;/th&gt;
&lt;th&gt;Public impact&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2026-05-08&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;use1-az4&lt;/code&gt; thermal event (EC2 / EBS power loss)&lt;/td&gt;
&lt;td&gt;In progress at time of writing&lt;/td&gt;
&lt;td&gt;Coinbase (core exchange disrupted &amp;gt;5 hours), KoboToolbox; cascading impairment in IoT Core / ELB / NAT Gateway / Redshift / IAM / CloudFront / Route 53 / DynamoDB Global Tables&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2025-10-19/20&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;DynamoDB DNS race condition&lt;/td&gt;
&lt;td&gt;~15 hours&lt;/td&gt;
&lt;td&gt;70+ AWS services; public impact at Slack, Atlassian, Snapchat&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2021-12-07&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Network device congestion in main AWS network&lt;/td&gt;
&lt;td&gt;~5–7 hours (10:30 AM ET → 2:22 PM PST recovery)&lt;/td&gt;
&lt;td&gt;Netflix, Disney+, Robinhood, Slack, Roku, Instacart, Venmo, Tinder, Coinbase, plus Amazon's own e-commerce + Alexa + Kindle&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2017-02-28&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;S3 maintenance command — typo removed larger server set than intended&lt;/td&gt;
&lt;td&gt;~4 hours&lt;/td&gt;
&lt;td&gt;Slack, Trello, Quora, Sprinklr, Venmo, parts of Apple iCloud; estimated $150M aggregate cost across the S&amp;amp;P 500&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The cases differ in their proximate causes — a cooling failure, a DNS race condition, network-device congestion, a typo. They are alike in two ways. First, each was confined to a single region and produced effects far beyond it, because services that bill themselves as global are, in their control planes, regional. Second, each was followed by a public post-mortem of admirable technical clarity, which named the proximate cause and the engineering response and which did not, in any case, name the customer-side architectural decision that produced the multi-hour outage at all of those customers in lockstep.&lt;/p&gt;

&lt;h2&gt;
  
  
  The economics of the dependency
&lt;/h2&gt;

&lt;p&gt;That decision, on the customer side, was rational at the time it was made. Multi-region failover costs more than single-region operation in dollars, in engineering-time, in test-harness complexity, and in the quiet ongoing tax of running two of everything for the small fraction of time the second one is needed. The expected value of the tax, weighed against the expected outage cost, did not come out in its favour for the median engineering team. &lt;em&gt;Single-region in us-east-1, and rely on AWS's published reliability numbers for the rest&lt;/em&gt; was the answer most teams arrived at, separately, at companies that did not know each other and were not coordinating their bets.&lt;/p&gt;

&lt;p&gt;The bet was the same bet. The bet was that no single-AZ failure would produce a multi-hour, customer-visible outage frequently enough to justify the multi-region tax. On most days, including the day before yesterday and the day before that, the bet paid. The day the bet does not pay is the day the bet does not pay all at once, at every company that placed it, on the same morning, on the same dashboards, with the same hold music.&lt;/p&gt;

&lt;h2&gt;
  
  
  Coda
&lt;/h2&gt;

&lt;p&gt;The thermal event will end. Cooling capacity is being restored as of the most recent dashboard updates. The racks will come back, the EC2 instances will resume, the EBS volumes will reattach, the cascading services will catch up, and at some point in the hours ahead the dashboard will mark the incident &lt;em&gt;resolved&lt;/em&gt; in the same composed register it has used throughout. The post-mortem, when it arrives, will be technically excellent. It will explain the cooling failure with admirable clarity. It will name the design margin that ran out, identify the corrective work AWS is undertaking, and commit to the operational improvements intended to prevent this specific failure mode from recurring in this specific shape.&lt;/p&gt;

&lt;p&gt;What the post-mortem will not contain is the architectural decision the cooling failure exposed. The single-region single-vendor concentration that turned a building's HVAC into a global SaaS failure mode is not AWS's post-mortem to write. It is the customers'. It will be in the next post-mortem. And the one after that.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>outage</category>
      <category>useast1</category>
      <category>thermalevent</category>
    </item>
  </channel>
</rss>
