A real production incident, what caused it, and the three-layer fix we built to make sure it never happens again.
It Started Like Any Other Deployment
We had a client deadline to hit. The backend team had migrated to a new API server — new domain, new database, cleaner infrastructure. The frontend team had updated the environment variables in AWS Amplify, pushed the code, and watched the CI/CD pipeline go green.
Build successful. Deployment successful.
We sent a quick message in the team channel. Closed our laptops. Done.
Or so we thought.
Something Did Not Feel Right
A little while later, one of our engineers was doing a routine check — not because anything looked wrong, just habit. He pulled up the server logs on the old API server. The one we had just replaced. The one that was supposed to be getting no traffic anymore.
It was getting traffic.
Not a little. Actual client requests, hitting the old endpoints, writing data to the old database.
We stared at the logs for a moment. The new deployment was live. The new API was running fine. But some clients were still talking to the old server like nothing had changed.
The first question was obvious: how long has this been going on?
We checked the timestamps. Somewhere between 30 minutes to 2 hours of real client data had gone to the wrong database.
That is the moment the deadline-pressure confidence completely evaporated.
What Actually Happened
Here is the thing about React apps deployed on AWS Amplify — or honestly any SPA hosted on a CDN. This problem is not unique to CRA either. Whether you are using Vite, Next.js static exports, or Create React App, the same behaviour applies: environment variables are replaced at build time, not at runtime.
When you run npm run build, the bundler takes every REACT_APP_* or VITE_* variable and does a find and replace across your entire codebase. The variable is gone. In its place, sitting inside your compiled main.a3f4c2.js file, is a hardcoded string:
// What you wrote in your source code
const url = process.env.REACT_APP_API_URL;
// What actually ended up in the compiled bundle
const url = "https://old-api.yourdomain.com";
That string is physically embedded in a file that is now cached in every client's browser.
When we deployed the new version, new users got the new bundle with the new API URL. But existing users — the ones who had the app open and never refreshed — were still running the old bundle. And the old bundle had the old domain hardcoded inside it.
They kept calling the old API. The old API kept responding. Data kept going to the wrong place. Silently.
The Human Element That Made It Worse
In the rush to hit the client deadline, our DevOps engineer forgot to shut down the old API server after the deployment.
This is not about blame. Deployments done under deadline pressure are exactly when humans forget steps. That is just reality. If your system depends on a person remembering to do a manual step at exactly the right time under stress, your system has a gap. The old server being alive meant there was nothing to catch the stale clients. They hit the old endpoint, got a 200 OK, and carried on writing data to the wrong database like everything was fine.
If the old server had been down, the stale clients would have gotten errors. The app would have visibly broken. We would have noticed and investigated immediately. Instead, we got silence — the worst kind of bug.
Why We Did Not Notice Earlier
This is worth being honest about.
We had monitoring in place — API errors, response latency, uptime checks on the new server. All of those looked healthy. The new API was getting traffic. Everything looked normal.
What we did not have was any comparison between traffic on the old environment versus the new one. We were not alerting on "unexpected requests to a decommissioned endpoint" because the concept of checking a server we assumed was dead had not crossed our minds. The old server was not in our monitoring dashboard anymore. Out of sight, out of mind.
That blind spot is what allowed the incident to stay quiet for up to two hours. Our monitoring was watching the right things in the wrong places.
Recovering the Data
Once we understood what had happened, the immediate priority was damage assessment.
We compared the timestamps of records written to the old database against the deployment timestamp. Everything written after the deployment to the old database was potentially out of sync with the new system. We exported those records, validated them against what had already arrived in the new database to identify the gap, and migrated the missing records across. Before reopening full client access we did a row count and key-field comparison on both sides to confirm nothing was lost or duplicated.
It was not a complicated recovery — but it was stressful and time-consuming work that should never have been necessary. The part that stung most was that the data was perfectly valid. The clients had done nothing wrong. The bug was entirely on our side, invisible to them, and we had to fix it without them noticing.
The Fix: Three Layers
After the incident was contained, we sat down to fix this properly. After discussing it internally we realised there was not a single root cause — several safeguards were missing at once. So we ended up addressing it in multiple layers, each one targeting a different failure point.
Layer 1 — Take the API URL Out of the Build
The root cause was the API URL being baked into the JavaScript bundle at compile time. So the fix is simple in concept: stop doing that.
Instead of relying on REACT_APP_* env vars, we created a config.json file that lives in the public/ folder of the React project. The public/ folder is special — the bundler never processes it. Everything in there is copied as-is to the build output and served as static files.
This config.json holds the API URL as a plain value:
{
"API_BASE_URL": "https://api.yourdomain.com",
"ENV": "production"
}
We have three versions of this file in our repo — one per environment:
public/
├── config.dev.json ← local development
├── config.staging.json ← staging environment
└── config.prod.json ← production
In amplify.yml, before the build runs, we copy the right file into place:
version: 1
frontend:
phases:
preBuild:
commands:
- cp public/config.${APP_ENV}.json public/config.json
build:
commands:
- npm ci
- npm run build
In Amplify Console, each branch gets one environment variable — APP_ENV — set to dev, staging, or prod. That is the only Amplify env var we need now.
Then in the React app, before anything renders, we fetch this file:
// src/config/index.js
let _config = null;
export async function loadConfig() {
// The timestamp query param is a belt-and-suspenders safety net during
// rollout, while we verified the no-cache headers had fully propagated
// through CloudFront. Once cache headers are confirmed working, the
// headers alone are sufficient — but the query param costs nothing to keep.
const res = await fetch(`/config.json?t=${Date.now()}`);
_config = await res.json();
}
export function getConfig() {
if (!_config) throw new Error('Config not loaded yet');
return _config;
}
And in src/index.js, we make this the very first thing that happens before React renders:
import { loadConfig } from './config';
import App from './App';
async function bootstrap() {
await loadConfig(); // fetch config first, then render
ReactDOM.createRoot(document.getElementById('root')).render(<App />);
}
bootstrap();
Every API call now reads the URL from in-memory config:
import { getConfig } from '../config';
export async function apiClient(path, options = {}) {
const { API_BASE_URL } = getConfig();
const res = await fetch(`${API_BASE_URL}${path}`, options);
return res.json();
}
One thing to be clear about: Layer 1 alone does not fix a tab that is already open. A user who opened the app before the deployment and never refreshed has the old config loaded in memory. Even if config.json on the server now points to the new API, that tab will not go and re-fetch it mid-session. Layer 1 helps every new session after a deployment. Layer 2 handles the open tabs.
Layer 2 — Use WebSocket to Push a Deploy Notification to Open Tabs
Even with Layer 1, a user who opened the app an hour ago and has it sitting in a browser tab is still running old config in memory. We needed a way to reach those open tabs and tell them that a new deployment had happened.
The obvious approach is polling — the client checks a version.json file every few minutes. But we already had WebSockets in our app for real-time features. So we took a better path: instead of the client repeatedly asking "has anything changed?", the server tells every connected client when a new version has been deployed. No polling, no wasted requests. Just a direct push at exactly the right moment.
Here is the key thing we got wrong in our first attempt at this, and it is worth explaining clearly.
Our first instinct was to broadcast the notification from the old server when it shuts down — using Node's SIGTERM handler. Technically that works in the one scenario where you are changing domains and decommissioning a server. But it completely fails for every other normal deployment — feature releases, UI updates, bug fixes — where the API server does not change at all and nothing shuts down. SIGTERM never fires. The broadcast never happens. Open tabs never get notified.
The right approach is to tie the notification to the UI deployment itself, not to any server lifecycle event.
Every time Amplify builds and deploys a new version of the UI, it triggers the notification. The server just listens and broadcasts when asked. This works for every deployment, every time, regardless of whether the backend changed.
How it works end to end:
Amplify builds and deploys new UI
│
▼
postBuild in amplify.yml calls a webhook on your API server
│
▼
API server verifies the request → broadcasts VERSION_UPDATE
to all connected WebSocket clients
│
├── Open Tab A (old version in memory) → banner appears
├── Open Tab B (old version in memory) → banner appears
└── Open Tab C (just opened, new version) → no banner
Step 1 — Stamp a version on every build
In amplify.yml, the preBuild phase stamps a Unix timestamp into public/version.json. This file gets deployed to CloudFront along with the rest of the build. It represents the identity of this specific deployment:
version: 1
frontend:
phases:
preBuild:
commands:
- cp public/config.${APP_ENV}.json public/config.json
- echo "{\"version\": \"$(date +%s)\"}" > public/version.json
build:
commands:
- npm ci
- npm run build
postBuild:
commands:
# Wait 30 seconds for CloudFront to propagate the new files globally
# before telling clients to refresh. Without this delay, a user who
# refreshes immediately might still get the old bundle from a CDN
# edge node that has not caught up yet.
- sleep 30
- |
curl -X POST https://api.yourdomain.com/internal/notify-deploy \
-H "Content-Type: application/json" \
-H "X-Deploy-Secret: ${DEPLOY_WEBHOOK_SECRET}" \
-d "{\"version\": \"$(date +%s)\"}"
artifacts:
baseDirectory: build
files:
- '**/*'
DEPLOY_WEBHOOK_SECRET is an Amplify environment variable — a shared secret so your API server can verify the webhook call is genuinely coming from your Amplify build and not from anywhere else.
The sleep 30 matters. postBuild runs after the build artifacts are created, but CloudFront can take a short time to propagate new files to all its edge nodes globally. If you notify clients immediately, a user who refreshes might still hit a stale CDN edge and get the old bundle. Thirty seconds is a safe buffer.
Step 2 — The webhook endpoint on your API server
Your Node.js server exposes a protected internal route. When Amplify calls it after a deployment, the server broadcasts to all connected WebSocket clients:
// POST /internal/notify-deploy
app.post('/internal/notify-deploy', (req, res) => {
// Reject anything that does not carry the correct shared secret
const secret = req.headers['x-deploy-secret'];
if (secret !== process.env.DEPLOY_WEBHOOK_SECRET) {
return res.status(401).json({ error: 'Unauthorized' });
}
const { version } = req.body;
// Broadcast to every open WebSocket connection
let notified = 0;
wss.clients.forEach(client => {
if (client.readyState === WebSocket.OPEN) {
client.send(JSON.stringify({
type: 'VERSION_UPDATE',
version
}));
notified++;
}
});
console.log(`[Deploy] Notified ${notified} connected clients of version: ${version}`);
res.json({ ok: true, notified });
});
The notified count in the log is genuinely useful — after every deployment you can see exactly how many open sessions received the push notification.
Step 3 — The React client captures its boot version and listens
When the app first loads, it fetches version.json and stores that version in memory. That is the version this session started with. When the WebSocket event arrives later, the client compares the incoming version with the one it booted on:
// src/index.js
import { loadConfig } from './config';
async function bootstrap() {
// Load runtime config
await loadConfig();
// Capture the version this session started with
const versionRes = await fetch(`/version.json?t=${Date.now()}`);
const { version: bootVersion } = await versionRes.json();
ReactDOM.createRoot(document.getElementById('root')).render(
<React.StrictMode>
<App bootVersion={bootVersion} />
</React.StrictMode>
);
}
bootstrap();
// src/hooks/useVersionNotification.js
import { useEffect, useState } from 'react';
import { socket } from '../socket'; // your existing WebSocket client
export function useVersionNotification(bootVersion) {
const [updateAvailable, setUpdateAvailable] = useState(false);
useEffect(() => {
socket.on('VERSION_UPDATE', ({ version }) => {
// Only show the banner if the incoming version is different from
// what this session loaded with. A tab that just opened with the
// new version will receive this event too, but its bootVersion
// will already match, so it sees nothing.
if (version !== bootVersion) {
setUpdateAvailable(true);
}
});
return () => socket.off('VERSION_UPDATE');
}, [bootVersion]);
return { updateAvailable };
}
// App.jsx
export default function App({ bootVersion }) {
const { updateAvailable } = useVersionNotification(bootVersion);
return (
<>
{updateAvailable && (
<div className="update-banner">
<span>A new version is available.</span>
<button onClick={() => window.location.reload()}>
Refresh now
</button>
</div>
)}
<YourRoutes />
</>
);
}
We went with a banner rather than an automatic hard reload. Auto-reload is convenient, but if a user is halfway through filling a form it destroys their work without warning. The banner is non-intrusive — they see it, finish what they are doing, and refresh when they are ready.
The polling fallback — why we kept it
Even with the WebSocket push in place, we kept a lightweight polling fallback that checks version.json every 5 minutes. The reason is simple: what if the webhook call from Amplify fails? A network blip, a brief server restart, a DNS hiccup during the build — any of these could cause the curl command to fail silently.
The WebSocket push handles the clean, expected path for every deployment. The polling fallback is the safety net for when the push does not get through. If the push fires, users see the banner within seconds. If it does not, they see it within 5 minutes. Either way they get notified.
Layer 3 — Fix the Cache Headers in Amplify
This layer is the quickest to implement and the one we should have had from day one — because it does not require any code changes at all.
The problem is that CloudFront, by default, can cache your index.html file. When it does that, even after a fresh deployment, some users fetch a cached index.html that still points to old JS bundle filenames. They never even receive the new code, regardless of what you deployed.
The rule is straightforward:
- Files that act as entry points (
index.html,config.json,version.json) should never be cached. They need to be fresh on every request. - Static JS and CSS bundle files can be cached forever — because the bundler gives them content hashes in their filenames (like
main.a3f4c2.js). If the file content changes, the filename changes. A cached old file and a new file can never share the same name.
In Amplify Console under App settings > Custom headers:
customHeaders:
- pattern: '**/*.html'
headers:
- key: 'Cache-Control'
value: 'no-cache, no-store, must-revalidate'
- pattern: '/config.json'
headers:
- key: 'Cache-Control'
value: 'no-cache, no-store, must-revalidate'
- pattern: '/version.json'
headers:
- key: 'Cache-Control'
value: 'no-cache, no-store, must-revalidate'
- pattern: '**/*.js'
headers:
- key: 'Cache-Control'
value: 'public, max-age=31536000, immutable'
- pattern: '**/*.css'
headers:
- key: 'Cache-Control'
value: 'public, max-age=31536000, immutable'
This was surprisingly quick to set up. No code changes, no redeployment. Just a configuration update in the console. If you are running a React app on Amplify and you have not done this yet, this is the one thing worth doing before anything else on this list.
The Operational Runbook We Should Have Had
Beyond the three technical layers, we added a deployment process specifically for API domain changes. The incident happened partly because the old server was left running with no process to handle it. So we built one.
Before we bring down any old API server, we now put it into deprecation mode first. All routes return 410 Gone with a clear message:
// Old server — deprecation mode before shutdown
app.use((req, res) => {
res.status(410).json({
error: 'API_DEPRECATED',
message: 'This API version is no longer active. Please refresh your application.'
});
});
We keep this running for 48 hours after the new deployment. If any stale client hits the old server during that window, they get a visible error instead of a silent success. The app breaks loudly. The user refreshes. Problem surfaced, not hidden.
After 48 hours, we check the old server logs one more time to confirm traffic has dropped to zero, then shut it down completely.
The Pre-Deploy Checklist We Use Now
Before every deployment that involves an API change:
Before deploying:
- [ ] New API server is healthy and confirmed working in staging
- [ ]
config.{env}.jsonupdated with the new URL - [ ]
DEPLOY_WEBHOOK_SECRETset in Amplify environment variables - [ ] PR reviewed and approved by at least one other engineer
At deploy time:
- [ ] Amplify build runs —
postBuildwebhook fires automatically - [ ] Check server logs to confirm connected clients were notified
- [ ] Put old API server into 410 mode immediately after new server is confirmed live
48 hours after deploy:
- [ ] Check old server logs — is traffic at zero?
- [ ] Shut down old server
- [ ] Remove old domain DNS record
What We Changed, Side by Side
Before this incident, here is how a deployment worked:
- Update
REACT_APP_API_URLin Amplify Console manually - Trigger a rebuild — new URL baked into the new JS bundle
- Deploy
- Hope all clients refresh soon
- Shut down old server (if remembered)
After:
- Edit
public/config.prod.jsonin the repo — update the URL - Raise a PR, get it reviewed, merge
- Amplify CI/CD triggers automatically — copies config, stamps version, builds, deploys
- All new sessions fetch fresh
config.jsonand get the new URL immediately -
postBuildwebhook fires → API server broadcastsVERSION_UPDATEto all open WebSocket clients → banner appears - Polling fallback catches any tabs that missed the push
- Old server enters 410 mode for 48 hours, then shuts down cleanly
The URL change requires no rebuild of the JS bundle. There is a full audit trail in Git. No step depends on someone remembering to do something manually under deadline pressure. And every open tab in every browser is notified within seconds of a new deployment going live.
What We Learned
The incident was not caused by a bad engineer or a bad process. It was caused by a system designed in a way that made a silent failure possible. The API URL was in the wrong place. There was no mechanism to tell stale clients that the world had changed. And there was no process to contain the blast radius when things went wrong.
A few things have stayed with us since:
Build-time config is a trap for runtime values. If a value can change between deployments — especially an API URL — it should not be baked into the bundle. The JS bundle should contain logic, not environment-specific values.
Silence is more dangerous than noise. The old server returning 200 OK instead of an error was the worst possible outcome. A visible failure would have been caught in minutes. Silent data corruption took up to two hours to notice. When you deprecate something, make it fail loudly.
Deadline pressure breaks manual steps. The forgotten shutdown was not incompetence — it was a predictable outcome of humans under pressure. Any step that must be done manually, at exactly the right moment, under stress, will eventually be missed. Automate it or put it in a written checklist with a sign-off.
Monitor the things you are turning off, not just the things you are turning on. We had monitoring on the new API from day one. We had nothing watching the old one. That is what gave the incident its silence. When you migrate away from something, keep an eye on it until you are certain it is dead.
One fix is rarely enough. If we had only done Layer 1, open tabs would still silently use stale in-memory config. If we had only done Layer 2, new sessions with a cached index.html might not get the new bundle at all. If we had only done Layer 3, the root cause still existed. All three layers were necessary because the incident had three contributing causes.
Conclusion
Production incidents are uncomfortable. But they are also the clearest possible signal about where your architecture has gaps.
This one taught us that deploying a new bundle is only half of a deployment. The other half is making sure every client — including the ones already sitting in an open tab — ends up running the right code against the right backend.
The three layers we built are not complicated. Runtime config takes a few hours to implement properly. The WebSocket deploy notification hooks into infrastructure we already had and fires automatically on every Amplify build. The cache headers took one console session to configure.
None of this required a major architectural change. It just required understanding exactly where the failure happened and filling each gap deliberately.
If you are running a React SPA on a CDN and you have not thought about stale client behaviour yet, this is a good time to start. Not because an incident is inevitable — but because when it does happen, you want it to fail loudly and be caught in minutes, not silently corrupt data while everyone thinks the deployment went fine.
Have you dealt with a similar stale client incident? What did your team do differently? Happy to discuss in the comments.
Top comments (0)