<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: 137Foundry</title>
    <description>The latest articles on DEV Community by 137Foundry (@137foundry).</description>
    <link>https://dev.to/137foundry</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3856342%2F39ac4be7-399f-4f6e-9a32-60abf8a8a324.png</url>
      <title>DEV Community: 137Foundry</title>
      <link>https://dev.to/137foundry</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/137foundry"/>
    <language>en</language>
    <item>
      <title>How to Write Readiness Checks That Survive Real Production Traffic</title>
      <dc:creator>137Foundry</dc:creator>
      <pubDate>Mon, 22 Jun 2026 10:10:15 +0000</pubDate>
      <link>https://dev.to/137foundry/how-to-write-readiness-checks-that-survive-real-production-traffic-13k2</link>
      <guid>https://dev.to/137foundry/how-to-write-readiness-checks-that-survive-real-production-traffic-13k2</guid>
      <description>&lt;p&gt;Readiness checks are deceptively easy to write and surprisingly easy to write badly. A readiness check that works in development can become a load-bearing piece of your production infrastructure in ways the author never intended. This guide walks through how to write a readiness check that holds up under real production traffic and does not become its own reliability liability.&lt;/p&gt;

&lt;p&gt;The audience for the readiness check is the load balancer in your orchestrator (typically &lt;a href="https://kubernetes.io" rel="noopener noreferrer"&gt;Kubernetes&lt;/a&gt; or a similar platform) and any front-end reverse proxy like &lt;a href="https://www.nginx.com" rel="noopener noreferrer"&gt;Nginx&lt;/a&gt; or an edge service like &lt;a href="https://www.cloudflare.com" rel="noopener noreferrer"&gt;Cloudflare&lt;/a&gt;. The job of the check is to tell the load balancer whether this instance can serve real production traffic right now.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Define what "ready" means for this service
&lt;/h2&gt;

&lt;p&gt;Before writing any code, write down the dependencies your service must have available to serve a real production request. The list should be short and specific. For most web services, it looks like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The primary database (read and write).&lt;/li&gt;
&lt;li&gt;The primary cache.&lt;/li&gt;
&lt;li&gt;The message broker, if the request path produces messages synchronously.&lt;/li&gt;
&lt;li&gt;Any internal upstream service called synchronously in the request handler.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Things that should not be on the list:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Asynchronous background dependencies. If your service writes events to a queue but the request can complete without confirming queue health, the queue is not a readiness dependency.&lt;/li&gt;
&lt;li&gt;Non-critical caches. If a cache outage degrades performance but does not break the request, it should be reported in detail status but not block readiness.&lt;/li&gt;
&lt;li&gt;External APIs called outside the synchronous request path. Same reasoning.&lt;/li&gt;
&lt;li&gt;Disk space, memory, CPU. These are metrics for &lt;a href="https://prometheus.io" rel="noopener noreferrer"&gt;Prometheus&lt;/a&gt; or another monitoring tool, not binary readiness signals.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This list is the entire substance of the readiness check. Writing it down explicitly is the most important step.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fmfodtw0ldv84yotjwdth.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fmfodtw0ldv84yotjwdth.jpeg" alt="Server rack with cables and indicator lights in a data center" width="799" height="534"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by panumas nikhomkhai on &lt;a href="https://www.pexels.com" rel="noopener noreferrer"&gt;Pexels&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 2: Implement each dependency check as a fast, isolated function
&lt;/h2&gt;

&lt;p&gt;Each dependency check should be a small function that takes no parameters, runs a single fast query against the dependency, and returns true/false plus a brief reason string.&lt;/p&gt;

&lt;p&gt;A reasonable database readiness check looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;async function checkPrimaryDb() {
  const start = performance.now();
  try {
    await db.query("SELECT 1");
    return { ok: true, latency_ms: performance.now() - start };
  } catch (e) {
    return { ok: false, error: e.message, latency_ms: performance.now() - start };
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notes on each part:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The query is &lt;code&gt;SELECT 1&lt;/code&gt;. Not a real query. Not a table scan. The check is testing connectivity, not performance.&lt;/li&gt;
&lt;li&gt;The timing is captured. You will want it for the detailed status endpoint.&lt;/li&gt;
&lt;li&gt;The function does not throw; it returns a structured result. The caller decides what to do with a failure.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Repeat this pattern for each dependency. Resist the urge to write a "generic" check that takes a dependency name and runs a different query per name. Specific is better than generic here; every dependency has its own quirks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Run the checks in parallel and with a timeout
&lt;/h2&gt;

&lt;p&gt;The readiness handler should call every dependency check in parallel and apply a hard timeout (typically 500 milliseconds to 2 seconds). The timeout matters because a slow dependency that does not return in time should fail the readiness check fast, not hang the probe.&lt;/p&gt;

&lt;p&gt;A reasonable handler in pseudocode:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;async function readyz(req, res) {
  const checks = await Promise.all([
    withTimeout(checkPrimaryDb(), 1000),
    withTimeout(checkPrimaryCache(), 500),
    withTimeout(checkMessageBroker(), 1000),
  ]);
  const allOk = checks.every(c =&amp;gt; c.ok);
  res.status(allOk ? 200 : 503).json({ checks });
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The parallel execution means the overall check completes in roughly the time of the slowest dependency, not the sum of all dependencies. The timeout means a stuck dependency does not stick the probe.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Cache the result briefly
&lt;/h2&gt;

&lt;p&gt;The orchestrator polls the readiness endpoint frequently. If every probe ran a fresh database query, the database would experience a constant trickle of health-check load. Multiply by every pod, and the health check becomes a load source.&lt;/p&gt;

&lt;p&gt;The fix is to cache the readiness result for a short window (typically 1 to 2 seconds). The handler:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Returns the cached result if it is less than the cache TTL old.&lt;/li&gt;
&lt;li&gt;Otherwise runs the checks fresh, caches the result with the timestamp, and returns it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A 1-second cache means each pod runs at most 1 set of dependency checks per second, regardless of how often the orchestrator polls. This is the right shape: the readiness result is fresh enough to be meaningful but not so fresh that it becomes a workload.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F7bgkg078jlhwjlgob01v.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F7bgkg078jlhwjlgob01v.jpeg" alt="Terminal screen showing application logs and dependency status" width="799" height="533"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by Al Nahian on &lt;a href="https://www.pexels.com" rel="noopener noreferrer"&gt;Pexels&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 5: Distinguish between hard and soft failures
&lt;/h2&gt;

&lt;p&gt;Not every dependency is equally critical. A reasonable readiness handler distinguishes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hard dependencies.&lt;/strong&gt; If any of these is down, the service cannot serve requests. The readiness check should fail. Database, primary cache, message broker.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Soft dependencies.&lt;/strong&gt; If these are degraded, the service can serve requests with degraded behavior. The readiness check should pass; the degraded state should be visible in detailed status.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For most web services, the soft category includes secondary caches, analytics endpoints, and any non-critical external integration. The boundary depends on the service.&lt;/p&gt;

&lt;p&gt;A common bug: classifying a soft dependency as hard, then having a non-critical service outage take the whole production cluster out of rotation. The fix is to be explicit about the boundary and to test failure modes for each.&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 6: Handle graceful shutdown
&lt;/h2&gt;

&lt;p&gt;When the process receives SIGTERM (a deploy, a scale-down, an orchestrator-initiated restart), the readiness check should immediately start failing so the load balancer stops routing new requests. Meanwhile in-flight requests continue to be served until they complete.&lt;/p&gt;

&lt;p&gt;In practice this looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;let shuttingDown = false;
process.on("SIGTERM", () =&amp;gt; { shuttingDown = true; });

async function readyz(req, res) {
  if (shuttingDown) {
    return res.status(503).json({ shutting_down: true });
  }
  // ... normal check logic
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The liveness probe should keep passing during this window so the orchestrator does not SIGKILL the process before in-flight work has drained.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 7: Confirm the failure modes in a test environment
&lt;/h2&gt;

&lt;p&gt;The most important step. In a non-production environment, simulate each dependency outage and confirm:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Killing the database causes readiness to fail within the timeout window.&lt;/li&gt;
&lt;li&gt;Killing the cache causes readiness to fail.&lt;/li&gt;
&lt;li&gt;Killing the message broker causes readiness to fail.&lt;/li&gt;
&lt;li&gt;Killing a soft dependency does not cause readiness to fail but does show up in detail status.&lt;/li&gt;
&lt;li&gt;SIGTERM causes readiness to fail immediately.&lt;/li&gt;
&lt;li&gt;After dependency recovery, readiness recovers automatically.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Run these tests during the initial implementation and after any meaningful change. They catch the four lies a health endpoint can tell, all in about an hour of work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 8: Watch the probe under real load
&lt;/h2&gt;

&lt;p&gt;Once deployed, monitor the readiness probe over time. Watch for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Frequent flaps (rapid 200/503/200/503 cycles). Usually indicates a timeout that is too aggressive or a dependency that is genuinely flaky.&lt;/li&gt;
&lt;li&gt;Slow response times. If the probe takes more than 500ms on average, something is wrong with the parallelization or the timeout configuration.&lt;/li&gt;
&lt;li&gt;Correlation with downstream incidents. Genuine readiness failures should correspond to real downstream issues.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tools like Prometheus can scrape the probe over time and graph these signals. The graphs will tell you whether the readiness check is doing its job or has become noise.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common mistakes and how to fix them
&lt;/h2&gt;

&lt;p&gt;A few patterns we have seen in audits at &lt;a href="https://137foundry.com" rel="noopener noreferrer"&gt;137Foundry's web development service&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sequential dependency checks.&lt;/strong&gt; Each dependency is checked in series. The total probe time is the sum of all checks, which slows the probe response. Fix: parallelize.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No timeout.&lt;/strong&gt; The check awaits each dependency without a timeout. A slow dependency hangs the probe indefinitely. Fix: add a timeout per dependency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No caching.&lt;/strong&gt; Every probe runs fresh checks. Probe load becomes meaningful at scale. Fix: cache for 1-2 seconds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Catch-all success.&lt;/strong&gt; A try/catch around the whole handler returns 200 on any exception. Real failures get hidden. Fix: return 503 on any caught error and surface the error in the response body.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Missing graceful shutdown.&lt;/strong&gt; The probe does not respect SIGTERM. Deploys lose in-flight requests. Fix: add the &lt;code&gt;shuttingDown&lt;/code&gt; flag.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Checking the wrong dependency.&lt;/strong&gt; The check queries a stub or a deprecated endpoint that no longer reflects production state. Fix: trace every check to a real production dependency.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;A readiness check that survives production traffic is a small piece of code that defines the dependency contract explicitly, runs checks in parallel with timeouts, caches briefly to avoid becoming load, distinguishes hard and soft failures, and respects graceful shutdown. None of these requirements is hard. All of them are easy to miss.&lt;/p&gt;

&lt;p&gt;For the longer write-up on how readiness sits inside the three-endpoint health check design and how the orchestrator, load balancer, and detailed status views all coordinate, the &lt;a href="https://137foundry.com/articles/web-application-health-check-endpoint-worth-trusting" rel="noopener noreferrer"&gt;hub article on health check endpoint design&lt;/a&gt; is the place to land. The &lt;a href="https://137foundry.com/services" rel="noopener noreferrer"&gt;services hub&lt;/a&gt; is where the longer engagements live.&lt;/p&gt;

&lt;p&gt;Most teams ship a v1 readiness check in a single afternoon. Most teams do not revisit it for two years. The audit that the section above describes usually catches at least one bug in the v1 implementation; running it once is a strong upgrade for any production web service.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>How to Tell If Your Health Endpoint Is Lying To You</title>
      <dc:creator>137Foundry</dc:creator>
      <pubDate>Mon, 22 Jun 2026 10:09:54 +0000</pubDate>
      <link>https://dev.to/137foundry/how-to-tell-if-your-health-endpoint-is-lying-to-you-2fmm</link>
      <guid>https://dev.to/137foundry/how-to-tell-if-your-health-endpoint-is-lying-to-you-2fmm</guid>
      <description>&lt;p&gt;A lying health endpoint is not a fictional problem. Almost every long-running web application accumulates one over time: a &lt;code&gt;/health&lt;/code&gt; or &lt;code&gt;/healthz&lt;/code&gt; route that returns 200 OK even when the application is, by any reasonable definition, broken. The team trusts the dashboard. The dashboard trusts the endpoint. The endpoint trusts a route handler that was scaffolded three years ago and has not been audited since.&lt;/p&gt;

&lt;p&gt;This is a guide to figuring out whether your own health endpoint is telling the truth. It is the audit that every web service should have run at least once and that very few have.&lt;/p&gt;

&lt;h2&gt;
  
  
  The four ways a health endpoint lies
&lt;/h2&gt;

&lt;p&gt;A health endpoint can lie in four distinct ways. Knowing which one applies tells you what to fix.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lie #1: It returns 200 OK because the route handler is hard-coded to do so.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The route handler reads, in essence:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;app.get("/health", (req, res) =&amp;gt; res.send("ok"));
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There is no check. There is no dependency probe. The endpoint reports the process can serve HTTP, and nothing more. For some teams this is fine (this is what a liveness probe should look like). For most teams who think they have a real health endpoint, it is not fine.&lt;/p&gt;

&lt;p&gt;How to spot it: read the route handler. If it has no awaits, no database calls, no external checks, no try/catch around dependencies, it is this kind of lie.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lie #2: It returns 200 OK because the check inside it is catching all errors.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The route handler tries to check the database, the cache, and maybe an upstream service, but every check is wrapped in a &lt;code&gt;try&lt;/code&gt; block whose &lt;code&gt;catch&lt;/code&gt; returns 200 anyway. The original intent may have been "do not let the health endpoint itself crash," but the implementation hides every dependency error behind a green light.&lt;/p&gt;

&lt;p&gt;How to spot it: search the route handler for &lt;code&gt;catch&lt;/code&gt; blocks. If any catch returns success, the endpoint can lie about that path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lie #3: It returns 200 OK because it checks the wrong dependency.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The handler checks the database connection pool by querying a stub or a status table that does not actually exercise the database. Or it checks a cache that has been replaced by a new cache, but the old check still works. Or it checks an upstream service that has been deprecated and now returns 200 from a maintenance page regardless of state.&lt;/p&gt;

&lt;p&gt;How to spot it: trace every check inside the handler back to a real production dependency. If any of them are vestigial, they are lying.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lie #4: It returns 200 OK because the process is fine but the parts that serve real traffic are not.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the subtlest lie. The route handler runs on a worker pool, an event loop, or a thread that is completely separate from the workers serving real requests. When the request-serving workers wedge, the health-endpoint worker is unaffected. The endpoint passes; users see latency and errors.&lt;/p&gt;

&lt;p&gt;How to spot it: look at the framework's worker configuration. If the health endpoint runs on its own loop or worker pool, the endpoint is structurally unable to detect a wedge in the request-serving path.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F407q5b7vb3bk9okg3ho8.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F407q5b7vb3bk9okg3ho8.jpeg" alt="Server rack with cables and indicator lights in a data center" width="799" height="532"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by Brett Sayles on &lt;a href="https://www.pexels.com" rel="noopener noreferrer"&gt;Pexels&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  A four-step audit you can run today
&lt;/h2&gt;

&lt;p&gt;The audit takes an hour or two for most web applications and produces concrete output: either the endpoint is telling the truth, or you have a short list of things to fix.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Read the route handler.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Open the source file containing the health endpoint. Read it line by line. Write down:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What checks does it run?&lt;/li&gt;
&lt;li&gt;What dependencies does each check exercise?&lt;/li&gt;
&lt;li&gt;What does it return on success, on partial failure, on total failure?&lt;/li&gt;
&lt;li&gt;Is the route served by the same worker pool as real traffic?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the answers are surprising (you thought it checked the cache and it does not), you have already found one of the lies above.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Run a fault-injection test.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In a non-production environment, stop the database. Call the health endpoint. Does it return 503? If it returns 200, the endpoint is lying. Repeat for the cache, for any internal upstream services in the request path, and for the message broker.&lt;/p&gt;

&lt;p&gt;This is the single most valuable test you can run on a health endpoint, and it is surprisingly underused. Teams audit code; very few teams audit signals.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Probe under realistic load.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In a non-production environment, generate traffic that mimics production. Watch the health endpoint while doing so. Does it stay snappy? Does it ever return 503 spuriously? Tools like &lt;a href="https://prometheus.io" rel="noopener noreferrer"&gt;Prometheus&lt;/a&gt; scraping the endpoint over time will surface slow checks and intermittent failures that hand-testing misses.&lt;/p&gt;

&lt;p&gt;A health endpoint that takes 2 seconds to respond is not a health endpoint; it is a latency liability. If the orchestrator polls it every 5 seconds, you have just added meaningful background load to your own database.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4: Trace the dashboard's chain.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The dashboard says the service is healthy. Where does that signal come from? Usually it is an external uptime checker, a monitoring tool, or the orchestrator's own status. Confirm what each of them is actually checking. Often the dashboard is summarizing the orchestrator's view, which in turn is summarizing the liveness probe. If the liveness probe is the always-200 lie from above, the entire dashboard chain is built on a lie.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common discoveries from the audit
&lt;/h2&gt;

&lt;p&gt;In our experience auditing client systems at &lt;a href="https://137foundry.com" rel="noopener noreferrer"&gt;137Foundry&lt;/a&gt;, the patterns recur:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pattern 1: The handler was written before the service used a database.&lt;/strong&gt; The endpoint is a leftover from when the application was a single in-memory service. The original handler was honest at the time and has never been updated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pattern 2: The handler checks an old dependency that has been replaced.&lt;/strong&gt; The application now uses Redis for caching, but the health check still queries a stub that pretends a Memcached instance is up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pattern 3: The handler is the same route for liveness, readiness, and detail.&lt;/strong&gt; It tries to satisfy three different audiences and ends up lying to at least one of them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pattern 4: The handler catches exceptions silently.&lt;/strong&gt; Someone added a &lt;code&gt;catch&lt;/code&gt; block years ago to prevent the endpoint from 500ing, and now any internal error is masked as 200.&lt;/p&gt;

&lt;p&gt;The fixes are usually 10 to 50 lines of code each, plus a configuration change to the orchestrator's probe settings.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr0rbomefz689d9mfw9bc.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr0rbomefz689d9mfw9bc.jpeg" alt="Network operations center with monitoring dashboards" width="799" height="533"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by Keysi Estrada on &lt;a href="https://www.pexels.com" rel="noopener noreferrer"&gt;Pexels&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The three-endpoint solution
&lt;/h2&gt;

&lt;p&gt;The longer-term fix is the three-endpoint pattern: split the single &lt;code&gt;/health&lt;/code&gt; into &lt;code&gt;/livez&lt;/code&gt; (liveness), &lt;code&gt;/readyz&lt;/code&gt; (readiness), and &lt;code&gt;/statusz&lt;/code&gt; (detailed JSON for humans). Each endpoint has a clear scope and a clear audience.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Liveness checks only that the process can serve HTTP at all. Cheap, fast, no dependencies.&lt;/li&gt;
&lt;li&gt;Readiness checks every dependency the application needs to serve real requests. Slightly slower, parallelized, cached briefly.&lt;/li&gt;
&lt;li&gt;Detailed status returns a JSON payload with per-dependency information for dashboards and on-call engineers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This pattern is described fully in the hub article on &lt;a href="https://137foundry.com/articles/web-application-health-check-endpoint-worth-trusting" rel="noopener noreferrer"&gt;health check endpoint design worth trusting&lt;/a&gt;. Container orchestrators like &lt;a href="https://kubernetes.io" rel="noopener noreferrer"&gt;Kubernetes&lt;/a&gt; are designed around this separation; load balancers like &lt;a href="https://www.nginx.com" rel="noopener noreferrer"&gt;Nginx&lt;/a&gt; and edge proxies like &lt;a href="https://www.cloudflare.com" rel="noopener noreferrer"&gt;Cloudflare&lt;/a&gt; all respect the readiness signal as a routing primitive. The split is not invented; it is what the underlying infrastructure already expects.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the audit will surface
&lt;/h2&gt;

&lt;p&gt;After the audit, you will land on one of three outcomes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Outcome A: The endpoint is honest.&lt;/strong&gt; The route handler runs real checks, returns 503 on real failures, runs on the production worker pool, and the orchestrator's view matches reality. You have spent two hours and confirmed a piece of your stack is working as intended. Worth knowing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Outcome B: The endpoint is structurally lying.&lt;/strong&gt; It is hardcoded to 200, or it is checking the wrong things, or it runs on the wrong worker pool. You have a short list of changes to ship. Usually one afternoon of work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Outcome C: The endpoint is mostly honest but has gaps.&lt;/strong&gt; It checks the database but not the cache. It catches one exception silently but not the others. It checks an upstream that no longer matters. You have a slightly longer list, but each item is mechanical.&lt;/p&gt;

&lt;p&gt;In our consulting work, Outcome B is the most common. Teams add real checks year by year, and the endpoint grows organically into something that mostly works but has at least one significant lie buried in it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;A lying health endpoint is the most expensive piece of seemingly trivial code in your stack. The reason it is expensive is that every alert, dashboard, runbook, and on-call escalation depends on it. When it lies, every downstream signal is poisoned, and the team trusts a green dashboard while users are unhappy.&lt;/p&gt;

&lt;p&gt;The audit is short. The fix is mechanical. The cost of doing it once is a fraction of the cost of a single 2 AM page caused by a probe that should have surfaced the problem ten minutes earlier. If you have not audited your health endpoint in the last year, you are almost certainly running on at least one of the four lies above.&lt;/p&gt;

&lt;p&gt;For a longer reference and the full three-endpoint design pattern, the &lt;a href="https://137foundry.com/articles/web-application-health-check-endpoint-worth-trusting" rel="noopener noreferrer"&gt;hub article&lt;/a&gt; covers the rest. The &lt;a href="https://137foundry.com/services" rel="noopener noreferrer"&gt;services hub&lt;/a&gt; is where the audit conversation usually begins.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>How to Add Idempotency Keys to an Existing Integration Without Breaking Live Traffic</title>
      <dc:creator>137Foundry</dc:creator>
      <pubDate>Sun, 21 Jun 2026 10:06:57 +0000</pubDate>
      <link>https://dev.to/137foundry/how-to-add-idempotency-keys-to-an-existing-integration-without-breaking-live-traffic-dfo</link>
      <guid>https://dev.to/137foundry/how-to-add-idempotency-keys-to-an-existing-integration-without-breaking-live-traffic-dfo</guid>
      <description>&lt;p&gt;Most idempotency retrofit work is on integrations that have been running in production for months or years, processing live traffic, with downstream consumers that depend on the existing message format. The challenge is not the cryptography or the design pattern. It is the deployment sequencing that lets you add keys, change behavior, and clean up edge cases without breaking anything in flight.&lt;/p&gt;

&lt;p&gt;This guide walks through a six-step deployment sequence that works for most production retrofits. The whole sequence takes one to three weeks of calendar time per integration, depending on how many consumers you have and how aggressive your retry windows are.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fodkzyl90ri06s2csapym.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fodkzyl90ri06s2csapym.jpeg" alt="An aerial view of a shipping container yard with neatly stacked rows" width="800" height="600"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by Giant Asparagus on &lt;a href="https://www.pexels.com" rel="noopener noreferrer"&gt;Pexels&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Add the Key Column Without Using It
&lt;/h2&gt;

&lt;p&gt;The first deployment is a no-op from a behavioral standpoint. You add a UUID column to the outgoing event table, populate it for new rows, and leave existing rows with NULL.&lt;/p&gt;

&lt;p&gt;The sender does not change yet. The receiver does not change yet. The only difference is that new rows now have a UUID that future code can use.&lt;/p&gt;

&lt;p&gt;This step takes a few hours to ship and prove. You verify by inspecting the database that new rows are getting UUIDs and old rows still have NULL. If anything is wrong, you roll back the column add (or just leave it; an unused column is harmless).&lt;/p&gt;

&lt;p&gt;This step is important because it lets you backfill the new column for existing rows during the next quiet window without time pressure. The Wikipedia overview of &lt;a href="https://en.wikipedia.org/wiki/Idempotence" rel="noopener noreferrer"&gt;idempotence&lt;/a&gt; covers the property you are ultimately enforcing, but the deployment work is what makes the enforcement reliable in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Backfill UUIDs for Existing Rows
&lt;/h2&gt;

&lt;p&gt;For rows that already exist with NULL UUIDs, generate UUIDs and populate the column. This is a one-time backfill, usually run as a batch job.&lt;/p&gt;

&lt;p&gt;The UUIDs for backfilled rows do not need to be deterministic, since these rows correspond to events that have already been processed. The only requirement is that future code can rely on every row having a non-NULL UUID.&lt;/p&gt;

&lt;p&gt;The backfill is usually fast (a single UPDATE statement for small tables, a chunked job for large ones). Verify that no rows remain with NULL UUIDs before moving to the next step. Wikipedia's overview of &lt;a href="https://en.wikipedia.org/wiki/Extract,_transform,_load" rel="noopener noreferrer"&gt;extract, transform, load&lt;/a&gt; processes covers the broader pattern of backfilling state without disrupting live processing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Modify the Sender to Include the UUID
&lt;/h2&gt;

&lt;p&gt;The sender now reads the UUID from the row and includes it in the outgoing message (header or body, depending on the message format).&lt;/p&gt;

&lt;p&gt;The receiver still does not use the UUID. It simply receives it and ignores it. This step is also behaviorally a no-op, but it gets the data flowing through the pipeline so that downstream changes can rely on it being present.&lt;/p&gt;

&lt;p&gt;You verify by tailing the receiver's request log and confirming that every incoming request includes a UUID.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Add Receiver Dedup Without Enforcing It
&lt;/h2&gt;

&lt;p&gt;The receiver now records every incoming UUID in a dedup table but does not yet refuse duplicates. If a duplicate UUID arrives, the receiver logs the event but processes the request as normal.&lt;/p&gt;

&lt;p&gt;This step is critical. It lets you observe what the actual duplicate rate looks like in production before flipping the enforcement switch. The expected hit rate is between 0.1 and 1 percent. Wikipedia's entry on the &lt;a href="https://en.wikipedia.org/wiki/Saga_pattern" rel="noopener noreferrer"&gt;saga pattern&lt;/a&gt; and writing from practitioners like Martin Fowler at &lt;a href="https://martinfowler.com" rel="noopener noreferrer"&gt;martinfowler.com&lt;/a&gt; both reinforce why observation before enforcement is the safer sequencing.&lt;/p&gt;

&lt;p&gt;Two outcomes are possible:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The hit rate is in the expected range. Move to step 5.&lt;/li&gt;
&lt;li&gt;The hit rate is 0 percent or 10+ percent. There is a problem worth understanding before enforcement: either the keys are not stable across retries (move to investigating sender behavior) or the keys are colliding across distinct operations (move to investigating key generation logic).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Spend at least one full week in this step before proceeding. Production traffic patterns vary day to day, and a week of data is the minimum to be confident the hit rate is representative.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Flip the Enforcement Switch
&lt;/h2&gt;

&lt;p&gt;Once you are confident that the dedup mechanism is recognizing duplicates correctly, change the receiver to actually return the original response (without re-applying side effects) on duplicate UUIDs.&lt;/p&gt;

&lt;p&gt;This is the moment when the integration becomes truly idempotent. From this deployment forward, retries are safe to perform.&lt;/p&gt;

&lt;p&gt;The change is small and behaviorally observable: duplicate-side-effect incidents should drop to zero from this point on. The dedup hit rate stays roughly stable; the difference is that the hits now actually prevent the duplicate work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 6: Tighten Up the Sender Retry Logic
&lt;/h2&gt;

&lt;p&gt;Once enforcement is on, the sender can be more aggressive about retries because the receiver is now safe under duplicates. Increase the retry count, shorten the backoff interval, or both, depending on the operational characteristics you want.&lt;/p&gt;

&lt;p&gt;This is also the step where you switch the sender's retry logic from "create new row" to "update existing row" if it was not already. With enforcement on at the receiver, a new row would just produce a different UUID and a fresh side effect. The retry has to reuse the existing row's UUID.&lt;/p&gt;

&lt;p&gt;After this step, the integration is fully retrofitted. You should observe:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A non-zero, roughly stable dedup hit rate on the receiver (proof that retries are working).&lt;/li&gt;
&lt;li&gt;Zero new duplicate-side-effect incidents in the receiver's data.&lt;/li&gt;
&lt;li&gt;Faster recovery from broker outages or network instability, because the sender can retry more aggressively without risking corruption.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What to Verify Before Each Step
&lt;/h2&gt;

&lt;p&gt;A few specific checks save a lot of grief:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before step 1:&lt;/strong&gt; Confirm you have permission to alter the outgoing events table schema. Some legacy integrations have this table managed by a separate team.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before step 3:&lt;/strong&gt; Confirm that the message format allows adding a new field without breaking existing consumers. For some legacy formats (fixed-length records, strictly schemaed protocol buffers), this requires a separate compat-layer change first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before step 4:&lt;/strong&gt; Confirm you have a place to put the receiver's dedup table and have provisioned enough storage for at least the retention window times the expected message rate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before step 5:&lt;/strong&gt; Confirm the receiver's response cache returns identical results for duplicate UUIDs. If the cached response includes a timestamp or other "now" field, duplicate responses might look different from each other even though the side effect was skipped, which can confuse downstream consumers that check response equality.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Retrofit Failure Modes
&lt;/h2&gt;

&lt;p&gt;A few patterns that reliably cause problems:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Skipping the observation phase.&lt;/strong&gt; Going from "no UUIDs" to "enforce on receiver" in two deploys means you find out about key generation bugs only after enforcement turns them into visible errors. The observation phase exists to find these bugs in a safe window.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Generating UUIDs at send time instead of event time.&lt;/strong&gt; The retrofit hooks the UUID into the wrong place, and the dedup hit rate stays at zero even after step 4. The fix is to move generation upstream to the source-of-truth event. The &lt;a href="https://137foundry.com/services/data-integration" rel="noopener noreferrer"&gt;137Foundry data integration practice&lt;/a&gt; treats event-time generation as the default for any new integration we ship, because the retrofit cost when this is wrong is meaningfully larger than the design cost to get it right from the start.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Backfilling UUIDs for already-processed rows in a way that collides with future generation.&lt;/strong&gt; Make sure the backfill uses a UUID scheme that cannot accidentally produce the same value as future event-time generation. Standard random UUIDs handle this; sequential or time-based schemes can collide if not handled carefully.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Forgetting to clean up the dedup table.&lt;/strong&gt; Without TTL or partitioning, the dedup table grows unbounded and eventually becomes the bottleneck. Add cleanup as part of step 4, not as a follow-up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the Sequencing Matters
&lt;/h2&gt;

&lt;p&gt;The six-step sequence works because each step is independently shippable and observably safe. Any single step can be rolled back without affecting the others. The pipeline can sit indefinitely in step 4 (recording but not enforcing) if the team needs more time to validate.&lt;/p&gt;

&lt;p&gt;The alternative (a big-bang retrofit that adds UUIDs, enforces dedup, and changes retry behavior in one deploy) is much more dangerous because the failure modes interact in ways that are hard to debug in production.&lt;/p&gt;

&lt;p&gt;For more depth on the underlying patterns, the &lt;a href="https://137foundry.com/articles/how-to-handle-idempotency-data-integration-retries" rel="noopener noreferrer"&gt;longer reference on how to handle idempotency in data integration pipelines&lt;/a&gt; covers the design patterns the retrofit is aiming to deliver, and the &lt;a href="https://137foundry.com/services" rel="noopener noreferrer"&gt;137Foundry services overview&lt;/a&gt; covers how this work fits into broader integration engineering.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Honest Framing
&lt;/h2&gt;

&lt;p&gt;A retrofit done right is invisible to upstream and downstream consumers. The pipeline just becomes more resilient over the course of a few weeks, without any breaking changes or visible behavior shifts.&lt;/p&gt;

&lt;p&gt;A retrofit done in a rush usually breaks something in the middle and produces a postmortem that lasts longer than the deployment itself. The slow path is the fast path.&lt;/p&gt;

&lt;p&gt;The six-step sequence is not the only valid approach, but it has worked across enough production retrofits to recommend as a default. Variations on timing or step ordering are fine. Skipping the observation phase is not.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>api</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Five Patterns for Making Data Integration Operations Safe to Retry</title>
      <dc:creator>137Foundry</dc:creator>
      <pubDate>Sun, 21 Jun 2026 10:06:57 +0000</pubDate>
      <link>https://dev.to/137foundry/five-patterns-for-making-data-integration-operations-safe-to-retry-58bm</link>
      <guid>https://dev.to/137foundry/five-patterns-for-making-data-integration-operations-safe-to-retry-58bm</guid>
      <description>&lt;p&gt;Every data integration pipeline has to handle retries, because every network boundary eventually produces a duplicate delivery. The patterns below are the five that show up most often in production-grade integration code, each with a clear use case and a clear set of trade-offs. The right choice depends on the operation shape and the cost structure of the integration.&lt;/p&gt;

&lt;p&gt;This is a roundup of the patterns, with notes on when each one fits best and when to reach for a different tool.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F8kx4ew2uxdbede5b2jjm.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F8kx4ew2uxdbede5b2jjm.jpeg" alt="An aerial view of a shipping container yard with stacked containers" width="799" height="449"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by Tom Fisk on &lt;a href="https://www.pexels.com" rel="noopener noreferrer"&gt;Pexels&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  1. The Idempotency Key Pattern
&lt;/h2&gt;

&lt;p&gt;The most common and most general pattern. The sender generates a stable UUID for each logical operation, includes it with every delivery attempt, and the receiver records processed UUIDs to dedupe duplicates.&lt;/p&gt;

&lt;p&gt;Use case: any operation that produces side effects (creates, charges, sends, allocates) and needs to be safe under retries.&lt;/p&gt;

&lt;p&gt;Trade-off: requires a dedup store at the receiver with a retention window longer than the maximum retry window. Storage cost scales with message volume.&lt;/p&gt;

&lt;p&gt;The Wikipedia overview of &lt;a href="https://en.wikipedia.org/wiki/Idempotence" rel="noopener noreferrer"&gt;idempotence&lt;/a&gt; covers the underlying property. The canonical implementation in production code is Stripe's "Idempotency-Key" header, which is the model most homegrown implementations follow.&lt;/p&gt;

&lt;p&gt;The key implementation detail: generate the UUID at the source-of-truth event, not at send time. A UUID generated inside the sender function is unique per call, not unique per logical operation, which defeats the entire pattern.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. The Absolute-State Pattern
&lt;/h2&gt;

&lt;p&gt;Convert operations from relative changes ("increment count by 3") to absolute state ("set count to 7"). Absolute-state operations are naturally idempotent because re-applying produces the same result.&lt;/p&gt;

&lt;p&gt;Use case: operations where you can express the desired end state as a value rather than a delta. Inventory levels, status flags, role assignments, configuration values.&lt;/p&gt;

&lt;p&gt;Trade-off: messages carry more state, which costs bandwidth and may require sending fields the receiver does not change. For 50-field records where one field moved, sending all 50 is wasteful.&lt;/p&gt;

&lt;p&gt;The Wikipedia overview of &lt;a href="https://en.wikipedia.org/wiki/Extract,_transform,_load" rel="noopener noreferrer"&gt;extract, transform, load&lt;/a&gt; processes covers the broader pattern of moving state to a target system in a retry-safe way. Absolute state is the easiest mental model for ETL because each batch overwrites the previous state.&lt;/p&gt;

&lt;p&gt;The pattern works best for low-frequency state synchronization (nightly inventory sync) and worst for high-frequency event-stream integrations where bandwidth matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. The Optimistic Concurrency / Version Check Pattern
&lt;/h2&gt;

&lt;p&gt;The entity being modified carries a version number. Each write includes the expected current version, and the receiver only applies the write if the version matches.&lt;/p&gt;

&lt;p&gt;A retry that arrives after the original applied finds a higher version on the entity and is harmlessly rejected. The sender treats the rejection as a successful retry resolution (the operation has already been applied).&lt;/p&gt;

&lt;p&gt;Use case: relative operations on entities that can carry version metadata. The Wikipedia entry on &lt;a href="https://en.wikipedia.org/wiki/Database_transaction" rel="noopener noreferrer"&gt;database transactions&lt;/a&gt; covers the broader concurrency model this pattern fits into, and standard databases like &lt;a href="https://www.postgresql.org/" rel="noopener noreferrer"&gt;PostgreSQL&lt;/a&gt; support row-level version semantics via xmin or explicit version columns.&lt;/p&gt;

&lt;p&gt;Trade-off: requires a read to find the current version before each write, which doubles the network cost. For high-throughput integrations where reads are expensive, this becomes the bottleneck.&lt;/p&gt;

&lt;p&gt;The pattern is more common in OLTP-style integrations (updating user records, modifying inventory) than in event-stream integrations (replicating a changelog), because OLTP already has version metadata on most entities.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. The Saga Pattern for Multi-Step Operations
&lt;/h2&gt;

&lt;p&gt;For operations that span multiple receivers and cannot be wrapped in a single transaction, the saga pattern uses local idempotent operations at each step with compensating actions to handle partial failures.&lt;/p&gt;

&lt;p&gt;Each step in the saga is independently retry-safe via one of the other patterns (idempotency key, absolute state, version check). The saga adds a layer above the individual steps: if a later step fails, the orchestrator triggers compensating actions on the earlier steps to roll back.&lt;/p&gt;

&lt;p&gt;Use case: any integration that touches more than one receiver and where partial application has to be cleaned up. Common in workflow orchestration, order processing across multiple services, multi-region replication.&lt;/p&gt;

&lt;p&gt;Trade-off: significantly more design work than single-step patterns. Compensating actions are real code that has to be written and tested. The saga state machine has to be persisted and resumable.&lt;/p&gt;

&lt;p&gt;Wikipedia's overview of the &lt;a href="https://en.wikipedia.org/wiki/Saga_pattern" rel="noopener noreferrer"&gt;saga pattern&lt;/a&gt; covers the trade-offs against the simpler but more fragile &lt;a href="https://en.wikipedia.org/wiki/Two-phase_commit_protocol" rel="noopener noreferrer"&gt;two-phase commit protocol&lt;/a&gt;. Writing from practitioners like Martin Fowler at &lt;a href="https://martinfowler.com" rel="noopener noreferrer"&gt;martinfowler.com&lt;/a&gt; covers the broader pattern in the context of microservice architectures.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. The Broker-Level Dedup Pattern
&lt;/h2&gt;

&lt;p&gt;Some message brokers (notably &lt;a href="https://kafka.apache.org/" rel="noopener noreferrer"&gt;Apache Kafka&lt;/a&gt; with idempotent producer mode) handle a subset of the dedup problem at the transport layer. The producer attaches a sequence number to each message, and the broker rejects out-of-order or duplicate sequence numbers from the same producer.&lt;/p&gt;

&lt;p&gt;Use case: producer-to-broker duplicate prevention. Useful as a foundation layer, not a complete solution.&lt;/p&gt;

&lt;p&gt;Trade-off: only covers the producer-to-broker hop. Consumer-side duplicates (from rebalances or consumer crashes after processing but before offset commit) still happen and still need application-level idempotency.&lt;/p&gt;

&lt;p&gt;The right framing is "necessary but not sufficient." Broker-level dedup eliminates one class of duplicate and reduces the volume the application-level dedup has to handle, but the consumer side of the integration still needs one of the other patterns.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Choose
&lt;/h2&gt;

&lt;p&gt;A rough decision flow:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;State-setting operation with infrequent updates and small messages:&lt;/strong&gt; Absolute-state. Simplest pattern, naturally idempotent, no dedup store needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Side-effect operation (create, charge, send) with high volume:&lt;/strong&gt; Idempotency key with event-time generation. Most general, requires receiver dedup store but scales well.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Relative operation on an entity that already has version metadata:&lt;/strong&gt; Optimistic concurrency with version checks. Cleaner than idempotency keys when version is already present.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-step operation across multiple receivers:&lt;/strong&gt; Saga with idempotent steps. Required when partial failure cleanup is necessary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Producer-to-broker messaging with consumer crash risk:&lt;/strong&gt; Broker-level dedup plus application-level idempotency at the consumer. Layered defense.&lt;/p&gt;

&lt;p&gt;Most production integrations use a mix. A typical pipeline might use absolute-state for inventory syncs, idempotency keys for outgoing webhooks, version checks for updates to user records, and broker-level dedup as a foundation underneath all of them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes Across Patterns
&lt;/h2&gt;

&lt;p&gt;A few patterns that reliably cause problems regardless of which idempotency pattern is in use:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Generating idempotency keys at send time.&lt;/strong&gt; The key has to be stable across all retries of the same logical operation, which means generating it at the source-of-truth event, not at delivery attempt.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unbounded dedup stores.&lt;/strong&gt; Without TTL or partitioning, the dedup table grows until it becomes the bottleneck. Add cleanup as a baseline requirement, not as a follow-up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Treating partial-success responses as success.&lt;/strong&gt; A receiver that returns 200 OK after the side effect but before recording the dedup key opens a window where retries duplicate the side effect. The dedup record has to be written atomically with the side effect.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Skipping the observation phase during retrofit.&lt;/strong&gt; Adding idempotency to an existing integration in one big-bang deploy is much riskier than the multi-step sequence of "record but do not enforce" then "enforce after observation."&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Pattern Choice Matters
&lt;/h2&gt;

&lt;p&gt;The choice of pattern shapes the cost structure of the integration over its lifetime. Idempotency keys with a well-managed dedup store have low ongoing cost but require disciplined design at the start. Absolute-state has minimal ongoing cost but constrains the message format. Sagas have high design cost but produce the cleanest behavior for multi-step operations.&lt;/p&gt;

&lt;p&gt;The wrong pattern usually shows up as either ongoing operational pain (cleaning up dedup tables that grew too large) or design pain (trying to add a saga compensation flow after the integration is in production). The right pattern shows up as the absence of incidents.&lt;/p&gt;

&lt;p&gt;For more depth on these patterns and the specific design choices that make each one production-grade, &lt;a href="https://137foundry.com" rel="noopener noreferrer"&gt;https://137foundry.com&lt;/a&gt; covers the broader engineering practice these patterns plug into. The &lt;a href="https://137foundry.com/services" rel="noopener noreferrer"&gt;137Foundry services overview&lt;/a&gt; covers how integration design fits with the rest of the platform work, and the &lt;a href="https://137foundry.com/articles/how-to-handle-idempotency-data-integration-retries" rel="noopener noreferrer"&gt;longer reference on idempotency in data integration pipelines&lt;/a&gt; walks through the unified theory that ties the five patterns together.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Take
&lt;/h2&gt;

&lt;p&gt;There is no single right pattern. There are five well-understood ones, each with a clear fit and a clear cost. The integrations that survive in production are the ones where the pattern was chosen deliberately for the operation shape, not the ones where the team picked one pattern and tried to make every operation fit it.&lt;/p&gt;

&lt;p&gt;The five patterns above are the toolkit. Picking the right one is the engineering work.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>api</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Why the Empty States in a Data Table Deserve Three Separate Designs Instead of One Generic Message</title>
      <dc:creator>137Foundry</dc:creator>
      <pubDate>Sat, 20 Jun 2026 10:04:27 +0000</pubDate>
      <link>https://dev.to/137foundry/why-the-empty-states-in-a-data-table-deserve-three-separate-designs-instead-of-one-generic-message-1il4</link>
      <guid>https://dev.to/137foundry/why-the-empty-states-in-a-data-table-deserve-three-separate-designs-instead-of-one-generic-message-1il4</guid>
      <description>&lt;p&gt;Most data tables ship with a single empty state: a centered "No data" message, sometimes with a small illustration, sometimes with a "create new" button. It looks fine in design review. It produces support tickets in production.&lt;/p&gt;

&lt;p&gt;The problem is that "the table is empty" can mean three completely different things, and the user's right next action is different in each case. A single generic message hides that distinction and leaves the user guessing.&lt;/p&gt;

&lt;p&gt;The three cases are: first-use (no data exists yet), filtered-empty (data exists but the current filters exclude everything), and load-failure (the request to fetch data did not succeed). Each one needs its own copy, its own actionable element, and its own visual treatment. Designing them all the same is one of the more common UX failures in data-heavy tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  Case 1: First-use empty
&lt;/h2&gt;

&lt;p&gt;The user has just landed on a screen where no data exists yet, because no one on the team has created anything that would appear here.&lt;/p&gt;

&lt;p&gt;The user's question: "What goes here, and how do I make it?"&lt;/p&gt;

&lt;p&gt;The right response: an explanation of what would appear in the table, plus a primary call to action for creating the first item, plus optional secondary actions for related onboarding paths.&lt;/p&gt;

&lt;p&gt;A good first-use empty state for a "Customer accounts" table:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;No customer accounts yet.&lt;/p&gt;

&lt;p&gt;Customer accounts let you track your client relationships and link them to invoices and notes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.toprimary"&gt;Create your first account&lt;/a&gt;&lt;br&gt;
Or &lt;a href="https://dev.tosecondary"&gt;import from CSV&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The copy explains the purpose of the table in one sentence. The primary action is the most common path forward. The secondary action acknowledges an alternative path (importing from another system).&lt;/p&gt;

&lt;p&gt;The mistake to avoid: a generic "No data" message in the first-use case. The user does not know what data should be there or how to create it. The message is a dead end.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://www.nngroup.com" rel="noopener noreferrer"&gt;Nielsen Norman Group&lt;/a&gt; has consistently reported that first-use empty states are one of the highest-leverage moments in onboarding new users to a tool: a good empty state can compress the time-to-first-value from days to minutes. A bad one extends it indefinitely.&lt;/p&gt;

&lt;h2&gt;
  
  
  Case 2: Filtered-empty
&lt;/h2&gt;

&lt;p&gt;Data exists in the table, but the user has applied filters that exclude all of it. The "no results matching" case.&lt;/p&gt;

&lt;p&gt;The user's question: "Why is nothing showing? Did the tool break?"&lt;/p&gt;

&lt;p&gt;The right response: an explicit message that filters are active and excluding all rows, plus an action to clear or modify the filters.&lt;/p&gt;

&lt;p&gt;A good filtered-empty state:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;No accounts match the current filters.&lt;/p&gt;

&lt;p&gt;You're filtering for: Status = Closed AND Created after 2025-01-01.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.toprimary"&gt;Clear all filters&lt;/a&gt;&lt;br&gt;
Or &lt;a href="https://dev.tosecondary"&gt;Edit filters&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The copy names the filters that are active. This is the most common cause of "why is the table empty" support tickets, and surfacing the cause in the empty state eliminates most of them.&lt;/p&gt;

&lt;p&gt;The mistake to avoid: a generic "No data" message that does not mention filters. The user has forgotten which filters they applied (or has navigated back to a screen that preserved the filters from earlier) and assumes the data is gone. They click around, they file a ticket, they have a bad experience that was entirely preventable.&lt;/p&gt;

&lt;p&gt;A subtler version of this failure: showing the filter chips above the empty state but not in the empty state itself. The filter chips become visual noise above an empty area; the user is staring at the empty area, not at the chips. The empty state needs to surface the filter status itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Case 3: Load-failure
&lt;/h2&gt;

&lt;p&gt;The fetch to load data failed. The server returned an error, the network is down, the auth token expired, the API rate-limited the user.&lt;/p&gt;

&lt;p&gt;The user's question: "Is this temporary? Should I try again or do I need to do something?"&lt;/p&gt;

&lt;p&gt;The right response: an explicit message that loading failed, plus context about the failure (when it happened, what error code, how to report it), plus a retry action.&lt;/p&gt;

&lt;p&gt;A good load-failure empty state:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Could not load accounts.&lt;/p&gt;

&lt;p&gt;Last attempt: 2 minutes ago. The server returned a 503 error, which usually means the service is temporarily unavailable.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.toprimary"&gt;Retry&lt;/a&gt;&lt;br&gt;
If the problem persists, contact support and reference error code REF-7281.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The copy distinguishes the failure mode (503 versus 401 versus network error versus timeout) when possible. The user can self-diagnose whether to retry or to act on the underlying cause (re-authenticate, contact support, wait).&lt;/p&gt;

&lt;p&gt;The mistake to avoid: showing the same "No data" message for load-failure as for the other cases. The user assumes the data is empty, makes wrong decisions based on that assumption, and may discover hours later that the load was failing silently.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://www.w3.org/WAI/standards-guidelines/wcag/" rel="noopener noreferrer"&gt;Web Content Accessibility Guidelines&lt;/a&gt; cover error message standards that apply to load-failure cases: the message should be perceivable, the cause should be identifiable, and the user should be able to recover. Generic messages fail all three.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why teams design only one empty state
&lt;/h2&gt;

&lt;p&gt;A few patterns I have observed repeatedly:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Empty states get designed late.&lt;/strong&gt; The table is built, data flows in during development, and no one notices that the empty case has not been designed until QA or production. By then, the team is shipping something and reaches for a generic placeholder.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Empty states get designed during design review, when the table has demo data.&lt;/strong&gt; The designer adds an empty state, the team reviews it in the context of the demo data, and the three different empty cases never come up because the data is there.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Empty states are seen as edge cases.&lt;/strong&gt; The team thinks empty states are uncommon and not worth the polish. They are wrong: filtered-empty happens routinely in production, load-failure happens whenever the network has a hiccup, first-use happens for every new user the tool ever onboards.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Empty states are conflated with placeholder states.&lt;/strong&gt; A "skeleton loader" while data loads is different from an empty state after data has loaded. Teams sometimes design only the skeleton and forget that the post-load empty case still needs its own treatment.&lt;/p&gt;

&lt;h2&gt;
  
  
  What good empty state coverage looks like
&lt;/h2&gt;

&lt;p&gt;A practical checklist for empty state design in a data-heavy tool:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;First-use empty: explains the table's purpose, primary CTA to create the first item, secondary CTA for alternative paths (import, link existing data).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Filtered-empty: names the active filters in the empty message, primary CTA to clear filters, secondary CTA to edit filters.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Load-failure: distinguishes the failure type when possible, includes a timestamp, primary CTA to retry, optional secondary CTA to contact support with an error reference.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Partial-load: when some rows loaded but the request was incomplete or paginated and the next page failed. Shows what loaded plus a load-failure message for the missing portion. This is a sub-case of load-failure but worth designing separately.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Permission-denied: when the user is authenticated but does not have access to view the data. Different from load-failure: the right action is to request access, not to retry. This is sometimes overlooked.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The pattern at &lt;a href="https://137foundry.com" rel="noopener noreferrer"&gt;137Foundry&lt;/a&gt; for clients building data-heavy interfaces is to surface all five cases in design review explicitly, before the table goes to QA. The cost of designing them separately is small; the cost of shipping a generic placeholder is paid forever in support volume. Designers and developers benefit from looking at all five together because the cases share visual treatment but differ in copy and action.&lt;/p&gt;

&lt;h2&gt;
  
  
  A practical implementation note
&lt;/h2&gt;

&lt;p&gt;The three (or five) empty states are usually conditional renders in the table component. A common implementation pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;TableBody&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;loading&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;filters&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;loading&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;SkeletonLoader&lt;/span&gt; &lt;span class="o"&gt;/&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;LoadFailureState&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="nx"&gt;onRetry&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;refetch&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="sr"&gt;/&amp;gt;&lt;/span&gt;&lt;span class="err"&gt;;
&lt;/span&gt;  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nf"&gt;hasActiveFilters&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;filters&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;FilteredEmptyState&lt;/span&gt; &lt;span class="nx"&gt;filters&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;filters&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="nx"&gt;onClear&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;clearFilters&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="sr"&gt;/&amp;gt;&lt;/span&gt;&lt;span class="err"&gt;;
&lt;/span&gt;  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;FirstUseEmptyState&lt;/span&gt; &lt;span class="o"&gt;/&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;DataRows&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="sr"&gt;/&amp;gt;&lt;/span&gt;&lt;span class="err"&gt;;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The conditional structure makes it clear which case is which, and prevents the "fall through to generic empty" failure that happens when there is only one empty state component for all cases.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://developer.mozilla.org" rel="noopener noreferrer"&gt;Mozilla Developer Network&lt;/a&gt; covers the underlying fetch and error-handling patterns that the load-failure case needs to detect specifically. Building empty states well is partly a design question and partly a question of having reliable error context to display, which depends on the data-fetching layer surfacing useful errors.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the user experiences when this is done right
&lt;/h2&gt;

&lt;p&gt;A user encountering a first-use table sees a welcoming explanation of what goes there and a clear next action. They create the first item and the table populates.&lt;/p&gt;

&lt;p&gt;A user who filters to an empty state sees an explicit "filters are excluding everything" message and clears the filters. They never wonder if the tool is broken.&lt;/p&gt;

&lt;p&gt;A user who hits a load-failure sees a clear failure message with a retry button and an error reference. They retry, the load succeeds, they continue. If retries keep failing, they have a reference to give support.&lt;/p&gt;

&lt;p&gt;In all three cases, the time the user spends being confused is short, the next action is clear, and the support tickets do not get filed. That is what good empty state coverage actually produces.&lt;/p&gt;

&lt;p&gt;The longer treatment of &lt;a href="https://137foundry.com/articles/how-to-design-data-tables-that-stay-readable-as-data-scales" rel="noopener noreferrer"&gt;data table design at scale&lt;/a&gt; covers how empty states pair with the rest of the design decisions in a production data table.&lt;/p&gt;

</description>
      <category>ux</category>
      <category>webdev</category>
      <category>productivity</category>
    </item>
    <item>
      <title>How to Design Keyboard Navigation for a Data Grid So Power Users Stop Reaching for the Mouse</title>
      <dc:creator>137Foundry</dc:creator>
      <pubDate>Sat, 20 Jun 2026 10:04:27 +0000</pubDate>
      <link>https://dev.to/137foundry/how-to-design-keyboard-navigation-for-a-data-grid-so-power-users-stop-reaching-for-the-mouse-mh4</link>
      <guid>https://dev.to/137foundry/how-to-design-keyboard-navigation-for-a-data-grid-so-power-users-stop-reaching-for-the-mouse-mh4</guid>
      <description>&lt;p&gt;A well-designed data grid lets power users navigate, edit, and act on rows without ever touching the mouse. They tab into the grid, use arrow keys to move between cells, hit Enter to activate the row's primary action, hit Escape to cancel, and tab back out. Their hands stay on the home row. Their throughput is two to three times higher than a mouse-driven user on the same interface.&lt;/p&gt;

&lt;p&gt;A poorly-designed data grid forces those same users to mouse-click every action. Even with keyboard shortcuts that exist somewhere in the documentation, the in-grid navigation is broken enough that the keyboard path is slower than the mouse path. Power users learn to live with it but they hate the tool.&lt;/p&gt;

&lt;p&gt;The difference between these two outcomes is roughly a hundred lines of focus management code, plus deliberate design choices about which keys do what. Here is the design and what to watch for.&lt;/p&gt;

&lt;h2&gt;
  
  
  The standard keyboard model
&lt;/h2&gt;

&lt;p&gt;A reasonable default for data grid keyboard navigation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Tab&lt;/strong&gt; moves focus into the grid from outside, and out of the grid to the next focusable element. Once in the grid, Tab moves between focusable controls within the active row (action buttons, edit controls). Tab does NOT move between cells in the grid; arrow keys do that.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Arrow keys&lt;/strong&gt; move between cells. Up and Down between rows in the same column. Left and Right between columns in the same row. Holding Shift extends the selection.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Enter&lt;/strong&gt; activates the primary action for the current row. Usually this is "open the row's detail view" or "edit this row inline." The primary action is the one a mouse user would click on by clicking the row identity column.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Space&lt;/strong&gt; toggles the row's checkbox in tables with bulk-select. In single-select tables, Space is the same as Enter.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Escape&lt;/strong&gt; cancels any in-progress edit, closes any open menu, or moves focus back to the parent element if no other escape action applies.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Home / End&lt;/strong&gt; move to the first / last column in the current row. &lt;strong&gt;Ctrl+Home / Ctrl+End&lt;/strong&gt; move to the first / last cell in the grid.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Page Up / Page Down&lt;/strong&gt; move one viewport of rows up / down.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This model matches the WAI-ARIA Authoring Practices for grids, and screen reader users and pure-keyboard users have been trained on it across most enterprise tools. Diverging from this pattern means asking users to relearn the grid, which they will resist.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://www.w3.org/WAI/" rel="noopener noreferrer"&gt;W3C Web Accessibility Initiative&lt;/a&gt; maintains the authoring practices documentation, which is the canonical reference for the standard grid keyboard model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Tab versus arrow keys matters
&lt;/h2&gt;

&lt;p&gt;The most common keyboard-navigation failure in data grids: using Tab to move between cells.&lt;/p&gt;

&lt;p&gt;This makes intuitive sense (Tab moves focus through page elements), but breaks at scale. A grid with 7 columns means Tab cycles through 7 focusable elements per row. A 20 row visible viewport means 140 Tab presses to traverse one screen of data. Compare to arrow keys, where a single Down press moves to the next row regardless of column count.&lt;/p&gt;

&lt;p&gt;The standard model uses Tab for entering and exiting the grid (a single focusable composite control), and arrow keys for navigation within the grid. This is called "composite widget" focus management, and it is the right pattern for any grid-shaped control.&lt;/p&gt;

&lt;p&gt;Implementing this requires &lt;code&gt;tabindex="-1"&lt;/code&gt; on cells (so they are programmatically focusable but not in the Tab order) and &lt;code&gt;tabindex="0"&lt;/code&gt; on the grid container (so Tab enters the grid). The active cell within the grid receives &lt;code&gt;tabindex="0"&lt;/code&gt; when focused, with all other cells getting &lt;code&gt;tabindex="-1"&lt;/code&gt;. The grid manages which cell is the current focus target.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;handleKeyDown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;key&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;activeCell&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ArrowDown&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;activeCell&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;nextRow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;current&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;preventDefault&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ArrowUp&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;activeCell&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;prevRow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;current&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;preventDefault&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ArrowRight&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;activeCell&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;nextColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;current&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;preventDefault&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ArrowLeft&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;activeCell&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;prevColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;current&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;preventDefault&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Enter&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;activatePrimaryAction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;current&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;preventDefault&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Home&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;activeCell&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;firstColumnInRow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;current&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;preventDefault&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;End&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;activeCell&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;lastColumnInRow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;current&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;preventDefault&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="c1"&gt;// etc.&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;event.preventDefault()&lt;/code&gt; calls are important: without them, arrow keys would also scroll the page, which fights the in-grid navigation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Inline editing patterns
&lt;/h2&gt;

&lt;p&gt;Tables with inline editing need an additional keyboard model layered on top of navigation.&lt;/p&gt;

&lt;p&gt;The common pattern: a cell has two states, navigation (the default) and edit. Pressing Enter on a navigation-state cell switches to edit state. The edit state is a text input, dropdown, date picker, or other interactive control. Pressing Enter on an edit-state cell commits the change. Pressing Escape on an edit-state cell cancels and returns to navigation state.&lt;/p&gt;

&lt;p&gt;In edit state, arrow keys move within the editor (typing left and right in a text field, opening dropdown options) rather than moving between cells. This is what users expect when actively editing a field.&lt;/p&gt;

&lt;p&gt;The transition between states is what teams get wrong. A few rules that help:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Tab in edit state&lt;/strong&gt; commits the change and moves to the next cell in edit state, if the next cell is editable. Otherwise commits and moves focus to the next cell in navigation state. This matches spreadsheet patterns and is what most users expect.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Enter in edit state on the last editable cell of a row&lt;/strong&gt; commits and moves to the first editable cell of the next row. Again, spreadsheet-style.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Click outside the cell in edit state&lt;/strong&gt; commits the change. This matches most modern editor patterns. Some tools cancel on click-outside; this is harder to recover from and frustrates users who accidentally click elsewhere.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Escape in edit state&lt;/strong&gt; cancels and returns the value to what it was before editing. This is the standard escape behavior across operating system controls.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Bulk-select with Shift and Ctrl
&lt;/h2&gt;

&lt;p&gt;For grids with bulk operations, the standard keyboard model layers on selection patterns from file managers and spreadsheets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Space&lt;/strong&gt; toggles the selection state of the focused row.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shift+Space&lt;/strong&gt; selects from the last-selected row to the current focus, inclusive.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ctrl+A&lt;/strong&gt; (or Cmd+A on Mac) selects all rows. This should respect any active filters: select-all-filtered, not select-all-data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ctrl+Click&lt;/strong&gt; in a selection mouse interaction toggles individual rows; the keyboard equivalent is just Space.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Shift+Click range-selection pattern is the one most worth implementing for power users: it cuts the time to select 200 rows from 200 clicks to 2 clicks. The keyboard equivalent (navigate to first, Space, navigate to last, Shift+Space) is the same pattern via the keyboard.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://www.nngroup.com" rel="noopener noreferrer"&gt;Nielsen Norman Group&lt;/a&gt; has published longitudinal research on bulk-selection patterns in enterprise tools, and the Shift-range pattern consistently shows up as the highest-impact selection feature for power users.&lt;/p&gt;

&lt;h2&gt;
  
  
  Discoverability without clutter
&lt;/h2&gt;

&lt;p&gt;Power users learn the keyboard shortcuts. Casual users may never learn them. The question is how to make the shortcuts discoverable without cluttering the interface.&lt;/p&gt;

&lt;p&gt;A few patterns work:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A "?" or "Shift+?" keyboard shortcut that opens a help overlay.&lt;/strong&gt; The overlay lists all keyboard shortcuts for the current view. Most modern web applications use this pattern, and users have been trained to try it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Inline tooltips on action buttons that include the shortcut.&lt;/strong&gt; A button labeled "Edit" with a tooltip "Edit (Enter)" teaches the shortcut every time the user hovers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Onboarding mention.&lt;/strong&gt; A one-time tooltip during first-use that says "Use arrow keys to navigate and Enter to act." Most users dismiss it, but the small fraction who do not pick up enough of the model to discover the rest.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Documentation page.&lt;/strong&gt; A dedicated keyboard shortcuts page in the help section, linked from the help menu. This is the fallback for users who think to look.&lt;/p&gt;

&lt;p&gt;The combination of these (help overlay, tooltips, onboarding hint, docs) covers different discovery paths without forcing the shortcuts onto every user.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common failures to avoid
&lt;/h2&gt;

&lt;p&gt;A few patterns reliably break keyboard navigation in data grids:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Capturing too many keys.&lt;/strong&gt; A grid that intercepts every keypress (including text input) breaks accessibility tools and surprises users. Capture only the keys the grid needs (arrows, Enter, Escape, Tab, Space, Home, End, PageUp, PageDown), and let everything else pass through to the browser.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Inconsistent focus between row click and keyboard nav.&lt;/strong&gt; Clicking a row should put focus in a predictable cell (usually the identity column). Arrow keys should then move from that cell. If clicking jumps the focus to one cell but arrow keys treat a different cell as the starting point, users get confused.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tab order that includes hidden controls.&lt;/strong&gt; Columns that are not visible (because of column customization or virtualization) sometimes accidentally remain in the Tab order. Tab presses go to invisible elements, which is disorienting. Use &lt;code&gt;tabindex="-1"&lt;/code&gt; on anything not visually present.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Modal interactions that trap focus incorrectly.&lt;/strong&gt; A row-detail panel that opens on Enter should trap focus within itself until the user closes it. Returning from the modal should restore focus to the row that opened it, not to the first cell of the grid.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No focus indicator.&lt;/strong&gt; A focused cell that does not have a visible focus ring is invisible to keyboard users. Make the focus indicator prominent. The default browser focus ring is often inadequate against busy table backgrounds; design a custom one that is clearly visible.&lt;/p&gt;

&lt;h2&gt;
  
  
  A practical implementation order
&lt;/h2&gt;

&lt;p&gt;For adding keyboard navigation to an existing data grid:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Add Tab-in / Tab-out at the grid level (composite widget pattern).&lt;/li&gt;
&lt;li&gt;Wire arrow key navigation between cells, with &lt;code&gt;preventDefault&lt;/code&gt; on the events.&lt;/li&gt;
&lt;li&gt;Add Enter for primary row action, Escape for cancel.&lt;/li&gt;
&lt;li&gt;Add Home / End / Ctrl+Home / Ctrl+End / PageUp / PageDown.&lt;/li&gt;
&lt;li&gt;Add Space for bulk-select toggle, Shift+Space for range select.&lt;/li&gt;
&lt;li&gt;Add Ctrl+A for select-all-filtered.&lt;/li&gt;
&lt;li&gt;Layer the inline edit state on top, with Enter / Escape / Tab transitions.&lt;/li&gt;
&lt;li&gt;Add the "?" help overlay listing all shortcuts.&lt;/li&gt;
&lt;li&gt;Add tooltips on action buttons that include the shortcut.&lt;/li&gt;
&lt;li&gt;Test with actual keyboard-only usage and with screen readers.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The longer write-up of &lt;a href="https://137foundry.com" rel="noopener noreferrer"&gt;137Foundry&lt;/a&gt; on &lt;a href="https://137foundry.com/articles/how-to-design-data-tables-that-stay-readable-as-data-scales" rel="noopener noreferrer"&gt;data table design at scale&lt;/a&gt; covers how keyboard navigation fits with the rest of the table's design decisions (defaults, column widths, virtualization). For the formal authoring practices that define the standard model, the &lt;a href="https://www.w3.org/WAI/" rel="noopener noreferrer"&gt;W3C Web Accessibility Initiative&lt;/a&gt; is the canonical reference, and the &lt;a href="https://developer.mozilla.org" rel="noopener noreferrer"&gt;Mozilla Developer Network&lt;/a&gt; covers the underlying browser focus management APIs.&lt;/p&gt;

&lt;h2&gt;
  
  
  What good keyboard navigation actually looks like
&lt;/h2&gt;

&lt;p&gt;A power user lands in a 5,000 row data grid via a link from another screen. The grid loads, Tab moves focus into the grid, the focus indicator appears on a sensible default cell (usually the first cell of the first row). The user uses Down arrow to move to the row they want. They press Enter to open the row detail panel. They close it with Escape and return to the grid with focus restored to the row they opened. They Shift+Space to extend selection to the row above. They use a custom shortcut (Ctrl+E in this hypothetical tool) to bulk-edit the selected rows. They commit the edit with Enter. They Tab out of the grid and continue with their workflow.&lt;/p&gt;

&lt;p&gt;The whole sequence is keyboard-only and fast. The mouse path for the same workflow would take 2 to 3 times as long. Multiplied across a workday and a user population of power users, the keyboard model produces a meaningful productivity difference in the tool.&lt;/p&gt;

&lt;p&gt;That outcome is what good keyboard navigation actually buys, and the implementation cost (a hundred lines of focus management, a help overlay, well-tested transitions) is small compared to the value.&lt;/p&gt;

</description>
      <category>ux</category>
      <category>webdev</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Why Your Service Worker Cache Is Silently Breaking Your Offline Mode</title>
      <dc:creator>137Foundry</dc:creator>
      <pubDate>Fri, 19 Jun 2026 10:09:26 +0000</pubDate>
      <link>https://dev.to/137foundry/why-your-service-worker-cache-is-silently-breaking-your-offline-mode-10a0</link>
      <guid>https://dev.to/137foundry/why-your-service-worker-cache-is-silently-breaking-your-offline-mode-10a0</guid>
      <description>&lt;p&gt;A Progressive Web App that promises offline support and only mostly delivers is worse than one that promises nothing. Users learn to distrust the offline indicator after the first time it lies to them, and the entire feature stops being a feature. The most common cause of this kind of silent failure is a service worker cache that does not actually contain what the app needs to operate offline.&lt;/p&gt;

&lt;p&gt;This piece walks through the patterns that produce the silent break and the techniques that prevent it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1584169417032-d34e8d805e8b%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5MzI0MTZ8MHwxfHNlYXJjaHwxfHxkaW0lMjBkYXRhJTIwY2VudGVyJTIwaGFsbHdheSUyMHNlcnZlcnMlMjByYWNrJTIwZnJhbWVzfGVufDF8fHx8MTc4MTg2Mzc2NHww%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1584169417032-d34e8d805e8b%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5MzI0MTZ8MHwxfHNlYXJjaHwxfHxkaW0lMjBkYXRhJTIwY2VudGVyJTIwaGFsbHdheSUyMHNlcnZlcnMlMjByYWNrJTIwZnJhbWVzfGVufDF8fHx8MTc4MTg2Mzc2NHww%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" alt="A dim data center hallway with rows of servers behind rack frames" width="1080" height="621"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by &lt;a href="https://unsplash.com/@ismailenesayhan?utm_source=137foundry&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;İsmail Enes Ayhan&lt;/a&gt; on &lt;a href="https://unsplash.com?utm_source=137foundry&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  The silent break, in one sentence
&lt;/h2&gt;

&lt;p&gt;Your service worker caches what the user has already visited, which is usually less than what the user will try to visit while offline. The mismatch is silent because there are no errors; the app simply fails to load assets it needs and falls back to whatever skeleton or error state it has, often without the user understanding why.&lt;/p&gt;

&lt;p&gt;The fix is not to cache more aggressively. The fix is to be deliberate about what gets cached, when, and from what trigger. Most service worker caches end up reactive: the first time the user visits a route, that route's assets are cached. The second time, they are served from cache. The user never visits the third route offline because they have not visited it online yet either.&lt;/p&gt;
&lt;h2&gt;
  
  
  Step one: precache the critical shell
&lt;/h2&gt;

&lt;p&gt;The most reliable pattern is to precache the application shell at install time. The shell is the set of assets the app needs to start: the entry HTML, the main JavaScript bundle, the main CSS, any fonts, and any images that are part of the always-visible UI.&lt;/p&gt;

&lt;p&gt;Precaching is done in the service worker's &lt;code&gt;install&lt;/code&gt; event:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nb"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addEventListener&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;install&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;waitUntil&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nx"&gt;caches&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;app-shell-v1&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;then&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addAll&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
        &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/main.js&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/main.css&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/fonts/inter.woff2&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/icons/logo.svg&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The shell is loaded once, kept in cache forever (or until the version is bumped), and serves as the foundation for offline operation. Users visiting any route within the app can get to a working shell even on a cold offline visit, because the shell was cached at install time, not on first visit.&lt;/p&gt;

&lt;p&gt;For the deeper offline-tolerant data fetching that needs to layer on top of this, the broader pattern lives in the &lt;a href="https://developer.chrome.com/docs/workbox/" rel="noopener noreferrer"&gt;Workbox documentation&lt;/a&gt; and the &lt;a href="https://developer.mozilla.org/" rel="noopener noreferrer"&gt;MDN service worker reference&lt;/a&gt;, which together cover the most common runtime caching strategies.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step two: distinguish between cache-first and network-first routes
&lt;/h2&gt;

&lt;p&gt;Once the shell is in place, runtime caching of API responses and dynamic content has to choose a strategy per route. The two that come up most often:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cache-first&lt;/strong&gt; means the service worker checks the cache first and only falls back to the network on miss. Good for assets that rarely change (avatars, product images, font files). Bad for API responses where staleness matters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Network-first&lt;/strong&gt; means the service worker tries the network first and falls back to the cache only on network failure. Good for API responses where freshness matters more than offline support. The user gets fresh data when online and the last-known data when offline.&lt;/p&gt;

&lt;p&gt;A third pattern, &lt;strong&gt;stale-while-revalidate&lt;/strong&gt; at the service worker layer, returns the cached response immediately and fires a background fetch that updates the cache for next time. This is the same idea as the HTTP-level stale-while-revalidate directive but implemented in JavaScript with more control.&lt;/p&gt;

&lt;p&gt;The trap is using cache-first for API responses without thinking. The user updates their profile, the server stores the change, the service worker keeps returning the cached old profile because the cache hit comes back before the network attempt. The offline mode "works" in that the UI loads, but the data is wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step three: handle the cache version bump
&lt;/h2&gt;

&lt;p&gt;Service worker caches are versioned. When you ship a new version of the app, the new service worker installs alongside the old one and the old caches stay around until something explicitly cleans them up. If you do not handle this, the user's browser accumulates dead caches forever, and storage quota eventually evicts good caches before bad ones.&lt;/p&gt;

&lt;p&gt;The pattern is to clean up old caches in the &lt;code&gt;activate&lt;/code&gt; event:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nb"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addEventListener&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;activate&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;allowList&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;app-shell-v2&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;api-cache-v1&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;waitUntil&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nx"&gt;caches&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;then&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;
      &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nx"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;allowList&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;includes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;caches&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;delete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
      &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bump the version string on every release that changes the shell. The activate handler deletes everything that is not on the current allowlist, freeing storage for whatever the new version needs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fu7s43dmwxnfze1pxjrn2.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fu7s43dmwxnfze1pxjrn2.jpeg" alt="A close-up of cables tied together with white labels behind glass" width="800" height="1202"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by Anete Lusina on &lt;a href="https://www.pexels.com" rel="noopener noreferrer"&gt;Pexels&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Step four: detect quota pressure before it bites
&lt;/h2&gt;

&lt;p&gt;Browsers limit the total storage a single origin can use. The limit varies by browser and by available disk space, but it is rarely as much as the application author assumes. When the limit is reached, the browser silently evicts entries. The app stops working offline, and the team has no signal that anything is wrong.&lt;/p&gt;

&lt;p&gt;The fix is to estimate quota usage at runtime and log when it approaches the limit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;storage&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;navigator&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;estimate&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;navigator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;storage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nb"&gt;navigator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;storage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;estimate&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;then&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;estimate&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;percent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;estimate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;usage&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nx"&gt;estimate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;quota&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;percent&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Storage at &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;percent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toFixed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;% of quota`&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In production, this signal should go to your error tracking. Quota pressure is a leading indicator of impending offline breakage; catching it early lets the team prune caches before the user notices.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Offline mode is not a feature you ship and forget. It is an ongoing operational commitment, and the silent failure modes deserve as much attention as the active ones." - Dennis Traina, &lt;a href="https://137foundry.com/services" rel="noopener noreferrer"&gt;founder of 137Foundry&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Step five: test offline mode in CI
&lt;/h2&gt;

&lt;p&gt;The hardest part of shipping reliable offline support is testing it. Manual testing is unreliable because developers always remember to visit the route online first, which warms the cache, which masks the bug.&lt;/p&gt;

&lt;p&gt;The fix is to run an automated test that exercises the offline flow from a cold state. &lt;a href="https://playwright.dev/" rel="noopener noreferrer"&gt;Playwright&lt;/a&gt; and similar tools let you simulate an offline network condition and verify that key routes still work. The test should:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Launch a fresh browser context (no cached state).&lt;/li&gt;
&lt;li&gt;Navigate to the app once online to install the service worker and precache the shell.&lt;/li&gt;
&lt;li&gt;Force the browser into offline mode.&lt;/li&gt;
&lt;li&gt;Navigate to each route that should work offline and verify the rendered output.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If any route fails, the precache list is incomplete or a runtime caching strategy is wrong. The test catches the bug before users see it.&lt;/p&gt;

&lt;p&gt;The web development team at 137Foundry treats this as a default part of any PWA we ship with offline support. Without it, the offline mode is theoretical, and the user complaints arrive eventually.&lt;/p&gt;

&lt;h2&gt;
  
  
  A small observability note
&lt;/h2&gt;

&lt;p&gt;One more layer worth mentioning: instrument the service worker itself. Add logging to every fetch event handler that records whether the response came from cache or network, and what the cache name was. Aggregate these logs to your observability backend.&lt;/p&gt;

&lt;p&gt;The metric to watch is the cache hit rate per route. A healthy precache hit rate is close to 100% for shell routes; a healthy runtime cache hit rate is 50% or more for routes the user visits repeatedly. A sudden drop indicates a regression, usually that a recent code change has broken the cache key or the precache list.&lt;/p&gt;

&lt;p&gt;The service worker context is awkward for logging because it does not have direct access to the rest of the app's logging infrastructure. The pattern that works is to post a message from the service worker to the main thread with the log payload, and let the main thread forward it to the observability backend. A few lines of code, and the offline mode now has the same observability story as the rest of the app.&lt;/p&gt;

&lt;p&gt;For the broader caching architecture, including how the service worker cache layer interacts with HTTP-level caching, the &lt;a href="https://137foundry.com/articles/how-to-cache-api-responses-in-the-browser" rel="noopener noreferrer"&gt;137Foundry article on browser API caching&lt;/a&gt; walks through the decision matrix in detail. The &lt;a href="https://137foundry.com/services/web-development" rel="noopener noreferrer"&gt;web development service page&lt;/a&gt; covers the related architectural work we do for clients building PWAs.&lt;/p&gt;

</description>
      <category>pwa</category>
      <category>webdev</category>
      <category>programming</category>
      <category>serviceworker</category>
    </item>
    <item>
      <title>How to Use ETags for Cheap Revalidation in a React App</title>
      <dc:creator>137Foundry</dc:creator>
      <pubDate>Fri, 19 Jun 2026 10:07:37 +0000</pubDate>
      <link>https://dev.to/137foundry/how-to-use-etags-for-cheap-revalidation-in-a-react-app-1c60</link>
      <guid>https://dev.to/137foundry/how-to-use-etags-for-cheap-revalidation-in-a-react-app-1c60</guid>
      <description>&lt;p&gt;ETags are one of the cheapest performance wins available to a React app talking to a JSON API. The browser does most of the work, the server returns a body-less 304 when the data has not changed, and the user sees response times that drop by a large fraction without any visible cost.&lt;/p&gt;

&lt;p&gt;This guide walks through the practical steps to wire ETags into a React application end to end, with the patterns that work and the gotchas to know about.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F4s383o3gcxiubuohua8y.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F4s383o3gcxiubuohua8y.jpeg" alt="A server rack with neatly bundled cables behind a panel of glass" width="799" height="532"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by Brett Sayles on &lt;a href="https://www.pexels.com" rel="noopener noreferrer"&gt;Pexels&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 1: Make sure the server returns ETags
&lt;/h2&gt;

&lt;p&gt;ETags are server-driven. The server attaches an &lt;code&gt;ETag: "some-value"&lt;/code&gt; header to each response, and the browser stores it alongside the body. On subsequent requests, the browser sends &lt;code&gt;If-None-Match: "some-value"&lt;/code&gt; and the server either returns 304 with no body (cache stays valid) or 200 with a new body and a new ETag.&lt;/p&gt;

&lt;p&gt;The implementation on the server can be one of two patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Content-hash ETag.&lt;/strong&gt; The server hashes the response body (SHA-1, MD5, anything stable). Useful for any response, but does not save the server-side work of generating the response.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Version-based ETag.&lt;/strong&gt; The server uses a version or &lt;code&gt;updated_at&lt;/code&gt; field from the underlying entity. Saves both the database read and the response generation, because revalidation can check the version alone.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Express, Next.js, and most modern Node.js frameworks support both patterns out of the box. Frameworks like &lt;a href="https://fastify.dev/" rel="noopener noreferrer"&gt;Fastify&lt;/a&gt; and &lt;a href="https://hono.dev/" rel="noopener noreferrer"&gt;Hono&lt;/a&gt; make this even easier with first-class ETag middleware. Verify your server is actually sending ETags by inspecting the response headers in the browser dev tools network panel.&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 2: Confirm the browser is sending If-None-Match on revalidation
&lt;/h2&gt;

&lt;p&gt;If your &lt;code&gt;Cache-Control&lt;/code&gt; headers are set to allow caching (e.g., &lt;code&gt;Cache-Control: private, max-age=0, must-revalidate&lt;/code&gt;), the browser should automatically include &lt;code&gt;If-None-Match&lt;/code&gt; on every subsequent request. Open the network panel, find your request, and check the request headers.&lt;/p&gt;

&lt;p&gt;If the header is missing, the most likely cause is a &lt;code&gt;Cache-Control: no-store&lt;/code&gt; directive somewhere in your response chain that is preventing the browser from storing the response at all. A &lt;code&gt;no-store&lt;/code&gt; response cannot be revalidated, because there is nothing to revalidate against. Change to &lt;code&gt;no-cache&lt;/code&gt; (the response is stored but must be revalidated on every use) if you want ETag-driven freshness.&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 3: Pick a React data-fetching library that respects HTTP caching
&lt;/h2&gt;

&lt;p&gt;Most modern React data-fetching libraries layer their own cache on top of the browser cache, and not all of them play nicely with HTTP-level caching. The two that handle this well:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://tanstack.com/query/latest" rel="noopener noreferrer"&gt;TanStack Query&lt;/a&gt; (formerly React Query) honors HTTP caching by default if you use the &lt;code&gt;fetch&lt;/code&gt; API directly inside your query function. The library's own staleTime and cacheTime are layered on top.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://swr.vercel.app/" rel="noopener noreferrer"&gt;SWR&lt;/a&gt; does the same. The library's own cache is keyed by URL, and the underlying fetch respects browser HTTP caching transparently.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The library-level cache is configured to be aggressive by default (stale data is served while a background revalidation fires). The browser-level cache adds another layer underneath, providing the 304-based optimization without any extra code.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;useQuery&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@tanstack/react-query&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;UserProfile&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;userId&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;isLoading&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;useQuery&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;queryKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="na"&gt;queryFn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`/api/users/&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;then&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
    &lt;span class="na"&gt;staleTime&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;isLoading&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;Skeleton&lt;/span&gt; &lt;span class="o"&gt;/&amp;gt;&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;ProfileCard&lt;/span&gt; &lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;fetch&lt;/code&gt; call here automatically picks up the browser's cached response when revalidation produces a 304. No additional configuration needed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fcz2d52oslbwcqrg7nfg2.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fcz2d52oslbwcqrg7nfg2.jpeg" alt="A flat lay of a notebook with annotated diagrams beside a fountain pen" width="799" height="534"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by fauxels on &lt;a href="https://www.pexels.com" rel="noopener noreferrer"&gt;Pexels&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 4: Set up Cache-Control to allow the browser to cache
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;Cache-Control&lt;/code&gt; header decides whether the browser will store the response at all. For ETag-driven revalidation to work, the response must be storable. The directive that works for most user-specific API endpoints:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Cache-Control: private, max-age=0, must-revalidate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This says: store the response, but always revalidate before serving it. The revalidation uses the ETag. On 304, the cache is served. On 200, the cache is replaced. Both outcomes are correct.&lt;/p&gt;

&lt;p&gt;For endpoints where stale-while-revalidate is appropriate, add it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Cache-Control: private, max-age=0, stale-while-revalidate=60, must-revalidate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This serves the cached response immediately while firing a background revalidation, giving the user a fast load and converging to fresh within sixty seconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Verify the round trip with the dev tools
&lt;/h2&gt;

&lt;p&gt;A working ETag round trip shows up in the network panel as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;First request: 200 OK, full response body, &lt;code&gt;ETag&lt;/code&gt; header in response.&lt;/li&gt;
&lt;li&gt;Subsequent requests within the cache window: 304 Not Modified, no response body, &lt;code&gt;If-None-Match&lt;/code&gt; header in request.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you see 200 every time, the browser is not honoring the cache. Check &lt;code&gt;Cache-Control&lt;/code&gt; on the response. If you see 304 but the response body in the panel is still the full body, that is the dev tools showing the cached body for clarity, not the actual network response. The actual network transfer is just the 304 headers.&lt;/p&gt;

&lt;p&gt;A common gotcha: the browser's network panel has a "Disable cache" checkbox that, when on, prevents any cache from being used. Make sure it is off when testing.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"The 304 round trip is the single cheapest thing you can do to make a React app feel snappy. Almost no code changes; almost all the win comes from the server doing its part." - Dennis Traina, &lt;a href="https://137foundry.com/services" rel="noopener noreferrer"&gt;founder of 137Foundry&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Step 6: Handle the invalidation case
&lt;/h2&gt;

&lt;p&gt;When the underlying data changes, the server's next response will be 200 with a new ETag. The browser stores the new response and the new ETag. The next revalidation uses the new ETag, gets a 304, and the cycle continues.&lt;/p&gt;

&lt;p&gt;There is one case where this falls down: if the underlying data changed but the user's open tab does not know about it. Until the user takes an action that triggers a refetch, the UI continues to show the previous response.&lt;/p&gt;

&lt;p&gt;The pattern that solves this is to combine ETag revalidation with library-level staleness. TanStack Query and SWR will refetch on window focus and on network reconnect by default. These triggers, combined with the 304 round trip when nothing has changed, give you fast loads and reasonably fresh data without explicit work.&lt;/p&gt;

&lt;p&gt;For real-time updates where the user must see changes immediately, the right pattern is server-pushed invalidation via WebSocket or Server-Sent Events. ETags are for the case where freshness within a few seconds of revalidation is good enough, which covers most dashboards and admin interfaces.&lt;/p&gt;

&lt;p&gt;The architecture team at &lt;a href="https://137foundry.com" rel="noopener noreferrer"&gt;137Foundry&lt;/a&gt; defaults to the ETag-plus-library-revalidation pattern for any React app with significant API traffic. It is the strategy with the best performance-per-line-of-code ratio.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 7: Layer in a small monitoring habit
&lt;/h2&gt;

&lt;p&gt;The last piece is a monitoring habit that catches regressions when the ETag pipeline degrades. Add a small instrumented wrapper around your fetch calls that logs the response status, the elapsed time, and whether the response was a 304 or a 200. Aggregate these metrics in your observability tool of choice.&lt;/p&gt;

&lt;p&gt;The metric to watch is the 304 rate over time. Healthy ETag-driven caching produces a 304 rate of 40% to 70% on typical user sessions. If the rate drops suddenly, something has changed about the cache directives or the ETag implementation. Catching this within a day saves the team from a slow degradation that nobody notices until the API costs start climbing.&lt;/p&gt;

&lt;p&gt;For an app at modest scale, this monitoring is two lines of code wrapped around the fetch call. For an app at larger scale, an APM tool like Datadog or &lt;a href="https://sentry.io/" rel="noopener noreferrer"&gt;Sentry's performance monitoring&lt;/a&gt; handles it more cleanly. The discipline matters more than the tool.&lt;/p&gt;

&lt;p&gt;For the deeper version of this guide, including the broader browser caching architecture and where the Cache API fits in, see the &lt;a href="https://137foundry.com/articles/how-to-cache-api-responses-in-the-browser" rel="noopener noreferrer"&gt;137Foundry article on browser API caching&lt;/a&gt;. The &lt;a href="https://137foundry.com/services/web-development" rel="noopener noreferrer"&gt;web development service page&lt;/a&gt; covers some of the related architectural work as well.&lt;/p&gt;

</description>
      <category>react</category>
      <category>webdev</category>
      <category>programming</category>
      <category>performance</category>
    </item>
    <item>
      <title>How to Wire aria-describedby and aria-invalid on Form Errors Without Breaking Your Styles</title>
      <dc:creator>137Foundry</dc:creator>
      <pubDate>Thu, 18 Jun 2026 10:02:55 +0000</pubDate>
      <link>https://dev.to/137foundry/how-to-wire-aria-describedby-and-aria-invalid-on-form-errors-without-breaking-your-styles-4583</link>
      <guid>https://dev.to/137foundry/how-to-wire-aria-describedby-and-aria-invalid-on-form-errors-without-breaking-your-styles-4583</guid>
      <description>&lt;p&gt;Most existing forms have a validation experience that visually works and is invisible to users on assistive technology. The fix is usually two ARIA attributes (&lt;code&gt;aria-describedby&lt;/code&gt; and &lt;code&gt;aria-invalid&lt;/code&gt;) plus a small live region on the error element. The total code change for a typical form is under 50 lines and produces a meaningful improvement in accessibility scores.&lt;/p&gt;

&lt;p&gt;This is a step-by-step walkthrough of the wiring, with the specific gotchas that come up when you retrofit it onto an existing form.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fx4zuooeaklqbcac52khf.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fx4zuooeaklqbcac52khf.jpeg" alt="A notebook open to a page of handwritten annotations next to a pen" width="800" height="600"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by Yusuf Çelik on &lt;a href="https://www.pexels.com" rel="noopener noreferrer"&gt;Pexels&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  The current state of most forms
&lt;/h2&gt;

&lt;p&gt;A typical existing form has this kind of structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"form-field"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;label&lt;/span&gt; &lt;span class="na"&gt;for=&lt;/span&gt;&lt;span class="s"&gt;"email-input"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;Email&lt;span class="nt"&gt;&amp;lt;/label&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;input&lt;/span&gt; &lt;span class="na"&gt;type=&lt;/span&gt;&lt;span class="s"&gt;"email"&lt;/span&gt; &lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;"email-input"&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;"email"&lt;/span&gt; &lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"error"&lt;/span&gt; &lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;"email-error"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The error div is in the DOM. The validation JavaScript writes a message into it when something is wrong. The CSS styles the error red. From a sighted user's perspective, it works.&lt;/p&gt;

&lt;p&gt;From a screen reader user's perspective, the input and the error are unrelated DOM siblings. Focusing the input announces only the label, not the error. The error appears and a user not navigating to it never knows.&lt;/p&gt;

&lt;p&gt;The fix is connecting the input to the error programmatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: add aria-describedby
&lt;/h2&gt;

&lt;p&gt;The first attribute. The input gets &lt;code&gt;aria-describedby&lt;/code&gt; pointing to the id of the error element.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;input&lt;/span&gt;
  &lt;span class="na"&gt;type=&lt;/span&gt;&lt;span class="s"&gt;"email"&lt;/span&gt;
  &lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;"email-input"&lt;/span&gt;
  &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;"email"&lt;/span&gt;
  &lt;span class="na"&gt;aria-describedby=&lt;/span&gt;&lt;span class="s"&gt;"email-error"&lt;/span&gt;
&lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is what tells the screen reader that when the user focuses the input, the content of the &lt;code&gt;#email-error&lt;/code&gt; element is part of the description. The same attribute can take multiple ids (&lt;code&gt;aria-describedby="email-error email-hint"&lt;/code&gt;) if you also have a static help text element below the input.&lt;/p&gt;

&lt;p&gt;The attribute can be on the input from the start; you do not have to add and remove it based on whether an error is present. The screen reader announces the description only when there is content in the referenced element. An empty error div produces no announcement.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: add aria-invalid dynamically
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;aria-invalid&lt;/code&gt; is the programmatic signal that the input has a validation error. Set it to &lt;code&gt;true&lt;/code&gt; when the error shows, &lt;code&gt;false&lt;/code&gt; (or remove it) when the error clears.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;showError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;errorElement&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;errorElement&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;textContent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setAttribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;aria-invalid&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;true&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;clearError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;errorElement&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;errorElement&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;textContent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setAttribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;aria-invalid&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;false&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;aria-invalid&lt;/code&gt; is separate from the visual styling. Screen readers announce it as "invalid entry" or similar; sighted users see the red border via CSS. Both signals point at the same state but go to different users.&lt;/p&gt;

&lt;p&gt;A common mistake: setting &lt;code&gt;aria-invalid&lt;/code&gt; on the input element only, while the visual red border is on a wrapping div. Either move the styling to the input itself (so the visual and the ARIA share a target), or add a corresponding visual class on the input in the same function. Drift between the two is the root cause of most "the form looks valid but the screen reader says invalid" bugs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: make the error announce when it appears
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;aria-describedby&lt;/code&gt; connection is sufficient if the user navigates back to the field after the error appears. It is not sufficient if the error appears while the user is still focused on the field (live re-validation on keystroke after an initial error).&lt;/p&gt;

&lt;p&gt;A live region on the error element fixes this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"error"&lt;/span&gt; &lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;"email-error"&lt;/span&gt; &lt;span class="na"&gt;aria-live=&lt;/span&gt;&lt;span class="s"&gt;"polite"&lt;/span&gt; &lt;span class="na"&gt;role=&lt;/span&gt;&lt;span class="s"&gt;"status"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;aria-live="polite"&lt;/code&gt; tells the screen reader to announce changes at the next natural pause, not interrupt the user mid-typing. The &lt;code&gt;role="status"&lt;/code&gt; is a redundant signal (many implementations require both for cross-browser support; you can drop &lt;code&gt;role&lt;/code&gt; and test with target screen readers if you want minimal markup).&lt;/p&gt;

&lt;p&gt;Polite is the right choice for form validation. The assertive variant (&lt;code&gt;aria-live="assertive"&lt;/code&gt;) interrupts and is reserved for emergencies, which a form error is not.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: handle the case where the input itself has the busy state
&lt;/h2&gt;

&lt;p&gt;For async validation (username check, address lookup), the user expects feedback that the field is being verified. Without it, the spinner is the only signal, and the screen reader user sees neither.&lt;/p&gt;

&lt;p&gt;Two patterns:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pattern A:&lt;/strong&gt; Use &lt;code&gt;aria-busy="true"&lt;/code&gt; on the input or its wrapper while the check is in flight. Most screen readers will announce that the field is busy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pattern B:&lt;/strong&gt; Use a separate live region for status updates ("Verifying username...") that updates when the check starts and clears when it completes.&lt;/p&gt;

&lt;p&gt;Pattern B works more reliably across screen readers, at the cost of slightly more markup. Pattern A is simpler when it works but inconsistent across implementations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"form-field"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;label&lt;/span&gt; &lt;span class="na"&gt;for=&lt;/span&gt;&lt;span class="s"&gt;"username-input"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;Username&lt;span class="nt"&gt;&amp;lt;/label&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;input&lt;/span&gt;
    &lt;span class="na"&gt;type=&lt;/span&gt;&lt;span class="s"&gt;"text"&lt;/span&gt;
    &lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;"username-input"&lt;/span&gt;
    &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;"username"&lt;/span&gt;
    &lt;span class="na"&gt;aria-describedby=&lt;/span&gt;&lt;span class="s"&gt;"username-error username-status"&lt;/span&gt;
  &lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"error"&lt;/span&gt; &lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;"username-error"&lt;/span&gt; &lt;span class="na"&gt;aria-live=&lt;/span&gt;&lt;span class="s"&gt;"polite"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"status visually-hidden"&lt;/span&gt; &lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;"username-status"&lt;/span&gt; &lt;span class="na"&gt;aria-live=&lt;/span&gt;&lt;span class="s"&gt;"polite"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The status element is visually hidden (via a standard &lt;code&gt;visually-hidden&lt;/code&gt; CSS utility) but exposed to screen readers. It is the audio-only counterpart to the spinner.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: handle keyboard focus correctly
&lt;/h2&gt;

&lt;p&gt;A field in an error state still needs to be reachable by keyboard. Two patterns that get broken in retrofits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Focus stops working because of overflow rules.&lt;/strong&gt; Some forms hide error states by clipping with &lt;code&gt;overflow: hidden&lt;/code&gt;, which can also clip the focus outline. Make sure the focus indicator on an error field is still visible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tab order skips the error element.&lt;/strong&gt; This is usually fine because the error is not interactive (no need to tab into it), but if you have an "edit" or "fix" link in the error message, make sure it is in the tab order with &lt;code&gt;tabindex="0"&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The general rule: keyboard navigation through the form should be unchanged by the presence of errors. The errors are descriptive, not interactive.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fm3felg0d4t0iygbwc5ut.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fm3felg0d4t0iygbwc5ut.jpeg" alt="A small sketch on graph paper of nested form components" width="800" height="1200"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by Letícia Alvares on &lt;a href="https://www.pexels.com" rel="noopener noreferrer"&gt;Pexels&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 6: verify with the standards
&lt;/h2&gt;

&lt;p&gt;Once the wiring is in place, the &lt;a href="https://www.w3.org/WAI/standards-guidelines/wcag/" rel="noopener noreferrer"&gt;W3C Web Content Accessibility Guidelines&lt;/a&gt; success criteria most relevant to form validation are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;3.3.1 Error Identification:&lt;/strong&gt; Errors must be identified in text. Color alone is not sufficient.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3.3.3 Error Suggestion:&lt;/strong&gt; When the error has known fixes, suggest them. "Add the @ to make this a valid email" is a suggestion; "Invalid email" is not.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4.1.3 Status Messages:&lt;/strong&gt; Status messages (including errors) must be programmatically determined and announced to assistive tech without receiving focus.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;a href="https://www.w3.org/WAI/ARIA/apg/" rel="noopener noreferrer"&gt;WAI-ARIA Authoring Practices Guide&lt;/a&gt; has worked code samples for accessible form validation patterns. If you are unsure about a specific wiring choice, the examples in the APG are the authoritative reference.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 7: test with a real screen reader
&lt;/h2&gt;

&lt;p&gt;ARIA wiring that looks correct in code does not always announce correctly. Test with at least one screen reader before merging. NVDA on Windows is free; VoiceOver on macOS is built in. Ten minutes of testing on a real screen reader catches more bugs than any automated audit.&lt;/p&gt;

&lt;p&gt;The minimum test pass:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tab into the field. Label announces.&lt;/li&gt;
&lt;li&gt;Type an invalid value. Tab away. Error announces.&lt;/li&gt;
&lt;li&gt;Tab back to the field. Label announces with the error as the description.&lt;/li&gt;
&lt;li&gt;Fix the input. Error clears, optionally with an announcement.&lt;/li&gt;
&lt;li&gt;Trigger an async check. Pending status announces. Result announces when ready.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tools that complement manual testing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://wave.webaim.org/" rel="noopener noreferrer"&gt;WAVE&lt;/a&gt; browser extension surfaces ARIA wiring gaps inline on the page.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.deque.com/axe/" rel="noopener noreferrer"&gt;axe DevTools&lt;/a&gt; browser extension catches ARIA misuse and color contrast issues; usable in CI for regression testing.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://webaim.org/resources/contrastchecker/" rel="noopener noreferrer"&gt;WebAIM contrast checker&lt;/a&gt; verifies error text and input border colors meet WCAG thresholds.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 8: keep the existing styles intact
&lt;/h2&gt;

&lt;p&gt;The whole point of this retrofit is to avoid rebuilding the form. The styles already work for sighted users; the ARIA additions are about adding signal for users on assistive tech.&lt;/p&gt;

&lt;p&gt;Specific things to avoid changing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The visual layout of the form. The error div stays where it is.&lt;/li&gt;
&lt;li&gt;The color of the error text or the input border. If the colors meet WCAG, leave them.&lt;/li&gt;
&lt;li&gt;The trigger logic (when validation runs). If the timing is wrong, fix it as a separate change, not as part of the accessibility retrofit. Bundling the two makes both harder to review.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The retrofit is a small targeted change. The design improvements come next.&lt;/p&gt;

&lt;h2&gt;
  
  
  The longer read on the full design
&lt;/h2&gt;

&lt;p&gt;This piece is the wiring mechanics. The full design of inline validation (when to trigger, what to say, async patterns, visual treatment, success states) sits in a &lt;a href="https://137foundry.com/articles/how-to-design-inline-form-validation-that-actually-helps-users" rel="noopener noreferrer"&gt;longer guide on designing inline form validation that actually helps users&lt;/a&gt; on &lt;a href="https://137foundry.com" rel="noopener noreferrer"&gt;https://137foundry.com&lt;/a&gt;. The two pieces stack: this one wires the accessibility, the longer one improves the design that the accessibility now exposes correctly.&lt;/p&gt;

&lt;p&gt;The right order is: ARIA retrofit first (cheap, high impact), then the design improvements on top of an accessibility-correct baseline. Skipping the retrofit and only doing the design work means the improved design is still invisible to users on assistive technology, which defeats the point.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>a11y</category>
      <category>html</category>
    </item>
    <item>
      <title>Why the Async Username Check Is the Worst Part of Most Signup Forms</title>
      <dc:creator>137Foundry</dc:creator>
      <pubDate>Thu, 18 Jun 2026 10:02:54 +0000</pubDate>
      <link>https://dev.to/137foundry/why-the-async-username-check-is-the-worst-part-of-most-signup-forms-3294</link>
      <guid>https://dev.to/137foundry/why-the-async-username-check-is-the-worst-part-of-most-signup-forms-3294</guid>
      <description>&lt;p&gt;I have looked at a lot of signup forms in the last year. Most of them have one specific bug that the team probably knows about but has not prioritized: the async username check.&lt;/p&gt;

&lt;p&gt;The pattern shows up in slightly different shapes. Sometimes the field reports "available" for half a second before the server responds with "taken." Sometimes the spinner runs for two seconds and the user has already moved to the next field. Sometimes the submit button does not block while the check is pending, so a user who mashes submit can get past the validation entirely. Sometimes the check fails on a network timeout and the user sees a red error that says "username is required" when actually their network is just slow.&lt;/p&gt;

&lt;p&gt;Each of these is the same underlying problem: the synchronous validation patterns are well-trodden, but the way forms handle a server roundtrip in the middle of inline feedback is where users get most confused. This is a writeup of how to get it right.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F5xfkw34zmg51pp5d6a03.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F5xfkw34zmg51pp5d6a03.jpeg" alt="Close-up of a smartphone screen with a signup form" width="799" height="533"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by Sanket  Mishra on &lt;a href="https://www.pexels.com" rel="noopener noreferrer"&gt;Pexels&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  What the user experience is supposed to be
&lt;/h2&gt;

&lt;p&gt;Strip out the engineering. From the user's perspective:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;User types a desired username.&lt;/li&gt;
&lt;li&gt;User pauses (or tabs away) and the form does a quick server check.&lt;/li&gt;
&lt;li&gt;The form tells the user either "this username is available" or "this is taken, here is what is wrong" within a second or so.&lt;/li&gt;
&lt;li&gt;The user fixes the username if needed and the field updates the moment they have a working one.&lt;/li&gt;
&lt;li&gt;The user submits and the username is reserved atomically with the rest of the form.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That is the spec. Now look at what most signup forms actually do and you will see at least one step where the implementation drifts.&lt;/p&gt;
&lt;h2&gt;
  
  
  Where the implementations actually fail
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Fail mode 1: no debounce, one request per keystroke.&lt;/strong&gt; The user types "j-o-h-n-s-m-i-t-h" and the form fires nine requests. Eight of them return results that no longer match the current input. The form has to track which request was the latest and ignore the older ones, which it usually does not, so the user sometimes sees "available" flash because an earlier in-flight request resolved last.&lt;/p&gt;

&lt;p&gt;The fix is debouncing. Wait 300 to 500 ms after the user stops typing before firing the request. The user perception is "the form checks when I am done with the field," not "the form checks every letter."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fail mode 2: no pending state.&lt;/strong&gt; The form fires the request, the spinner is missing or hidden, and the user tabs away expecting that the field is now valid. Half a second later the server returns "taken" and the field flips to red while the user is already in the next field.&lt;/p&gt;

&lt;p&gt;The fix is a visible pending state. A small spinner inside or next to the field, plus disabling the submit button. The user understands "this is still being checked" and waits, or at least is not surprised when the result comes in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fail mode 3: submit not blocked during pending check.&lt;/strong&gt; The user types a username, mashes submit before the check completes, and the form sends with &lt;code&gt;username_valid: true&lt;/code&gt; based on the assumption that the field looked fine at submit time. The server then either accepts the submission (creating a duplicate username state) or rejects it with a generic 400 that does not surface in the inline UI.&lt;/p&gt;

&lt;p&gt;The fix is to disable submit while the check is pending. The button is gray, with a tooltip explaining why ("Verifying username...") until the check completes. The user sees the friction and waits a beat instead of submitting blind.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fail mode 4: network errors masquerading as validation errors.&lt;/strong&gt; The server timeout returns nothing or returns a 500. The form's error handler is wired to treat any non-success response as "username is invalid" and shows a red error. The user thinks they did something wrong; the actual problem is the network or the server.&lt;/p&gt;

&lt;p&gt;The fix is to distinguish the two cases. A successful server response that says "taken" is a validation error and should show as one. A network timeout or a 5xx is an infrastructure error and should show as one ("we could not verify this right now, please try again"). The two messages are different because the user actions to fix them are different.&lt;/p&gt;
&lt;h2&gt;
  
  
  The pattern that handles all four
&lt;/h2&gt;

&lt;p&gt;A small piece of pseudocode that captures the right shape:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;activeRequest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;debounceTimer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addEventListener&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;input&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Clear any pending debounce and previous error&lt;/span&gt;
  &lt;span class="nf"&gt;clearTimeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;debounceTimer&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nf"&gt;clearError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;errorElement&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="c1"&gt;// Start a new debounce window&lt;/span&gt;
  &lt;span class="nx"&gt;debounceTimer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;setTimeout&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;runCheck&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;runCheck&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Mark request as in flight and disable submit&lt;/span&gt;
  &lt;span class="nf"&gt;setPendingState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;submitButton&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;disabled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;requestId&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="nx"&gt;latestRequestId&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`/api/usernames/check?value=&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nf"&gt;encodeURIComponent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// If a newer request has started, ignore this one&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;requestId&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="nx"&gt;latestRequestId&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;available&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nf"&gt;clearError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;errorElement&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nf"&gt;showError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;errorElement&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; is taken. Try another.`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nf"&gt;showInfrastructureMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;errorElement&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;We could not verify this right now. Please try again.&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;showInfrastructureMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;errorElement&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;We could not verify this right now. Please try again.&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;finally&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;setPendingState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;submitButton&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;disabled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The four fail modes map to four patterns in this code: debounce, in-flight tracking with a request id, distinguishing OK from network failure, and disabling submit.&lt;/p&gt;

&lt;h2&gt;
  
  
  The accessibility piece nobody hand-codes
&lt;/h2&gt;

&lt;p&gt;Once the validation logic is correct, the announcement needs to be wired so screen readers convey the right state. The minimum:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;aria-describedby="username-error"&lt;/code&gt; on the input pointing to the error element.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;aria-invalid="true"&lt;/code&gt; set when the error is showing, cleared when it clears.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;aria-live="polite"&lt;/code&gt; on the error element so the message is announced when it changes.&lt;/li&gt;
&lt;li&gt;A separate live region or &lt;code&gt;aria-busy="true"&lt;/code&gt; on the input while the check is pending, so the screen reader user knows the field is still being evaluated.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;a href="https://www.w3.org/WAI/standards-guidelines/wcag/" rel="noopener noreferrer"&gt;W3C Web Content Accessibility Guidelines&lt;/a&gt; cover the formal requirements; the &lt;a href="https://www.w3.org/WAI/ARIA/apg/" rel="noopener noreferrer"&gt;WAI-ARIA Authoring Practices Guide&lt;/a&gt; has worked examples of accessible form validation patterns. The pattern is stable across screen readers; the work is making sure your code actually emits the ARIA states consistently.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1542744094-24638eff58bb%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5MzI0MTZ8MHwxfHNlYXJjaHwyfHx3aGl0ZWJvYXJkJTIwc2tldGNoZXMlMjB1c2VyJTIwZmxvd3xlbnwxfHx8fDE3ODE3NzY5NzN8MA%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1542744094-24638eff58bb%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5MzI0MTZ8MHwxfHNlYXJjaHwyfHx3aGl0ZWJvYXJkJTIwc2tldGNoZXMlMjB1c2VyJTIwZmxvd3xlbnwxfHx8fDE3ODE3NzY5NzN8MA%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" alt="A small whiteboard with handwritten input flow sketches" width="1080" height="720"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by &lt;a href="https://unsplash.com/@campaign_creators?utm_source=137foundry&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Campaign Creators&lt;/a&gt; on &lt;a href="https://unsplash.com?utm_source=137foundry&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The atomic uniqueness check on submit
&lt;/h2&gt;

&lt;p&gt;Even with a perfect inline check, the server still needs to be authoritative on uniqueness. Two users picking the same username at the same time will both pass the inline check; only one will pass the server transaction.&lt;/p&gt;

&lt;p&gt;The server-side pattern: the actual username reservation happens in a database transaction with a unique constraint. If the insert fails because of a conflict, the server returns a structured error that the form maps back to a friendly message ("That username was just taken. Try another."). The inline check is a UX optimization; the server is the source of truth.&lt;/p&gt;

&lt;p&gt;This is the pattern that makes the form survive the rare race condition without confusing the user. The form respects the inline result for the common case and trusts the server for the conflict case.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to test it
&lt;/h2&gt;

&lt;p&gt;Three tests that catch most of the failure modes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Throttle the network.&lt;/strong&gt; Use the browser dev tools to add 1 to 2 seconds of latency on the username-check endpoint. Confirm the pending state is visible, submit is disabled, and the eventual result lands correctly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Type fast and stop.&lt;/strong&gt; Type a username quickly and stop. Confirm only one request fires (debounce working). Confirm the result is the result for the final input, not an earlier intermediate value.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test a network failure.&lt;/strong&gt; Use the dev tools to block the username-check endpoint. Confirm the form shows the infrastructure message, not a validation error.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Add an automated test for the in-flight tracking specifically: fire two checks with a small delay, return the first response after the second response, and confirm the form ignores the stale result. The bug where the wrong result wins because requests resolve out of order is one of the most common in real signup flows and is invisible without a test for it.&lt;/p&gt;

&lt;p&gt;For a broader accessibility audit, the &lt;a href="https://wave.webaim.org/" rel="noopener noreferrer"&gt;WAVE&lt;/a&gt; browser extension catches missing ARIA attributes; running it on the form during testing surfaces wiring gaps that manual testing misses.&lt;/p&gt;

&lt;h2&gt;
  
  
  The longer read on the full validation design
&lt;/h2&gt;

&lt;p&gt;This piece focuses on the async username check because it is the part most teams ship wrong. The full design of inline validation (when to trigger sync checks, what the error message should say, how to handle paste and autofill, when success states earn their place) sits in a &lt;a href="https://137foundry.com/articles/how-to-design-inline-form-validation-that-actually-helps-users" rel="noopener noreferrer"&gt;longer guide on designing inline form validation that actually helps users&lt;/a&gt; on &lt;a href="https://137foundry.com/services/web-development" rel="noopener noreferrer"&gt;137Foundry's web development service&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The async case is where validation systems leak. Getting the four patterns above right (debounce, pending state, submit blocked, distinguish network from validation) is the difference between a form that feels smooth and one that feels broken.&lt;/p&gt;

</description>
      <category>ux</category>
      <category>webdev</category>
      <category>a11y</category>
    </item>
    <item>
      <title>How to Handle Late-Arriving Events in Apache Flink With Side Outputs</title>
      <dc:creator>137Foundry</dc:creator>
      <pubDate>Wed, 17 Jun 2026 13:17:45 +0000</pubDate>
      <link>https://dev.to/137foundry/how-to-handle-late-arriving-events-in-apache-flink-with-side-outputs-5ene</link>
      <guid>https://dev.to/137foundry/how-to-handle-late-arriving-events-in-apache-flink-with-side-outputs-5ene</guid>
      <description>&lt;p&gt;The biggest single-line bug in a Flink job is usually the watermark. The second biggest is what happens when an event arrives past the watermark. The defaults Flink ships with are sensible for the synthetic examples in the documentation but wrong for nearly every real-world integration.&lt;/p&gt;

&lt;p&gt;This walks through how to wire up Flink's side-output mechanism for late events, and how to feed the side output into a correction process that keeps downstream consumers honest.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuhz19hm0a9tgtkhctj1r.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuhz19hm0a9tgtkhctj1r.jpeg" alt="A close-up of fiber optic cables glowing in a server room" width="800" height="1200"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by Quoc Anh Tran Duong on &lt;a href="https://www.pexels.com" rel="noopener noreferrer"&gt;Pexels&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  What Flink does by default
&lt;/h2&gt;

&lt;p&gt;A Flink keyed window with a tumbling window assigner emits a final result when the watermark passes the window's end time. By default, any event arriving with an event time at or before the window's end but after the watermark has passed is dropped silently.&lt;/p&gt;

&lt;p&gt;The behavior is documented in the &lt;a href="https://flink.apache.org/" rel="noopener noreferrer"&gt;Apache Flink&lt;/a&gt; windowing documentation, and the relevant API surface is &lt;code&gt;allowedLateness&lt;/code&gt; and &lt;code&gt;sideOutputLateData&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The mental model that helps: the watermark is the pipeline's promise about "I have seen everything up to time T." &lt;code&gt;allowedLateness&lt;/code&gt; is a grace period during which the window stays alive and accepts late events. &lt;code&gt;sideOutputLateData&lt;/code&gt; is the escape hatch for events arriving past the grace period.&lt;/p&gt;
&lt;h2&gt;
  
  
  Setting up the side output
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;OutputTag&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;MyEvent&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;lateOutputTag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;OutputTag&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;MyEvent&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;(&lt;/span&gt;&lt;span class="s"&gt;"late-events"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{};&lt;/span&gt;

&lt;span class="nc"&gt;SingleOutputStreamOperator&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Aggregate&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;mainResult&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;keyBy&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getKey&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;window&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;TumblingEventTimeWindows&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;of&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;minutes&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="o"&gt;)))&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;allowedLateness&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;minutes&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;sideOutputLateData&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lateOutputTag&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;process&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;MyAggregator&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;

&lt;span class="nc"&gt;DataStream&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;MyEvent&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;lateEvents&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mainResult&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getSideOutput&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lateOutputTag&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;lateEvents&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;addSink&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;LateEventSink&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The pattern: a fifteen-minute tumbling window, five-minute grace period, late events captured to a named side output and shipped to a sink. The &lt;code&gt;LateEventSink&lt;/code&gt; should be durable; a Kafka topic or a database table both work.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why allowedLateness alone is not enough
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;allowedLateness&lt;/code&gt; is necessary but not sufficient. It only widens the window's acceptance period to the value you set. Events arriving past that period are still dropped unless &lt;code&gt;sideOutputLateData&lt;/code&gt; is also configured.&lt;/p&gt;

&lt;p&gt;Setting &lt;code&gt;allowedLateness&lt;/code&gt; to a very large value is a common workaround that does not work. The window state has to stay open for the full lateness period, so a one-hour allowed lateness on a fifteen-minute window means Flink holds state for one hour and fifteen minutes per window key. That is a lot of state. The recommended pattern is to set &lt;code&gt;allowedLateness&lt;/code&gt; to a tight value (matching the realistic late-event distribution observed in production) and use the side output to absorb the longer-tail late events.&lt;/p&gt;
&lt;h2&gt;
  
  
  What goes in the LateEventSink
&lt;/h2&gt;

&lt;p&gt;The sink should persist enough context to reconstruct the affected window for correction.&lt;/p&gt;

&lt;p&gt;For each late event, write a row with: the original event payload, the event's timestamp, the window key, the window's start time, the window's end time, and the time the late event was received by the pipeline.&lt;/p&gt;

&lt;p&gt;That last field, the receive time, is what your correction job uses to do incremental processing. A correction job that runs every fifteen minutes only needs to look at side-output rows with receive_time newer than its last successful run.&lt;/p&gt;
&lt;h2&gt;
  
  
  The correction job
&lt;/h2&gt;

&lt;p&gt;The correction job is a separate Flink job (or batch job) that reads the side output, recomputes the aggregate for affected windows, and writes corrected aggregates back to the target table.&lt;/p&gt;

&lt;p&gt;The mechanics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Pseudocode
&lt;/span&gt;&lt;span class="n"&gt;last_run_ts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;read_last_run_timestamp&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;late_events&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;read_side_output&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;since&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;last_run_ts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;affected_windows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;group_by_window_key_and_window_start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;late_events&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;window_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;affected_windows&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;original_events&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;read_original_events_in_window&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;window_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;corrected_aggregate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;aggregate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;original_events&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;write_versioned_fact&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;window_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;corrected_aggregate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="nf"&gt;write_last_run_timestamp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key piece is the versioned-fact write: instead of mutating the original aggregate row, write a new row with a later &lt;code&gt;emitted_at&lt;/code&gt; timestamp. Downstream consumers query the latest version per window key.&lt;/p&gt;

&lt;h2&gt;
  
  
  Idempotent emit
&lt;/h2&gt;

&lt;p&gt;The correction job has to be idempotent. If it crashes mid-run and restarts, it should not produce duplicate corrections.&lt;/p&gt;

&lt;p&gt;The cleanest way to achieve idempotency is to make the target table's primary key &lt;code&gt;(window_key, window_start, emitted_at)&lt;/code&gt;, with &lt;code&gt;emitted_at&lt;/code&gt; populated from a deterministic source like "the timestamp of the latest event in the correction batch." Then re-running the same batch produces the same &lt;code&gt;emitted_at&lt;/code&gt; and the same primary key, which collides on insert and either no-ops or updates.&lt;/p&gt;

&lt;p&gt;For PostgreSQL targets, &lt;code&gt;INSERT ... ON CONFLICT DO UPDATE&lt;/code&gt; handles this in one statement. For BigQuery or Snowflake, &lt;code&gt;MERGE&lt;/code&gt; does the same.&lt;/p&gt;

&lt;p&gt;If your target is a data lake using Apache Iceberg, the &lt;a href="https://iceberg.apache.org/" rel="noopener noreferrer"&gt;Iceberg specification&lt;/a&gt; supports merge operations natively and the streaming engine can write through Iceberg's merge-on-read or copy-on-write modes depending on read pattern.&lt;/p&gt;

&lt;h2&gt;
  
  
  The view that downstream consumers query
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;window_aggregates_current&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;window_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;window_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;window_start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;emitted_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ROW_NUMBER&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;window_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_start&lt;/span&gt;
      &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;emitted_at&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rn&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;window_aggregates&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;ranked&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;rn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the only thing the dashboard, the BI tool, or any downstream consumer should query. They see the latest version per window automatically. The underlying table keeps every version for audit and reconstruction.&lt;/p&gt;

&lt;h2&gt;
  
  
  Metrics to add
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;numLateEvents&lt;/code&gt; is a counter you can register from inside the &lt;code&gt;ProcessWindowFunction&lt;/code&gt; using Flink's metrics API. It increments every time a late event arrives. Tag it by window-key partition if your job has skewed partitions. The &lt;a href="https://prometheus.io/" rel="noopener noreferrer"&gt;Prometheus&lt;/a&gt; docs cover the metric-collection layer once Flink is emitting these.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;watermarkLag&lt;/code&gt; is the gap between current processing time and current watermark, in milliseconds. Flink exposes this as a built-in metric; just enable it in your metrics config.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;correctionsEmittedPerHour&lt;/code&gt; is a counter from the correction job. Spikes here correlate with upstream incidents.&lt;/p&gt;

&lt;p&gt;Wire all three into Prometheus or whatever your stack uses. The point is to have a visible signal when late-event volume changes, so you can retune the grace period or chase down a producer regression.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tradeoffs to acknowledge
&lt;/h2&gt;

&lt;p&gt;The side-output pattern is not free.&lt;/p&gt;

&lt;p&gt;State management costs more because allowedLateness keeps windows open longer. For high-cardinality keys, this can be significant. Watch your state size and adjust if it grows beyond your operational budget.&lt;/p&gt;

&lt;p&gt;The correction job adds latency. Downstream consumers see the "almost final" aggregate when the window closes, and the corrected aggregate fifteen minutes to one hour later. For most business reporting this is acceptable; for sub-second reporting it is not.&lt;/p&gt;

&lt;p&gt;The versioned-fact pattern adds storage cost because every correction is persisted. Old versions are usually compacted on a quarterly schedule.&lt;/p&gt;

&lt;p&gt;These costs are usually worth paying. The alternative is silent drift, which costs more in trust and rework.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this looks like at 137Foundry
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://137foundry.com" rel="noopener noreferrer"&gt;137Foundry&lt;/a&gt; has retrofitted this pattern onto live pipelines for clients integrating Stripe webhooks, Salesforce event streams, and internal CDC streams to Snowflake. The typical retrofit is a one- to two-week project, not a full rewrite.&lt;/p&gt;

&lt;p&gt;The longer article that frames this in context, including the versioned-fact pattern in more detail and the lambda-architecture alternative, is &lt;a href="https://137foundry.com/articles/how-to-handle-late-arriving-data-streaming-integration" rel="noopener noreferrer"&gt;How to Handle Late-Arriving Data in a Streaming Integration Pipeline Without Corrupting Downstream Reports&lt;/a&gt;. The piece above is the Flink-specific recipe; the article is the design rationale.&lt;/p&gt;

&lt;h2&gt;
  
  
  A note on testing
&lt;/h2&gt;

&lt;p&gt;The hardest part of validating late-event handling is constructing a realistic test case. Most unit tests for streaming jobs assume events arrive in order. The late-event path only fires when events arrive out of order.&lt;/p&gt;

&lt;p&gt;The cleanest test pattern: build a fixture stream that emits events with deliberately scrambled event times, including some events past the watermark, and assert that the side output contains the late events and the corrected aggregate matches the expected total. Run the test against an embedded Flink mini-cluster.&lt;/p&gt;

&lt;p&gt;Without this test, the late-event handling path is essentially untested in CI, and regressions slip through unnoticed. With it, you can iterate on grace periods and side-output logic without fear.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>integration</category>
      <category>api</category>
    </item>
    <item>
      <title>How to Implement the Versioned-Fact Pattern for Streaming Corrections</title>
      <dc:creator>137Foundry</dc:creator>
      <pubDate>Wed, 17 Jun 2026 13:16:39 +0000</pubDate>
      <link>https://dev.to/137foundry/how-to-implement-the-versioned-fact-pattern-for-streaming-corrections-3d0o</link>
      <guid>https://dev.to/137foundry/how-to-implement-the-versioned-fact-pattern-for-streaming-corrections-3d0o</guid>
      <description>&lt;p&gt;The versioned-fact pattern is the most useful tool for handling corrections from a streaming pipeline without forcing every downstream consumer to learn how the pipeline works internally. This walks through how to implement it on a target table, what the view looks like, how to handle compaction, and what breaks if you skip steps.&lt;/p&gt;

&lt;p&gt;The use case: a streaming pipeline emits a windowed aggregate (sum of transactions per minute, count of events per user per hour, whatever). Late events arrive after the initial aggregate is emitted. You need a way to publish corrected aggregates without breaking dashboards or BI tools that already read the original numbers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fggrpco0mr4v9t8ayysvc.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fggrpco0mr4v9t8ayysvc.jpeg" alt="A close-up of a database screen showing rows with timestamps" width="800" height="600"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by Seraphfim Gallery on &lt;a href="https://www.pexels.com" rel="noopener noreferrer"&gt;Pexels&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  The basic schema
&lt;/h2&gt;

&lt;p&gt;The target table has the window key, the value, an &lt;code&gt;emitted_at&lt;/code&gt; timestamp, and a primary key built from &lt;code&gt;(window_key, emitted_at)&lt;/code&gt; or some equivalent uniqueness constraint.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;window_aggregates&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;window_key&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;window_start&lt;/span&gt; &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;window_end&lt;/span&gt; &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="nb"&gt;NUMERIC&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;emitted_at&lt;/span&gt; &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;window_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;emitted_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The original emit goes in with the time the window closed. Each correction goes in with a later &lt;code&gt;emitted_at&lt;/code&gt;. The table holds the entire history.&lt;/p&gt;

&lt;h2&gt;
  
  
  The current-view layer
&lt;/h2&gt;

&lt;p&gt;Downstream consumers should never query the raw table. They query a view that selects the latest version per window key.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;window_aggregates_current&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;window_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_end&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;window_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;window_start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;window_end&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;emitted_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ROW_NUMBER&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;window_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_start&lt;/span&gt;
      &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;emitted_at&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rn&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;window_aggregates&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;ranked&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;rn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The view is the contract with downstream. Anything that needs "the current truth" reads from the view. Anything that needs historical reconstruction reads from the table.&lt;/p&gt;

&lt;h2&gt;
  
  
  The emit side
&lt;/h2&gt;

&lt;p&gt;The streaming pipeline emits the original aggregate with &lt;code&gt;emitted_at = window_end + grace_period&lt;/code&gt;. The correction job emits corrected aggregates with &lt;code&gt;emitted_at = correction_job_start_time&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;emit_correction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;window_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;corrected_value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch_id&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;window_aggregates&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;window_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;window_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;window_start&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;window_start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;window_end&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;window_start&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;window_duration&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;corrected_value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;emitted_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;batch_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;batch_id&lt;/code&gt; is a deterministic timestamp tied to the correction batch (typically the cron schedule's wall-clock time, or the timestamp of the latest event in the batch). This makes re-runs idempotent: the same batch produces the same &lt;code&gt;(window_key, window_start, emitted_at)&lt;/code&gt; and collides cleanly on insert.&lt;/p&gt;

&lt;h2&gt;
  
  
  Idempotent inserts
&lt;/h2&gt;

&lt;p&gt;For PostgreSQL, &lt;code&gt;INSERT ... ON CONFLICT DO UPDATE&lt;/code&gt; makes re-runs safe:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;window_aggregates&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;window_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_end&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;emitted_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(...)&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;CONFLICT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;window_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;emitted_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;DO&lt;/span&gt; &lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;EXCLUDED&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;EXCLUDED&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;window_end&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For BigQuery or Snowflake, &lt;code&gt;MERGE&lt;/code&gt; is the equivalent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;MERGE&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;window_aggregates&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;
&lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;corrections_batch&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;window_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;window_key&lt;/span&gt;
   &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;window_start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;window_start&lt;/span&gt;
   &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;emitted_at&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;emitted_at&lt;/span&gt;
&lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;MATCHED&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;window_end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;window_end&lt;/span&gt;
&lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="n"&gt;MATCHED&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;window_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_end&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;emitted_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                     &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;window_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;window_start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;window_end&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;emitted_at&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Either pattern lets the correction job crash and restart without producing duplicate rows.&lt;/p&gt;

&lt;h2&gt;
  
  
  Compaction strategy
&lt;/h2&gt;

&lt;p&gt;The versioned-fact table grows linearly with the number of corrections per window. After enough time, the table is mostly historical versions and the view is doing a lot of work to find the latest per window.&lt;/p&gt;

&lt;p&gt;The compaction strategy: on a quarterly schedule, run a job that copies the latest version per window (the view's output) into a new table, deletes everything from the original, and writes the latest versions back. This collapses the table to "latest only" while preserving the schema.&lt;/p&gt;

&lt;p&gt;For pipelines that need audit history, keep the pre-compaction table as a snapshot in cheap storage (object store, archive table). For pipelines that only care about the current value, drop it after compaction.&lt;/p&gt;

&lt;h2&gt;
  
  
  When the view stops being enough
&lt;/h2&gt;

&lt;p&gt;For very high-cardinality tables, the window function in the view becomes a hot spot. Two mitigations:&lt;/p&gt;

&lt;p&gt;The first is to materialize the view as a table that the correction job updates instead of computing on read. The downside is the materialization layer adds latency and complexity. The upside is read performance is fixed.&lt;/p&gt;

&lt;p&gt;The second is to use a database feature like PostgreSQL's &lt;code&gt;DISTINCT ON&lt;/code&gt;, which the planner can optimize better than &lt;code&gt;ROW_NUMBER()&lt;/code&gt; in some cases. The query becomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;window_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;window_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_end&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;window_aggregates&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;window_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;emitted_at&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For tables in the hundreds of millions of rows, the &lt;code&gt;DISTINCT ON&lt;/code&gt; pattern is often two to three times faster than the row-number version, depending on indexes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Indexes
&lt;/h2&gt;

&lt;p&gt;The primary key alone is usually enough for the OLTP write path. For the read path through the view, two indexes help:&lt;/p&gt;

&lt;p&gt;A composite index on &lt;code&gt;(window_key, window_start, emitted_at DESC)&lt;/code&gt; makes the latest-per-window lookup almost free.&lt;/p&gt;

&lt;p&gt;A separate index on &lt;code&gt;emitted_at&lt;/code&gt; makes the correction job's "find corrections newer than my last run" query fast without scanning the whole table.&lt;/p&gt;

&lt;h2&gt;
  
  
  Validation
&lt;/h2&gt;

&lt;p&gt;After the pattern is in place, validate two things every day.&lt;/p&gt;

&lt;p&gt;The reconciliation against the source system should show near-zero variance after the correction window has closed (typically the day after). If the variance is non-zero after corrections have had time to land, the late-event handling is incomplete and you have a producer regression or a side-output bug.&lt;/p&gt;

&lt;p&gt;The compaction job's row counts should match the view's row counts at the moment of compaction. A mismatch means the table is missing windows or has phantom rows. Either is a bug in the emit logic.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to avoid
&lt;/h2&gt;

&lt;p&gt;Mutating the aggregate row in place. This makes the dashboard correct but destroys the audit trail and breaks idempotency for the correction job.&lt;/p&gt;

&lt;p&gt;Using natural keys (window start time alone) as the primary key. This forces UPSERT semantics on every emit, which is harder to debug than INSERT with conflict handling.&lt;/p&gt;

&lt;p&gt;Treating the latest-version view as a transient construct. It is the contract with downstream consumers; document it, version it, and treat changes to it like API changes.&lt;/p&gt;

&lt;h2&gt;
  
  
  How this fits into the larger pattern
&lt;/h2&gt;

&lt;p&gt;This pattern is one piece of a larger late-data-handling design. The other pieces are watermarks, grace periods, side outputs, and reconciliation against the source system. All five together produce a streaming pipeline that downstream consumers can rely on.&lt;/p&gt;

&lt;p&gt;The end-to-end design is covered in &lt;a href="https://137foundry.com/articles/how-to-handle-late-arriving-data-streaming-integration" rel="noopener noreferrer"&gt;How to Handle Late-Arriving Data in a Streaming Integration Pipeline Without Corrupting Downstream Reports&lt;/a&gt;. The versioned-fact pattern is the last mile before the downstream dashboard; the article walks through everything upstream of it.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to ask for help
&lt;/h2&gt;

&lt;p&gt;If you are retrofitting this pattern onto an existing pipeline with downstream consumers already in production, the migration plan matters more than the implementation. You usually want to deploy the view alongside the old table, gradually move downstream consumers to the view, validate against historical reports, then turn on the correction job. Skipping the gradual migration is how you accidentally republish incorrect numbers to people who were not expecting them.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://137foundry.com/services/data-integration" rel="noopener noreferrer"&gt;137Foundry's data integration service&lt;/a&gt; covers this kind of retrofit work, including the migration playbook and the reconciliation reporting that proves the change did not regress anything. The &lt;a href="https://flink.apache.org/" rel="noopener noreferrer"&gt;Apache Flink&lt;/a&gt; docs and the &lt;a href="https://iceberg.apache.org/" rel="noopener noreferrer"&gt;Iceberg&lt;/a&gt; spec are the canonical references for the underlying engine and storage-layer details. The &lt;a href="https://prometheus.io/" rel="noopener noreferrer"&gt;Prometheus&lt;/a&gt; docs cover the metrics layer that makes the whole thing operationally observable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing thought
&lt;/h2&gt;

&lt;p&gt;The versioned-fact pattern is more bookkeeping than algorithm. The algorithmic insight is "do not mutate, append with a version." Everything else is execution detail. The pattern's value is that it lets the streaming pipeline correct itself without breaking the contract with every dashboard, BI tool, and report that already consumes its output. That contract is what makes the pipeline trustworthy.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>integration</category>
      <category>sql</category>
    </item>
  </channel>
</rss>
