<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sven Schuchardt</title>
    <description>The latest articles on DEV Community by Sven Schuchardt (@sven_schuchardt_0aa51663a).</description>
    <link>https://dev.to/sven_schuchardt_0aa51663a</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3407996%2Ffaba4f7a-c083-4f98-9f6c-e8be76fc17b2.jpg</url>
      <title>DEV Community: Sven Schuchardt</title>
      <link>https://dev.to/sven_schuchardt_0aa51663a</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sven_schuchardt_0aa51663a"/>
    <language>en</language>
    <item>
      <title>Every AI Agent Failure I've Debugged in 2026 was an Idempotency Problem</title>
      <dc:creator>Sven Schuchardt</dc:creator>
      <pubDate>Mon, 11 May 2026 07:06:03 +0000</pubDate>
      <link>https://dev.to/sven_schuchardt_0aa51663a/every-ai-agent-failure-ive-debugged-in-2026-was-an-idempotency-problem-5dl0</link>
      <guid>https://dev.to/sven_schuchardt_0aa51663a/every-ai-agent-failure-ive-debugged-in-2026-was-an-idempotency-problem-5dl0</guid>
      <description>&lt;p&gt;Five real production incidents, the 25-year-old constraint that explains them all, and the three-layer architectural fix every agent team should have shipped last quarter.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;The failure pattern looks different every time, and it is the same pattern every time.&lt;/p&gt;

&lt;p&gt;A customer gets the same onboarding email fourteen times in nine minutes. A B2B account is charged twice for one subscription renewal. An order shows up in the OMS as three orders. A support ticket is created, escalated, re-created, re-escalated, and then closed as duplicate by a human who eventually has to write the apology email.&lt;/p&gt;

&lt;p&gt;Every one of these incidents in the last six months has landed on my desk with the same opening line in the post-mortem: &lt;em&gt;"the agent acted weirdly."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The agent did not act weirdly. The agent did exactly what the framework told it to do — retry on timeout, retry on 5xx, retry on ambiguous tool response — against a tool call that was never designed to be retried. That is not an AI failure. That is a 25-year-old distributed-systems failure wearing a new costume.&lt;/p&gt;

&lt;p&gt;The principle the agent ecosystem is currently rediscovering is &lt;strong&gt;idempotency&lt;/strong&gt;: an operation is idempotent if applying it once and applying it more than once produce the same result. Roy Fielding formalized it for HTTP methods in chapter 5 of his &lt;a href="https://ics.uci.edu" rel="noopener noreferrer"&gt;2000 REST dissertation&lt;/a&gt;, made normative in &lt;a href="https://datatracker.ietf.org/doc/html/rfc2616#section-9.1.2" rel="noopener noreferrer"&gt;RFC 2616 §9.1.2&lt;/a&gt; and restated in &lt;a href="https://datatracker.ietf.org/doc/html/rfc7231#section-4.2.2" rel="noopener noreferrer"&gt;RFC 7231 §4.2.2&lt;/a&gt;. The folklore is older — RPC implementers were debating it in the 1980s.&lt;/p&gt;

&lt;p&gt;By 2010, idempotency was a non-negotiable in any serious payments, messaging, or inventory system. The agent frameworks of 2024–2026 ship with retry semantics at the tool-call layer. The tools they call were written by humans, for humans, on the assumption that a human would not press the button fourteen times in nine minutes. The collision between those two assumptions is where the production damage lives.&lt;/p&gt;

&lt;h2&gt;
  
  
  Nothing really new
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Tool calls now appear in &lt;strong&gt;21.9% of agent traces, up from 0.5% in 2023&lt;/strong&gt; — a 44× expansion of the retry surface in a single year (&lt;a href="https://blog.langchain.com/langchain-state-of-ai-2024/" rel="noopener noreferrer"&gt;LangChain State of AI 2024&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;Gartner forecasts &lt;strong&gt;40% of enterprise apps will ship task-specific agents by end of 2026&lt;/strong&gt;, and &lt;strong&gt;40%+ of agentic AI projects will be cancelled by end of 2027&lt;/strong&gt; — driven by reliability and governance gaps (&lt;a href="https://www.gartner.com/en/newsroom/press-releases/2025-08-26-gartner-predicts-40-percent-of-enterprise-apps-will-feature-task-specific-ai-agents-by-2026-up-from-less-than-5-percent-in-2025" rel="noopener noreferrer"&gt;Gartner&lt;/a&gt;, &lt;a href="https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027" rel="noopener noreferrer"&gt;Gartner&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;Every major delivery substrate the agent stack inherits is &lt;strong&gt;at-least-once&lt;/strong&gt;: Stripe retries webhooks for 3 days, AWS SQS standard queues document duplicate delivery as the contract, HTTP retries are normative.&lt;/li&gt;
&lt;li&gt;The fix is unchanged from 2017: every state-mutating tool requires a &lt;strong&gt;deterministic idempotency key + a deduplication store at the boundary&lt;/strong&gt;. Frameworks do not enforce this by default.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why this is happening now: the retry surface just got 44× bigger
&lt;/h2&gt;

&lt;p&gt;LangChain's 2024 telemetry shows tool calls jumping from 0.5% of agent traces in 2023 to 21.9% in 2024, with average steps per trace growing from 2.8 to 7.7. Each step is a potential non-idempotent side effect.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Year&lt;/th&gt;
&lt;th&gt;Tool calls (% of traces)&lt;/th&gt;
&lt;th&gt;Avg steps per trace&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2023&lt;/td&gt;
&lt;td&gt;0.5%&lt;/td&gt;
&lt;td&gt;2.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2024&lt;/td&gt;
&lt;td&gt;21.9%&lt;/td&gt;
&lt;td&gt;7.7&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Source: &lt;a href="https://blog.langchain.com/langchain-state-of-ai-2024/" rel="noopener noreferrer"&gt;LangChain State of AI 2024&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;What is new is not retry behaviour at the network layer. What is new is the &lt;strong&gt;volume of state-mutating calls being generated by a non-deterministic upstream component&lt;/strong&gt;. An LLM that produces "approximately the right tool call" 95% of the time also produces "almost-but-not-quite the same tool call" the other 5% — and 5% of millions of calls a day is enough to expose every non-idempotent operation in the entire downstream stack.&lt;/p&gt;

&lt;p&gt;51% of survey respondents in the &lt;a href="https://www.langchain.com/stateofaiagents" rel="noopener noreferrer"&gt;LangChain State of AI Agents Report&lt;/a&gt; run agents in production. 89% of orgs in the &lt;a href="https://www.langchain.com/state-of-agent-engineering" rel="noopener noreferrer"&gt;State of Agent Engineering 2025&lt;/a&gt; report have observability in place. Instrumentation is catching up. The contracts at the tool boundary are not.&lt;/p&gt;

&lt;h2&gt;
  
  
  Five production failures, all the same shape
&lt;/h2&gt;

&lt;p&gt;Real incidents from the last six months.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The fourteen-email onboarding
&lt;/h3&gt;

&lt;p&gt;A B2C signup agent calls a &lt;code&gt;send_welcome_email&lt;/code&gt; tool wrapping an internal API. The internal API is &lt;em&gt;eventually consistent&lt;/em&gt; — it returns 202 Accepted before enqueue, and under load occasionally returns a socket timeout &lt;em&gt;after&lt;/em&gt; the message was enqueued. Framework default: retry on timeout up to 3× with backoff. The tool: no idempotency key, no de-duplication.&lt;/p&gt;

&lt;p&gt;Three retries × four sequential retriggers from a downstream "incomplete onboarding" agent = fourteen emails to one mailbox. One enterprise customer publicly tweeted about it. Two hours of incident response. A week of churn-control outreach.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. The double subscription charge
&lt;/h3&gt;

&lt;p&gt;A self-serve renewal agent handled decline-and-retry on subscription billing. The Stripe call was idempotent — Stripe has supported &lt;a href="https://docs.stripe.com/api/idempotent_requests" rel="noopener noreferrer"&gt;&lt;code&gt;Idempotency-Key&lt;/code&gt; headers&lt;/a&gt; for years, with a 24-hour deduplication window. The internal entitlement-grant call after the charge was &lt;em&gt;not&lt;/em&gt; idempotent.&lt;/p&gt;

&lt;p&gt;When Stripe returned a network-layer error after the card was already charged, the agent retried the &lt;strong&gt;whole sequence&lt;/strong&gt; — including a second successful Stripe charge (because the framework's retry was at the agent step, not the tool step) and a second entitlement grant.&lt;/p&gt;

&lt;p&gt;Lesson: Stripe's idempotency layer was correct, and the system still produced a duplicate charge, because the retry was orchestrated one level above where the idempotency key lived. &lt;strong&gt;Idempotency is not a property of one call. It is a property of every layer in the call chain.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The ghost order
&lt;/h3&gt;

&lt;p&gt;An order-capture agent calls an OMS &lt;code&gt;create_order&lt;/code&gt; tool. The OMS expects a client-supplied order ID and is in fact idempotent on it — but the agent, on retry, generated a &lt;em&gt;new&lt;/em&gt; UUID for each attempt because the prompt said "generate an order ID" rather than "reuse the order ID across retries."&lt;/p&gt;

&lt;p&gt;Every individual layer was idempotent-aware. The integration was not. The non-determinism of the LLM produced new IDs on retry, defeating the very property the OMS was designed to provide.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. The webhook fan-out
&lt;/h3&gt;

&lt;p&gt;A vendor's webhook delivery is at-least-once — they retry on any non-2xx response. Stripe's &lt;a href="https://stripe.com/docs/webhooks/best-practices" rel="noopener noreferrer"&gt;published retry schedule&lt;/a&gt; extends across immediate, 5-min, 30-min, 2-hr, 5-hr, 10-hr, then every-12-hour windows for up to 3 days. Duplicate delivery is the documented expectation, not the edge case.&lt;/p&gt;

&lt;p&gt;The receiving agent's &lt;code&gt;adjust_inventory&lt;/code&gt; tool decremented stock. A debug field in the response triggered a Pydantic error in the framework's parser, returning a 500 to the source. The vendor retried. The framework parsed correctly the second time. Inventory decremented twice. Three SKUs oversold. Wrong stock counts pushed to the e-commerce frontend before the on-call SRE caught it.&lt;/p&gt;

&lt;p&gt;The fix was not in the agent. The fix was in the inventory tool, which should have accepted an idempotency key from the webhook source and rejected duplicates with 200 OK rather than re-executing.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. The duplicate Jira
&lt;/h3&gt;

&lt;p&gt;An incident-triage agent ingests a support email and creates a Jira ticket. Framework response timeout: 8 seconds. Jira instance under load: regularly 12 seconds. Agent retried. Jira created a second ticket. The triage agent's own dedup pass merged them — but the merge call timed out, retried, and produced a third ticket. By end of morning: six Jira tickets, two Slack threads, one customer email.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pattern, stated clearly
&lt;/h2&gt;

&lt;p&gt;In every case, the surface narrative was the agent's behaviour. The actual cause was an operation that was &lt;strong&gt;non-idempotent in the path of an at-least-once delivery semantic&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Non-idempotent operation. At-least-once delivery semantic. If those two facts are true at the same boundary, you do not have an AI failure. You have a distributed-systems failure that AI made cheaper to trigger.&lt;/p&gt;

&lt;p&gt;The agent did not invent the retry. The agent did not invent the network timeout. The agent inherited an at-least-once world from every layer beneath it — the LLM provider's retry on rate-limit, the framework's retry on tool error, the SDK's retry on socket close, the webhook source's retry policy, the queue's redelivery contract — and pointed it at tools designed for a single human caller pressing a single button once.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The reason this pattern is hard to see in post-mortem is that &lt;strong&gt;no single component is "wrong."&lt;/strong&gt; The framework's retry policy is correct. The webhook source's retry policy is correct. The downstream tool's response-on-error is technically correct. The failure is emergent — it lives at the seams between layers, where each layer assumes the layer beneath it is idempotent and does not check.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  At-least-once is inescapable
&lt;/h2&gt;

&lt;p&gt;Every major delivery substrate the agent ecosystem inherits is at-least-once. This is not a pessimistic framing. It is the documented behaviour:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/standard-queues-at-least-once-delivery.html" rel="noopener noreferrer"&gt;AWS SQS standard queues&lt;/a&gt; document at-least-once delivery as a guarantee.&lt;/li&gt;
&lt;li&gt;Apache Kafka defaults to at-least-once; exactly-once is opt-in via transactional config.&lt;/li&gt;
&lt;li&gt;HTTP retries are normative — RFC 7231 specifies which methods are safe to retry.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://stripe.com/docs/webhooks/best-practices" rel="noopener noreferrer"&gt;Stripe's webhook docs&lt;/a&gt; explicitly warn: &lt;em&gt;"your endpoint should be idempotent"&lt;/em&gt; — duplicates across a 3-day window are expected on the happy path.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Exactly-once delivery in asynchronous distributed systems with failures is impossible by formal proof — established in the 1980s, rediscovered every time a new generation tries to design around it. What you can do is build idempotent receivers and let the substrate retry as much as it wants without producing duplicate side effects.&lt;/p&gt;

&lt;h2&gt;
  
  
  The architectural fix
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Treat every state-mutating tool call as a network call to an at-least-once delivery channel.&lt;/strong&gt; That is the only assumption that is safe.&lt;/p&gt;

&lt;p&gt;Three layers, in order of importance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1 — every state-mutating tool requires an idempotency key
&lt;/h3&gt;

&lt;p&gt;Not optional. Not "if the upstream service supports it." The tool's own contract enforces it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Annotated&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Field&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CreateOrderInput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;idempotency_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Annotated&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;min_length&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_length&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
    &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;line_items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;LineItem&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="nd"&gt;@tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state_mutating&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_order&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;CreateOrderInput&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Order&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# framework rejects the call before reaching the OMS
&lt;/span&gt;    &lt;span class="c1"&gt;# if idempotency_key is missing or malformed
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;oms_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_order&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;client_order_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;inp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;idempotency_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;inp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;line_items&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;inp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;line_items&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the agent calls &lt;code&gt;create_order(...)&lt;/code&gt; without a key, the call fails fast at the tool boundary with a 400 — before reaching the OMS. The framework's tool-call validator catches this in development and prevents the integration from shipping in the first place.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2 — the idempotency key has a defined synthesis rule
&lt;/h3&gt;

&lt;p&gt;The agent does not "generate" the key on retry. The key is &lt;strong&gt;derived&lt;/strong&gt; from the inputs of the original call — a hash of the caller, the operation, and the semantically-meaningful inputs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;synthesize_key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;caller_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;canonical&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sort_keys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;separators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;|&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;caller_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;|&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;canonical&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On retry, the same inputs produce the same key. The key is stable across retries because it is &lt;em&gt;derived&lt;/em&gt;, not invented. This rule directly addresses failure case 3 (the ghost order) — the LLM cannot accidentally regenerate a UUID if the UUID is a deterministic hash of the input.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3 — deduplication store at the tool boundary
&lt;/h3&gt;

&lt;p&gt;A cheap key-value store keyed by &lt;code&gt;(tool, idempotency_key)&lt;/code&gt; returns the cached response on duplicate calls.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;execute_with_dedup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ttl_seconds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;86_400&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dedup_store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt;  &lt;span class="c1"&gt;# replay original response, no side effect
&lt;/span&gt;    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;dedup_store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ex&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ttl_seconds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;TTL is generous — Stripe's &lt;a href="https://docs.stripe.com/api/idempotent_requests" rel="noopener noreferrer"&gt;24-hour window&lt;/a&gt; is the canonical reference; 7 days is fine for high-cost operations like billing or order creation. Storage is cheap. A second customer charge is not.&lt;/p&gt;

&lt;p&gt;This is not novel architecture. Stripe published the canonical pattern for it in 2017. The reason it does not exist by default in agent frameworks is that the frameworks were optimized for prototyping, not production — and the production cost of the missing layer only becomes visible after the first incident.&lt;/p&gt;

&lt;p&gt;The deeper reason it does not exist is that the frameworks are converging on the &lt;strong&gt;wrong default&lt;/strong&gt;. They optimize for "make tool calls easy" — correct for prototyping — but the production-correct default is "make tool calls &lt;em&gt;safe&lt;/em&gt;". Easy and safe are not the same. The frameworks that ship safe-by-default tool wrapping in the next 18 months will eat the lunch of the ones that ship easy-by-default. This pattern repeats every time a substrate matures. It happened to RPC. It happened to REST. It will happen to agents.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three engineering rules for 2026
&lt;/h2&gt;

&lt;p&gt;Three rules I am asking every team I work with to adopt. They are not new — they are what a Stripe engineer would have given you in 2018, restated for an agent context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule 1 — Tools, not agents, own idempotency.&lt;/strong&gt; The agent is non-deterministic by design. The tool is the deterministic boundary. The contract belongs there. Every state-mutating tool exposes an &lt;code&gt;idempotency_key&lt;/code&gt; parameter; the framework synthesizes it from inputs if the agent does not supply one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule 2 — Test retries explicitly.&lt;/strong&gt; Every state-mutating tool ships with a regression test that calls it twice with the same inputs and asserts identical end state. CI catches the violation before the framework's retry policy does. The single most cost-effective test you can add to an agent codebase, and almost no team I have worked with is doing it consistently.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_create_order_is_idempotent&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;inputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sample_order_input&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;first&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_order&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;second&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_order&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# same idempotency_key derived
&lt;/span&gt;    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;second&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;oms_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;order_count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Rule 3 — Treat idempotency as a versioned contract.&lt;/strong&gt; When the tool's input shape changes, the key derivation changes, and old in-flight retries should fail closed, not silently re-execute against the new shape. Most teams miss this on the first refactor and discover it on the second incident.&lt;/p&gt;

&lt;p&gt;These three rules together cost a small engineering tax — perhaps 5% on tool development time — and prevent every one of the five failure modes above. The math is not subtle.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this costs when you skip it
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Direct revenue impact&lt;/strong&gt; when duplicate billing requires refund + concession.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trust erosion&lt;/strong&gt; when fourteen-email incidents hit social media.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Engineering time&lt;/strong&gt; when reconciliation between a ledger and an entitlement system takes a week.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit surface&lt;/strong&gt; when finance discovers the system of record for charges and the system of record for grants disagree.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Project survival&lt;/strong&gt; when leadership concludes the agent platform is "not production-ready" and pulls the funding. This is the failure mode behind Gartner's &lt;a href="https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027" rel="noopener noreferrer"&gt;40% project-cancellation forecast&lt;/a&gt; — not the AI being insufficiently capable, but the integration around it being insufficiently durable.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In every post-mortem I have run on these incidents, the cost-to-fix-after is at least 10× the cost-to-design-correctly-before.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;The agent ecosystem is going through the same maturation curve every distributed-systems substrate has gone through. The 1990s had it for RPC. The 2000s had it for SOAP. The 2010s had it for REST and webhooks. Each generation rediscovered idempotency the hard way, usually after a billing incident hit the press.&lt;/p&gt;

&lt;p&gt;The 2020s have it for agents. The good news is that we know the answer. The bad news is that the framework defaults are not yet aligned to it, and the production incidents are paying for the misalignment.&lt;/p&gt;

&lt;p&gt;If you are building anything where an agent calls a tool that mutates state, the most useful question you can ask this quarter is: &lt;strong&gt;what happens if this exact call is made twice?&lt;/strong&gt; If the answer is anything other than "the same thing happens once," you have an incident in your future. The only variable is the timing.&lt;/p&gt;

&lt;p&gt;Idempotency is not a clever pattern. It is a 25-year-old constraint that distributed-systems people stopped negotiating about a long time ago. The agent ecosystem is currently rediscovering why.&lt;/p&gt;

&lt;p&gt;The fix is older than most of the engineers shipping the bug.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This post is part of a four-week series connecting old software-engineering principles to new AI failure modes. Originally published on &lt;a href="https://biztechbridge.com/insights/idempotency-ai-agent-failures" rel="noopener noreferrer"&gt;biztechbridge.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>webdev</category>
      <category>distributedsystems</category>
    </item>
    <item>
      <title>How to Compute Zero Trust Effectiveness: Four Metrics That Survive a Breach</title>
      <dc:creator>Sven Schuchardt</dc:creator>
      <pubDate>Wed, 29 Apr 2026 16:29:10 +0000</pubDate>
      <link>https://dev.to/sven_schuchardt_0aa51663a/how-to-compute-zero-trust-effectiveness-four-metrics-that-survive-a-breach-1ghg</link>
      <guid>https://dev.to/sven_schuchardt_0aa51663a/how-to-compute-zero-trust-effectiveness-four-metrics-that-survive-a-breach-1ghg</guid>
      <description>&lt;p&gt;Three hops captures the realistic post-compromise reach inside a typical enterprise environment. If your IAM tooling does not expose a graph, the practical substitute is "count of distinct resources the identity has permission to read or modify within 60 minutes of session start, assuming no MFA step-up triggers."&lt;/p&gt;

&lt;h3&gt;
  
  
  What good looks like
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Privileged human identity:&lt;/strong&gt; under 50 reachable resources, zero crown-jewel data classes without step-up&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Standard human identity:&lt;/strong&gt; under 200 reachable resources, no production data without explicit grant&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Service account:&lt;/strong&gt; scoped to a single namespace or workload — under 10 reachable resources is normal, over 100 is a problem&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Report this metric per identity &lt;em&gt;class&lt;/em&gt;, not as a single org-wide average. The average hides the outliers, and the outliers are what get exploited.&lt;/p&gt;

&lt;h2&gt;
  
  
  Metric 2: Lateral-movement time-to-detect
&lt;/h2&gt;

&lt;p&gt;Lateral-movement TTD is the median time between an attacker's first action on a compromised host and the moment your SOC opens a case for the second host. Every Zero Trust programme implicitly claims to reduce this number. Most never measure it.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to compute it
&lt;/h3&gt;

&lt;p&gt;The easiest source is your EDR plus your SIEM. You need two timestamps per simulated or real lateral-movement event:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// Microsoft Sentinel / KQL — adapt to Splunk / Elastic / Chronicle
let lateralEvents = SecurityAlert
  | where AlertName has_any ("Pass-the-hash", "Suspicious WMI", "RDP from unusual host", "Service account used from new asset")
  | project firstHopTime = TimeGenerated, firstHost = CompromisedEntity, alertId = SystemAlertId;
let secondHopAlerts = SecurityAlert
  | where AlertName has_any ("Suspicious lateral connection", "Credential reuse on new host")
  | project secondHopTime = TimeGenerated, secondHost = CompromisedEntity, correlationId = SystemAlertId;
lateralEvents
  | join kind=inner (secondHopAlerts) on $left.alertId == $right.correlationId
  | extend ttd_minutes = datetime_diff('minute', secondHopTime, firstHopTime)
  | summarize p50 = percentile(ttd_minutes, 50), p90 = percentile(ttd_minutes, 90)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you are not running purple-team exercises that produce real lateral-movement signal, your TTD is technically infinite — and that is the metric you should report. Quarterly attack simulations are the cheapest way to populate this number honestly.&lt;/p&gt;

&lt;h3&gt;
  
  
  What good looks like
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Mature programme:&lt;/strong&gt; p50 under 10 minutes, p90 under 30 minutes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Functional programme:&lt;/strong&gt; p50 under 60 minutes, p90 under 4 hours&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Untested programme:&lt;/strong&gt; unknown — and "unknown" is a board-grade red flag&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;a href="https://www.ibm.com/reports/data-breach" rel="noopener noreferrer"&gt;IBM 2025 Cost of a Data Breach Report&lt;/a&gt; shows breaches contained in under 200 days cost $1.14M less on average than slower ones. Lateral-movement TTD is the leading indicator that determines containment time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Metric 3: Service-account scope drift
&lt;/h2&gt;

&lt;p&gt;Human identities have managers, review cycles, and offboarding. Service accounts and machine identities have none of these by default — and they outnumber human identities &lt;a href="https://biztechbridge.com/insights/zero-trust-service-account-triage" rel="noopener noreferrer"&gt;roughly 82 to 1 in a typical enterprise&lt;/a&gt;. Scope drift measures how their permissions change quarter over quarter without explicit human approval.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to compute it
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Compare snapshot of service-account permissions across two points in time&lt;/span&gt;
&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;current_perms&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;identity_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;permission&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;granted_at&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;iam_permissions_snapshot&lt;/span&gt;
  &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;snapshot_date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;CURRENT_DATE&lt;/span&gt;
    &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;identity_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'service_account'&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;baseline_perms&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;identity_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;permission&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;iam_permissions_snapshot&lt;/span&gt;
  &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;snapshot_date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;CURRENT_DATE&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'90 days'&lt;/span&gt;
    &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;identity_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'service_account'&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;drift&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;identity_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;permission&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;granted_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;CASE&lt;/span&gt;
      &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;change_approvals&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;
                   &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;identity_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;identity_id&lt;/span&gt;
                     &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;permission&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;permission&lt;/span&gt;
                     &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;approved_at&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;granted_at&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'7 days'&lt;/span&gt;
                                            &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;granted_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'approved'&lt;/span&gt;
      &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="s1"&gt;'unapproved'&lt;/span&gt;
    &lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;approval_status&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;current_perms&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
  &lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;baseline_perms&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;identity_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;identity_id&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;permission&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;permission&lt;/span&gt;
  &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;permission&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;  &lt;span class="c1"&gt;-- new permission since baseline&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;approval_status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;new_perms&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;drift&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;approval_status&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The number you report is the count of unapproved new permissions per quarter, plus the top ten service accounts that gained the most scope.&lt;/p&gt;

&lt;h3&gt;
  
  
  What good looks like
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Quarterly unapproved drift: &lt;strong&gt;under 5%&lt;/strong&gt; of total permission changes&lt;/li&gt;
&lt;li&gt;Zero service accounts in the top-ten that touch crown-jewel data classes&lt;/li&gt;
&lt;li&gt;Every "approved" entry traces to a ticket or change record&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Anything above 15% unapproved drift means your IAM hygiene has decayed, regardless of how many controls you have deployed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Metric 4: Exception age
&lt;/h2&gt;

&lt;p&gt;Every Zero Trust programme accumulates exceptions: the legacy app that cannot do MFA, the build server that needs a static credential, the compliance carve-out for a specific business unit. These are unavoidable. What is not unavoidable is letting them age.&lt;/p&gt;

&lt;p&gt;Exception age is the median number of days an active policy exception has been in production.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to compute it
&lt;/h3&gt;

&lt;p&gt;The exception register is your source of truth. It needs three fields per entry: opened date, business owner, and committed remediation date. The query is trivial:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;exception_category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;active_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;PERCENTILE_CONT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;WITHIN&lt;/span&gt; &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;DATE_PART&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'day'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;opened_at&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;p50_age_days&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;PERCENTILE_CONT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;WITHIN&lt;/span&gt; &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;DATE_PART&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'day'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;opened_at&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;p90_age_days&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;FILTER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;remediation_committed_at&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;overdue_count&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;zt_exceptions&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'active'&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;exception_category&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;p90_age_days&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you do not have an exception register, that is the metric you should report: "number of policy exceptions tracked: zero — and we know that is wrong."&lt;/p&gt;

&lt;h3&gt;
  
  
  What good looks like
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Median exception age: &lt;strong&gt;under 90 days&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;p90 exception age: under 180 days&lt;/li&gt;
&lt;li&gt;Overdue (past committed remediation date): zero&lt;/li&gt;
&lt;li&gt;Every entry has a named human owner, not a team distribution list&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The most uncomfortable version of this metric is the &lt;em&gt;expired&lt;/em&gt; exception count — exceptions whose stated business justification is no longer true but which remain in production because nobody owns the cleanup. Surface that number deliberately.&lt;/p&gt;

&lt;h2&gt;
  
  
  Putting the four metrics together
&lt;/h2&gt;

&lt;p&gt;The four metrics tell a coherent story when reported together:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pattern&lt;/th&gt;
&lt;th&gt;Diagnosis&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Blast radius high, TTD low&lt;/td&gt;
&lt;td&gt;Detection is fast but identity scope is too broad. Tighten least-privilege.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Blast radius low, TTD high&lt;/td&gt;
&lt;td&gt;Containment is structurally sound but observability is weak. Invest in EDR + UEBA.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Drift high, exception age low&lt;/td&gt;
&lt;td&gt;New permissions outpace cleanup. Tighten IAM change control.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Drift low, exception age high&lt;/td&gt;
&lt;td&gt;Stable IAM, but the exception register is a parking lot. Force re-justification quarterly.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;All four red&lt;/td&gt;
&lt;td&gt;The programme is doing activity work. Stop deploying and start measuring.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Notice none of these four metrics are coverage percentages. None of them go up just because you bought a tool. Every one of them requires a human to make a decision about whether the current number is acceptable — which is the entire point.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to put on the board slide
&lt;/h2&gt;

&lt;p&gt;Translate the four metrics into the only sentence the board cares about:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"If an attacker compromises one identity tomorrow, the blast radius is &lt;strong&gt;N systems&lt;/strong&gt; containing &lt;strong&gt;C crown-jewel data classes&lt;/strong&gt;, our median time to detect a second hop is &lt;strong&gt;T minutes&lt;/strong&gt;, and we currently carry &lt;strong&gt;E policy exceptions&lt;/strong&gt; with a median age of &lt;strong&gt;A days&lt;/strong&gt;."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That single sentence is the dashboard. Everything else — the rings, the percentages, the heatmaps — is supporting evidence. If you cannot answer it from your current tooling in under five minutes, the gap is not a tooling gap. It is a measurement-discipline gap, and no amount of additional Zero Trust deployment will close it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;Zero Trust is a security discipline that lives or dies by what you measure. Activity metrics make the programme look healthy in year one and vanish in year two when the breach happens anyway. Effectiveness metrics are uglier, harder to compute, and they survive contact with reality.&lt;/p&gt;

&lt;p&gt;Pick the four. Compute them honestly. Report the awkward numbers alongside the impressive ones. The CISOs getting real budget in 2026 are the ones whose dashboards make leadership uncomfortable on purpose — because uncomfortable numbers are the only ones a board can act on.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://biztechbridge.com/insights/computing-zero-trust-effectiveness-metrics" rel="noopener noreferrer"&gt;biztechbridge.com&lt;/a&gt;. For the strategic framing of these metrics in board reporting, see &lt;a href="https://biztechbridge.com/insights/zero-trust-measurement-dashboard" rel="noopener noreferrer"&gt;Measuring Zero Trust: The Dashboard Your Board Wants to See&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>zerotrust</category>
      <category>security</category>
      <category>devops</category>
      <category>sre</category>
    </item>
    <item>
      <title>How to Measure Voluntary Adoption of Your Internal Developer Platform</title>
      <dc:creator>Sven Schuchardt</dc:creator>
      <pubDate>Mon, 27 Apr 2026 06:44:45 +0000</pubDate>
      <link>https://dev.to/sven_schuchardt_0aa51663a/how-to-measure-voluntary-adoption-of-your-internal-developer-platform-22dm</link>
      <guid>https://dev.to/sven_schuchardt_0aa51663a/how-to-measure-voluntary-adoption-of-your-internal-developer-platform-22dm</guid>
      <description>&lt;p&gt;If your platform team only tracks "services onboarded" or "deployments per week," you are measuring compliance, not value. The single metric that predicts whether your Internal Developer Platform (IDP) will deliver return on investment is &lt;strong&gt;voluntary adoption rate of the golden path&lt;/strong&gt; — the percentage of new work that chooses the paved road when an off-road option still exists.&lt;/p&gt;

&lt;p&gt;This article shows three ways to measure it concretely, using Backstage, GitHub, Argo CD, and Prometheus. It is the technical companion to the broader &lt;a href="https://dev.to/insights/platform-engineering-business-impact"&gt;Platform Engineering business case&lt;/a&gt; — that piece argues &lt;em&gt;why&lt;/em&gt; voluntary adoption matters; this one shows &lt;em&gt;how&lt;/em&gt; to compute it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why activity metrics mislead platform teams
&lt;/h2&gt;

&lt;p&gt;Most platform dashboards report on activity:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Number of templates run&lt;/li&gt;
&lt;li&gt;Services in the catalog&lt;/li&gt;
&lt;li&gt;Pipeline executions per day&lt;/li&gt;
&lt;li&gt;Onboarded teams&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These numbers go up regardless of whether developers actually like the platform. A mandated platform produces the same activity graph as a beloved one — until attrition spikes and the post-mortem reveals that nobody was self-serving anything; they were filing tickets to comply with a policy.&lt;/p&gt;

&lt;p&gt;Voluntary adoption asks a harder, more honest question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;When developers had a real choice, what did they pick?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If the answer trends toward the golden path over time, the platform is genuinely removing friction. If it trends away — or if there was never an off-road option to reject — you do not have signal. You have theatre.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three measurement layers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;What it measures&lt;/th&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;Cadence&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1. Path-of-least-resistance rate&lt;/td&gt;
&lt;td&gt;% of new services created via the golden-path template vs. ad-hoc&lt;/td&gt;
&lt;td&gt;Backstage Scaffolder + GitHub repo creation events&lt;/td&gt;
&lt;td&gt;Weekly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2. Stickiness rate&lt;/td&gt;
&lt;td&gt;% of services still on the golden path 90 days after creation&lt;/td&gt;
&lt;td&gt;Catalog metadata + drift detection&lt;/td&gt;
&lt;td&gt;Monthly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3. Re-entry rate&lt;/td&gt;
&lt;td&gt;% of legacy services voluntarily migrating onto the platform without a mandate&lt;/td&gt;
&lt;td&gt;Catalog + GitOps PR activity&lt;/td&gt;
&lt;td&gt;Quarterly&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;You want all three trending up. Activity metrics — deploys, builds, pipeline runs — are downstream of these and noisy on their own.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 1: path-of-least-resistance rate
&lt;/h2&gt;

&lt;p&gt;The Backstage Scaffolder logs every template execution. Cross-reference that against all new repositories created in your GitHub organisation during the same window. The ratio between the two is your weekly voluntary adoption rate for new work.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Pseudocode against your data warehouse&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;date_trunc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'week'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;week&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;FILTER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'backstage_scaffolder'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;golden_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;FILTER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'manual_repo'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;          &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;off_road&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;FILTER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'backstage_scaffolder'&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;numeric&lt;/span&gt;
    &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="k"&gt;NULLIF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;voluntary_adoption_pct&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;new_services&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Healthy trend: 60% → 80%+ over six months. Stalled below 40%? Your golden path is not actually the easiest path. Find out why before adding more features. The most common culprits are template rigidity, slow scaffold time, and missing escape hatches for legitimate edge cases.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 2: stickiness rate
&lt;/h2&gt;

&lt;p&gt;Services that get scaffolded from a template often drift off the paved road within weeks. Teams bypass the CI pipeline, stop publishing to the catalog, hand-edit the Helm chart instead of using the platform-provided one. Stickiness measures how many services are still genuinely platform-managed 90 days after creation.&lt;/p&gt;

&lt;p&gt;Detect drift with a periodic reconciliation job and an annotation contract:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Argo CD Application — every service should carry these annotations&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Application&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;platform.bz/golden-path-version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2.4"&lt;/span&gt;
    &lt;span class="na"&gt;platform.bz/last-reconciled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2026-04-24T08:12:00Z"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then ask: of services scaffolded 90 days ago, how many still carry a current &lt;code&gt;golden-path-version&lt;/code&gt; annotation and pass policy admission without exemption?&lt;/p&gt;

&lt;p&gt;That ratio is your stickiness rate. Sub-70% means your golden path is too rigid — developers leave because the road does not bend where it needs to. The fix is rarely more enforcement; it is usually more flexibility within the paved road, so teams do not need to step off it for legitimate variations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 3: re-entry rate
&lt;/h2&gt;

&lt;p&gt;The hardest test, and the most informative one. Of services that existed &lt;em&gt;before&lt;/em&gt; the platform, how many have voluntarily migrated onto it in the last quarter — without a top-down mandate forcing the move?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Prometheus — services emitting platform-managed telemetry that were not managed 90 days ago
sum(
  count by (service) (
    platform_managed_service_info{managed="true"}
    unless on(service)
    platform_managed_service_info{managed="true"} offset 90d
  )
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the return-on-investment signal a CFO can defend in a budget review. A voluntary re-entry means the platform earned the developer's trust enough that they ported a working production service onto it — at their own initiative, on their own schedule, against the inertia of "if it works, do not touch it."&lt;/p&gt;

&lt;p&gt;If you are seeing zero quarterly re-entries despite a healthy Layer 1 number, the migration cost is too high. Build a one-command migration template for the most common service shape and watch the rate move.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reading the three numbers together
&lt;/h2&gt;

&lt;p&gt;Each layer in isolation lies. Read them as a system:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pattern&lt;/th&gt;
&lt;th&gt;Diagnosis&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;L1 high, L2 low&lt;/td&gt;
&lt;td&gt;Templates are good; the runtime experience drifts. Invest in policy automation and reconciliation.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;L1 low, L2 high&lt;/td&gt;
&lt;td&gt;The few who adopt love it; discoverability is broken. Invest in developer experience and template marketing, not features.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;L1 high, L3 zero&lt;/td&gt;
&lt;td&gt;New work uses the platform; old work will not migrate. Build a migration template.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;All three flat&lt;/td&gt;
&lt;td&gt;You probably have a mandate hiding the truth. Remove it for one team and remeasure honestly.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The one anti-pattern to avoid
&lt;/h2&gt;

&lt;p&gt;Do not report voluntary adoption as a single headline percentage to leadership without the three layers underneath. A 92% headline with 30% stickiness is materially worse than a 60% headline with 85% stickiness — the first is a compliance illusion that will collapse when the mandate lifts; the second is a working product with room to grow.&lt;/p&gt;

&lt;p&gt;Platform engineering is a product discipline. Measure it the way a product team measures retention, not the way an ops team measures uptime.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;The mandate question is not ideological — it is a measurement question. If you can prove voluntary adoption is rising, you do not need a mandate. If you cannot measure it, no mandate will save the platform when budget season arrives and the CFO asks what changed.&lt;/p&gt;

&lt;p&gt;What is your platform's voluntary adoption rate right now? And, more importantly: does anyone on your team know how to compute it?&lt;/p&gt;

&lt;p&gt;For the broader strategic case behind this metric — Forrester ROI numbers, developer attrition costs, and why mandated adoption destroys returns — see the &lt;a href="https://biztechbridge.com/insights/platform-engineering-business-impact" rel="noopener noreferrer"&gt;Platform Engineering business case&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;For the reference architecture and tooling choices that make these measurements possible, see the &lt;a href="https://biztechbridge.com/insights/platform-engineering-technology" rel="noopener noreferrer"&gt;Platform Engineering technology deep dive&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>backstage</category>
      <category>devops</category>
      <category>productivity</category>
      <category>devex</category>
    </item>
    <item>
      <title>Zero Trust is Not a Security Tool — It’s a Software Design Problem</title>
      <dc:creator>Sven Schuchardt</dc:creator>
      <pubDate>Sun, 19 Apr 2026 17:47:27 +0000</pubDate>
      <link>https://dev.to/sven_schuchardt_0aa51663a/zero-trust-is-not-a-security-tool-its-a-software-design-problem-25mh</link>
      <guid>https://dev.to/sven_schuchardt_0aa51663a/zero-trust-is-not-a-security-tool-its-a-software-design-problem-25mh</guid>
      <description>&lt;p&gt;Most Zero Trust discussions focus on tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ZTNA
&lt;/li&gt;
&lt;li&gt;micro-segmentation
&lt;/li&gt;
&lt;li&gt;identity providers
&lt;/li&gt;
&lt;li&gt;SASE platforms
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s useful.&lt;/p&gt;

&lt;p&gt;But it misses the point.&lt;/p&gt;




&lt;h2&gt;
  
  
  The real problem
&lt;/h2&gt;

&lt;p&gt;Modern systems don’t behave like the architectures Zero Trust was originally designed to fix.&lt;/p&gt;

&lt;p&gt;Today, most traffic is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;encrypted
&lt;/li&gt;
&lt;li&gt;service-to-service
&lt;/li&gt;
&lt;li&gt;happening inside your system
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Which means:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The “trusted internal network” assumption is already broken.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And yet, many implementations still rely on it.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Zero Trust actually requires
&lt;/h2&gt;

&lt;p&gt;Zero Trust Architecture is often described as:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Never trust, always verify”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That sounds simple.&lt;/p&gt;

&lt;p&gt;But the technical implication is not.&lt;/p&gt;

&lt;p&gt;It means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;every request must be authenticated&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;every request must be authorized&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;continuously, not just at login&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s not a network change.&lt;/p&gt;

&lt;p&gt;That’s a &lt;strong&gt;system design change&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where most implementations break
&lt;/h2&gt;

&lt;p&gt;In practice, the failure point is rarely the edge.&lt;/p&gt;

&lt;p&gt;It’s inside the system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;internal APIs don’t enforce authentication
&lt;/li&gt;
&lt;li&gt;service-to-service calls rely on network trust
&lt;/li&gt;
&lt;li&gt;authorization logic is inconsistent
&lt;/li&gt;
&lt;li&gt;policies are not version-controlled
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;We removed the perimeter… but kept the assumptions.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The shift most teams underestimate
&lt;/h2&gt;

&lt;p&gt;To make Zero Trust work, three things need to change:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Identity is no longer just for users
&lt;/h3&gt;

&lt;p&gt;Workloads need identity too.&lt;/p&gt;

&lt;p&gt;Patterns like SPIFFE/SPIRE provide:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;short-lived identities
&lt;/li&gt;
&lt;li&gt;tied to workload, not IP
&lt;/li&gt;
&lt;li&gt;automatically rotated
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without this, mTLS becomes operationally painful or inconsistent.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Authorization becomes per-request
&lt;/h3&gt;

&lt;p&gt;Checking access at login is not enough.&lt;/p&gt;

&lt;p&gt;You need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;request-level validation
&lt;/li&gt;
&lt;li&gt;resource-level authorization
&lt;/li&gt;
&lt;li&gt;context-aware decisions
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why patterns like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;API gateways
&lt;/li&gt;
&lt;li&gt;service mesh policies
&lt;/li&gt;
&lt;li&gt;policy-as-code (e.g. OPA)
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;become critical.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Security moves into the delivery pipeline
&lt;/h3&gt;

&lt;p&gt;If policies only exist at runtime, you are already too late.&lt;/p&gt;

&lt;p&gt;Teams that push Zero Trust controls into CI/CD:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;catch violations earlier
&lt;/li&gt;
&lt;li&gt;reduce production incidents significantly
&lt;/li&gt;
&lt;li&gt;avoid breaking changes at enforcement time
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The uncomfortable takeaway
&lt;/h2&gt;

&lt;p&gt;Zero Trust is not something you “implement” with a tool.&lt;/p&gt;

&lt;p&gt;It’s something you &lt;strong&gt;design into your system&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If your architecture still assumes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;trusted internal networks
&lt;/li&gt;
&lt;li&gt;static roles
&lt;/li&gt;
&lt;li&gt;one-time authentication
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;then adding Zero Trust tooling will mostly add complexity.&lt;/p&gt;

&lt;p&gt;Not security.&lt;/p&gt;




&lt;h2&gt;
  
  
  What to do instead
&lt;/h2&gt;

&lt;p&gt;Start with a different question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Where does implicit trust still exist in our system?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You’ll usually find it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;between services
&lt;/li&gt;
&lt;li&gt;in internal APIs
&lt;/li&gt;
&lt;li&gt;in long-lived credentials
&lt;/li&gt;
&lt;li&gt;in developer workflows
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s your real attack surface.&lt;/p&gt;




&lt;h2&gt;
  
  
  If you want to go deeper
&lt;/h2&gt;

&lt;p&gt;I wrote a full breakdown of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;how NIST SP 800-207 maps to real systems
&lt;/li&gt;
&lt;li&gt;mTLS and workload identity
&lt;/li&gt;
&lt;li&gt;SPIFFE/SPIRE, OPA, and secrets management
&lt;/li&gt;
&lt;li&gt;what actually changes for development teams
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 &lt;a href="https://biztechbridge.com/insights/zero-trust-architecture-technology" rel="noopener noreferrer"&gt;https://biztechbridge.com/insights/zero-trust-architecture-technology&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Final thought
&lt;/h2&gt;

&lt;p&gt;Zero Trust is often sold as a security upgrade.&lt;/p&gt;

&lt;p&gt;In reality, it’s closer to a &lt;strong&gt;paradigm shift in how systems make decisions&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;And that shift is still underestimated in most implementations.&lt;/p&gt;

</description>
      <category>security</category>
      <category>cloud</category>
      <category>devops</category>
      <category>platformengineering</category>
    </item>
    <item>
      <title>JWT Explained: What's Actually Inside a JSON Web Token</title>
      <dc:creator>Sven Schuchardt</dc:creator>
      <pubDate>Fri, 10 Apr 2026 17:09:14 +0000</pubDate>
      <link>https://dev.to/sven_schuchardt_0aa51663a/jwt-explained-whats-actually-inside-a-json-web-token-3o0d</link>
      <guid>https://dev.to/sven_schuchardt_0aa51663a/jwt-explained-whats-actually-inside-a-json-web-token-3o0d</guid>
      <description>&lt;p&gt;You're integrating an API and you get back a token that starts with &lt;code&gt;eyJ&lt;/code&gt;. You paste it somewhere and suddenly you can read your user's email address, their user ID, and an expiry timestamp. No decryption key needed. How? And if anyone can read it, is that secure?&lt;/p&gt;

&lt;p&gt;JWTs look encrypted but aren't. That tension — readable but trustworthy — is the whole point. Understanding it takes about five minutes, and it changes how you think about auth tokens for good.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is a JWT?
&lt;/h2&gt;

&lt;p&gt;A JSON Web Token is three base64url-encoded strings joined by dots:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;header.payload.signature
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Take a real minimal example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJ1c2VyXzEyMyIsImVtYWlsIjoidXNlckBleGFtcGxlLmNvbSIsImV4cCI6MTcxMjcwMDAwMH0.signature
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each part can be decoded in a browser console right now — no keys, no secrets, no libraries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Manually decode the payload (works in any browser console)&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJ1c2VyXzEyMyIsImVtYWlsIjoidXNlckBleGFtcGxlLmNvbSIsImV4cCI6MTcxMjcwMDAwMH0.signature&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;atob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;token&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/-/g&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;+&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/_/g&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)));&lt;/span&gt;
&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="c1"&gt;// { sub: "user_123", email: "user@example.com", exp: 1712700000 }&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Part 1 — Header:&lt;/strong&gt; Contains &lt;code&gt;alg&lt;/code&gt; (the signing algorithm, e.g. &lt;code&gt;HS256&lt;/code&gt; or &lt;code&gt;RS256&lt;/code&gt;) and &lt;code&gt;typ&lt;/code&gt; (always &lt;code&gt;"JWT"&lt;/code&gt;). Decoded, it looks like &lt;code&gt;{ "alg": "HS256", "typ": "JWT" }&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Part 2 — Payload:&lt;/strong&gt; The claims — data statements about the user or token. These are just JSON key-value pairs. Standard claim names are short by convention (&lt;code&gt;sub&lt;/code&gt;, &lt;code&gt;exp&lt;/code&gt;, &lt;code&gt;iat&lt;/code&gt;) but the values can be anything. Custom claims like &lt;code&gt;role&lt;/code&gt; or &lt;code&gt;org_id&lt;/code&gt; are perfectly valid.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Part 3 — Signature:&lt;/strong&gt; An HMAC or RSA hash of &lt;code&gt;base64url(header) + "." + base64url(payload)&lt;/code&gt;, computed using a secret known only to the issuer. This is the part that makes the token trustworthy — not readability, but tamper-evidence.&lt;/p&gt;

&lt;p&gt;The key insight: &lt;strong&gt;JWTs are signed, not encrypted.&lt;/strong&gt; The payload is readable by anyone who has the token. Only the issuer can produce a valid signature.&lt;/p&gt;

&lt;h2&gt;
  
  
  Standard Claims
&lt;/h2&gt;

&lt;p&gt;The JWT spec defines a set of registered claim names. You don't have to use them, but you should — they're understood by every JWT library.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Claim&lt;/th&gt;
&lt;th&gt;Name&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sub&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Subject&lt;/td&gt;
&lt;td&gt;User identifier (user ID, email, etc.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;iss&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Issuer&lt;/td&gt;
&lt;td&gt;Who created the token (your auth server)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;aud&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Audience&lt;/td&gt;
&lt;td&gt;Who the token is intended for&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;exp&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Expiration&lt;/td&gt;
&lt;td&gt;Unix timestamp when token expires&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;iat&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Issued at&lt;/td&gt;
&lt;td&gt;Unix timestamp when token was created&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;nbf&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Not before&lt;/td&gt;
&lt;td&gt;Token not valid before this timestamp&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;jti&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;JWT ID&lt;/td&gt;
&lt;td&gt;Unique token identifier (for revocation)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;exp&lt;/code&gt; and &lt;code&gt;iat&lt;/code&gt; are Unix timestamps — seconds since January 1 1970. An &lt;code&gt;exp&lt;/code&gt; of &lt;code&gt;1712700000&lt;/code&gt; means the token expires at a specific calendar date and time. Paste any JWT into our &lt;a href="https://biztechbridge.com/tools/jwt-decoder" rel="noopener noreferrer"&gt;JWT decoder tool&lt;/a&gt; to see the header, payload, and claims broken out — without sending the token to any server.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why it Works — and Where it Doesn't
&lt;/h2&gt;

&lt;p&gt;The signature prevents tampering. If you change even one byte of the payload, the signature becomes invalid. The server verifies by re-computing the signature with its own secret and comparing. If they match, the payload hasn't been touched since the issuer signed it.&lt;/p&gt;

&lt;p&gt;But the payload is public. Everyone who holds the token can read it. That means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Never put passwords, credit card numbers, or API secrets in a JWT payload.&lt;/li&gt;
&lt;li&gt;Never put anything you wouldn't put in a cookie you're okay with users reading.&lt;/li&gt;
&lt;li&gt;Session tokens and user IDs are fine. Sensitive personal data should stay server-side.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A common mistake in early JWT implementations: accepting a token as proof of identity without verifying the signature. A token that decodes to &lt;code&gt;{ "sub": "admin" }&lt;/code&gt; proves nothing on its own — the signature is what proves it came from your auth server. Always verify server-side before trusting any claim.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://jwt.io/introduction" rel="noopener noreferrer"&gt;JWT.io Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.rfc-editor.org/rfc/rfc7519" rel="noopener noreferrer"&gt;RFC 7519 — JSON Web Token&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/sven-divico/biztechbridge-tools/tree/main/tools/jwt-decoder" rel="noopener noreferrer"&gt;biztechbridge-tools/jwt-decoder&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>jwt</category>
      <category>security</category>
      <category>webdev</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Unix Timestamps Explained: What Every Developer Should Know</title>
      <dc:creator>Sven Schuchardt</dc:creator>
      <pubDate>Thu, 09 Apr 2026 19:30:59 +0000</pubDate>
      <link>https://dev.to/sven_schuchardt_0aa51663a/unix-timestamps-explained-what-every-developer-should-know-890</link>
      <guid>https://dev.to/sven_schuchardt_0aa51663a/unix-timestamps-explained-what-every-developer-should-know-890</guid>
      <description>&lt;p&gt;You're tailing a log file and you see this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[1712700000] ERROR: connection timeout
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What is &lt;code&gt;1712700000&lt;/code&gt;? Is it a bug? A timestamp? A version number? If you've ever stared at a number like that and felt unsure, this article is for you.&lt;/p&gt;

&lt;p&gt;By the end you'll know exactly what Unix timestamps are, why every serious API uses them, and how to convert them instantly without memorising any formula.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is a Unix Timestamp?
&lt;/h2&gt;

&lt;p&gt;A Unix timestamp (also called an epoch timestamp) is simply the number of &lt;strong&gt;seconds that have elapsed since January 1, 1970, 00:00:00 UTC&lt;/strong&gt; — a moment arbitrarily chosen as the starting point of computer time, known as the Unix epoch.&lt;/p&gt;

&lt;p&gt;That's it. No timezones, no daylight saving adjustments, no locale quirks. Just a single integer that means the same thing on every machine on the planet.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;1712700000&lt;/code&gt; translates to &lt;strong&gt;April 9, 2024, 20:00:00 UTC&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why 1970?
&lt;/h3&gt;

&lt;p&gt;The Unix operating system was developed in the early 1970s. The designers needed a fixed reference point that was recent enough to keep numbers small but old enough to cover any historical dates they cared about. January 1, 1970 was a clean, round choice that stuck — and 50+ years later the entire industry still uses it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Seconds vs Milliseconds — the most common gotcha
&lt;/h3&gt;

&lt;p&gt;Two variants exist and they will burn you if you mix them up:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;th&gt;Used by&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Seconds (Unix time)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1712700000&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Most Unix APIs, databases, server logs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Milliseconds&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1712700000000&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;JavaScript's &lt;code&gt;Date.now()&lt;/code&gt;, Java, many web APIs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A 13-digit number is almost always milliseconds. A 10-digit number is almost always seconds. When in doubt, check the API docs.&lt;/p&gt;

&lt;h3&gt;
  
  
  The 2038 problem (a quick aside)
&lt;/h3&gt;

&lt;p&gt;32-bit systems store Unix timestamps as a signed integer, which maxes out on &lt;strong&gt;January 19, 2038&lt;/strong&gt;. Most modern systems use 64-bit integers (which won't overflow until the year 292 billion), but if you're working with embedded systems or legacy C code, it's worth knowing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Converting Epoch to a Human-Readable Date in JavaScript
&lt;/h2&gt;

&lt;p&gt;No library needed. JavaScript's built-in &lt;code&gt;Date&lt;/code&gt; handles it in one line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// If your timestamp is in seconds, multiply by 1000 first&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1712700000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ts&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toISOString&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
&lt;span class="c1"&gt;// → "2024-04-09T20:00:00.000Z"&lt;/span&gt;

&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toLocaleString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;en-US&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;timeZone&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;America/New_York&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}));&lt;/span&gt;
&lt;span class="c1"&gt;// → "4/9/2024, 4:00:00 PM"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Going the other way — current time as epoch — is even simpler:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;nowInSeconds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;floor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;nowInSeconds&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// → 1712700000 (approximately)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  A debug helper worth bookmarking
&lt;/h3&gt;

&lt;p&gt;When you're debugging logs, paste this into your browser console:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;fromEpoch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Handle both seconds and milliseconds automatically&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;ts&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="nx"&gt;e12&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="nx"&gt;ts&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ts&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ms&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toISOString&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="nf"&gt;fromEpoch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1712700000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;    &lt;span class="c1"&gt;// → "2024-04-09T20:00:00.000Z"&lt;/span&gt;
&lt;span class="nf"&gt;fromEpoch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1712700000000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// → "2024-04-09T20:00:00.000Z"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Try It Without Writing Any Code
&lt;/h2&gt;

&lt;p&gt;If you just need a quick conversion — during debugging, code review, or reading API docs — paste any timestamp into our free &lt;a href="https://biztechbridge.com/tools/epoch-converter" rel="noopener noreferrer"&gt;epoch converter tool&lt;/a&gt;. It handles both seconds and milliseconds, supports timezone selection, and works entirely in your browser. No data is ever sent to a server.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why APIs Use Unix Time (and Not ISO Strings)
&lt;/h2&gt;

&lt;p&gt;You'll notice that Stripe, GitHub, Slack, and virtually every major API returns timestamps as integers, not formatted date strings. There are good reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No timezone ambiguity&lt;/strong&gt; — &lt;code&gt;1712700000&lt;/code&gt; is the same moment everywhere; &lt;code&gt;"2024-04-09 20:00:00"&lt;/code&gt; is meaningless without a timezone&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Easy arithmetic&lt;/strong&gt; — want to check if something happened in the last 24 hours? &lt;code&gt;now - ts &amp;lt; 86400&lt;/code&gt;. Try doing that with ISO strings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compact&lt;/strong&gt; — 10 digits vs 24 characters for ISO 8601&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No parsing edge cases&lt;/strong&gt; — no locale formats, no AM/PM, no separator variations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The tradeoff is readability — which is exactly why tools and debug helpers exist.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;A Unix timestamp is seconds since January 1, 1970 UTC&lt;/li&gt;
&lt;li&gt;10 digits = seconds, 13 digits = milliseconds — multiply by 1000 before passing to &lt;code&gt;new Date()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;APIs use Unix time because it's unambiguous, compact, and arithmetically convenient&lt;/li&gt;
&lt;li&gt;Convert any timestamp instantly with our &lt;a href="https://biztechbridge.com/tools/epoch-converter" rel="noopener noreferrer"&gt;epoch converter tool&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Date/now" rel="noopener noreferrer"&gt;MDN: Date.now()&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://en.wikipedia.org/wiki/Unix_time" rel="noopener noreferrer"&gt;Unix time — Wikipedia&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;The full React source for the converter is open on GitHub: &lt;a href="https://github.com/sven-divico/biztechbridge-tools/tree/main/tools/epoch-converter" rel="noopener noreferrer"&gt;biztechbridge-tools/epoch-converter&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>webdev</category>
      <category>javascript</category>
      <category>beginners</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
