DEV Community: Sven Schuchardt

Every AI Agent Failure I've Debugged in 2026 was an Idempotency Problem

Sven Schuchardt — Mon, 11 May 2026 07:06:03 +0000

Five real production incidents, the 25-year-old constraint that explains them all, and the three-layer architectural fix every agent team should have shipped last quarter.

Summary

The failure pattern looks different every time, and it is the same pattern every time.

A customer gets the same onboarding email fourteen times in nine minutes. A B2B account is charged twice for one subscription renewal. An order shows up in the OMS as three orders. A support ticket is created, escalated, re-created, re-escalated, and then closed as duplicate by a human who eventually has to write the apology email.

Every one of these incidents in the last six months has landed on my desk with the same opening line in the post-mortem: "the agent acted weirdly."

The agent did not act weirdly. The agent did exactly what the framework told it to do — retry on timeout, retry on 5xx, retry on ambiguous tool response — against a tool call that was never designed to be retried. That is not an AI failure. That is a 25-year-old distributed-systems failure wearing a new costume.

The principle the agent ecosystem is currently rediscovering is idempotency: an operation is idempotent if applying it once and applying it more than once produce the same result. Roy Fielding formalized it for HTTP methods in chapter 5 of his 2000 REST dissertation, made normative in RFC 2616 §9.1.2 and restated in RFC 7231 §4.2.2. The folklore is older — RPC implementers were debating it in the 1980s.

By 2010, idempotency was a non-negotiable in any serious payments, messaging, or inventory system. The agent frameworks of 2024–2026 ship with retry semantics at the tool-call layer. The tools they call were written by humans, for humans, on the assumption that a human would not press the button fourteen times in nine minutes. The collision between those two assumptions is where the production damage lives.

Nothing really new

Tool calls now appear in 21.9% of agent traces, up from 0.5% in 2023 — a 44× expansion of the retry surface in a single year (LangChain State of AI 2024).
Gartner forecasts 40% of enterprise apps will ship task-specific agents by end of 2026, and 40%+ of agentic AI projects will be cancelled by end of 2027 — driven by reliability and governance gaps (Gartner, Gartner).
Every major delivery substrate the agent stack inherits is at-least-once: Stripe retries webhooks for 3 days, AWS SQS standard queues document duplicate delivery as the contract, HTTP retries are normative.
The fix is unchanged from 2017: every state-mutating tool requires a deterministic idempotency key + a deduplication store at the boundary. Frameworks do not enforce this by default.

Why this is happening now: the retry surface just got 44× bigger

LangChain's 2024 telemetry shows tool calls jumping from 0.5% of agent traces in 2023 to 21.9% in 2024, with average steps per trace growing from 2.8 to 7.7. Each step is a potential non-idempotent side effect.

Year	Tool calls (% of traces)	Avg steps per trace
2023	0.5%	2.8
2024	21.9%	7.7

Source: LangChain State of AI 2024.

What is new is not retry behaviour at the network layer. What is new is the volume of state-mutating calls being generated by a non-deterministic upstream component. An LLM that produces "approximately the right tool call" 95% of the time also produces "almost-but-not-quite the same tool call" the other 5% — and 5% of millions of calls a day is enough to expose every non-idempotent operation in the entire downstream stack.

51% of survey respondents in the LangChain State of AI Agents Report run agents in production. 89% of orgs in the State of Agent Engineering 2025 report have observability in place. Instrumentation is catching up. The contracts at the tool boundary are not.

Five production failures, all the same shape

Real incidents from the last six months.

1. The fourteen-email onboarding

A B2C signup agent calls a send_welcome_email tool wrapping an internal API. The internal API is eventually consistent — it returns 202 Accepted before enqueue, and under load occasionally returns a socket timeout after the message was enqueued. Framework default: retry on timeout up to 3× with backoff. The tool: no idempotency key, no de-duplication.

Three retries × four sequential retriggers from a downstream "incomplete onboarding" agent = fourteen emails to one mailbox. One enterprise customer publicly tweeted about it. Two hours of incident response. A week of churn-control outreach.

2. The double subscription charge

A self-serve renewal agent handled decline-and-retry on subscription billing. The Stripe call was idempotent — Stripe has supported Idempotency-Key headers for years, with a 24-hour deduplication window. The internal entitlement-grant call after the charge was not idempotent.

When Stripe returned a network-layer error after the card was already charged, the agent retried the whole sequence — including a second successful Stripe charge (because the framework's retry was at the agent step, not the tool step) and a second entitlement grant.

Lesson: Stripe's idempotency layer was correct, and the system still produced a duplicate charge, because the retry was orchestrated one level above where the idempotency key lived. Idempotency is not a property of one call. It is a property of every layer in the call chain.

3. The ghost order

An order-capture agent calls an OMS create_order tool. The OMS expects a client-supplied order ID and is in fact idempotent on it — but the agent, on retry, generated a new UUID for each attempt because the prompt said "generate an order ID" rather than "reuse the order ID across retries."

Every individual layer was idempotent-aware. The integration was not. The non-determinism of the LLM produced new IDs on retry, defeating the very property the OMS was designed to provide.

4. The webhook fan-out

A vendor's webhook delivery is at-least-once — they retry on any non-2xx response. Stripe's published retry schedule extends across immediate, 5-min, 30-min, 2-hr, 5-hr, 10-hr, then every-12-hour windows for up to 3 days. Duplicate delivery is the documented expectation, not the edge case.

The receiving agent's adjust_inventory tool decremented stock. A debug field in the response triggered a Pydantic error in the framework's parser, returning a 500 to the source. The vendor retried. The framework parsed correctly the second time. Inventory decremented twice. Three SKUs oversold. Wrong stock counts pushed to the e-commerce frontend before the on-call SRE caught it.

The fix was not in the agent. The fix was in the inventory tool, which should have accepted an idempotency key from the webhook source and rejected duplicates with 200 OK rather than re-executing.

5. The duplicate Jira

An incident-triage agent ingests a support email and creates a Jira ticket. Framework response timeout: 8 seconds. Jira instance under load: regularly 12 seconds. Agent retried. Jira created a second ticket. The triage agent's own dedup pass merged them — but the merge call timed out, retried, and produced a third ticket. By end of morning: six Jira tickets, two Slack threads, one customer email.

The pattern, stated clearly

In every case, the surface narrative was the agent's behaviour. The actual cause was an operation that was non-idempotent in the path of an at-least-once delivery semantic.

Non-idempotent operation. At-least-once delivery semantic. If those two facts are true at the same boundary, you do not have an AI failure. You have a distributed-systems failure that AI made cheaper to trigger.

The agent did not invent the retry. The agent did not invent the network timeout. The agent inherited an at-least-once world from every layer beneath it — the LLM provider's retry on rate-limit, the framework's retry on tool error, the SDK's retry on socket close, the webhook source's retry policy, the queue's redelivery contract — and pointed it at tools designed for a single human caller pressing a single button once.

The reason this pattern is hard to see in post-mortem is that no single component is "wrong." The framework's retry policy is correct. The webhook source's retry policy is correct. The downstream tool's response-on-error is technically correct. The failure is emergent — it lives at the seams between layers, where each layer assumes the layer beneath it is idempotent and does not check.

At-least-once is inescapable

Every major delivery substrate the agent ecosystem inherits is at-least-once. This is not a pessimistic framing. It is the documented behaviour:

AWS SQS standard queues document at-least-once delivery as a guarantee.
Apache Kafka defaults to at-least-once; exactly-once is opt-in via transactional config.
HTTP retries are normative — RFC 7231 specifies which methods are safe to retry.
Stripe's webhook docs explicitly warn: "your endpoint should be idempotent" — duplicates across a 3-day window are expected on the happy path.

Exactly-once delivery in asynchronous distributed systems with failures is impossible by formal proof — established in the 1980s, rediscovered every time a new generation tries to design around it. What you can do is build idempotent receivers and let the substrate retry as much as it wants without producing duplicate side effects.

The architectural fix

Treat every state-mutating tool call as a network call to an at-least-once delivery channel. That is the only assumption that is safe.

Three layers, in order of importance.

Layer 1 — every state-mutating tool requires an idempotency key

Not optional. Not "if the upstream service supports it." The tool's own contract enforces it.

from typing import Annotated
from pydantic import BaseModel, Field

class CreateOrderInput(BaseModel):
    idempotency_key: Annotated[str, Field(min_length=16, max_length=128)]
    customer_id: str
    line_items: list[LineItem]

@tool(state_mutating=True)
def create_order(inp: CreateOrderInput) -> Order:
    # framework rejects the call before reaching the OMS
    # if idempotency_key is missing or malformed
    return oms_client.create_order(
        client_order_id=inp.idempotency_key,
        customer_id=inp.customer_id,
        line_items=inp.line_items,
    )

If the agent calls create_order(...) without a key, the call fails fast at the tool boundary with a 400 — before reaching the OMS. The framework's tool-call validator catches this in development and prevents the integration from shipping in the first place.

Layer 2 — the idempotency key has a defined synthesis rule

The agent does not "generate" the key on retry. The key is derived from the inputs of the original call — a hash of the caller, the operation, and the semantically-meaningful inputs.

import hashlib, json

def synthesize_key(tool_name: str, caller_id: str, inputs: dict) -> str:
    canonical = json.dumps(inputs, sort_keys=True, separators=(",", ":"))
    payload = f"{tool_name}|{caller_id}|{canonical}".encode()
    return hashlib.sha256(payload).hexdigest()

On retry, the same inputs produce the same key. The key is stable across retries because it is derived, not invented. This rule directly addresses failure case 3 (the ghost order) — the LLM cannot accidentally regenerate a UUID if the UUID is a deterministic hash of the input.

Layer 3 — deduplication store at the tool boundary

A cheap key-value store keyed by (tool, idempotency_key) returns the cached response on duplicate calls.

def execute_with_dedup(tool_name: str, key: str, fn, ttl_seconds=86_400):
    cached = dedup_store.get(f"{tool_name}:{key}")
    if cached is not None:
        return cached  # replay original response, no side effect
    result = fn()
    dedup_store.set(f"{tool_name}:{key}", result, ex=ttl_seconds)
    return result

TTL is generous — Stripe's 24-hour window is the canonical reference; 7 days is fine for high-cost operations like billing or order creation. Storage is cheap. A second customer charge is not.

This is not novel architecture. Stripe published the canonical pattern for it in 2017. The reason it does not exist by default in agent frameworks is that the frameworks were optimized for prototyping, not production — and the production cost of the missing layer only becomes visible after the first incident.

The deeper reason it does not exist is that the frameworks are converging on the wrong default. They optimize for "make tool calls easy" — correct for prototyping — but the production-correct default is "make tool calls safe". Easy and safe are not the same. The frameworks that ship safe-by-default tool wrapping in the next 18 months will eat the lunch of the ones that ship easy-by-default. This pattern repeats every time a substrate matures. It happened to RPC. It happened to REST. It will happen to agents.

Three engineering rules for 2026

Three rules I am asking every team I work with to adopt. They are not new — they are what a Stripe engineer would have given you in 2018, restated for an agent context.

Rule 1 — Tools, not agents, own idempotency. The agent is non-deterministic by design. The tool is the deterministic boundary. The contract belongs there. Every state-mutating tool exposes an idempotency_key parameter; the framework synthesizes it from inputs if the agent does not supply one.

Rule 2 — Test retries explicitly. Every state-mutating tool ships with a regression test that calls it twice with the same inputs and asserts identical end state. CI catches the violation before the framework's retry policy does. The single most cost-effective test you can add to an agent codebase, and almost no team I have worked with is doing it consistently.

def test_create_order_is_idempotent():
    inputs = sample_order_input()
    first = create_order(inputs)
    second = create_order(inputs)  # same idempotency_key derived
    assert first.order_id == second.order_id
    assert oms_client.order_count(inputs.customer_id) == 1

Rule 3 — Treat idempotency as a versioned contract. When the tool's input shape changes, the key derivation changes, and old in-flight retries should fail closed, not silently re-execute against the new shape. Most teams miss this on the first refactor and discover it on the second incident.

These three rules together cost a small engineering tax — perhaps 5% on tool development time — and prevent every one of the five failure modes above. The math is not subtle.

What this costs when you skip it

Direct revenue impact when duplicate billing requires refund + concession.
Trust erosion when fourteen-email incidents hit social media.
Engineering time when reconciliation between a ledger and an entitlement system takes a week.
Audit surface when finance discovers the system of record for charges and the system of record for grants disagree.
Project survival when leadership concludes the agent platform is "not production-ready" and pulls the funding. This is the failure mode behind Gartner's 40% project-cancellation forecast — not the AI being insufficiently capable, but the integration around it being insufficiently durable.

In every post-mortem I have run on these incidents, the cost-to-fix-after is at least 10× the cost-to-design-correctly-before.

Closing

The agent ecosystem is going through the same maturation curve every distributed-systems substrate has gone through. The 1990s had it for RPC. The 2000s had it for SOAP. The 2010s had it for REST and webhooks. Each generation rediscovered idempotency the hard way, usually after a billing incident hit the press.

The 2020s have it for agents. The good news is that we know the answer. The bad news is that the framework defaults are not yet aligned to it, and the production incidents are paying for the misalignment.

If you are building anything where an agent calls a tool that mutates state, the most useful question you can ask this quarter is: what happens if this exact call is made twice? If the answer is anything other than "the same thing happens once," you have an incident in your future. The only variable is the timing.

Idempotency is not a clever pattern. It is a 25-year-old constraint that distributed-systems people stopped negotiating about a long time ago. The agent ecosystem is currently rediscovering why.

The fix is older than most of the engineers shipping the bug.

This post is part of a four-week series connecting old software-engineering principles to new AI failure modes. Originally published on biztechbridge.com.

How to Compute Zero Trust Effectiveness: Four Metrics That Survive a Breach

Sven Schuchardt — Wed, 29 Apr 2026 16:29:10 +0000

Three hops captures the realistic post-compromise reach inside a typical enterprise environment. If your IAM tooling does not expose a graph, the practical substitute is "count of distinct resources the identity has permission to read or modify within 60 minutes of session start, assuming no MFA step-up triggers."

What good looks like

Privileged human identity: under 50 reachable resources, zero crown-jewel data classes without step-up
Standard human identity: under 200 reachable resources, no production data without explicit grant
Service account: scoped to a single namespace or workload — under 10 reachable resources is normal, over 100 is a problem

Report this metric per identity class, not as a single org-wide average. The average hides the outliers, and the outliers are what get exploited.

Metric 2: Lateral-movement time-to-detect

Lateral-movement TTD is the median time between an attacker's first action on a compromised host and the moment your SOC opens a case for the second host. Every Zero Trust programme implicitly claims to reduce this number. Most never measure it.

How to compute it

The easiest source is your EDR plus your SIEM. You need two timestamps per simulated or real lateral-movement event:

// Microsoft Sentinel / KQL — adapt to Splunk / Elastic / Chronicle
let lateralEvents = SecurityAlert
  | where AlertName has_any ("Pass-the-hash", "Suspicious WMI", "RDP from unusual host", "Service account used from new asset")
  | project firstHopTime = TimeGenerated, firstHost = CompromisedEntity, alertId = SystemAlertId;
let secondHopAlerts = SecurityAlert
  | where AlertName has_any ("Suspicious lateral connection", "Credential reuse on new host")
  | project secondHopTime = TimeGenerated, secondHost = CompromisedEntity, correlationId = SystemAlertId;
lateralEvents
  | join kind=inner (secondHopAlerts) on $left.alertId == $right.correlationId
  | extend ttd_minutes = datetime_diff('minute', secondHopTime, firstHopTime)
  | summarize p50 = percentile(ttd_minutes, 50), p90 = percentile(ttd_minutes, 90)

If you are not running purple-team exercises that produce real lateral-movement signal, your TTD is technically infinite — and that is the metric you should report. Quarterly attack simulations are the cheapest way to populate this number honestly.

What good looks like

Mature programme: p50 under 10 minutes, p90 under 30 minutes
Functional programme: p50 under 60 minutes, p90 under 4 hours
Untested programme: unknown — and "unknown" is a board-grade red flag

The IBM 2025 Cost of a Data Breach Report shows breaches contained in under 200 days cost $1.14M less on average than slower ones. Lateral-movement TTD is the leading indicator that determines containment time.

Metric 3: Service-account scope drift

Human identities have managers, review cycles, and offboarding. Service accounts and machine identities have none of these by default — and they outnumber human identities roughly 82 to 1 in a typical enterprise. Scope drift measures how their permissions change quarter over quarter without explicit human approval.

How to compute it

-- Compare snapshot of service-account permissions across two points in time
WITH current_perms AS (
  SELECT identity_id, permission, granted_at
  FROM iam_permissions_snapshot
  WHERE snapshot_date = CURRENT_DATE
    AND identity_type = 'service_account'
),
baseline_perms AS (
  SELECT identity_id, permission
  FROM iam_permissions_snapshot
  WHERE snapshot_date = CURRENT_DATE - INTERVAL '90 days'
    AND identity_type = 'service_account'
),
drift AS (
  SELECT
    c.identity_id,
    c.permission,
    c.granted_at,
    CASE
      WHEN EXISTS (SELECT 1 FROM change_approvals a
                   WHERE a.identity_id = c.identity_id
                     AND a.permission = c.permission
                     AND a.approved_at BETWEEN c.granted_at - INTERVAL '7 days'
                                            AND c.granted_at)
      THEN 'approved'
      ELSE 'unapproved'
    END AS approval_status
  FROM current_perms c
  LEFT JOIN baseline_perms b
    ON c.identity_id = b.identity_id AND c.permission = b.permission
  WHERE b.permission IS NULL  -- new permission since baseline
)
SELECT approval_status, COUNT(*) AS new_perms
FROM drift
GROUP BY approval_status;

The number you report is the count of unapproved new permissions per quarter, plus the top ten service accounts that gained the most scope.

What good looks like

Quarterly unapproved drift: under 5% of total permission changes
Zero service accounts in the top-ten that touch crown-jewel data classes
Every "approved" entry traces to a ticket or change record

Anything above 15% unapproved drift means your IAM hygiene has decayed, regardless of how many controls you have deployed.

Metric 4: Exception age

Every Zero Trust programme accumulates exceptions: the legacy app that cannot do MFA, the build server that needs a static credential, the compliance carve-out for a specific business unit. These are unavoidable. What is not unavoidable is letting them age.

Exception age is the median number of days an active policy exception has been in production.

How to compute it

The exception register is your source of truth. It needs three fields per entry: opened date, business owner, and committed remediation date. The query is trivial:

SELECT
  exception_category,
  COUNT(*) AS active_count,
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY DATE_PART('day', NOW() - opened_at)) AS p50_age_days,
  PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY DATE_PART('day', NOW() - opened_at)) AS p90_age_days,
  COUNT(*) FILTER (WHERE remediation_committed_at < NOW()) AS overdue_count
FROM zt_exceptions
WHERE status = 'active'
GROUP BY exception_category
ORDER BY p90_age_days DESC;

If you do not have an exception register, that is the metric you should report: "number of policy exceptions tracked: zero — and we know that is wrong."

What good looks like

Median exception age: under 90 days
p90 exception age: under 180 days
Overdue (past committed remediation date): zero
Every entry has a named human owner, not a team distribution list

The most uncomfortable version of this metric is the expired exception count — exceptions whose stated business justification is no longer true but which remain in production because nobody owns the cleanup. Surface that number deliberately.

Putting the four metrics together

The four metrics tell a coherent story when reported together:

Pattern	Diagnosis
Blast radius high, TTD low	Detection is fast but identity scope is too broad. Tighten least-privilege.
Blast radius low, TTD high	Containment is structurally sound but observability is weak. Invest in EDR + UEBA.
Drift high, exception age low	New permissions outpace cleanup. Tighten IAM change control.
Drift low, exception age high	Stable IAM, but the exception register is a parking lot. Force re-justification quarterly.
All four red	The programme is doing activity work. Stop deploying and start measuring.

Notice none of these four metrics are coverage percentages. None of them go up just because you bought a tool. Every one of them requires a human to make a decision about whether the current number is acceptable — which is the entire point.

What to put on the board slide

Translate the four metrics into the only sentence the board cares about:

"If an attacker compromises one identity tomorrow, the blast radius is N systems containing C crown-jewel data classes, our median time to detect a second hop is T minutes, and we currently carry E policy exceptions with a median age of A days."

That single sentence is the dashboard. Everything else — the rings, the percentages, the heatmaps — is supporting evidence. If you cannot answer it from your current tooling in under five minutes, the gap is not a tooling gap. It is a measurement-discipline gap, and no amount of additional Zero Trust deployment will close it.

Closing

Zero Trust is a security discipline that lives or dies by what you measure. Activity metrics make the programme look healthy in year one and vanish in year two when the breach happens anyway. Effectiveness metrics are uglier, harder to compute, and they survive contact with reality.

Pick the four. Compute them honestly. Report the awkward numbers alongside the impressive ones. The CISOs getting real budget in 2026 are the ones whose dashboards make leadership uncomfortable on purpose — because uncomfortable numbers are the only ones a board can act on.

Originally published at biztechbridge.com. For the strategic framing of these metrics in board reporting, see Measuring Zero Trust: The Dashboard Your Board Wants to See.

How to Measure Voluntary Adoption of Your Internal Developer Platform

Sven Schuchardt — Mon, 27 Apr 2026 06:44:45 +0000

If your platform team only tracks "services onboarded" or "deployments per week," you are measuring compliance, not value. The single metric that predicts whether your Internal Developer Platform (IDP) will deliver return on investment is voluntary adoption rate of the golden path — the percentage of new work that chooses the paved road when an off-road option still exists.

This article shows three ways to measure it concretely, using Backstage, GitHub, Argo CD, and Prometheus. It is the technical companion to the broader Platform Engineering business case — that piece argues why voluntary adoption matters; this one shows how to compute it.

Why activity metrics mislead platform teams

Most platform dashboards report on activity:

Number of templates run
Services in the catalog
Pipeline executions per day
Onboarded teams

These numbers go up regardless of whether developers actually like the platform. A mandated platform produces the same activity graph as a beloved one — until attrition spikes and the post-mortem reveals that nobody was self-serving anything; they were filing tickets to comply with a policy.

Voluntary adoption asks a harder, more honest question:

When developers had a real choice, what did they pick?

If the answer trends toward the golden path over time, the platform is genuinely removing friction. If it trends away — or if there was never an off-road option to reject — you do not have signal. You have theatre.

The three measurement layers

Layer	What it measures	Source	Cadence
1. Path-of-least-resistance rate	% of new services created via the golden-path template vs. ad-hoc	Backstage Scaffolder + GitHub repo creation events	Weekly
2. Stickiness rate	% of services still on the golden path 90 days after creation	Catalog metadata + drift detection	Monthly
3. Re-entry rate	% of legacy services voluntarily migrating onto the platform without a mandate	Catalog + GitOps PR activity	Quarterly

You want all three trending up. Activity metrics — deploys, builds, pipeline runs — are downstream of these and noisy on their own.

Layer 1: path-of-least-resistance rate

The Backstage Scaffolder logs every template execution. Cross-reference that against all new repositories created in your GitHub organisation during the same window. The ratio between the two is your weekly voluntary adoption rate for new work.

-- Pseudocode against your data warehouse
SELECT
  date_trunc('week', created_at) AS week,
  COUNT(*) FILTER (WHERE source = 'backstage_scaffolder') AS golden_path,
  COUNT(*) FILTER (WHERE source = 'manual_repo')          AS off_road,
  ROUND(
    COUNT(*) FILTER (WHERE source = 'backstage_scaffolder')::numeric
    / NULLIF(COUNT(*), 0) * 100, 1
  ) AS voluntary_adoption_pct
FROM new_services
GROUP BY 1
ORDER BY 1;

Healthy trend: 60% → 80%+ over six months. Stalled below 40%? Your golden path is not actually the easiest path. Find out why before adding more features. The most common culprits are template rigidity, slow scaffold time, and missing escape hatches for legitimate edge cases.

Layer 2: stickiness rate

Services that get scaffolded from a template often drift off the paved road within weeks. Teams bypass the CI pipeline, stop publishing to the catalog, hand-edit the Helm chart instead of using the platform-provided one. Stickiness measures how many services are still genuinely platform-managed 90 days after creation.

Detect drift with a periodic reconciliation job and an annotation contract:

# Argo CD Application — every service should carry these annotations
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  annotations:
    platform.bz/golden-path-version: "2.4"
    platform.bz/last-reconciled: "2026-04-24T08:12:00Z"

Then ask: of services scaffolded 90 days ago, how many still carry a current golden-path-version annotation and pass policy admission without exemption?

That ratio is your stickiness rate. Sub-70% means your golden path is too rigid — developers leave because the road does not bend where it needs to. The fix is rarely more enforcement; it is usually more flexibility within the paved road, so teams do not need to step off it for legitimate variations.

Layer 3: re-entry rate

The hardest test, and the most informative one. Of services that existed before the platform, how many have voluntarily migrated onto it in the last quarter — without a top-down mandate forcing the move?

# Prometheus — services emitting platform-managed telemetry that were not managed 90 days ago
sum(
  count by (service) (
    platform_managed_service_info{managed="true"}
    unless on(service)
    platform_managed_service_info{managed="true"} offset 90d
  )
)

This is the return-on-investment signal a CFO can defend in a budget review. A voluntary re-entry means the platform earned the developer's trust enough that they ported a working production service onto it — at their own initiative, on their own schedule, against the inertia of "if it works, do not touch it."

If you are seeing zero quarterly re-entries despite a healthy Layer 1 number, the migration cost is too high. Build a one-command migration template for the most common service shape and watch the rate move.

Reading the three numbers together

Each layer in isolation lies. Read them as a system:

Pattern	Diagnosis
L1 high, L2 low	Templates are good; the runtime experience drifts. Invest in policy automation and reconciliation.
L1 low, L2 high	The few who adopt love it; discoverability is broken. Invest in developer experience and template marketing, not features.
L1 high, L3 zero	New work uses the platform; old work will not migrate. Build a migration template.
All three flat	You probably have a mandate hiding the truth. Remove it for one team and remeasure honestly.

The one anti-pattern to avoid

Do not report voluntary adoption as a single headline percentage to leadership without the three layers underneath. A 92% headline with 30% stickiness is materially worse than a 60% headline with 85% stickiness — the first is a compliance illusion that will collapse when the mandate lifts; the second is a working product with room to grow.

Platform engineering is a product discipline. Measure it the way a product team measures retention, not the way an ops team measures uptime.

Closing

The mandate question is not ideological — it is a measurement question. If you can prove voluntary adoption is rising, you do not need a mandate. If you cannot measure it, no mandate will save the platform when budget season arrives and the CFO asks what changed.

What is your platform's voluntary adoption rate right now? And, more importantly: does anyone on your team know how to compute it?

For the broader strategic case behind this metric — Forrester ROI numbers, developer attrition costs, and why mandated adoption destroys returns — see the Platform Engineering business case.

For the reference architecture and tooling choices that make these measurements possible, see the Platform Engineering technology deep dive.

Zero Trust is Not a Security Tool — It’s a Software Design Problem

Sven Schuchardt — Sun, 19 Apr 2026 17:47:27 +0000

Most Zero Trust discussions focus on tools:

ZTNA
micro-segmentation
identity providers
SASE platforms

That’s useful.

But it misses the point.

The real problem

Modern systems don’t behave like the architectures Zero Trust was originally designed to fix.

Today, most traffic is:

encrypted
service-to-service
happening inside your system

Which means:

The “trusted internal network” assumption is already broken.

And yet, many implementations still rely on it.

What Zero Trust actually requires

Zero Trust Architecture is often described as:

“Never trust, always verify”

That sounds simple.

But the technical implication is not.

It means:

every request must be authenticated
every request must be authorized
continuously, not just at login

That’s not a network change.

That’s a system design change.

Where most implementations break

In practice, the failure point is rarely the edge.

It’s inside the system:

internal APIs don’t enforce authentication
service-to-service calls rely on network trust
authorization logic is inconsistent
policies are not version-controlled

In other words:

We removed the perimeter… but kept the assumptions.

The shift most teams underestimate

To make Zero Trust work, three things need to change:

1. Identity is no longer just for users

Workloads need identity too.

Patterns like SPIFFE/SPIRE provide:

short-lived identities
tied to workload, not IP
automatically rotated

Without this, mTLS becomes operationally painful or inconsistent.

2. Authorization becomes per-request

Checking access at login is not enough.

You need:

request-level validation
resource-level authorization
context-aware decisions

This is why patterns like:

API gateways
service mesh policies
policy-as-code (e.g. OPA)

become critical.

3. Security moves into the delivery pipeline

If policies only exist at runtime, you are already too late.

Teams that push Zero Trust controls into CI/CD:

catch violations earlier
reduce production incidents significantly
avoid breaking changes at enforcement time

The uncomfortable takeaway

Zero Trust is not something you “implement” with a tool.

It’s something you design into your system.

If your architecture still assumes:

trusted internal networks
static roles
one-time authentication

then adding Zero Trust tooling will mostly add complexity.

Not security.

What to do instead

Start with a different question:

Where does implicit trust still exist in our system?

You’ll usually find it:

between services
in internal APIs
in long-lived credentials
in developer workflows

That’s your real attack surface.

If you want to go deeper

I wrote a full breakdown of:

how NIST SP 800-207 maps to real systems
mTLS and workload identity
SPIFFE/SPIRE, OPA, and secrets management
what actually changes for development teams

👉 https://biztechbridge.com/insights/zero-trust-architecture-technology

Final thought

Zero Trust is often sold as a security upgrade.

In reality, it’s closer to a paradigm shift in how systems make decisions.

And that shift is still underestimated in most implementations.

JWT Explained: What's Actually Inside a JSON Web Token

Sven Schuchardt — Fri, 10 Apr 2026 17:09:14 +0000

You're integrating an API and you get back a token that starts with eyJ. You paste it somewhere and suddenly you can read your user's email address, their user ID, and an expiry timestamp. No decryption key needed. How? And if anyone can read it, is that secure?

JWTs look encrypted but aren't. That tension — readable but trustworthy — is the whole point. Understanding it takes about five minutes, and it changes how you think about auth tokens for good.

What is a JWT?

A JSON Web Token is three base64url-encoded strings joined by dots:

header.payload.signature

Take a real minimal example:

eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJ1c2VyXzEyMyIsImVtYWlsIjoidXNlckBleGFtcGxlLmNvbSIsImV4cCI6MTcxMjcwMDAwMH0.signature

Each part can be decoded in a browser console right now — no keys, no secrets, no libraries:

// Manually decode the payload (works in any browser console)
const token = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJ1c2VyXzEyMyIsImVtYWlsIjoidXNlckBleGFtcGxlLmNvbSIsImV4cCI6MTcxMjcwMDAwMH0.signature";
const payload = JSON.parse(atob(token.split('.')[1].replace(/-/g, '+').replace(/_/g, '/')));
console.log(payload);
// { sub: "user_123", email: "user@example.com", exp: 1712700000 }

Part 1 — Header: Contains alg (the signing algorithm, e.g. HS256 or RS256) and typ (always "JWT"). Decoded, it looks like { "alg": "HS256", "typ": "JWT" }.

Part 2 — Payload: The claims — data statements about the user or token. These are just JSON key-value pairs. Standard claim names are short by convention (sub, exp, iat) but the values can be anything. Custom claims like role or org_id are perfectly valid.

Part 3 — Signature: An HMAC or RSA hash of base64url(header) + "." + base64url(payload), computed using a secret known only to the issuer. This is the part that makes the token trustworthy — not readability, but tamper-evidence.

The key insight: JWTs are signed, not encrypted. The payload is readable by anyone who has the token. Only the issuer can produce a valid signature.

Standard Claims

The JWT spec defines a set of registered claim names. You don't have to use them, but you should — they're understood by every JWT library.

Claim	Name	Meaning
`sub`	Subject	User identifier (user ID, email, etc.)
`iss`	Issuer	Who created the token (your auth server)
`aud`	Audience	Who the token is intended for
`exp`	Expiration	Unix timestamp when token expires
`iat`	Issued at	Unix timestamp when token was created
`nbf`	Not before	Token not valid before this timestamp
`jti`	JWT ID	Unique token identifier (for revocation)

exp and iat are Unix timestamps — seconds since January 1 1970. An exp of 1712700000 means the token expires at a specific calendar date and time. Paste any JWT into our JWT decoder tool to see the header, payload, and claims broken out — without sending the token to any server.

Why it Works — and Where it Doesn't

The signature prevents tampering. If you change even one byte of the payload, the signature becomes invalid. The server verifies by re-computing the signature with its own secret and comparing. If they match, the payload hasn't been touched since the issuer signed it.

But the payload is public. Everyone who holds the token can read it. That means:

Never put passwords, credit card numbers, or API secrets in a JWT payload.
Never put anything you wouldn't put in a cookie you're okay with users reading.
Session tokens and user IDs are fine. Sensitive personal data should stay server-side.

A common mistake in early JWT implementations: accepting a token as proof of identity without verifying the signature. A token that decodes to { "sub": "admin" } proves nothing on its own — the signature is what proves it came from your auth server. Always verify server-side before trusting any claim.

Unix Timestamps Explained: What Every Developer Should Know

Sven Schuchardt — Thu, 09 Apr 2026 19:30:59 +0000

You're tailing a log file and you see this:

[1712700000] ERROR: connection timeout

What is 1712700000? Is it a bug? A timestamp? A version number? If you've ever stared at a number like that and felt unsure, this article is for you.

By the end you'll know exactly what Unix timestamps are, why every serious API uses them, and how to convert them instantly without memorising any formula.

What Is a Unix Timestamp?

A Unix timestamp (also called an epoch timestamp) is simply the number of seconds that have elapsed since January 1, 1970, 00:00:00 UTC — a moment arbitrarily chosen as the starting point of computer time, known as the Unix epoch.

That's it. No timezones, no daylight saving adjustments, no locale quirks. Just a single integer that means the same thing on every machine on the planet.

1712700000 translates to April 9, 2024, 20:00:00 UTC.

Why 1970?

The Unix operating system was developed in the early 1970s. The designers needed a fixed reference point that was recent enough to keep numbers small but old enough to cover any historical dates they cared about. January 1, 1970 was a clean, round choice that stuck — and 50+ years later the entire industry still uses it.

Seconds vs Milliseconds — the most common gotcha

Two variants exist and they will burn you if you mix them up:

Format	Example	Used by
Seconds (Unix time)	`1712700000`	Most Unix APIs, databases, server logs
Milliseconds	`1712700000000`	JavaScript's `Date.now()`, Java, many web APIs

A 13-digit number is almost always milliseconds. A 10-digit number is almost always seconds. When in doubt, check the API docs.

The 2038 problem (a quick aside)

32-bit systems store Unix timestamps as a signed integer, which maxes out on January 19, 2038. Most modern systems use 64-bit integers (which won't overflow until the year 292 billion), but if you're working with embedded systems or legacy C code, it's worth knowing.

Converting Epoch to a Human-Readable Date in JavaScript

No library needed. JavaScript's built-in Date handles it in one line:

// If your timestamp is in seconds, multiply by 1000 first
const ts = 1712700000;
const date = new Date(ts * 1000);

console.log(date.toISOString());
// → "2024-04-09T20:00:00.000Z"

console.log(date.toLocaleString('en-US', { timeZone: 'America/New_York' }));
// → "4/9/2024, 4:00:00 PM"

Going the other way — current time as epoch — is even simpler:

const nowInSeconds = Math.floor(Date.now() / 1000);
console.log(nowInSeconds); // → 1712700000 (approximately)

A debug helper worth bookmarking

When you're debugging logs, paste this into your browser console:

const fromEpoch = (ts) => {
  // Handle both seconds and milliseconds automatically
  const ms = ts > 1e12 ? ts : ts * 1000;
  return new Date(ms).toISOString();
};

fromEpoch(1712700000);    // → "2024-04-09T20:00:00.000Z"
fromEpoch(1712700000000); // → "2024-04-09T20:00:00.000Z"

Try It Without Writing Any Code

If you just need a quick conversion — during debugging, code review, or reading API docs — paste any timestamp into our free epoch converter tool. It handles both seconds and milliseconds, supports timezone selection, and works entirely in your browser. No data is ever sent to a server.

Why APIs Use Unix Time (and Not ISO Strings)

You'll notice that Stripe, GitHub, Slack, and virtually every major API returns timestamps as integers, not formatted date strings. There are good reasons:

No timezone ambiguity — 1712700000 is the same moment everywhere; "2024-04-09 20:00:00" is meaningless without a timezone
Easy arithmetic — want to check if something happened in the last 24 hours? now - ts < 86400. Try doing that with ISO strings.
Compact — 10 digits vs 24 characters for ISO 8601
No parsing edge cases — no locale formats, no AM/PM, no separator variations

The tradeoff is readability — which is exactly why tools and debug helpers exist.

Key Takeaways

A Unix timestamp is seconds since January 1, 1970 UTC
10 digits = seconds, 13 digits = milliseconds — multiply by 1000 before passing to new Date()
APIs use Unix time because it's unambiguous, compact, and arithmetically convenient
Convert any timestamp instantly with our epoch converter tool

DEV Community: Sven Schuchardt

Every AI Agent Failure I've Debugged in 2026 was an Idempotency Problem

Summary

Nothing really new

Why this is happening now: the retry surface just got 44× bigger

Five production failures, all the same shape

1. The fourteen-email onboarding

2. The double subscription charge

3. The ghost order

4. The webhook fan-out

5. The duplicate Jira

The pattern, stated clearly

At-least-once is inescapable

The architectural fix

Layer 1 — every state-mutating tool requires an idempotency key

Layer 2 — the idempotency key has a defined synthesis rule

Layer 3 — deduplication store at the tool boundary

Three engineering rules for 2026

What this costs when you skip it

Closing

How to Compute Zero Trust Effectiveness: Four Metrics That Survive a Breach

What good looks like

Metric 2: Lateral-movement time-to-detect

How to compute it

What good looks like

Metric 3: Service-account scope drift

How to compute it

What good looks like

Metric 4: Exception age

How to compute it

What good looks like

Putting the four metrics together

What to put on the board slide

Closing

How to Measure Voluntary Adoption of Your Internal Developer Platform

Why activity metrics mislead platform teams

The three measurement layers

Layer 1: path-of-least-resistance rate

Layer 2: stickiness rate

Layer 3: re-entry rate

Reading the three numbers together

The one anti-pattern to avoid

Closing

Zero Trust is Not a Security Tool — It’s a Software Design Problem

The real problem

What Zero Trust actually requires

Where most implementations break

The shift most teams underestimate

1. Identity is no longer just for users

2. Authorization becomes per-request

3. Security moves into the delivery pipeline

The uncomfortable takeaway

What to do instead

If you want to go deeper

Final thought

JWT Explained: What's Actually Inside a JSON Web Token

What is a JWT?

Standard Claims

Why it Works — and Where it Doesn't

Further Reading

Unix Timestamps Explained: What Every Developer Should Know

What Is a Unix Timestamp?

Why 1970?

Seconds vs Milliseconds — the most common gotcha

The 2038 problem (a quick aside)

Converting Epoch to a Human-Readable Date in JavaScript

A debug helper worth bookmarking

Try It Without Writing Any Code

Why APIs Use Unix Time (and Not ISO Strings)

Key Takeaways

Further Reading