DEV Community: kt

SPIFFE Compliance Deep Dive

kt — Sun, 31 May 2026 06:05:03 +0000

Introduction

"This is SPIFFE compliant." "Our implementation is SPIFFE-compatible." Read the SPIRE, Istio, or Cilium docs and you bump into these phrases everywhere. But if you actually stop and ask what compliance means, the answer is surprisingly hard to nail down.

If I run SPIRE, am I SPIFFE compliant?
If I roll my own, what do I have to ship to call it compliant?
Does an SVID have to support both X.509 and JWT? Or is one enough?
Is there an official conformance test you can point at?

A lot of people use the word without checking. I was one of them.

So I cloned the upstream spec repo (github.com/spiffe/spiffe) and read every document from top to bottom. What follows are my notes, organized around the question "what does the spec actually require?"

git clone https://github.com/spiffe/spiffe.git ~/spiffe
ls ~/spiffe/standards/
# JWT-SVID.md
# SPIFFE-ID.md
# SPIFFE.md
# SPIFFE_Federation.md
# SPIFFE_Trust_Domain_and_Bundle.md
# SPIFFE_Workload_API.md
# SPIFFE_Workload_Endpoint.md
# X509-SVID.md
# workloadapi.proto

Eight specs total. Read them in order and each one spells out where the hard requirements end and the soft suggestions begin. I'll pull out the load-bearing parts.

0. Prerequisites

A few terms before we go any further. Skip this section if you've touched SPIFFE/SPIRE before.

What "workload identity" means

A way to cryptographically prove "who you are" to a workload (a process or container). IP addresses and hostnames can be spoofed. Kubernetes labels can be rewritten. Instead, a trusted Certificate Authority (CA) signs a certificate or token that the workload presents, and verifiers check the signature.

To distinguish this from human identity (the account you log into via OAuth), it's increasingly called Workload Identity or Non-Human Identity (NHI).

URI basics (RFC 3986)

We'll be dealing with strings like spiffe://example.com/payments/web-fe, so a quick refresher on URI structure.

spiffe://example.com/payments/web-fe
  |        |              |
scheme  authority        path

In SPIFFE the scheme is fixed at spiffe, the authority is the Trust Domain, and everything after that is the Workload Path.

X.509, mTLS, JWT, JWS, JWK

Signed data structures, or representations of keys.

X.509: the certificate format TLS uses. Binary (DER) or PEM encoded.
mTLS (mutual TLS): both client and server present X.509 certificates and verify each other. Plain TLS only verifies the server.
JWT (RFC 7519): a token made of JSON pieces joined with Base64URL. Common on the web.
JWS (RFC 7515): the structure that signs the JWT payload. A JWT requires a JWS to exist. When the SPIFFE spec says "JWS Compact Serialization," it means the normal JWT layout: header.payload.signature joined with dots.
JWK (RFC 7517): a JSON representation of a public (or private) key. Looks like {"kty":"RSA", "n":"...", "e":"AQAB"}.
JWKS (JWK Set): a JSON array bundling multiple JWKs. If you've ever fetched /.well-known/jwks.json in OAuth/OIDC, you've seen one.
Bearer Token: a token where possession alone grants access. Steal it, use it. JWTs are typically bearer tokens.

The SPIFFE spec stacks a SPIFFE ID on top of these and calls the result an SVID (SPIFFE Verifiable Identity Document):

X.509 based → X.509-SVID
JWT/JWS based → JWT-SVID
Distribution of trust roots (CA keys) → SPIFFE Bundle (a JWKS underneath)

gRPC and Unix Domain Sockets

gRPC: an RPC framework that pushes Protocol Buffers over HTTP/2.
Unix Domain Socket (UDS): an inter-process socket on the local host. You reach it by filesystem path (for example /run/spire/agent.sock).

The Workload API ships over these two together.

1. The Shape of SPIFFE: Three Pillars

The SPIFFE spec set, in one sentence:

"Hand the workload a spiffe://... ID, issue a verifiable document (SVID) that carries it, and serve it through a local API."

The three pieces (ID, document, API) each get their own spec document.

Pillar	Document	Role
1. SPIFFE ID	`SPIFFE-ID.md`	Defines the workload namespace
2. SVID (X.509)	`X509-SVID.md`	How to carry a SPIFFE ID in an X.509 certificate
2. SVID (JWT)	`JWT-SVID.md`	How to carry a SPIFFE ID in a JWT
3. Workload API	`SPIFFE_Workload_API.md`	gRPC service that issues SVIDs
3. Workload Endpoint	`SPIFFE_Workload_Endpoint.md`	Conventions for the socket that exposes the API
Extra	`SPIFFE_Trust_Domain_and_Bundle.md`	How to represent trust roots (CA keys)
Extra	`SPIFFE_Federation.md`	How separate Trust Domains link up

You don't actually have to satisfy all of these to call yourself SPIFFE compliant. The next section explains why.

2. What the Spec Itself Says "Compliant" Means

This one sentence at the end of SPIFFE-ID.md Section 1 does a lot of work:

Conformance with this document is sufficient for the purposes of SPIFFE compliance.

Meet SPIFFE-ID.md and you can claim SPIFFE compliance. That's it. The Workload API, the Trust Bundle, the Federation spec, none of them are required by the letter of the spec.

That's the formal floor. In practice:

An ID by itself is useless without an SVID carrying it.
An SVID is useless without a Workload API to deliver it.
And neither is verifiable across hosts or trust domains without a Bundle representing the trust root.

"Practically SPIFFE compliant" means the three core specs plus Bundle, four documents total. If you cross trust domains, add Federation to that list.

The rest of this article walks through the MUST requirements of each one.

3. SPIFFE ID: How You Name Things

3.1 Anatomy of the URI

A SPIFFE ID is an RFC 3986 URI with a fixed shape.

spiffe://trust-domain-name/path/segments

Pull out everything the spec marks MUST or MUST NOT and you get:

Requirement	Level	Detail
Scheme is fixed at `spiffe`	MUST	Case-insensitive
Trust domain name is non-empty	MUST	The `host` component
No `userinfo`	MUST NOT	`spiffe://user:pass@td/` is out
No `port`	MUST NOT	`spiffe://td:8080/` is out
Trust domain name is lowercase	MUST	`Example.com` is out
Trust domain charset is `[a-z0-9.-_]`	MUST	Percent-encoding is out
No `query` or `fragment`	MUST NOT	`?key=val` and `#frag` are out
No trailing slash on the path	MUST NOT	`spiffe://td/foo/` is out
No `.` or `..` segments	MUST NOT	Relative path tokens forbidden
Whole URI under 2048 bytes	MUST (SHOULD)	Full URI length
Trust domain name under 255 bytes	MUST	RFC 3986 host limit

3.2 Trust Domain Names Are Unregulated

The spec is up front about this:

Trust domain operators are free to choose any trust domain name they find suitable: there is no centralized authority for regulation or registration of trust domain names.

Unlike DNS, there is no registry. Anyone can call themselves example.com.

So what happens when two parties pick the same name?

When a collision does occur, those trust domains will continue to operate independently but will be unable to federate.

While they stay independent, nothing breaks. The moment they try to federate (link up across trust domains), the two sides hold different keys, so neither can verify the other's certificates. They just fail to connect.

The practical workaround is to use a DNS name you already own. If you're auto-generating, a UUID works.

3.3 Path Design

What the path means is up to the implementer. The spec gives three example patterns:

Pattern A, identify the service directly: spiffe://staging.example.com/payments/mysql
Pattern B, mirror the orchestrator's ID structure: spiffe://k8s-west.example.com/ns/staging/sa/default
Pattern C, opaque: spiffe://example.com/9eebccd2-12bf-40a6-b262-65fe0487d453 (metadata managed elsewhere)

SPIRE's default on Kubernetes is Pattern B (/ns/<namespace>/sa/<service-account>), and that's close to the de facto convention. It maps Kubernetes Pod Identity directly, so the operational view stays legible.

4. SVID: X.509 Profile Requirements

4.1 An X.509-SVID Is Just an X.509 Certificate With a URI SAN

Read X509-SVID.md and there are no new fields. It's a normal X.509 certificate with rules layered on top about how to use the existing ones.

The core rule is simple: the Subject Alternative Name extension's URI type holds exactly one SPIFFE ID. Everything else (Subject, Basic Constraints, Key Usage, Extended Key Usage) carries its usual X.509 meaning. The spec only constrains which values are legal for SPIFFE use.

4.2 Leaf SVID vs Signing SVID

The X.509 field values differ between Leaf and Signing.

Field	Leaf SVID	Signing SVID (CA)
`cA` (Basic Constraints)	`false` MUST	`true` MUST
`keyCertSign` (Key Usage)	MUST NOT	MUST
`cRLSign` (Key Usage)	MUST NOT	MAY
`digitalSignature` (Key Usage)	MUST	(not specified)
SPIFFE ID path	Non-root (at least one segment) MUST	No path (`spiffe://td`) SHOULD
Number of URI SANs	Exactly one MUST	Exactly one MUST

A Leaf SVID can sign (mTLS client auth, API request signing) but cannot issue certificates for other parties. Same position as an ordinary client certificate.

4.3 Extra Checks at Validation Time

Section 5.2 of X509-SVID.md makes it clear that standard X.509 path validation is not enough. Using a certificate as an SVID requires these additional checks.

Forget these and you open the door to vulnerabilities like an intermediate CA being used as a leaf certificate. Libraries like go-spiffe/v2 handle this for you. Only worry about it when writing your own.

5. SVID: JWT Profile Requirements

5.1 A JWT-SVID Is Just a JWS

From Section 1 of JWT-SVID.md:

JWT-SVIDs are standard JWT tokens with a handful of restrictions applied.

A normal JWT with a few extra constraints, nothing more. Format is JWS Compact Serialization (the header.payload.signature layout). JWS JSON Serialization is MUST NOT.

5.2 The Core Restriction: Reject `alg: none`

JWT has a well-known footgun: set alg to none in the header and the token sails through unverified.

The JWT-SVID spec pins alg to one of these nine values.

`alg` value	Algorithm
RS256 / RS384 / RS512	RSASSA-PKCS1-v1_5 + SHA
PS256 / PS384 / PS512	RSASSA-PSS + SHA
ES256 / ES384 / ES512	ECDSA + SHA

Anything else (especially none or the symmetric HS* family) MUST be rejected. That single rule closes off most of the classic JOSE vulnerability surface.

Required claims:

sub: the SPIFFE ID of the workload.
aud: at least one audience.
exp: expiry.

kid is optional in the header but required on the Bundle JWK side so verifiers can pick the right key.

5.3 Always Check `aud`

JWT-SVID is a bearer token. Anyone holding it can use it. To soften that blow:

aud MUST be set.
The receiver MUST check that its own ID is in aud.
Single audience strongly recommended.

The example in Section 7.2 of the spec shows the failure mode:

if Alice has a token with audiences Bob and Chuck, and transmits that token to Chuck, then Chuck can impersonate Alice by sending the same token to Bob.

Alice issues a token addressed to both Bob and Chuck. Chuck replays it to Bob and impersonates Alice. One token, one audience. That's the rule.

6. Workload API: How SVIDs Get Delivered

6.1 gRPC, Local Only

SPIFFE_Workload_Endpoint.md, summarized: the Workload API is a gRPC endpoint on a Unix Domain Socket (or localhost TCP) within the same host. In practice, the SPIRE Agent running on each node binds to /run/spire/agent.sock, and every container on that host talks to it over gRPC. The Agent handles the conversation with the SPIRE Server (the central CA) over a separate network. The workload itself never reaches the outside.

The MUSTs around transport and accessibility:

Requirement	Detail
gRPC required	Prefer UDS over TCP (SHOULD)
No TLS	In fact, MUST NOT require it. At bootstrap the workload has no trust root yet.
Confined to one host	SHOULD (don't make it reachable from other hosts)
`workload.spiffe.io: true` metadata	MUST (SSRF defense)
No authentication handshake	MUST NOT require one. Caller is identified at the OS layer instead.

That last point is unusual. A typical gRPC service authenticates clients via certificates or tokens. The Workload API flips that: the workload doesn't announce itself, the server identifies it via the OS. This identification flow is called Workload Attestation. Concretely:

Fetch the connecting PID from the UDS via getpeerucred (or equivalent).
Inspect the PID's cgroup, or query the Kubernetes API server.
Conclude "you're the web-fe container in pod foo-bar-1234."
Issue the SPIFFE ID that container should hold.

6.2 Five RPCs Across Two Profiles

The Workload API has two profiles (X.509 and JWT) and five RPCs between them.

Profile	RPC	Mode	What it returns
X.509	`FetchX509SVID`	stream	SVID + Bundle
X.509	`FetchX509Bundles`	stream	Bundles only
JWT	`FetchJWTSVID`	unary	JWT for a given audience
JWT	`FetchJWTBundles`	stream	JWKS
JWT	`ValidateJWTSVID`	unary	Delegate JWT verification to the server

From Section 1 of SPIFFE_Workload_API.md:

Both profiles are mandatory and MUST be supported by SPIFFE implementations. However, operators MAY administratively disable a specific profile in their deployment.

A SPIFFE-compliant Workload API has to implement both X.509 and JWT profiles (the operator is allowed to turn one off in their deployment). This bites in practice. Even popular OSS libraries often bolt JWT on later.

6.3 Why It Streams

FetchX509SVID, FetchJWTBundles, and FetchX509Bundles return as gRPC server-side streams so the server can push:

New certificates during key rotation.
CRL (revocation list) updates.
New state without the workload paying reconnect cost.

The client SHOULD keep the connection open.

Every message ships the full current state, not a delta (spec sections 4.3 and 4.4). That sidesteps the whole anti-entropy headache of incremental sync.

6.4 Discovery: `SPIFFE_ENDPOINT_SOCKET`

How does a client find the Workload API? From spec section 4:

Clients may be explicitly configured with the socket location, or may utilize the well-known environment variable SPIFFE_ENDPOINT_SOCKET. If not explicitly configured, conforming clients MUST fall back to the environment variable.

Without explicit configuration, look at SPIFFE_ENDPOINT_SOCKET. The value is URI-formatted.

# Unix Domain Socket
export SPIFFE_ENDPOINT_SOCKET=unix:///run/spire/agent.sock

# TCP (only on hosts with a specific reason)
export SPIFFE_ENDPOINT_SOCKET=tcp://127.0.0.1:8000

Official libraries like go-spiffe/v2 do this automatically.

7. Trust Domain and Bundle: Building the Trust Root

7.1 A Bundle Is a Keyring Shaped Like a JWKS

SPIFFE_Trust_Domain_and_Bundle.md in one line: SPIFFE Bundle = RFC 7517 JWK Set + SPIFFE-specific metadata.

JWKS is the same format you've seen as /.well-known/jwks.json on Cognito, Auth0, and Google Identity Platform. The SPIFFE Bundle extends it to hold both X.509 CA certificates and JWT signing keys.

{
  "spiffe_sequence": 12035488,
  "spiffe_refresh_hint": 2419200,
  "keys": [
    {
      "kty": "RSA",
      "use": "x509-svid",
      "x5c": ["<base64 DER encoding of X.509 CA cert>"],
      "n": "<base64urlUint-encoded modulus>",
      "e": "AQAB"
    },
    {
      "kty": "RSA",
      "kid": "<JWT key id>",
      "use": "jwt-svid",
      "n": "<base64urlUint-encoded modulus>",
      "e": "AQAB"
    }
  ]
}

Two things are unique to a SPIFFE Bundle:

use parameter: MUST be either x509-svid or jwt-svid. Without it the verifier can't tell which kind of SVID the key validates.
spiffe_sequence: monotonically increasing counter. Increments every time the Bundle is updated.

7.2 Keep Trust Domains Isolated

Section 6.2 of the Bundle spec:

When a root key is shared across multiple trust domains, it becomes critically important that authentication and authorization implementations carefully check the trust domain name component of an identity.

Don't share keys across trust domains. If you do, verifiers had better check the trust domain name strictly, or an SVID from the wrong trust domain becomes a forgery vector.

One trust domain, one independent set of keys. That's the iron rule of SPIFFE operations.

7.3 Key Rotation: Add First, Remove Later

Bundle updates work as "add the new one, then remove the old." The Bundle moves through four states:

Phase	Bundle contents	Issuer signs new SVIDs with	Verifier accepts SVIDs signed by
Initial	`[K1]`	K1	K1
Add K2	`[K1, K2]`	K1 (still)	K1 or K2
Switch	`[K1, K2]`	K2	K1 or K2
Drop K1	`[K2]`	K2	K2

spiffe_refresh_hint (seconds) controls how often the workload re-fetches the Bundle. The spec's example value is 28 days (2419200 seconds), which is far too long if revocation matters. In practice, 5 minutes to 1 hour is the realistic production range.

8. What Existing Implementations Actually Cover

8.1 SPIRE

The reference implementation. Satisfies every item by definition. Workload API streaming, JWT-SVID ValidateJWTSVID delegation, the whole list.

8.2 Istio

Istio holds the workload's SVID inside each sidecar (Envoy). The on-the-wire interface to Envoy is SDS (Secret Discovery Service), which is an Envoy upstream API, not Istio-specific. From a SPIFFE-compliance standpoint, the relevant fact is that Istio's istio-agent feeds SVIDs into Envoy via SDS rather than exposing the SPIFFE Workload API to workloads. The payload format is SPIFFE-compatible, but the Workload Endpoint spec is not implemented.

Istio can also be wired to use a SPIRE-backed SDS server. In that setup SPIRE is the CA, and the Workload Endpoint spec becomes implemented through SPIRE.

8.3 Cilium

Cilium has two unrelated paths that get conflated under the SPIFFE label, and they need to be kept apart.

Path 1, SPIRE-backed Mutual Authentication (Beta since v1.14, still Beta in v1.19). Cilium auto-deploys a SPIRE Agent on each node and rides on top. Cilium itself doesn't implement the SPIFFE spec directly. It wraps SPIRE. This path is opt-in (authentication.mutual.spire.enabled=true) and has been since v1.14.

Path 2, ztunnel-based transparent mTLS (Beta, added v1.19). Cilium picked up the same ztunnel data plane Istio ambient mode uses. Per the Cilium docs, ztunnel's certificates are generated via OpenSSL and stored as Kubernetes Secrets. SPIRE is not in the loop, and the docs do not claim SPIFFE compliance for this mode. Whether the certificates carry SPIFFE IDs in their SANs is an implementation detail I haven't verified, so don't take "Cilium ztunnel is SPIFFE compliant" as a settled claim.

If someone says "Cilium is SPIFFE compliant", ask which feature: the SPIRE-backed mutual auth, or ztunnel.

So when a doc says "Istio is SPIFFE compliant", read it as "the SVID format is compliant, the Workload Endpoint spec is not." Same caution applies to Cilium plus the version question.

9. 2026 Updates: Recent Spec Movement

Tracking the spiffe/spiffe repo, a couple of things have moved since late 2025.

draft-wit-svid branch, active since late 2025. A candidate successor to JWT-SVID aimed at integration with IETF's WIMSE (Workload Identity in Multi-System Environments) work. PR #361 ("Introduce WIT-SVID Token Document") landed on the branch 2026-01-21. Two changes worth noting: WIT-SVID makes Proof of Possession via the cnf claim mandatory (RFC 7800 confirmation), structurally closing off the bearer-token replay risk of JWT-SVID; and PR #372 (2026-01-26) prohibits the aud claim entirely (audience scoping is delegated to the PoP layer). Still an experimental branch, not on main, so I don't count it toward "compliance" yet.
PR #381, merged 2026-03-26. "Strengthen validation language, and clarify leaf SPIFFE ID requirements." Tightens the path requirements and validation language around leaf SPIFFE IDs. Low impact for most implementations, worth a re-read if you wrote your own.

The "what compliance means" I described here tracks the main branch as of May 2026. If WIT-SVID gets merged into main, sections 4 to 5 grow by one more SVID profile with stricter rules than JWT-SVID.

10. Running the Checklist: `scc`

Walking sections 3 to 7 by hand gets old after the second time someone hands you an SVID and asks "is this compliant?". I packaged the static slice of the checklist as a single-binary CLI: scc (spiffe-compliance-checker).

Each subcommand reads one artifact and emits one line per MUST / SHOULD clause, with the spec file and section cited inline. Exit code is 1 on any MUST failure, 0 otherwise. SHOULD violations surface as WARN and do not change the exit code.

brew install kanywst/tap/spiffe-compliance-checker
# or: go install github.com/0-draft/spiffe-compliance-checker/cmd/scc@latest

scc id        'spiffe://example.com/payments/web-fe'
scc x509-svid leaf.pem
scc jwt-svid  <token>
scc bundle    bundle.json

A failing run reads as a spec walkthrough rather than an opaque error:

What it covers right now: SPIFFE ID syntax, X.509-SVID structural rules (URI SAN, Basic Constraints, Key Usage criticality, EKU), JWT-SVID claims, Trust Bundle JWKS shape including base64 + DER validation of x5c. What it does not: live Workload API behaviour, federation endpoint trust, signature verification against a specific bundle. Those need a running Agent and are out of scope for a static checker.

Source, issues, releases: https://github.com/0-draft/spiffe-compliance-checker.

Conclusion

SPIFFE compliance has two definitions and you should know which one you're talking about.

The spec-defined floor is SPIFFE-ID plus SVID. That's it.
The practical floor is that plus Workload API plus Trust Bundle. Federation joins the list only if you cross trust domains.

When something claims SPIFFE compliance, ask at what level.

Fully compliant (SPIRE): meets every spec, interoperable across the board.
SVID-compatible (Istio): SVID format is compliant, Workload Endpoint is custom.
SPIRE-dependent (Cilium SPIRE path): bundles a SPIFFE implementation. The rest is its own.
Not yet claimable (Cilium ztunnel): a separate mTLS path that doesn't go through SPIRE and hasn't been declared SPIFFE compliant by upstream.

Just want to know whether an artifact in front of you is compliant? Run it through scc (section 10). Building your own implementation? Walk the MUSTs in sections 3 to 7 one at a time. SVID format alone is feasible from scratch. The Workload API is where the scope explodes. go-spiffe/v2 and java-spiffe are the official libraries. Wrapping them is the saner starting point than reimplementing the gRPC service.

What makes SPIFFE useful is that workload authentication completes inside a vendor-neutral spec. AWS IAM, Google IAM, Kubernetes Service Accounts, none of them have to be in the trust path. Wire SPIFFE in once and cross-environment authentication stops relying on IP allowlists and pre-shared secrets.

AWS SigV4 and SigV4A Deep Dive

kt — Sat, 30 May 2026 11:19:22 +0000

Introduction

Hitting S3 from boto3, I had never thought about SigV4. The SDK does everything. I knew an Authorization: AWS4-HMAC-SHA256 ... header was being assembled somewhere under the hood, but I had never built one by hand.

Multi-Region Access Point (MRAP) destroyed that complacency. The instant I hit S3 through MRAP from Lambda, an algorithm I had never seen called AWS4-ECDSA-P256-SHA256 showed up instead of the usual SigV4, and the old botocore I had pinned locally crashed with InvalidSignature. AWS has two signing schemes: SigV4 and SigV4A. The latter is asymmetric, using ECDSA instead of HMAC.

This article dissects both, in this order.

Why AWS uses a custom signature instead of Bearer Token
Background: what HMAC and SHA-256 contribute
The four SigV4 steps (Canonical Request / StringToSign / SigningKey derivation / Signature)
SigV4 written in 80 lines of Python
Chunked Upload (STREAMING-AWS4-HMAC-SHA256-PAYLOAD)
Streaming SigV4 (WebSocket / IoT MQTT)
What is inside a Presigned URL
SigV4A: asymmetric signing on ECDSA P-256
SigV4 vs SigV4A comparison
Clock Skew traps and debugging tips

1. Why a custom signature instead of Bearer Token

In the REST world, Authorization: Bearer <token> is the default. OAuth 2.0 is basically the same. Bearer means "whoever holds the token is legitimate", and if one is stolen it is over. Under HTTPS that is usually fine in practice, but AWS explicitly refused that design. There are three reasons.

The three design goals.

Never put the Secret Key on the wire: SecretAccessKey is an absurdly powerful credential. Sending it on every call is a non-starter. SigV4 does not sign with the Secret Key itself; it signs with a short-lived key derived from it through HMAC.
Replay protection: the signature covers an ISO8601 timestamp and the CredentialScope (date + region + service), so it is bound to a moment in time and a place. A captured signature replayed the next day will not pass.
Tamper detection: HTTP method, URI, query string, headers, and the SHA-256 of the body are all in the signed string. Flip a single byte in transit and the signature mismatches.

SigV4 proves three things at once: who sent it, when, and what was in it. Completely different model from Bearer.

2. Background: HMAC and SHA-256 in one paragraph

SigV4 sits on top of HMAC-SHA256. SHA-256 compresses arbitrary-length input into a fixed 256-bit digest. It is a one-way function with collision resistance (you cannot produce two inputs with the same hash) and preimage resistance (you cannot reverse it). HMAC (Hash-based MAC) wraps a key around a message: HMAC(key, msg) = SHA256(key XOR opad || SHA256(key XOR ipad || msg)). Anyone without the key cannot reproduce the output. SigV4 chains HMAC four times to derive a signing key, then HMACs the StringToSign with it. The hex of that final HMAC becomes Signature=... in the Authorization header. SigV4A swaps only that last step from HMAC to ECDSA. Hold that mental model and the rest is detail.

3. The four SigV4 steps at a glance

One step at a time from here.

4. Step 1: Building the Canonical Request

The signature only works if "the same request always serializes to the same string". Header order and URI escape variants have to be flattened out. That flattening is what the Canonical Request does.

CanonicalRequest =
  HTTPRequestMethod + '\n' +
  CanonicalURI + '\n' +
  CanonicalQueryString + '\n' +
  CanonicalHeaders + '\n' +
  SignedHeaders + '\n' +
  HashedPayload

The rules for each element, in one diagram.

Three traps that bite hard.

URI encoded twice: for AWS APIs other than S3, the path is URI-encoded twice. A frequent SDK bug. S3 is the only exception, encoded once.
Query string sort: b=2&a=1 becomes a=1&b=2. When the same key appears multiple times, sort the values too.
Header value trim: strip leading and trailing whitespace, then collapse runs of internal whitespace to a single space. Foo: bar baz becomes foo:bar baz.

In Python.

from urllib.parse import quote
import hashlib

def canonical_request(method, uri, query, headers, payload):
    # URI: encode once for S3, twice for others (this example assumes S3)
    canonical_uri = quote(uri, safe="/-_.~")

    # Query: sort by key, encode values
    items = sorted(query.items())
    canonical_query = "&".join(
        f"{quote(k, safe='-_.~')}={quote(v, safe='-_.~')}" for k, v in items
    )

    # Headers: lowercase + sort + value trim
    lower = {k.lower(): " ".join(v.split()) for k, v in headers.items()}
    sorted_keys = sorted(lower.keys())
    canonical_headers = "".join(f"{k}:{lower[k]}\n" for k in sorted_keys)
    signed_headers = ";".join(sorted_keys)

    # Payload: SHA256 hex
    hashed_payload = hashlib.sha256(payload).hexdigest()

    return (
        f"{method}\n{canonical_uri}\n{canonical_query}\n"
        f"{canonical_headers}\n{signed_headers}\n{hashed_payload}"
    ), signed_headers, hashed_payload

canonical_headers ends with \n, and the line after it brings another \n, so the serialized form has two newlines in a row. That is spec-correct. "Fix" it to a single newline and SignatureDoesNotMatch greets you immediately.

5. Step 2: Building the StringToSign

Once the Canonical Request exists, SHA256 it and combine the result with the signing scope (algorithm + datetime + region + service) into a single string. That is the StringToSign.

StringToSign =
  Algorithm + '\n' +
  RequestDateTime + '\n' +
  CredentialScope + '\n' +
  HEX(SHA256(CanonicalRequest))

Concrete example.

AWS4-HMAC-SHA256
20260517T120000Z
20260517/us-east-1/s3/aws4_request
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855

CredentialScope has the shape <date>/<region>/<service>/aws4_request. That alone tells AWS "this signature is for this day, this region, and this service". A signature scoped to s3 cannot be reused against dynamodb. That is the core of replay protection.

import hashlib

def string_to_sign(amz_date, scope_date, region, service, canonical_req):
    algorithm = "AWS4-HMAC-SHA256"
    scope = f"{scope_date}/{region}/{service}/aws4_request"
    hashed_cr = hashlib.sha256(canonical_req.encode()).hexdigest()
    return f"{algorithm}\n{amz_date}\n{scope}\n{hashed_cr}", scope

amz_date is the basic ISO8601 form 20260517T120000Z (no - or :). scope_date is just 20260517. Mismatching them is a classic bug.

6. Step 3: Deriving the SigningKey (4-stage HMAC chain)

This is the heart of SigV4. You never sign with the Secret Key itself. You chain HMAC four times from the Secret Key to build kSigning, and sign with that.

Why four stages. Key separation. kDate is "a key valid only for this day", kRegion is "valid only for this day and this region", and so on. Each stage narrows the scope further. Even if an attacker leaks kRegion, it cannot be used on another day or in another region. The Secret Key is permanent, but kSigning is disposable, scoped to one day, one region, one service.

import hmac
import hashlib

def hmac_sha256(key: bytes, msg: str) -> bytes:
    return hmac.new(key, msg.encode(), hashlib.sha256).digest()

def signing_key(secret, scope_date, region, service):
    k_date = hmac_sha256(("AWS4" + secret).encode(), scope_date)
    k_region = hmac_sha256(k_date, region)
    k_service = hmac_sha256(k_region, service)
    k_signing = hmac_sha256(k_service, "aws4_request")
    return k_signing

The "AWS4" + SecretAccessKey prefix is hard-coded in the spec. Forget the AWS4 and SignatureDoesNotMatch lands on the first request.

7. Step 4: Computing the Signature and the Authorization Header

The last step is just HMAC the StringToSign with kSigning and hex-encode it.

def sign(k_signing: bytes, string_to_sign: str) -> str:
    return hmac.new(k_signing, string_to_sign.encode(), hashlib.sha256).hexdigest()

The Authorization header is assembled like this.

Authorization: AWS4-HMAC-SHA256
  Credential=<AccessKeyId>/<scope>,
  SignedHeaders=<signed>,
  Signature=<hex>

A real example.

Authorization: AWS4-HMAC-SHA256 Credential=AKIAIOSFODNN7EXAMPLE/20260517/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=fe5f80f77d5fa3beca038a248ff027d0445342fe2855ddc963176630326f1024

Credential holds the AccessKeyId and the CredentialScope, SignedHeaders lists the signed header names with ; separators, and Signature is the hex string. This is exactly what the SDK is assembling for you.

8. Full implementation: hitting S3 GetObject with SigV4

The four steps wired together in under 80 lines. Pure urllib, no SDK.

import datetime
import hashlib
import hmac
import urllib.request
from urllib.parse import quote

ACCESS_KEY = "AKIAIOSFODNN7EXAMPLE"
SECRET_KEY = "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
REGION = "us-east-1"
SERVICE = "s3"
BUCKET = "kt-sigv4-demo"
KEY = "hello.txt"
HOST = f"{BUCKET}.s3.{REGION}.amazonaws.com"


def hmac_sha256(key: bytes, msg: str) -> bytes:
    return hmac.new(key, msg.encode(), hashlib.sha256).digest()


def signing_key(secret, scope_date, region, service):
    k = hmac_sha256(("AWS4" + secret).encode(), scope_date)
    k = hmac_sha256(k, region)
    k = hmac_sha256(k, service)
    return hmac_sha256(k, "aws4_request")


def sigv4_get(bucket, key):
    now = datetime.datetime.now(datetime.timezone.utc)
    amz_date = now.strftime("%Y%m%dT%H%M%SZ")
    scope_date = now.strftime("%Y%m%d")
    scope = f"{scope_date}/{REGION}/{SERVICE}/aws4_request"

    method = "GET"
    canonical_uri = "/" + quote(key, safe="/-_.~")
    canonical_query = ""
    payload_hash = hashlib.sha256(b"").hexdigest()

    headers_to_sign = {
        "host": HOST,
        "x-amz-content-sha256": payload_hash,
        "x-amz-date": amz_date,
    }
    sorted_keys = sorted(headers_to_sign.keys())
    canonical_headers = "".join(f"{k}:{headers_to_sign[k]}\n" for k in sorted_keys)
    signed_headers = ";".join(sorted_keys)

    canonical_request = (
        f"{method}\n{canonical_uri}\n{canonical_query}\n"
        f"{canonical_headers}\n{signed_headers}\n{payload_hash}"
    )
    hashed_cr = hashlib.sha256(canonical_request.encode()).hexdigest()
    string_to_sign = f"AWS4-HMAC-SHA256\n{amz_date}\n{scope}\n{hashed_cr}"

    k_signing = signing_key(SECRET_KEY, scope_date, REGION, SERVICE)
    signature = hmac.new(k_signing, string_to_sign.encode(), hashlib.sha256).hexdigest()

    authorization = (
        f"AWS4-HMAC-SHA256 Credential={ACCESS_KEY}/{scope}, "
        f"SignedHeaders={signed_headers}, Signature={signature}"
    )

    req = urllib.request.Request(f"https://{HOST}/{key}", method=method)
    for h, v in headers_to_sign.items():
        req.add_header(h, v)
    req.add_header("Authorization", authorization)

    with urllib.request.urlopen(req) as resp:
        return resp.read()


if __name__ == "__main__":
    print(sigv4_get(BUCKET, KEY).decode())

Typical failures.

Forgot host in headers_to_sign: SignatureDoesNotMatch
x-amz-date and the date inside Authorization differ: SignatureDoesNotMatch
Body is empty but you forgot to compute payload_hash: SignatureDoesNotMatch

The SDK does all of this correctly. When writing your own, the fastest debugging path is byte-for-byte diff against what the SDK produces.

9. End-to-end sequence: Client and Server verification

The server side (AWS) reproduces the exact same procedure to build the signature, then compares against what the client sent.

When AWS returns SignatureDoesNotMatch, the response body contains the Canonical Request that AWS itself reconstructed. That is gold for debugging. Diff your own Canonical Request against the one AWS built and the divergent element (header trim, URI encoding, missing header) pops out immediately.

10. Chunked Upload: STREAMING-AWS4-HMAC-SHA256-PAYLOAD

For large uploads to S3, hashing the entire body up front is not realistic. Hashing a 10 GB file before you even start sending it is a latency disaster. So S3 has a STREAMING mode where each chunk is signed individually.

The headers look like this.

x-amz-content-sha256: STREAMING-AWS4-HMAC-SHA256-PAYLOAD
Content-Encoding: aws-chunked
x-amz-decoded-content-length: <original body size>
Content-Length: <size including chunk headers>

The body is chunked-transfer-style but with a custom format.

10000;chunk-signature=<sig1>\r\n
<8192 byte chunk 1>\r\n
10000;chunk-signature=<sig2>\r\n
<8192 byte chunk 2>\r\n
...
0;chunk-signature=<sigN>\r\n
\r\n

Each chunk's signature is computed by chaining the previous chunk's signature with the current chunk's hash. Swap a chunk in the middle and every signature after it falls apart.

The StringToSign gets a chunk-specific extension.

AWS4-HMAC-SHA256-PAYLOAD
<amz-date>
<scope>
<previous-signature>
HEX(SHA256(""))
HEX(SHA256(chunk-data))

previous-signature is what closes the chain. The stream terminates with a zero-byte chunk. There is also a STREAMING-AWS4-HMAC-SHA256-PAYLOAD-TRAILER variant that appends trailer headers like CRC32C for an extra integrity check.

Writing this by hand is basically masochism. Let the SDK handle it. But knowing the mechanics makes questions like "why is Content-Length different from x-amz-decoded-content-length" trivial to answer.

11. Streaming SigV4: WebSocket / IoT MQTT

SigV4 also rides on WebSocket and MQTT over WebSocket, not just plain HTTP. IoT Core, API Gateway WebSocket, and CloudWatch Logs Live Tail all sit here.

These use a Presigned URL form that embeds SigV4 into the connection URL's query string. The Authorization header is hard to attach to a WebSocket Upgrade, so cramming everything into the URL lets the handshake complete in one round trip. With MQTT over WebSocket the URL ends up as wss://<endpoint>/mqtt?X-Amz-Algorithm=...&X-Amz-Signature=....

The streaming case is special because the connection lives for a long time. The SigV4 signature itself is checked only once at connection establishment, so even after X-Amz-Expires passes, the existing connection keeps going (depending on the AWS implementation). Reconnecting requires a fresh signature.

12. What is inside a Presigned URL

A Presigned URL is the Authorization header crammed into the query string. Heavily used for "hand someone a URL and let the browser download the S3 object directly".

https://kt-bucket.s3.us-east-1.amazonaws.com/report.pdf
  ?X-Amz-Algorithm=AWS4-HMAC-SHA256
  &X-Amz-Credential=AKIA.../20260517/us-east-1/s3/aws4_request
  &X-Amz-Date=20260517T120000Z
  &X-Amz-Expires=900
  &X-Amz-SignedHeaders=host
  &X-Amz-Signature=fe5f80f77d5fa3beca038a248ff027d0445342fe2855ddc963176630326f1024

Properties.

X-Amz-Expires: in seconds. Maximum 604800 (= 7 days). The 7-day cap exists because the SigV4 signing key rotates on roughly a 7-day cycle (kDate is per-day, but kSigning is valid for about 7 days).
X-Amz-SignedHeaders=host: no body, no extra headers, so only host needs to be signed.
Payload hash handling: presigned URLs use either UNSIGNED-PAYLOAD or the actual body hash. Looser than the header-based form.

Watch out for these.

A Presigned URL minted with IAM Role temporary credentials (e.g. STS AssumeRole) is capped by the credential's own lifetime. AssumeRole defaults to 1 hour for DurationSeconds, configurable up to 12 hours, but once the SessionToken expires the URL dies. This is the standard root cause when someone passes --expires-in 604800 and the URL still dies after 1 hour.
If you are handing the URL to a browser, mint it with an IAM User long-term key or raise the Role's MaxSessionDuration. EC2 Instance Profile lands in the same trap: IMDS auto-rotates, but after the rotation any URL signed with the previous credentials is dead.

13. SigV4A: the asymmetric variant

Now the main event. The instant you hit Multi-Region Access Point (MRAP), the SDK silently flips from SigV4 to SigV4A. The algorithm name is AWS4-ECDSA-P256-SHA256. It uses ECDSA (Elliptic Curve Digital Signature Algorithm) with NIST P-256 instead of HMAC.

How it actually works

SigV4A derives an ECDSA P-256 keypair from the SecretAccessKey through a KDF. The spec uses input_key = "AWS4A" || sk, the label AWS4-ECDSA-P256-SHA256, the akid (AccessKeyId) as context, and runs a counter that iterates until the derived scalar lands below the P-256 order n. The resulting scalar c becomes the private key as k = c + 1, and the public key is Q = k * G. The verifier needs only the public key to check the signature. The canonical implementation lives in key_derivation.c in awslabs/aws-c-auth.

Why Multi-Region forces this

MRAP is a representative endpoint for a set of buckets across multiple regions, and any request can be routed to any of them. SigV4 bakes the region name directly into CredentialScope, so a signature minted for us-east-1 simply cannot be verified at us-west-2.

SigV4A drops the region segment from CredentialScope entirely and replaces it with the X-Amz-Region-Set header that declares "this signature is valid in us-east-1, us-west-2, and eu-west-1". CredentialScope changes like this.

SigV4:  20260517/us-east-1/s3/aws4_request
SigV4A: 20260517/s3/aws4_request   (region segment removed)

The X-Amz-Region-Set value is a comma-separated list like us-east-1,us-west-2, and wildcards like us-east-* or a bare * are allowed too.

On top of that, every region can verify independently with only the public key. Distributing a symmetric HMAC key to every region is a security risk (compromise one region and they all leak), which is exactly where asymmetric signing pays off.

Is the ECDSA here deterministic

Standard ECDSA needs a fresh random nonce k for every signature (the same input produces different signatures each time). Bias or reuse of k is a fatal mistake that leaks the private key. SigV4A's implementation lives in AWS Common Runtime (awslabs/aws-c-auth + aws-c-cal), and the nonce comes from the OS RNG, so the same request produces a different signature on every call. Whether it follows RFC 6979 (deterministic ECDSA) is not stated explicitly in public docs. "awscrt handles the nonce correctly" is enough to know in practice.

Rolling your own is brutal

ECDSA has heavy math, so if you go custom, use the cryptography library. AWS officially recommends aws-crt (a C library bound into Python through awscrt). botocore treats awscrt as an optional dependency. Install with pip install botocore[crt] (or pip install boto3[crt]) and SigV4A kicks in automatically against MRAP. Plain boto3 without the crt extra crashes against MRAP with MissingDependencyException.

14. SigV4 vs SigV4A side by side

Item	SigV4	SigV4A
Algorithm name	`AWS4-HMAC-SHA256`	`AWS4-ECDSA-P256-SHA256`
Key type	Symmetric (HMAC)	Asymmetric (ECDSA P-256)
Key derivation	4-stage HMAC chain (kDate to kRegion to kService to kSigning)	One-shot KDF with a counter to derive an ECDSA keypair (label `AWS4-ECDSA-P256-SHA256`, input `AWS4A` + secret)
Region in CredentialScope	Fixed (e.g. `us-east-1`)	Segment removed. `X-Amz-Region-Set` header declares the regions (wildcards allowed)
Verification	AWS recomputes the same HMAC from the same Secret	AWS verifies the ECDSA signature with the public key
Main use case	Any single-region API call	Multi-Region Access Point, replication paths
Signature determinism	Deterministic (same input gives same signature)	Non-deterministic (`awscrt` uses a random nonce)
Performance	HMAC is ultra-light (microseconds)	ECDSA is heavier (milliseconds)
7-day Presigned URL	Supported	Supported (same conditions)
SDK support	All SDKs	Through `aws-crt` (Python needs `botocore[crt]`; Go/Java/JS bundle it)

For working developers the only thing that matters in practice: the moment MRAP enters the picture the SDK flips to SigV4A; everything else stays on SigV4. You almost never reach for SigV4A by hand.

15. SigV4 vs SigV4A: scope and verifier differences in one figure

SigV4 assumes 1 request = 1 region, so HMAC is enough. SigV4A is built around 1 request landing in any of several regions, which makes a shared HMAC untenable and forces ECDSA. That is the structural difference.

16. The Clock Skew trap

SigV4's CredentialScope carries <date> and x-amz-date carries a second-precision timestamp. AWS rejects requests whose timestamp is more than ±15 minutes off the current time (the S3 docs state this explicitly; the tolerance varies slightly by service).

Three classic ways to step on this.

Docker container NTP drift: when the container is not NTP-synced, host clock skew kills you instantly. Containers with no visible hwclock and no ntpd are the worst offenders. Lambda is fine because AWS keeps it in sync.
EC2 with chrony missing: AL2023 ships chronyd by default, but custom AMIs that strip it out drift over time. Point chrony at 169.254.169.123 (Amazon Time Sync Service) and the problem disappears.
Old Lambda layers / virtualized clocks: BPF-based sandboxes where the clock is pinned. Upgrading the Lambda runtime usually fixes it.

Error examples.

SignatureDoesNotMatch:
  The request signature we calculated does not match the signature you provided.
  Check your key and signing method.
RequestTimeTooSkewed:
  The difference between the request time and the current time is too large.

RequestTimeTooSkewed is always clock skew, 100%. SignatureDoesNotMatch is either clock skew or a Canonical Request construction bug.

Conclusion

SigV4 solves "request signing that detects tampering and replay without ever exposing the secret". Chain HMAC four times to derive a signing key, SHA256 the Canonical Request to build a StringToSign, then HMAC the whole thing for the final signature. Write the four steps by hand once and everything the SDK was hiding finally becomes visible.

SigV4A is SigV4 extended to multi-region by going asymmetric (ECDSA P-256). Strip the region from CredentialScope, declare the regions in X-Amz-Region-Set, and let each region verify independently with the public key. The SDK flips to it automatically when MRAP is involved, so you basically never reach for it by hand.

The two real-world traps are always the same: clock skew and canonical-request normalization mistakes. Read the exact error message (RequestTimeTooSkewed vs SignatureDoesNotMatch vs AccessDenied) and diff your Canonical Request against the one AWS echoes back. Get that far and SigV4 stops biting.

AWS IAM Roles Anywhere Deep Dive

kt — Thu, 28 May 2026 15:46:19 +0000

Introduction

"I want to drop a file from an on-prem server into S3."
"I want to read DynamoDB from a Kubernetes pod sitting in my datacenter."
"I want to pull a secret from AWS Secrets Manager from an app in someone else's cloud."

Do this the naive way and you end up here:

Create an IAM User
Issue an access key (the AKIA... kind)
Paste it into ~/.aws/credentials, env vars, or some on-prem Secrets Manager
Use it forever

This is where most production incidents start today. Long-lived access keys leak and stay leaked, rotation gets forgotten, and there is no record of who copied them where or when.

AWS IAM Roles Anywhere is the mechanism that hands IAM Role temporary credentials to workloads outside AWS without distributing any long-lived key. The key material is replaced by an X.509 certificate.

This article goes deep on Roles Anywhere.

1. Vocabulary you need first

Roles Anywhere sits on top of two things: IAM Role and PKI (the certificate world). If either is fuzzy you will get lost fast. The bare minimum below.

IAM Role / STS / temporary credentials

Term	What to remember
IAM Role	A box that says "who is allowed to take this permission on". It has two parts: a Trust Policy (who can assume it) and a Permission Policy (what the assumer can do)
AssumeRole	The act of taking on a Role. When it succeeds, you immediately get back a set of temporary credentials
STS (Security Token Service)	The AWS service that issues temporary credentials. The `sts:AssumeRole` family of APIs lands here
Temporary credentials	A three-piece set: `AccessKeyId` + `SecretAccessKey` + `SessionToken`. Default lifetime 1 hour, up to 12 hours via the Role config

X.509 certificate / CA / PKI

Term	What to remember
X.509 certificate	A digital document where a CA signs "this public key belongs to this entity". The cert sitting in front of any HTTPS site is the same shape
CA (Certificate Authority)	The party that issues certificates. Trusting a CA's certificate means trusting every certificate it has ever issued
Private CA	A CA for closed environments such as inside a company. AWS sells a managed version called AWS Private CA (formerly ACM PCA)
PKI (Public Key Infrastructure)	The whole ecosystem of issuing, distributing, and revoking certificates
Private key	The key paired with the certificate. Digital signatures are made with it. Never put it on the network

Roughly: Roles Anywhere is the mechanism that makes AssumeRole callable using an X.509 certificate's private-key signature, instead of an IAM User's long-term key.

2. The big picture and the three actors

Roles Anywhere has three specific concepts: Trust Anchor / Profile / Role. Pin them down visually first.

The three actors and their jobs:

Concept	What it is	What it holds
Trust Anchor	The declaration "I (this AWS account) trust this CA"	A reference to an AWS Private CA, or the PEM of an external CA
Profile	The declaration "for callers authenticated via this CA, which Roles can they use and with what session limits"	A list of allowed Role ARNs + session duration + optional Session Policy
IAM Role	The permission body that activates once assumed	Trust Policy (allowing the Roles Anywhere service principal) + Permission Policy

Walking through each in order.

3. Trust Anchor: the root of trust

A Trust Anchor is the declaration "this AWS account trusts this CA". When Roles Anywhere sees a certificate, it walks the chain to confirm the certificate was issued by this CA.

There are two ways to create a Trust Anchor.

A. Use AWS Private CA (PCA) as the source

AWS Private CA (formerly ACM PCA) is AWS's managed paid CA. When creating a Trust Anchor you say "use this PCA", and every certificate issued by that PCA is trusted automatically.

PCA has two modes (Tokyo region reference pricing, varies by region):

General Purpose Mode: 400 USD/month plus per-certificate issuance fees. Full features (CRL publication, long-lived certs, freely issued via API). Capable as a replacement for an internal PKI
Short-Lived Certificate Mode: 50 USD/month plus per-certificate issuance fees. Only for certificates with lifetime under 7 days, no CRL publication. Optimised for the Roles Anywhere "issue short-lived certs frequently" usage pattern

Upside: issuance, revocation, and renewal automation lean on AWS. The AWS Private CA Issuer for cert-manager lets Kubernetes consume it.

Downside: either mode bills a flat monthly fee just for having a CA stood up. For verification or dev work, an external CA is cheaper.

B. Upload an existing external CA

Upload a CA certificate in PEM and register it as a Trust Anchor. If you already have an internal PKI (HashiCorp Vault, smallstep step-ca, an OpenSSL-based homemade CA, etc.), the extra cost is zero.

aws rolesanywhere create-trust-anchor \
    --name "my-internal-ca" \
    --source '{
        "sourceType": "CERTIFICATE_BUNDLE",
        "sourceData": {
            "x509CertificateData": "-----BEGIN CERTIFICATE-----\nMIID...\n-----END CERTIFICATE-----"
        }
    }' \
    --enabled

With an external CA you own revocation management (CRL updates). Covered in §8.

4. Profile: which Role, with what guardrails

A Profile is "the usage rules for callers authenticated against a given Trust Anchor".

What goes into a Profile:

roleArns: list of IAM Role ARNs that may be assumed
durationSeconds: maximum session lifetime (900 to 43200 seconds, i.e. 15 minutes to 12 hours, capped by the Role's MaxSessionDuration)
sessionPolicy / managedPolicyArns: extra policy that applies only to the session. "The Role's Permission Policy is X, but calls coming through this Profile are further narrowed to Y"

A Session Policy intersects with the Role's Permission Policy via AND. Even if the Role allows s3:*, narrowing the Session Policy to s3:GetObject means only s3:GetObject is effectively permitted.

5. The Role's Trust Policy: who is allowed to call

The Role itself is created the usual way, but its Trust Policy (the policy saying "who can Assume this Role") has a Roles Anywhere-specific shape:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "rolesanywhere.amazonaws.com"
      },
      "Action": [
        "sts:AssumeRole",
        "sts:TagSession",
        "sts:SetSourceIdentity"
      ],
      "Condition": {
        "StringEquals": {
          "aws:PrincipalTag/x509Subject/CN": "server-a.example.com"
        },
        "ArnEquals": {
          "aws:SourceArn": "arn:aws:rolesanywhere:ap-northeast-1:123456789012:trust-anchor/abc-..."
        }
      }
    }
  ]
}

Key points:

The service principal is rolesanywhere.amazonaws.com. Without this entry, AssumeRole via the Roles Anywhere CreateSession path will not work
Allowing sts:TagSession lets the certificate's Subject / SAN / Issuer be injected as session tags (more on this below)
Conditions like aws:PrincipalTag/x509Subject/CN let you narrow down further based on the certificate contents. Fine-grained control like "only certificates with CN server-a.example.com may assume this" is possible
aws:SourceArn pins the call to a specific Trust Anchor, so a request coming from a Trust Anchor you did not expect is denied

Session tags auto-populated from the certificate

A session created through CreateSession + AssumeRole automatically carries certificate attributes as tags. The common ones:

Tag key	Content
`aws:PrincipalTag/x509Subject/CN`	Common Name of the certificate Subject
`aws:PrincipalTag/x509SAN/DNS`	DNS name in the Subject Alternative Name (SAN)
`aws:PrincipalTag/x509SAN/URI`	URI in SAN (e.g. a SPIFFE ID like `spiffe://example.com/ns/prod/sa/app`)
`aws:PrincipalTag/x509Issuer/CN`	CN of the issuing CA
`aws:PrincipalTag/x509Serial`	Certificate serial number

A small bit of supplementary vocabulary:

SAN (Subject Alternative Name): an X.509 extension field that holds additional identifiers beyond Common Name, including multiple hostnames, IPs, or URIs
SPIFFE ID: a URI-format workload identifier defined by the SPIFFE spec (spiffe://...). If your internal identity platform uses SPIFFE, dropping a SPIFFE ID into the SAN URI lets the AWS side reference it directly in a Condition

The upshot: the Role's Permission Policy can also use aws:PrincipalTag/x509Subject/CN in its Conditions. Patterns like "the server-a.example.com certificate can only write under the prefix=server-a/* portion of S3" become possible. That is ABAC (Attribute Based Access Control): instead of carving permissions up by Role (RBAC-style), principal attributes (tags) dynamically tighten what is allowed.

6. The CreateSession flow

Tracing what actually happens from the client's first call until temporary credentials are in hand.

Key points:

A "SigV4 X.509 variant" sits on the SigV4 frame but signs with an X.509 private key. Normal AWS APIs sign with AWS4-HMAC-SHA256 (symmetric HMAC using the access key). CreateSession uses one of AWS4-X509-RSA-SHA256 / AWS4-X509-ECDSA-SHA256 / AWS4-X509-MLDSA (the last one is post-quantum, PQC-capable) depending on the key type. The Canonical Request construction is the same as SigV4. Only the final step is an asymmetric X.509 signature instead of HMAC
The private key never leaves the credential helper. It does not go on the network
The credentials handed back are plain IAM temporary credentials. The SDK sees nothing unusual. Subsequent S3 / DynamoDB calls go out as normal SigV4 (HMAC)

7. What is inside the credential helper

Hand-rolling the CreateSession signing on the client side is not realistic, so AWS ships an official binary called aws_signing_helper (repo: aws/rolesanywhere-credential-helper). That binary is the credential helper.

aws_signing_helper plugs into the AWS CLI / SDK credential_process spec. Drop this into ~/.aws/config and both the CLI and any SDK transparently fetch credentials through Roles Anywhere:

[profile myapp]
credential_process = /usr/local/bin/aws_signing_helper credential-process \
    --certificate /etc/pki/server.pem \
    --private-key /etc/pki/server.key \
    --trust-anchor-arn arn:aws:rolesanywhere:ap-northeast-1:123456789012:trust-anchor/xxxx \
    --profile-arn       arn:aws:rolesanywhere:ap-northeast-1:123456789012:profile/yyyy \
    --role-arn          arn:aws:iam::123456789012:role/MyAppRole

A call like aws s3 ls --profile myapp triggers credential_process, which runs aws_signing_helper, which returns the JSON. The AWS CLI / SDK takes care of automatic credential refresh. No manual refresh needed.

Where the private key lives

The choice of private-key storage sets the ceiling on your Roles Anywhere operational quality. aws_signing_helper supports the following backends. Leak resistance rises as you go down the table.

Storage	Properties	Leak resistance
File (`/etc/pki/server.key` etc.)	Easiest. Anyone with read access can `cat` it	Low
OS keystore (Windows Certificate Store / macOS Keychain)	Protected by OS access control	Medium
PKCS#11 module (HSM / smartcard)	Key stays inside the HSM, only signing operations are delegated	High
TPM 2.0 (secure chip on the motherboard)	Key sealed into hardware, non-exportable	High

In high-security environments, sealing the key into PKCS#11 (HSM) or TPM 2.0 so that the key cannot be extracted at all is the right move. aws_signing_helper exposes flags like --cert-selector and --tpm-key-handle to wire into these backends.

8. Revoking a certificate

You will eventually need to kill a compromised server's certificate immediately. Roles Anywhere supports CRL (Certificate Revocation List).

Importing a CRL

Generate a revocation list on the CA side and register it in Roles Anywhere as PEM.

aws rolesanywhere import-crl \
    --name "my-ca-crl" \
    --crl-data file://crl.pem \
    --trust-anchor-arn arn:aws:rolesanywhere:ap-northeast-1:123456789012:trust-anchor/xxxx \
    --enabled

Once registered, Roles Anywhere checks the CRL on every CreateSession. Calls from revoked certificates are rejected.

Auto-integration with AWS Private CA

When the Trust Anchor source is PCA, calling revoke-certificate on the PCA causes the PCA to publish a CRL into S3 within about 30 minutes. The standard pattern is to catch that with Lambda and feed it to the ImportCrl API.

Temporarily disabling the CRL check

When you need to disable it during incident response, DisableCrl flips the check off and EnableCrl turns it back on. In normal production, leave it on.

9. When it fits, and when it does not

Roles Anywhere is powerful but not for every case. The decision flow:

Takeaways:

Workloads running inside AWS do not need Roles Anywhere. Instance Profile / Execution Role / Pod Identity are the natural choice
If your CI can speak OIDC, prefer OIDC. GitHub Actions / GitLab CI / etc. emit OIDC officially. Roles Anywhere is unnecessary
If you cannot emit OIDC, you already have an X.509-based PKI, and the server is fully on-prem, that is where Roles Anywhere shines

Where Roles Anywhere is a poor fit

"We do not have an internal PKI yet" cases: standing up a CA before introducing Roles Anywhere is much more work than Roles Anywhere itself. If there is an OIDC path, take that
"Thousands of clients, each with its own certificate, CRL updated daily" cases: at this scale CRL propagation lag and other factors need real design work
"Calling from inside a VPC" cases: anything inside a VPC is already inside AWS, so attaching a Role the normal way is enough

10. Pricing and limits

Pricing

The IAM Roles Anywhere service itself is free. No per-CreateSession request fees
AWS Private CA is the only billed component. 400 USD/month for General Purpose, 50 USD/month for Short-Lived Certificate, plus per-certificate issuance fees
Using an external CA (HashiCorp Vault / step-ca / homemade PKI etc.) avoids the PCA bill (you pay in operational effort instead)

Main limits (from official Service Quotas, per Region)

Item	Default	Increase request
Trust Anchors per account	50	Allowed
Profiles per account	250	Allowed
Roles per Profile	250	Not allowed (hard limit)
Registered certificates per Trust Anchor	2	Not allowed (two slots, for rotation)
CRLs per Trust Anchor	2	Not allowed
CreateSession rate	10 req/sec	Allowed
Session lifetime	15 minutes to 12 hours	Within the Role's `MaxSessionDuration`

Note that CreateSession is 10 req/sec. A design that pulls short-lived credentials in volume will hit this per-Region rate. The basic pattern is to cache credentials on the client side and refresh just before expiration. aws_signing_helper's credential_process does this automatically.

11. Don'ts and Do's

❌ Don't

Drop the private key as a plaintext file readable by anyone
Reuse one certificate across multiple servers (you lose the ability to tell which one leaked)
Write a Trust Policy that trusts rolesanywhere.amazonaws.com alone, with no aws:SourceArn or certificate Conditions
Skip CRL operations entirely (i.e. no way to invalidate a cert on compromise)
Issue certificates with absurdly long lifetimes (e.g. 10 years)

✅ Do

Seal the private key into TPM 2.0 / HSM (PKCS#11) / OS keystore
One server, one certificate. Put a hostname or SPIFFE-ID-equivalent identifier in the Subject / SAN
Issue short-lived certificates (e.g. 24 hours to a few days) with automatic renewal. cert-manager and step-ca's auto-renew machinery is the standard pattern
Put both aws:SourceArn (Trust Anchor) and aws:PrincipalTag/x509Subject/CN into the Role's Trust Policy
Automate CRL ingestion and keep DisableCrl ready as an operations break-glass
Use session tags (aws:PrincipalTag/x509...) to drive ABAC in the Role's Permission Policy

12. Wrap-up

IAM Roles Anywhere hands IAM Role temporary credentials to workloads outside AWS without distributing long-lived keys
It uses X.509 certificates as the key material, with the issuing CA registered as a Trust Anchor
Three actors: Trust Anchor (the trusted CA) / Profile (which Role can be used) / Role (the actual permission)
CreateSession is not plain SigV4. It uses a distinct flow signed with the certificate's private key, and the private key never goes on the network
The official aws_signing_helper ships as a credential_process provider, so existing CLI and SDK code works as-is
Seal the private key in TPM / HSM / OS keystore. A plain file on disk is an incident waiting to happen
Revocation goes through CRL via ImportCrl. PCA-backed setups can auto-integrate via S3
If you are inside AWS, Roles Anywhere is unnecessary. If CI can speak OIDC, OIDC wins. On-prem, multi-cloud, and existing-PKI worlds are where it actually fits
The service itself is free. Only PCA costs money

References

Microsegmentation Deep Dive

kt — Wed, 27 May 2026 15:03:02 +0000

Introduction

A few years back, an SRE I know said this to me:

"The firewall on the outside is hardened to hell. But once you're in, it's over. Inside the LAN, everything from Pod-to-Pod to SMB just flows."

You cannot defend a modern system by making the outside stronger. You have to slice the inside of the LAN, the gaps between Pods and containers, the East-West traffic between VMs, into many small zones and police each one. That is microsegmentation.

This article tries to nail down microsegmentation at the resolution of implementation and operations, not as a vibe-word. Why VLAN and firewall hit a wall, how the four implementation styles (hypervisor / agent / cloud-native / Identity) actually differ, and how players like VMware NSX, Illumio, Akamai Guardicore, Cilium, and Google BeyondCorp each took their own shot at the problem.

0. Prerequisites for reading this

A quick glossary for terms used later. Skip if you already know them.

North-South traffic and East-West traffic

Traffic crossing the boundary of a data center or cluster is North-South. A user's browser hitting a web server, for example. Traffic inside the data center, server to server, Pod to Pod, container to container, is East-West.

In modern systems, East-West dwarfs North-South in both volume and variety. With microservices, containers, and data lakes, a single user request often fans out into dozens of internal API calls. A perimeter firewall sees none of that East-West.

Zero Trust and "Never Trust, Always Verify"

The idea of throwing out the assumption that "the internal LAN is trusted." Every connection is treated as if it were external, and authorization happens every time based on identity, device state, and context. NIST SP 800-207 is the de facto standard spec.

Microsegmentation is the main piece of "how do you actually implement Zero Trust at the network layer."

PEP / PDP (Policy Enforcement Point / Policy Decision Point)

The core model defined by NIST SP 800-207.

PDP (Policy Decision Point): the "brain" that evaluates policy and returns allow/deny. Internally split in two.
- PE (Policy Engine): the decision logic itself.
- PA (Policy Administrator): opens and closes sessions based on PE's verdict.
PEP (Policy Enforcement Point): the "hand" that actually applies the PDP's decision to the wire.

In microsegmentation, the PEP is something sitting right next to each workload: the host OS firewall, the hypervisor's virtual NIC, a sidecar proxy, an eBPF program. The PE consumes identity, device posture, threat intel, behavioral baselines, and so on.

Comparing microsegmentation products comes down to two axes: where you put the PEP and what signals the PDP can use to decide. The four implementation styles in the next sections all map onto one corner of this picture.

VLAN, ACL, Security Group

The classic cast of segmentation. VLANs split broadcast domains at L2. ACLs (Access Control List) sit on routers and switches and permit traffic per IP/port. In cloud, Security Groups act as stateful L3/L4 ACLs. All of them are IP/location-based, which (we will see) is exactly the root cause of today's pain.

IP-based vs Identity-driven

What you use as the subject of a policy.

IP-based: "allow 10.0.1.0/24 to reach 10.0.2.5:443." The address is the subject. VLAN, ACL, and traditional firewalls all live here.
Identity-driven: "allow Pods with role=web to reach Pods with role=db on tcp/5432." A label or a cryptographic ID is the subject.

In a world where Pod IPs churn every minute because of autoscaling and rescheduling, IP-based policy cannot keep up. Identity-driven exists precisely to dodge that problem.

L3/L4 control vs L7 control

How deep the firewall looks before it decides.

L3/L4: IP address and port/protocol. "Allow tcp/5432 from 10.0.1.0/24." VLAN, ACL, Security Group, nftables, classic NSX.
L7: HTTP method, URL path, gRPC method, Kafka topic, etc. "Allow only POST /api/payment, block DELETE." Service meshes (Istio / Envoy), Cilium's L7 policy, Cisco Secure Workload reach into this layer.

L7 control lets you cut off SQL injection or unauthorized admin-API calls one layer earlier.

SPIFFE / SPIRE and SVID

A spec (SPIFFE) for giving workloads a cryptographic ID instead of an IP, plus its reference implementation (SPIRE). A URI of the form spiffe://trust-domain/workload-path gets embedded into the URI SAN (Subject Alternative Name) field of an X.509 cert. That cert is called an SVID (SPIFFE Verifiable Identity Document).

During an mTLS handshake, you read the URI SAN out of the peer's SVID and decide "the peer is spiffe://prod/cart, so allow." That is the foundation of identity-driven microsegmentation.

eBPF

A Linux kernel mechanism for safely running small programs inside the kernel. Cilium uses it to do packet filtering, policy evaluation, and L7 observation in one pass, all in kernel space.

Application Dependency Map (ADM)

The core feature of agent-based products like Illumio. It aggregates flow telemetry from every host and automatically draws "which workload talks to which workload, on which port." If you start enforcement without first looking at this map, your policy will halt the business the moment it lands (more on this in section 5).

1. Why "macro" stopped being enough

A quick walk through the history. Microsegmentation did not pop out of nowhere; it is the endpoint of 20 years of "make the segments smaller."

1.1 The castle wall era (perimeter-only)

The classic 2000s network: a single wall between the LAN and the internet.

The problem is obvious. Once anyone steps inside the wall, the inside is flat. The moment an attacker owns one workstation, every internal DB, every printer, every dev server is theirs.

1.2 VLAN + firewall zones (macrosegmentation)

The next generation split the LAN into rough chunks with VLANs and per-zone firewalls.

A real step forward, but "inside the zone" is still flat. Once an attacker is on the web tier, the other web servers in the same VLAN are wide open. In an era where one service has fanned out into dozens or hundreds of workloads, this granularity does not cut it.

1.3 Microsegmentation (per-workload)

The endpoint: every single workload gets its own firewall.

Three things matter here:

The PEP sits right next to the workload. Not a physical firewall box. The host OS filter, the hypervisor's virtual NIC, a sidecar proxy, an eBPF program.
Default deny. "Anything not explicitly allowed is dropped" is the starting position.
Policy is written per application. "Web to Cart," using services or roles as subjects.

1.4 Macro vs micro

Item	Macro (VLAN/Zone)	Micro (Workload)
Granularity	Subnet / zone	Workload / Pod / process
Subject	IP, subnet	Service name, label, SPIFFE ID
Mainly guards	North-South	East-West
Policy lifetime	Long, pinned to IP	Dynamic, driven by ID
Resistance to lateral movement	Weak inside the zone	Cuts per workload
Number of firewalls	A handful to dozens	Hundreds to hundreds of thousands (logically)

The point is not "macro becomes unnecessary," it is "microsegmentation adds one more layer of finer defense inside the macro."

2. NotPetya and ransomware pressed the "decision button"

The thing that pushed microsegmentation from "idea" to "product category" was NotPetya in 2017. Once you understand how that incident unfolded, the sudden global appetite for investment makes sense.

2.1 What happened

Damage came to roughly $300M for Maersk alone, and over $10B globally (Forrester / Armis estimates). Merck, FedEx's TNT Express, Mondelez, and Reckitt got dragged in too.

2.2 What would have happened with microsegmentation

A single-line equivalent policy, "never permit SMB 445/tcp between workstations," would have shut down the main lateral path (EternalBlue over SMB). The secondary spread after Mimikatz hijacked the DC could have been localized too, if admin paths were narrowed to "from jump host only." Forrester's post-mortem said it plainly: microsegmentation and fine-grained internal controls should be rolled out as part of a Zero Trust strategy.

After that, alongside Zero Trust, microsegmentation became a board-level topic as "the precondition for lowering your ransomware insurance premium." Gartner forecasts that by 2026, 60% of organizations pursuing Zero Trust will use multiple forms of microsegmentation in combination (the same figure was under 5% three years prior).

3. The four implementation styles

There are dozens of vendors, but architecturally they cluster into four families.

3.1 A. Hypervisor (VMware NSX)

The classic for VM-heavy shops. Every ESXi host has a Distributed Firewall (DFW) baked into the hypervisor kernel, doing stateful filtering right at the VM's virtual NIC (vNIC).

What stands out:

The app does not change. No agent inside the guest OS. The policy "moves with the VM" on vMotion (the session table travels too).
Near line-rate (close to the physical NIC's theoretical throughput). Being a kernel module, the perf hit is minimal.
The weakness is VMware lock-in, plus the fact that you need a different stack the moment physical servers, containers, or cloud VMs enter the picture.

3.2 B. Agent (Illumio / Akamai Guardicore / Cisco Secure Workload)

Drop a lightweight agent into every host OS, and have it drive the OS-native firewall (Windows Filtering Platform, Linux nftables/iptables, macOS pf).

Illumio collects telemetry from its agents (VEN: Virtual Enforcement Node), auto-builds an Application Dependency Map (ADM), and uses that as the canvas for writing policy. Akamai Guardicore ships its own light kernel module, plus differentiators like deception (decoys) and an agentless mode powered by NVIDIA BlueField DPUs (Data Processing Units that run segmentation on the server's NIC). Cisco Secure Workload (formerly Tetration, renamed in 2020) traces back to slurping flows out of data center switch ASICs and learning behavior with ML.

What stands out:

Infrastructure-agnostic. The same abstraction (labels/tags) works across on-prem, bare metal, cloud, and containers.
Strong visualization via ADM. Real value shows up before policy: "let's see what is actually talking to what."
The weakness is "you have to run an agent on every host." OS version compatibility, updates, and support windows for legacy OSes (Windows Server 2003 and friends) become operational pain.

3.3 C. Cloud-native (Security Group / NSG / NetworkPolicy)

Take the mechanisms cloud providers and orchestrators already ship with, and lean on them hard for microsegmentation purposes.

The thing about AWS Security Groups is that you can put an SG where an IP would normally go. Writing source=sg-web effectively gives you role-based policy. Azure NSG and GCP VPC Firewall Rules have the same trick. On the Kubernetes side, NetworkPolicy plays this role: write allow rules with podSelector and namespaceSelector, and the policy tracks Pod IP changes for you.

What stands out:

Effectively free to start. Cloud-standard, no extra agent.
The weakness is central policy management and org-wide consistency. Once SGs hit five digits, humans cannot manage them by hand. You need a higher layer like AWS Firewall Manager, OPA, or Kyverno. Also, no L7 control (you cannot permit by HTTP method).

3.4 D. Identity-driven (Cilium / Istio + SPIFFE)

The newest family. The subject of policy is not an IP, it is a cryptographic ID.

With Cilium, Pod IPs can change, but the Cilium Identity (a numeric value) computed from labels stays stable. eBPF looks at the Identity attached to a packet to decide allow/deny, so "Cart Pod restarted and got a new IP" does not break policy.

With Istio + SPIFFE, each sidecar (Envoy) receives an SVID. At mTLS time, the URI SAN is read and compared against principals in an AuthorizationPolicy (e.g. cluster.local/ns/prod/sa/cart).

What stands out:

Doesn't fall apart when IPs are ephemeral. Pod churn, autoscaling, serverless-style dynamic environments, all fine.
L7 control is on the table (HTTP method/path, gRPC methods, Kafka topics).
The weakness is the learning curve, and the need to run a CA / Identity layer underneath.

3.5 Comparison

Style	Example	Main PEP	Subject	Strength	Weakness
Hypervisor	VMware NSX	vNIC (DFW)	VM tag	Perf, follows VM moves	VMware only
Agent	Illumio, Akamai	OS firewall	Label	Cross-environment, ADM	Agent ops
Cloud-native	AWS SG, K8s NetworkPolicy	Cloud / CNI	SG / label	Cheap, standard	No L7, weak central mgmt
Identity-driven	Cilium, Istio	eBPF / sidecar	SPIFFE ID, label	Great for dynamic env, L7	Learning curve

Large orgs mix all of these. SG at the AWS edge, Cilium inside the Kubernetes layer, NSX for data center VMs, Illumio agents on the legacy Windows servers. The "60% using multiple forms" in Gartner's 2026 forecast is exactly this.

4. How large enterprises actually use it

Spec sheets only get you so far. Five real-world cases.

4.1 Google: BeyondCorp (identity-driven across the whole company)

In late 2009, Google got hit by a Chinese state-sponsored campaign known as Operation Aurora. Initial intrusion was a spear-phishing email plus an Internet Explorer zero-day (CVE-2010-0249): clicking the malicious link planted the Hydraq trojan. From that beachhead, attackers moved laterally across Google's flat internal network and walked out with intellectual property, source code included. Google went public in January 2010.

Google's answer was BeyondCorp. They flipped the premise 180 degrees:

Stop treating the internal LAN as a trusted zone. Throw out the VPN.
Don't authorize by "what network are you on," authorize every request based on user ID + device ID + posture.
Microsegmentation is the core feature that stops malware moving from a user toward an app.

The key here: the AP (Access Proxy) is the PEP sitting in front of every app, and apps are reachable only via the AP. Whether you're on the office LAN is irrelevant. This is identity-driven microsegmentation taken to its logical extreme.

A commercial version is sold externally as BeyondCorp Enterprise.

4.2 Maersk: rebuilding after NotPetya

Maersk's global ops were down for roughly 10 days after NotPetya. They rebuilt 45,000 PCs and 4,000 servers in about 10 days (per Bleeping Computer reporting). Post-mortems (CSO Online, ComputerWeekly) point to SMB v1 and Windows domain admin paths running flat across the org, which let the malware spread sideways.

Specific details of the post-incident security rebuild are limited in public reporting, but Maersk's then-CISO repeatedly said in conference talks and interviews that "the same attack still works against plenty of other companies." That pushed the Maersk case into the canonical reference for "this is when microsegmentation plus Zero Trust investment took off industry-wide."

4.3 A Global 250 bank: SWIFT compliance with Illumio

From Illumio's official case study. A Global 250 bank needed to comply with SWIFT CSP (Customer Security Programme), which required closing every SWIFT-related server into a "per-host logical isolation zone."

The traditional path (physical firewalls):

Weeks of lead time per rule change.
Mismatch with DevOps speed.
Extra hardware cost.

How Illumio solved it:

Deploy the agent in observation mode across every SWIFT-related host.
Use the Application Dependency Map to visualize real traffic.
Generate policy from observed flows.
Gradually flip to enforcement mode.

Illumio's framing of the bank's stance: "you cannot write policy for traffic you do not understand, so start from the ADM." The bank is now extending Illumio into SDLC (Software Development Life Cycle: dev, test, staging environments) too, using it to stand up isolated test environments for remote vendors.

4.4 A major healthcare org: protecting 6,000 assets and medical IoT (Akamai Guardicore)

From Akamai's case study. Over 6,000 assets, plus bedside monitors and medical IoT, all sharing a flat network. Lateral movement toward patient data and payment data was the headline risk.

Why Guardicore:

It's a software overlay, so no infrastructure changes were needed.
The same policy language works across on-prem and AWS.
Even devices that cannot host an agent (medical IoT, etc.) can be covered by flow observation.

4.5 A global manufacturer: SMB control across 2,000 workstations (Akamai Guardicore)

Another Akamai case. A manufacturer with mixed "office + factory" sites worldwide, slicing up a flat workstation network in stages.

The first phase rolled Guardicore out to 2,000 workstations. The IT security team's quote: "network visibility improved by 1,000%." They cut off pass-the-hash (the classic move of replaying a stolen Windows password hash to authenticate remotely) and ransomware spreading workstation-to-workstation over SMB by making SMB between workstations default-deny.

5. Implementation pitfalls (where projects die on Day 1 / Day 2)

Picking the right product does not save you. The places where projects get stuck are pretty consistent.

5.1 "Policy without visibility" always breaks something

A lot of teams get cocky after a PoC and flip enforcement in production. The blast radius is huge, and the root cause is always the same: you put in default-deny without first understanding what is talking to what. Without 2 to 4 weeks (months, for business systems) in observation mode collecting flows, then policy generation, then staged rollout, you will almost certainly take down a production system and watch the project get frozen.

That is the same reason Illumio insists "the ADM is the protagonist; policy is derived from the ADM," and the reason Cisco Secure Workload makes a song and dance about ML-based behavior learning. The rule is Discovery, then Modeling, then Enforcement, in that order.

5.2 Label design decides your fate

With both agent and identity-driven styles, policy is written as set algebra over labels. Dirty labels produce exponentially dirty policy.

A practical axis set:

environment: prod / staging / dev
application: payments / cart / catalog
role / tier: web / app / db / cache
data classification: pii / pci / public
region (optional): us-east-1 / ap-northeast-1

Lock in a plan to attach at least these four axes to every asset before enforcement starts. Re-labeling later is, in practice, impossible.

5.3 The last 10% is the hardest

Discovery explains most of the traffic. The remaining 10%:

Yearly batch jobs that have not run during the observation window.
Legacy systems that were supposed to be retired but are still running.
Traffic that only fires during a DR cutover.
Stray scripts a developer set up on their own.

Deciding "the observation window never saw it, so deny" leads to the yearly batch tipping over a production system. The opposite, "allow everything just in case," makes the whole exercise pointless.

In practice:

Run in alert-only mode for a while (let traffic through but log it).
Stay in observation mode for one full business cycle (e.g. 13 months).
Triage every unsanctioned flow that surfaces in that window.
Then, finally, flip to enforcement.

The organization and management who can accept that rhythm matter more than the technology choice.

5.4 Don't try to do everything at once

A project that tries to segment the entire organization in one swing almost always fails. Elisity's and Gartner's guides list "tried to do everything at once and stalled" as the most common failure mode.

What actually works is roughly four phases:

Phase 1: Crown jewels (SWIFT, payment DB, KMS, CA, etc.). Small blast radius, but very expensive if down.
Phase 2: Production app tiers (Web / App / DB).
Phase 3: Endpoint and office networks.
Phase 4: OT and legacy systems (the hardest).

Start with the Crown Jewels, where business impact is biggest and flows are most knowable, and expand outward. Each phase is 3 to 6 months. Be ready for 18 to 36 months total.

Conclusion

If you only take three things away:

Microsegmentation is about stopping lateral movement. Look at East-West (inside), not at the North-South boundary.
Move from IP-based to ID-based. Stop extending VLAN/ACL; redesign around labels, SPIFFE IDs, IAM roles.
Visibility before enforcement. Reverse Discovery -> Modeling -> Enforcement and the project gets pickled.

AWS IAM Roles Anywhere Hands-On

kt — Tue, 26 May 2026 15:22:04 +0000

Introduction

You'll always have cases where you want to hit AWS from a home laptop or an on-prem server.

For a long time, the only way to give credentials to code running "outside AWS" was "stick an IAM User's long-lived key (AKIA...) into an env var or config file."

This sucks for a bunch of reasons.

If the long-lived key leaks, attackers can use it forever
Rotation is annoying enough that you end up using the same key for two years
The "1 server = 1 IAM User" model breaks down (User limit, audit nightmare)
CloudTrail can't tell you which physical machine made the call

IAM Roles Anywhere is the 2022 feature that fixes this. By using an X.509 certificate as your identity to AWS, you can pull temporary credentials without any IAM User or long-lived key in the picture.

This article does that inside your Mac/Linux box. No real CA involved. We make one self-signed Root CA and issue one end-entity certificate. We register only the Root CA's public certificate with AWS.

End state:

Your laptop has a certificate + private key acting as your AWS identity
Zero long-lived keys
Temporary credentials (ASIA...) come back from AWS
You hit S3 with them

Cost: $0. Roles Anywhere itself has no extra charge (per AWS docs). Certificates are generated locally, and creating Trust Anchor / Profile / Role is free.

Total time: about 60 minutes. Each command has a one-line explanation, so this works even if X.509 / PKI is new to you.

Prerequisites

One AWS account
AWS CLI (v2)
A Mac or Linux terminal (openssl is preinstalled)
Windows users: run this inside WSL2
Theory background is in AWS IAM Roles Anywhere Deep Dive

If you've never touched X.509 / PKI, you can still follow along. Term cheatsheet first.

CA (Certificate Authority): the authority that issues certificates. Today, you are the CA
Root CA: the topmost CA. Its private key signs end-entity certificates
End-entity certificate: the actual certificate you use. Signed by the Root CA's private key
CN (Common Name): the "name" field of the certificate. Roles Anywhere Trust Policy can match on it

If you know "sign with private key, verify with public key," you have enough background.

The whole flow

Plan in one diagram.

Build the PKI material by hand (Steps 1-2), connect it to AWS (Steps 3-5), then make actual calls (Steps 6-8).

Step 0: Environment setup

0-1. Check the tools

openssl version
# OpenSSL 3.x.x (preinstalled on Mac/Linux)

aws --version
# aws-cli/2.x.x

If you don't have them:

macOS: brew install awscli (OpenSSL ships with the OS)
Linux: install via your package manager

0-2. Create a working directory

mkdir -p ~/roles-anywhere-handson
cd ~/roles-anywhere-handson

Every command from here runs inside this directory.

0-3. Put Account ID in an env var

export ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
export REGION=us-east-1
echo $ACCOUNT_ID

Step 1: Create a self-signed Root CA

You're the CA. We won't use a real one (DigiCert, Let's Encrypt, etc.). We'll generate a self-signed Root CA with openssl.

1-1. CA private key

openssl genrsa -out ca.key 4096

This creates ca.key. Never let this file leave your box. It is literally the CA's body.

1-2. CA self-signed certificate

openssl req -x509 -new -nodes -key ca.key -sha256 -days 365 \
  -out ca.crt \
  -subj "/CN=handson-root-ca/O=Hands-On" \
  -addext "basicConstraints=critical,CA:TRUE" \
  -addext "keyUsage=critical,keyCertSign,cRLSign"

Gotcha: basicConstraints=CA:TRUE is required. Without it, registering this cert as a Trust Anchor fails with Incorrect basic constraints for CA certificate. keyUsage similarly needs CA values (keyCertSign + cRLSign).

ca.crt is your public certificate. This is what you upload to AWS. Inspect it:

openssl x509 -in ca.crt -text -noout | head -20

Subject: CN = handson-root-ca, O = Hands-On is the name you set
Issuer: is the same value (you signed yourself, so Subject = Issuer, which is what self-signed means)
Not Before / Not After: give you a 365-day validity from today

That's your Root CA.

1-3. Why self-signed is fine here

For a browser TLS cert, "self-signed = don't trust" is the rule. But Roles Anywhere works on a model where AWS only trusts the Root CAs you registered (that's what a Trust Anchor is). Registering only your self-signed Root CA with AWS closes off a private trust loop. Only certificates derived from your CA work with AWS.

In production you'd use an internal PKI or AWS Private CA. For learning purposes, self-signed is enough.

Step 2: Issue an end-entity certificate

Now that you have a CA, use it to issue the end-entity certificate (the actual one you'll use).

2-1. End-entity private key

openssl genrsa -out client.key 2048

client.key is the private key that lives on the machine making Roles Anywhere calls (your laptop today). This must also never leave the box.

2-2. Generate a CSR (Certificate Signing Request)

A CSR is the paper that says "I am this name, please sign me" sent to the CA.

openssl req -new -key client.key \
  -out client.csr \
  -subj "/CN=handson-client-01/O=Hands-On"

CN is handson-client-01. The Roles Anywhere Trust Policy can pin to "only certs with this CN," so this value matters.

2-3. Sign with the CA

The end-entity certificate needs two extensions Roles Anywhere requires. Pass them through an extension file using openssl x509 -extfile.

cat > client.ext <<'EOF'
basicConstraints = CA:FALSE
keyUsage = critical, digitalSignature
extendedKeyUsage = clientAuth
EOF

openssl x509 -req -in client.csr \
  -CA ca.crt -CAkey ca.key -CAcreateserial \
  -out client.crt -days 90 -sha256 \
  -extfile client.ext

Gotcha: without keyUsage = digitalSignature and extendedKeyUsage = clientAuth, the Roles Anywhere CreateSession call fails with Untrusted certificate. Insufficient certificate. Both are required.

client.crt is now ready. 90-day validity.

Verify:

openssl x509 -in client.crt -text -noout | head -15

Subject: CN = handson-client-01 is the end-entity's name
Issuer: CN = handson-root-ca proves the Root CA signed it

PKI material is now ready.

ca.key    : Root CA private key (secret)
ca.crt    : Root CA public certificate (upload to AWS)
client.key: End-entity private key (lives on the client machine)
client.crt: End-entity public certificate (lives on the client machine)
client.csr: CSR (no longer needed, you can delete)

Step 3: Register the Trust Anchor with AWS

Tell Roles Anywhere "I trust this Root CA". That's a Trust Anchor.

3-1. Create one from the AWS Console

AWS Console → IAM → Roles Anywhere (bottom of the left nav) → Manage → Create a trust anchor.

Trust anchor name: handson-trust-anchor
Source: pick External certificate bundle (use the other option if you're using AWS Private CA)
Certificate bundle: paste the contents of ca.crt. Print it with:

cat ca.crt
# Copy everything from -----BEGIN CERTIFICATE----- to -----END CERTIFICATE-----

Click Create trust anchor

3-2. Or do it via CLI (optional)

If you hate the GUI:

CA_CERT_BODY=$(cat ca.crt)

aws rolesanywhere create-trust-anchor \
  --name handson-trust-anchor \
  --source "sourceType=CERTIFICATE_BUNDLE,sourceData={x509CertificateData=$(cat ca.crt | jq -Rs .)}" \
  --enabled \
  --region ${REGION}

The shell escaping is gnarly. Doing it from the Console is easier.

3-3. Save the Trust Anchor ARN

Copy the ARN shown on the post-creation page into an env var.

export TA_ARN="arn:aws:rolesanywhere:us-east-1:${ACCOUNT_ID}:trust-anchor/abc123..."

Step 4: Create the IAM Role

This is the Role that Roles Anywhere will Assume on your behalf. In the Trust Policy, set rolesanywhere.amazonaws.com as the Principal.

4-1. Trust Policy

cat > trust-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": { "Service": "rolesanywhere.amazonaws.com" },
    "Action": [
      "sts:AssumeRole",
      "sts:TagSession",
      "sts:SetSourceIdentity"
    ],
    "Condition": {
      "ArnEquals": {
        "aws:SourceArn": "${TA_ARN}"
      }
    }
  }]
}
EOF

Key points:

Principal: { "Service": "rolesanywhere.amazonaws.com" }: not an IAM User or Role. The Principal is the Roles Anywhere AWS service
Action: sts:AssumeRole + sts:TagSession + sts:SetSourceIdentity: Roles Anywhere doesn't just AssumeRole. It also injects Session Tags and Source Identity derived from the cert
Condition: aws:SourceArn = Trust Anchor ARN: lock down the Assume to only happen through this specific Trust Anchor (Confused Deputy mitigation)

4-2. Create the Role and attach an Identity Policy

aws iam create-role \
  --role-name handson-rolesanywhere-role \
  --assume-role-policy-document file://trust-policy.json

cat > identity-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": ["s3:ListAllMyBuckets"],
    "Resource": "*"
  }]
}
EOF

aws iam put-role-policy \
  --role-name handson-rolesanywhere-role \
  --policy-name s3-list \
  --policy-document file://identity-policy.json

The Identity Policy is minimal. s3:ListAllMyBuckets only. Anything that proves "the credentials work" is fine.

4-3. Save the Role ARN

export ROLE_ARN=$(aws iam get-role --role-name handson-rolesanywhere-role --query 'Role.Arn' --output text)
echo $ROLE_ARN

Step 5: Create the Roles Anywhere Profile

The Profile is the link between Trust Anchor and Role. It says "callers coming through this Trust Anchor can Assume this Role."

5-1. Create from the Console

IAM → Roles Anywhere → Profiles → Create a profile.

Profile name: handson-profile
Roles: select handson-rolesanywhere-role
Session policy (optional): you can cap permissions here, leave it empty for now
Create profile

5-2. Save the Profile ARN

export PROFILE_ARN="arn:aws:rolesanywhere:us-east-1:${ACCOUNT_ID}:profile/def456..."

The AWS side is done.

5-3. The trust structure you just built

Step 6: Install aws_signing_helper

The Roles Anywhere API doesn't use plain SigV4. It requires certificate-based signing. Writing this by hand every time is painful, so AWS ships an official helper binary.

6-1. Download

As of May 2026, the latest version is 1.8.3. URLs are per platform.

macOS (Intel):

curl -O https://rolesanywhere.amazonaws.com/releases/1.8.3/X86_64/MacOS/Sonoma/aws_signing_helper
chmod +x aws_signing_helper
./aws_signing_helper version

macOS (Apple Silicon / M1, M2, M3):

curl -O https://rolesanywhere.amazonaws.com/releases/1.8.3/Aarch64/MacOS/Sonoma/aws_signing_helper
chmod +x aws_signing_helper
./aws_signing_helper version

Linux (x86_64):

curl -O https://rolesanywhere.amazonaws.com/releases/1.8.3/X86_64/Linux/Amzn2023/aws_signing_helper
chmod +x aws_signing_helper
./aws_signing_helper version

Linux (ARM64 / Raspberry Pi etc.):

curl -O https://rolesanywhere.amazonaws.com/releases/1.8.3/Aarch64/Linux/Amzn2023/aws_signing_helper
chmod +x aws_signing_helper
./aws_signing_helper version

Check for newer releases on GitHub Releases. Just swap the version (1.8.3) in the URL.

6-2. What the helper does

The helper takes three things (Trust Anchor ARN, Profile ARN, Role ARN) as arguments, sends an HTTPS request signed with client.crt + client.key to Roles Anywhere, and returns ASIA... temporary credentials.

Common mix-up: Roles Anywhere is not mTLS. The TLS handshake is server-only. The client cert is never exchanged at the TLS layer (AWS does not send a CertificateRequest). Authentication happens at the application layer. The scheme extends SigV4 and is called AWS4-X509-RSA-SHA256 (for RSA keys) or AWS4-X509-ECDSA-SHA256 (for EC keys).

The request looks like this:

POST https://rolesanywhere.<region>.amazonaws.com/sessions
Authorization: AWS4-X509-RSA-SHA256 Credential=..., Signature=...
X-Amz-X509: <base64 of client.crt>
Body: <JSON for CreateSession>
   ↑ Signature is the RSA signature over the Canonical Request, signed with client.key

Plain SigV4 signs with "HMAC using Secret Access Key." Roles Anywhere signs with "RSA using private key, with the certificate carried in X-Amz-X509." Think of it as SigV4 where the symmetric HMAC has been swapped for asymmetric RSA-Sign.

6-3. AWS verifies in two stages

"Does the Trust Anchor verify the body signature?" is a natural question. The answer is no. Verification splits in two.

Chain validation (this is where the Trust Anchor matters): is the client.crt in X-Amz-X509 derived from the Root CA registered as Trust Anchor? Are validity, KeyUsage, and EKU sane? If a CRL is configured, is the cert non-revoked? After this passes, AWS treats client.crt as trusted
Signature verification: extract the public key from the trusted client.crt, and verify the RSA signature on the request body with it. After this passes, AWS knows the holder of client.key sent the request

So the public key in the Trust Anchor's ca.crt does not directly verify the request signature. The Trust Anchor decides "do we trust client.crt?" and the actual request signature is verified by the public key inside client.crt.

After both pass, the Roles Anywhere service internally performs the equivalent of sts:AssumeRole and returns ASIA... (that's why the Trust Policy's Principal is rolesanywhere.amazonaws.com). Clients never hit STS directly.

The payoff: AWS never holds a copy of your private key. A regular long-lived key is a shared secret because Secret Access Key exists on both AWS and your side. Roles Anywhere is asymmetric, so AWS only ever sees the public key (= certificate). The only thing whose leak hurts you is your own client.key.

Step 7: Pull temporary credentials and call AWS

7-1. Get credentials via credential-process

./aws_signing_helper credential-process \
  --certificate ${HOME}/roles-anywhere-handson/client.crt \
  --private-key ${HOME}/roles-anywhere-handson/client.key \
  --trust-anchor-arn ${TA_ARN} \
  --profile-arn ${PROFILE_ARN} \
  --role-arn ${ROLE_ARN}

On success, you get JSON back:

{
  "Version": 1,
  "AccessKeyId": "ASIA...",
  "SecretAccessKey": "...",
  "SessionToken": "...(long)",
  "Expiration": "2026-05-24T13:00:00Z"
}

Temporary credentials starting with ASIA... are in your hands. Zero long-lived keys involved. Cert + private key alone got AWS to hand back temporary credentials. That's the core of Roles Anywhere.

7-2. Wire it into the AWS CLI

Running aws_signing_helper credential-process by hand every time is tedious. Drop a credential_process config into ~/.aws/config.

cat >> ~/.aws/config <<EOF

[profile rolesanywhere-handson]
region = ${REGION}
credential_process = ${HOME}/roles-anywhere-handson/aws_signing_helper credential-process --certificate ${HOME}/roles-anywhere-handson/client.crt --private-key ${HOME}/roles-anywhere-handson/client.key --trust-anchor-arn ${TA_ARN} --profile-arn ${PROFILE_ARN} --role-arn ${ROLE_ARN}
EOF

Now the aws CLI invokes the helper automatically to fetch credentials.

aws s3 ls --profile rolesanywhere-handson
# bucket list shows up

Run aws sts get-caller-identity --profile rolesanywhere-handson:

{
  "UserId": "AROA...:2a9902119721ab7b5c19a878c4c66af705895b42",
  "Account": "123456789012",
  "Arn": "arn:aws:sts::.../assumed-role/handson-rolesanywhere-role/2a9902119721ab7b5c19a878c4c66af705895b42"
}

The long hex string at the tail of UserId and Arn is the end-entity certificate's serial number. Not the CN (this trips people up). Roles Anywhere uses the certificate serial number (big-endian hex, lowercase) as the session name automatically. To confirm, compare with the lowercased output of openssl x509 -in client.crt -noout -serial.

The CN itself lands in sourceIdentity (visible in CloudTrail in the next section). Remember it as "session name = serial, sourceIdentity = CN" on two axes.

7-3. Verify in CloudTrail

Console → CloudTrail → Event history → Lookup attributes:

Event name: AssumeRole

A Roles Anywhere AssumeRole event shows up.

{
  "eventSource": "rolesanywhere.amazonaws.com",
  "userIdentity": {
    "type": "AWSService",
    "invokedBy": "rolesanywhere.amazonaws.com"
  },
  "requestParameters": {
    "profileArn": "...",
    "roleArn": "...",
    "trustAnchorArn": "...",
    "cert": "...(contents of end-entity cert)"
  },
  "responseElements": {
    "credentialSet": [{
      "sourceIdentity": "CN=handson-client-01"
    }]
  }
}

sourceIdentity automatically carries the certificate's CN. Even without ExternalId or explicit Source Identity config, Roles Anywhere stamps the CN as the audit trail for "who called this."

Downstream CloudTrail events (the S3 ls call, etc.) inherit sessionContext.sourceIdentity = CN=handson-client-01. In production, aggregating on this gives you "which physical server (= which certificate) made the call" in one query.

Other Subject fields (O, OU, etc.) don't land in sourceIdentity. They land in aws:PrincipalTag/x509Subject/<attribute> injected into the Session (used in Step 8).

Step 8: Restrict Trust Policy by CN

Right now, any end-entity certificate derived from the Root CA can Assume the Role. In production you usually want "only certificates with this specific CN."

8-1. Add a CN match condition

Add an aws:PrincipalTag/x509Subject/CN condition to the Trust Policy.

cat > trust-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": { "Service": "rolesanywhere.amazonaws.com" },
    "Action": [
      "sts:AssumeRole",
      "sts:TagSession",
      "sts:SetSourceIdentity"
    ],
    "Condition": {
      "ArnEquals": {
        "aws:SourceArn": "${TA_ARN}"
      },
      "StringEquals": {
        "aws:PrincipalTag/x509Subject/CN": "handson-client-01"
      }
    }
  }]
}
EOF

aws iam update-assume-role-policy \
  --role-name handson-rolesanywhere-role \
  --policy-document file://trust-policy.json

When Roles Anywhere parses the cert, it auto-injects each Subject attribute as a Session Tag. You can use x509Subject/CN, x509Subject/O, x509Issuer/CN, etc.

8-2. Verify behavior

Calling with the cert whose CN is handson-client-01 works.

aws sts get-caller-identity --profile rolesanywhere-handson
# success

Calling with a cert that has a different CN (build client2.key / client2.crt and swap them in) returns AccessDenied.

# For comparison, build a cert with a different CN
openssl genrsa -out client2.key 2048
openssl req -new -key client2.key -out client2.csr -subj "/CN=other-client/O=Hands-On"
openssl x509 -req -in client2.csr -CA ca.crt -CAkey ca.key -CAcreateserial -out client2.crt -days 90 -sha256

./aws_signing_helper credential-process \
  --certificate ${HOME}/roles-anywhere-handson/client2.crt \
  --private-key ${HOME}/roles-anywhere-handson/client2.key \
  --trust-anchor-arn ${TA_ARN} \
  --profile-arn ${PROFILE_ARN} \
  --role-arn ${ROLE_ARN}

# AccessDenied: cert passes the Trust Anchor (same CA derived) but fails
# the Role Trust Policy condition (CN=handson-client-01)

So "trust one CA, split Roles per CN". Manage 100 servers with one CA but keep one Role per CN. That's how this scales in production.

Step 9: Cleanup

Delete things at the end. Order matters because of dependencies (Profile → Trust Anchor → Role).

9-1. Delete AWS resources

# Delete Profile (from Console or via CLI)
PROFILE_ID=$(echo ${PROFILE_ARN} | awk -F/ '{print $NF}')
aws rolesanywhere delete-profile --profile-id ${PROFILE_ID} --region ${REGION}

# Delete Trust Anchor
TA_ID=$(echo ${TA_ARN} | awk -F/ '{print $NF}')
aws rolesanywhere delete-trust-anchor --trust-anchor-id ${TA_ID} --region ${REGION}

# Detach Role policy, then delete the Role
aws iam delete-role-policy --role-name handson-rolesanywhere-role --policy-name s3-list
aws iam delete-role --role-name handson-rolesanywhere-role

9-2. Delete local certificates

rm -rf ~/roles-anywhere-handson

9-3. Edit ~/.aws/config

# Open in an editor and remove the [profile rolesanywhere-handson] section
vim ~/.aws/config

Everything is gone. Zero bill.

Summary

Built the full path for calling AWS from "outside AWS" without long-lived keys, by hand.

Build one self-signed Root CA (2 openssl commands)
Issue an end-entity certificate (3 openssl commands)
Register the CA cert as a Trust Anchor with AWS
In the IAM Role's Trust Policy, trust rolesanywhere.amazonaws.com
Bind Trust Anchor and Role via a Profile
Drop aws_signing_helper locally and wire it into credential_process
Per-CN role split inside one CA via aws:PrincipalTag/x509Subject/CN in the Trust Policy

End state: zero long-lived keys (AKIA...) on your laptop. ASIA... temporary credentials come out of AWS. Revoke the certificate and access dies immediately. CloudTrail automatically stamps the CN into sourceIdentity.

What changes when this scales to production:

Switch the Root CA to AWS Private CA or your internal PKI: with a self-signed CA, the CA private key's safety is now the whole trust loop's safety
Automate cert distribution: every new CN needs a freshly issued end-entity cert. Use a pipeline (Step CA, Smallstep, HashiCorp Vault PKI)
Operate a CRL: leaked certs need to die instantly

Roles Anywhere itself is a small API + official helper, but the operational hard part is PKI lifecycle.

What you might explore next:

Combine EKS Service Accounts with Roles Anywhere as an IRSA-free path
Connect GitHub self-hosted runners to AWS via Roles Anywhere (GitHub-hosted runners are better with OIDC)
Rotate end-entity certs every 24 hours using Smallstep

References

AWS STS Deep Dive

kt — Sun, 24 May 2026 11:34:59 +0000

Introduction

In the previous IAM article, I argued you should drop long-lived IAM User keys (AKIA...) and lean on Role + STS. Back then I described STS as "the temporary credential issuer" in one line.

Then I kept tripping over AssumeRole edge cases in real work.

"What is the actual difference between AssumeRole, AssumeRoleWithSAML, and AssumeRoleWithWebIdentity?"
"A SaaS vendor told me to pass ExternalId. What is it for?"
"CloudTrail keeps showing assumed-role/XXX/session-yyy. Where does the session-yyy part come from?"
"I chained Role A to Role B and suddenly the session expires after 1 hour. Why?"
"Where does AssumeRoot from 2024 re:Invent fit?"

STS is not a single API. It is a small universe: 6 separate issuance APIs plus Source Identity, External ID, Session Tag, and Session Policy. This article opens all of it. Treat it as the sequel to the IAM piece.

Outline.

What STS actually does (including Global vs Regional)
The 3 fields of a temporary credential and how they ride on SigV4
The 6 STS APIs as a decision tree
Trust Policy vs Identity Policy (one more careful pass)
External ID: stopping the Confused Deputy
Source Identity: keeping the original human in CloudTrail
Session Tag and Transitive Tag: the engine behind ABAC
Session Policy: narrowing through AssumeRole arguments
Role Chaining: the 1-hour wall
DurationSeconds: 15 minutes to 12 hours
Wiring GitHub Actions OIDC (a quick look)
How it shows up in CloudTrail
AssumeRoot (2024): Centralized Root Access
Do this / avoid that

Foundations (IAM principals, policy evaluation order, SigV4) are covered in the previous article. Read that first if you skipped it.

1. What STS Does

STS (Security Token Service) is the AWS service that issues temporary credentials. One hostname, 8 APIs (6 for issuance, 2 helpers).

Global and Regional both exist

The STS endpoint comes in 2 shapes.

Form	hostname	Physical location
Global	`sts.amazonaws.com`	Hosted only in us-east-1
Regional	`sts.<region>.amazonaws.com` (e.g. `sts.ap-northeast-1.amazonaws.com`)	Independently hosted in each Region

IAM itself (Users, Roles, Policy definitions) is Global, but the issuance act is regionalized. This is the central design point. The current official line: use Regional.

Why.

Latency: Hitting a nearby Region endpoint is obviously faster.
Availability: The Global endpoint is hosted only in us-east-1, so it goes down with us-east-1. Regional endpoints are independent per Region.
Token validity scope: SessionTokens issued from a Regional endpoint are valid in every Region. Tokens issued from the Global endpoint only work in default-enabled Regions (not Opt-in Regions). To use a newer Opt-in Region, you must issue from a Regional endpoint.

Around July 31, 2025, the SDK defaults for boto3 v1.40.0, PHP, C++, .NET, and Tools for PowerShell flipped to Regional (Global used to be the default). Go, Node, and Java were already Regional by default. Whatever SDK you are using without thinking about it is basically already Regional.

For the CLI, set AWS_STS_REGIONAL_ENDPOINTS=regional (or sts_regional_endpoints = regional in ~/.aws/config) to make the choice explicit. New projects should standardize on Regional without thinking.

2. The 3 fields of a temporary credential

What STS hands you is not a single "token" string. It is a triple.

Field	Content	Property
AccessKeyId	20 chars starting with `ASIA`	Safe to expose (it rides on every request)
SecretAccessKey	40-char base64	Leak = compromise. HMAC key for SigV4
SessionToken	Hundreds to thousands of characters	Leak = compromise. Without it the temporary credential is not accepted
Expiration	ISO8601 UTC timestamp	One second past this and you re-fetch from STS

IAM User long-lived keys start with AKIA.... STS temporary credentials start with ASIA.... The first 4 characters tell you long-lived vs temporary. Remember this and your first triage on CloudTrail or ~/.aws/credentials gets fast.

SessionToken rides as an extra SigV4 header

With long-lived keys, SigV4 uses AccessKeyId and SecretAccessKey. With STS temporary credentials, you have to add one more HTTP header.

POST / HTTP/1.1
Host: dynamodb.ap-northeast-1.amazonaws.com
X-Amz-Date: 20260517T120000Z
X-Amz-Security-Token: IQoJb3JpZ2luX2VjEM3...(continues for hundreds of chars)
Authorization: AWS4-HMAC-SHA256
  Credential=ASIAEXAMPLE/20260517/ap-northeast-1/dynamodb/aws4_request,
  SignedHeaders=host;x-amz-date;x-amz-security-token,
  Signature=abc123...

Points.

Put SessionToken in the X-Amz-Security-Token header. AWS uses this to decide "this is not a long-lived IAM User key, it is an STS-issued temporary credential" and looks up validity and policy accordingly.
Include x-amz-security-token in SignedHeaders so it is covered by the signature (tamper resistance).

SDKs handle all of this automatically, so you rarely write it by hand. But when "it works from a container via the SDK but curl returns 403," this header is almost always the cause.

3. STS APIs: a decision tree

STS has multiple similarly named APIs and people mix them up every time. Picture first.

One line per API.

API	What it does	DurationSeconds max
`AssumeRole`	Switch into an IAM Role in the same or another account	Role's MaxSessionDuration (15 min to 12 h)
`AssumeRoleWithSAML`	Pass a SAML 2.0 IdP assertion and switch into a Role	Same
`AssumeRoleWithWebIdentity`	Pass an OIDC IdP id_token and switch into a Role	Same
`AssumeRoot` (2024)	From the Management Account, take Root-equivalent permissions on a Member Account for 15 min	Fixed 15 min
`GetSessionToken`	IAM User long-lived key plus (optional) MFA, upgraded to a temporary credential	12 h (1 h when the caller is root)
`GetFederationToken`	IAM User long-lived key mints a temporary credential for another identity (federated user)	12 h (1 h when the caller is root)
`DecodeAuthorizationMessage`	Expand the encoded AccessDenied message into something human-readable	(not an issuer)
`GetCallerIdentity`	Returns "if I call AWS right now with these credentials, who am I?"	(not an issuer)

Frequency in real work: AssumeRole > AssumeRoleWithWebIdentity (CI) > AssumeRoleWithSAML (corporate IdP) > DecodeAuthorizationMessage (debugging). GetSessionToken and GetFederationToken are IAM User-era APIs. With Identity Center and OIDC unifying humans and CI, they barely show up anymore.

Do not confuse `GetSessionToken` with `AssumeRole`

The name GetSessionToken suggests "the API that returns you a session token." It actually means "take an IAM User's long-lived key and upgrade it to a temporary credential, optionally with MFA." You gain no new permissions. You get the same permissions as the original IAM User, just with an Expiration and (if MFA was used) the aws:MultiFactorAuthPresent condition set.

The canonical use case is enforcing MFA via aws:MultiFactorAuthPresent in an IAM Policy condition.

{
  "Effect": "Deny",
  "Action": "ec2:TerminateInstances",
  "Resource": "*",
  "Condition": {
    "BoolIfExists": { "aws:MultiFactorAuthPresent": "false" }
  }
}

When you run as an IAM User from the CLI, call get-session-token --serial-number <MFA ARN> --token-code <6 digits> to swap in a credential that carries the MFA condition. Save it as a separate profile in ~/.aws/credentials and use that profile for everything sensitive.

`GetFederationToken` is basically retired

GetFederationToken lets an IAM User long-lived key mint a temporary credential on behalf of someone else. It mattered when you stood up a custom corporate IdP server that held the IAM User key and minted credentials for authenticated employees.

That era is over. The same outcome ships via Identity Center + Permission Set + SAML/OIDC, without any long-lived IAM User key in the picture. New designs do not reach for GetFederationToken.

4. Trust Policy vs Identity Policy

"I have a few policies on this Role, but what is the Trust Policy?" is a recurring question. Roles carry two kinds of policy evaluated at different moments.

Policy	Where it lives	Evaluation moment	Question it answers
Trust Policy	The Role's `AssumeRolePolicyDocument`	When `AssumeRole` is called	"Who is allowed to assume this Role?"
Identity Policy	Policies attached to the Role	On each API call after assuming	"What can the assumed Role do?"

Both are JSON policy documents with nearly identical syntax. The one difference: whether you write Principal.

Trust Policy requires Principal. You explicitly say "from arn:aws:iam::111:user/alice," who can assume.
Identity Policy has no Principal. The owner is the attached Role, no ambiguity.

Example: a Trust Policy that allows GitHub Actions OIDC federation.

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Federated": "arn:aws:iam::111111111111:oidc-provider/token.actions.githubusercontent.com"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
      "StringEquals": {
        "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
      },
      "StringLike": {
        "token.actions.githubusercontent.com:sub": "repo:my-org/my-repo:ref:refs/heads/main"
      }
    }
  }]
}

The common mistake here: writing sts:AssumeRole for the Action. OIDC requires sts:AssumeRoleWithWebIdentity. SAML needs sts:AssumeRoleWithSAML. Plain role-switching needs sts:AssumeRole. Memorize this: the API name you call equals the Action name in the Trust Policy.

5. External ID: stopping the Confused Deputy

If a SaaS vendor has ever asked you to "add this ExternalId to your AWS Role's Trust Policy," that is the Confused Deputy defense.

What is a Confused Deputy

An attack where "a third party (the Deputy) who legitimately holds permissions on your behalf" is tricked by a different attacker into exercising those permissions for someone other than you.

The Victim's Trust Policy only says "the Vendor's AWS account may assume this Role." The SaaS vendor's system mints AssumeRole calls on behalf of many customers. If the Vendor's tenant isolation is sloppy, the attacker registers "my tenant's Role ARN is (actually the Victim's Role ARN)" and the Vendor unwittingly assumes the Victim's Role.

Close the hole with ExternalId

ExternalId is "a value only the Victim knows, that the SaaS vendor must include when assuming the Victim's Role for that tenant." The Victim writes it into the Trust Policy, the Vendor passes it via --external-id.

The Victim's Trust Policy looks like this.

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": { "AWS": "arn:aws:iam::SAAS_VENDOR_ACCOUNT:root" },
    "Action": "sts:AssumeRole",
    "Condition": {
      "StringEquals": {
        "sts:ExternalId": "ab8a3c7e-4d1f-4d9b-90e4-..."
      }
    }
  }]
}

ExternalId is not a secret password. The Vendor knows it. It is written in plain text in the Trust Policy. It is just "a unique string the customer (the Victim) and the SaaS agreed on." Its job is to make AWS re-validate the Vendor-internal tenant routing.

Vendor responsibilities.

Generate a distinct ExternalId per customer (UUID recommended).
Never pass customer A's ExternalId into customer B's AssumeRole. No internal swaps.
Show ExternalId only to the customer it belongs to.

Customer (Victim) responsibilities.

Always put the Vendor-supplied ExternalId into your Trust Policy's Condition.
Treat "an external vendor Role with no ExternalId" as a design defect.

6. Source Identity: keep the real human in CloudTrail

When you assume a Role, CloudTrail shows arn:aws:sts::123456789012:assumed-role/MyRole/SomeSessionName. SomeSessionName is the required --role-session-name argument.

In day-to-day use this ends up as bob-cli-session or 1700000000 or whatever string someone threw in. CloudTrail alone cannot tell you who actually called the API.

Source Identity solves this.

Properties of Source Identity

Pass it at AssumeRole with --source-identity bob@example.com.
Once set, it is immutable for the session. Cannot be tampered with.
Survives Role Chaining. The value propagates to subsequent assumed Roles.
It always appears at userIdentity.sessionContext.sourceIdentity in CloudTrail.

So stamping the "original human ID" at the first AssumeRole locks it in for the whole chain.

Enforce it in the Trust Policy

Tighten the operation: refuse any AssumeRole that does not set Source Identity. Add this to the Role's Trust Policy.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": { "AWS": "arn:aws:iam::111:root" },
      "Action": "sts:AssumeRole"
    },
    {
      "Effect": "Allow",
      "Principal": { "AWS": "arn:aws:iam::111:root" },
      "Action": "sts:SetSourceIdentity"
    },
    {
      "Effect": "Deny",
      "Principal": { "AWS": "arn:aws:iam::111:root" },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": { "sts:SourceIdentity": "" }
      }
    }
  ]
}

You need sts:SetSourceIdentity to be Allow-ed before SourceIdentity can be stamped. Stack a Deny that fires when the value is empty.

One trap: to propagate Source Identity across Role Chain or Cross-account, you must write sts:SetSourceIdentity in two places.

The caller Principal's Identity Policy (the IAM User / Role making the call)
The target Role's Trust Policy

Miss either and the chained AssumeRole fails.

In Identity Center environments, mapping https://aws.amazon.com/SAML/Attributes/SourceIdentity on the IdP side to the user's email gives you automatic Source Identity stamping on every Identity Center-mediated AssumeRole. That is the ideal setup.

7. Session Tag and Transitive Tag

Session Tags are key=value pairs passed at AssumeRole. They attach to the temporary credential and become available in policy Conditions as aws:PrincipalTag/Team. They are the engine that drives ABAC (attribute-based access control).

Plain Session Tag

aws sts assume-role \
  --role-arn arn:aws:iam::111:role/DataEngineerRole \
  --role-session-name alice-session \
  --tags Key=Team,Value=ml Key=Project,Value=recsys

On the Role's Identity Policy side, use the tag.

{
  "Effect": "Allow",
  "Action": "s3:GetObject",
  "Resource": "arn:aws:s3:::data-*",
  "Condition": {
    "StringEquals": {
      "aws:PrincipalTag/Team": "${s3:ResourceTag/Team}"
    }
  }
}

This says "Allow only when the principal's Team matches the bucket's Team." Instead of creating a Role per Team, one Role with dynamic filtering by Tag.

Transitive Tag: carry across Role Chain

Plain Session Tags disappear at the next AssumeRole. After Role A to Role B, the tag on Role A's session does not appear on Role B's session.

--transitive-tag-keys makes a Tag survive the chain.

aws sts assume-role \
  --role-arn arn:aws:iam::111:role/RoleA \
  --role-session-name session1 \
  --tags Key=Team,Value=ml Key=Project,Value=recsys \
  --transitive-tag-keys Team

Now aws:PrincipalTag/Team=ml stays alive across Role A to Role B to Role C. Project drops at the first hop.

Transitive Tag also gives you a guarantee: an attribute stamped upstream cannot be overwritten downstream. Even if a downstream call to AssumeRole sets --tags Team=admin with the same key, the upstream Transitive Tag wins.

Same shape as Source Identity: an attribute stamped at the upstream trust boundary survives intact downstream. That is what makes ABAC trustworthy.

8. Session Policy: narrow at the call site

Session Policy is an inline policy passed as an AssumeRole argument. "Keep the Role's full permissions, but for this specific session, narrow further to just these." You are lowering the permission ceiling through the call argument.

aws sts assume-role \
  --role-arn arn:aws:iam::111:role/AdminRole \
  --role-session-name restricted-session \
  --policy '{
    "Version": "2012-10-17",
    "Statement": [{
      "Effect": "Allow",
      "Action": ["s3:GetObject"],
      "Resource": "arn:aws:s3:::reports/*"
    }]
  }'

Important rules.

Session Policy is AND (intersection) with the Role's Identity Policy. It can only narrow, never widen.
Even if the Role allows all of s3:*, a Session Policy restricted to s3:GetObject ends up at just s3:GetObject.
Writing ec2:* in a Session Policy when the Role does not allow it grants nothing.

Use cases.

Ephemeral narrowing: "I am inside Admin Role but for this 1 hour I only touch a specific S3 prefix." Self-imposed handcuffs.
Limited delegation to a SaaS: when a SaaS Vendor assumes your Role, layer a Session Policy at the call site so the Vendor can only hit the minimum it actually needs.
Extra defense on cross-account: even when a Role is open to assume from another account, the AssumeRole-side Session Policy enforces "what runs in our account is read-only."

You can pass up to 10 managed policy ARNs plus 1 inline policy (the inline one has a 2 KB cap).

9. Role Chaining: the 1-hour wall

Role Chaining means "use Role A's temporary credential to AssumeRole into Role B."

The trap: once you chain, DurationSeconds is hard-capped at 1 hour (3600 s) for every hop after the first. Even if the Role's MaxSessionDuration is 12 hours, a request beyond 1 hour is rejected.

If you leave DurationSeconds at the default (1 hour) you never see it. But "the first hop ran 12 hours fine, so the second should too" leads to Terraform / scripts breaking on The requested DurationSeconds exceeds the 1 hour session limit for roles assumed by role chaining.

The fix is simple: do not chain.

Within an account: usually you can skip the relay Role and assume the target directly.
Cross-account: reconsider whether the relay Role is actually needed. Moving to Identity Center removes the relay.
When chaining is genuinely required (some delegated-admin patterns): write a refresh loop that re-assumes every hour.

10. DurationSeconds limits

DurationSeconds for AssumeRole and friends ranges from 900 (15 min) to 43200 (12 hours). What you actually get is the minimum of several limits.

Constraint	Limit
API spec maximum	43200 s (12 h)
Role's `MaxSessionDuration`	3600 to 43200 s (set when the Role is created)
Role Chaining (see section 9)	3600 s (fixed)
`AssumeRoot` (2024)	900 s (fixed)
`GetSessionToken` (IAM User)	129600 s (36 h). 3600 s when called by root
`GetFederationToken` (IAM User)	43200 s (12 h). 3600 s when called by root

Effective limit is API spec ∩ Role.MaxSessionDuration ∩ chain constraint, take the minimum. "I asked for 43200 and got 3600 back" almost always means the Role's MaxSessionDuration is still at 1 hour, or you are chaining.

Check with aws iam get-role --role-name MyRole to see MaxSessionDuration.

11. GitHub Actions OIDC integration (a quick look)

The most popular use of AssumeRoleWithWebIdentity is GitHub Actions OIDC. It ends the era of putting long-lived keys (AKIA...) into GitHub Secrets.

Just the points.

Register token.actions.githubusercontent.com as an IAM OIDC Identity Provider up front.
In the receiving Role's Trust Policy, narrow the sub StringLike condition to repository + branch. Loose conditions like repo:my-org/* open the door for other repos in the org to assume.
The workflow side just uses aws-actions/configure-aws-credentials.
The reason the Job can get an id_token without long-lived secrets: GitHub's control plane injects a short-lived internal token (ACTIONS_ID_TOKEN_REQUEST_TOKEN) into env vars when the Runner starts. The long-lived secret exists on GitHub's side, not on yours. "Zero long-lived credentials" is a user-side statement, not a literal one (typical workload identity pattern).

After this, zero long-lived keys live in GitHub Secrets. This is a different path from Roles Anywhere, and for CI it is far easier.

12. How it shows up in CloudTrail

Once you have assumed a Role, CloudTrail's userIdentity block tells you "who is this calling as."

`userIdentity` right after AssumeRole

{
  "userIdentity": {
    "type": "AssumedRole",
    "principalId": "AROAEXAMPLEID:alice-session",
    "arn": "arn:aws:sts::123456789012:assumed-role/DataEngineerRole/alice-session",
    "accountId": "123456789012",
    "accessKeyId": "ASIAEXAMPLEKEY",
    "sessionContext": {
      "sessionIssuer": {
        "type": "Role",
        "principalId": "AROAEXAMPLEID",
        "arn": "arn:aws:iam::123456789012:role/DataEngineerRole",
        "accountId": "123456789012",
        "userName": "DataEngineerRole"
      },
      "attributes": {
        "creationDate": "2026-05-17T09:00:00Z",
        "mfaAuthenticated": "false"
      },
      "sourceIdentity": "alice@example.com"
    }
  }
}

Reading it.

type: "AssumedRole" tells you "this call used a temporary credential."
arn follows arn:aws:sts::ACCOUNT:assumed-role/ROLE_NAME/SESSION_NAME, an unusual format. Note the sts::, not the regular Role ARN (arn:aws:iam::...).
principalId is AROA... (the Role ID) plus : plus the session name.
accessKeyId starts with ASIA... (the temporary credential marker).
sessionContext.sourceIdentity carries the Source Identity. Empty here means you cannot tell who, which is why stamping is worth enforcing.
sessionIssuer is the Role this session came from. In a chain, the nearest Role.

Athena: aggregate "by human"

If CloudTrail is landed in S3, this Athena query gives you "API calls per human."

SELECT
  COALESCE(useridentity.sessioncontext.sourceidentity, useridentity.arn) AS who,
  eventname,
  count(*) AS calls
FROM cloudtrail_logs
WHERE eventtime BETWEEN '2026-05-01' AND '2026-05-17'
GROUP BY 1, 2
ORDER BY calls DESC
LIMIT 100;

With Source Identity enforced, the first column is alice@example.com. Without it, you get a tasteless arn:aws:sts::.../session-1700000000.

This is the most concrete reason to enforce Source Identity: audit aggregation becomes mechanical.

13. AssumeRoot: the 2024 re:Invent addition

In November 2024, just before re:Invent, AWS released a new STS API: AssumeRoot. Its purpose: "From the Management Account, take Root-equivalent permissions on a Member Account on behalf of that account."

Why it exists

Member Accounts in AWS Organizations all have a Root User. Some operations require the Root User. Examples.

Unsticking an S3 bucket whose Bucket Policy is Deny for every Principal: no IAM permission can fix it, only Root.
Unsticking an SQS queue whose Queue Policy is Deny for every Principal: same.
Resetting Root User MFA when lost: only the Root itself can.
Deleting / enabling Root credentials: same.

Doing these required logging into the Member Account's Root, and maintaining MFA and password-recovery email for every Root across the organization. In organizations with hundreds of accounts, this was an incident factory.

AssumeRoot fixes it. From the Management Account (or the IAM delegated admin Account), take Root-equivalent permissions on a Member Account, narrowed to a specific task, for 15 minutes.

Flow

Scoping via TaskPolicy

The key constraint of AssumeRoot is that TaskPolicy is required. You cannot mint a session that says "Root, anything goes." You must pick one of the AWS-managed task-scoped policies. Representative ones.

Task Policy (managed)	What it does
`S3UnlockBucketPolicy`	Read / Delete on S3 Bucket Policy
`SQSUnlockQueuePolicy`	Read / Delete on SQS Queue Policy
`IAMDeleteRootUserCredentials`	Delete the Member Account's Root credentials
`IAMCreateRootUserPassword`	Regenerate the Member Account's Root password (recovery)
`IAMAuditRootUserCredentials`	Read-only inspection of Root credential state

So you cannot "do anything because you are Root." You get a "Root session scoped to unlock S3 bucket policy" or "Root session scoped to unlock SQS queue policy." Roles are narrow by design, and audits stay readable.

Prerequisites

Two Organizations-side settings are required to call AssumeRoot.

Enable Centralized root access for member accounts (Organizations console / API).
Delete or disable each Member Account's Root credentials (or use SCP to forbid direct Root login).

The caller's IAM Identity Policy needs sts:AssumeRoot Allow. Only the Management Account or a delegated admin account can hold it. You cannot grant it on a Member Account directly.

Audit side

If your org has monitored only ConsoleLogin events for Root, take note. With Root credentials deleted, Root console logins stop happening, so add sts:AssumeRoot to your monitored event set. Vendors like Elastic Security Labs already publish detection rules for "rare user plus Member Account AssumeRoot."

14. Do this / avoid that

Avoid

Reaching for the Global STS endpoint (sts.amazonaws.com) on a new project.
Putting only * in Trust Policy's Principal without a Condition.
Forgetting ExternalId on a Role for an external vendor.
Stacking 3 or more Role Chain hops (the trust boundary blurs).
Random session names with an empty Source Identity.
Granting AssumeRoot to a Member Account Principal. Keep it on Management / delegated admin only.
Maxing DurationSeconds to 12 hours everywhere. You lose the short-lived benefit.
Continuing to use long-lived keys (AKIA...) without MFA.

Default to Regional STS endpoints.
Require ExternalId on external vendor Roles.
Route through Identity Center so SAML/OIDC stamps SourceIdentity automatically.
Design Team / Project ABAC with Transitive Tags surviving the chain.
Layer Session Policy at AssumeRole call sites: "this session only touches this bucket."
Write a refresh loop aware of the 1-hour cap after Role Chain.
Monitor sts:AssumeRoot in CloudTrail and route to Slack.
Aggregate CloudTrail by sourceIdentity so "who did what" becomes mechanical.

Conclusion

STS is not a single function. It is 6 issuance APIs (AssumeRole, AssumeRoleWithSAML, AssumeRoleWithWebIdentity, AssumeRoot, GetSessionToken, GetFederationToken) plus helpers (DecodeAuthorizationMessage, GetCallerIdentity).
Global and Regional both exist. New work picks Regional.
A temporary credential is ASIA... + Secret + SessionToken, sent with X-Amz-Security-Token riding on SigV4.
Trust Policy answers "who is allowed to assume." Identity Policy answers "what can the assumed Role do."
ExternalId stops Confused Deputy through a value the customer and vendor agreed on, enforced in the Trust Policy.
Source Identity carries the original human ID across the chain. Required for mechanical CloudTrail aggregation.
Transitive Session Tag is the ABAC engine. Upstream stamping survives downstream.
Session Policy is intersection at the call site. It narrows, never widens.
DurationSeconds is hard-capped at 1 hour after Role Chaining.
AssumeRoot from 2024 re:Invent is the core of Centralized Root Access. Take Member Root from Management for 15 minutes, scoped by TaskPolicy.

References

WORM (Write Once Read Many) Deep Dive

kt — Mon, 18 May 2026 15:49:44 +0000

Introduction

"Data can be deleted." It looks obvious, but it is actually a fatal default from a legal and ransomware perspective.

When an auditor asks for 7 years of email, the moment a single person can type DELETE, evidentiary value collapses
When ransomware encrypts your backups too, recovery is impossible and you end up paying the ransom
When someone suspects "that trade record was tampered with, wasn't it?", you lose in court unless you can show that tampering is technically impossible

The answer to all of these is WORM (Write Once Read Many). A storage mode where once you write data, nobody can delete or rewrite it during the prescribed period. Not the root admin, not the attacker, not the operations team at the service provider.

This article dissects "undeletable storage" in the following order.

What WORM is (definition)
History of WORM (from optical discs to cloud)
The 3 worlds that need WORM (regulation / ransomware / insider threat)
The 3 ways to implement WORM (physical / on-prem / cloud)
The main event of cloud WORM: dissecting S3 Object Lock
Side-by-side comparison: AWS / Azure / GCP / on-prem NetApp
Why regulators demand WORM (SEC 17a-4 and the 2022 amendment)
The clash with GDPR "right to be forgotten" and how to solve it
Famous companies and actual fine cases
Design pitfalls and how to avoid them

It is long, but read top to bottom and you will see what WORM is, why it is needed, how to build it, and how to use it as one connected story.

1. What WORM is

WORM stands for Write Once Read Many. It refers to a storage mode in which data, once written, cannot be rewritten or deleted until a defined retention period has elapsed.

The definition has three cores.

Write Once: the object cannot be overwritten with new content from the start of protection until the retention expires
Read Many: reads are unlimited, by anyone (who is authorized)
Retention: configurable "until when can it not be deleted". Specified in days, or indefinite (Legal Hold)

The key thing to remember: in modern implementations, WORM is not a property of the whole storage but a flag attached to an object. Inside one bucket it is normal to have object A locked for 7 years and object B not locked at all.

"Once" strictly means "that version"

In modern implementations like S3 Object Lock, WORM applies at the version level. A new PUT to the same key (path) is written as a new version, and the old version remains untouched. So it is "appending to history", not "overwriting content".

This is an important premise for the GDPR "right to be forgotten" conflict discussion later (Section 8).

2. History of WORM: why it has existed since the 1980s

WORM is not a new concept. It is a technology that has existed at the physical media level since the 1980s, moved up to the software layer, and finally became a cloud API.

Physical WORM era (1984 to 2000)

The first WORM implementations were physically un-rewritable at the media itself.

Optical disc WORM: a laser physically burns pits into the recording layer (ablative). Burnt pits cannot be undone
Magneto-Optical (MO): the first MO drives in 1985 were WORM only. They ran a verification pass after writing, so writing took 3x as long as reading, but reliability was extremely high
CD-R / DVD-R: consumer-grade WORM popular in the 1990s. Burning legal documents, medical images, and broadcaster master tapes onto discs and locking them in a safe was the standard workflow

WORM in this era was complete once you "put the media in a safe". A simple guarantee model: tampering is impossible unless an attacker physically reaches the safe.

Software WORM era (2002 to 2017)

The problem was that optical discs do not scale to TB-class data retention. So the idea emerged: "provide WORM properties in software on top of ordinary hard disks".

The representatives were EMC Centera (2002) and NetApp SnapLock (2003). They put logic at the OS / firmware level on top of a normal HDD array that "rejects delete / overwrite during the retention period".

This solved the scale problem but raised a new issue: "what if you turn the OS clock back?"

NetApp killed this with a solution called ComplianceClock (detailed in Section 4). An internal hardware-based clock that is independent of the OS clock and only moves forward, used as the basis for retention. This makes the "turn the clock back to expire retention" attack impossible.

Cloud WORM era (2018 onward)

Since AWS released S3 Object Lock in 2018, WORM has become a standard feature of cloud object storage. Azure and GCP followed, and now if you are cloud-native you can get regulation-grade WORM by just "creating one bucket and flipping a flag".

Unlike physical media it scales to TB instantly, and replication handles redundancy. The "put media in a safe" workflow of optical discs is now only seen at regional banks, municipalities, and old hospitals.

3. The 3 worlds that need WORM

Many people think "we are not a regulated industry, so we do not need WORM", but in modern times there are 2 reasons besides regulation that demand WORM.

#	Reason	What to protect	Attack / failure scenario	Main industries
①	Regulation	Trade / communication records (5 to 30y)	Legal violation, huge fines, suspension order	Securities, banking, healthcare, pharma
②	Ransomware	Backups themselves	Attacker takes admin rights and encrypts backups along with the rest	All industries (exploded after 2020)
③	Insider threat	Audit logs, auth logs, operation history	Admin or DBA deletes logs to hide evidence of wrongdoing	SOC, SaaS operators, finance in general

① Regulation

Historically this is where WORM was born. The US SEC Rule 17a-4 is the representative: it requires securities firms to retain trade records in a "non-rewriteable, non-erasable" form. Equivalents exist in Japan (FIEA), the EU (MiFID II), healthcare (HIPAA), and pharma (FDA 21 CFR Part 11).

② Ransomware

This is the biggest reason WORM use exploded after 2020. Ransomware attackers have established a playbook of going for the backups first. Once backups are gone, paying the ransom is the only option.

If "the backup storage is on WORM", attackers cannot delete backups even after taking admin. In fact, backup products like Veeam / Rubrik / Commvault / Cohesity integrate with S3 Object Lock and Azure Immutable Blob and market a feature called "Immutable Backup".

③ Insider threat

Use audit logs, trade history, and security events for making sure admins themselves cannot delete them. For example, if DB access logs flow into WORM storage, a DBA cannot "go delete the logs" after doing something bad.

Especially SOC (Security Operation Center) logs and FIDO / authentication logs as evidence stores are becoming best practice to flow into WORM.

4. The 3 ways to implement WORM

From here, the "how to build it" story. In modern times, WORM can be implemented at three large layers.

① Physical media WORM

Optical discs or WORM tape. Sony Optical Disc Archive and HPE LTO WORM tapes are the representatives.

Pros: pull the media and put it in a safe and it truly cannot be written (air gap). 50 to 100 year retention track records
Cons: beyond TB-scale you need robotic libraries; cost and ops get heavy. Reading needs a dedicated drive; if you retain longer than the drive lifetime, you need to preserve the drive too
Where it is used: movie studio masters, national archives, broadcaster archives, medical imaging (legal 30-year retention)

② On-prem software WORM

Provides WORM at the filesystem or OS level on top of normal HDD / SSD arrays.

The representatives are NetApp SnapLock (still active), Dell EMC Centera (EOL in 2019, replaced by a tool called CTA), HPE StoreOnce, and Hitachi HCP.

Taking NetApp SnapLock as an example, internally it works like this.

Key points.

WORM commit: the moment a normal file gets the "read-only attribute" set, SnapLock promotes it to WORM file status. The retention date is recorded at the same time
ComplianceClock: the core of SnapLock. A dedicated clock that is not affected by ntpdate on the OS turning the clock back. A hardware-based one-way clock initialized once, used as the basis for retention. This makes the "turn the clock back to expire retention" attack impossible
Two modes: Compliance (not deletable even by root) and Enterprise (deletable with privileged delete, but logged)

On-prem WORM is firmly rooted in old banks and hospitals. The reason is simple: it integrates easily with existing mainframe / NAS operations, and "the data lives in our own rack" is easy to explain to auditors.

③ Cloud WORM

The main event since 2018. Starting with AWS S3 Object Lock, almost all major object storage services support WORM: Azure, GCP, Backblaze B2, Wasabi, Cloudflare R2.

We dig deeper in the next section.

5. The main event of cloud WORM: dissecting S3 Object Lock

S3 Object Lock went GA in November 2018. It is the de facto standard for WORM features, and Azure / GCP designed their APIs with it in mind.

Big picture

There are three elements.

Retention Mode: period-based (Days or Years). Two modes: Governance (loose) and Compliance (strict)
Legal Hold: no period; locked indefinitely until explicitly released
Both can apply at the same time: if the normal retention has expired but Legal Hold is on, it still cannot be deleted

Governance vs Compliance: get this wrong and you have an incident

The difference between them is "whether even the root user can delete or not".

Item	Governance	Compliance
Normal user delete	❌ no	❌ no
Privileged delete (`s3:BypassGovernanceRetention`)	✅ yes	❌ no
Root user delete	✅ yes (effectively)	❌ no
Shortening retention period	yes with privilege	❌ absolutely no (extend only)
Changing retention mode	can upgrade to Compliance with privilege	❌ absolutely no
Only way to release	release with privilege	close the entire AWS account (may still survive retention even then)
Use case	"accident prevention", test, internal policy	regulatory requirements like SEC 17a-4

Once you put it in Compliance mode, even AWS Support cannot release it. If you set 7 years, it absolutely stays for 7 years. Use Compliance in a test environment by mistake and the bucket fills with undeletable test data while billing climbs. Compliance is for production regulatory compliance only.

API: actually applying the lock

The lock is specified by HTTP headers on PUT.

aws s3api put-object \
  --bucket my-immutable-records \
  --key 2026/q1/trades.csv \
  --body trades.csv \
  --object-lock-mode COMPLIANCE \
  --object-lock-retain-until-date "2033-05-17T00:00:00Z"

The essence is what gets sent under the hood.

PUT /2026/q1/trades.csv HTTP/1.1
Host: my-immutable-records.s3.amazonaws.com
x-amz-object-lock-mode: COMPLIANCE
x-amz-object-lock-retain-until-date: 2033-05-17T00:00:00Z
Content-MD5: <base64-md5>
Content-Length: 12345

Key points:

If you specify x-amz-object-lock-mode, x-amz-object-lock-retain-until-date is required. One alone is not allowed
Content-MD5 header is required: a PUT with object-lock related headers forces MD5 verification. This is to prevent "tampering during PUT"
Dates are ISO 8601 (UTC, millisecond precision)

Object Lock operation sequence

Following the actual flow of "lock → delete attempt → retention expires → delete" as a sequence.

On the S3 server side, a check of retain-until-date > current time runs, and if true it rejects at the HTTP layer. Even if a DELETE flies due to an app bug or admin mistake, it never reaches the storage layer.

Three gotchas when using Object Lock

Versioning is a prerequisite: to enable Object Lock you need S3 Versioning ON first. Easy if you flip both on at bucket creation, but watch the order when retrofitting an existing bucket (self-service retrofit on existing buckets became possible in November 2023; details in Section 10)
Re-PUT to the same key becomes a new version: a version that has the lock applied is not removed even by Lifecycle. With Object Lock + Versioning, old versions pile up, so always use NoncurrentVersionExpiration in Lifecycle together
Legal Hold uses different IAM Actions: controlled by s3:PutObjectLegalHold / s3:GetObjectLegalHold. Manage who can apply or remove Legal Hold under a separate policy from retention

6. Cloud WORM comparison: AWS / Azure / GCP / on-prem

All three major clouds have WORM features, but the API and terminology differ. Side-by-side comparison.

Item	AWS S3 Object Lock	Azure Immutable Blob	GCS Bucket Lock	NetApp SnapLock
Granularity	Object (Version)	Container / Version	Bucket / Object	Volume + File
Period unit	Days / Years	Days	Seconds	Days / Years
"Not deletable by root" mode	Compliance	Locked Policy	Locked Retention Policy	Compliance Mode
"Loose" mode	Governance	Unlocked Policy	Unlocked Policy	Enterprise Mode
Legal Hold	yes	yes	yes (Object Hold)	yes (Event-Based Hold)
Auto-delete after period	via Lifecycle	via Lifecycle	via Lifecycle	configurable
Regulatory attestations	SEC 17a-4, FINRA 4511, CFTC 1.31, HIPAA	SEC 17a-4, FINRA 4511, CFTC 1.31	SEC 17a-4, FINRA, CFTC	SEC 17a-4 (longest)
Clock tampering defense	AWS internal clock (root cannot touch)	Azure internal clock	GCP internal clock	ComplianceClock
First GA	Nov 2018	2020	2018 (Bucket Lock) / 2023 (Object Hold)	2003

The notable fact here is "cloud WORMs all converged to a similar shape". S3 Object Lock effectively became the reference implementation.

Azure peculiarity: Container vs Version

Azure has two scopes.

Container-level WORM: WORM policy applied to the entire container. All blobs in the same container share the same retention
Version-level WORM: WORM applied to individual blob versions. Feels almost identical to S3 Object Lock

Recommend Version-level for new designs. Container-level is an older design with less operational flexibility.

GCP Bucket Lock has a brutal "no going back"

GCS Retention Policy + Bucket Lock, once locked, permanently disables the following:

Shortening retention period
Deleting retention policy
Releasing the bucket retention setting

Extension is allowed. The bucket itself cannot be deleted unless all contents have completed retention and the bucket is empty. "Oops, my bad" does not work at all, so always test it in dev / staging before production.

7. Why regulators demand WORM

Worth understanding "why are the requirements this strict".

SEC Rule 17a-4: the original

SEC Rule 17a-4 requires US broker-dealers (securities firms) to retain trade records and communications (including email and chat) in a "non-rewriteable, non-erasable form".

Established in 1997. Originally written assuming paper and microfilm, when electronic records came in, it was interpreted as "if storing electronically, in WORM format" and operated that way ever since.

Why so strict? The answer is simple: lying in securities transactions has huge instant upside.

Delete an insider trading evidence email → escape civil sanctions
Rewrite a customer explanation → win the lawsuit
Tamper with trade timestamps → push losses onto the customer

If these become technically possible, trust in the whole market collapses. So the logic is to mandate "technically impossible" mechanisms.

2022 amendment: the WORM-only era ended

For years 17a-4 allowed only WORM, but on October 12, 2022 the SEC adopted an amendment (Effective January 3, 2023; broker-dealer compliance deadline May 3, 2023) adding an alternative called "Audit Trail".

This is a fairly big change: now you can meet the requirement with modern cloud-native designs (for example, streaming DB CDC logs into a separate WORM store). That said, "the most obviously sufficient option is still WORM", so conservative operations continue to choose WORM.

Other regulations

Regulation	Industry	Role of WORM
FINRA Rule 4511	US securities	extends 17a-4 to all FINRA member firms
CFTC Rule 1.31(c)-(d)	US commodities futures	retain trade records in WORM form
HIPAA	US healthcare	prevent tampering of PHI (Protected Health Information)
FDA 21 CFR Part 11	US pharma / medical dev	authenticity of electronic records and signatures
FIEA	Japan finance	"store in tamper-proof method for 7 years"
MiFID II	EU finance	retain communications 5 years, prevent tampering
GDPR Art. 32	EU all industries	integrity assurance (WORM is one concrete option)

8. Clash with GDPR: "right to be forgotten" vs "WORM"

Unavoidable when talking about WORM: the relationship with GDPR Article 17 (right to be forgotten).

The problem is simple.

GDPR: when a user requests, delete their personal data
WORM: the moment a record containing personal data lands, it cannot be deleted during retention

Looks irreconcilable. In fact, this was a long-running headache for cloud WORM design.

Solution 1: Crypto-Shredding (destroy the encryption key)

The most widespread solution.

Key points:

The data itself remains in the WORM S3 still encrypted (not deleted)
Generate a separate encryption key per user (envelope encryption)
When a deletion request comes, just delete the key. Data can no longer be decrypted = effectively deleted

This satisfies both "WORM immutability" and "GDPR right to delete". Easy to assemble with AWS KMS plus S3 Object Lock.

Solution 2: Off-chain / Hybrid

A solution from the blockchain context. Put only the hash or reference in WORM storage, and put the actual content in normal storage. When a deletion request comes, delete the content. The hash remains in WORM, so "data of this kind existed at this time" can be proven, but the content itself cannot be reconstructed.

Useful for audit contexts where it is sufficient to prove "data existed at that timestamp".

9. Famous companies and real cases

Now the reality. What happens without WORM, and how companies that have it use it.

Case ①: FINRA fines 12 major firms 14.4M USD (2016)

December 21, 2016, FINRA fined 12 firms a total of 14.4 million USD. The reason was that they "failed to store hundreds of millions of electronic records in WORM format". One of the largest WORM-violation fine cases ever.

The top 4 breakdown.

Rank	Fine	Firm
1	$4.0M	Wells Fargo Securities + Wells Fargo Prime Services (combined)
2	$3.5M	RBC Capital Markets + RBC Capital Markets Arbitrage S.A. (combined)
3	$2.0M	RBS Securities, Inc.
4	$1.5M	Wells Fargo Advisors + Advisors Financial Network + First Clearing
rest	3.4M	8 firms including PNC Capital Markets ($0.5M)

After this incident, US major securities firms audited existing systems for WORM compliance all at once, and migration to cloud WORM accelerated. The general retrospective: the post-fine regulatory remediation cost and reputational risk were far larger than the fine itself.

Case ②: Continuity Centers ransomware defense (2020)

Continuity Centers is a disaster recovery specialist DRaaS provider. In 2020, against the surge of ransomware, they adopted Veeam + Backblaze B2 Cloud Storage Object Lock.

The configuration is simple.

CEO Gregory Tellone's comment captures the essence: "Attackers try to wipe the customer's data center and backups in one go. Immutability is another line of defense."

A notable point: setup completed within an hour of Backblaze's Object Lock announcement. A symbolic case of the era when cloud WORM setup is one API call.

Case ③: Hospital + Veeam + StoneFly DR365V

A US public hospital with tens of thousands of patients adopted the following to meet HIPAA's long-term medical record retention requirement plus ransomware defense.

20 VMware VMs + TB-scale medical records
Backup product: Veeam
Storage destination: StoneFly DR365V air-gap + WORM volume + S3 Object Lockdown

Three layers of immutability, with tamper-proof copies in local, cloud, and offline.

Case ④: Snowflake GA's WORM Backup in 2025

The data warehouse Snowflake made WORM Backup generally available in December 2025. In addition to Time Travel, you can use external S3 / Azure Blob as a retention-lock-capable WORM store to take tamper-proof backups.

As Snowflake usage in financial institutions grew, Snowflake itself moved to the side of providing WORM features to meet SEC 17a-4 / FINRA 4511. A clear trend of platform vendors building WORM in as a "standard feature".

Case ⑤: Broadcasters / archives (optical disc is still alive)

Even in the cloud era, there is a domain where optical disc WORM is still alive for long-term archive.

Sony's Optical Disc Archive (ODA) is the representative. A cartridge bundles multiple dedicated Blu-ray discs; capacity varies by generation from 1.5 TB to 5.5 TB. The vendor claims 100-year retention
Adopted for master storage in broadcasting, video production, and medical imaging
The decisive properties: "no power needed once the media is removed" and "not subject to the convenience of firmware or cloud providers"

Even when cloud WORM is one API call away, the reality is that "no operator guarantees access for 100 years" (no cloud provider signs a 100-year contract), so for ultra-long-term retention, optical disc or LTO tape WORM is still chosen.

10. Design pitfalls and how to avoid them

We have seen the mechanics and usage of WORM, so let me close with the landmines easy to step on in implementation.

Pitfall ①: using Compliance mode lightly

If you use Compliance mode in dev / staging / learning buckets, the bucket keeps bloating with undeletable data while billing climbs.

Mitigations:

For verification, always use Governance mode or short retention (1 day)
Before production, set an IAM policy / SCP (Service Control Policy) at the organization level that Denies PUT in Compliance mode except for specific IAM Roles

Pitfall ②: combined with versioning, capacity explodes

Once you enable S3 Object Lock, Versioning is forced ON. If you design with frequent PUTs to the same key, all old versions stay under WORM and cannot be removed.

Mitigations:

Set Lifecycle NoncurrentVersionExpiration to delete old versions after retention
Make it an operational rule that immutable buckets are "PUT once, never overwrite" (use date-suffixed keys)

Pitfall ③: deleting the KMS encryption key

If you encrypted WORM objects with SSE-KMS and delete the KMS Key, you land in data remains, but cannot decrypt hell.

Mitigations:

Operate KMS Keys for WORM buckets as never delete (even with PendingWindowInDays set to the longest 30 days, operational mistakes happen)
SCP at the organization level that Denies Key deletion
If using Multi-Region Keys, consistent management across both regions

Pitfall ④: Legal Hold release permission held by everyone

If retention is strict but s3:PutObjectLegalHold (release permission) is granted to all IAM Roles, Legal Hold is effectively meaningless.

Mitigations:

Grant Legal Hold Put / Delete permission only to a dedicated IAM Role for legal / compliance
Log every PutObjectLegalHold API call in CloudTrail and stream into SIEM

Pitfall ⑤: acting on old knowledge that "Object Lock can only be enabled at bucket creation"

For a long time, "Object Lock can only be turned ON at bucket creation" was a constraint. There may still be old project plans in your org that gave up with "cannot change anymore" and migrated to a new bucket.

But the November 20, 2023 update made it possible to enable Object Lock self-service on existing buckets from the console / API. You do not even need to contact AWS Support. To apply retention to existing objects, S3 Batch Operations can apply in bulk to billions of objects.

But two caveats:

S3 Versioning must be ON before enabling (Versioning is a prerequisite)
Once Object Lock is enabled, it cannot be disabled. Versioning can no longer be Suspended either. No going back, so do not enable "just to try"

Conclusion

WORM is a feature that looks simple on the surface as "data cannot be deleted", but to actually realize it requires the physical layer / OS layer / firmware layer / app layer / legal layer to all click together.

The takeaways:

WORM is an old concept that has existed since the 1980s. The implementation evolved from physical media to software to cloud
Modern cloud WORM is effectively referenced against S3 Object Lock. Azure / GCP provide almost equivalent features
Compliance mode = not deletable even by root, Governance mode = removable with privilege. Use Compliance for regulatory production, Governance for internal policy or accident prevention
WORM demand comes from three sources: regulation / ransomware defense / insider threat. Even non-regulated industries should WORM-ify backups
NetApp SnapLock's ComplianceClock is a historic device to defend against clock tampering. In the cloud it is transparently solved internally
Conflict with GDPR's right to delete is practically resolved by Crypto-Shredding (destroying the encryption key)
The classic three design pitfalls: "do not use Compliance lightly", "watch out for version explosion", "do not delete KMS keys"

Next time you design a backup system, always ask first: "Is this destination immutable?". That is the most basic line of defense against both ransomware and audit.

References

AWS S3 Deep Dive

kt — Mon, 18 May 2026 10:45:59 +0000

Introduction

S3 (Simple Storage Service) is one of the oldest (launched in 2006) and most-used services in AWS. "Near-infinite storage where you can put and get files over an HTTPS API" sounds simple, but in real operation it keeps showing up in incident news because of the complexity of its access control.

Every year you see headlines like "S3 bucket left public, customer PII leaked." That isn't because S3 is broken. It's because almost nobody fully understands all 4 kinds of access control.

This article dissects S3 in the following order.

The S3 object model: the true structure of buckets, objects, and keys
Following a single round-trip: what happens inside S3 between receiving and returning
The 4 layers of access control: IAM / Bucket Policy / ACL / Block Public Access
The evaluation flowchart: who actually gets access in the end
The 4 forms of encryption: SSE-S3 / SSE-KMS / SSE-C / CSE
Features you lose by not knowing: Versioning / MFA Delete / Object Lock / Access Point
A list of configurations you must never use

1. The S3 Object Model

S3 is not a filesystem. It is object storage. This is the first source of confusion.

Object storage is a mechanism where data + a unique ID (the key) + metadata are bundled into one unit (an object) and put and got from a flat namespace. There is no directory hierarchy, no file permission bits, no blocks. If you think of it not as a "file" but as "an item retrievable by ID," it starts to make sense.

The key points.

A Bucket is a "container" scoped to an account + Region. Its name is globally unique across all of AWS, so the name my-app-logs exists in exactly one place in the world.
An Object is a single piece of data inside a bucket. It is identified by a Key like 2026/05/14/app.log.
The key "looks like" a file path, but it is just a string. The / is just a character. Directories do not exist.
Each object consists of data + metadata + (optional) tags.
A bucket is tied to a Region, but only the bucket name namespace is global.

"Folders" Are an Illusion

When you click 2026/05/ in the S3 console and "see its contents," that is just a prefix search on keys, not a folder. To fetch the contents of 2026/05/14/app.log, you do not need to first fetch a parent object called 2026/ or 2026/05/.

If you do not know this, you end up doing things like uploading a 0-byte object named 2026/06/ because "I want to create an empty folder in S3," which is a console-driven misconception.

Memorize the ARN Format

You always need this when pointing to S3 (or any AWS resource) from an IAM policy. ARN stands for Amazon Resource Name and is the scheme AWS uses to refer to any resource by a globally unique string. The base form is this.

arn:partition:service:region:account-id:resource
       │        │       │        │          │
       │        │       │        │          └─ Resource ID (bucket name or object key)
       │        │       │        └────────── 12-digit AWS account number
       │        │       └─────────────────── Region (e.g. ap-northeast-1)
       │        └─────────────────────────── Service name (e.g. s3, ec2, iam)
       └──────────────────────────────────── Partition (usually aws, China is aws-cn)

S3 takes the special form of leaving region and account-id blank (which is why you see three consecutive colons :::), due to "globally unique bucket names" and a "Region-independent billing model."

Target	ARN
The whole bucket	`arn:aws:s3:::my-app-logs`
All objects in the bucket	`arn:aws:s3:::my-app-logs/*`
A specific object	`arn:aws:s3:::my-app-logs/2026/05/14/app.log`
Everything under a specific prefix	`arn:aws:s3:::my-app-logs/users/alice/*`

The key point is that a bucket and its objects have different ARNs. s3:ListBucket applies to the bucket ARN and s3:GetObject applies to the object ARN, so you need to write both to get listing and reading working together.

2. Following a Single Round-Trip

When a client runs aws s3 cp ./image.png s3://my-app-uploads/users/alice/avatar.png, what happens inside S3?

What you need to recognize here is that the inputs S3 uses to decide authorization are all 4 of IAM / Bucket Policy / ACL / Block Public Access. Looking at just one of them does not tell you the real permissions.

3. The 4 Layers of Access Control

This is the single biggest source of confusion in S3. Access control is decided by 4 layers simultaneously.

Layer ①: Block Public Access (BPA)

The newest, strongest safety net, and the first thing evaluated.

Since April 2023, AWS has enabled it by default on new buckets and also disabled ACL by default. BPA consists of 4 switches.

BPA setting	What it does
`BlockPublicAcls`	Prevent newly attaching a public ACL
`IgnorePublicAcls`	Ignore all public ACLs (including existing ones)
`BlockPublicPolicy`	Prevent attaching a public bucket policy
`RestrictPublicBuckets`	Completely block unauthenticated access to public buckets

All 4 ON is the rule. As long as you do this, "I accidentally made it public" incidents are almost entirely prevented.

You can also enforce BPA at the Organizations level (Organization-level BPA). Once set, individual accounts can no longer disable BPA themselves. If you use AWS at a company, turn this on without exception.

Layer ②: IAM Identity Policy

The IAM policy type that is attached to the operating side (user / role). It writes "what this principal can do."

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject"],
      "Resource": "arn:aws:s3:::my-app-uploads/users/${aws:username}/*"
    }
  ]
}

This is an IAM policy that says "you can only read the subfolder named after your own username." Dynamic variables like ${aws:username} are useful.

Layer ③: Bucket Policy

A Resource Policy attached to the bucket side. It writes "who can access this bucket."

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyInsecureTransport",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "s3:*",
      "Resource": [
        "arn:aws:s3:::my-app-uploads",
        "arn:aws:s3:::my-app-uploads/*"
      ],
      "Condition": {
        "Bool": { "aws:SecureTransport": "false" }
      }
    },
    {
      "Sid": "AllowFromMyOrg",
      "Effect": "Allow",
      "Principal": "*",
      "Action": ["s3:GetObject", "s3:PutObject"],
      "Resource": "arn:aws:s3:::my-app-uploads/*",
      "Condition": {
        "StringEquals": {
          "aws:PrincipalOrgID": "o-abc1234567"
        }
      }
    }
  ]
}

Key points.

Denying HTTP (non-TLS) access is a standard template.
Adding aws:PrincipalOrgID to Condition lets you ensure only people/Roles in your own AWS Organizations can access. This is a structural defense that prevents public-exposure incidents.
Cross-account sharing is done with the bucket policy. IAM alone will not allow cross-account.

Layer ④: ACL (Access Control List)

The oldest access control in S3. Today, you do not need to use it.

Since April 2023, ACL is disabled by default on new buckets (Object Ownership = "Bucket owner enforced").
Some legacy integrations (such as CloudFront logs) still require it, but even then keep it minimal.
For existing buckets too, prefer disabling ACL when possible.

A lot of "S3 public-exposure incidents" were caused by giving Read to the ACL's AllUsers group (= the entire internet). BPA + disabled ACL is the mechanism that closes that old wound.

4. Who Actually Gets Access: The Evaluation Flowchart

Given the 4 layers, here is the full flow of whether S3 lets a request through.

The principles to internalize.

BPA wins everything: if it judges the request as public, it is over.
An explicit Deny wherever it appears wins: if there is a Deny in IAM, Bucket Policy, or SCP, you are done.
Same account: Allow from either IAM or Bucket Policy is enough (OR).
Cross-account: Allow is required in both IAM and Bucket Policy (AND).

Cross-account sharing often gets stuck on point 4. When someone says "I can't access our bucket from a Role in another account," the first thing to check is whether Allow is written on both sides.

5. The 4 Forms of Encryption

As of 2026, all objects in S3 are automatically encrypted by default. That said, there are 4 variations on whose key does the encryption.

Their characteristics.

Type	Key management	Key-use log	Recommendation	When to use
SSE-S3	Fully managed by S3	None	Standard	You want forced encryption but want to avoid KMS cost and management
SSE-KMS	AWS KMS (Customer Managed Key)	Recorded in CloudTrail	★★★	Required when you need auditing, separation of duties, or compliance
SSE-C	Client sends a key on each request	None	★	Only for special cases where you cannot hand the key to AWS. Disabled by default on new buckets from April 2026
CSE	Fully managed by the client	None	★	For ultra-high requirements where you cannot let AWS see plaintext at all

How to Choose in Practice

If you do not want to think about it, pick SSE-KMS with a Customer Managed Key (CMK). Key-use logs go into CloudTrail, KMS key policy lets you enforce "only this role can decrypt," and you can replicate keys to other Regions. The only downside is that the KMS API call charges slowly add up.

Enforce Default Encryption on the Bucket

When you want to enforce encryption on new objects, set the bucket-level Default encryption.

{
  "Rules": [
    {
      "ApplyServerSideEncryptionByDefault": {
        "SSEAlgorithm": "aws:kms",
        "KMSMasterKeyID": "arn:aws:kms:ap-northeast-1:123456789012:key/abc-..."
      },
      "BucketKeyEnabled": true
    }
  ]
}

BucketKeyEnabled: true dramatically reduces the number of KMS API calls and saves cost, so turn it on whenever you use KMS.

Bucket Policy That Rejects "Unencrypted PUT"

Putting this in as a guardrail rescues you with default encryption even when someone misuses the SDK.

{
  "Sid": "DenyUnEncryptedObjectUploads",
  "Effect": "Deny",
  "Principal": "*",
  "Action": "s3:PutObject",
  "Resource": "arn:aws:s3:::my-app-uploads/*",
  "Condition": {
    "StringNotEquals": {
      "s3:x-amz-server-side-encryption": "aws:kms"
    }
  }
}

6. Features You Lose by Not Knowing

Features that elevate S3 beyond "just an object store."

Versioning and MFA Delete

Turning Versioning on keeps older versions even when you overwrite the same key.
Deletion just stacks a Delete Marker, leaving the actual data intact.
Combining MFA Delete requires the Root account's MFA for permanent deletion. Even if ransomware encrypts your bucket, you can roll back from an older version.

Object Lock

A WORM (Write Once Read Many) mode that disallows deletion and overwriting for a fixed period after a write.

Governance mode: privileged users can still delete.
Compliance mode: nobody (not even Root) can delete until expiry.

Use it for data that must not be deleted, such as legal records (SEC, FINRA, HIPAA).

Access Point

A feature that "grows multiple purpose-specific entry points on a single bucket." For one bucket company-data you can create Access Points like marketing-readonly (only the marketing IAM can Get), finance-readwrite (the finance IAM can Get/Put), public-cdn (only via CloudFront), each with its own independent Access Point Policy.

Before your Bucket Policy bloats into something unreadable, splitting it into Access Points keeps things organized.

Pre-signed URL

A feature that issues a short-lived signed URL. Handy when you want to let an unauthenticated user temporarily download or upload.

Example: "let a customer download an invoice PDF for just 1 hour" → issue a Pre-signed URL and email it.
The bucket stays completely closed via BPA. Per-request access is allowed by the URL's expiration and signature.

7. List of Configurations You Must Never Use

Given everything above, here are anti-patterns and best practices.

❌ Do not do

Turn BPA OFF
Grant Read to AllUsers / AuthenticatedUsers via ACL
Use Principal: * in Bucket Policy without any Condition
Store important data without Versioning
Allow HTTP (non-TLS) access
Embed long-lived IAM User credentials in an application
Casually grab a globally unique bucket name and operate it with public settings

✅ Do

All 4 BPA switches ON / enforced at the organization level
Disable ACL (Object Ownership = Bucket owner enforced)
Enforce aws:SecureTransport in Bucket Policy
Fence in with aws:PrincipalOrgID in Bucket Policy
Versioning + MFA Delete
SSE-KMS + Bucket Key for both auditability and low cost
Use Pre-signed URLs as a substitute for temporary public exposure
For distribution, completely privatize the bucket with CloudFront + OAC

There is one operational pattern worth emphasizing especially.

"I want to expose this to the internet" does not mean "make the bucket public."

If you want to publish a static site or image distribution, keep the bucket fully private and serve it through a CDN with CloudFront + OAC (Origin Access Control). This achieves "direct bucket access is blocked, only the CDN can fetch from it." There is zero need to disable BPA.

Conclusion

S3 is an object store of Bucket + Object + Key. Folders are an illusion, / is just a character.
Bucket names are globally unique. The ARN of the bucket and of its objects are different.
Access control has 4 layers: BPA / IAM / Bucket Policy / ACL.
Evaluation principles: BPA wins everything, explicit Deny wins, same account is OR, cross-account is AND.
Encryption comes in 4 flavors: SSE-S3 / SSE-KMS / SSE-C / CSE. Today, SSE-KMS + Bucket Key is the standard.
All 4 BPA switches ON, ACL disabled, HTTPS enforced, scoped by aws:PrincipalOrgID: public-exposure incidents drop dramatically.
For public distribution use CloudFront + OAC. For temporary sharing use Pre-signed URLs.

References

AWS IAM Deep Dive

kt — Sun, 17 May 2026 07:10:19 +0000

Introduction

Every action on AWS goes through an HTTPS API, and IAM (Identity and Access Management) sits in front of every single one of them.

Once you actually run things on AWS, you notice IAM is where you get stuck. "It says AccessDenied." "The policy says Allow, why is it still rejected?" "I assumed the role but my credentials are still the old ones." The patterns are predictable.

This article takes IAM apart in the order auth happens: authentication, then authorization, then operations.

What IAM actually solves: authentication vs authorization
Principals: who is making the call
SigV4: how AWS verifies the caller is real
The six policy types and what each one does
The shape of a policy JSON
Policy evaluation: why Deny beats Allow
IAM Identity Center and short-lived credentials
The do / don't list

1. What IAM Solves: Authentication and Authorization

Start by separating authentication (AuthN) from authorization (AuthZ).

Authentication (AuthN): pins down "who are you". On AWS, the caller is whoever owns the credentials that produced this SigV4 signature.
Authorization (AuthZ): decides "can they do this". AWS answers it by combining multiple policies and evaluating them together.

The rest of the article follows that same order: AuthN first, then AuthZ.

2. Principals: Who Is Calling AWS

In AWS, the entity making a call is called a Principal. Anything that can execute against a resource.

Type	Credential	Authenticated by	When to use
Root User	Account email + password	AWS itself	Almost never. Billing changes and a few special tasks only.
IAM User	Access key (long-lived) or password	IAM	One per human or system. Long-lived keys are a leak risk.
IAM Role	Short-lived credentials issued by STS	AssumeRole	The default. Roles over Users in modern setups.
Federated Identity	External IdP token, swapped for AWS creds via STS	Identity Center / SAML / OIDC	The modern way humans log in.

An IAM Group is not a principal. It is just a container for attaching policies to a set of users. The group itself never authenticates and you cannot put it in the Principal field of a policy.

Don't Use the Root User

Root has god-mode on the whole account. Only Root can change billing or close the account, and that is exactly why everything else should not be done as Root.

The AWS-recommended setup:

After creating the account, set up MFA on Root immediately (basically required now).
Never create an access key for Root.
For day-to-day work, log in through IAM Identity Center or assume an IAM Role.
Lock the Root credentials in a safe.

For MFA, TOTP (the 6-digit code from Google Authenticator, 1Password, etc.) works, but AWS recommends FIDO2 / WebAuthn passkeys or hardware keys like a YubiKey, because they resist phishing. AWS also recommends registering at least two MFA methods on Root so you don't get locked out if one is lost.

Use IAM Roles Instead of IAM Users

The access keys on an IAM User (the ones that start with AKIA...) are long-lived. Once leaked, they stay valid until you rotate them. The stream of "committed an access key to GitHub, got abused within hours" stories isn't slowing down.

IAM Roles solve this by handing out short-lived credentials via AssumeRole that expire after 1 hour by default (12 hours max).

Three pieces to know before reading the diagram:

STS (Security Token Service): the AWS service that issues temporary credentials. sts:AssumeRole and friends call into this.
Trust Policy: a policy attached to a Role that says "who is allowed to assume this role" (e.g. "only this specific IAM User", "only this GitHub Actions repo").
Temporary credentials: a triple of AccessKeyId, SecretAccessKey, and SessionToken, with an expiry.

3. SigV4: How AWS Verifies a Request Is Real

Once we know who the principal is, the next question is whether they actually signed the request. That is what SigV4 (Signature Version 4) does.

A typical REST API sends something like Authorization: Bearer <token> on every call. AWS does not. It never sends the Secret Key on the wire. It sends a signature computed from the Secret Key, every time.

The Big Picture

What the 4 Steps Actually Do

Three things to internalize:

The Secret Key never leaves your machine. Only an HMAC derived from it goes on the wire.
Clock skew will kill you. The timestamp is baked into the signature scope. If your laptop's clock is off, you get InvalidSignature immediately. AWS tolerates roughly 15 minutes of skew.
Signatures have a short validity window. To shrink the replay window, API calls only accept signatures from the last 15 minutes.

In practice the SDK and CLI do all of this for you, so you almost never compute it by hand. But when you see "calling another region from inside Lambda fails with a signature error" or "every S3 call from this Docker container returns 403", the answer is almost always clock sync.

4. The Six Policy Types and Who Does What

Now for authorization. The reason AWS authorization feels complicated is that six different policy types are evaluated together.

Policy	Attached to	Role	How often used
Identity Policy	User / Group / Role	"What can this principal do?"	★★★
Resource Policy	S3 bucket, KMS Key, SNS, etc.	"Who can touch this resource?"	★★★
SCP (Service Control Policy)	OU / Account (Organizations)	Per-account ceiling (a guardrail)	★★
Permissions Boundary	User / Role	Per-principal ceiling	★
Session Policy	Args to AssumeRole / GetFederationToken	Extra narrowing that only applies in a session	★
RCP (Resource Control Policy)	Organizations	Org-wide guardrail on resources	Newer

The trick is to keep "ceiling" and "permission" separate in your head. Identity Policy is the "what you may do" list. SCP and Permissions Boundary are the "the most you will ever be allowed to do" list. No Allow anywhere means no access; a Deny in the ceiling kills it for sure.

5. The Shape of a Policy JSON

Every policy uses the same JSON structure, with five elements.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowReadOnlyMyBucket",
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::my-bucket",
        "arn:aws:s3:::my-bucket/*"
      ],
      "Condition": {
        "IpAddress": {
          "aws:SourceIp": ["203.0.113.0/24"]
        },
        "Bool": {
          "aws:MultiFactorAuthPresent": "true"
        }
      }
    }
  ]
}

What to focus on:

Effect: only Allow or Deny. Default is Deny (if nothing says Allow, you get denied).
Action: s3:GetObject style, in <service prefix>:<API name> form. Wildcards (s3:*, s3:Get*) work.
Resource: the target, expressed as an ARN (Amazon Resource Name), e.g. arn:aws:s3:::my-bucket/*.
Principal: "who". Not needed in an Identity Policy because the attachment target is implied. Required in a Resource Policy.
Condition: extra constraints like IP, MFA, tags, time, VPC endpoint, and so on.

Condition Is Where Real IAM Lives

Once you start writing IAM seriously, you live inside Condition. The ones I reach for most:

Key	What it does	Example
`aws:SourceIp`	Source IP	Allow only from the office IP block
`aws:MultiFactorAuthPresent`	Did the caller use MFA?	Require MFA
`aws:PrincipalTag/Team`	Tag on the principal	If Team = ml, allow only ml-bucket
`s3:prefix`	Prefix being listed in S3	A user can only `ls` their own folder
`aws:RequestedRegion`	Target region	Deny anything outside ap-northeast-1
`aws:SecureTransport`	HTTPS?	Block plaintext HTTP access to S3

6. Policy Evaluation: Deny Beats Allow

AWS evaluates the six policy types in this order to reach a final decision.

(Note: this is the simplified, same-account version. Cross-account access has extra rules, e.g. the Resource Policy Allow is required, but the picture above is enough to grasp the principle.)

Three Iron Rules

Default is Deny. If nothing says Allow, the answer is no.
An explicit Deny overrides an explicit Allow. A single Deny anywhere ends the discussion.
You need at least one Allow. No Allow = denied.

Without these three, you cannot debug "the policy says Allow but I still get AccessDenied".

Common "Why Am I Denied?" Patterns

Symptom	Cause
Identity Policy says Allow, still Deny	An SCP is blocking that service at the org level
Can't access your own bucket	The bucket policy names a different Principal
Lost permissions after AssumeRole	A Session Policy was passed as an argument
Only some actions get Deny'd	A Permissions Boundary is narrowing via wildcards
Denied in only one region	An SCP or IAM Policy has a Region condition

Why SCPs Are Powerful

An SCP (Service Control Policy) is an AWS Organizations feature that caps what an OU or account is allowed to do. Attach this SCP to a Sandbox OU and accounts under it cannot launch EC2 outside Tokyo.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyOutsideTokyo",
      "Effect": "Deny",
      "Action": "ec2:RunInstances",
      "Resource": "*",
      "Condition": {
        "StringNotEquals": {
          "aws:RequestedRegion": "ap-northeast-1"
        }
      }
    }
  ]
}

Even if an IAM policy inside the account allows RunInstances in every region, the SCP Deny wins. That is what a guardrail buys you.

7. IAM Identity Center and Short-Lived Credentials

The era of humans logging in as IAM Users is over. The standard now is IAM Identity Center (formerly AWS SSO).

The points:

The user logs in once via the company IdP (Google / Okta / Entra), and AssumeRole gives them short-lived credentials into every AWS account under it.
The contents of the role are defined as a Permission Set. Build a few of them ("ReadOnly is this, dev is that") and assign per user × account.
No access keys are ever issued, so leak risk goes away structurally.
aws sso login covers the CLI side, dropping a temp token in ~/.aws/sso/cache.

For EC2 / Lambda / GitHub Actions

Non-human callers (apps, CI) should also not use long-lived keys. The current playbook:

Runtime	Authentication
EC2	Instance Profile (attach a Role directly)
Lambda	Execution Role
ECS / EKS	Task Role / IRSA / Pod Identity
GitHub Actions	OIDC, then AssumeRoleWithWebIdentity into an AWS Role
GitLab CI	Same as above
External Kubernetes	IAM Roles for Service Accounts (IRSA)

GitHub Actions OIDC is the strongest of the bunch: you can scope AssumeRole by repository or branch. Long-lived keys in GitHub Secrets become entirely unnecessary.

8. The Do / Don't List

Tying the theory back to practice.

❌ Don't

Create an access key for the Root user
Do day-to-day work as Root
Commit IAM User long-lived keys to GitHub
Hand out wildcards like s3:*
Use Principal: * in a Resource Policy without a Condition
Mix production and staging in the same account
Skip MFA entirely

✅ Do

Put MFA on Root and lock the credentials away
Move humans onto SSO via IAM Identity Center
Have GitHub Actions assume a Role via OIDC
Cap developer permissions with a Permissions Boundary
Set an org-wide guardrail with an SCP
Apply least privilege and audit with IAM Access Analyzer
Turn CloudTrail on in every account

Three of these are worth calling out:

Treat long-lived access keys as something you will eliminate in the next few years. Roles + STS + Identity Center can cover the same ground.
Be suspicious of *. Action *, Resource *, Principal *, no Condition: that combination is how incidents happen.
CloudTrail and IAM Access Analyzer must be on. If you can't reconstruct "who did what" after the fact, an incident becomes unsolvable.

Conclusion

IAM handles both authentication (who?) and authorization (allowed to?).
Principals come in four kinds: Root, User, Role, Federated. Groups are containers, not principals.
AuthN uses SigV4: the Secret Key never leaves you, clock drift is fatal, signatures live for 15 minutes.
There are six policy types (Identity / Resource / SCP / Permissions Boundary / Session / RCP) and they are all combined.
Evaluation rules: default Deny, explicit Deny beats explicit Allow, you need at least one Allow.
Modern best practice is Identity Center + Permission Sets + OIDC-based role assumption.
Long-lived keys are game over once leaked. Replace them with Role + STS.

References

AWS Free Hands-On

kt — Fri, 15 May 2026 14:38:54 +0000

Introduction

This is for the "I want to try AWS but I'm scared of the bill" and "I made an account once and never did anything with it" crowd. By the end of the next 30 minutes you'll have something running, and you'll have actually touched the parts of AWS that matter.

The finished pipeline:

What you touch by building it:

S3: the cloud object store
Lambda: a serverless runtime where you just put code and it runs
DynamoDB: managed NoSQL database
IAM: permission management for everything (the hidden main character of this article)
CloudWatch: logs, metrics, alarms
AWS Budgets: monthly cost alerts
AWS CLI: the command-line interface for talking to AWS from your machine

And the core stays inside the Always Free tier. The S3 usage in this hands-on is tiny (a few dozen bytes, a handful of requests), so even if S3 isn't strictly "Always Free" for your account type, the credit consumption is effectively $0. No surprise charges 6 months from now either.

The plan:

The 2026 AWS free tier, what it actually is (misread this and you get charged)
Account creation and four safety setups for the first 10 minutes
Install AWS CLI locally
Create an S3 bucket and put something in it
Create a DynamoDB table
Prepare an IAM Role for Lambda
Write a Lambda function and say hello world
Wire S3 to invoke Lambda automatically
Watch logs and metrics in CloudWatch
Tear it all down (this matters the most)

The finished version is on GitHub

The article walks through every command by hand. The same setup is also on GitHub as a repo where scripts/setup.sh brings everything up in one shot, and scripts/cleanup.sh tears it down in one shot.

0-draft/awscape

Escape AWS bills. A 30-minute hands-on that builds S3 → Lambda → DynamoDB inside the AWS Always Free tier.

What's in it:

lambda-src/handler.py: the same Lambda code you'll write in Step 7
policies/trust-policy.json: the trust policy from Step 6
scripts/setup.sh: runs Step 4 through Step 9 end-to-end (after you edit config.env)
scripts/cleanup.sh: idempotent teardown for Step 10 (safe to re-run if something errors midway)
README is English by default, Japanese version is included

Read it through and type each command, or clone and run the scripts and then read the code. Either way works.

1. The 2026 AWS free tier, what it actually is

Get this part wrong and you're going to be unhappy later. Read it carefully.

The AWS free tier was overhauled on 2025-07-15. Accounts created on or after that date follow the new rules. The word "free" now covers three different things.

Breakdown.

A. Free Plan (new accounts, on/after 2025-07-15)

$100 in credits at signup, plus up to another $100 through 5 specific activities (launch EC2, configure RDS, build a Lambda, prompt Bedrock, set up Budgets) at $20 each. Maximum total: $200
Expires at 6 months or when credits are gone, whichever comes first
When it expires the account is automatically closed. You then have 90 days to upgrade to Paid Plan to keep the account and data; otherwise data is permanently deleted
Services that "would burn through the credits in a heartbeat" (Savings Plans, Reserved Instances, certain AWS Marketplace offers, etc.) aren't available on the Free Plan. The core services this hands-on uses (Lambda / DynamoDB / S3 / CloudWatch) work fine

B. Paid Plan (the other option on/after 2025-07-15)

Same up to $200 in credits at signup
No expiry, full access to all services, pay-as-you-go after credits are gone
For production workloads or anyone who wants to keep learning past 6 months

C. Legacy Free Tier (accounts created before 2025-07-15)

"12 months free" applies for 12 months from account creation: EC2 t2.micro or t3.micro (region-dependent) 750h/mo, S3 5 GB + 20,000 GET + 2,000 PUT, RDS 750h, etc.
After 12 months only Always Free remains (the account stays open)

D. Always Free (every plan, no expiry)

No clock, resets monthly, never goes away
This is what this hands-on relies on

The Always Free monthly quotas worth knowing about in 2026 (per each service's official pricing page).

Service	Always Free per month
AWS Lambda	1M requests + 400,000 GB-seconds of compute
DynamoDB	25 GB storage + 25 RCU + 25 WCU (provisioned, Standard table class only, on-demand is not covered)
CloudWatch metrics + alarms	10 custom metrics, 10 alarms (Standard Resolution), 1M API requests
CloudWatch Logs	Ingestion + archive + Logs Insights scan share a single 5 GB quota (not 5 GB each)
SNS	1M publishes (commonly cited figure; not on the official pricing page directly)
SQS	1M requests

A note on S3. The "5 GB + 20,000 GET + 2,000 PUT" figure is documented as the Legacy 12-month free tier (for accounts created before 2025-07-15). For new Free Plan accounts, S3 usage draws from the $200 credit pool. AWS says "over 30 always-free services" but doesn't explicitly clarify whether S3 is in that list for new plans. Either way, this hands-on uses only a few dozen bytes and a handful of requests, so the practical cost is $0.

That's enough room to put a small prototype together with effectively no cost.

Which plan does this hands-on use

Brand new to AWS: pick the Free Plan. The core stays in Always Free and the S3 use is so small that credit consumption is effectively zero
Already have a legacy account: use that
Worried about the 6-month auto-close: switch to Paid Plan before the expiry, and the account stays

2. Prerequisites

What you need:

An email address (ideally separate from your daily one, for the root user)
A credit or debit card (the Free Plan still wants one for identity verification; just registering won't charge it)
A phone number (for SMS or voice OTP)
A notebook and a password manager (for storing everything you set up)

Set aside 30 to 60 minutes of focused time.

3. Step 1: create an account

Open https://aws.amazon.com/ and click "Create an AWS Account." The flow:

Account alias: shows up in the IAM sign-in URL later. Pick something easy to remember and unrelated to anything sensitive
Plan: pick Free Plan the first time. You can switch to Paid Plan later
Wait for the "Your account is ready" email before moving on

4. Step 2: four safety setups for the first 10 minutes

The instant the account is live, do these four things. Skip them and you're one leaked root credential away from the kind of bill that ends up on Twitter.

Add MFA to the root user
Set up an AWS Budgets alert
Create an IAM admin user
Log out of root, log in as the IAM user

a) Add MFA to the root user

Top-right of the console, click your account name, then "Security credentials"
"Multi-factor authentication (MFA)" then "Assign MFA device"
Scan the QR code with an Authenticator app (Google Authenticator / 1Password / Authy etc.)
Type two consecutive 6-digit codes to confirm

If the root user somehow has any access keys, delete them now. The root user must never own access keys.

b) AWS Budgets alert

For catching "I did something I shouldn't have" early.

Search "Budgets" in the console
Create budget, pick the Zero spend budget template ("notify if any charge appears")
Enter your email and save

If you ever step outside Always Free, you get notified the next day or so (Budgets' billing data aggregates with an 8 to 24 hour lag, so it's not literally instant, just very prompt).

c) Create an IAM admin user

The root user is not for daily work. Create one IAM user for that.

Search "IAM" in the console, then "Users", then "Create user"
Username: admin-cli or similar
Check "Provide user access to the console", auto-generate password, force change on first login
1. "Attach policies directly", pick AdministratorAccess (full permissions, for learning; never do this in production)
On the success screen, download the .csv and bookmark the sign-in URL
Add MFA to this user too (same flow as root, separate entry in your authenticator)

d) Log out of root, sign in as the IAM user

The sign-in URL is https://<account-id>.signin.aws.amazon.com/console. Use that going forward. Only switch back to root for special tasks like changing billing info.

5. Step 3: AWS CLI setup

You'll drive AWS from your local machine. Install the CLI.

Install

# macOS (Homebrew)
brew install awscli

# Linux (official)
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip && sudo ./aws/install

# Windows: grab the official MSI
# https://awscli.amazonaws.com/AWSCLIV2.msi

# Verify
aws --version
# aws-cli/2.x.x Python/3.x.x ...

Issue an access key for the CLI

The IAM user needs its own access key for CLI use.

IAM, then user admin-cli, then the "Security credentials" tab
"Create access key"
For the use case, pick "Command Line Interface (CLI)"
Copy both the Access Key ID and the Secret Access Key (the secret is shown once and never again)

`aws configure`

aws configure
# AWS Access Key ID: AKIA...
# AWS Secret Access Key: ...
# Default region name: ap-northeast-1
# Default output format: json

Confirm.

aws sts get-caller-identity
# {
#     "UserId": "AIDA...",
#     "Account": "123456789012",
#     "Arn": "arn:aws:iam::123456789012:user/admin-cli"
# }

Your username should show up in Arn. That's the green light.

6. Step 4: create an S3 bucket and put something in it

S3 (Simple Storage Service) is one of the oldest AWS services (2006). For now, think of it as "near-infinite cloud storage that you talk to over HTTPS."

Quick refresher

Bucket: the container for objects. Names are globally unique across all of AWS
Object: an individual file in a bucket. Identified by a key (the filename, roughly)
Region: a bucket lives in one Region. This hands-on uses ap-northeast-1 (Tokyo) everywhere

Create the bucket

Bucket names are globally unique, so add some entropy. Replace <your-name> with your handle or a date.

BUCKET="aws-handson-<your-name>-2026"
aws s3 mb "s3://${BUCKET}" --region ap-northeast-1
# make_bucket: aws-handson-...

Check Block Public Access

New buckets should have all public access blocked by default. Verify it:

aws s3api get-public-access-block --bucket "${BUCKET}"
# {
#   "PublicAccessBlockConfiguration": {
#     "BlockPublicAcls": true,
#     "IgnorePublicAcls": true,
#     "BlockPublicPolicy": true,
#     "RestrictPublicBuckets": true
#   }
# }

All four true is the safe state. If any are false, force them all on:

aws s3api put-public-access-block --bucket "${BUCKET}" \
    --public-access-block-configuration "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"

Drop a file in

echo "Hello AWS" > hello.txt
aws s3 cp hello.txt "s3://${BUCKET}/uploads/hello.txt"
# upload: ./hello.txt to s3://aws-handson-.../uploads/hello.txt

aws s3 ls "s3://${BUCKET}/uploads/"
# 2026-05-14 12:34:56         10 hello.txt

That's "putting a file in cloud storage." Always Free covers 5 GB of standard storage plus 2,000 PUT and 20,000 GET per month.

7. Step 5: create a DynamoDB table

DynamoDB is the managed NoSQL database. Create one table, then later Lambda will write to it.

Vocab

Table: a collection of schemaless items
Partition key: the unique key for an item (close to a primary key in RDBMS)
Capacity mode: "Provisioned" (reserve capacity in RCU/WCU) or "On-demand" (PAY_PER_REQUEST, per-request billing)

Pick provisioned mode to stay on Always Free

This part trips people up. DynamoDB's Always Free covers provisioned (Standard class), 25 WCU + 25 RCU + 25 GB only. On-demand (PAY_PER_REQUEST) is outside the free tier and charges per request.

For learning, reserve 25 WCU / 25 RCU on provisioned mode (provisioning that much still costs nothing).

aws dynamodb create-table \
    --table-name UploadLog \
    --attribute-definitions AttributeName=ObjectKey,AttributeType=S \
    --key-schema AttributeName=ObjectKey,KeyType=HASH \
    --billing-mode PROVISIONED \
    --provisioned-throughput ReadCapacityUnits=25,WriteCapacityUnits=25 \
    --table-class STANDARD \
    --region ap-northeast-1

# Give it a moment
aws dynamodb describe-table --table-name UploadLog \
    --query 'Table.TableStatus'
# "ACTIVE"

25 WCU lets you write 25 items per second (1 KB each), roughly 65 million writes per month. You'll never come close in a hands-on.

8. Step 6: prepare an IAM Role for Lambda

Time for the IAM piece. The two-line version:

Lambda doesn't carry credentials of its own. When you create a function you specify "the IAM Role this function runs as"
Every time the function runs, the Lambda service assumes that Role to get temporary credentials, and your code (boto3 etc.) uses those to call other AWS services

So the Role needs two policies. A trust policy that says "the Lambda service is allowed to assume me," and permission policies describing what the function can do once it has.

Trust policy file

Lets the Lambda service assume the Role.

cat > trust-policy.json <<'EOF'
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": { "Service": "lambda.amazonaws.com" },
    "Action": "sts:AssumeRole"
  }]
}
EOF

Create the Role

aws iam create-role \
    --role-name HandleUploadRole \
    --assume-role-policy-document file://trust-policy.json

Attach the AWS managed policy

The standard one for log writes.

aws iam attach-role-policy \
    --role-name HandleUploadRole \
    --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole

Add DynamoDB and S3 permissions (inline policy)

Following least privilege, scope it to exactly this table and this bucket.

ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

cat > inline-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "dynamodb:PutItem",
      "Resource": "arn:aws:dynamodb:ap-northeast-1:${ACCOUNT_ID}:table/UploadLog"
    },
    {
      "Effect": "Allow",
      "Action": "s3:GetObject",
      "Resource": "arn:aws:s3:::${BUCKET}/*"
    }
  ]
}
EOF

aws iam put-role-policy \
    --role-name HandleUploadRole \
    --policy-name HandleUploadInline \
    --policy-document file://inline-policy.json

9. Step 7: write and deploy the Lambda function

Lambda is the serverless runtime where you upload code plus a runtime spec, and AWS handles starting and stopping the process for you.

The Python code

A heads-up on the libraries used:

boto3: the AWS SDK for Python. Already bundled with the Lambda Python runtime, so no pip install needed
urllib.parse.unquote_plus: object keys in S3 events come URL-encoded (spaces become +, Japanese characters become %E3%83%...). This function decodes them back

mkdir -p lambda-src && cat > lambda-src/handler.py <<'EOF'
import json
import urllib.parse
from datetime import datetime, timezone

import boto3

dynamodb = boto3.resource("dynamodb")
table = dynamodb.Table("UploadLog")


def lambda_handler(event, context):
    # Pull info out of the S3 event
    record = event["Records"][0]
    bucket = record["s3"]["bucket"]["name"]
    # S3 URL-encodes the object key in events; decode it
    key = urllib.parse.unquote_plus(record["s3"]["object"]["key"])
    size = record["s3"]["object"]["size"]

    print(f"Got upload: bucket={bucket} key={key} size={size}")

    # Write to DynamoDB
    table.put_item(Item={
        "ObjectKey": key,
        "Bucket": bucket,
        "Size": size,
        "UploadedAt": datetime.now(timezone.utc).isoformat(),
    })

    return {"statusCode": 200, "body": json.dumps({"ok": True, "key": key})}
EOF

# Zip it up
cd lambda-src && zip ../function.zip handler.py && cd ..

Create the Lambda function

ROLE_ARN=$(aws iam get-role --role-name HandleUploadRole --query 'Role.Arn' --output text)

aws lambda create-function \
    --function-name HandleUpload \
    --runtime python3.13 \
    --role "${ROLE_ARN}" \
    --handler handler.lambda_handler \
    --zip-file fileb://function.zip \
    --region ap-northeast-1

If you just created the Role, you might get an "IAM Role hasn't propagated yet" error. Wait 10 seconds and retry.

Smoke test (manually invoke with a fake S3 event)

cat > test-event.json <<EOF
{
  "Records": [{
    "s3": {
      "bucket": {"name": "${BUCKET}"},
      "object": {"key": "uploads/hello.txt", "size": 10}
    }
  }]
}
EOF

aws lambda invoke \
    --function-name HandleUpload \
    --payload fileb://test-event.json \
    --cli-binary-format raw-in-base64-out \
    response.json

cat response.json
# {"statusCode": 200, "body": "{\"ok\": true, \"key\": \"uploads/hello.txt\"}"}

Check that DynamoDB picked it up:

aws dynamodb scan --table-name UploadLog
# Items should include a hello.txt record

10. Step 8: have S3 invoke Lambda automatically

Up to here we've been calling the function explicitly. Next, wire it so that a file landing in S3 triggers the Lambda by itself.

Allow S3 to invoke the function

For S3 to call Lambda, Lambda needs a resource-based policy granting that permission.

aws lambda add-permission \
    --function-name HandleUpload \
    --statement-id AllowS3Invoke \
    --action lambda:InvokeFunction \
    --principal s3.amazonaws.com \
    --source-arn "arn:aws:s3:::${BUCKET}"

Configure the S3 event notification

LAMBDA_ARN=$(aws lambda get-function --function-name HandleUpload --query 'Configuration.FunctionArn' --output text)

cat > notification.json <<EOF
{
  "LambdaFunctionConfigurations": [{
    "LambdaFunctionArn": "${LAMBDA_ARN}",
    "Events": ["s3:ObjectCreated:*"],
    "Filter": {
      "Key": {
        "FilterRules": [{ "Name": "prefix", "Value": "uploads/" }]
      }
    }
  }]
}
EOF

aws s3api put-bucket-notification-configuration \
    --bucket "${BUCKET}" \
    --notification-configuration file://notification.json

Upload a file and watch it fire

echo "auto-trigger test" > memo.txt
aws s3 cp memo.txt "s3://${BUCKET}/uploads/memo.txt"

# Wait a few seconds, then check DynamoDB
sleep 5
aws dynamodb scan --table-name UploadLog --query 'Items[?ObjectKey.S == `uploads/memo.txt`]'

If memo.txt shows up in the scan, the S3 → Lambda → DynamoDB pipeline is wired up.

11. Step 9: logs and metrics in CloudWatch

Lambda automatically pushes stdout (print or console.log) to CloudWatch Logs. Execution count, error count, duration, all of it lands in CloudWatch Metrics automatically too.

Read the logs

aws logs tail /aws/lambda/HandleUpload --since 10m
# 2026-05-14T12:34:56 START RequestId: ...
# 2026-05-14T12:34:56 Got upload: bucket=aws-handson-... key=uploads/memo.txt size=18
# 2026-05-14T12:34:56 END RequestId: ...

View metrics in the console

Console → CloudWatch → Metrics → Lambda
Look at Invocations (run count), Errors (failures), Duration (how long it took)

Set up an alarm

Email yourself when Lambda errors. The alarm itself just holds state; for actual email delivery you need an SNS (Simple Notification Service) topic with your email subscribed.

# 1. Create an SNS topic
TOPIC_ARN=$(aws sns create-topic --name lambda-alarms --query 'TopicArn' --output text)

# 2. Subscribe your email
aws sns subscribe \
    --topic-arn "${TOPIC_ARN}" \
    --protocol email \
    --notification-endpoint your-email@example.com
# AWS sends you a confirmation email shortly. Click the Confirm link in it.

# 3. Create the CloudWatch alarm and point it at the SNS topic
aws cloudwatch put-metric-alarm \
    --alarm-name HandleUpload-Errors \
    --metric-name Errors \
    --namespace AWS/Lambda \
    --statistic Sum \
    --period 60 \
    --evaluation-periods 1 \
    --threshold 1 \
    --comparison-operator GreaterThanOrEqualToThreshold \
    --dimensions Name=FunctionName,Value=HandleUpload \
    --treat-missing-data notBreaching \
    --alarm-actions "${TOPIC_ARN}"

Click the Subscribe confirmation email. Without it the topic accepts messages but nothing reaches your inbox.

Everything stays inside the Always Free quota of 10 alarms and 1M SNS publishes. To test it, raise an exception in the Lambda code and within about a minute the alarm goes ALARM and you get the email.

12. Step 10: tear it all down (this matters the most)

When you're done, delete every resource you created. Anything left over is one slip away from surprise billing. Everything here should fit inside Always Free, but build the habit of deleting anyway.

Order of teardown:

Empty the S3 bucket, then delete it
Delete the Lambda function
Delete the DynamoDB table
Delete the CloudWatch alarm
Delete the CloudWatch log group
Delete the SNS topic
Delete the IAM Role

Leave AWS Budgets in place (you want it to keep watching going forward).

One-shot teardown

# S3
aws s3 rm "s3://${BUCKET}" --recursive
aws s3 rb "s3://${BUCKET}"

# Lambda
aws lambda delete-function --function-name HandleUpload

# DynamoDB
aws dynamodb delete-table --table-name UploadLog

# CloudWatch
aws cloudwatch delete-alarms --alarm-names HandleUpload-Errors
aws logs delete-log-group --log-group-name /aws/lambda/HandleUpload

# SNS
aws sns delete-topic --topic-arn "${TOPIC_ARN}"

# IAM Role
aws iam delete-role-policy --role-name HandleUploadRole --policy-name HandleUploadInline
aws iam detach-role-policy --role-name HandleUploadRole \
    --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
aws iam delete-role --role-name HandleUploadRole

Last step: open Billing → Bills in the console and confirm the current month shows $0.00. Done.

If you cloned 0-draft/awscape, ./scripts/cleanup.sh does all of the above in one go. It's idempotent, so re-running after a partial failure is safe.

13. What you actually touched

The seven services here cover most of the fundamental building blocks of AWS.

Category	What you used	What you got out of it
Auth	IAM User / Role / Policy / Trust Policy	The basics of separating "who can access what"
Storage	S3 + Block Public Access	Cloud object storage and how access control works
Compute	Lambda + IAM Role wiring	Serverless execution, and how one AWS service calls another
Database	DynamoDB (provisioned)	The feel of NoSQL and reading/writing from Lambda
Event-driven	S3 → Lambda notifications	How AWS services trigger each other loosely-coupled
Observability	CloudWatch Logs / Metrics / Alarms	"Is it actually running" plus paging when it isn't
Cost	AWS Budgets + Always Free	The baseline for not getting a surprise bill

The shape (S3 catches input, Lambda processes, DynamoDB stores, CloudWatch watches) shows up everywhere in real systems. From here, the natural next moves are adding API Gateway for a REST API, fanning out with SQS or EventBridge, or rolling all of it into CloudFormation / CDK as infrastructure-as-code.

14. Common mistakes

Symptom	Cause	Fix
`AccessDenied` everywhere	Logged out of root but the IAM user has too-narrow permissions	Confirm you attached `AdministratorAccess` in Step 2c (only for learning)
`s3 mb` fails with name conflict	S3 bucket names are global	Add more entropy (date, random suffix)
`InvalidParameterValueException` creating Lambda	IAM Role hasn't propagated yet	Wait 10 seconds and retry
S3 upload works but Lambda doesn't fire	Missing `add-permission`, or wrong prefix in the notification	Re-run Step 10
Lambda runs but DynamoDB write fails	Inline policy is missing `dynamodb:PutItem`	Re-apply inline-policy.json
Surprise charge next month	Forgot to clean up	Open Cost Explorer to find what's still running
Account vanished at the 6-month mark	Free Plan expired	Upgrade to Paid Plan before that date, or start a new account

Wrap-up

The 2026 AWS free tier is four things: Free Plan (6-month, new), Paid Plan (new), Legacy 12-month (old), Always Free (forever)
Always Free alone (Lambda 1M req / DynamoDB 25 GB / S3 5 GB / CloudWatch 10 metrics) is enough to actually build something
Four things to do immediately after account creation: root MFA, Budgets, IAM admin user, log out of root
S3 + Lambda + DynamoDB + CloudWatch + IAM is the foundational pattern; once you've wired it once by hand, the rest of AWS reads as variations on this
When you're done, delete the resources. Keep Budgets
Watch the Free Plan 6-month clock. Switch to Paid Plan before the expiry if you want to keep the account

References

AWS Deep Dive

kt — Thu, 14 May 2026 12:46:07 +0000

Introduction

Before you can honestly say you "get AWS," you need a plain-language answer to one question: what is it, really? Where does it physically run, how does the bill actually arrive, and how do you talk to it? Skip that and every IAM policy you write is just vibes.

By the end of this you should be able to answer all of these without hesitation:

What's the actual difference between a Region and an Availability Zone?
An AWS account is just an email address, right?
Why does every senior engineer insist on splitting production and staging into separate accounts?
When you click around in the console vs. typing into the CLI, what's actually happening underneath?

1. The 30-second version of "the world before the cloud"

The fastest way to appreciate the cloud is to remember what it replaced.

You buy a physical server. You rent rack space in a data center. You sign contracts for power and network. You install an OS. Finally, your app runs. That's on-premises, "on-prem" for short.

The problem with all of that: it's slow, paid up front, and a pain to scale. If marketing tells you traffic is going to be 10x next week, you can't just go buy 10x the hardware.

The cloud rents you everything from "physical box up to just-below-your-app." All of it. One API call, available right now.

The green box is the only thing you actually wrote. With the cloud, that's the only part you have to care about.

AWS (Amazon Web Services) is the biggest cloud provider out there with 200+ services. EC2 (virtual machines), S3 (object storage), RDS (managed databases), Lambda (serverless runtime), IAM (auth). Those are just AWS's internal code names for things you probably already know.

2. The physical layout: Region, AZ, data center

The cloud isn't magic. It runs on physical servers somewhere on Earth, and AWS arranges those servers in a strict hierarchy.

Three terms, three different scopes.

Term	What it is	Example	Isolation
Region	A geographic chunk of the planet	ap-northeast-1 (Tokyo)	Regions are fully independent. An S3 bucket in Tokyo is invisible from Virginia.
Availability Zone (AZ)	A physically isolated set of DCs inside one Region	ap-northeast-1a	Independent power, cooling, and network. One AZ failing doesn't take down the others.
Data center (DC)	The actual building	(omitted)	AWS officially says "an AZ is one or more discrete DCs." In any serious Region the AZs are multiple buildings (the diagram above is conceptual).

Three classic beginner mistakes.

Putting EC2 in a single AZ: that AZ goes down, your service goes down. Production should always span at least 2 AZs.
Working in the wrong Region: thinking you're in the US, racking up bills in Tokyo. Happens more than you'd expect.
"We're using Tokyo, so the data stays in Japan": the AWS control plane (IAM user info and the like) is global. Data residency and governance need explicit design, not a Region picker.

"Same services everywhere" is a myth

New services almost always launch in us-east-1 first and roll out to other Regions over months or years. There are always features Tokyo doesn't have yet. If you just want to play with the latest thing, us-east-1 is the fast lane.

3. The logical layout: AWS accounts and Organizations

This part is harder to picture than the physical layout. An AWS account is not a Twitter account or a GitHub account.

An AWS account is a container for resources and money. Every EC2 instance, every S3 bucket, every resource you create lives inside exactly one account. Billing is also per account.

The relationships in one table:

Concept	Role	What to remember
AWS account	The unit of resources and billing	One account is one "box." Everything inside is self-contained.
Organizations	Trees multiple accounts together	If you're a company, use it.
Management Account	The root of the tree	Billing aggregation and governance only. Do not run workloads here.
OU (Organizational Unit)	A folder-like grouping for accounts	Split by Production / Development / Security and similar lines.

Why split accounts

"We're one company, why do we need more than one account?" People say this until they've operated a real system. Then they stop saying it.

Account separation buys you:

Physical isolation of permissions: one botched IAM policy can't reach into a different account.
Visible costs: glance at the bill and you know which environment is spending what.
Clear ownership: "this account belongs to Team A" is now a sentence you can say.
Bounded blast radius: when something blows up, it blows up inside one account, not across everything.

That's why every senior engineer ends up saying "production and staging belong in different accounts."

The Management Account is sacred

The Management Account (the one at the top of the tree) is for billing and governance, period. Don't run apps in it. If that account is compromised, every account below it is at risk by definition.

In real organizations almost no human ever needs to log in to the management account.

4. Every path into AWS ends at the same HTTPS API

Humans, CI runners, Lambda functions, Terraform: when they touch AWS, there's exactly one path under the hood. HTTPS API.

Worth letting that sink in. Clicking around in the console? That's hitting the same HTTPS API. CLI? Same API. CloudFormation, CDK, Terraform? Same API.

So when people talk about AWS security, the whole problem reduces to who can call which HTTPS API, and with what permissions. The thing that answers both questions is IAM (Identity and Access Management).

Every API request carries a signature

Unlike most REST APIs, AWS doesn't use Authorization: Bearer <token>. It uses its own signing scheme: SigV4 (Signature Version 4).

The shape of it: hash the request (URL, headers, body), sign that hash with your secret key, attach the signature to the request. The server reproduces the same calculation. If the signatures match, the request is authentic and unmodified.

The thing to take away: the secret key itself never goes over the wire. Only the signature derived from it does. Even if someone snoops on the request, they can't recover the secret.

5. Authentication vs. authorization in AWS

So far: everything is an API, and every request is signed. Once the request lands, AWS asks itself two separate questions.

Kind	Question	How AWS answers it
Authentication (AuthN)	Who are you?	SigV4 signature check + access key or temporary credentials
Authorization (AuthZ)	Are you allowed to do this?	IAM policy, SCP, resource policy, permissions boundary, and friends

These are not the same thing. Having a valid signature (you authenticated) doesn't mean the action is allowed. The reverse is also true: being broadly allowed to do something is meaningless if your signature is wrong.

Authorization in AWS gets interesting because multiple policy types are combined to produce the final yes-or-no. That deserves its own write-up, and AWS's IAM docs are the place to go for the details.

6. The money question: pay-as-you-go and guardrails

AWS is pay-as-you-go, billed after the fact. At the end of the month, you get a bill for what you used.

This is where new users get burned. The classic blunders:

Spin up an EC2 instance for a test, forget to stop it, head into a long weekend, next month's bill is $400.
Stuff a bucket full of logs and forget about them. Six months later you have tens of GB at a few dollars a month and never noticed.
Assume data transfer (egress) is free. It is not. Big transfers add up fast.
Leave a NAT Gateway running in a VPC you no longer use. That's about $30 a month for nothing.

Cloud pricing usually breaks down into compute, storage, data transfer, and request counts. Services that look cheap on the sticker can get expensive once request charges or retrieval fees pile on. (Glacier's "cheap to store, expensive to retrieve" trap is famous for a reason.)

Bare-minimum guardrails:

Set up AWS Budgets: a monthly budget with email alerts at, say, $100. You can wire it up to actually disable IAM users at higher thresholds if you want.
Open Cost Explorer regularly: a quick daily glance catches abnormal trends within a week.
Keep personal experiments in their own account: mixed in with work stuff, weird charges go unnoticed.

7. Wrap-up

AWS rents you everything from hardware up to just below your app, and bills you for what you used.
Physical hierarchy: Region > AZ > DC. Always span multiple AZs in production.
Logical hierarchy: Organizations > OU > account. Production and staging go in separate accounts.
Every operation, no matter how it's triggered, ends up as an HTTPS API call signed with SigV4.
Authentication (who?) and authorization (allowed?) are separate questions. IAM handles both.
Pay-as-you-go means you need Budgets and Cost Explorer in place from day one.

"AWS is a collection of HTTPS APIs, and IAM stands at the door." Keep that one line in your head and you won't get lost when a new service shows up.

References

Introducing Zopa: a 60 KB authorization engine for proxy-wasm, written in Zig

kt — Sun, 10 May 2026 07:37:50 +0000

There are plenty of times you want to delegate "let this request through, or block it" to a wasm filter inside Envoy. API gateways, service mesh boundaries, L7 checkpoints. The default move is to use OPA's wasm build.

The trouble is OPA-as-wasm is heavy. The Go runtime, the Rego parser, and the evaluator are all in there. You only want to return allow/deny at the edge, but you ship something many times the size of the evaluator. Cedar and Casbin don't ship official wasm builds (as of May 2026). The slot for "drop-in proxy-wasm authorization filter" is empty.

zopa is what I built to fill that slot. A Zig wasm32-freestanding binary, ~60 KB at release. No GC; memory turns over on a per-request arena. It runs on any host that implements proxy-wasm 0.2.1 (Envoy / wasmtime / wamr / v8).

Big picture

Zopa assumes you separate where you write policy from where you evaluate it.

Policy authors write rules in Rego (OPA's policy language; a declarative DSL in the Datalog family). The CI converts that to AST (Abstract Syntax Tree) JSON. At Envoy startup the AST is handed to the wasm module as plugin config; on each incoming request, zopa evaluates and returns 1 or 0.

There's exactly one design call here: don't ship a language compiler inside the wasm module. OPA wasm is large because the Rego parser and evaluator are bundled together. Zopa pushes the parser out of the wasm (into a CI job) and keeps the wasm module focused on evaluation. That alone moves the binary size by orders of magnitude.

proxy-wasm refresher

proxy-wasm is the ABI spec for "filters in wasm" used by Envoy and friends. Most famous in Envoy, but anything embedding wasmtime / wamr / v8 can host it.

Three points cover the host/wasm relationship:

The host calls into wasm exports at request milestones (proxy_on_request_headers etc.).
The wasm pulls header values back through host imports (proxy_get_header_map_value).
Allow does nothing (Envoy continues). Deny asks the host to call proxy_send_local_response(403).

Zopa implements proxy-wasm 0.2.1. Spec body: proxy-wasm/spec.

Why it fits in 60 KB

The build:

zig build --release=small
ls -lh zig-out/bin/zopa.wasm
# -rw-r--r-- 1 you staff 60K  zopa.wasm

Three things drive the size:

wasm32-freestanding target. No WASI (the wasm syscall spec). No OS, no syscalls, only a thin slice of stdlib. freestanding drops every file I/O / network stub.
No GC. Zig has no garbage collector (like Rust, ownership is explicit and memory is hand-managed). The GC code and management metadata simply don't exist.
Zero deps. Nothing outside Zig stdlib. The JSON parser is hand-rolled (recursive descent) in src/json.zig, surrogate-pair handling included, in a few hundred lines.

OPA's wasm is large because it carries the Go runtime (with GC), the Rego parser, and the evaluator. Zopa took the opposite call on every point. The result is 60 KB.

Memory model

Zopa's heart is the memory layout. Two allocators, with different lifetimes and roles.

host_allocator is Zig stdlib's std.heap.wasm_allocator, a freelist-style allocator. It backs every buffer that crosses the host boundary. Lifetime: the whole module.

request_arena is std.heap.ArenaAllocator. Per-request scratch space. We call reset(.retain_capacity) at the end of evaluate().

An arena means "alloc as much as you want; everything goes away at the end". No individual free calls. With retain_capacity, the wasm linear memory pages aren't returned, so the next request reuses the existing capacity.

Net effect: after warmup, memory.grow (the wasm heap-grow instruction) stops firing. Throughput goes up; memory stays roughly flat. That's the source of that property.

The single rule that ties it together: a pointer minted by one allocator must only be released by the matching free path. The proxy-wasm shim returns host-malloc'd buffers via the host's free; the evaluator never calls free directly and instead leans on the arena reset.

Three-phase target rules

Zopa makes the decision independently in three HTTP phases. Each phase is bound to a different target rule name; if your policy contains a rule with that name, the phase fires; if not, the phase passes through silently.

Phase	Target rule	Input shape	On deny
`proxy_on_request_headers`	`allow`	`{method, path, headers}`	403
`proxy_on_request_body`	`allow_body`	`{body, body_raw}`	403 + Pause
`proxy_on_response_headers`	`allow_response`	`{response: {status, headers}}`	503

The key point: a policy with only an allow rule sails past the body and response phases untouched. At configure time we parse the policy JSON and remember whether allow_body / allow_response rules exist as bools; if they don't, the matching callbacks return Continue directly.

What happens during one request

When Envoy hands a single request to zopa, this is the timeline inside.

Three takeaways:

The policy AST is handed to us once at startup and copied into host_allocator. We don't re-receive it per request.
Every phase ends with arena.reset. Zopa carries no data across phase or request boundaries.
The body and response phases only run eval when the matching rule exists. The policy opts each phase in by writing a rule with that name.

evaluate() returns a plain i32:

Return	Meaning
`1`	allow. The target rule fired with a truthy value.
`0`	deny. No rule fired and no truthy default rule was present.
`-1`	error. Parse failure, unknown node, recursion cap, etc.

The proxy-wasm shim treats -1 the same as deny (broken policies block by default).

Policy AST

Zopa's input isn't Rego source; it's AST-shaped JSON. The supported nodes mirror a subset of Rego.

"role equals admin → allow" looks like:

{
  "type": "module",
  "rules": [
    {
      "type": "rule",
      "name": "allow",
      "default": true,
      "value": { "type": "value", "value": false }
    },
    {
      "type": "rule",
      "name": "allow",
      "body": [
        {
          "type": "eq",
          "left":  { "type": "ref", "path": ["input", "user", "role"] },
          "right": { "type": "value", "value": "admin" }
        }
      ]
    }
  ]
}

The evaluation rule is "OR every rule whose name is "allow"". If any body holds, allow=true. A default=true rule's value is the fallback for when nothing else fires.

Supported node types:

Node	Use
`value`	Literal (any JSON value)
`ref`	Path lookup, e.g. walking `input.user.role`
`compare`	Binary compare. `eq` / `neq` / `lt` / `lte` / `gt` / `gte`
`not`	Logical negation
`set`	Set literal
`some`	Existential quantifier: "some element x makes body true"
`every`	Universal quantifier: "every element x makes body true"
`call`	Builtin function call.
`module`	A set of rules. Optional `package` field carries the package name.
`modules`	Module bundle: multiple packages co-resident in a single VM.
`rule`	A single rule with `body` (AND) and `value`.

call ships with these four builtins:

Name	Args	Returns
`startswith`	(string, string)	bool
`endswith`	(string, string)	bool
`contains`	(string, string)	bool
`count`	(array / set / object / string)	number

some / every also iterate over JSON objects: pick kind: "keys" (default) or kind: "values".

Zopa doesn't reach the full Rego (user-defined functions, with clauses, partial evaluation, imports, etc.). The scope is "decide allow/deny at the edge". Full reference: docs/ast.md.

Try it

Build

You need Zig 0.16.0:

brew install zig
git clone https://github.com/0-draft/zopa
cd zopa
zig build --release=small

Drive it directly (Node.js)

Call evaluate(input, ast) without going through proxy-wasm. Useful as a smoke test before standing up Envoy.

import { readFileSync } from 'node:fs';

const { instance } = await WebAssembly.instantiate(
  readFileSync('zig-out/bin/zopa.wasm'),
  { env: {
      proxy_log: () => 0,
      proxy_get_buffer_bytes: () => 1,
      proxy_get_header_map_pairs: () => 1,
      proxy_get_header_map_value: () => 1,
      proxy_send_local_response: () => 0,
  }},
);
const { malloc, free, evaluate, memory } = instance.exports;

const enc = new TextEncoder();
function write(obj) {
  const bytes = enc.encode(JSON.stringify(obj));
  const ptr = malloc(bytes.length);
  new Uint8Array(memory.buffer, ptr, bytes.length).set(bytes);
  return [ptr, bytes.length];
}

const [ip, il] = write({ user: { role: 'admin' } });
const [ap, al] = write({
  type: 'compare', op: 'eq',
  left:  { type: 'ref',   path: ['input', 'user', 'role'] },
  right: { type: 'value', value: 'admin' },
});

console.log(evaluate(ip, il, ap, al)); // 1 (allow)
free(ip); free(ap);

The proxy_* stubs in env are there because proxy-wasm imports must resolve before the wasm module instantiates. evaluate itself doesn't call into them, so dummies are fine.

As an Envoy proxy-wasm filter

Drop the wasm into http_filters. Two important fields: vm_config and configuration.

http_filters:
  - name: envoy.filters.http.wasm
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.filters.http.wasm.v3.Wasm
      config:
        configuration:
          "@type": type.googleapis.com/google.protobuf.StringValue
          value: |
            {"type":"module","rules":[
              {"type":"rule","name":"allow","default":true,
               "value":{"type":"value","value":false}},
              {"type":"rule","name":"allow","body":[
                {"type":"eq",
                 "left":{"type":"ref","path":["input","method"]},
                 "right":{"type":"value","value":"GET"}}]}
            ]}
        vm_config:
          runtime: envoy.wasm.runtime.v8
          code:
            local:
              filename: /etc/zopa/zopa.wasm

configuration.value is the policy AST as JSON. The example reads "GET passes; everything else denies".

A complete end-to-end sample lives in examples/envoy/; zig build test-envoy runs curl assertions against a real Envoy. CI exercises Node, wasmtime, and a real Envoy on every commit.

Container image

There's also a distroless OCI image. Multi-arch (amd64 / arm64) and cosign keyless-signed.

docker pull ghcr.io/0-draft/zopa:v0.2.0
docker run --rm --entrypoint=ls ghcr.io/0-draft/zopa:v0.2.0 -lh /zopa.wasm

The intended use is staging /zopa.wasm from an initContainer into the Envoy sidecar pod.

Latency

zig build bench runs a zopa-only latency benchmark. On a local M-series Mac with --release=small:

fixture                 |    p50 |    p95 |    p99 |   mean
------------------------+--------+--------+--------+-------
01_static               |  1.79  |  2.96  |  3.46  |  1.73  (μs)
02_header_eq            |  4.42  |  4.96  |  5.17  |  4.48  (μs)

Literal true policy: 1.79 μs at p50. A simple input.method == "GET" style compare: 4.42 μs at p50. Wall-clock direct measurement, 10 000 iterations after 1 000 warmup. No head-to-head against OPA / Cedar yet; cross-engine numbers wait until the conformance corpus is wide enough to honestly assert "same answer".

Where it sits relative to alternatives

Item	OPA	Cedar	Casbin	zopa
Language	Go	Rust	Go (+ ports)	Zig
Wasm distribution	Yes (heavy)	No	No	Yes (~60KB)
Memory model	GC	RC + arenas	GC	per-request arena
proxy-wasm	Side project	No	No	First-class
Policy input	Rego source	Cedar source	CSV / source	Compiled AST
Maturity	CNCF Graduated	Stable	Mature	Alpha

Zopa isn't a replacement for OPA when you need full Rego, the management plane, bundle distribution, partial evaluation, or the state API. Use OPA for those. Zopa solves a narrow case: "I can compile the policy elsewhere and just want to evaluate at the edge". For that case, the wasm binary is two orders of magnitude smaller than OPA's.

It fits when "Rego-ish syntax, but OPA is too heavy" and "Cedar / Casbin can't go to wasm" line up at the same time.

Try it and tell me

Zopa's source is 8 files under src/, no deps outside stdlib, readable top to bottom. Sized so that if you want to change something, you can fork and rewrite.

Feedback I'd love:

"tools/rego2ast.py rejects my policy with Unsupported on this node, please add it"
"proxy-wasm host X (Istio / Kong / APISIX) worked / didn't work like this"
"Cases I'd add to the conformance corpus"
"The allow_body 64 KiB cap is too small / too large"

Reference

Repo: https://github.com/0-draft/zopa
Issues: https://github.com/0-draft/zopa/issues
v0.2.0 release: https://github.com/0-draft/zopa/releases/tag/v0.2.0
OCI image: ghcr.io/0-draft/zopa:v0.2.0
proxy-wasm spec: https://github.com/proxy-wasm/spec

DEV Community: kt

SPIFFE Compliance Deep Dive

Introduction

0. Prerequisites

What "workload identity" means

URI basics (RFC 3986)

X.509, mTLS, JWT, JWS, JWK

gRPC and Unix Domain Sockets

1. The Shape of SPIFFE: Three Pillars

2. What the Spec Itself Says "Compliant" Means

3. SPIFFE ID: How You Name Things

3.1 Anatomy of the URI

3.2 Trust Domain Names Are Unregulated

3.3 Path Design

4. SVID: X.509 Profile Requirements

4.1 An X.509-SVID Is Just an X.509 Certificate With a URI SAN

4.2 Leaf SVID vs Signing SVID

4.3 Extra Checks at Validation Time

5. SVID: JWT Profile Requirements

5.1 A JWT-SVID Is Just a JWS

5.2 The Core Restriction: Reject alg: none

5.3 Always Check aud

6. Workload API: How SVIDs Get Delivered

6.1 gRPC, Local Only

6.2 Five RPCs Across Two Profiles

6.3 Why It Streams

6.4 Discovery: SPIFFE_ENDPOINT_SOCKET

7. Trust Domain and Bundle: Building the Trust Root

7.1 A Bundle Is a Keyring Shaped Like a JWKS

7.2 Keep Trust Domains Isolated

7.3 Key Rotation: Add First, Remove Later

8. What Existing Implementations Actually Cover

8.1 SPIRE

8.2 Istio

8.3 Cilium

9. 2026 Updates: Recent Spec Movement

10. Running the Checklist: scc

Conclusion

AWS SigV4 and SigV4A Deep Dive

Introduction

1. Why a custom signature instead of Bearer Token

2. Background: HMAC and SHA-256 in one paragraph

3. The four SigV4 steps at a glance

4. Step 1: Building the Canonical Request

5. Step 2: Building the StringToSign

6. Step 3: Deriving the SigningKey (4-stage HMAC chain)

7. Step 4: Computing the Signature and the Authorization Header

8. Full implementation: hitting S3 GetObject with SigV4

9. End-to-end sequence: Client and Server verification

10. Chunked Upload: STREAMING-AWS4-HMAC-SHA256-PAYLOAD

11. Streaming SigV4: WebSocket / IoT MQTT

12. What is inside a Presigned URL

13. SigV4A: the asymmetric variant

How it actually works

Why Multi-Region forces this

Is the ECDSA here deterministic

Rolling your own is brutal

14. SigV4 vs SigV4A side by side

15. SigV4 vs SigV4A: scope and verifier differences in one figure

16. The Clock Skew trap

Conclusion

AWS IAM Roles Anywhere Deep Dive

Introduction

1. Vocabulary you need first

IAM Role / STS / temporary credentials

X.509 certificate / CA / PKI

2. The big picture and the three actors

3. Trust Anchor: the root of trust

A. Use AWS Private CA (PCA) as the source

B. Upload an existing external CA

4. Profile: which Role, with what guardrails

5. The Role's Trust Policy: who is allowed to call

Session tags auto-populated from the certificate

6. The CreateSession flow

7. What is inside the credential helper

Where the private key lives

8. Revoking a certificate

Importing a CRL

Auto-integration with AWS Private CA

Temporarily disabling the CRL check

5.2 The Core Restriction: Reject `alg: none`

5.3 Always Check `aud`

6.4 Discovery: `SPIFFE_ENDPOINT_SOCKET`

10. Running the Checklist: `scc`