DEV Community: Matheus

Rust 1.94.0: array_windows, Cargo Config Includes, and 10 Breaking Changes You Should Know About

Matheus — Fri, 06 Mar 2026 19:05:53 +0000

Rust 1.94.0 landed on March 5, 2026. Three headline features and a surprisingly long compatibility notes section.

Here's what actually matters if you're shipping Rust in production.

The Headlines

array_windows Finally Stabilized

This one's been cooking since 2020. array_windows gives you sliding window iteration over slices, but with compile-time known sizes instead of runtime slices.

// Old way: runtime-sized windows, manual indexing
data.windows(4).any(|w| w[0] != w[1] && w[0] == w[3] && w[1] == w[2])

// New way: destructure directly, compiler knows the size
data.as_bytes()
    .array_windows()
    .any(|[a1, b1, b2, a2]| (a1 != b1) && (a1 == a2) && (b1 == b2))

The real win here isn't just ergonomics. The compiler can eliminate bounds checks entirely because it knows the window size at compile time. If you're doing any kind of signal processing, pattern matching, or rolling calculations over slices, this is a free performance upgrade.

The window size is inferred from usage too. That destructuring pattern |[a1, b1, b2, a2]| tells the compiler you want windows of 4. No need to specify it explicitly.

Cargo Config Includes

You can now split your .cargo/config.toml across multiple files:

include = [
    { path = "ci.toml" },
    { path = "local-overrides.toml", optional = true },
]

This is genuinely useful for teams. You can keep CI-specific settings, local developer overrides, and shared config separate without fighting merge conflicts in one massive config file. The optional = true flag means you can have developer-specific files that don't need to exist for everyone.

Monorepo teams will probably get the most out of this. Think shared build profiles, registry mirrors, or target-specific settings that only some developers need.

Also worth noting: Cargo now records a pubtime field in the registry index, tracking when each crate version was published. This lays groundwork for time-based dependency resolution in the future. crates.io is gradually backfilling existing packages.

TOML 1.1 in Cargo

Cargo now parses TOML v1.1, which means you can finally write multi-line inline tables:

# Before: everything crammed on one line
serde = { version = "1.0", features = ["derive", "rc", "alloc"] }

# After: readable and trailing commas allowed
serde = {
    version = "1.0",
    features = [
        "derive",
        "rc",
        "alloc",
    ],
}

One catch: if you use TOML 1.1 syntax in your Cargo.toml, your development MSRV effectively becomes Rust 1.94. Cargo rewrites the manifest on publish to stay compatible with older parsers, so your users won't be affected. But anyone building your crate from source with an older toolchain will hit parse errors.

If you're maintaining a library with a strict MSRV policy, hold off on the new syntax for now.

Breaking Changes: The Real Release Notes

This is where 1.94 gets interesting. Ten compatibility notes, and some of them will bite you.

Closure Capturing Behavior Changed

The biggest one. How closures capture variables around pattern matching has been tightened up. Previously, a non-move closure might capture an entire variable by move in some pattern matching contexts. Now it captures only the parts it needs.

Sounds good in theory, but it can cause new borrow checker errors where code previously compiled fine. It can also change when Drop runs for partially captured values.

If you have closures near match or if let expressions that suddenly stop compiling after upgrading, this is likely why.

Standard Library Macros Import Change

Standard library macros (println!, vec!, matches!, etc.) are now imported via the prelude instead of #[macro_use]. This sounds like an internal change, but it has a visible effect: if you have a custom macro with the same name as a standard library macro and you glob-import it, you'll get an ambiguity error.

The most common case: if you defined your own matches! macro and glob-imported it. You'll need an explicit import to resolve which one you mean.

For #![no_std] code that glob-imports from std, you might see a new ambiguous_panic_imports warning because both core::panic! and std::panic! are now in scope.

dyn Trait Lifetime Casting Restricted

dyn trait objects can no longer freely cast between different lifetime bounds. If you were doing something like casting dyn Foo + 'a to dyn Foo + 'b, the compiler now correctly rejects it.

Shebang Lines in include!()

include!() in expression context no longer strips shebang lines (#!/...). If you were including files that start with a shebang, they'll now fail to compile. The fix is to remove the shebang from included files.

Other Compat Notes

Ambiguous glob reexports are now visible cross-crate (may introduce new ambiguity errors in downstream crates)
Where-clause normalization changed in well-formedness checks
Codegen attributes on body-free trait methods now produce a future compatibility warning (they had no effect anyway)
Windows SystemTime changes: checked_sub_duration returns None for times before the Windows epoch (Jan 1, 1601)
Lifetime identifiers are now NFC normalized (e.g. 'á written with combining characters vs precomposed). Edge case, but if you're generating Rust code programmatically, double check.
Compiler filename handling overhauled for cross-compiler consistency. Paths in diagnostics for local crates in Cargo workspaces are now relative instead of absolute. This can break CI scripts that grep compiler output for absolute paths.

Stabilized APIs Worth Knowing

Beyond array_windows, a few other stabilizations stand out:

LazyCell::get and LazyLock::get: Check whether a lazy value has been initialized without forcing it. Useful for conditional logic around cached values.

Peekable::next_if_map: Conditionally advance a peekable iterator and transform the value in one step. Cleaner than peek + next + map separately.

element_offset: Get the index of an element in a slice from a reference to it. Handy when you have a reference into a slice and need to know where it is.

f32/f64::consts::EULER_GAMMA and GOLDEN_RATIO: Mathematical constants added to the standard library. Minor, but saves you from defining them yourself.

f32/f64::mul_add now const: Fused multiply-add in const contexts. Useful for compile-time math.

Platform and Compiler Notes

New tier 3 target: riscv64im-unknown-none-elf (RISC-V without atomics)
29 additional RISC-V target features stabilized, covering large parts of RVA22U64 and RVA23U64 profiles
Unicode 17 support
BinaryHeap methods relaxed: some no longer require T: Ord (for methods that don't need ordering)
Error messages now use annotate-snippets internally, so diagnostic output may look slightly different

Upgrade Recommendation

Rust 1.94 is a solid release. array_windows alone makes it worth upgrading if you do any slice processing. The Cargo improvements are pure quality of life.

The main risk is the closure capturing change. If you have a large codebase, run cargo check before deploying and watch for new borrow checker errors around closures. The macro import change is lower risk but check if you have any custom macros that shadow stdlib names.

rustup update stable
cargo check  # Run this before committing to the upgrade

For teams on MSRV policies: 1.94 is safe to adopt as your new MSRV if you want the Cargo improvements. If you're maintaining a library, consider waiting one release cycle (until 1.95) to let the closure capturing changes settle and for downstream users to upgrade.

ReleaseRun Health Grade: A (actively maintained, 6-week release cadence, no EOL concerns)

Track Rust and 300+ other technologies at releaserun.com. Get version health grades, EOL alerts, and upgrade recommendations for your entire stack.

Keep Reading

🔍 Related tool: Cargo Dependency Health Checker — paste your Cargo.toml and check every crate for deprecation and latest versions. Free.

Rust 1.93.0 release notes: SIMD, varargs, and the stuff that breaks builds

Matheus — Sat, 21 Feb 2026 17:25:19 +0000

I’ve watched “minor” Rust upgrades stall a release train for one dumb reason. Emscripten flags.

Rust 1.93.0 lands with real wins for low-level work (SIMD on s390x, C-style variadic functions), plus a few changes that can trip CI in under 60 seconds if you ship WebAssembly or rely on sloppy tests.

The 30-second upgrade call

Upgrade if you hit FFI edges, ship on IBM Z, or you want stricter diagnostics before prod. Wait a week if your WebAssembly pipeline depends on Emscripten and you cannot spare an afternoon to chase linker flags.

High risk: Emscripten unwinding ABI change for panic=unwind. Your build can fail at link time.
Medium risk: Stricter #[test] validation. Rust stops ignoring invalid placements and starts erroring.
Low risk: New lints and Cargo quality-of-life changes. You will mostly see warnings.

What actually changed (the parts you will notice)

This bit me when a “harmless” std behavior change hid in a patch note. BTreeMap::append now stops overwriting existing keys when the incoming map contains a key you already have.

New lints: Rust now warns by default on const_item_interior_mutations and function_casts_as_integer. Expect fresh warnings in older codebases with clever const tricks or pointer-ish casts.
Cargo clean: cargo clean --workspace now cleans every package in a workspace, not just the current one.
Built-in attributes: Rust adds pin_v2 to the built-in attribute namespace.
Future incompat warnings: Rust now warns about ... parameters without a pattern (outside extern blocks), repr(C) enums with discriminants outside c_int/c_uint, and repr(transparent) that “forgets” an inner repr(C) type.
musl bump: The bundled musl version moves to 1.2.5. This usually feels boring until your static builds stop matching yesterday’s container image.

SIMD on s390x: useful, but only if you live there

Most teams will not rewrite hot loops this quarter. Good.

If you run on IBM Z, stabilized s390x vector target features matter because they let you ship one binary that checks CPU features at runtime, then takes the fast path. The macro to look for is is_s390x_feature_detected!. The thing nobody mentions is the boring part: you still need a scalar fallback unless you control every machine you deploy to.

Where it pays off: tight numeric code, compression, crypto-ish primitives, and batch processing where you touch big buffers.
Where it wastes time: request routing, JSON glue, anything dominated by syscalls or allocations.

If you cannot test your CNI in staging, you should not be running Kubernetes. Same energy here. If you cannot test on the actual CPU, do not pretend your SIMD change helped.

C-style variadic functions: great for FFI, still a foot-gun

Variadics make FFI wrappers less awkward. They do not make them safe.

Rust 1.93.0 stabilizes declaring C-style variadic functions for the system ABI, which helps when you need to bind to APIs like printf-style functions. In most cases, you should wrap the variadic call in a tiny unsafe boundary and expose a non-variadic Rust API to the rest of your crate. Some folks skip that and export varargs directly. I do not.

Good pattern: keep the extern varargs signature private, then build typed wrappers around the handful of formats you actually need.
Bad pattern: re-export varargs in your public Rust API and hope callers pass the right types on every platform.

Breaking changes that deserve a real test run

Here’s the one that will ruin your afternoon. Emscripten.

Rust changed the Emscripten unwinding ABI from JS exception handling to wasm exception handling when you compile with panic=unwind. If you link C or C++ objects, you now need to pass -fwasm-exceptions to the linker. If you build wasm once a month, you will forget this and rediscover it the hard way.

#[test] validation: Rust now errors when you slap #[test] on structs, trait methods, or other invalid spots. Older code that “worked” only worked because Rust ignored it.
deref_nullptr lint: Rust upgrades deref_nullptr to deny-by-default. Builds can fail where they used to warn.
offset_of! macro: offset_of! now validates user-written types for well-formedness. Code that relied on sketchy layouts might stop compiling.
cargo publish output: cargo publish no longer leaves .crate files as a final artifact when build.build-dir is unset.

Known issues and the “annoying but real” corner cases

I’ve seen this show up as a clean compile on one machine and a dead build in CI. Environment variables.

A Cargo environment variable change around CARGO_CFG_DEBUG_ASSERTIONS can break projects that depend on static-init versions 1.0.1 to 1.0.3, typically with an unresolved module-style error. If your dependency tree includes static-init, test this upgrade before you merge a toolchain bump across all repos.

Other stuff in this release: dependency bumps, some image updates, the usual.

How I’d roll this out (without drama)

Pin the toolchain first. Seriously.

Update locally, then run the exact commands your CI runs, not the “happy path” build you do on your laptop. For prod systems, test this twice. For a dev sandbox, sure, yolo it on Friday.

Update: rustup update stable
Confirm: rustc --version and check for 1.93.0
Build: cargo build --release
Test: cargo test and watch for new #[test] errors
Scan warnings: pay attention to const_item_interior_mutations and function_casts_as_integer
CI: align every runner image and container to the same toolchain

Official notes

Read the upstream release notes on GitHub if you ship to weird targets or you maintain a library. That page holds the exact wording and linked PRs.

Anyway.

Keep Reading

Frequently Asked Questions

What are the biggest changes in Rust 1.93.0? Three headline features: (1) SIMD intrinsics for s390x architecture (niche but important for IBM mainframe teams), (2) C-style variadic functions using extern "C" fn with ... syntax for better FFI interop, and (3) several lint changes that may cause existing code to fail compilation. The lint changes are what most teams will actually notice during upgrades.
Will Rust 1.93.0 break my existing code? Possibly. The function_casts_as_integer lint was elevated from warning to error, meaning code that casts function pointers to integers will now fail to compile. Check for patterns like fn_ptr as usize in your codebase. Additionally, some previously allowed implicit conversions in match arms are now flagged. Run cargo build with the new version in CI before committing to the upgrade.
How do I safely upgrade Rust in CI? Pin your Rust version in rust-toolchain.toml (e.g., channel = "1.93.0"). Before upgrading: (1) Run cargo clippy with the new version to catch new warnings, (2) Run your full test suite, (3) Check for deprecation warnings that became errors. If anything breaks, you can temporarily allow specific lints with #[allow(lint_name)] while you fix the underlying code. Never upgrade Rust and merge in the same PR.
What are C-style variadic functions in Rust 1.93.0 used for? They let you write Rust functions that accept a variable number of arguments, matching C's printf(const char*, ...) pattern. This is primarily useful for FFI (Foreign Function Interface) - if you're writing a Rust library that needs to expose a C-compatible API, you can now define variadic functions directly instead of using workarounds. For pure Rust code, macros remain the idiomatic way to handle variable arguments.

Node.js 25.6.0 Release Notes: What Breaks, What Changed, What I’d Test

Matheus — Sat, 21 Feb 2026 17:24:42 +0000

Another “maintenance” release. What broke this time, and why does it touch async tracking, networking headers, URL parsing, and OpenSSL?

I’ve watched teams ship patch and minor updates on Friday, then spend Saturday bisecting TLS handshakes and weird latency spikes. Node.js 25.6.0 looks useful. It also pokes several sharp edges at once, so I would not treat it as a free win.

Concerns first: the stuff the changelog won’t warn you about

This bit me before.

The release notes say a lot about new knobs, and almost nothing about the boring failure modes. If you run Node in production, those failure modes matter more than the feature bullets.

“No known issues” does not mean “safe”: The official notes list no known issues, but that just means nobody wrote them down there. I do not trust “known issues: none” from any project, especially right after a release drops.
Promise tracking can turn into a tax: async_hooks instrumentation often costs CPU and memory in promise-heavy code. Node adds a trackPromises option, which is great, but the notes do not give overhead numbers. You need to measure your own workload.
TOS socket controls vary by OS: Setting Type of Service sounds simple until you hit platform differences, privilege constraints, and “best effort” behavior. The release notes do not give you a support matrix in one place. Assume surprises unless you test on your exact fleet.
OpenSSL bumps change real behavior: Even when nobody calls it a breaking change, TLS stacks change. Cipher support, defaults, and edge-case handshakes can shift. If you talk to legacy endpoints, run handshake tests before you celebrate.
URL parser updates can change parsing outcomes: Updating the Ada URL parser to a new version can change how weird inputs normalize. If you sign URLs, compare canonical forms, or parse user-supplied URLs, you should run a corpus test.

So what actually changed in Node.js 25.6.0?

Here’s the clean list, with the parts I’d pay attention to.

Node.js shipped v25.6.0 on Feb 3, 2026. The headline items include promise lifecycle tracking in async_hooks, a new Type of Service API on sockets, initial ESM support for embedders, and a handful of runtime and dependency updates.

async_hooks: Promise lifecycle tracking: Node adds a trackPromises option to async_hooks.createHook() so you can observe promise creation and settlement. They claim it helps reduce overhead when you do not need promise execution tracking, but you still need to validate overhead when you enable it.
net: Type of Service on sockets: Node adds socket TOS controls via socket.setTypeOfService(tos) and socket.getTypeOfService(). The part to remember: this depends on the OS and network stack. Test it. Do not assume it changes packet handling in your environment.
Embedder API: initial ESM entry points: Embedders get initial support for loading ESM. “Initial” usually means “expect sharp corners,” so if you ship a custom embedder, plan time to read the PR and try it against your module loader setup.
stream consumers: bytes(): node:stream/consumers gains bytes() to collect stream data into a Uint8Array. If your code expects a Buffer, check your call sites.
test_runner: env option: test_runner.run() gets an env option for isolated test environment variables. This can fix “why did CI pass but local failed” problems, unless your tests secretly depend on inherited env state.
Performance: TextEncoder: Node improves TextEncoder.encode using simdutf. Great, but measure on your CPU and container base image. SIMD paths can behave differently across architectures.
url: Ada parser update: Node updates Ada to 3.4.2 with Unicode 17 support. If you do URL-heavy work, run regression tests against real inputs, not just “happy path” URLs.
Dependencies: undici 7.19.2, corepack 0.34.6, nghttp3 1.15.0, ngtcp2 1.20.0, and OpenSSL 3.5.5. Dependency bumps cause most of the “nothing changed” outages I’ve seen, because they change behavior outside your diff.

What I’d test in staging before I let this near production

Test this twice.

I’d wait a week for ecosystem noise, then I’d do a canary. Some folks skip canaries for minor releases. I do not, because I like sleeping.

TLS sanity: Run a handshake suite against every external dependency that still scares you. Old proxies, legacy APIs, that one vendor endpoint that only fails in one region.
HTTP client behavior (undici): Hit your highest-QPS routes and watch connection reuse, timeouts, and error rates. If you pin undici behavior indirectly through frameworks, this matters.
Async hot paths: If you plan to use trackPromises, load test with it on and off. Watch heap growth and p95 latency. If it adds 5 ms to a hot endpoint, you will feel it.
URL parsing regression corpus: Feed the new URL parser the ugliest URLs you see in logs. Compare normalized outputs if you do redirects, signing, allowlists, or cache keys.
Network TOS verification: If you actually need TOS, capture packets and confirm the DSCP/TOS bits show up. “API exists” does not equal “network respects it.”

Do not claim “no known issues.” Say “none listed in the official notes as of today,” then keep a rollback plan.

Recommendation (grudgingly)

I’d wait.

If you need promise lifecycle visibility right now, or you have a clear use case for Type of Service tagging, try 25.6.0 in staging and canary it into one slice of traffic. If your app runs fine on 25.5.x and you are not chasing one of these features, give it 7 days, watch for issue reports, then roll it out with a quick rollback path. Other stuff in this release: dependency bumps, some parser changes, the usual. There’s probably a better way to test this, but…

VS Code 1.109.0 Release Notes: Claude Agents, Integrated Browser, and the Stuff People Actually Mention

Matheus — Sat, 21 Feb 2026 17:24:07 +0000

Reddit's already arguing about this one.

The consensus seems to be "yeah, upgrade," mostly because 1.109 tightens Copilot Chat and adds an integrated browser preview, but Linux folks keep side-eyeing the Snap packaging situation.

Community take: what people are saying this week

I've watched teams treat VS Code updates like Chrome updates. They just happen, until they don't.

On the k8s and devtools Slacks, the vibe around 1.109 feels practical: frontend folks like the in-editor browser, AI-heavy teams like having Claude in the mix, and ops-minded people immediately ask "where do the keys go?"

Most teams are upgrading for Copilot Chat and agents: The chatter I see centers on "agent sessions feel less sticky now" and "streaming feels snappier," not on big editor changes.
Linux users keep bringing up Snap disk usage: Some teams report deleted files piling up in a snap-local Trash folder and eating disk. Others say "just use .deb and move on." Either way, do not write "no known issues."
Frontend devs like the integrated browser preview: As one SRE put it, "anything that kills the alt-tab loop is worth trying," but they still keep Chrome open for real debugging.

So. If you run VS Code via Snap on Linux, read the community reports before you hit update.

If you install via .deb/.rpm or you sit on macOS/Windows, you probably won't notice drama. You'll just notice new toys.

Official changelog recap (what 1.109.0 actually ships)

The official notes call out three big buckets: chat improvements, multi-agent workflows, and the new integrated browser preview.

They also sneak in a couple operational changes people miss until their terminal stops working on an old Windows VM.

Claude Agent support (Preview): VS Code adds Claude agent support through Anthropic integration in Copilot Chat. Expect "preview" sharp edges, and expect your security team to ask where the Anthropic API key lives.
Integrated browser (Preview): VS Code can open a browser inside the workbench with DevTools. You can test localhost flows and keep it in a tab next to your code.
Chat UX changes: The notes mention faster streaming and better reasoning display. You feel this as "less waiting for the full blob," not as magic correctness.

Two more official bits matter for upgrades, even if they don't sound exciting.

VS Code deprecates the old Copilot extension in favor of Copilot Chat, and VS Code also removes winpty support, which can hit older Windows installs.

Claude Agents in VS Code: what actually changes in your workflow

This is the headline feature and it deserves more than a bullet point. Here's what you're actually getting.

Claude integration in VS Code 1.109 means you can select Claude models (Sonnet, Opus) as your chat provider inside Copilot Chat. Previously, Copilot was locked to OpenAI models. Now you pick your model in the chat panel dropdown - no extension swapping, no separate window.

The practical impact depends on what you do with chat:

Code review prompts: Claude tends to catch more architectural issues and explain trade-offs in more depth than GPT-4 in my experience. If you use chat for "review this PR diff," Claude is worth trying here.
Multi-file refactoring: The agent mode lets Claude make edits across multiple files in one session. This is where "agent" actually means something - it proposes changes, you approve them, and it moves to the next file without losing context.
Test generation: Claude's test output tends to be less boilerplate-heavy. If your previous experience with Copilot test generation was "it writes tests that test the mock, not the behavior," Claude does better here in most cases.

The setup is straightforward but the security implications aren't trivial:

// settings.json - add your Anthropic API key
{
  "github.copilot.chat.models": ["claude-sonnet-4-20250514"],
  "anthropic.apiKey": "sk-ant-..."  // or use env var
}

Security note that matters: Your API key lives in settings.json by default. If you sync settings across machines (which most people do), that key syncs too. Use an environment variable instead (ANTHROPIC_API_KEY) or configure it per-workspace in .vscode/settings.json and add that file to .gitignore. Do not commit API keys. This sounds obvious until you realize VS Code Settings Sync makes it non-obvious.

Also worth knowing: Claude in VS Code sends your code context to Anthropic's API. If you work on proprietary code, check your org's data handling policy before enabling this. The context window includes the active file, selected text, and referenced files - it's not sending your entire workspace, but it's sending more than people usually realize.

Integrated browser: killing the alt-tab loop (mostly)

The integrated browser preview lets you open a Chromium-based browser tab inside VS Code, complete with DevTools. You get this via the Command Palette (Ctrl+Shift+P → "Simple Browser: Show") or by clicking a URL in the terminal.

What it actually does well:

Localhost testing without leaving the editor: If you're running a dev server on localhost:3000, you can preview it in a VS Code tab. CSS changes, component renders, API responses - all visible without switching windows.
DevTools in the same pane: The embedded browser includes a basic DevTools panel. Network tab, console, elements inspector. Good enough for "why is this API call failing?" checks.
Side-by-side layout: Split your editor left, browser right. Edit a React component, see it render immediately. This is genuinely useful for UI work.

What it doesn't replace:

Chrome DevTools for serious debugging: The embedded DevTools are a subset. No Performance tab, no Lighthouse, no Application panel for service workers. For real performance work, you still need a full browser.
Cross-browser testing: It's Chromium only. If you need to test Firefox or Safari rendering, this doesn't help.
Auth flows involving redirects: OAuth redirects, SSO flows, anything that bounces you through multiple domains - the embedded browser handles these unpredictably. Test auth flows in a real browser.

The practical verdict: great for fast feedback loops on UI changes, not a replacement for your actual browser. Think of it as a better Live Server, not a better Chrome.

What's changing under the hood (the stuff that breaks things)

Two deprecations in 1.109 that create real tickets if you miss them:

Old Copilot extension deprecated: If your org installed the standalone "GitHub Copilot" extension separately from "GitHub Copilot Chat," the standalone one is now deprecated. It still works in 1.109 but expect it to stop working in a future release. The migration path is to use Copilot Chat for everything. If you manage a fleet, check which extension ID your deployment scripts install - github.copilot (old) vs github.copilot-chat (current).

winpty removed: VS Code drops winpty terminal support, which was the fallback for terminals on older Windows systems. If you run VS Code on Windows Server 2016 or earlier, or on any machine where ConPTY isn't available, your integrated terminal breaks silently. The fix is to use a Windows version that supports ConPTY (Windows 10 1809+), but "upgrade Windows" isn't always an option in enterprise environments.

1.108 vs 1.109: what actually changed between versions

If you're on 1.108 and wondering whether to jump, here's what the delta looks like:

1.108 → 1.109 gains: Claude agent support, integrated browser preview, faster chat streaming, multi-agent session management, Copilot extension consolidation.
1.108 → 1.109 losses: winpty terminal support (Windows), standalone Copilot extension (deprecated, still functional).
Stability: 1.108.2 was a recovery build (see our 1.108.2 analysis), which means 1.108 had bumps. 1.109.0 ships without a recovery build so far - that's a good sign but not a guarantee.

If you skipped 1.108 entirely (waited on 1.107), you're getting two releases of accumulated changes. Read our 1.108 analysis too.

My synthesis: who should upgrade now, who should test first

Upgrade if you use chat daily.

I do not think "feature release, upgrade immediately" is always smart, but 1.109 lands in the category where most teams won't regret it, unless you sit on a brittle packaging path or an old Windows baseline.

Upgrade now if you live in Copilot Chat: If your day includes "ask for a refactor," "write tests," and "explain this stacktrace," you'll notice the streaming and session workflow tweaks.
Test first if you manage a fleet image: If you bake VS Code into a golden image, test the Copilot extension deprecation behavior. Extensions disappearing during an update creates fun tickets.
Be paranoid on Linux Snap: If your devs install from Snap, you should probably prefer .deb/.rpm for now, or at least warn people to check disk usage and Trash behavior after updating.
Hold on Windows Server 2016 or earlier: The winpty removal means your terminal may break. Test before rolling out.

Ignore the GitHub commit count. It's a vanity metric. I care about "does my terminal still work," and "did my extensions behave."

How to upgrade (and what I check right after)

Keep it boring.

Restart VS Code when it prompts you, then do a quick smoke test before you trust it with an incident.

Confirm the version: Open Help, then About, and verify you see 1.109.0.
Smoke test the terminal: Open a terminal, run node -v or python --version, and make sure the shell actually starts. On Windows, check that ConPTY is working.
Smoke test chat: Ask Copilot Chat to explain a small function. Watch streaming. If it lags or stalls, you'll notice immediately.
Check extensions: Open the Extensions panel and verify no extensions are disabled or showing errors. Pay special attention to the Copilot extension status.
Try Claude if enabled: Switch the model picker to Claude, send a prompt, confirm it responds. Check that your API key isn't visible in Settings Sync.

Quick usage examples (the stuff people will try first)

People won't read a long guide before clicking buttons.

They'll try the browser tab, then they'll try Claude, then they'll ask why auth or keys feel weird.

Try the integrated browser preview: Open a local web app, then open it in the integrated browser. Use DevTools to check network calls, especially auth redirects, because embedded browsers love to surprise you.
Try Claude in chat (if your org allows it): Add your Anthropic key per your org policy, then select a Claude model for review-style prompts. Keep an eye on what context you share. "Paste the whole repo" turns into a compliance conversation fast.
Run two agent sessions on purpose: Put one agent on "write tests," another on "document behavior," then see if session switching feels sane. This is where the update pays off if you actually work that way.

Known issues (official vs community)

The official notes do not list a known-issues section.

The community still reports issues, especially around VS Code on Linux via Snap and disk usage related to Trash behavior. If you hit that, switch install methods and move on with your life.

Additional issues worth watching:

Claude rate limiting: If you're on a free Anthropic tier, you'll hit rate limits fast when using Claude as your primary chat model. The error messages in VS Code aren't always clear about this - it looks like a timeout, not a rate limit.
Settings Sync + API keys: As mentioned above, API keys in settings.json sync across machines. If you're on a shared machine or syncing to a personal device, audit what's in your synced settings.
Extension conflicts: If you have both the old Copilot extension and Copilot Chat installed, some users report duplicate suggestions or chat panel confusion. Uninstall the old one explicitly.

There's probably a better way to test this, but...

Kubernetes 1.36 apiserver /readyz now waits for watch cache

Matheus — Sat, 21 Feb 2026 17:23:31 +0000

Test first. If you run production traffic, treat this as a control-plane behavior change, not a feature.

should you care? verdict

Yes, you should care. This changes when kube-apiserver admits it is ready, and your automation will notice.

In my experience, the worst control-plane outages start with a “green” health check and a pile of controllers doing list+watch at the same time. This release nudges the apiserver toward honest readiness. That is good. It can still bite you if your probes or load balancer health checks assume startup always stays under 10 seconds.

Upgrade stance: test in a disposable cluster first. Watch your apiserver readiness time, restart count, and error rates after deploying.
What breaks first: aggressive liveness or external health checks that kill the apiserver before it finishes warming watch cache.
What gets better: fewer “Ready but actually not ready” windows that trigger thundering-herd list+watch traffic.

should you care? apiserver readiness waits for watch cache init (PR #135777)

You will see this. Your /readyz can stay red longer.

PR #135777 enables WatchCacheInitializationPostStartHook by default. kube-apiserver will not report ready until it initializes the watch cache, instead of letting it settle later. Read the PR if you want the gory details. It is a small default change with big operational side effects. Your mileage may vary depending on how large your cluster is and how noisy your controllers are.

Here’s the thing nobody mentions in release notes. A lot of “control-plane automation” treats slow readiness as a failure and responds by killing the pod. That works great until Kubernetes starts doing more work before readyz flips green. Then you get a boot loop you created yourself. I cannot point to an upstream “this will restart-loop you” bug report yet, so treat it as an operator risk. Still. I have seen enough probe configs to know it happens.

What to monitor: apiserver restart count during rollout, and how long /readyz stays non-200 after a process start.
What to check right after deploy: controller error rates and request latency. A thundering herd shows up as a fat tail on apiserver request duration and a spike in inflight.
What to fix before blaming Kubernetes: any external load balancer health check that marks the apiserver dead faster than your slowest control-plane node can warm its caches.

should you care? watch_list_duration_seconds goes Beta (PR #136086)

This matters. You can alert on it without feeling silly.

PR #136086 graduates watch_list_duration_seconds to Beta. That gives you a more stable target for SLOs around watch-list behavior. If you run large informer fleets, or you have operators that “helpfully” relist the world every few minutes, this metric helps you stop guessing.

Do this before testing 1.36: baseline watch_list_duration_seconds in your current version. Capture p50 and p95 for a normal day.
Alerting posture: start with a paging threshold only if you already page on apiserver latency. Otherwise, route it to a ticket first and tighten later.
After deploying: check your error rates after deploying, then check watch_list_duration_seconds. If both jump together, you found a real control-plane problem.

should you care? declarative validation can fail closed (PR #136117, KEP-5073)

Yes. This can turn a panic into user-visible 500s.

PR #136117 adds WithDeclarativeNative so strategy.go code can opt into DV-native validations. The sharp edge is intentional. When DV-native rules exist, generated validation code can run even if the DeclarativeValidation feature gate is disabled. If the authoritative declarative validator panics, Kubernetes fails closed and returns InternalError. That trades availability for correctness. In an API server, I agree with that trade most days.

What you will see: InternalError on create or update, correlated with apiserver stack traces.
What to monitor: apiserver 5xx rate by resource and verb. If your 5xx jumps right after upgrade, do not shrug and blame clients.
Operational caveat: a fail-closed validator can block writes. Plan a rollback path for the control plane. Do not discover that during an incident.

should you care? list/watch memory spikes and the 10x swing

I have watched apiserver RSS climb for seven minutes straight. It looks like a leak. It usually is not.

The Kubernetes API streaming work shows why list behavior hurts under load. In a synthetic test from the upstream blog, kube-apiserver memory stabilized around ~2 GB with watch-list patterns enabled versus ~20 GB without. That is a test, not a promise for your cluster. Still, the direction matches what I see in real incidents. List-heavy clients punish apiservers with big transient allocations, then you pray the OOM killer picks the right process.

What to monitor: apiserver RSS and allocation pressure during informer resyncs. Pair it with watch_list_duration_seconds so you can tell “slow watch-list” from “just memory churn.”
What to do if it climbs: slow the rollout, reduce controller concurrency if you can, and check for misbehaving operators spamming list calls.
Opinion: ignore GitHub commit counts. Watch your apiserver graphs.

should you care? how to test 1.36 alpha without ruining your week

Do it in kind first. Then do it again with your real monitoring.

The release schedule lists 2026-02-18 for 1.36.0-alpha.2. Schedules slip. Tags show up late. Verify the image exists before you assume it does. If you run kubeadm or a managed provider, you will wait for your distro anyway.

Use a disposable cluster. Keep it boring.

Create a kind cluster: kind create cluster --name k136 --image kindest/node:v1.36.0-alpha.2
Basic smoke check: kubectl cluster-info
What to watch live: apiserver readiness behavior, restart count, and 5xx error rate during startup and during controller churn tests.

should you care? red flags and what I grep first

Page on symptoms. Not vibes.

If /readyz never goes green, look at watch cache init messages and then look at who kills the apiserver. If you see InternalError on writes, correlate timestamps with apiserver stack traces. If watch/list stalls show up, baseline watch_list_duration_seconds now that it is Beta-grade and compare during your canary. Your monitoring should tell you the story in five minutes, not after the postmortem.

Check your error rates after deploying. If you do not have an apiserver 5xx panel, you are flying blind.

Other stuff in this release: dependency bumps, some image updates, the usual. Anyway.

Kubernetes 1.32 End of Life: Migration Playbook for February 28, 2026

Matheus — Sat, 21 Feb 2026 17:22:55 +0000

12 days. That's how long Kubernetes 1.32 has left before the upstream project stops issuing patches. After February 28, 2026, there are no more security fixes, no more bug patches, no more backports. Version 1.32.12 - released on February 10 - is the last update you will ever get.

If you're still running 1.32 in production, this is your migration playbook. Not a gentle nudge. A concrete, step-by-step plan to get off a version that's about to become a liability.

What "End of Life" Actually Means (It's Worse Than You Think)

Let's be precise about what happens on March 1st if you're still on 1.32.

No more CVE patches. When the next Kubernetes vulnerability drops - and it will - the fix will ship for 1.33, 1.34, and 1.35. Not 1.32. You'll read the advisory, understand exactly how your clusters are exposed, and have no upstream fix to apply.

This isn't theoretical. Look at what's already been patched in 1.33.x that 1.32 users are exposed to right now:

CVE-2025-5187 (fixed in 1.33.4): Nodes can delete themselves by adding an OwnerReference to their own Node object. An attacker with node-level access can cause cascading disruption by self-destructing nodes in your cluster. This is the kind of bug that makes incident response teams lose sleep.
CVE-2025-4563 (fixed in 1.33.2): DRA (Dynamic Resource Allocation) authorization bypass. If you're using DRA - and more teams are as GPU workloads grow - this one matters. No more bug fixes. Several nasty bugs were fixed in 1.33.x patches that will never be backported to 1.32:- A kubelet watchdog that kills the kubelet during slow container runtime initialization (1.33.7). If you've ever seen mysterious kubelet restarts after a node reboot, this might be why.
A DRA double-allocation race condition during rapid pod scheduling (1.33.8). You won't hit this until you do - and when you do, two pods will think they own the same resource.
A DaemonSet orphaned pod regression (1.33.6) that can leave ghost pods consuming resources with no controller managing them. No more compatibility guarantees. Ecosystem tools - Helm, Istio, cert-manager, ArgoCD - will drop 1.32 from their test matrices. You'll start seeing "unsupported version" warnings, then errors, then silent incompatibilities that only surface at 3 AM.

Kubernetes 1.32 had a solid run. Originally released December 11, 2024, it received 12 patch releases over ~14 months. That's the standard lifecycle. But its time is up.

Where Should You Land?

You have three supported targets. Here's the honest comparison:

	1.33	1.34	1.35
EOL	June 28, 2026	October 27, 2026	February 28, 2027
Support remaining	~4 months	~8 months	~12 months
Hops from 1.32	1	2	3
Maturity	Fully battle-tested	Stable, well-patched	Current release, still early patches
Risk profile	Low risk, low runway	Low risk, good runway	Low risk on paper, less field time
Recommended for	"Just get off 1.32 NOW"	Most production teams	Teams who just upgraded recently

Our recommendation: Target 1.34 for most teams.

Here's the reasoning:

Why not 1.33? It works, it's stable, and it's the fewest changes from where you are. But with EOL on June 28, you'd be doing this exact same fire drill in four months. That's not a migration strategy - that's procrastination with extra steps.
Why not 1.35? It's the current release with the longest support runway. But getting there requires three sequential minor version upgrades (1.32→1.33→1.34→1.35), and the newest release has had less time in the field. Unless you upgraded to 1.34 recently and are just continuing the chain, the extra hop adds risk and downtime for marginal benefit.
Why 1.34? Two hops (1.32→1.33→1.34), eight months of support, and a version that's had enough patch releases to shake out the rough edges. You get the major 1.33 features (sidecar containers GA, nftables GA) plus whatever 1.34 brought to the table, and you won't need to think about upgrading again until late summer.

The one exception: if you're in a change-freeze or have a release cycle that makes two hops impossible before February 28, go to 1.33 now and plan the 1.33→1.34 hop for March. Getting off 1.32 is the priority.

The Upgrade Path: You Cannot Skip Minor Versions

This is the part where people get burned. Kubernetes version skew policy is strict: you must upgrade one minor version at a time. There is no shortcut from 1.32 to 1.34. You go through 1.33, you validate, and then you continue.

Here's the sequence for a kubeadm-managed cluster:

Pre-Upgrade Checklist

Before you touch anything:

# 1. Confirm your current version

kubectl version --short


kubectl get nodes -o wide

# 2. Check for deprecated API usage that will break on upgrade
# Install kubectl-deprecations or use kubent

kubectl get --raw /metrics | grep apiserver_requested_deprecated_apis

# 3. Verify etcd health

ETCDCTL_API=3 etcdctl endpoint health \


--endpoints=https://127.0.0.1:2379 \


--cacert=/etc/kubernetes/pki/etcd/ca.crt \


--cert=/etc/kubernetes/pki/etcd/server.crt \


--key=/etc/kubernetes/pki/etcd/server.key

# 4. Back up etcd (non-negotiable)

ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-pre-upgrade-$(date +%Y%m%d).db \


--endpoints=https://127.0.0.1:2379 \


--cacert=/etc/kubernetes/pki/etcd/ca.crt \


--cert=/etc/kubernetes/pki/etcd/server.crt \


--key=/etc/kubernetes/pki/etcd/server.key

# 5. Check component version skew
# kubelet must be within one minor version of the API server
# kube-proxy must match the API server minor version

kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.nodeInfo.kubeletVersion}{"\n"}{end}'

Hop 1: 1.32 → 1.33

# On the first control plane node:
# Update kubeadm

apt-get update && apt-get install -y kubeadm=1.33.<em>-</em>

# or on RHEL/CentOS:
# yum install -y kubeadm-1.33.*
# Verify the upgrade plan

kubeadm upgrade plan

# Apply the upgrade (first control plane only)

kubeadm upgrade apply v1.33.8

# Upgrade kubelet and kubectl

apt-get install -y kubelet=1.33.<em>-</em> kubectl=1.33.<em>-</em>


systemctl daemon-reload


systemctl restart kubelet

For additional control plane nodes:

kubeadm upgrade node

apt-get install -y kubelet=1.33.<em>-</em> kubectl=1.33.<em>-</em>


systemctl daemon-reload


systemctl restart kubelet

For each worker node:

# From a machine with kubectl access:

kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# On the worker node:

apt-get update && apt-get install -y kubeadm=1.33.<em>-</em>


kubeadm upgrade node


apt-get install -y kubelet=1.33.<em>-</em>


systemctl daemon-reload


systemctl restart kubelet

# From kubectl:

kubectl uncordon <node-name>

Stop here. Validate. Don't chain upgrades without confirming the cluster is healthy:

kubectl get nodes          # All nodes Ready?

kubectl get pods -A        # Any CrashLoopBackOff?


kubectl get cs             # Component statuses healthy?

# Run your smoke tests. You have smoke tests, right?

Hop 2: 1.33 → 1.34

Repeat the exact same process, substituting 1.34 for 1.33. Same drain-upgrade-uncordon dance. Same validation.

Version skew during upgrade: The Kubernetes version skew policy allows kubelet to be one minor version behind the API server. This means during the 1.33→1.34 upgrade, your 1.33 kubelets will work with the 1.34 API server while you roll nodes. But 1.32 kubelets will not work with a 1.34 API server. This is why you can't skip versions.

Cloud Provider Timelines: You Might Have More Time (For a Price)

If you're running managed Kubernetes, your deadlines are slightly different - but don't get complacent.

Amazon EKS

Standard support EOL for 1.32: March 23, 2026 (three weeks after upstream)
Extended support EOL: March 23, 2027

EKS extended support buys you a full extra year, but at a premium: $0.60 per cluster per hour on top of the standard $0.10/hour. That's roughly $4,400/year per cluster just for the privilege of staying on 1.32. For a single cluster, maybe. For a fleet, you're burning budget to avoid an upgrade you'll have to do anyway.

# Check your EKS cluster version

aws eks describe-cluster --name <cluster-name> \


--query 'cluster.version' --output text

# Start an EKS upgrade to 1.33

aws eks update-cluster-version \


--name <cluster-name> \


--kubernetes-version 1.33

# Watch the update status

aws eks describe-update --name <cluster-name> \


--update-id <update-id-from-previous-command>

# Don't forget to update your node groups after!

aws eks update-nodegroup-version \


--cluster-name <cluster-name> \


--nodegroup-name <nodegroup-name>

Google GKE

GKE typically provides 2-4 weeks of grace after upstream EOL before auto-upgrading clusters. If you haven't set a maintenance window and an upgrade strategy, GKE will upgrade your clusters for you. That sounds convenient until it happens during your traffic peak.

# Check GKE cluster version

gcloud container clusters describe <cluster-name> \


--zone <zone> --format="value(currentMasterVersion)"

# Initiate upgrade

gcloud container clusters upgrade <cluster-name> \


--zone <zone> --master --cluster-version 1.33

Azure AKS

AKS follows a similar pattern: roughly 2-4 weeks past upstream EOL, with platform-managed upgrades kicking in after that. AKS's "long-term support" (LTS) versions are a separate track - 1.32 is not an LTS release, so no special treatment here.

# Check AKS version

az aks show --resource-group <rg> --name <cluster-name> \


--query kubernetesVersion -o tsv

# Upgrade AKS

az aks upgrade --resource-group <rg> --name <cluster-name> \


--kubernetes-version 1.33

The bottom line for cloud users: You have a few weeks of buffer. Use that buffer for testing, not for procrastination. Start the upgrade now and use the extra weeks as a safety net, not a crutch.

What You Gain: 5 Features Worth the Upgrade

Upgrading isn't just about escaping EOL. The jump from 1.32 to 1.33 is one of the most feature-rich minor releases in recent Kubernetes history. Here's what actually matters in production:

1. Sidecar Containers - GA (KEP-753)

This is the big one. After years of KEPs, alpha gates, and community debate, native sidecar containers are generally available. Init containers with restartPolicy: Always now have proper lifecycle management: they start before your main containers, stay running alongside them, and shut down after them.

If you're running service meshes (Istio, Linkerd), log shippers, or any sidecar-dependent architecture, this eliminates a whole class of race conditions. No more hacks with postStart hooks and sleep loops to ensure your Envoy proxy is ready before your app starts.

Watch out: A sidecar startup probe race condition was fixed in 1.33.6. Make sure you're on 1.33.8 (latest) to avoid it.

2. nftables Kube-Proxy Backend - GA (KEP-3866)

The iptables-based kube-proxy is showing its age. nftables is faster, handles large rule sets better, and is the future of Linux packet filtering. With GA in 1.33, it's production-ready.

The caveat: This doesn't mean nftables is the default yet. You still need to opt in. But if you're running clusters with thousands of Services, the performance difference is measurable - especially rule reload times during Service churn. An iif vs iifname bug in local traffic detection was fixed in 1.33.6, so again: run the latest patch.

3. In-Place Pod Resource Resize - Beta (KEP-1287)

Change a pod's CPU and memory requests/limits without restarting it. Still beta, so it's behind a feature gate, but this is the kind of capability that changes how you think about vertical scaling. No more killing a pod just because it needs 200Mi more memory during a traffic spike.

4. Topology-Aware Routing - GA (KEP-4444)

trafficDistribution: PreferClose is now GA. Traffic prefers endpoints in the same zone before crossing zone boundaries. This is pure money in multi-AZ deployments: less cross-zone data transfer, lower latency, better tail percentiles. If you're on AWS or GCP and not using this, you're paying an invisible cloud networking tax.

5. Multiple Service CIDRs - GA (KEP-1880)

You can now dynamically expand your ClusterIP range without cluster recreation. If you've ever hit the ceiling on your Service CIDR and had to do gymnastics to work around it, this fixes that permanently. Especially relevant for large multi-tenant clusters.

Breaking Changes and Gotchas: What to Watch For

Every upgrade has landmines. Here are the ones that bite in the 1.32→1.33 transition:

nftables Consideration

While nftables kube-proxy went GA, the default backend is still iptables in 1.33. However, start planning your migration now. Test nftables in staging. Future versions may change the default, and you don't want to be scrambling when that happens. The migration guide is essential reading - nftables rule semantics differ from iptables in subtle ways that will break custom NetworkPolicy implementations relying on iptables-specific behavior.

Deprecated API Removals

Check for any APIs that were deprecated in 1.31 or earlier and removed in 1.33. The flowcontrol.apiserver.k8s.io/v1beta3 API group is one to watch. Run kubectl-deprecations or kubent before upgrading:

# Using kubent (kube-no-trouble)

kubent

# Or check directly

kubectl get --raw /metrics | grep apiserver_requested_deprecated_apis

Feature Gate Changes

Some feature gates that were beta (and on by default) in 1.32 graduated to GA in 1.33, which means the gates are locked and removed. If you were explicitly setting these gates in your kubelet or API server configs, the flags will cause startup errors. Audit your --feature-gates flags before upgrading.

DRA (Dynamic Resource Allocation) Changes

If you're using DRA for GPU or custom resource scheduling, be aware of the authorization bypass fix (CVE-2025-4563) and the double-allocation race fix. The fixes are in 1.33.2 and 1.33.8 respectively, so target 1.33.8 as your landing version.

Your 5-Step Action Plan

Here's what to do this week. Not next month. This week.

Step 1: Audit (Today)

# Find every cluster still on 1.32
# For kubeadm clusters:

kubectl version -o json | jq '.serverVersion.minor'

# For EKS:

aws eks list-clusters --output text | xargs -I{} \


aws eks describe-cluster --name {} \


--query '[cluster.name, cluster.version]' --output text

# For GKE:

gcloud container clusters list \


--format="table(name, currentMasterVersion)"

# For AKS:

az aks list --query '[].{name:name, version:kubernetesVersion}' -o table

Step 2: Test in Staging (This Week)

Upgrade a non-production cluster to 1.33. Run your full test suite (see our Kubernetes upgrade checklist). Pay special attention to:

Service mesh behavior (sidecar lifecycle changes)
Network policies (if you plan to test nftables)
Any workloads using DRA
Custom admission webhooks (API changes can break them silently)

Step 3: Upgrade Production to 1.33 (Week of Feb 23)

Follow the kubeadm or cloud provider upgrade steps above. Target 1.33.8 - it has the latest security and bug fixes.

Step 4: Validate and Soak (1 Week)

Run 1.33 in production for at least a few days. Monitor:

# Watch for elevated error rates

kubectl get events --sort-by='.lastTimestamp' -A | head -50

# Check component health

kubectl get componentstatuses

# Monitor pod restarts (a spike means something broke)

kubectl get pods -A --sort-by='.status.containerStatuses[0].restartCount' | tail -20

Step 5: Continue to 1.34 (Early March)

Once 1.33 is stable, repeat the process for 1.34. This is your final destination - 8 months of support runway, the features you need, and a stable foundation.

The Clock Is Ticking

February 28 is not a soft deadline. It's the day your clusters become unpatched infrastructure. Every day after that, your attack surface grows and your ecosystem compatibility shrinks.

The upgrade from 1.32 to 1.33 (and then 1.34) is well-trodden ground. Thousands of clusters have made this jump. The tooling works. The docs are solid. The features are worth it.

What's not worth it is explaining to your security team in April why you're running a Kubernetes version with known, unpatched CVEs because the upgrade "wasn't prioritized."

Start today. Your future on-call self will thank you.

Popular Kubernetes Distributions Compared (2026)

Matheus — Sat, 21 Feb 2026 17:17:35 +0000

Choosing a Kubernetes distribution is one of the first decisions platform teams face. The ecosystem now includes over 200 certified options -- from lightweight single-node setups to enterprise platforms managing thousands of clusters.

Here's a practical comparison of the most popular distributions, what each is best suited for, and how to decide.

What Is a Kubernetes Distribution?

A Kubernetes distribution is a packaged version of upstream Kubernetes that adds installation tooling, default configurations, and often additional features like networking, storage, and security integrations.

Think of it like Linux distributions: Ubuntu, Red Hat, and Alpine all run the Linux kernel, but each packages it differently for different use cases. Kubernetes distributions work the same way.

The Major Distributions

Managed Cloud Services (Hosted)

These are fully managed -- the cloud provider handles the control plane, upgrades, and infrastructure.

Amazon EKS -- The market leader (~42% share). Tight integration with AWS services (IAM, VPC, ALB). Supports both cloud and on-premises deployment (EKS Anywhere). Best for teams already invested in AWS.

Google GKE -- Built by the team that created Kubernetes. Fastest to adopt new K8s versions (often same-day support for new releases). Autopilot mode eliminates node management entirely. Best for teams wanting the most "pure" Kubernetes experience.

Azure AKS -- Deep integration with Azure Active Directory and Azure DevOps. Strong Windows container support. Free control plane (you only pay for worker nodes). Best for Microsoft-centric enterprises.

DigitalOcean Kubernetes (DOKS) -- Simplest managed option. Free control plane, straightforward pricing. Limited to smaller scale. Best for startups and small teams.

Self-Managed (On-Premises / Hybrid)

kubeadm -- The official Kubernetes bootstrapping tool. Minimal opinions -- gives you vanilla upstream Kubernetes. Requires you to handle networking, storage, and upgrades yourself. Best for teams that want full control and understand the internals.

Red Hat OpenShift -- Enterprise Kubernetes platform with built-in CI/CD (Tekton), developer portal, and strict security defaults (SELinux, SCCs). Runs on any infrastructure. Opinionated but comprehensive. Best for regulated enterprises that need a complete platform, not just an orchestrator.

Rancher (by SUSE) -- Multi-cluster management platform. Can manage EKS, GKE, AKS, and on-prem clusters from a single dashboard. Includes its own lightweight distribution (RKE2). Best for teams managing Kubernetes across multiple environments.

VMware Tanzu -- Integrates Kubernetes into existing VMware infrastructure. Lets teams run containers alongside traditional VMs. Best for organisations transitioning from VMware to containers gradually.

Lightweight / Edge Distributions

k3s -- Rancher's lightweight Kubernetes distribution. Single binary under 100MB. Ideal for edge computing, IoT, CI/CD pipelines, and development environments. Strips out cloud-provider-specific code and uses SQLite instead of etcd by default.

MicroK8s (Canonical) -- Snap-packaged Kubernetes from the Ubuntu team. Zero-ops single-node to multi-node clusters. Strong add-on ecosystem (Istio, Knative, GPU support). Best for developer workstations and Ubuntu-based infrastructure.

minikube -- Local Kubernetes for development and testing. Runs inside a VM or container on your laptop. Not intended for production. Best for learning Kubernetes and local development.

kind (Kubernetes in Docker) -- Runs Kubernetes clusters using Docker containers as nodes. Designed for testing Kubernetes itself. Extremely fast to spin up and tear down. Best for CI/CD pipelines and integration testing.

Comparison Table

Distribution	Type	Best For	K8s Version Lag	Cost
Amazon EKS	Managed	AWS-native teams	1-2 weeks	Pay per cluster + nodes
Google GKE	Managed	K8s-first teams	Same day	Pay per cluster + nodes
Azure AKS	Managed	Microsoft shops	1-2 weeks	Free control plane
DigitalOcean	Managed	Startups	2-4 weeks	Free control plane
kubeadm	Self-managed	Full control	Same day	Free (your infra)
OpenShift	Platform	Enterprises	2-4 weeks	Subscription
Rancher/RKE2	Multi-cluster	Hybrid/multi-cloud	1-2 weeks	Free + Enterprise tier
k3s	Lightweight	Edge/IoT	1-2 weeks	Free
MicroK8s	Lightweight	Dev/Ubuntu	1-3 weeks	Free

How to Choose

Start with this question: Who manages the infrastructure?

"Not us" → Managed service (EKS, GKE, AKS). Pick based on your cloud provider.
"Us, on our hardware" → kubeadm (DIY), OpenShift (enterprise), or Rancher (multi-cluster).
"It's for development/testing" → k3s, minikube, or kind.
"It's for edge/IoT" → k3s or MicroK8s.

Then consider:

Team size: Small teams benefit from managed services or opinionated platforms. Large platform teams can handle kubeadm.
Compliance requirements: Regulated industries often need OpenShift or Tanzu for their built-in security controls.
Multi-cloud needs: Rancher or Anthos (Google's hybrid offering) if you're running across providers.
Version freshness: If running the latest Kubernetes version matters, GKE and kubeadm track upstream fastest.

Version Support Across Distributions

Not all distributions support the same Kubernetes versions at the same time. When upstream Kubernetes releases version 1.35, managed providers typically need 1-4 weeks to certify and offer it.

Check current version support for any distribution on our Kubernetes Releases hub, which tracks every supported version with live health grades and EOL dates.

Key dates to know:

Kubernetes 1.32 reaches end of life February 28, 2026
Kubernetes 1.36 expected April 2026

Track all Kubernetes versions, EOL dates, and security status in real time at ReleaseRun.

Maintained by ReleaseRun -- tracking release health for 300+ software products. Last updated: February 2026.

Node.js 20 End of Life: Migration Playbook for April 30, 2026

Matheus — Sat, 21 Feb 2026 17:16:55 +0000

Node.js 20 reaches end of life on April 30, 2026.

If you are reading this in March or April, you are already behind. Node.js EOL dates do not come with a grace period. On May 1st, no more security patches. No more CVE fixes. The npm ecosystem moves on, and packages start dropping support in their CI matrices before the EOL date even arrives.

I have seen teams discover they are running an EOL runtime at the worst possible moment -- during a security incident, when the fix only ships for supported versions. This playbook covers what Node 20 EOL means, which version to move to, what breaks along the way, and the exact steps to migrate without a production outage.

TL;DR -- What to do and when

Right now (February): Audit every service, container, Lambda function, and CI pipeline running Node 20. Run your test suite on Node 22.
March: Migrate production workloads to Node 22 LTS (recommended) or Node 24 if you need the latest features. Deploy behind canary or feature flags.
Early April: Clean up stragglers -- serverless functions, internal tools, batch jobs, developer machines.
April 30: Deadline. You want to be done a week before, not on the day.

If you do one thing today, run node --version across every production host, container, and CI runner. The number of places pinning Node 20 will surprise you.

What "end of life" actually means for Node 20

Node.js 20 entered Maintenance LTS on October 22, 2024. Since then, it only receives critical bug fixes and security patches. On April 30, 2026, even that stops.

No more security patches: If a vulnerability is found in Node 20 after April 30, the Node.js team will only patch supported versions (22, 24+). You get nothing.
npm ecosystem moves on: Package authors drop Node 20 from their engines field and CI matrices. Some already have. When a package you depend on releases a version that requires Node 22+, your lockfile becomes a ticking time bomb.
Cloud runtimes deprecate: AWS Lambda, Google Cloud Functions, and Azure Functions will deprecate their Node 20 runtimes on their own timeline. AWS gives at least 180 days notice and phases out in stages: first blocking new function creation, then updates, though existing invocations can continue indefinitely on deprecated runtimes. Other providers have similar but not identical policies.
Compliance gaps: SOC 2, PCI DSS, and ISO 27001 all require running supported software. An EOL runtime is a finding waiting to happen.

Your code still runs. Node.js does not brick itself. But security coverage evaporates and the maintenance burden increases every week as the ecosystem leaves you behind.

Where Node 20 is hiding

The obvious places are easy. The ones that bite you are the ones nobody remembers deploying.

Docker base images: node:20, node:20-slim, node:20-alpine. Search your Dockerfiles: grep -rn "FROM node:20" . --include="Dockerfile*". Check multi-stage builds too.
.nvmrc and .node-version files: These pin the Node version for local development and often get copied into CI. Search: find . -name ".nvmrc" -o -name ".node-version" | xargs grep "20"
package.json engines field: grep -rn '"engines"' . --include="package.json" -A 3 | grep "20"
CI/CD pipelines: GitHub Actions (setup-node), GitLab CI, CircleCI, and Jenkins configs. Search for node-version: '20' or NODE_VERSION: 20 across all YAML files.
AWS Lambda: Check runtime settings: aws lambda list-functions --query 'Functions[?Runtime==\nodejs20.x].[FunctionName,Runtime]' --output table
Vercel / Netlify / Cloudflare Workers: Check project settings for Node version overrides. Vercel uses engines.node in package.json. Netlify uses environment variables. Cloudflare Workers has its own compatibility dates.
Tooling: Husky, lint-staged, Prettier, ESLint config runners -- these run on your dev machine's Node version, which developers may not have updated.

Which version to upgrade to

Short answer: Node 22 LTS for production. Node 24 if you are already running it in development and can tolerate a shorter track record.

Node 22 LTS (recommended for most teams)

Entered Active LTS in October 2024, Maintenance LTS started October 21, 2025
EOL April 30, 2027 -- gives you a full year of support after Node 20 dies
V8 engine 12.4 -- significant performance improvements over Node 20's V8 11.3
Key additions over Node 20:

require() now works with ES modules (release candidate status as of 22.x -- usable in production but check your specific minor version) -- the biggest quality-of-life improvement in years
- Built-in node --watch is stable (no more nodemon for simple use cases)
- fetch() and WebStreams are stable (no longer experimental)
- Built-in WebSocket client (WebSocket global) -- stable (was experimental behind a flag in Node 20.10+)
- Improved test runner (node:test) with snapshot testing and coverage -- the test runner is stable in Node 20, but Node 22 adds significant features
- glob and matchesGlob in node:fs and node:path
- Task runner: node --run as a faster alternative to npm run
Breaking changes from Node 20: V8 upgrade may affect native addons compiled against Node 20's ABI. Rebuild native modules with npm rebuild or node-gyp rebuild.

Node 24 (for early adopters)

Released May 6, 2025, entered Active LTS October 2025
EOL April 2028 -- nearly two years of runway
V8 13.x with further performance improvements
Permissions model stable, TypeScript type stripping is now stable (the old --experimental-strip-types flag was removed in 24.12+)
Risk: Some packages may not have been fully tested against Node 24's V8 engine. Native addons are the usual pain point.

My recommendation for February 2026: jump to Node 22 LTS. It is battle-tested, has the widest ecosystem compatibility, and gives you a year before you need to think about versions again. If you are starting a new project, consider Node 24 from day one.

What breaks when you upgrade from Node 20 to 22

The gap is one LTS version (20 → 22), which is the smallest possible LTS jump. Good news: this is usually straightforward. Bad news: "usually" is not "always."

V8 engine changes

Node 22 ships V8 12.4 (Node 20 had V8 11.3). This matters if you:

Use native addons compiled against Node 20. Fix: npm rebuild after upgrading. Most addons recompile automatically, but some with pinned prebuilt binaries (sharp, bcrypt, better-sqlite3) may need an explicit version bump.
Rely on specific V8 flags for performance tuning. Some flags change between V8 versions. Check your --v8-* flags still exist.

Deprecated APIs removed or changed

punycode module: Runtime deprecation warning in Node 20, still importable. In Node 22, the warning is louder and the module is scheduled for removal. Use the punycode/ npm package instead.
SlowBuffer: If you somehow still use this, switch to Buffer.allocUnsafe().
url.parse(): Still works but URL constructor is preferred. Some edge cases around auth parsing were tightened in Node 22.
OpenSSL 3.x changes: Node 22 may use a newer OpenSSL patch that affects TLS behavior. If you connect to systems with legacy TLS configurations, test your HTTPS connections thoroughly.

ESM/CJS interop changes

This is the area most likely to cause confusion, not breakage:

Node 22 supports require() for ES modules. This is unflagged but at "release candidate" stability (1.2) -- usable and increasingly relied on, but not yet fully stable. It does not break existing CJS code.
If you have a mixed ESM/CJS codebase, test both import and require paths after upgrading.
The "type": "module" field in package.json behaves the same way. No changes there.

Step-by-step migration

1. Audit your Node.js footprint

# Check all hosts and containers
node --version

# Find pinned versions in your codebase
grep -rn "FROM node:20" . --include="Dockerfile*"
find . -name ".nvmrc" -exec cat {} \; -print
find . -name ".node-version" -exec cat {} \; -print
grep -rn '"node"' . --include="package.json" | grep "20"

# Check AWS Lambda functions
aws lambda list-functions \
  --query 'Functions[?Runtime==`nodejs20.x`].[FunctionName]' \
  --output table

2. Test on Node 22 locally

# Using nvm
nvm install 22
nvm use 22
npm ci
npm test

# Or with Docker
# Before
FROM node:20-slim
# After
FROM node:22-slim

Watch for:

Native addon compilation failures (most common: sharp, bcrypt, better-sqlite3, canvas)
Test failures from tightened URL parsing or crypto behavior
Deprecation warnings that became errors

3. Fix dependency issues

# Rebuild all native addons
npm rebuild

# Check for packages that declare Node engine requirements
npx check-engines

# Update packages that need newer versions for Node 22
npm outdated
npm update

Common packages that needed updates for Node 22:

sharp: Needs 0.33+ for Node 22 prebuilt binaries
bcrypt: Needs 5.1+ for Node 22 ABI compatibility
node-sass: Dead project. Switch to sass (Dart Sass) immediately -- this will not get Node 22 support.
better-sqlite3: Needs 11+ for Node 22
Prisma: 5.x supports Node 22. If you are on Prisma 4.x, this is a good time to upgrade.

4. Update CI pipelines

# GitHub Actions -- run both during migration
- uses: actions/setup-node@v6
  with:
    node-version: '22'  # was '20'

# To test both versions in a matrix:
strategy:
  matrix:
    node-version: ['20', '22']

5. Update Docker base images

# Pin the specific LTS version
FROM node:22-slim

# If you were on Alpine
FROM node:22-alpine

Important: If you use multi-stage builds, update ALL stages -- not just the final one. A common mistake is updating the runtime stage but leaving the builder stage on Node 20.

6. Update serverless runtimes

# AWS Lambda -- update runtime in SAM/CloudFormation
Runtime: nodejs22.x  # was nodejs20.x

# Or via AWS CLI
aws lambda update-function-configuration \
  --function-name my-function \
  --runtime nodejs22.x

# Vercel -- update package.json
"engines": {
  "node": "22.x"
}

# Netlify -- update environment variable
NODE_VERSION=22

7. Deploy with a canary

Do not upgrade every service simultaneously. Pick your least-critical production service. Deploy with Node 22. Watch error rates, latency, and memory usage for 48 hours. Then roll forward to the next service.

Pay particular attention to:

Memory usage (V8 12.4 may have different heap behavior)
Cold start times in serverless (first request after deploy)
TLS handshake failures if connecting to legacy systems

Node 20 vs 22 vs 24: Quick comparison

	Node 20 LTS	Node 22 LTS	Node 24
Release date	April 2023	April 2024	May 2025
Active LTS start	October 2023	October 2024	October 2025
EOL	April 30, 2026	April 30, 2027	April 30, 2028
V8 engine	11.3	12.4	13.x
fetch()	Stable (since 21.x backport)	Stable	Stable
WebSocket	Experimental (20.10+, flag)	Stable	Stable
require(esm)	No	Release candidate	Stable
Test runner	Stable	Stable (enhanced)	Stable (enhanced)
Watch mode	Stable (since 20.13)	Stable	Stable
Ecosystem support	Universal	Universal	Most packages

The one thing nobody tells you about Node version migrations

The breakage is almost never in your application code. It is in native addons. Specifically, it is the one C++ addon that was compiled against Node 20's ABI and ships a prebuilt binary that does not exist for Node 22 yet.

When this happens, npm ci either fails with a compilation error (if you do not have build tools installed) or silently downloads a binary for the wrong ABI (which then crashes at runtime with NODE_MODULE_VERSION mismatch).

The fix: always run npm rebuild after switching Node versions. Add it to your Dockerfile. Add it to your CI setup step. Make it automatic so you never think about it again.

Frequently Asked Questions

When exactly does Node 20 reach end of life? April 30, 2026. After this date, the Node.js project will not release any further patches for the 20.x line. If a CVE is found in Node 20 after this date, the fix will only ship for Node 22+ and you will need to upgrade to receive it.

Can I skip Node 22 and go straight to Node 24? Yes, if Node 24 has entered Active LTS by the time you migrate (October 2025). The jump from Node 20 to 24 is larger -- two V8 major versions -- so expect more native addon rebuilds and test more thoroughly. For most teams, Node 22 is the safer choice because it has been in production for over a year.

Does Node 20 EOL affect my operating system? Linux distributions that ship Node 20 as a system package (some Debian/Ubuntu versions) may continue to backport security patches on their own timeline. However, this only covers the Node binary itself -- not npm packages, not your application dependencies. For application security, you need an upstream-supported Node version.

Will AWS Lambda stop running Node 20 functions on April 30? No. AWS provides at least 180 days notice before deprecating a runtime, then phases out in stages: first blocking new function creation, then blocking updates. Existing invocations can continue indefinitely on deprecated runtimes -- AWS does not forcibly stop running functions. But running an EOL runtime in Lambda means neither upstream Node.js nor AWS is patching it, so your functions are exposed to any vulnerabilities found after the EOL date. Do not confuse "still runs" with "still safe."

How do I check which Node version my production containers are running? If you use Docker, check your base image tags. For running containers: docker exec node --version. For Kubernetes: kubectl exec -- node --version. For a fleet, consider adding Node version to your health check endpoint so monitoring can track it.

What about TypeScript? TypeScript itself runs on whatever Node version you have -- the compiler is pure JavaScript. The concern is with @types/node: make sure you update to @types/node@22 to get accurate type definitions for Node 22's APIs. Also check that your tsconfig.json target and lib settings are appropriate for Node 22's V8 version.

🔍 Related tool: npm Package Health Checker — paste your package.json and check every dependency for deprecation and staleness. Free.

Kubernetes Events Explained: Types, kubectl Commands, and Observability Patterns

Matheus — Sat, 21 Feb 2026 17:16:15 +0000

What Are Kubernetes Events?

Every time something happens inside a Kubernetes cluster -- a pod gets scheduled, a container image is pulled, a volume fails to mount -- the control plane records it as an Event. Events are first-class API objects (kind: Event) that provide a running log of what is happening across your nodes, pods, deployments, and other resources.

Unlike application logs, which capture output from your code, Kubernetes events describe the lifecycle of cluster objects themselves. They answer questions like: Why is this pod stuck in Pending? Why did that node go NotReady? Why was my container OOM-killed?

Events are stored in etcd alongside other API objects and are accessible through the Kubernetes API. They are namespaced resources, meaning each event belongs to a specific namespace (or to the cluster scope for node-level events). Understanding how to read, filter, and export events is one of the most practical debugging skills a Kubernetes operator can develop.

Event Types: Normal and Warning

Kubernetes classifies every event into one of two types:

Normal -- Indicates that something expected happened. A pod was scheduled, a container started, a volume was successfully attached. These events confirm that the system is working as intended.
Warning -- Indicates that something unexpected or potentially problematic occurred. A container crashed, an image pull failed, a node ran out of resources. Warning events are the ones you typically want to monitor and alert on.

Here is an example of what a Normal event looks like when a pod starts successfully:

LAST SEEN   TYPE     REASON      OBJECT          MESSAGE
2m          Normal   Scheduled   pod/web-abc12   Successfully assigned default/web-abc12 to node-3
2m          Normal   Pulling     pod/web-abc12   Pulling image "nginx:1.27"
2m          Normal   Pulled      pod/web-abc12   Successfully pulled image "nginx:1.27" in 1.2s
2m          Normal   Created     pod/web-abc12   Created container nginx
2m          Normal   Started     pod/web-abc12   Started container nginx

And here is a Warning event when something goes wrong:

LAST SEEN   TYPE      REASON             OBJECT          MESSAGE
30s         Warning   FailedScheduling   pod/web-xyz99   0/5 nodes are available: 5 Insufficient memory.

Anatomy of a Kubernetes Event

Each event object contains several fields that together tell you exactly what happened, to which object, and when. Understanding these fields is essential for effective debugging.

Key Event Fields

type -- Either Normal or Warning.
reason -- A short, CamelCase string that categorizes the event. Examples: Scheduled, Pulling, FailedMount, BackOff.
message -- A human-readable description of what happened.
involvedObject -- The API object the event is about, including its kind, name, namespace, and uid.
source -- The component that generated the event (e.g., kubelet, default-scheduler, kube-controller-manager).
count -- How many times this event has occurred. Kubernetes deduplicates repeated events and increments this counter instead of creating new objects.
firstTimestamp -- When the event was first recorded.
lastTimestamp -- When the event was most recently recorded.

Here is a full event object in YAML format:

apiVersion: v1
kind: Event
metadata:
  name: web-abc12.17f3a2b8c9d1e4f6
  namespace: default
  creationTimestamp: "2026-02-16T10:30:00Z"
involvedObject:
  apiVersion: v1
  kind: Pod
  name: web-abc12
  namespace: default
  uid: a1b2c3d4-e5f6-7890-abcd-ef1234567890
reason: BackOff
message: "Back-off restarting failed container nginx in pod web-abc12_default"
source:
  component: kubelet
  host: node-3
type: Warning
count: 5
firstTimestamp: "2026-02-16T10:25:00Z"
lastTimestamp: "2026-02-16T10:30:00Z"

Viewing Events with kubectl

The most common way to inspect events is through kubectl. Here are the commands you will use most often.

List All Events in the Current Namespace

kubectl get events

This returns events in the default namespace. To see events in a different namespace, add -n. To see events across all namespaces:

kubectl get events --all-namespaces

Sort Events by Time

By default, events are not guaranteed to be in chronological order. Sort them by creation timestamp to see the most recent activity:

kubectl get events --sort-by=.metadata.creationTimestamp

This is one of the most useful flags when triaging an incident. It lets you reconstruct a timeline of what happened in the cluster.

Filter Events by Type

To see only Warning events, which are typically the ones that matter during debugging:

kubectl get events --field-selector type=Warning

You can also filter by the involved object. For example, to see events for a specific pod:

kubectl get events --field-selector involvedObject.name=web-abc12

Or combine multiple field selectors:

kubectl get events --field-selector type=Warning,involvedObject.kind=Pod

View Events via kubectl describe

The kubectl describe command shows events at the bottom of its output for any resource. This is often the fastest way to check events for a specific pod:

kubectl describe pod web-abc12

The Events section at the bottom will show recent events related to that pod, sorted chronologically. This is usually the first command you run when a pod is misbehaving.

Wide Output and Custom Columns

For more detail, use wide output or custom columns:

kubectl get events -o wide

Or extract specific fields with JSONPath:

kubectl get events -o jsonpath='{range .items[*]}{.lastTimestamp}{"\t"}{.type}{"\t"}{.reason}{"\t"}{.message}{"\n"}{end}'

Common Warning Events and What They Mean

Certain warning events appear frequently in production clusters. Knowing what they mean and how to respond to them will save you significant debugging time.

FailedScheduling

Warning   FailedScheduling   pod/app-xyz   0/5 nodes are available: 2 Insufficient cpu, 3 Insufficient memory.

The scheduler cannot find a node with enough resources to place the pod. This usually means you need to scale up your node pool, reduce resource requests, or free up capacity by evicting lower-priority workloads. Check your resource requests and limits against actual node capacity.

ImagePullBackOff

Warning   Failed    pod/app-xyz   Failed to pull image "myregistry.io/app:v2.1": rpc error: unauthorized
Warning   BackOff   pod/app-xyz   Back-off pulling image "myregistry.io/app:v2.1"

The kubelet cannot pull the container image. Common causes include incorrect image tags, missing or expired registry credentials (imagePullSecrets), or network connectivity issues to the registry. To debug, verify the image name and tag are correct, confirm that the imagePullSecret exists in the pod's namespace and contains valid credentials, and test registry connectivity from the node with curl or crictl pull.

BackOff (CrashLoopBackOff)

Warning   BackOff   pod/app-xyz   Back-off restarting failed container app in pod app-xyz_default

The container keeps crashing and Kubernetes is applying an exponential back-off delay before restarting it. Check the container logs with kubectl logs app-xyz --previous to see why the application is crashing.

Unhealthy (Liveness/Readiness Probe Failures)

Warning   Unhealthy   pod/app-xyz   Readiness probe failed: HTTP probe failed with statuscode: 503
Warning   Unhealthy   pod/app-xyz   Liveness probe failed: connection refused

The kubelet's health check probes are failing. If the liveness probe fails, Kubernetes will restart the container. If the readiness probe fails, the pod is removed from service endpoints. Review your probe configuration -- the path, port, and timeout values -- and verify that your application is actually healthy on those endpoints.

FailedMount and FailedAttachVolume

Warning   FailedMount         pod/db-abc   Unable to attach or mount volumes: timed out waiting for the condition
Warning   FailedAttachVolume   pod/db-abc   Multi-Attach error for volume "pvc-123": Volume is already attached to node-1

The pod's volume cannot be attached or mounted. This is common with cloud block storage (EBS, Persistent Disk) when a volume is still attached to a previous node after a failover. Some storage backends do not support ReadWriteMany access mode. When you see this event, check the PersistentVolumeClaim status with kubectl get pvc and verify the volume's availability in your cloud provider's console. In many cases, force-detaching the volume from the old node resolves the issue.

OOMKilling

Warning   OOMKilling   pod/app-xyz   Memory cgroup out of memory: Killed process 12345 (java)

The container exceeded its memory limit and was killed by the kernel's OOM killer. Either the memory limit is too low for the workload, or the application has a memory leak. Increase the memory limit or investigate the application's memory usage patterns. For more on diagnosing node-level issues, see our guide to debugging Kubernetes nodes in NotReady state.

NodeNotReady

Warning   NodeNotReady   node/node-3   Node node-3 status is now: NodeNotReady

A node has stopped reporting its status to the control plane. This can be caused by kubelet crashes, network partitions, or the node running out of resources (disk pressure, memory pressure, PID pressure). All pods on the affected node will eventually be rescheduled to other nodes after the pod-eviction-timeout expires (default: 5 minutes). Monitor for this event closely in production -- it often indicates a node that needs investigation or replacement. For a detailed troubleshooting guide, see our article on debugging Kubernetes nodes in NotReady state.

Event Retention and the Default TTL

One of the most important things to understand about Kubernetes events is that they are ephemeral by default. The kube-apiserver has a default event time-to-live (TTL) of 1 hour. After that, events are garbage-collected from etcd.

This means that if you look at events after an incident that happened two hours ago, they will already be gone. This is one of the main reasons teams set up event exporters (covered in the next section). The short default TTL is intentional -- events can be high-volume in large clusters, and storing them indefinitely in etcd would increase storage and memory pressure on the control plane.

Configuring the Event TTL

You can change the default TTL by passing the --event-ttl flag to the kube-apiserver:

# In the kube-apiserver manifest (e.g., /etc/kubernetes/manifests/kube-apiserver.yaml)
spec:
  containers:
  - command:
    - kube-apiserver
    - --event-ttl=6h
    # ... other flags

Increasing the TTL gives you a longer window to inspect events, but it also increases the load on etcd since more objects are stored. For most production clusters, 2-6 hours is a reasonable range. Beyond that, you should be exporting events to an external system.

If you are planning a cluster upgrade, be aware that changes to apiserver flags may need to be reapplied. Our Kubernetes upgrade checklist covers these considerations.

Exporting Events for Long-Term Observability

Since events are garbage-collected after the TTL expires, exporting them to an external logging or observability platform is essential for production clusters. Several tools are available for this purpose.

Kubernetes Event Exporter

The Kubernetes Event Exporter (originally by OpenPolicyAgent, now maintained by Resmo) watches the event stream and forwards events to sinks like Elasticsearch, OpenSearch, Slack, webhooks, or files.

Here is a minimal configuration that forwards Warning events to Elasticsearch:

apiVersion: v1
kind: ConfigMap
metadata:
  name: event-exporter-cfg
  namespace: monitoring
data:
  config.yaml: |
    logLevel: error
    logFormat: json
    route:
      routes:
        - match:
            - receiver: "elasticsearch"
              type: Warning
    receivers:
      - name: "elasticsearch"
        elasticsearch:
          hosts:
            - "http://elasticsearch.monitoring.svc:9200"
          index: kube-events
          indexFormat: "kube-events-{2006-01-02}"

Fluentd and Fluent Bit

If you already run Fluentd or Fluent Bit for log collection, you can configure them to collect Kubernetes events as well. Fluent Bit has a built-in kubernetes_events input plugin:

[INPUT]
    Name              kubernetes_events
    Tag               kube_events.*
    Kube_URL          https://kubernetes.default.svc:443
    Kube_CA_File      /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    Kube_Token_File   /var/run/secrets/kubernetes.io/serviceaccount/token

[OUTPUT]
    Name              es
    Match             kube_events.*
    Host              elasticsearch.monitoring.svc
    Port              9200
    Index             kube-events
    Type              _doc

Kubernetes Event Router (Heptio/VMware)

The Event Router is a simpler alternative that captures events and writes them to stdout in a structured format. You can then collect that stdout with any log aggregation system (Fluentd, Promtail, Vector, etc.):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: eventrouter
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: eventrouter
  template:
    metadata:
      labels:
        app: eventrouter
    spec:
      serviceAccountName: eventrouter
      containers:
      - name: kube-eventrouter
        image: gcr.io/heptio-images/eventrouter:v0.4
        volumeMounts:
        - name: config-volume
          mountPath: /etc/eventrouter
      volumes:
      - name: config-volume
        configMap:
          name: eventrouter-cm

Prometheus and Alerting

While events themselves are not natively exposed as Prometheus metrics, you can use kube-state-metrics to generate metrics from events. The kube_pod_status_reason and similar metrics can trigger alerts for patterns like repeated OOMKills or CrashLoopBackOffs. You can also build custom Prometheus alerts that fire when specific event patterns appear in your exported event data, creating a bridge between Kubernetes events and your alerting infrastructure.

Events in Modern Kubernetes: events.k8s.io/v1

Historically, Kubernetes events used the core v1 API (apiVersion: v1, kind: Event). Starting with Kubernetes 1.19, a new API group events.k8s.io/v1 was introduced with improvements. As of Kubernetes 1.35, this is the recommended API for working with events. For a full overview of what changed in this release, see our Kubernetes 1.35 release preview.

Key Changes in events.k8s.io/v1

regarding -- Replaces involvedObject. Contains a reference to the primary object the event is about.
related -- A new field that provides a reference to a secondary object. For example, if a pod event is related to a specific node, the node reference goes here.
reportingController -- Replaces source.component. A string identifying the controller that reported the event (e.g., k8s.io/kubelet).
reportingInstance -- Replaces source.host. Identifies the specific instance of the controller.
note -- Replaces message. A human-readable description of the event.
series -- Replaces count, firstTimestamp, and lastTimestamp with a structured EventSeries object that tracks recurring events more efficiently.

Here is what a modern event looks like in the new API:

apiVersion: events.k8s.io/v1
kind: Event
metadata:
  name: web-abc12.a1b2c3d4e5f6
  namespace: default
regarding:
  apiVersion: v1
  kind: Pod
  name: web-abc12
  namespace: default
related:
  apiVersion: v1
  kind: Node
  name: node-3
reason: BackOff
note: "Back-off restarting failed container nginx in pod web-abc12_default"
type: Warning
reportingController: kubelet
reportingInstance: node-3
eventTime: "2026-02-16T10:30:00.000000Z"
action: Restarting
series:
  count: 5
  lastObservedTime: "2026-02-16T10:30:00.000000Z"

Practical Recipes for Event-Driven Debugging

Here are some workflows that combine event inspection with other kubectl commands to quickly diagnose common issues.

Recipe 1: Why Is My Pod Pending?

# Check events for the pending pod
kubectl get events --field-selector involvedObject.name=my-pod --sort-by=.metadata.creationTimestamp

# Look for FailedScheduling reason and read the message
# Common causes: insufficient CPU/memory, node affinity/anti-affinity rules,
# taints without matching tolerations, PVC not bound

# Check node resource availability
kubectl describe nodes | grep -A 5 "Allocated resources"

Recipe 2: Find All Failing Pods in a Namespace

# Get all Warning events in the production namespace, sorted by time
kubectl get events -n production \
  --field-selector type=Warning \
  --sort-by=.metadata.creationTimestamp \
  -o custom-columns=TIME:.lastTimestamp,REASON:.reason,OBJECT:.involvedObject.name,MESSAGE:.message

Recipe 3: Monitor Events in Real Time

# Watch events as they happen (like tail -f for events)
kubectl get events --watch

# Watch only warnings across all namespaces
kubectl get events --all-namespaces --field-selector type=Warning --watch

Recipe 4: Audit Node Stability

# Check events for a specific node
kubectl get events --field-selector involvedObject.kind=Node,involvedObject.name=node-3

# Look for patterns: NodeNotReady, NodeHasDiskPressure, NodeHasMemoryPressure,
# NodeHasInsufficientPID, NodeRebooted

Best Practices for Working with Kubernetes Events

Export events to a durable store. The 1-hour default TTL means events vanish quickly. Use an event exporter, Fluent Bit, or another tool to ship events to Elasticsearch, Loki, or your SIEM.
Alert on Warning events. Set up alerts for high-frequency warnings like OOMKilling, FailedScheduling, and CrashLoopBackOff. Track event counts over time to catch trends.
Use field selectors in scripts. When building automation, use --field-selector to filter events server-side rather than piping through grep. This reduces the load on the API server.
Correlate events with logs and metrics. Events tell you what happened at the orchestration layer. Combine them with container logs (the why) and metrics (the how much) for a complete picture.
Increase the TTL for staging and CI clusters. In environments where you debug after the fact, set --event-ttl=12h or higher to keep events around longer.
Treat events as a first-class observability signal. Events are often overlooked in favor of logs and metrics, but they provide the clearest view into Kubernetes control-plane decisions like scheduling, scaling, and health checking.

Summary

Kubernetes events are the cluster's built-in audit trail. They record every significant lifecycle change -- from pod scheduling to volume attachment to node health transitions. By mastering kubectl get events with field selectors and time sorting, setting up event exporters for long-term retention, and alerting on Warning-type events, you gain deep visibility into what your cluster is doing and why.

The shift to the events.k8s.io/v1 API brings cleaner semantics with regarding/related fields and better deduplication through the series structure. Whether you are debugging a single failing pod or building a comprehensive observability stack, events should be one of the first signals you reach for.

Companies Using Kubernetes in 2026: Who Runs K8s and How They Scale

Matheus — Sat, 21 Feb 2026 17:15:36 +0000

Kubernetes Adoption in 2026: The Numbers

Kubernetes has moved well past the early-adopter phase. According to the CNCF Annual Survey 2024, 84% of organizations are either using or evaluating containers in production, with Kubernetes as the dominant orchestrator. The Datadog 2024 Container Report found that over 65% of organizations running containers have adopted Kubernetes, up from roughly 50% just two years prior.

What was once a technology associated primarily with Silicon Valley hyperscalers is now standard infrastructure across industries -- from banking and healthcare to government agencies and particle physics labs. For a broader look at adoption trends and data, see our detailed Kubernetes statistics and adoption report for 2026.

This article profiles nine organizations that run Kubernetes at significant scale, covering what they run, how big their deployments are, and what lessons other teams can draw from their experience.

Tech and Media Companies

Spotify: 4,000+ Microservices Across 200 Clusters

Spotify is one of the most frequently cited large-scale Kubernetes adopters, and for good reason. The music streaming platform serves over 600 million monthly active users and runs more than 4,000 microservices across approximately 200 Kubernetes clusters.

Spotify migrated from a Helios-based container orchestration system (built in-house) to Kubernetes beginning around 2019. The migration was driven by the desire to reduce the operational burden of maintaining a custom orchestrator and to benefit from the Kubernetes ecosystem's tooling and community.

Key details of Spotify's Kubernetes setup:

Runs on Google Kubernetes Engine (GKE) as the primary platform.
Uses Backstage -- Spotify's open-source developer portal, now a CNCF incubating project -- as the interface for developers to deploy and manage services on Kubernetes without needing deep K8s knowledge.
Operates a multi-cluster architecture with separate clusters for different teams and environments.
Handles over 10 million requests per second across its microservices mesh.

Lesson: Spotify's experience shows that a strong developer platform layer on top of Kubernetes (like Backstage) is critical for adoption at scale. Most developers at Spotify do not write Kubernetes YAML directly -- the platform abstracts it away.

Reddit: From Bare Metal to Kubernetes

Reddit's migration story is notable because the company moved from a traditional bare-metal infrastructure to Kubernetes. For years, Reddit ran its services on physical servers managed with configuration management tools. The limitations of this approach -- slow deployments, manual scaling, and hardware procurement lead times -- drove the shift to Kubernetes on AWS.

Reddit now runs its core platform on Amazon EKS, including the services that power the front page, comment threads, voting, and real-time features. At peak traffic, Reddit serves hundreds of millions of page views per day, with traffic spikes that can be sudden and massive (viral posts, breaking news events, AMA sessions). The migration was gradual, taking several years to move all production workloads.

The engineering team invested heavily in building a custom Kubernetes deployment platform that integrated with their existing tooling. They adopted a "paved road" approach, providing standardized Helm charts and CI/CD pipelines that made it easy for service teams to migrate without becoming Kubernetes experts.

Lesson: Large-scale bare-metal-to-Kubernetes migrations are possible but require patience. Reddit's team emphasized the importance of running old and new infrastructure in parallel during the transition, and investing heavily in CI/CD pipelines to support the new deployment model. They also found that the cost savings from moving away from owned hardware to cloud-based Kubernetes were significant, even accounting for the cloud provider costs.

The New York Times: News on GKE

The New York Times moved its digital infrastructure to Google Kubernetes Engine (GKE) to support the rapid iteration required by a modern digital newsroom. The migration consolidated a patchwork of deployment systems into a unified Kubernetes-based platform.

The NYT runs content delivery, search, personalization, and subscription services on GKE. Their engineering team built an internal delivery platform that lets developers deploy services through a simplified interface, abstracting away Kubernetes complexity for reporters and editors who work on interactive projects.

The NYT engineering team has spoken publicly about the benefits of Kubernetes for their newsroom's technical projects. During major news events -- elections, breaking stories, live events -- traffic can spike by 5-10x within minutes. Kubernetes' horizontal pod autoscaling lets them handle these spikes automatically, which was difficult to achieve on their previous infrastructure.

Lesson: Kubernetes adoption is not just for tech companies. Media organizations with demanding content delivery requirements benefit from the scalability and rapid deployment cycles that Kubernetes provides. The NYT also demonstrates the value of having a platform engineering team that shields content-focused developers from infrastructure complexity.

Pinterest: 30,000+ Pods at Scale

Pinterest runs one of the larger Kubernetes deployments in the consumer technology space. The visual discovery platform operates over 30,000 pods across multiple clusters, supporting a user base of more than 450 million monthly active users.

Pinterest's infrastructure handles computationally intensive workloads including image processing, recommendation algorithms, and search indexing. The company has been public about the challenges of running machine learning training and inference workloads on Kubernetes, contributing to upstream projects around GPU scheduling and resource management.

Key aspects of Pinterest's setup:

Multi-cluster architecture running on AWS EKS.
Custom autoscaling policies tuned for workloads with bursty traffic patterns (e.g., holiday shopping seasons).
Heavy use of Kubernetes for batch processing and ML training alongside serving workloads.

Lesson: Running both serving and batch/ML workloads on Kubernetes is feasible but requires careful attention to scheduling, resource isolation, and autoscaling. Pinterest's multi-cluster strategy helps isolate failures and manage upgrades safely.

E-Commerce and Consumer Brands

Airbnb: EKS After the Monolith

Airbnb's Kubernetes journey began as part of a broader effort to decompose its Ruby on Rails monolith into microservices. The company migrated to Amazon EKS and built a service-oriented architecture where hundreds of services run independently on Kubernetes.

Airbnb's engineering team developed a significant amount of internal tooling around Kubernetes, including:

A service configuration system that generates Kubernetes manifests from a higher-level service definition.
Custom admission controllers for enforcing organizational policies (resource limits, security contexts, labeling requirements).
Integration with their experimentation platform, allowing A/B tests to be deployed as separate Kubernetes rollouts.

Airbnb processes millions of searches and bookings daily, with each request touching dozens of downstream services. Their Kubernetes deployment handles significant computational workloads including search ranking, pricing algorithms, and real-time availability checks. The company has shared that their migration to Kubernetes reduced deployment times from hours to minutes and significantly improved developer velocity.

Lesson: Breaking up a monolith and moving to Kubernetes are often done together, but they are separate concerns. Airbnb found that the microservices decomposition was the harder problem -- Kubernetes provided the runtime, but the architectural decisions around service boundaries were what determined success. Their custom admission controllers are worth noting as well -- enforcing organizational standards at the cluster level prevents configuration drift and security gaps as the number of services grows.

Adidas: On-Prem to Cloud-Native

Adidas migrated its e-commerce platform from traditional on-premises infrastructure to Kubernetes on AWS. The sports brand was one of the earlier enterprise adopters in the retail space, driven by the need to handle massive traffic spikes during product launches (particularly limited-edition sneaker drops, which generate extreme burst traffic).

After the migration, Adidas reported a significant reduction in deployment lead time -- from weeks to minutes -- and improved ability to scale for peak traffic events. The platform team standardized on Kubernetes across development, staging, and production environments, creating consistency across the software delivery lifecycle.

Lesson: Retail companies with extreme traffic variability benefit enormously from Kubernetes' horizontal pod autoscaling and cluster autoscaling. The ability to scale up for a product launch and scale down afterward translates directly into cost savings compared to provisioning for peak capacity.

Financial Services

Capital One: Kubernetes in Banking

Capital One has been one of the most visible proponents of Kubernetes adoption in the financial services industry. The bank runs a large-scale Kubernetes platform on AWS and has contributed to several open-source projects in the Kubernetes ecosystem, including Critical Stack (a Kubernetes management platform they later open-sourced).

Running Kubernetes in financial services comes with additional constraints that do not apply to most technology companies:

Regulatory compliance: Financial regulators require strict controls around data access, encryption, and audit logging. Capital One's Kubernetes platform integrates with their compliance and governance systems.
Security requirements: Multi-tenancy is enforced through namespace isolation, network policies, and OPA/Gatekeeper admission policies.
Change management: Deployments follow formal change management processes, with Kubernetes rollouts integrated into the bank's change advisory board workflows.

Lesson: Kubernetes adoption in regulated industries is absolutely possible but requires upfront investment in policy enforcement, audit logging, and integration with existing compliance frameworks. Tools like OPA Gatekeeper and Kubernetes RBAC are essential building blocks.

Government and Research

US Department of Defense: Platform One

The US Department of Defense (DoD) operates Platform One, a Kubernetes-based DevSecOps platform that provides a standardized, security-hardened software delivery environment for defense applications. Platform One is built on top of a DoD-hardened Kubernetes distribution and includes a curated set of tools for CI/CD, monitoring, logging, and security scanning.

Platform One serves as the foundation for Big Bang, a Helm-based deployment package that installs a complete DevSecOps stack on any Kubernetes cluster. Components include Istio for service mesh, Prometheus and Grafana for monitoring, Elasticsearch and Kibana for logging, and various security scanning tools that meet DoD security requirements (STIG compliance).

Key aspects of Platform One:

Designed to run on any infrastructure: cloud, on-premises, or air-gapped environments.
All container images are scanned and signed through the DoD's Iron Bank registry.
Supports multiple classification levels with appropriate network isolation.
Used by multiple branches of the military and defense agencies.

Lesson: If the US Department of Defense can run Kubernetes with its extreme security requirements, most organizations can too. The key is a standardized platform approach (Platform One/Big Bang) rather than letting every team build their own Kubernetes setup. For context on how Kubernetes compares to simpler container runtimes in different scenarios, see our Docker vs Kubernetes production decision rubric.

CERN: Kubernetes for Particle Physics

CERN, the European Organization for Nuclear Research, uses Kubernetes to manage the massive data processing pipelines required to analyze data from the Large Hadron Collider (LHC). CERN's computing infrastructure processes petabytes of physics data, and Kubernetes helps orchestrate the batch processing jobs and analysis workflows.

CERN's Kubernetes deployment is notable for several reasons:

Runs on on-premises infrastructure in CERN's data centers, not on public cloud.
Manages workloads that are heavily batch-oriented, using Kubernetes alongside HTCondor and other HPC schedulers.
Uses OpenStack Magnum for provisioning Kubernetes clusters on their private cloud infrastructure.
Contributes to upstream Kubernetes development, particularly around batch scheduling and multi-cluster federation.

Lesson: Kubernetes is not just for web services. Batch processing, scientific computing, and data pipelines are legitimate Kubernetes workloads, especially when combined with tools like Kubernetes Jobs, CronJobs, and the emerging Kubernetes Batch/HPC features. For more on how different Kubernetes distributions serve these varied use cases, see our Kubernetes distributions comparison.

Industry Breakdown: Where Kubernetes Runs

Looking across the companies profiled above and the broader ecosystem, Kubernetes adoption follows clear patterns by industry.

Technology and Media

Technology companies were the earliest adopters and run the largest deployments. Adoption is near-universal among companies with more than 500 engineers. Kubernetes is typically managed by a dedicated platform engineering team that provides an internal developer platform. The tech sector also leads in multi-cluster adoption, with companies routinely running dozens or hundreds of clusters segmented by team, region, or workload type.

Financial Services

Banks, insurance companies, and fintech firms have adopted Kubernetes aggressively over the past five years. The main drivers are faster time-to-market for financial products and the ability to scale trading and payment processing systems dynamically. Compliance and security overhead is significant but manageable with the right tooling.

E-Commerce and Retail

Retail companies with seasonal traffic patterns (Black Friday, product launches, holiday shopping) benefit from Kubernetes' autoscaling capabilities. Companies like Adidas, Target, and Zalando have all migrated to Kubernetes-based platforms.

Healthcare and Life Sciences

Healthcare organizations are increasingly adopting Kubernetes for electronic health record (EHR) systems, genomics processing, and medical imaging workloads. HIPAA compliance requirements add complexity, similar to financial services, but Kubernetes' namespace isolation and network policies provide the necessary building blocks. Companies like Philips and Kaiser Permanente have invested significantly in Kubernetes platforms for both clinical and research workloads.

Government and Defense

Government adoption has accelerated significantly, led by the US DoD's Platform One initiative. Other agencies, including the IRS and VA, have Kubernetes initiatives. Government adoption emphasizes security hardening, air-gapped deployment capabilities, and FedRAMP compliance. The US Census Bureau and NHS Digital (UK) have also adopted Kubernetes for citizen-facing services, showing that government use extends beyond defense into civilian applications.

Adoption by Company Size

Startups (1-50 Engineers)

For early-stage startups, managed Kubernetes services (EKS, GKE, AKS) have reduced the barrier to entry significantly. However, the operational overhead of Kubernetes can be substantial for small teams. Many startups start with simpler alternatives (AWS ECS, Google Cloud Run, Railway) and migrate to Kubernetes as they grow. The decision depends on team expertise and workload complexity. That said, startups building infrastructure-heavy products (developer tools, data platforms, security tools) often adopt Kubernetes early because their customers expect Kubernetes-native deployment options.

Mid-Market (50-500 Engineers)

This is the fastest-growing adoption segment. Companies in this range typically have enough engineering capacity to justify a small platform team (2-5 engineers) dedicated to running Kubernetes. Managed services and platform-as-a-service layers like Humanitec, Upbound, or internal Backstage portals help make Kubernetes accessible to the broader engineering organization.

Enterprise (500+ Engineers)

Large enterprises overwhelmingly run Kubernetes, often across multiple cloud providers and on-premises data centers. Multi-cluster management, federation, and governance at scale are the primary challenges. These organizations typically run dedicated platform engineering organizations (not just teams) with 10-50+ engineers focused on Kubernetes infrastructure. At this scale, the focus shifts from "how do we run Kubernetes" to "how do we govern, secure, and provide self-service access to Kubernetes across hundreds of teams." Tools like Rancher, Tanzu, and OpenShift are common in this segment because they provide the multi-cluster management and enterprise governance features that large organizations require.

Common Patterns and Lessons Learned

Across all the companies profiled here, several patterns emerge consistently:

Platform engineering is non-negotiable. Every successful large-scale Kubernetes deployment has a dedicated platform team that abstracts Kubernetes complexity from application developers. Without this, adoption stalls because developers spend too much time fighting with YAML and cluster configuration.
Managed Kubernetes is the default. Even companies with deep infrastructure expertise (Spotify, Reddit, Airbnb) run on managed services like GKE, EKS, or AKS. The operational overhead of running your own control plane is rarely justified.
Multi-cluster is the norm at scale. No company running thousands of services uses a single Kubernetes cluster. Multi-cluster strategies provide blast radius isolation, allow independent upgrade schedules, and enable different security boundaries for different workloads.
Migrations are gradual. Every company that moved to Kubernetes did so incrementally, running old and new infrastructure in parallel for months or years. Big-bang migrations are rarely successful.
Developer experience determines adoption speed. Companies that invested in internal developer platforms, service templates, and self-service tooling saw faster adoption. Companies that asked developers to learn raw Kubernetes saw resistance and slow rollouts.
Security and compliance are solvable. Financial services, healthcare, and defense organizations have all proven that Kubernetes can meet strict regulatory requirements. The tools (OPA, network policies, RBAC, image signing) exist -- the work is in integrating them into your specific compliance framework.

Summary

Kubernetes adoption in 2026 spans virtually every industry and company size. From Spotify's 200 clusters powering music streaming to CERN's on-premises deployment analyzing particle physics data to the US DoD's security-hardened Platform One, Kubernetes has proven adaptable to radically different requirements.

The common thread across all successful adopters is not the technology itself but the organizational investment around it: platform engineering teams, developer experience tooling, and gradual migration strategies. Kubernetes provides the runtime foundation, but it is the platform built on top of it -- and the team operating it -- that determines success.

🔍 Related tool: Kubernetes YAML Security Linter — paste any K8s manifest and scan for 12 security issues with an A–F grade. Free, browser-based.

Pets vs Cattle DevOps: The Security Risk You Inherit

Matheus — Sat, 21 Feb 2026 17:14:54 +0000

Pets vs Cattle DevOps: The Security Risk You Inherit

No CVEs patched. Your attack surface still changes.

I have watched teams “modernize” from pet VMs to cattle and accidentally make audits harder and breaches faster. If you do not treat pets vs cattle as a security classification, you will ship unauditable infrastructure and you will not notice until an incident, or a regulator, forces you to.

Security impact first: what changes when you move to “cattle”

Patch this before your next standup. Not with a hotfix, with controls.

Pets fail in slow motion. Cattle fail at scale. If you run cattle without guardrails, a single bad image, a poisoned Terraform module, or a compromised GitOps repo can roll out to 400 nodes before you finish your coffee. Until we see a PoC, the real risk is probably misconfiguration and supply-chain drift, not a Hollywood zero-day.

If you keep pets: Long-lived SSH keys and config drift hang around for years. An attacker who lands once can come back later and still find the same foothold.
If you move to cattle: You reduce drift, but you increase blast radius. One promoted image becomes tomorrow’s fleet baseline.
If you do nothing: You keep “snowflake” servers that miss patches, and you also inherit new cloud-native failure modes like leaked service account tokens and over-permissive IAM.

If you cannot prove what ran, who changed it, and when it changed, you do not have “cattle.” You have pets with better marketing.

Breaking operational changes (these cause outages and audit findings)

Some folks skip canaries for “just infrastructure” changes. I do not.

The thing nobody mentions is that pets vs cattle breaks your incident response muscle memory. Your old runbook said “SSH to db-primary.” Your new world says “the pod died, and the controller replaced it.” If you do not build an evidence trail and a break-glass path, you will lose time during containment and you will lose artifacts during forensics.

Logging changes: SSH session logs disappear when you stop SSH-ing. You must replace them with Kubernetes audit logs, Git provider audit logs, CI logs, and centralized application logs with retention.
Access changes: “No one SSHs into anything” sounds clean. In practice you still need privileged access for nodes, storage, and rare outages. Define who can do it, how you record it, and how you revoke it.
Stateful workloads: Treating a database like cattle can delete data. The advisory does not specify how your org should do backups. Your SRE team still owns that risk.

What “pets” look like in a security review

Pets keep secrets warm.

A pet server usually carries a private key in /home, a forgotten debug binary, and a firewall rule nobody can explain. I have seen a “temporary” SSH exception live for 14 months because “nobody wants to touch prod.” If you do not upgrade your operating model, you will keep paying for that fear in outages and incident dwell time.

Typical findings: Untracked local changes, inconsistent patch levels, shared admin accounts, and backups that exist but never restore cleanly.
Threat scenario if you do not change: An attacker pivots through one unpatched pet, drops a persistent user, and waits for a quiet weekend. You only notice after data exfil shows up in DNS logs.

What “cattle” look like when you do it safely

Cattle need fences.

Teams love to say “immutable infrastructure” and then run unsigned images from random registries. That bit me once in a staging cluster. A developer “temporarily” used :latest, the build pulled a new dependency, and we spent half a day chasing behavior that never reproduced locally. In production, that same pattern becomes a supply-chain incident.

Minimum bar for cattle: Rebuild images on a schedule, scan them in CI, generate an SBOM, and sign the artifact before promotion.
GitOps control: Treat Git as production. Lock branches, require reviews, and alert on changes to cluster-admin RBAC and network policy.
Runtime control: Enforce non-root containers, drop capabilities, and block privileged pods unless you can defend the exception in writing.

Stateful workloads: keep the “pet” behavior, automate the handling

Databases do not forgive you.

A PostgreSQL primary still needs a stable identity, durable storage, and careful failover. Kubernetes StatefulSets help, but they do not remove your need for tested restores and clear RTO/RPO targets. If you pretend state is disposable, you will eventually test your backups during an outage. That is the worst time.

Use StatefulSets for stateful systems: Stable names, stable volumes, ordered rollout. This reduces chaos, it does not eliminate risk.
Threat scenario if you misclassify state: A “self-healing” controller recreates a pod, attaches the wrong volume, and you corrupt data during recovery.

Migration checklist (security gates, not just steps)

Move in slices.

Start with workloads that can tolerate replacement, like stateless APIs and CI runners. Then work toward the ugly stuff. For each step, set a gate you can measure and audit, otherwise the project becomes vibes-based engineering.

Inventory and classify: Record what runs where, what data it touches, and what compliance regime applies. If you cannot classify it, you cannot secure it.
Externalize state and secrets: Move data off hosts. Move secrets into a managed system. Rotate anything that used to live on a pet box.
Codify and review: Put Terraform, Helm, and policies under pull request review. Capture approvals as evidence.
Build immutable artifacts: Build once. Promote the same artifact. Do not patch live nodes by hand unless you execute a documented break-glass procedure.
Practice destruction: Kill instances in staging on purpose. If the system cannot recover without a human, you still run pets.

Everything else you should know (quick and slightly unfinished)

History matters less than evidence.

Yes, the metaphor goes back to early 2010s talks and blog posts, and people still argue who said it first. I care more about whether you can produce a change log, an artifact signature, and an audit trail on demand. Other stuff you will run into: autoscaler cooldowns, weird storage edge cases, dependency pinning, the usual.

If you do not upgrade your operating model, you will keep shipping servers you cannot recreate, cannot attest, and cannot explain under pressure. Attackers love that kind of environment.

Container Escape Vulnerabilities: The CVEs That Shaped Docker and Kubernetes Security

Matheus — Sat, 21 Feb 2026 17:00:11 +0000

Why Container Escapes Matter

Containers are not virtual machines. A virtual machine runs its own kernel on emulated hardware, creating a strong isolation boundary. A container shares the host kernel with every other container on the system -- isolation comes from Linux kernel features (namespaces, cgroups, capabilities, seccomp filters), not from a hardware-enforced boundary.

When an attacker escapes a container, they break through those kernel-level abstractions and gain access to the host. From there, they can reach every other container on that node, access mounted secrets and credentials, and pivot deeper into the cluster. In a Kubernetes or Docker production environment, a single container escape can compromise an entire node and, in the worst case, the entire cluster.

This article covers the most significant container escape CVEs from 2017 through 2024: how each exploit worked, what made it possible, and how the ecosystem responded. The same classes of bugs keep resurfacing, and the defensive patterns developed in response form the foundation of modern container security.

CVE-2017-5123: The waitid Kernel Exploit

What Happened

In October 2017, a vulnerability was discovered in the Linux kernel's waitid() system call. During a refactor of the waitid code in kernel version 4.13, a critical check was accidentally removed: the access_ok() call that validates whether a user-supplied pointer actually points to user-space memory. Without this check, an unprivileged process could pass a pointer to kernel memory, and the kernel would happily write data to that location.

How the Exploit Worked

The bug allowed an attacker to write partially controlled data to an arbitrary kernel memory address. While the attacker could not fully control the content being written -- the kernel wrote a siginfo_t structure with fields determined by process state -- careful manipulation of which process was being waited on gave enough control to be dangerous.

The container escape leveraged this kernel write primitive to modify the calling process's capability structure in kernel memory. Docker containers run with a restricted set of Linux capabilities, which is one of the primary mechanisms preventing containerized processes from performing privileged operations on the host. By overwriting the capability bitmask, the attacker could grant themselves CAP_SYS_ADMIN and CAP_NET_ADMIN -- effectively breaking out of the container's capability restrictions and gaining host-level privileges.

Impact and Fix

This vulnerability affected Linux kernel 4.13 through 4.14.0-rc4. The fix was straightforward: re-adding the access_ok() check to validate that the user-provided pointer targets user-space memory. The bug was introduced on May 21, 2017 and patched on October 9, 2017.

CVE-2017-5123 demonstrated something fundamental: containers share the host kernel, and a kernel vulnerability is a container escape vulnerability. No amount of namespace isolation matters if the kernel itself can be tricked into overwriting its own security data structures.

CVE-2019-5736: The runc Overwrite

What Happened

Disclosed on February 11, 2019, CVE-2019-5736 was arguably the most impactful container escape vulnerability ever published. It affected runc, the low-level container runtime used by Docker, containerd, CRI-O, and essentially every OCI-compliant container platform. The vulnerability allowed a malicious process inside a container to overwrite the host's runc binary, gaining root-level code execution on the host.

How the Exploit Worked

The exploit took advantage of how Linux handles /proc/self/exe. This special file is a symbolic link that points to the binary of the currently running process. When runc executes a command inside a container (via docker exec or similar), there is a brief window where the container's process can access the runc binary through /proc/self/exe.

The attack worked in two stages:

Set the trap. The attacker replaces the container's /bin/sh (or another entrypoint binary) with a script containing #!/proc/self/exe. This tells the kernel to execute the binary that /proc/self/exe points to -- which, during a docker exec call, is the host's runc binary.
Overwrite runc. When runc enters the container and the tampered entrypoint executes, the process gets a file handle to the host's runc binary via /proc/self/exe. The attacker then writes a malicious payload to this file handle, overwriting the host's runc binary with attacker-controlled code.

The next time any container operation invokes runc on that host -- starting a container, running exec, or even performing a health check -- the attacker's payload executes with root privileges on the host.

Impact and Fix

The severity was enormous. The exploit required only UID 0 inside the container (which is the default for most container images) and worked with default Docker configurations. No special privileges, no host mounts, no unusual capabilities. It affected Docker, Kubernetes, and any platform using runc versions prior to 1.0-rc6.

The fix changed runc's behavior so that it creates a copy of itself as a sealed, read-only file descriptor (using memfd_create with F_SEAL flags) before entering the container. When the malicious process attempts to write to /proc/self/exe, the kernel blocks the write because the file descriptor is sealed.

CVE-2019-1002101: kubectl cp Directory Traversal

What Happened

While most container escape CVEs involve breaking out of a running container, CVE-2019-1002101 took a different approach: it targeted the operator's workstation. This vulnerability allowed a malicious container to write arbitrary files to the machine of any Kubernetes user who ran kubectl cp to copy files from that container.

How the Exploit Worked

The kubectl cp command works by creating a tar archive inside the target container, streaming it over the network to the user's machine, and extracting it locally. The vulnerability was a classic directory traversal: the tar archive created inside the container could include file paths containing ../ sequences, and kubectl did not sanitize these paths before extraction.

If an attacker controlled the tar binary inside a container, they could craft filenames like ../../../etc/cron.d/backdoor. When the unsuspecting operator ran kubectl cp mypod:/data ./local-dir, the malicious tar entries would be extracted outside the intended destination directory, writing files anywhere the user had permissions.

Impact and Fix

The fix added path validation to reject directory traversal sequences during tar extraction. The initial fix was incomplete -- follow-up CVEs (CVE-2019-11246 and CVE-2019-11249) addressed bypass techniques, highlighting how tricky path sanitization can be.

This vulnerability is a reminder that the attack surface of a Kubernetes environment extends beyond the cluster. Operator tools, CI/CD pipelines, and client-side utilities are all part of the security perimeter.

CVE-2020-15257: containerd Host Network Escape

What Happened

In November 2020, NCC Group disclosed CVE-2020-15257, a vulnerability in containerd that allowed containers running with host network access to escape to the host.

How the Exploit Worked

containerd uses a component called containerd-shim, which runs as a parent process for each container. The shim exposes an API over an abstract namespace Unix domain socket. The critical flaw was that this socket was accessible from the host's network namespace.

When a container was configured with --net=host (sharing the host's network namespace), a root process inside that container could connect to the containerd-shim's abstract Unix socket. From there, the attacker could use the shim API to:

Read and write files on the host filesystem.
Execute commands on the host as root.
Spin up new, fully privileged containers.

The attack required two conditions: the container had to be running with host networking, and the process inside had to be running as UID 0.

Impact and Fix

The fix switched the shim API from abstract Unix sockets to file-based Unix sockets under /run/containerd, which respect filesystem permissions and namespace boundaries. Important: containers running before the upgrade retained the old socket connections and had to be restarted.

CVE-2020-15257 reinforced a well-known principle: do not use host networking unless absolutely necessary.

CVE-2024-21626: Leaky Vessels

What Happened

In January 2024, Snyk researchers disclosed a set of vulnerabilities collectively named "Leaky Vessels," with CVE-2024-21626 being the most severe. This was another runc vulnerability -- five years after CVE-2019-5736. It carried a CVSS score of 8.6.

How the Exploit Worked

The vulnerability stemmed from an internal file descriptor leak in runc. When runc set up a new container, it inadvertently leaked file descriptors that pointed to the host filesystem.

Two primary attack vectors:

Malicious container image. A Dockerfile with a WORKDIR directive set to a path like /proc/self/fd/[leaked_fd] could cause the container process to start with its working directory pointing to a host filesystem location.
Crafted exec command. An attacker with the ability to run runc exec could specify a working directory that referenced the leaked file descriptor.

What made this especially concerning was the image-based attack vector. Unlike CVE-2019-5736, which required an attacker to already have code execution inside a container, CVE-2024-21626 could be triggered simply by building or running a malicious image pulled from a registry.

Impact and Fix

The fix in runc 1.1.12 ensured that all internal file descriptors are properly closed before the container process starts. The disclosure also included three other CVEs affecting Docker's BuildKit component, demonstrating that the container build pipeline -- not just runtime -- is a significant attack surface.

Other Notable Container Escape Vulnerabilities

Dirty COW (CVE-2016-5195)

A race condition in the Linux kernel's memory subsystem, present for nine years before discovery in October 2016. The vulnerability allowed an unprivileged process to write to read-only memory mappings. Researchers demonstrated container escape techniques using the vDSO (virtual Dynamic Shared Object) to inject shellcode that would execute in the context of any process on the host.

systemd-journald Exploits (CVE-2018-16865 and CVE-2018-16866)

Vulnerabilities in systemd-journald that, chained together, allowed a local attacker to obtain a root shell. Since journald runs as root and accepts log messages from containers, this created a path from containerized process to host root access through the logging infrastructure.

These bugs highlighted the risk of host services that accept input from containers. Any host daemon that processes container-generated data is a potential escape vector.

Patterns Across Container Escape CVEs

Several recurring patterns emerge:

Shared kernel, shared fate. CVE-2017-5123 and Dirty COW exploited kernel bugs that no amount of namespace isolation can defend against. This is the fundamental architectural limitation of containers versus virtual machines.
File descriptor and /proc leaks. CVE-2019-5736 and CVE-2024-21626 both exploited how runc handles file descriptors and /proc entries during container setup.
Host services extend the attack surface. CVE-2020-15257 and the systemd-journald exploits show that any host service that accepts container input is a potential escape path.
Client tools matter too. CVE-2019-1002101 weaponized kubectl to compromise operator workstations.

Modern Defenses Against Container Escapes

Seccomp Profiles

Seccomp restricts which system calls a containerized process can make. Docker's default profile blocks approximately 44 of the 300+ available system calls. Custom profiles tailored to your application's actual system call usage offer stronger protection.

AppArmor and SELinux

Mandatory Access Control (MAC) systems add restrictions beyond standard Linux permissions. SELinux in enforcing mode mitigated CVE-2019-5736 by blocking writes to the host's runc binary. AppArmor provides path-based controls.

Rootless Containers and User Namespaces

Many container escape exploits require UID 0 inside the container. Rootless containers address this by running the entire container runtime as an unprivileged user, using user namespaces to remap UID 0 inside the container to an unprivileged UID on the host.

With rootless mode, even a successful escape lands the attacker on the host as an unprivileged user. Docker supports rootless mode natively (since 20.10), Podman runs rootless by default, and Kubernetes user namespaces for pods reached beta in version 1.30.

Read-Only Root Filesystems

Running containers with read-only root filesystems (readOnlyRootFilesystem: true) prevents a compromised container from modifying its own filesystem, directly mitigating exploits like CVE-2019-5736.

Runtime Security: Falco and Tetragon

Falco, a CNCF graduated project, monitors system calls and container events against a rule engine. Tetragon, from the Cilium project, uses eBPF to enforce security policies directly in the kernel with less than 1% performance overhead.

Pod Security Standards

Kubernetes Pod Security Standards define three profiles -- Privileged, Baseline, and Restricted. The Restricted profile enforces non-root execution, drops all capabilities, disables privilege escalation, and requires a read-only root filesystem.

Image Scanning and Supply Chain Security

Image scanning tools (Trivy, Grype, Snyk Container) detect known vulnerable packages, image signing with Sigstore/cosign provides provenance verification, and admission controllers can enforce that only signed, scanned images are deployed.

Container Escape Prevention Checklist

Runtime Configuration

Run containers as non-root. Set runAsNonRoot: true and specify a runAsUser.
Drop all capabilities, add only what is needed. Use drop: ["ALL"].
Disable privilege escalation. Set allowPrivilegeEscalation: false.
Use read-only root filesystems. Set readOnlyRootFilesystem: true.
Avoid host namespaces. Do not use hostNetwork, hostPID, or hostIPC.
Never run privileged containers in production.

Infrastructure and Patching

Keep the host kernel updated. Kernel vulnerabilities bypass all container isolation.
Patch container runtimes promptly. runc, containerd, and CRI-O vulnerabilities are direct escape vectors.
Update client tools. kubectl and other client-side tools are part of the attack surface.
Enable user namespaces. Ensure UID 0 inside containers maps to an unprivileged host UID.

Detection and Monitoring

Deploy runtime security tooling. Use Falco, Tetragon, or similar tools.
Apply seccomp profiles. Start with defaults and customize based on your application.
Enable audit logging. Kubernetes audit logs, container runtime logs, and host-level audit provide forensic trails.

Supply Chain

Scan images for known CVEs. Run vulnerability scanners in your CI/CD pipeline.
Use minimal base images. Smaller images have fewer potential vulnerabilities.
Sign and verify images. Use cosign/Sigstore for image signing.
Pin image digests. Reference images by digest rather than mutable tags.

The Future of Container Isolation

Sandbox runtimes like gVisor and Kata Containers add stronger isolation boundaries. eBPF-based security enforcement is maturing rapidly. Confidential computing (AMD SEV, Intel TDX) is bringing hardware-level isolation to container workloads using encrypted memory enclaves.

For most teams today, defense in depth -- rootless containers, seccomp profiles, MAC policies, runtime security tools, and diligent patching -- provides strong protection. No single mechanism is a silver bullet, but the combination makes exploitation significantly harder and detection significantly faster.

Container escapes are not theoretical. They have been discovered repeatedly in the most critical infrastructure components, from the Linux kernel to runc to containerd to kubectl. The organizations that avoid becoming case studies are the ones that treat these vulnerabilities as inevitable, and build their defenses accordingly.

🔍 Related tool: Kubernetes YAML Security Linter — paste any K8s manifest and scan for 12 security issues with an A–F grade. Free, browser-based.

DEV Community: Matheus

Rust 1.94.0: array_windows, Cargo Config Includes, and 10 Breaking Changes You Should Know About

The Headlines

array_windows Finally Stabilized

Cargo Config Includes

TOML 1.1 in Cargo

Breaking Changes: The Real Release Notes

Closure Capturing Behavior Changed

Standard Library Macros Import Change

dyn Trait Lifetime Casting Restricted

Shebang Lines in include!()

Other Compat Notes

Stabilized APIs Worth Knowing

Platform and Compiler Notes

Upgrade Recommendation

Keep Reading

Rust 1.93.0 release notes: SIMD, varargs, and the stuff that breaks builds

The 30-second upgrade call

What actually changed (the parts you will notice)

SIMD on s390x: useful, but only if you live there

C-style variadic functions: great for FFI, still a foot-gun

Breaking changes that deserve a real test run

Known issues and the “annoying but real” corner cases

How I’d roll this out (without drama)

Official notes

Keep Reading

Frequently Asked Questions

Related Reading

Node.js 25.6.0 Release Notes: What Breaks, What Changed, What I’d Test

Concerns first: the stuff the changelog won’t warn you about

So what actually changed in Node.js 25.6.0?

What I’d test in staging before I let this near production

Recommendation (grudgingly)

Related Reading

VS Code 1.109.0 Release Notes: Claude Agents, Integrated Browser, and the Stuff People Actually Mention

Community take: what people are saying this week

Official changelog recap (what 1.109.0 actually ships)

Claude Agents in VS Code: what actually changes in your workflow

Integrated browser: killing the alt-tab loop (mostly)

What's changing under the hood (the stuff that breaks things)

1.108 vs 1.109: what actually changed between versions

My synthesis: who should upgrade now, who should test first

How to upgrade (and what I check right after)

Quick usage examples (the stuff people will try first)

Known issues (official vs community)

Related Reading

Kubernetes 1.36 apiserver /readyz now waits for watch cache

should you care? verdict

should you care? apiserver readiness waits for watch cache init (PR #135777)

should you care? watch_list_duration_seconds goes Beta (PR #136086)

should you care? declarative validation can fail closed (PR #136117, KEP-5073)

should you care? list/watch memory spikes and the 10x swing

should you care? how to test 1.36 alpha without ruining your week

should you care? red flags and what I grep first

Related Reading

Kubernetes 1.32 End of Life: Migration Playbook for February 28, 2026

What "End of Life" Actually Means (It's Worse Than You Think)

Where Should You Land?

Our recommendation: Target 1.34 for most teams.

The Upgrade Path: You Cannot Skip Minor Versions

Pre-Upgrade Checklist

Hop 1: 1.32 → 1.33

Hop 2: 1.33 → 1.34

Cloud Provider Timelines: You Might Have More Time (For a Price)

Amazon EKS

Google GKE

Azure AKS

What You Gain: 5 Features Worth the Upgrade

1. Sidecar Containers - GA (KEP-753)

2. nftables Kube-Proxy Backend - GA (KEP-3866)

3. In-Place Pod Resource Resize - Beta (KEP-1287)

4. Topology-Aware Routing - GA (KEP-4444)

5. Multiple Service CIDRs - GA (KEP-1880)

Breaking Changes and Gotchas: What to Watch For

nftables Consideration

Deprecated API Removals

Feature Gate Changes

DRA (Dynamic Resource Allocation) Changes

Your 5-Step Action Plan

Step 1: Audit (Today)