Pavan Madduri

Posted on Jun 26

Zero-Downtime Crossplane v1 v2 Migration: Adopt-in-Place at Production Scale

#crossplane #kubernetes #aws #platformengineering

Crossplane v2 (released in late 2025) introduced a cleaner, namespaced resource model and removed a lot of the v1 ceremony around Claims and cluster-scoped composites. Upgrading the control plane to v2 is usually painless — if you're not using the v1 features that changed, your existing claims keep working thanks to backward compatibility.

The hard part is the next step: migrating your existing v1-style workloads onto v2-style namespaced resources. That's where there's still no cohesive, end-to-end story — and it's where I spent most of my effort taking a production EKS fleet all the way through.

This post is the field guide I wish I'd had: the adopt-in-place method, how to validate it before touching anything, and the three failure modes that will bite you in production.

Nothing here destroys or recreates cloud infrastructure. The whole point is to keep every existing AWS resource exactly where it is and just change which Crossplane resource owns it.

The setup (in generic terms)

A control plane running Crossplane, managing Amazon EKS clusters end to end.
Each cluster is represented by a Crossplane composite, which in turn owns ~90–100 managed resources (MRs): IAM roles/policies, the EKS cluster, EKS add-ons, a managed NodeGroup, a launch template, security groups and rules, an OIDC provider, and a pile of Objects managed through provider-kubernetes.
Several distinct cluster archetypes, each backed by its own Composition (think: general workload clusters, ingress/gateway clusters, and stateful clusters). Same migration mechanics, slightly different resource sets.
GitOps-driven: a Git repository is the source of truth, reconciled by a GitOps controller.

Constraints that shaped everything: no resource recreation, no node rotation, zero downtime.

Why "delete and recreate" is a non-starter

The naive migration is: delete the v1 composite, create the v2 XR, let the provider rebuild everything. In production that's a non-starter:

You cannot destroy a VPC, an EKS control plane, or a live NodeGroup and rebuild it under traffic.
Even Crossplane's Observe /import flows leave a window where the resource is briefly unmanaged or re-created.
Anything that recreates a NodeGroup triggers a node rotation — every pod on the cluster gets evicted and rescheduled. That's a customer-visible event you do not want as a side effect of an internal refactor.

So the goal isn't "create v2 resources." It's "make a v2 XR adopt the exact resources the v1 composite already owns, with zero observable change."

How Crossplane decides what to create vs. adopt

Two facts make adopt-in-place possible:

External-name is the source of truth for the real cloud resource. Crossplane reconciles a managed resource against the actual AWS object identified by its crossplane.io/external-name annotation. If a v2-owned MR has the same external-name as the live AWS resource, Crossplane observes it instead of creating a new one.
Ownership is expressed by a label + an ownerReference. A composed MR points back to its owning composite via:
- the crossplane.io/composite label, and
- a Kubernetes ownerReference to the owning XR (name + UID).
Within a composition, each MR is keyed by a "composition-resource-name" (crn) — the crossplane.io/composition-resource-name annotation. The engine matches the desired resource the composition wants to produce against the observed MR with the same crn. Same crn + same external-name → adopt in place. Different crn → the engine thinks the desired resource is missing and creates a new one (and treats the old one as an orphan).

Adopt-in-place is just: rewrite ownership (#2) and make crn + external-name line up (#1, #3) so the v2 composition's desired output matches what's already there.

The adopt-in-place method, step by step

Step 0 — Snapshot and pre-validate (do this before touching prod)

Before any mutation, capture the live state and prove the v2 composition will adopt rather than recreate. Crossplane's render command can do this offline against observed resources:

crossplane beta render \
  xr.yaml \
  composition.yaml \
  functions.yaml \
  --observed-resources ./observed/

./observed/ holds the live MRs (exported from the cluster). The command prints the desired resources the v2 composition would produce. Diff desired vs. observed and classify every resource:

adopt-in-place — desired crn (after remap, see below) and external-name match an observed MR.
net-new — desired resource the v2 composition adds that v1 didn't have.
orphan — observed MR that the v2 composition doesn't produce.

Gate the migration on: zero orphans, zero unexpected net-new. This single offline check caught every surprise before it reached production.

Step 1 — Pause both the claim and the composite

kubectl annotate <claim-kind> <name> crossplane.io/paused="true"
kubectl annotate <composite-kind> <name> crossplane.io/paused="true"

Pausing only the claim is a classic mistake — the composite keeps reconciling and will fight you. Pause both.

Step 2 — Reparent every managed resource

For each MR owned by the v1 composite, repoint ownership at the new v2 XR:

set the crossplane.io/composite label to the v2 XR name,
replace the ownerReference with one pointing at the v2 XR (kind, name, UID).

Conceptually:

metadata:
  labels:
    crossplane.io/composite: <v2-xr-name>     # was: <v1-composite-name>
  ownerReferences:
    - apiVersion: <v2-xr-apiVersion>
      kind: <v2-xr-kind>
      name: <v2-xr-name>
      uid: <v2-xr-uid>                          # the new XR's uid
      controller: true
      blockOwnerDeletion: true

Script this across all ~90–100 MRs; doing it by hand is how you get an inconsistent state.

Step 3 — Point the v2 XR at the adopted resources

Patch the v2 XR so it references the composition and the exact resources it's adopting:

spec:
  crossplane:
    compositionRef:
      name: <v2-composition>
    resourceRefs:
      - { apiVersion: ..., kind: ..., name: <mr-1> }
      - { apiVersion: ..., kind: ..., name: <mr-2> }
      # ... all adopted MRs

Keep the XR paused while you do this (create it paused from the start).

Step 4 — Unpause and let it converge

kubectl annotate <v2-xr-kind> <name> crossplane.io/paused-

The engine reconciles, matches desired↔observed by crn + external-name, and adopts. Watch the XR and its MRs go Synced=True/Ready=True without any Create calls hitting AWS.

The three failure modes that will bite you

1. NodeGroup composition-resource-name drift (blue/green)

This is the one most likely to cause a real incident.

Our v1 composition emitted the managed NodeGroup with one crn (e.g. a blue/green-style nodegroup-active), while the v2 composition emits a different crn (e.g. nodegroup). Because the engine matches desired↔observed by crn, the mismatch means:

the v2 composition's desired nodegroup has no matching observed MR → it wants to create one, and
the live NodeGroup (crn nodegroup-active) has no matching desired resource → it's treated as an orphan.

The net effect would be a brand-new NodeGroup and a rotation of every node.

Fix: remap the crn annotation on the live NodeGroup to match what the v2 composition expects, and preserve the existing NodeGroup name/external-name. Don't touch external-name — that's what keeps it bound to the real AWS NodeGroup.

kubectl annotate nodegroup.<group> <existing-ng-name> \
  crossplane.io/composition-resource-name=nodegroup --overwrite

Verify no rotation after cutover by confirming the launch-template name+version and the NodeGroup version are unchanged from before:

kubectl get nodegroup.<group> <name> \
  -o jsonpath='LT={.status.atProvider.launchTemplate.name}:v{.status.atProvider.launchTemplate.version} ver={.status.atProvider.version}'

Same values before and after = the existing nodes were adopted, not replaced.

2. The cluster-auth connection-secret republish race

The subtlest one — a silent failure if you're not watching for it.

For EKS, a managed "cluster auth" resource generates the kubeconfig (a short-lived token) and writes it to a connection Secret. The provider-kubernetes ProviderConfig reads that Secret to talk to the workload cluster, and every Object on that cluster depends on it.

When the v2 XR took ownership, the connection Secret got recreated empty. If the cluster-auth resource's last token refresh happened before that recreation, it didn't immediately republish — so the Secret stayed empty. Every downstream Object then stranded with:

cannot build kube client for provider config: currentContext not set in kubeconfig

On most clusters this self-healed on the cluster-auth resource's next refresh cycle. On one, the timing left it stuck for several minutes with no sign of recovering.

Fix: force the cluster-auth resource to reconcile so it republishes the kubeconfig. A benign annotation bump does it:

kubectl annotate <clusterauth-kind> <name> \
  example.com/republish="$(date -u +%s)" --overwrite

The connection Secret repopulates, and the stranded Objects build their client and sync. The lesson: adopting an MR can re-create its connection Secret out from under downstream consumers. Put health checks on the downstream objects, not just the cluster resource, or you'll never see it.

3. GitOps source-of-truth drift

The live cutover above is imperative. Your GitOps repo still describes the v1 world. Until you reconcile it, your GitOps controller will try to "fix" the cluster back toward the manifests — unpausing the v1 claim, or having no record of the v2 XR at all.

Treat the cluster migration and the source-of-truth migration as two separate workstreams. After the live cutover, land a Git change that:

adds crossplane.io/paused: "true" to the v1 claim manifests, and
adds the v2 XR manifests without the paused annotation (so the controller manages them as the active resources).

Make sure auto-sync/self-heal won't revert your live state in the gap between cutover and merge.

A repeatable runbook

Boiled down, every cluster followed the same pipeline:

snapshot — export live MRs, claim, composite, and composition.
render-gate — beta render --observed-resources + diff; require zero orphans / zero unexpected net-new; confirm the NodeGroup crn remap and that launch-template/version match.
reparent — script the label + ownerReference rewrite for all MRs (with a rollback script that puts them back).
patch — set the v2 XR compositionRef + resourceRefs (still paused).
pause v1 → unpause v2.
health-gate — every MR Synced/Ready excluding a known-baseline set; NodeGroup unchanged; downstream Objects connected.
reconcile Git — pause v1 manifests, add unpaused v2 manifests.

Do non-prod first, build the runbook, then prod. Keep the v1 composite paused (not deleted) for a cooldown period so rollback is a single unpause away.

What the ecosystem still needs

Most of the above was hand-rolled. A few things would turn this from "expert-only surgery" into a supported workflow:

A migrate command that, given a v1 claim/composite and a target v2 composition, generates the reparent patches, the v2 XR with populated resourceRefs, and — critically — a crn remap table between the two compositions. Matching must be by external resource identity, not crn string equality.
An adopt-preview/dry-run that classifies every MR as adopt-in-place / net-new / orphan and gates on zero orphans before proceeding (productizing the render --observed-resources diff).
Connection-secret-aware adoption — on adoption, force a reconcile or wait on connection-secret readiness so downstream providers don't lose connectivity.

There's an active community effort around exactly this (a maintainer-run feedback discussion and a migration-tooling tracking issue, plus a community CLI for migrating composition manifests). If you've done a migration like this, your war stories are genuinely useful input — the design is still being shaped.

Takeaways

Adopt, don't recreate. Make a v2 XR own the exact MRs the v1 composite owned; never let external-names change.
Validate offline first. beta render --observed-resources + a desired/observed diff is the single highest-leverage safety check.
crn alignment is everything for NodeGroups — a mismatch is the difference between a silent adoption and a full node rotation.
Watch your connection secrets. Adoption can recreate them empty; downstream consumers fail silently until the owner republishes.
Two migrations, not one. The live cluster and the GitOps source of truth move separately — plan for both.

It's very doable to take a production fleet from v1 to v2 with zero downtime today — it just isn't yet a one-command experience. Hopefully this shortens the path for the next person.

DEV Community