Hybrid orchestration in 2026 means you can deploy the same workload across on-prem + AWS (and a second cloud if needed) using Kubernetes + Terraform + Argo CD as the common layer.
Keep Git as source of truth. Standardize identity, DNS, ingress, and observability. Then test failover like it’s a feature, not a promise.
What “single-provider risk” looks like
You can’t mitigate what you won’t name.
|
Risk |
What breaks first |
What it looks like at 2AM |
|
Region/control-plane dependency |
Deploy pipeline, cluster ops |
“Can’t roll back. API calls time out.” |
|
IAM lock-in |
Workload identity, secrets access |
“Pods can’t auth off-cloud.” |
|
Network primitives |
Ingress/LB, DNS |
“Traffic won’t steer. Health checks lie.” |
|
Data gravity/egress |
DR, migration |
“Failover works, but costs explode.” |
|
Managed service coupling |
DB/cache/queue |
“App is portable. State is not.” |
Rule: If your deploy and auth only work inside AWS, you don’t have “hybrid.” You have “AWS with extra steps.”
Pick a hybrid shape that matches reality
Topology decides your failure modes.
Option A: Two independent clusters (recommended default)
This is the boring one. It works.
- Cluster 1: EKS in AWS
- Cluster 2: on-prem Kubernetes (or another provider)
Argo CD fans out apps to both. Terraform builds both. You can fail one without taking the other’s control plane with it.
Option B: “Stretched cluster” (know the connectivity tax)
This is EKS Hybrid Nodes territory: control plane in AWS Region, nodes on-prem.
AWS calls this a “stretched/extended” cluster architecture.
AWS also publishes best practices that assume redundant, resilient connectivity to avoid disconnections.
Use it when:
- you want one control plane
- you can engineer reliable private connectivity
Avoid it when:
- your on-prem is intermittently connected
- you need disconnected/air-gapped operations
Option C: Disconnected/air-gapped on-prem
If “internet might not exist,” treat it as a hard requirement.
AWS documents EKS Anywhere as capable of running in air-gapped/disconnected environments.
Reference architecture
Every subsystem needs a home.
Git (source of truth)
|
v
Argo CD (GitOps)
(runs on-prem or neutral)
/ \
v v
On-prem K8s cluster AWS EKS cluster
(apps + addons) (apps + addons)
\ /
\ /
v v
Shared services: DNS, OIDC, logging/metrics,
container registry (mirrors), secrets KMS strategy
Rule: Put the GitOps control plane where a provider outage can’t strand you. Argo CD is a Kubernetes controller that continuously compares live state to Git and reports drift as OutOfSync.
Terraform: build infra once, not by hand
Terraform is for infra. Argo is for convergence. Don’t mix them.
Terraform responsibilities
- VPC/VPN/Direct Connect edge
- EKS cluster + node groups
- On-prem cluster primitives (or the platform that hosts it)
- IAM/OIDC scaffolding
- Base DNS zones / records (if you must)
Repo layout that survives day-2
Keep it simple:
infra/
aws/
eks/
network/
onprem/
k8s/
apps/
base/
overlays/
aws/
onprem/
gitops/
applicationsets/
Argo CD: one template, many clusters
Multi-cluster GitOps is the whole point.
Argo CD supports ApplicationSet for multi-cluster automation.
The Cluster generator can auto-discover clusters registered in Argo CD and expose their metadata as template parameters.
Example: ApplicationSet that deploys to both clusters
Label your clusters in Argo (env=aws, env=onprem), then:
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: platform-addons
spec:
generators:
- clusters:
selector:
matchExpressions:
- key: env
operator: In
values: ["aws", "onprem"]
template:
metadata:
name: "addons-{{name}}"
spec:
project: platform
source:
repoURL: https://git.example.com/platform.git
targetRevision: main
path: "apps/overlays/{{metadata.labels.env}}/addons"
destination:
server: "{{server}}"
namespace: platform
syncPolicy:
automated:
prune: true
selfHeal: true
This gives you:
- one definition
- two targets
- drift correction
Portability boundary: decide what must stay portable
Hybrid fails when you pretend everything is portable.
Portable by default
- Kubernetes APIs (Deployments, Services, Ingress)
- Helm/Kustomize overlays
- Argo CD delivery mechanics
- OpenTelemetry-based app telemetry
Not portable unless you plan it
- Provider IAM-only auth
- Provider-specific LBs and DNS behavior
- Storage classes with provider-only semantics
Rule: If state can’t move, failover is theater.
Identity: stop wiring apps to one cloud’s IAM
Auth is the first thing that breaks off-cloud.
Baseline pattern:
- Use OIDC for human and workload identity.
- Use Kubernetes service accounts mapped to your identity provider.
- Keep secrets strategy consistent (Vault, SOPS, external secret operators) across clusters.
If AWS IAM is your only workload identity story, your on-prem cluster becomes a second-class citizen.
Networking: make DNS and routing boring
Hybrid is mostly DNS and routes.
Minimum requirements:
- deterministic routing between on-prem and AWS (VPN/Direct Connect)
- clear ownership of egress/ingress paths
- DNS resolution both directions (forward + reverse if needed)
If you choose “stretched EKS,” AWS’s docs push you to engineer resilient connectivity and plan for disconnections.
Operations: avoid doubling your surface area
Two clusters means two of everything unless you standardize.
One observability pipeline
- one metrics backend
- one log backend
- consistent labels: cluster, env, region, service
One upgrade policy
- version skew rules
- maintenance windows
- rollback runbooks
One incident drill
Run this quarterly:
- break AWS ingress (simulate region/LB outage)
- fail traffic to on-prem
- verify auth, DNS, data correctness
- roll back cleanly
If you can’t rehearse it, don’t claim it.
Where AceCloud fits in a “don’t bet on one provider” plan
If you want a second cloud without rewriting your platform, add it as another Kubernetes target.
AceCloud’s docs show a managed Kubernetes flow built around worker node groups, where you pick Flavor Type/Name, worker count, per-node volume, and security group.
That maps cleanly to the same GitOps model:
- Terraform (or API) builds the cluster/node groups
- Argo CD registers the cluster
- ApplicationSet deploys the same overlays
This gives you a practical hedge:
- AWS EKS as primary
- on-prem as locality/compliance anchor
- AceCloud as a secondary cloud target for burst, DR rehearsals, or an exit ramp
CTO checklist
Print this and use it in reviews.
- GitOps control plane is provider-neutral (or at least not single-region)
- Two independent clusters exist (on-prem + AWS), not just a stretched cluster
- Argo CD multi-cluster deployment is automated (ApplicationSet)
- Identity works off-cloud (OIDC strategy, not AWS-only IAM)
- DNS and routing are deterministic (and tested)
- Failover drill is scripted and run regularly
- State portability is explicitly defined (what can fail over, what can’t)
Top comments (0)