DEV Community

Cover image for Hybrid Orchestration Basics: Avoiding Single-Provider Risks in 2026
Daya Shankar
Daya Shankar

Posted on

Hybrid Orchestration Basics: Avoiding Single-Provider Risks in 2026

Hybrid orchestration in 2026 means you can deploy the same workload across on-prem + AWS (and a second cloud if needed) using Kubernetes + Terraform + Argo CD as the common layer. 

Keep Git as source of truth. Standardize identity, DNS, ingress, and observability. Then test failover like it’s a feature, not a promise. 

What “single-provider risk” looks like

You can’t mitigate what you won’t name.

Risk

What breaks first

What it looks like at 2AM

Region/control-plane dependency

Deploy pipeline, cluster ops

“Can’t roll back. API calls time out.”

IAM lock-in

Workload identity, secrets access

“Pods can’t auth off-cloud.”

Network primitives

Ingress/LB, DNS

“Traffic won’t steer. Health checks lie.”

Data gravity/egress

DR, migration

“Failover works, but costs explode.”

Managed service coupling

DB/cache/queue

“App is portable. State is not.”

Rule: If your deploy and auth only work inside AWS, you don’t have “hybrid.” You have “AWS with extra steps.”

Pick a hybrid shape that matches reality

Topology decides your failure modes.

Option A: Two independent clusters (recommended default)

This is the boring one. It works.

  • Cluster 1: EKS in AWS
  • Cluster 2: on-prem Kubernetes (or another provider)

Argo CD fans out apps to both. Terraform builds both. You can fail one without taking the other’s control plane with it.

Option B: “Stretched cluster” (know the connectivity tax)

This is EKS Hybrid Nodes territory: control plane in AWS Region, nodes on-prem.

AWS calls this a “stretched/extended” cluster architecture.  
AWS also publishes best practices that assume redundant, resilient connectivity to avoid disconnections. 

Use it when:

  • you want one control plane
  • you can engineer reliable private connectivity

Avoid it when:

  • your on-prem is intermittently connected
  • you need disconnected/air-gapped operations

Option C: Disconnected/air-gapped on-prem

If “internet might not exist,” treat it as a hard requirement.

AWS documents EKS Anywhere as capable of running in air-gapped/disconnected environments. 

Reference architecture

Every subsystem needs a home.

Git (source of truth) 


Argo CD (GitOps) 
(runs on-prem or neutral) 
/ \ 
v v 
On-prem K8s cluster AWS EKS cluster 
(apps + addons) (apps + addons) 
\ / 
\ / 
v v 
Shared services: DNS, OIDC, logging/metrics, 
container registry (mirrors), secrets KMS strategy 

Rule: Put the GitOps control plane where a provider outage can’t strand you. Argo CD is a Kubernetes controller that continuously compares live state to Git and reports drift as OutOfSync. 

Terraform: build infra once, not by hand

Terraform is for infra. Argo is for convergence. Don’t mix them.

Terraform responsibilities

  • VPC/VPN/Direct Connect edge
  • EKS cluster + node groups
  • On-prem cluster primitives (or the platform that hosts it)
  • IAM/OIDC scaffolding
  • Base DNS zones / records (if you must)

Repo layout that survives day-2

Keep it simple:

infra/ 
aws/ 
eks/ 
network/ 
onprem/ 
k8s/ 
apps/ 
base/ 
overlays/ 
aws/ 
onprem/ 
gitops/ 
applicationsets/ 

Argo CD: one template, many clusters

Multi-cluster GitOps is the whole point.

Argo CD supports ApplicationSet for multi-cluster automation.  
The Cluster generator can auto-discover clusters registered in Argo CD and expose their metadata as template parameters. 

Example: ApplicationSet that deploys to both clusters

Label your clusters in Argo (env=aws, env=onprem), then:

apiVersion: argoproj.io/v1alpha1 
kind: ApplicationSet 
metadata: 
name: platform-addons 
spec: 
generators: 
- clusters: 
selector: 
matchExpressions: 
- key: env 
operator: In 
values: ["aws", "onprem"] 
template: 
metadata: 
name: "addons-{{name}}" 
spec: 
project: platform 
source: 
repoURL: https://git.example.com/platform.git 
targetRevision: main 
path: "apps/overlays/{{metadata.labels.env}}/addons" 
destination: 
server: "{{server}}" 
namespace: platform 
syncPolicy: 
automated: 
prune: true 
selfHeal: true 

This gives you:

  • one definition
  • two targets
  • drift correction

Portability boundary: decide what must stay portable

Hybrid fails when you pretend everything is portable.

Portable by default

  • Kubernetes APIs (Deployments, Services, Ingress)
  • Helm/Kustomize overlays
  • Argo CD delivery mechanics
  • OpenTelemetry-based app telemetry

Not portable unless you plan it

  • Provider IAM-only auth
  • Provider-specific LBs and DNS behavior
  • Storage classes with provider-only semantics

Rule: If state can’t move, failover is theater.

Identity: stop wiring apps to one cloud’s IAM

Auth is the first thing that breaks off-cloud.

Baseline pattern:

  • Use OIDC for human and workload identity.
  • Use Kubernetes service accounts mapped to your identity provider.
  • Keep secrets strategy consistent (Vault, SOPS, external secret operators) across clusters.

If AWS IAM is your only workload identity story, your on-prem cluster becomes a second-class citizen.

Networking: make DNS and routing boring

Hybrid is mostly DNS and routes.

Minimum requirements:

  • deterministic routing between on-prem and AWS (VPN/Direct Connect)
  • clear ownership of egress/ingress paths
  • DNS resolution both directions (forward + reverse if needed)

If you choose “stretched EKS,” AWS’s docs push you to engineer resilient connectivity and plan for disconnections. 

Operations: avoid doubling your surface area

Two clusters means two of everything unless you standardize.

One observability pipeline

  • one metrics backend
  • one log backend
  • consistent labels: cluster, env, region, service

One upgrade policy

  • version skew rules
  • maintenance windows
  • rollback runbooks

One incident drill

Run this quarterly:

  1. break AWS ingress (simulate region/LB outage)
  1. fail traffic to on-prem
  1. verify auth, DNS, data correctness
  1. roll back cleanly

If you can’t rehearse it, don’t claim it.

Where AceCloud fits in a “don’t bet on one provider” plan

If you want a second cloud without rewriting your platform, add it as another Kubernetes target.

AceCloud’s docs show a managed Kubernetes flow built around worker node groups, where you pick Flavor Type/Name, worker count, per-node volume, and security group.  
That maps cleanly to the same GitOps model:

  • Terraform (or API) builds the cluster/node groups
  • Argo CD registers the cluster
  • ApplicationSet deploys the same overlays

This gives you a practical hedge:

  • AWS EKS as primary
  • on-prem as locality/compliance anchor
  • AceCloud as a secondary cloud target for burst, DR rehearsals, or an exit ramp

CTO checklist

Print this and use it in reviews.

  • GitOps control plane is provider-neutral (or at least not single-region)
  • Two independent clusters exist (on-prem + AWS), not just a stretched cluster
  • Argo CD multi-cluster deployment is automated (ApplicationSet) 
  • Identity works off-cloud (OIDC strategy, not AWS-only IAM)
  • DNS and routing are deterministic (and tested)
  • Failover drill is scripted and run regularly
  • State portability is explicitly defined (what can fail over, what can’t)

 

Top comments (0)