Daya Shankar

Posted on Feb 19

Hybrid Orchestration Basics: Avoiding Single-Provider Risks in 2026

#cloud #database

Hybrid orchestration in 2026 means you can deploy the same workload across on-prem + AWS (and a second cloud if needed) using Kubernetes + Terraform + Argo CD as the common layer.

Keep Git as source of truth. Standardize identity, DNS, ingress, and observability. Then test failover like it’s a feature, not a promise.

What “single-provider risk” looks like

You can’t mitigate what you won’t name.

Risk	What breaks first	What it looks like at 2AM
Region/control-plane dependency	Deploy pipeline, cluster ops	“Can’t roll back. API calls time out.”
IAM lock-in	Workload identity, secrets access	“Pods can’t auth off-cloud.”
Network primitives	Ingress/LB, DNS	“Traffic won’t steer. Health checks lie.”
Data gravity/egress	DR, migration	“Failover works, but costs explode.”
Managed service coupling	DB/cache/queue	“App is portable. State is not.”

Rule: If your deploy and auth only work inside AWS, you don’t have “hybrid.” You have “AWS with extra steps.”

Pick a hybrid shape that matches reality

Topology decides your failure modes.

Option A: Two independent clusters (recommended default)

This is the boring one. It works.

Cluster 1: EKS in AWS

Cluster 2: on-prem Kubernetes (or another provider)

Argo CD fans out apps to both. Terraform builds both. You can fail one without taking the other’s control plane with it.

Option B: “Stretched cluster” (know the connectivity tax)

This is EKS Hybrid Nodes territory: control plane in AWS Region, nodes on-prem.

AWS calls this a “stretched/extended” cluster architecture.
AWS also publishes best practices that assume redundant, resilient connectivity to avoid disconnections.

Use it when:

you want one control plane

you can engineer reliable private connectivity

Avoid it when:

your on-prem is intermittently connected

you need disconnected/air-gapped operations

Option C: Disconnected/air-gapped on-prem

If “internet might not exist,” treat it as a hard requirement.

AWS documents EKS Anywhere as capable of running in air-gapped/disconnected environments.

Reference architecture

Every subsystem needs a home.

Git (source of truth)
|
v
Argo CD (GitOps)
(runs on-prem or neutral)
/ \
v v
On-prem K8s cluster AWS EKS cluster
(apps + addons) (apps + addons)
\ /
\ /
v v
Shared services: DNS, OIDC, logging/metrics,
container registry (mirrors), secrets KMS strategy

Rule: Put the GitOps control plane where a provider outage can’t strand you. Argo CD is a Kubernetes controller that continuously compares live state to Git and reports drift as OutOfSync.

Terraform: build infra once, not by hand

Terraform is for infra. Argo is for convergence. Don’t mix them.

Terraform responsibilities

VPC/VPN/Direct Connect edge

EKS cluster + node groups

On-prem cluster primitives (or the platform that hosts it)

IAM/OIDC scaffolding

Base DNS zones / records (if you must)

Repo layout that survives day-2

Keep it simple:

infra/
aws/
eks/
network/
onprem/
k8s/
apps/
base/
overlays/
aws/
onprem/
gitops/
applicationsets/

Argo CD: one template, many clusters

Multi-cluster GitOps is the whole point.

Argo CD supports ApplicationSet for multi-cluster automation.
The Cluster generator can auto-discover clusters registered in Argo CD and expose their metadata as template parameters.

Example: ApplicationSet that deploys to both clusters

Label your clusters in Argo (env=aws, env=onprem), then:

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: platform-addons
spec:
generators:
- clusters:
selector:
matchExpressions:
- key: env
operator: In
values: ["aws", "onprem"]
template:
metadata:
name: "addons-{{name}}"
spec:
project: platform
source:
repoURL: https://git.example.com/platform.git
targetRevision: main
path: "apps/overlays/{{metadata.labels.env}}/addons"
destination:
server: "{{server}}"
namespace: platform
syncPolicy:
automated:
prune: true
selfHeal: true

This gives you:

one definition

two targets

drift correction

Portability boundary: decide what must stay portable

Hybrid fails when you pretend everything is portable.

Portable by default

Kubernetes APIs (Deployments, Services, Ingress)

Helm/Kustomize overlays

Argo CD delivery mechanics

OpenTelemetry-based app telemetry

Not portable unless you plan it

Managed databases

Provider IAM-only auth

Provider-specific LBs and DNS behavior

Storage classes with provider-only semantics

Rule: If state can’t move, failover is theater.

Identity: stop wiring apps to one cloud’s IAM

Auth is the first thing that breaks off-cloud.

Baseline pattern:

Use OIDC for human and workload identity.

Use Kubernetes service accounts mapped to your identity provider.

Keep secrets strategy consistent (Vault, SOPS, external secret operators) across clusters.

If AWS IAM is your only workload identity story, your on-prem cluster becomes a second-class citizen.

Networking: make DNS and routing boring

Hybrid is mostly DNS and routes.

Minimum requirements:

deterministic routing between on-prem and AWS (VPN/Direct Connect)

clear ownership of egress/ingress paths

DNS resolution both directions (forward + reverse if needed)

If you choose “stretched EKS,” AWS’s docs push you to engineer resilient connectivity and plan for disconnections.

Operations: avoid doubling your surface area

Two clusters means two of everything unless you standardize.

One observability pipeline

one metrics backend

one log backend

consistent labels: cluster, env, region, service

One upgrade policy

version skew rules

maintenance windows

rollback runbooks

One incident drill

Run this quarterly:

break AWS ingress (simulate region/LB outage)

fail traffic to on-prem

verify auth, DNS, data correctness

roll back cleanly

If you can’t rehearse it, don’t claim it.

Where AceCloud fits in a “don’t bet on one provider” plan

If you want a second cloud without rewriting your platform, add it as another Kubernetes target.

AceCloud’s docs show a managed Kubernetes flow built around worker node groups, where you pick Flavor Type/Name, worker count, per-node volume, and security group.
That maps cleanly to the same GitOps model:

Terraform (or API) builds the cluster/node groups

Argo CD registers the cluster

ApplicationSet deploys the same overlays

This gives you a practical hedge:

AWS EKS as primary

on-prem as locality/compliance anchor

AceCloud as a secondary cloud target for burst, DR rehearsals, or an exit ramp

CTO checklist

Print this and use it in reviews.

GitOps control plane is provider-neutral (or at least not single-region)

Two independent clusters exist (on-prem + AWS), not just a stretched cluster

Argo CD multi-cluster deployment is automated (ApplicationSet)

Identity works off-cloud (OIDC strategy, not AWS-only IAM)

DNS and routing are deterministic (and tested)

Failover drill is scripted and run regularly

State portability is explicitly defined (what can fail over, what can’t)

DEV Community

Hybrid Orchestration Basics: Avoiding Single-Provider Risks in 2026

Top comments (0)