Daya Shankar

Posted on Feb 27

Hosted control plane: when it simplifies operations and when it adds complexity

#architecture #devops #kubernetes #sre

A hosted control plane moves Kubernetes control-plane components off your worker fleet either into a provider-managed boundary (EKS) or onto a separate hosting cluster as pods (HyperShift).

It simplifies ops when you want predictable upgrades, less per-cluster snowflake work, and cleaner separation between “management” and “workloads.”

It adds complexity when control-plane connectivity, IAM, and shared blast radius become your new failure modes especially with private clusters.

Define hosted control plane in concrete terms

If you can’t say where the API server and etcd live, you can’t model risk.

“Hosted control plane” is a placement decision.

EKS: hosted by AWS in an EKS-managed VPC

AWS owns the masters; you own nodes and workloads.

AWS documents that the EKS-managed control plane runs inside an AWS-managed VPC and includes Kubernetes API server nodes and an etcd cluster. API server nodes run in an Auto Scaling group across at least two AZs; etcd nodes span three AZs.

What that means operationally:

You don’t patch control-plane instances.

You don’t rebuild etcd.

You do still own access, RBAC, node lifecycle, and add-ons.

kubeadm on EC2: not hosted, you host it

You run the masters, the etcd, the upgrades, and the recovery drills.

Kubeadm HA requires you to pick a topology (stacked etcd vs external etcd) and wire up the endpoints (often via a load balancer DNS name). External etcd needs explicit endpoint configuration; stacked etcd is “managed automatically” by kubeadm’s topology.

What that means operationally:

You patch and upgrade the control plane.

You own etcd snapshots and restore tests.

You own certificates and rotation edge cases.

HyperShift (hosted control planes): control planes as pods on a hosting cluster

You consolidate many control planes onto one management cluster.

Red Hat’s hosted control planes model runs control planes as pods on a management/hosting cluster, without dedicated VMs per control plane.

HyperShift then introduces a new question: where do those control plane pods land? Docs show “shared everything” by default, and you can dedicate nodes for control plane workloads via labels/taints.

Side-by-side: what gets simpler, what gets harder

Feature lists lie. Ownership and failure modes don’t.

Model	What simplifies	What gets harder	The new “pager line”
EKS hosted control plane	Control plane HA, scaling, replacement; less etcd babysitting	Endpoint access + SG design for private clusters; version planning	“Can we reach the API endpoint from the right networks?”
kubeadm on EC2	Full control; no managed constraints	Everything: HA wiring, etcd ops, upgrades, certs	“etcd is sick” is your incident
HyperShift	Reduce per-cluster control-plane VMs; faster cluster churn; multi-tenant mgmt	Hosting cluster becomes shared blast radius; two-layer debugging	“Hosting cluster health” pages everyone

When a hosted control plane simplifies operations

Hosted control planes help when your bottleneck is “running too many control planes.”

1) You operate many clusters (multi-tenant SaaS, env sprawl)

Cluster count is the multiplier.

If you run 20+ clusters, self-managed control planes become a tax:

patch windows multiply

certificate and etcd risk multiplies

“one-off cluster drift” becomes normal

EKS removes the control plane instances from your fleet and gives you a standardized control plane architecture across AZs.

HyperShift goes further: it removes dedicated control-plane machines per cluster and runs them as pods on a hosting cluster.

2) You want predictable control-plane availability without building an etcd practice

etcd is not hard until it’s hard at 3 AM.

kubeadm HA docs are clear: external etcd adds configuration surface area (explicit endpoints); stacked etcd is simpler but still your operational problem.

If your team doesn’t want to own etcd restores as a practiced drill, a hosted control plane removes that class of work from your team’s backlog.

3) You need fast cluster create/delete (ephemeral clusters, tenant clusters)

Provisioning speed is operational leverage.

HyperShift is designed around the concept of creating control planes as pods on a management cluster, which reduces the need to “spin up” dedicated control-plane machines per hosted cluster.

That’s useful when:

you create short-lived clusters for CI

you provision tenant clusters and churn them

you want cluster lifecycle to look like Deploying an app

4) You’re private-cluster-heavy and want a supported endpoint model

Private changes the operational shape more than any “feature.”

EKS lets you run a private-only API server endpoint (no public access), where kubectl must come from within the VPC or connected networks. Access to the private endpoint is controlled by rules on the cluster security group.

That’s not “simpler” in absolute terms. It’s simpler because it’s a supported, documented pattern with fewer moving parts than self-hosting your own API endpoint VIP/LB and cert story.

When a hosted control plane adds complexity

You trade “masters on VMs” for “network + IAM + shared blast radius.”

1) Control-plane connectivity becomes a first-class dependency

The API server is now “across a boundary,” and boundaries fail.

With EKS private-only clusters:

your kubectl, CI runners, and controllers must live inside the VPC or connected networks

your security group rules become part of cluster availability

With public endpoint access, the default behavior has historically been public enabled / private disabled (and you can toggle both).
Either way, endpoint mode is now a design choice you must document, test, and audit.

What changes for on-call:

“API is down” might really be “route to endpoint is broken”

DNS, TGW/peering, SG rules, and client network become suspects

2) Identity boundaries get sharper (and easier to misconfigure)

Hosted control planes push you into “who can reach what” decisions.

Private endpoint + security group control is good. It’s also easy to get wrong:

over-broad SG rules turn “private endpoint” into “private but reachable from everything”

too-tight rules break controllers and CI/CD in weird ways

Hosted doesn’t remove IAM work. It moves it to the center of the blast radius.

3) HyperShift’s hosting cluster becomes shared infrastructure

You didn’t delete control planes. You consolidated them.

HyperShift runs control planes as pods on a hosting cluster.
Docs show that hosted control plane pods can be scheduled broadly (“shared everything”), and you can taint/label nodes to dedicate capacity.

This is the operational trade:

Pro: fewer dedicated control-plane machines per tenant cluster

Con: hosting cluster saturation, upgrades, or outages can hit multiple hosted clusters at once

If you adopt HyperShift, treat the hosting cluster like tier-0 infrastructure:

separate node pools

aggressive monitoring

strict change control

tested disaster recovery

4) Debug becomes two-layer

Symptoms show up in the guest cluster; root cause can live elsewhere.

With EKS, control plane is managed. You troubleshoot via endpoint reachability, AWS telemetry, and cluster behavior. You can’t SSH into masters, and that’s the point.

With HyperShift, you can often inspect control plane pods on the hosting cluster. That’s powerful and it means your runbooks must cover two clusters:

guest cluster symptoms

hosting cluster root cause

Private clusters: the “hosted” decision that matters most

Private mode turns networking into part of the control plane.

EKS private endpoint: supported, but policy-heavy

SG rules are now part of cluster uptime.

AWS states that for private-only API servers:

there is no public access from the internet

kubectl must come from the VPC or connected network

cluster security group rules control private endpoint access

This is clean if you already run:

TGW / VPC peering / Direct Connect

private DNS resolution patterns

locked-down egress

It’s messy if your ops tooling lives outside the network boundary and you aren’t ready to move it.

kubeadm private: you own the endpoint and its failure modes

You don’t get a managed endpoint; you build one.

kubeadm HA guides assume you Configure a load balancer in front of the control plane nodes and wire up DNS names and endpoints.

That’s flexible. It’s also more work:

API endpoint LB health checks

TLS/cert rotation

routing changes during upgrades

HyperShift private: you design exposure between hosting and guest clusters

Hosted control planes still need reachable endpoints.

Hosted control plane pods live on the hosting cluster. That’s good for consolidation. It also means you must design:

how guest nodes reach the hosted API server

how admins reach it (private networks, bastions, CI runners)

how you segment tenants

The exact networking patterns vary by environment, but the invariant is: private hosted control planes increase the importance of network design.

Terraform: what you actually manage in each model

IaC doesn’t disappear. The resource graph changes.

EKS Terraform surface area

You configure endpoint modes, SGs, node groups, and IAM.

Minimum Terraform concerns:

endpoint access mode (public/private/both)

cluster security group rules for private access

node groups and AMI strategy

IRSA and IAM boundaries

Hosted control plane simplifies the “masters” part. It does not simplify the access-control part.

kubeadm Terraform surface area

Terraform becomes your control-plane installer, not just a cluster creator.

You end up managing:

control plane EC2 instances

LB/VIP in front of API servers (common HA pattern)

etcd instances (external) or colocated etcd (stacked)

bootstrap scripts, cert distribution, upgrade workflows

This can be clean if you have mature automation. If not, it’s a lot of state to keep consistent.

HyperShift Terraform surface area

You manage the hosting cluster like a platform, then declaratively create hosted clusters.

HyperShift adds:

hosting cluster lifecycle (upgrade, capacity, resilience)

hosted cluster objects and their infra mappings

scheduling policies for control plane pods (dedicated nodes via labels/taints)

Terraform can drive parts of this, but you’ll also lean on cluster-native controllers.

Prometheus: what you need to watch so hosted doesn’t surprise you

Hosted control planes move failure modes. Your dashboards must follow.

At minimum, split monitoring into two planes:

Workload plane (guest cluster apps)
request rates, latency, errors
node saturation
queue depth / retries
Control plane plane
API server availability/latency from where your clients run
controller health signals
for HyperShift: hosting cluster resource pressure, because control planes are pods

For private clusters, add synthetic checks from the networks that matter:

from CI runner network

from admin network

from in-cluster controllers

If the API endpoint is unreachable from your automation network, you don’t have a cluster. You have a museum exhibit.

Decision checklist for SaaS and platform teams

Answer these honestly and the right model usually falls out.

How many clusters will you run in 12 months?
If the number is growing fast, hosted control plane saves toil.
Do you have an etcd practice?
If “restore drill” isn’t something you run quarterly, kubeadm HA is a risk trade.
Is private-only mandatory?
If yes, model endpoint reachability and SG rules as part of uptime.
Can you tolerate shared blast radius?
HyperShift consolidates control planes. Treat hosting cluster as tier-0.
What do you want to debug at 3 AM: VMs or networks?
kubeadm tends toward VM-level debugging.
hosted control planes tend toward network/identity debugging.

Where AceCloud fits

Hosted control plane only helps if the day-2 loop is owned and scripted.

If you’re buying hosted control plane benefits but don’t want to run the surrounding ops (endpoint policies, Terraform hygiene, Prometheus wiring, upgrade runbooks), a managed Kubernetes provider like AceCloud can own that platform loop while your team focuses on workload correctness and SLOs.

Bottom line

Hosted control plane is not “less complexity.” It’s different complexity.

Pick a hosted control plane (EKS) when you want AWS to own control plane HA, scaling, and replacement across AZs.
Pick kubeadm when you need maximum control and you’re willing to own HA topology, etcd ops, and endpoint plumbing.
Pick HyperShift when you need to run many clusters and you’re ready to operate a tier-0 hosting cluster that runs control planes as pods.

The correct choice is the one that gives every failure mode a clear owner—and keeps your pager quiet for the right reasons.

DEV Community

Hosted control plane: when it simplifies operations and when it adds complexity

Top comments (0)