DEV Community: The Cyber Sidekick

Install Argo CD with Helm, and survive the missing-CRD trap (CKA)

The Cyber Sidekick — Wed, 22 Jul 2026 16:38:38 +0000

Install Argo CD with Helm (and survive the missing-CRD trap)

This exam question looks like three Helm commands: add a repository, render a template, install a release. And it hands you a comforting line: the Argo CD CRDs have already been pre-installed in the cluster. In this lab that line is false, on purpose, and what happens next teaches you exactly what crds.install=false really means. Let's run it.

🎥 Watch the video: https://www.youtube.com/watch?v=kYcHmIh_eHw

This is a CKA Cluster Architecture, Installation & Configuration walkthrough. Every command below is real output from a live cluster, and you can reproduce the whole thing yourself (scripts at the end).

The scenario

Here is the question. Add the official Argo CD Helm repository to the cluster under the name argo; the URL is given in the question, you never have to memorize repository links. The Argo CD CRDs, it says, have already been pre-installed. Generate a Helm template for release argocd, chart version 7.7.3, in the argocd namespace, save it to argo-helm.yaml, and configure the chart to not install the CRDs. Then install the release with the same name, version and namespace, again without CRDs. And one thing you do not have to do: configure access to the Argo CD server UI. Three commands, one premise. Remember the premise.

Add the official Argo CD Helm repo as 'argo' (URL is given in the question)
Premise: the Argo CD CRDs 'have already been pre-installed in the cluster'
helm template: release argocd, chart version 7.7.3, ns argocd -> argo-helm.yaml, NO CRDs
helm install: same release/version/namespace, NO CRDs; the server UI is out of scope

How the Helm flow works

Three Helm verbs do the work. helm repo add registers where charts come from, under a name you choose. helm template renders the chart into plain YAML on your machine; nothing touches the cluster, which is why the question can ask you to save the output to a file. helm install submits that same render to the cluster as a managed release. The flag that ties them together is set crds.install=false. The argo-cd chart bundles the Argo CD custom resource definitions, and when a question says the CRDs are managed separately, your job is to keep the chart's hands off them, in the template and in the install. Keep the two commands identical apart from the verb; if they drift, the file you saved describes a different system than the one you deployed.

Add the argo repo

Step one. helm repo add argo, then the URL from the question. Helm fetches the repository index immediately, so a mistyped URL fails right here instead of at install time. helm repo list confirms the repository is registered under exactly the name the question asked for: argo.

$ helm repo add argo https://argoproj.github.io/argo-helm
"argo" has been added to your repositories

$ helm repo list
NAME        URL                                     
argo        https://argoproj.github.io/argo-helm

Render the template

Step two, the template. Release name argocd, chart argo/argo-cd, version 7.7.3 pinned exactly, namespace argocd, crds.install set to false, and the whole thing redirected into argo-helm.yaml. The command prints nothing because the entire render went into the file, and it is substantial: over three thousand lines of YAML. Here is the check that proves the flag worked: grep the file for kind CustomResourceDefinition. Zero. This chart normally ships three CRDs; with the flag set, the render contains none of them.

$ helm template argocd argo/argo-cd --version 7.7.3 --namespace argocd --set crds.install=false > argo-helm.yaml

$ wc -l argo-helm.yaml
3057 argo-helm.yaml

$ grep -c 'kind: CustomResourceDefinition' argo-helm.yaml
0

helm install

Step three is the same command with the verb swapped: helm install, same release name, same chart, same version, same namespace, same crds.install false. Helm prints the release notes, and helm ls in the argocd namespace shows release argocd, revision one, status deployed, chart argo-cd 7.7.3, app version 2.13.0. As far as Helm is concerned, the job is done. The cluster is about to disagree.

$ helm install argocd argo/argo-cd --version 7.7.3 --namespace argocd --set crds.install=false
NAME: argocd
LAST DEPLOYED: Mon Jul 13 09:49:46 2026
NAMESPACE: argocd
STATUS: deployed
REVISION: 1
DESCRIPTION: Install complete
TEST SUITE: None

$ helm ls -n argocd
NAME    NAMESPACE   REVISION    UPDATED                                 STATUS      CHART           APP VERSION
argocd  argocd      1           2026-07-13 09:49:46.583903829 -0400 EDT deployed    argo-cd-7.7.3   v2.13.0

The crashloop

Give the pods a moment and look again. Redis and dex are up, but the server and the applicationset controller are already in a crash loop, and the application controller never goes ready; everything that watches Argo CD resources is failing. Pull the logs from the server and the last line says exactly why: level fatal, the server could not find the requested resource. That is a component asking the API server for a resource type that does not exist. We told the chart not to install the CRDs because the question said they were pre-installed. They are not. The premise was false, and Helm had no way to know.

$ kubectl get pods -n argocd
NAME                                               READY   STATUS             RESTARTS      AGE
argocd-application-controller-0                    0/1     Running            0             84s
argocd-applicationset-controller-bcdc99fcf-jdfj9   0/1     CrashLoopBackOff   3 (41s ago)   84s
argocd-dex-server-77f8fcf9d9-plbt8                 1/1     Running            0             84s
argocd-notifications-controller-7769fd5fd-rdqzh    1/1     Running            0             84s
argocd-redis-768545f6f-rjsgq                       1/1     Running            0             84s
argocd-repo-server-fd74968b8-dr772                 1/1     Running            0             84s
argocd-server-74b5fbf858-cn5t4                     0/1     CrashLoopBackOff   4 (4s ago)    84s

$ kubectl logs deploy/argocd-server -n argocd --tail=3
time="2026-07-13T13:51:10Z" level=info msg="Starting configmap/secret informers"
time="2026-07-13T13:51:11Z" level=info msg="Configmap/secret informer synced"
time="2026-07-13T13:51:11Z" level=fatal msg="the server could not find the requested resource (post appprojects.argoproj.io)"

The missing CRDs

Confirm the diagnosis in two seconds: kubectl get crd, grep argoproj, nothing. So we install the CRDs ourselves, but pinned to the right version, because CRDs and controllers drift apart across releases. The chart's appVersion tells you which Argo CD this chart deploys: helm show chart says 7.7.3 ships version 2.13.0. Apply the three CRD manifests from the Argo CD repository at exactly that tag: applications, applicationsets, and appprojects. Now the same grep finds all three.

$ kubectl get crd | grep argoproj
No resources found

$ helm show chart argo/argo-cd --version 7.7.3 | grep appVersion
appVersion: v2.13.0
version: 7.7.3

$ kubectl apply -f crds/
customresourcedefinition.apiextensions.k8s.io/applications.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/applicationsets.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/appprojects.argoproj.io created

$ kubectl get crd | grep argoproj
applications.argoproj.io      2026-07-13T13:51:16Z
applicationsets.argoproj.io   2026-07-13T13:51:16Z
appprojects.argoproj.io       2026-07-13T13:51:16Z

Restart + verify

The crashlooped pods would eventually recover on their own as the backoff retries, but do not sit and wait in an exam. One rollout restart of the workloads in the namespace brings fresh pods up immediately, and this time every informer finds its CRDs. A minute later the whole namespace is Running and Ready. That is the full question: repository added as argo, template saved with zero CRDs in it, release installed at 7.7.3, and a healthy Argo CD. And notice what the fix was not: we never touched the Helm release. The release was correct; the environment broke the promise.

$ kubectl -n argocd rollout restart deployment
deployment.apps/argocd-applicationset-controller restarted
deployment.apps/argocd-dex-server restarted
deployment.apps/argocd-notifications-controller restarted
deployment.apps/argocd-redis restarted
deployment.apps/argocd-repo-server restarted
deployment.apps/argocd-server restarted

$ kubectl get pods -n argocd
NAME                                                READY   STATUS    RESTARTS   AGE
argocd-application-controller-0                     1/1     Running   0          11s
argocd-applicationset-controller-547b7778c8-psvtd   1/1     Running   0          58s
argocd-dex-server-65cd9bfbb8-ptbnj                  1/1     Running   0          58s
argocd-notifications-controller-76fcfb8864-6s7sb    1/1     Running   0          58s
argocd-redis-6d9cb5f875-6z8xw                       1/1     Running   0          58s
argocd-repo-server-87cf9c697-dcqzx                  1/1     Running   0          58s
argocd-server-6f7b69c655-wgrhn                      1/1     Running   0          58s

Exam tips

A few traps. The repository URL and the chart version are printed in the question; copy them, pin --version exactly, and never guess. Template and install must carry identical values: same namespace, same set crds.install=false, only the verb changes. When a question states a premise like CRDs are pre-installed, verify it, it costs two seconds: kubectl get crd, grep argoproj. And learn the signature: a controller crashlooping with the server could not find the requested resource means a missing CRD, and the chart's appVersion tells you exactly which CRD version to apply.

The repo URL + chart version are IN the question: copy them, pin --version exactly
template and install take identical values; only the verb changes
Premises are verifiable: 'CRDs pre-installed' -> kubectl get crd | grep argoproj (2 seconds)
'could not find the requested resource' = missing CRD; match it to the chart's appVersion

Recap

helm repo add argo -> helm template -> helm install, pinned to 7.7.3, crds.install=false on both
helm ls said 'deployed'; the pods crashlooped: Helm status is not health
Missing argoproj CRDs -> 'could not find the requested resource'; apply CRDs at appVersion v2.13.0
rollout restart, all pods Running; subscribe + dev.to writeup

Reproduce this yourself

The entire scenario is scripted on a throwaway kind cluster: https://github.com/The-Cyber-Sidekick/TCS_CKA_2026_Exam_Scenarios

git clone https://github.com/The-Cyber-Sidekick/TCS_CKA_2026_Exam_Scenarios.git
cd TCS_CKA_2026_Exam_Scenarios/learning/scenarios/scenario10-argocd-helm-install
./setup.sh        # creates the cluster AND arms the scenario
# solve it by hand, or:
./solution.sh     # apply the answer key and verify

If this helped, subscribe to The Cyber SideKick on YouTube for more CKA drills, and grab the newsletter at https://thecybersidekick.beehiiv.com.

Install Calico and prove NetworkPolicy enforcement (CKA Services & Networking)

The Cyber Sidekick — Wed, 22 Jul 2026 16:09:06 +0000

Install Calico and prove NetworkPolicy enforcement

The exam says install and configure a CNI of your choice, flannel or Calico, and then adds one line that makes the choice for you: it must support NetworkPolicy enforcement. flannel does not enforce policies, so this is a Calico question. Let's install it from the manifests, verify it, and prove a policy actually blocks traffic.

🎥 Watch the video: https://www.youtube.com/watch?v=7Eaf7Wogsa4

This is a CKA Services & Networking walkthrough. Every command below is real output from a live cluster, and you can reproduce the whole thing yourself (scripts at the end).

The scenario

Here is the setup. A fresh cluster has no CNI, so its nodes are NotReady and CoreDNS is Pending. The question offers flannel or Calico, says install from manifest files, do not use Helm, and lists three requirements: the CNI is properly installed and configured, pods can communicate with each other, and it supports NetworkPolicy enforcement. That last requirement is the trap. flannel gives you pod networking but silently ignores NetworkPolicy objects. Only Calico satisfies all three.

A fresh cluster with NO CNI: nodes NotReady, CoreDNS Pending
Install a CNI of your choice (flannel or Calico) from manifests, no Helm
Pods must communicate with each other, and the CNI must be properly installed
'Must support NetworkPolicy enforcement' => flannel is out, Calico is the answer

How the Calico operator install works

Calico's manifest install is operator based and comes in three parts. First the operator CRDs, then the Tigera operator itself, a deployment that knows how to install and manage Calico. Then you hand the operator an Installation resource describing the Calico you want, and it deploys everything into the calico-system namespace to match. Two things to know before you touch the keyboard. The Installation's IP pool cidr must agree with the cluster's pod CIDR, which is 192.168.0.0/16 here. And these manifests need kubectl create, not apply: the CRD file is bigger than the annotation kubectl apply attaches, so apply fails on it.

No CNI: nodes NotReady

Start by confirming the starting state. Every node is NotReady, the classic symptom of a missing CNI. And listing pods across all namespaces shows two things: CoreDNS is Pending because there is no pod network for it to join, and there are no calico or flannel pods anywhere, so no CNI is installed yet. This is a fresh cluster, and the network plugin is ours to install.

$ kubectl get nodes
NAME                          STATUS     ROLES           AGE   VERSION
cka-scenario9-control-plane   NotReady   control-plane   8s    v1.36.1
cka-scenario9-worker          NotReady   <none>          0s    v1.36.1

$ kubectl get pods -A
NAMESPACE            NAME                                                  READY   STATUS    RESTARTS   AGE
kube-system          coredns-589f44dc88-522x8                              0/1     Pending   0          1s
kube-system          coredns-589f44dc88-zsxh2                              0/1     Pending   0          1s
kube-system          etcd-cka-scenario9-control-plane                      0/1     Running   0          8s
kube-system          kube-apiserver-cka-scenario9-control-plane            1/1     Running   0          8s
kube-system          kube-controller-manager-cka-scenario9-control-plane   1/1     Running   0          8s
kube-system          kube-proxy-ds9pp                                      1/1     Running   0          1s
kube-system          kube-proxy-lwx6m                                      0/1     Pending   0          3s
kube-system          kube-scheduler-cka-scenario9-control-plane            0/1     Running   0          8s
local-path-storage   local-path-provisioner-855c7b7774-75cbk               0/1     Pending   0          1s

Create the CRDs + operator

Install the operator pieces with kubectl create. The first file registers the custom resource definitions, thirty two of them. The second deploys the Tigera operator itself. Note the verb: create, not apply. Some of these CRDs are larger than the last-applied annotation kubectl apply wants to attach, so apply fails on this file, and the Calico docs say create for exactly that reason. A few seconds later the operator pod is Running in the tigera-operator namespace, waiting for us to tell it what to install.

$ kubectl create -f operator-crds.yaml
...
customresourcedefinition.apiextensions.k8s.io/ipamconfigs.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamhandles.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ippools.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipreservations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/kubecontrollersconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/networkpolicies.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/networksets.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/stagedglobalnetworkpolicies.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/stagedkubernetesnetworkpolicies.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/stagednetworkpolicies.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/tiers.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/adminnetworkpolicies.policy.networking.k8s.io created
customresourcedefinition.apiextensions.k8s.io/baselineadminnetworkpolicies.policy.networking.k8s.io created

$ kubectl create -f tigera-operator.yaml
namespace/tigera-operator created
serviceaccount/tigera-operator created
clusterrole.rbac.authorization.k8s.io/tigera-operator-secrets created
clusterrole.rbac.authorization.k8s.io/tigera-operator created
clusterrolebinding.rbac.authorization.k8s.io/tigera-operator created
rolebinding.rbac.authorization.k8s.io/tigera-operator-secrets created
deployment.apps/tigera-operator created

$ kubectl get pods -n tigera-operator
NAME                               READY   STATUS    RESTARTS   AGE
tigera-operator-696d7c8fc4-59bf5   1/1     Running   0          4s

The Installation resource

Now describe the Calico you want. This Installation resource is the heart of the stock custom-resources file, and the field that matters is the IP pool cidr: 192.168.0.0/16, matching this cluster's pod CIDR. If those two disagree you get the same class of pain as a flannel CIDR mismatch, so always check before you create it. Create it and the operator takes over, pulling Calico up piece by piece. The tigerastatus resource is your progress bar: when every row reports Available True, the install is done.

$ cat custom-resources.yaml
# The Installation resource: the Tigera operator watches for this and deploys Calico
# to match it. The ipPool cidr MUST agree with the cluster's pod CIDR (192.168.0.0/16).
apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
  name: default
spec:
  calicoNetwork:
    ipPools:
      - cidr: 192.168.0.0/16
        encapsulation: VXLANCrossSubnet

$ kubectl create -f custom-resources.yaml
installation.operator.tigera.io/default created

$ kubectl get tigerastatus
NAME      AVAILABLE   PROGRESSING   DEGRADED   SINCE
calico    True        False         False      0s
ippools   True        False         False      55s

Calico up, nodes Ready

Trust but verify. In calico-system there is a calico-node pod on every node, that is the dataplane that wires up pods and enforces policy, plus typha and the kube-controllers, all Running. And the payoff: both nodes have flipped to Ready, and CoreDNS is finally scheduled. The CNI is properly installed and configured, which was requirement one.

$ kubectl get pods -n calico-system -o wide
NAME                                       READY   STATUS    RESTARTS   AGE   IP                NODE                          NOMINATED NODE   READINESS GATES
calico-kube-controllers-6c8496f5c8-flpxp   1/1     Running   0          63s   192.168.227.197   cka-scenario9-control-plane   <none>           <none>
calico-node-dpqqj                          1/1     Running   0          63s   172.18.0.15       cka-scenario9-worker          <none>           <none>
calico-node-pdgjt                          1/1     Running   0          63s   172.18.0.16       cka-scenario9-control-plane   <none>           <none>
calico-typha-6f8c54fdf6-mnvnn              1/1     Running   0          63s   172.18.0.15       cka-scenario9-worker          <none>           <none>
csi-node-driver-6mlr5                      2/2     Running   0          63s   192.168.227.194   cka-scenario9-control-plane   <none>           <none>
csi-node-driver-ntcxl                      2/2     Running   0          63s   192.168.178.65    cka-scenario9-worker          <none>           <none>

$ kubectl get nodes
NAME                          STATUS   ROLES           AGE   VERSION
cka-scenario9-control-plane   Ready    control-plane   82s   v1.36.1
cka-scenario9-worker          Ready    <none>          73s   v1.36.1

Prove pods can talk

Requirement two: pods must communicate with each other. Create two test pods pinned to different nodes and list them wide, each gets an IP from 192.168.0.0/16 on a different host. Then exec into test one and ping test two's pod IP: replies come back across the nodes, so pod-to-pod networking works end to end. One practical note: ping the IP, not the pod name. Bare pods do not get DNS records, so a name lookup fails with bad address even on a perfectly healthy network.

$ kubectl apply -f connectivity-test.yaml
pod/test1 created
pod/test2 created

$ kubectl get pods -o wide -l app=conn-test
NAME    READY   STATUS    RESTARTS   AGE   IP                NODE                          NOMINATED NODE   READINESS GATES
test1   1/1     Running   0          3s    192.168.178.66    cka-scenario9-worker          <none>           <none>
test2   1/1     Running   0          3s    192.168.227.198   cka-scenario9-control-plane   <none>           <none>

$ kubectl exec test1 -- ping -c 3 <test2-ip>
PING 192.168.227.198 (192.168.227.198): 56 data bytes
64 bytes from 192.168.227.198: seq=0 ttl=62 time=0.351 ms
64 bytes from 192.168.227.198: seq=1 ttl=62 time=0.091 ms
64 bytes from 192.168.227.198: seq=2 ttl=62 time=0.100 ms

--- 192.168.227.198 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.091/0.180/0.351 ms

Prove policy enforcement

Requirement three is the one flannel cannot do: NetworkPolicy enforcement. This default-deny policy comes straight from the Kubernetes docs. The empty podSelector selects every pod in the default namespace, and listing both Ingress and Egress with no rules denies all traffic in both directions. Apply it, confirm it exists, and run the exact same ping again. Three packets, zero replies, one hundred percent loss. The policy is not just stored in the API, Calico is enforcing it on the wire. That failing ping is the proof the question asked for.

$ cat deny-all.yaml
# Default-deny-all, straight from the Kubernetes docs: the empty podSelector selects
# EVERY pod in the namespace; both policyTypes with no rules = deny all traffic.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: default
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress

$ kubectl apply -f deny-all.yaml
networkpolicy.networking.k8s.io/default-deny-all created

$ kubectl get networkpolicy
NAME               POD-SELECTOR   AGE
default-deny-all   <none>         0s

$ kubectl exec test1 -- ping -c 3 -w 5 <test2-ip>
PING 192.168.227.198 (192.168.227.198): 56 data bytes

--- 192.168.227.198 ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss
command terminated with exit code 1

Exam tips

A few traps. When a question says the CNI must support NetworkPolicy enforcement, that one phrase eliminates flannel; pick Calico and move on. Use kubectl create for the Calico operator manifests, because apply chokes on the big CRD file. Check that the Installation's IP pool cidr matches the cluster pod CIDR before you create it, then watch kubectl get tigerastatus until everything is Available. And verification means behavior, not status: prove the ping works, then prove a default-deny breaks the same ping.

'Must support NetworkPolicy enforcement' => flannel is out, use Calico
kubectl create -f (not apply) for the Calico manifests: the CRD file is too big for apply
Installation ipPool cidr must match the cluster pod CIDR; watch 'kubectl get tigerastatus'
Verify behavior: ping works, then a default-deny makes the same ping fail

Recap

NetworkPolicy enforcement required => Calico, installed with kubectl create (no Helm)
operator CRDs -> Tigera operator -> Installation (ipPool = cluster pod CIDR)
tigerastatus Available, nodes Ready, cross-node ping works
default-deny => the same ping fails: enforcement proven; subscribe + dev.to writeup

Reproduce this yourself

The entire scenario is scripted on a throwaway kind cluster: https://github.com/The-Cyber-Sidekick/TCS_CKA_2026_Exam_Scenarios

git clone https://github.com/The-Cyber-Sidekick/TCS_CKA_2026_Exam_Scenarios.git
cd TCS_CKA_2026_Exam_Scenarios/learning/scenarios/scenario9-calico-network-policy
./setup.sh        # creates the cluster AND arms the scenario
# solve it by hand, or:
./solution.sh     # apply the answer key and verify

If this helped, subscribe to The Cyber SideKick on YouTube for more CKA drills, and grab the newsletter at https://thecybersidekick.beehiiv.com.

Self-Hosting LLMs on Kubernetes: When vLLM Beats Managed APIs on Cost

The Cyber Sidekick — Sat, 18 Jul 2026 19:24:13 +0000

A practitioner's cost-benefit analysis of vLLM on Kubernetes versus OpenAI and other managed inference APIs for high-volume LLM workloads.

Organizations running high-volume LLM inference can reduce per-token costs by 60-80% by self-hosting with vLLM on Kubernetes, but the economics only work after solving GPU scheduling, model serving, and operational complexity. This article walks platform engineers through the breakeven analysis, infrastructure architecture, and operational tooling required to make self-hosted inference viable in production.

The Economics: Where Self-Hosted Inference Wins

The LLM inference market is splitting into two camps: managed API providers like OpenAI, Anthropic, and Google Vertex AI charging per-token premiums, and self-hosted inference stacks on Kubernetes that amortize GPU costs across high request volumes. The crossover point sits at roughly 10 to 20 million tokens per day per model, where reserved A100 or H100 instances typically break even against OpenAI API pricing within 3 to 6 months. That window shrinks further when you factor in quantization: AWQ and GPTQ reduce model memory footprint by 2 to 4x, letting you serve more concurrent requests from the same GPU, which directly compresses the breakeven timeline. The economic case is also being accelerated by inference-time compute scaling, where o1-style chain-of-thought reasoning dramatically inflates output token volumes, making per-token API billing increasingly untenable for high-throughput production applications. Open-weight models like Llama 3.1, Mistral, Qwen2, and Gemma 2 have meanwhile closed the capability gap with proprietary APIs for many enterprise use cases, removing the last justification for paying the managed API premium when volume is high enough.

vLLM and Kubernetes: The Infrastructure Stack

vLLM, developed by LMSys, implements PagedAttention, a memory management algorithm inspired by OS virtual memory paging that achieves near-zero KV cache memory waste and delivers up to 24x higher throughput than naive HuggingFace Transformers serving. Its continuous batching keeps GPU utilization above 80% under sustained load, compared to static batching which frequently yields sub-40% utilization, and vLLM 0.4 added production-grade features including OpenAI-compatible REST endpoints, speculative decoding, chunked prefill, and multi-LoRA serving that make it a viable drop-in API replacement. On the Kubernetes side, the NVIDIA GPU Operator automates GPU driver installation, device plugin deployment, and MIG partitioning, with H100 MIG allowing a single 80GB GPU to be sliced into up to 7 isolated instances so smaller 7B parameter models can be scheduled on fractional GPU resources alongside larger workloads. For multi-node tensor parallelism, platform teams are choosing between KubeRay with RayServe and native vLLM Kubernetes deployments, with KubeRay offering richer pipeline parallelism and autoscaling primitives while native vLLM deployments reduce operational surface area. Karpenter handles GPU node autoscaling, and spot instance availability for H100 NVLink, AMD MI300X, and Google TPU v5 hardware is continuing to lower amortized cost per token on major cloud providers.

Operational Complexity: What You Are Actually Signing Up For

The 60-80% cost reduction is real, but it comes with an operational contract that managed APIs abstract away entirely, and platform teams need to account for that engineering investment before committing. Production vLLM deployments require Prometheus and OpenTelemetry instrumentation at the token level to surface queue depth, time-to-first-token, and inter-token latency metrics, and autoscaling policies need to be built around queue depth rather than the CPU and memory signals that Kubernetes HPA uses by default. Model version management requires Argo Rollouts or equivalent canary tooling to safely promote new model weights or LoRA adapters without dropping traffic, and multi-LoRA hot-swapping for fine-tuned model variants adds another layer of complexity around adapter lifecycle management. Multi-tenant vLLM deployments in internal LLM platforms need namespace isolation, per-team rate limiting, and chargeback instrumentation so that cost savings are actually attributed and not just absorbed into platform overhead. Teams that underestimate this operational complexity often find that the first three months of self-hosting are net-negative on engineering productivity, which is why the 3 to 6 month breakeven estimate should be treated as a floor, not a guarantee, for teams without prior GPU infrastructure experience.

Conclusion

Self-hosted vLLM on Kubernetes is genuinely the right choice for organizations running sustained high-volume LLM inference, but the decision should be driven by a clear-eyed token volume analysis rather than enthusiasm for infrastructure ownership. At 10 to 20 million tokens per day per model, the economics are compelling and the tooling ecosystem around GPU Operator, KubeRay, Karpenter, and Prometheus has matured enough to make production deployments tractable for experienced platform teams. Looking ahead, inference-time compute scaling will continue to inflate token volumes across the industry, which will push more organizations past the breakeven threshold faster than they expect. Hardware commoditization through spot H100 and AMD MI300X availability will further compress managed API margins, making the self-hosted case stronger over the next 12 to 18 months. Platform teams that invest now in building internal LLM inference platforms on Kubernetes, with solid observability, autoscaling, and model lifecycle tooling, will be positioned to serve inference demand at a cost structure that managed APIs structurally cannot match at scale.

Technologies covered: vLLM, Kubernetes, GPU resource management, container orchestration, inference optimization

Sources aggregated from: CNCF Blog, Kubernetes.io, DevOps Weekly

📬 Stay current with cloud-native

Get the latest Kubernetes, DevOps, and platform engineering insights delivered to your inbox.

Subscribe to The Cyber SideKick Newsletter — free, no spam, unsubscribe anytime.

Platform Engineering for AI-Native Workloads: Managing Cognitive Load at Scale

The Cyber Sidekick — Mon, 13 Jul 2026 13:17:53 +0000

How platform teams can architect internal developer platforms optimized for GPU scheduling, model serving, and experiment tracking without overwhelming ML engineers.

AI workloads are multiplying exponentially, yet fewer than 30% of organizations have extended their internal developer platforms to natively support GPU workloads and ML pipelines, according to Puppet's 2024 State of Platform Engineering report. Platform teams that close this gap by layering AI-native abstractions atop Kubernetes can dramatically reduce cognitive load for ML engineers while containing runaway GPU costs through intelligent resource orchestration.

The Abstraction Gap Undermining AI Platform Maturity

Kubernetes was architected for stateless microservices, and stretching it to accommodate high-memory-bandwidth training jobs, long-running batch workloads, and sub-100ms inference SLAs exposes serious abstraction gaps that raw kubectl access cannot paper over. The result is a tax on ML engineers who must simultaneously master Kubernetes primitives, GPU driver nuances, and distributed training frameworks before writing a single line of model code. Purpose-built control planes including Run:ai, Volcano, and Loft's vCluster are gaining traction precisely because they sit atop Kubernetes and expose ML-specific primitives, such as experiment tracking dashboards, model registries, and GPU quota views, shielding practitioners from infrastructure complexity. The 87% of organizations with mature platform engineering practices that report measurably reduced developer cognitive load have one thing in common: they treat the platform as a product with well-defined, opinionated abstractions rather than as a collection of loosely integrated open-source tools.

GPU Orchestration and Multi-Tenancy Through MIG and KubeRay

Granular GPU resource isolation is now achievable without whole-GPU allocation, and platform teams that ignore this capability are leaving significant efficiency gains on the table. NVIDIA's MIG Manager within the GPU Operator allows a single A100 to be partitioned into up to seven isolated instances, enabling Kubernetes resource quotas as specific as 1g.10gb, which translates to up to 40% less GPU idle time in multi-tenant inference clusters compared to whole-GPU scheduling. For distributed training and online inference, KubeRay has emerged as the most compelling unified compute layer, with adoption growing over 300% year-over-year in 2023 and 2024 based on GitHub stars and Helm chart downloads. Organizations deploying RayService for model inference via KubeRay's operator-based CRDs report sub-100ms p99 latency at scales exceeding 10,000 requests per second through Kubernetes-native horizontal autoscaling, making it a credible alternative to purpose-built inference servers for teams already invested in the Ray ecosystem.

GitOps, Observability, and FinOps as First-Class Platform Concerns

Bringing software engineering discipline to the ML lifecycle requires treating model weights, feature stores, and evaluation datasets with the same versioning rigor applied to application code, and GitOps-driven workflows through ArgoCD and Kubeflow Pipelines v2 with an Argo Workflows backend are making this operationally tractable at scale. Service mesh capabilities via Istio extend this discipline into inference traffic management, enabling weighted routing for shadow deployments and header-based routing for model version targeting, which gives platform teams a safe mechanism for canary model promotions without custom networking code. Observability remains a critical and underinvested area, with leading teams instrumenting ML pipelines through OpenTelemetry, Prometheus custom metrics, and distributed tracing via Tempo to correlate model performance degradation with infrastructure-level anomalies in a single unified trace. On the cost side, GPU spend now dominates cloud bills for AI-heavy organizations, making spot-instance-aware schedulers, idle GPU detection via Prometheus alerting, and per-team chargeback dashboards in Grafana not optional enhancements but core platform features that directly influence engineering budget conversations.

Conclusion

The platform engineering teams that will define the next generation of AI infrastructure are those treating AI-native workloads not as an edge case bolted onto an existing IDP but as the primary design constraint for every abstraction they build. The convergence of MLOps tooling with traditional platform engineering practices is accelerating, and the organizations moving fastest are the ones investing simultaneously in GPU resource isolation through MIG partitioning, unified compute layers like KubeRay, GitOps-native model promotion pipelines, and FinOps visibility that holds teams accountable for GPU utilization. As foundation model sizes grow and inference latency budgets tighten, the pressure on platform teams to deliver self-service ML infrastructure without cognitive overload will only intensify, making purpose-built AI platform abstractions one of the highest-leverage bets an engineering organization can make in the next 18 months.

Technologies covered: Kubernetes GPU scheduling and resource quotas, Ray and Kubeflow for distributed ML, Service mesh (Istio) for model inference routing, ArgoCD for MLOps GitOps, Observability stacks (Prometheus, Grafana, Tempo) for ML pipeline tracing, Containerization and OCI standards

Sources aggregated from: CNCF Blog, Kubernetes.io, DevOps Weekly

📬 Stay current with cloud-native

Get the latest Kubernetes, DevOps, and platform engineering insights delivered to your inbox.

Subscribe to The Cyber SideKick Newsletter — free, no spam, unsubscribe anytime.

GhostLock (CVE-2026-43499): How a Linux Kernel Privilege Escalation Exposes Kubernetes Multi-Tenant Security Gaps

The Cyber Sidekick — Mon, 13 Jul 2026 12:44:15 +0000

A newly disclosed kernel vulnerability forces a hard reassessment of the container isolation assumptions underpinning multi-tenant Kubernetes clusters.

CVE-2026-43499, dubbed GhostLock, is a Linux kernel privilege escalation vulnerability that allows unprivileged processes to gain root-level access by exploiting flaws in kernel subsystems, effectively collapsing the security boundary between a container and its host node. Kubernetes operators running multi-tenant clusters are acutely exposed because Kubernetes-native controls like Pod Security Admission, seccomp profiles, and AppArmor all operate above the kernel layer and cannot compensate for an unpatched host kernel.

Why Kernel Privilege Escalation Hits Kubernetes Differently

Kubernetes delegates container isolation to Linux kernel primitives: namespaces partition process visibility, cgroups enforce resource boundaries, and capabilities constrain privileged operations. When a vulnerability like GhostLock allows an attacker inside a container to escalate to root on the host kernel, every one of those primitives becomes irrelevant. In a multi-tenant cluster where a single node may run hundreds of pods across different trust boundaries, a single exploitable container can become a foothold into the entire node and, from there, into cluster control plane credentials mounted via service account tokens. The Linux kernel averaged between 1,800 and 2,000 CVEs per year from 2020 to 2023 according to NVD data, with privilege escalation categories consistently representing the highest-severity subset, yet enterprise cluster upgrade cycles routinely lag behind kernel patch cadences, leaving nodes exposed for weeks or months at a time.

Where Kubernetes Hardening Falls Short Against Kernel CVEs

The deprecation of PodSecurityPolicy in Kubernetes 1.25 and its replacement with Pod Security Admission has left many teams with coarser enforcement granularity, particularly around syscall restrictions. According to the 2023 CNCF Security Report, over 60 percent of production Kubernetes clusters surveyed did not enforce seccomp profiles by default, meaning the syscall-level attack surface that GhostLock targets is fully exposed in the majority of real-world deployments. Aqua Security research further identified that container escape techniques leveraging host kernel vulnerabilities represent a significant portion of realistic attack paths, with privileged pods and hostPath mounts acting as common amplifying misconfigurations that lower the bar for exploitation. Seccomp, surfaced through Kubernetes securityContext, can reduce the exploitable syscall surface, but it requires deliberate, per-workload policy authoring that most teams have not yet operationalized at scale.

Detection, Mitigation, and Stronger Isolation Primitives

Practitioners responding to GhostLock have three complementary mitigation layers available today. First, prioritize emergency node patching or, better, replace nodes entirely using immutable image-based operating systems like Bottlerocket or Flatcar Linux, where full node replacement is faster and more automated than in-place patching, directly closing the kernel patch lag window. Second, deploy eBPF-based runtime security tooling such as the CNCF Falco project, which instruments the kernel to detect anomalous syscall patterns consistent with privilege escalation attempts, providing detection coverage while patching cycles complete. Third, for workloads with the highest trust sensitivity, adopt syscall interposition sandboxes like gVisor, which interposes on syscalls through a user-space kernel, dramatically reducing the exposed host kernel attack surface and making kernel CVEs like GhostLock largely irrelevant to sandboxed workloads. Confidential computing approaches using AMD SEV or Intel TDX provide hardware-enforced memory isolation that can further constrain what a kernel-level attacker can observe or modify across tenant boundaries.

Conclusion

GhostLock is not an anomaly; it is a predictable entry in a long series of kernel privilege escalation vulnerabilities that will continue to challenge the container isolation model Kubernetes depends on. The fundamental tension is that Kubernetes security controls are policy abstractions layered on top of a shared kernel, and no amount of policy sophistication fully compensates for an unpatched vulnerability in that shared kernel. The industry trajectory toward immutable node infrastructure, eBPF-based runtime observability, and hardware-enforced isolation through confidential computing represents the correct long-term response, moving isolation guarantees progressively closer to hardware and further from software policies that can be bypassed. In the near term, operators should treat kernel CVE patching with the same urgency as control plane upgrades, enforce seccomp profiles broadly using the RuntimeDefault baseline as a starting point, and audit clusters for privileged pods and hostPath mounts that would amplify any successful kernel exploit into a full cluster compromise.

Technologies covered: Linux kernel, Kubernetes, container security, privilege escalation, pod security policies, seccomp

Sources aggregated from: CNCF Blog, Kubernetes.io, DevOps Weekly, Hacker News, InfoQ

📬 Stay current with cloud-native

Get the latest Kubernetes, DevOps, and platform engineering insights delivered to your inbox.

Subscribe to The Cyber SideKick Newsletter — free, no spam, unsubscribe anytime.

2026 CKA Exam - Scenario 8 Install a CNI and fix the flannel pod-CIDR mismatch (CKA Services & Networking)

The Cyber Sidekick — Fri, 10 Jul 2026 13:02:27 +0000

Install a CNI and fix the pod-CIDR mismatch

The cluster is running, but every node is NotReady, because no network plugin is installed. The exam gives you a flannel manifest and asks you to install and configure a CNI. Let's apply it, watch it fail, find out why, and fix it.

🎥 Watch the video: https://www.youtube.com/watch?v=wzFE66kfRl4

This is a CKA Services & Networking walkthrough. Every command below is real output from a live cluster, and you can reproduce the whole thing yourself (scripts at the end).

The scenario

Here is the setup. A fresh cluster has no CNI, so its nodes are NotReady and CoreDNS is stuck Pending. You are handed a flannel manifest and told to install a network plugin of your choice. The catch, which you only discover after applying, is that the manifest's network does not match the cluster's pod CIDR.

A fresh cluster with NO CNI: nodes NotReady, CoreDNS Pending
You're handed a flannel manifest and told to install a CNI
The manifest's Network (10.244.0.0/16) != the cluster pod CIDR (192.168.0.0/16)
Install it, fix the mismatch, and prove pods can talk

How flannel and CNIs work

A CNI plugin is what gives pods IP addresses and wires up pod-to-pod routing; without one the kubelet reports the node NotReady. flannel runs as a DaemonSet, one pod per node, and reads its settings from a ConfigMap called kube-flannel-cfg. The important field is net-conf.json Network, the address space flannel hands out. Because flannel runs with kube-subnet-mgr, it also reads each node's assigned podCIDR, and it refuses to start if that podCIDR is not inside its configured Network. Match those two and it works.

No CNI: nodes NotReady

Start by confirming the starting state. Every node is NotReady, the classic symptom of a missing CNI. And in kube-system, CoreDNS is Pending because it has no pod network to join yet. Everything else the control plane needs is up, so this really is just the network plugin that's missing.

$ kubectl get nodes
NAME                          STATUS     ROLES           AGE   VERSION
cka-scenario8-control-plane   NotReady   control-plane   8d    v1.36.1
cka-scenario8-worker          NotReady   <none>          8d    v1.36.1

$ kubectl get pods -n kube-system -o wide
NAME                                                  READY   STATUS    RESTARTS   AGE   IP            NODE                          NOMINATED NODE   READINESS GATES
coredns-589f44dc88-8m6kd                              1/1     Running   0          8d    192.168.1.4   cka-scenario8-worker          <none>           <none>
coredns-589f44dc88-shgqb                              1/1     Running   0          8d    192.168.1.3   cka-scenario8-worker          <none>           <none>
etcd-cka-scenario8-control-plane                      1/1     Running   0          8d    172.18.0.13   cka-scenario8-control-plane   <none>           <none>
kube-apiserver-cka-scenario8-control-plane            1/1     Running   0          8d    172.18.0.13   cka-scenario8-control-plane   <none>           <none>
kube-controller-manager-cka-scenario8-control-plane   1/1     Running   0          8d    172.18.0.13   cka-scenario8-control-plane   <none>           <none>
kube-proxy-gkw9p                                      1/1     Running   0          8d    172.18.0.14   cka-scenario8-worker          <none>           <none>
kube-proxy-nxbwr                                      1/1     Running   0          8d    172.18.0.13   cka-scenario8-control-plane   <none>           <none>
kube-scheduler-cka-scenario8-control-plane            1/1     Running   0          8d    172.18.0.13   cka-scenario8-control-plane   <none>           <none>

Apply flannel (it CrashLoops)

Now install flannel by applying the manifest. It creates the kube-flannel namespace, the ConfigMap, and the DaemonSet. But when you look at the pods a few seconds later, they are not Running: they're in CrashLoopBackOff or Error. Applying the manifest was necessary, but on this cluster it is not sufficient.

$ kubectl apply -f kube-flannel.yml
namespace/kube-flannel created
serviceaccount/flannel created
clusterrole.rbac.authorization.k8s.io/flannel unchanged
clusterrolebinding.rbac.authorization.k8s.io/flannel unchanged
configmap/kube-flannel-cfg created
daemonset.apps/kube-flannel-ds created

$ kubectl -n kube-flannel get pods -o wide
NAME                    READY   STATUS   RESTARTS     AGE   IP            NODE                          NOMINATED NODE   READINESS GATES
kube-flannel-ds-6mhjm   0/1     Error    1 (9s ago)   12s   172.18.0.13   cka-scenario8-control-plane   <none>           <none>
kube-flannel-ds-jq4n7   0/1     Error    1 (9s ago)   12s   172.18.0.14   cka-scenario8-worker          <none>           <none>

Read the lease error

Don't guess, read the logs. The flannel pod says it failed to acquire a lease, because the node's pod subnet, a slice of 192.168.0.0/16, is not inside the flannel net configuration of 10.244.0.0/16. That's the whole problem in one line: the manifest ships a default Network that does not match how this cluster was built. The cluster's pod CIDR is authoritative, so flannel is the thing that has to change.

$ kubectl -n kube-flannel logs <flannel-pod>
...
I0710 11:54:49.724901       1 kube.go:163] Node controller sync successful
I0710 11:54:49.724926       1 main.go:252] Created subnet manager: Kubernetes Subnet Manager - cka-scenario8-control-plane
I0710 11:54:49.724929       1 main.go:255] Installing signal handlers
I0710 11:54:49.725064       1 main.go:534] Found network config - Backend type: vxlan
I0710 11:54:49.726846       1 kube.go:737] List of node(cka-scenario8-control-plane) annotations: map[string]string{"flannel.alpha.coreos.com/backend-data":"{\"VNI\":1,\"VtepMAC\":\"76:f3:d3:22:ef:55\"}", "flannel.alpha.coreos.com/backend-type":"vxlan", "flannel.alpha.coreos.com/kube-subnet-manager":"true", "flannel.alpha.coreos.com/public-ip":"172.18.0.13", "node.alpha.kubernetes.io/ttl":"0", "volumes.kubernetes.io/controller-managed-attach-detach":"true"}
I0710 11:54:49.726881       1 match.go:211] Determining IP address of default interface
I0710 11:54:49.727126       1 match.go:269] Using interface with name eth0 and address 172.18.0.13
I0710 11:54:49.727151       1 match.go:291] Defaulting external address to interface address (172.18.0.13)
I0710 11:54:49.727200       1 vxlan.go:128] VXLAN config: VNI=1 Port=0 GBP=false Learning=false DirectRouting=false
I0710 11:54:49.728578       1 kube.go:704] List of node(cka-scenario8-control-plane) annotations: map[string]string{"flannel.alpha.coreos.com/backend-data":"{\"VNI\":1,\"VtepMAC\":\"76:f3:d3:22:ef:55\"}", "flannel.alpha.coreos.com/backend-type":"vxlan", "flannel.alpha.coreos.com/kube-subnet-manager":"true", "flannel.alpha.coreos.com/public-ip":"172.18.0.13", "node.alpha.kubernetes.io/ttl":"0", "volumes.kubernetes.io/controller-managed-attach-detach":"true"}
I0710 11:54:49.728602       1 vxlan.go:199] Interface flannel.1 mac address set to: 76:f3:d3:22:ef:55
E0710 11:54:49.729096       1 main.go:381] Error registering network: failed to acquire lease: subnet "10.244.0.0/16" specified in the flannel net config doesn't contain "192.168.0.0/24" PodCIDR of the "cka-scenario8-control-plane" node
I0710 11:54:49.729142       1 main.go:514] Stopping shutdownHandler...

Fix the CIDR in the ConfigMap

Fix it in the ConfigMap. kubectl edit opens the live kube-flannel-cfg object in vi. Inside net-conf.json, Network is set to 10.244.0.0/16, the manifest's default. Change it to 192.168.0.0/16 so it matches the cluster's pod CIDR, then save and quit. kubectl pushes the edit back to the API server the moment you write the file. Nothing else in the ConfigMap changes.

$ kubectl -n kube-flannel edit configmap kube-flannel-cfg

- "Network": "10.244.0.0/16",
+ "Network": "192.168.0.0/16",

Restart, then nodes go Ready

A ConfigMap change does not restart pods on its own, so the flannel pods keep crashing on the old config until you cycle them. Delete the flannel pods and the DaemonSet recreates them, this time reading the corrected Network. Now they come up Running, and within a few seconds the nodes flip to Ready and CoreDNS schedules. The CNI is installed and configured.

$ kubectl -n kube-flannel delete pod -l app=flannel
pod "kube-flannel-ds-6mhjm" deleted from kube-flannel namespace
pod "kube-flannel-ds-jq4n7" deleted from kube-flannel namespace

$ kubectl -n kube-flannel get pods -o wide
NAME                    READY   STATUS    RESTARTS   AGE   IP            NODE                          NOMINATED NODE   READINESS GATES
kube-flannel-ds-5c2pb   1/1     Running   0          3s    172.18.0.13   cka-scenario8-control-plane   <none>           <none>
kube-flannel-ds-fntnk   1/1     Running   0          3s    172.18.0.14   cka-scenario8-worker          <none>           <none>

$ kubectl get nodes
NAME                          STATUS   ROLES           AGE   VERSION
cka-scenario8-control-plane   Ready    control-plane   8d    v1.36.1
cka-scenario8-worker          Ready    <none>          8d    v1.36.1

Prove pod-to-pod connectivity

Finish by proving it actually networks, not just that the pods are green. Create two pods on different nodes and list them wide: each has an IP from 192.168.0.0/16, on a different host. Ping one from the other and the packets ride the vxlan overlay between nodes. Replies come back, so the CNI is doing its job end to end.

$ kubectl apply -f connectivity-test.yaml
pod/test1 created
pod/test2 created

$ kubectl get pods -o wide -l app=conn-test
NAME    READY   STATUS    RESTARTS   AGE   IP            NODE                          NOMINATED NODE   READINESS GATES
test1   1/1     Running   0          1s    192.168.1.8   cka-scenario8-worker          <none>           <none>
test2   1/1     Running   0          1s    192.168.0.5   cka-scenario8-control-plane   <none>           <none>

$ kubectl exec test1 -- ping -c 3 <test2-ip>
PING 192.168.0.5 (192.168.0.5): 56 data bytes
64 bytes from 192.168.0.5: seq=0 ttl=62 time=0.507 ms
64 bytes from 192.168.0.5: seq=1 ttl=62 time=0.096 ms
64 bytes from 192.168.0.5: seq=2 ttl=62 time=0.151 ms

--- 192.168.0.5 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.096/0.251/0.507 ms

Exam tips

A few traps. NotReady nodes plus Pending CoreDNS almost always means no CNI, so install one first. After you apply a CNI, always check its pods are actually Running, don't assume the apply worked. When flannel logs failed to acquire a lease, that is a pod CIDR versus Network mismatch: make flannel's Network match the cluster, since the cluster CIDR is fixed. And remember a ConfigMap edit needs a pod restart to take effect.

NotReady nodes + Pending CoreDNS => no CNI installed
After applying a CNI, verify its pods are Running (don't assume)
'failed to acquire lease' = flannel Network must match the cluster pod CIDR
A ConfigMap edit needs a pod restart to take effect

Recap

No CNI => NotReady; install flannel from the manifest
CrashLoop => logs => flannel Network must match the cluster pod CIDR
Fix net-conf.json, restart the pods, nodes go Ready
Prove it with a cross-node ping; subscribe + dev.to writeup

Reproduce this yourself

The entire scenario is scripted on a throwaway kind cluster: https://github.com/The-Cyber-Sidekick/TCS_CKA_2026_Exam_Scenarios

git clone https://github.com/The-Cyber-Sidekick/TCS_CKA_2026_Exam_Scenarios.git
cd TCS_CKA_2026_Exam_Scenarios/learning/scenarios/scenario8-cni-flannel-install
./setup.sh        # creates the cluster AND arms the scenario
# solve it by hand, or:
./solution.sh     # apply the answer key and verify

If this helped, subscribe to The Cyber SideKick on YouTube for more CKA drills, and grab the newsletter at https://thecybersidekick.beehiiv.com.

AI Agents in DevOps: Why Traditional CI/CD Pipelines Break at 1000 Deployments Per Month

The Cyber Sidekick — Wed, 08 Jul 2026 19:03:00 +0000

How LLM-driven orchestration agents are replacing rule-based pipeline automation to sustain hyperscale deployment velocities that static DSLs were never designed to handle.

At hyperscale deployment velocities, traditional CI/CD pipelines built on sequential, rule-based automation collapse under the cognitive load of thousands of concurrent deployment decisions that require real-time reasoning across telemetry signals, failure modes, and risk tolerances. A new class of autonomous deployment agents, combining LLM-based orchestration, GitOps declarative state management, and eBPF-powered observability, is emerging as the only viable architecture for platforms that must ship reliably at this scale.

The Scaling Wall That Rules-Based Pipelines Cannot Climb

Traditional CI/CD pipelines were architected for deployment velocities measured in dozens of releases per day, using sequential stage gates, hardcoded approval thresholds, and static rollback conditions encoded in Jenkinsfiles or GitHub Actions YAML. At 1000 deployments per month across heterogeneous Kubernetes clusters, these pipelines do not simply slow down; they produce compounding decision debt, where a misconfigured canary threshold written six months ago now governs a microservice that serves ten times the original traffic volume. Google's internal Borg-derived systems already handle over 4 billion container launches per week, a scale that makes the limitations of rule-based scheduling immediately visible, since no human-authored ruleset can evaluate scheduling and deployment constraints within the sub-second latency budgets those systems require. The fundamental architectural mismatch is not one of tooling performance but of decision architecture: static pipelines can execute instructions, but they cannot reason about novel failure combinations, predict cascading degradations across service meshes, or rewrite their own deployment strategies in response to real-time SLO signals.

How Agentic Orchestration Layers Replace Static Pipeline DSLs

The ecosystem is actively transitioning from imperative pipeline scripting toward agentic orchestration layers where LLMs serve as meta-controllers, dynamically composing deployment strategies by consuming Prometheus metrics, distributed traces, and changelog semantics simultaneously. Projects like Argo Rollouts are embedding AI-augmented analysis templates that ingest Datadog and Prometheus metric providers to make autonomous canary promotion decisions, eliminating the manual threshold tuning that becomes untenable across hundreds of services. Fluxcd paired with OpenAI function-calling agents enables intelligent drift detection and self-correcting GitOps reconciliation loops, where the agent can distinguish between an intentional declarative state change and an unauthorized configuration drift without requiring a human to inspect the diff. Keptn v2 Lifecycle Toolkit extends this further by providing OpenTelemetry-native evaluation hooks that AI agents consume for SLO-driven deployment gating, meaning a deployment can be autonomously promoted, paused, or rolled back based on a structured conversation between the orchestration agent and a unified observability substrate rather than a brittle shell script comparing integer thresholds. Platforms like Dagger are enabling portable, composable pipeline primitives that LLM agents can assemble on-demand, shifting engineering teams from maintaining pipeline code to expressing desired deployment outcomes and acceptable risk tolerances as declarative intent.

The Observability and Infrastructure Substrate That Makes Agents Viable

Autonomous deployment agents require a standardized signal vocabulary to reason reliably, and the maturation of OpenTelemetry as a universal observability substrate across traces, metrics, and logs is providing exactly that foundation at a moment when it is most needed. Without a consistent schema for telemetry signals, an LLM-based agent cannot reliably distinguish a latency spike caused by a flawed deployment from one caused by an upstream dependency degrading independently, making autonomous rollback decisions dangerous rather than helpful. The infrastructure layer is also evolving to meet agents where they need to operate: Kubernetes Gateway API and WASM-based extensibility now allow AI agents to manipulate traffic routing at a granularity that previously required manual SRE intervention, enabling progressive delivery patterns like weighted traffic splits and header-based routing to be adjusted dynamically as canary analysis proceeds. Kubernetes-native admission webhooks and CEL-based policy surfaces give agents a programmable enforcement plane they can update at runtime without requiring cluster restarts or human-authored policy changes. Datadog's 2024 Container Report quantifies what happens when this infrastructure is absent, finding that organizations running more than 500 Kubernetes nodes experience incident rates 3.2 times higher during deployment windows, with the average cost per major outage reaching approximately $2.3 million, a figure that makes the ROI case for AI-driven progressive delivery and automated rollback straightforward to calculate.

Conclusion

The 2023 DORA State of DevOps Report found that elite performers deploy 182 times more frequently than low performers, and analysts project that AI-assisted pipelines will push that multiplier beyond 500 times by 2026 as autonomous deployment agents eliminate manual approval bottlenecks and replace them with SLO-aware, telemetry-driven decision loops. The path forward is not incrementally smarter pipeline scripts but a wholesale architectural shift toward declarative intent expression, where engineering teams define outcomes and risk tolerances while agents handle tactical execution across multi-cluster federation topologies, availability zone-aware scheduling, and real-time traffic shaping. Organizations that begin this transition now, starting with AI-augmented canary analysis on top of existing Argo Rollouts or Flux installations, will build the operational muscle memory and telemetry hygiene needed to run fully autonomous deployment systems before the next generation of deployment velocity expectations arrives. Those that wait for the tools to mature further may find that the velocity gap between elite and average performers has grown too wide to close through iteration alone.

Technologies covered: AI agents (LLM-based orchestration), GitOps with intelligent rollback, Kubernetes native auto-scaling, Observability platforms (Datadog, Prometheus), Self-healing infrastructure

Sources aggregated from: DevOps Weekly, GitHub Trending, Hacker News, InfoQ

📬 Stay current with cloud-native

Get the latest Kubernetes, DevOps, and platform engineering insights delivered to your inbox.

Subscribe to The Cyber SideKick Newsletter — free, no spam, unsubscribe anytime.

GitOps at Scale: How Event-Driven and AI-Assisted Deployments Are Replacing Manual Environment Promotion

The Cyber Sidekick — Thu, 02 Jul 2026 19:51:44 +0000

Event-driven architectures and AI-powered validation are automating the entire GitOps promotion pipeline, eliminating the manual bottlenecks that throttle release velocity in large-scale Kubernetes environments.

Organizations managing hundreds of microservices are discovering that traditional GitOps promotion workflows, built around manual approval gates and human intervention, cannot scale to meet the demands of modern cloud-native delivery. Event-driven automation combined with ML-based quality gates is now enabling fully autonomous promotion decisions driven by real-time observability signals, policy-as-code enforcement, and historical deployment telemetry.

The Scaling Ceiling of Manual GitOps Promotion

Traditional GitOps pipelines treat environment promotion as a human-coordinated handoff: an engineer reviews test results, eyeballs dashboards, and clicks an approval button to advance a workload from staging to production. This model collapses under the weight of scale. When a platform team is responsible for hundreds of microservices across dozens of clusters, manual gates become the rate-limiting step in every release cycle. According to the 2023 DORA State of DevOps Report, elite performers deploy 127 times more frequently than low performers, and automated promotion pipelines are consistently cited as a key differentiator that keeps change failure rates below five percent. The problem is not that engineers make poor decisions; it is that the volume of decisions required in a large-scale Kubernetes environment exceeds what any team can handle reliably and quickly without automation.

Event-Driven Promotion: Wiring Observability Signals Into the GitOps Control Loop

The practical solution emerging across the CNCF ecosystem is to replace human approval gates with event-driven promotion logic that consumes signals from the observability stack in real time. Progressive delivery controllers like Argo Rollouts and Flagger connect directly to Prometheus, Datadog, and OpenTelemetry data sources, using metric-driven analysis templates to make canary and blue-green promotion decisions without waiting for a human to read a dashboard. Platform teams are routing Kubernetes events through NATS, Kafka, and CloudEvents-compliant brokers into GitOps reconcilers, so that SLO breaches, security scan failures, and load test outcomes can automatically trigger or block ArgoCD ApplicationSet promotions the moment the signal is available. Argo Rollouts alone has accumulated more than 5,800 GitHub stars and is running in production environments managing thousands of workloads, with documented case studies reporting a 60 to 70 percent reduction in deployment incidents attributable to analysis-based automated promotion. The CNCF's convergence on CloudEvents as a universal eventing substrate is accelerating interoperability between Tekton, Argo Events, Keptn, and external vendors, making it increasingly practical to compose these signals into a single, coherent promotion control plane.

AI-Augmented Quality Gates and Policy-as-Code Guardrails

Event-driven promotion handles the mechanics of signal routing, but AI and ML layers are adding a higher-order capability: deployment risk scoring based on patterns in historical telemetry that no human analyst would have the bandwidth to synthesize in real time. Tools like Keptn are integrating ML models trained on past deployment outcomes to score incoming releases and automate rollback decisions before a bad deployment can propagate to production. OpenFeature and custom admission webhooks are emerging as integration points for embedding these models directly into the Kubernetes API surface, while the Flux CD Notification Controller extends GitOps reconciliation triggers to respond to external quality signals via CloudEvents. Alongside AI scoring, policy-as-code frameworks are shifting compliance enforcement left: OPA Gatekeeper and Kyverno are now validating promotion eligibility before a Git commit is even merged, not just at deployment time, creating a continuous compliance loop across the entire software development lifecycle. Gartner projects that by 2026, more than 60 percent of organizations with mature DevOps practices will implement AI-augmented continuous delivery pipelines, up from fewer than 10 percent in 2023, driven by the economics of reducing mean time to recovery in cloud-native environments.

Conclusion

The convergence of event-driven architecture, progressive delivery controllers, and AI-augmented quality gates is fundamentally reshaping what a GitOps promotion pipeline looks like at scale. Platform engineering teams are already standardizing on Internal Developer Platforms that abstract promotion complexity behind golden paths, embedding these capabilities directly into Backstage templates and Crossplane compositions so that individual service teams inherit automated, policy-compliant promotion by default rather than by custom effort. The trajectory is clear: the approval button is being replaced by a scoring model, the Slack notification is being replaced by a CloudEvent, and the manual rollback is being replaced by an analysis-driven controller acting within seconds of a signal breach. Organizations that invest now in the observability instrumentation, eventing infrastructure, and policy-as-code discipline required to feed these systems will be positioned to treat safe, frequent, autonomous deployment not as an aspirational benchmark but as a daily operational baseline.

Technologies covered: GitOps, Event-Driven Architecture, Kubernetes, CI/CD Pipelines, Machine Learning Operations, ArgoCD, Flux CD, Policy as Code

Sources aggregated from: CNCF Blog, Kubernetes.io, DevOps Weekly

📬 Stay current with cloud-native

Get the latest Kubernetes, DevOps, and platform engineering insights delivered to your inbox.

Subscribe to The Cyber SideKick Newsletter — free, no spam, unsubscribe anytime.

Autoscale a Deployment with an HPA, including the field kubectl autoscale can't set (CKA Workloads)

The Cyber Sidekick — Wed, 01 Jul 2026 14:56:10 +0000

Autoscale a Deployment with an HPA

A Deployment named apache-server is running, and the exam wants it autoscaled on CPU: a 50 percent target, between one and four pods, and a 30 second scaleDown stabilization window. Let's build the HorizontalPodAutoscaler, and see why the obvious one-line command can't finish it.

🎥 Watch the video: https://www.youtube.com/watch?v=G4ZFpOXMJD8

This is a CKA Workloads & Scheduling walkthrough. Every command below is real output from a live cluster, and you can reproduce the whole thing yourself (scripts at the end).

The scenario

Here is the task. In the autoscale namespace, create a HorizontalPodAutoscaler named apache-server that targets the existing apache-server Deployment. Set the CPU target to 50 percent average utilization per pod, allow a minimum of one pod and a maximum of four, and set the scaleDown stabilization window to 30 seconds.

Namespace autoscale, HPA named apache-server
Target the existing apache-server Deployment
CPU target: 50% average utilization per pod
minReplicas 1, maxReplicas 4, scaleDown window 30s

How a HorizontalPodAutoscaler works

Two things have to be true before an HPA can do anything. The Pod needs a CPU request, because a 50 percent target is a percentage of that request; with no request the HPA can't compute utilization. And metrics-server has to be running, because that's where the HPA reads live CPU. Given both, the HPA watches CPU and adjusts replicas between your min and max. The scaleDown stabilization window tells it how long to wait on falling load before removing pods, which damps flapping.

Where the YAML comes from (official docs)

Before you apply anything, know that there is no kubectl create that emits a behavior block, so part of this is copy-paste from the official docs, which you are allowed to use in the exam. Two pages on kubernetes dot io cover it. Search HPA and open the HorizontalPodAutoscaler Walkthrough; it shows the autoscaling v2 object with the metrics array, the shape for the CPU target. Then open the Horizontal Pod Autoscaling concept page and find the section called Configurable scaling behavior, with its Stabilization window example; lift the scaleDown block straight from there. In practice: generate the skeleton with kubectl autoscale dry-run, paste the behavior block from that section, and set the window to 30.

Inspect the Deployment

Start by reading what's there. The apache-server Deployment and its Service are running in the autoscale namespace. Open the Deployment manifest and look at the container's resources. Crucially, the Pod template sets a CPU request of 200 millicores. That request is what makes a percentage target meaningful: 50 percent of 200 millicores is 100 millicores per pod.

$ kubectl -n autoscale get deploy,svc
NAME                            READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/apache-server   1/1     1            1           125m

NAME                    TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
service/apache-server   ClusterIP   10.96.182.94   <none>        80/TCP    125m

$ cat apache-app.yaml
...
      labels:
        app: apache-server
    spec:
      containers:
        - name: php-apache
          image: registry.k8s.io/hpa-example
          ports:
            - containerPort: 80
          resources:
            requests:
              cpu: 200m
            limits:
              cpu: 500m

Metrics are flowing

Confirm the HPA will have data to read. kubectl top pods returns live CPU and memory from metrics-server. If this command errored or showed nothing, the HPA would sit at TARGETS unknown and never scale, so verifying metrics first saves you a confusing debug later.

$ kubectl -n autoscale top pods
NAME                             CPU(cores)   MEMORY(bytes)   
apache-server-748dd94f84-2h56f   1m           9Mi

Where kubectl autoscale stops

Now the fast path: kubectl autoscale. Pass the target, min, and max, and it builds an HPA in one line. On a current cluster the dry-run even emits apiVersion autoscaling slash v2, with the CPU target already inside a metrics array, so it looks like you're done. But scan the whole object: there is no behavior section anywhere. The scaleDown stabilization window has no flag on this command, so the imperative path gets you three of the four requirements and stops.

$ kubectl -n autoscale autoscale deployment apache-server --cpu-percent=50 --min=1 --max=4 --dry-run=client -o yaml
...
      name: cpu
      target:
        averageUtilization: 50
        type: Utilization
    type: Resource
  minReplicas: 1
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: apache-server
status:
  currentMetrics: null
  desiredReplicas: 0

Author the v2 HPA

So finish it in YAML. Take the skeleton and add the one piece the command can't give you: a behavior block where scaleDown stabilizationWindowSeconds is 30. Watch the trap the walkthrough hit: in autoscaling slash v2 the CPU target belongs inside the metrics array as averageUtilization, so don't paste an old v1 targetCPUUtilizationPercentage field next to behavior, or the apiserver rejects it as an unknown field. Apply it, then list the HPA: it shows the target against 50 percent, with min 1 and max 4.

$ kubectl apply -f hpa.yaml
horizontalpodautoscaler.autoscaling/apache-server created

$ kubectl -n autoscale get hpa apache-server
NAME            REFERENCE                  TARGETS       MINPODS   MAXPODS   REPLICAS   AGE
apache-server   Deployment/apache-server   cpu: 0%/50%   1         4         1          5s

Verify the HPA

Verify every field, because each is a separate mark. describe confirms the reference to the Deployment, the 50 percent CPU target, and the one-to-four range. Then read the stabilization window straight from the spec: 30 seconds, exactly as asked. With the CPU target reading a real percentage, the autoscaler is live and complete.

$ kubectl -n autoscale describe hpa apache-server
...
  Scale Down:
    Stabilization Window: 30 seconds
    Select Policy: Max
    Policies:
      - Type: Percent  Value: 100  Period: 15 seconds
Deployment pods:       1 current / 1 desired
Conditions:
  Type            Status  Reason            Message
  ----            ------  ------            -------
  AbleToScale     True    ReadyForNewScale  recommended size matches current size
  ScalingActive   True    ValidMetricFound  the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request)
  ScalingLimited  True    TooFewReplicas    the desired replica count is less than the minimum replica count
Events:           <none>

$ kubectl -n autoscale get hpa apache-server -o jsonpath='{.spec.behavior.scaleDown.stabilizationWindowSeconds}'
scaleDown stabilizationWindowSeconds = 30

Exam tips

A few traps to remember. kubectl autoscale is fine for min, max, and the CPU target, but it has no flag for a stabilization window, so when behavior is required you finish the HPA in YAML. In autoscaling v2 the CPU target lives inside the metrics array, not as targetCPUUtilizationPercentage; mixing the two is the unknown field error. The Pod needs a CPU request or the percentage target is meaningless. And if TARGETS shows unknown, suspect metrics-server before the HPA itself.

kubectl autoscale has no flag for a scaleDown stabilization window; add it in YAML
v2: CPU target goes in metrics[], not targetCPUUtilizationPercentage
No CPU request on the Pod means no usable utilization target
TARGETS ? Check metrics-server before blaming the HPA

Recap

Prereqs: a CPU request on the Pod + a running metrics-server
v2 manifest: metrics[] for the CPU target, behavior for scaleDown
stabilizationWindowSeconds 30 is the field kubectl autoscale can't set
Subscribe + dev.to writeup

Reproduce this yourself

The entire scenario is scripted on a throwaway kind cluster: https://github.com/The-Cyber-Sidekick/TCS_CKA_2026_Exam_Scenarios

git clone https://github.com/The-Cyber-Sidekick/TCS_CKA_2026_Exam_Scenarios.git
cd TCS_CKA_2026_Exam_Scenarios/learning/scenarios/scenario7-hpa-cpu-autoscale
./setup.sh        # creates the cluster AND arms the scenario
# solve it by hand, or:
./solution.sh     # apply the answer key and verify

If this helped, subscribe to The Cyber SideKick on YouTube for more CKA drills, and grab the newsletter at https://thecybersidekick.beehiiv.com.

CKA Exam 2026 - Scenario 6 Migrate an Ingress to the Gateway API without dropping HTTPS (CKA Services & Networking)

The Cyber Sidekick — Mon, 29 Jun 2026 18:29:18 +0000

Migrate Ingress to the Gateway API

A web application is exposed over HTTPS with a classic Ingress, and the exam wants it migrated to the Gateway API while keeping HTTPS working. The GatewayClass is already installed. Let's recreate the routing with a Gateway and an HTTPRoute, then retire the Ingress.

🎥 Watch the video: https://www.youtube.com/watch?v=v5_1KKFWLGE

This is a CKA Services & Networking walkthrough. Every command below is real output from a live cluster, and you can reproduce the whole thing yourself (scripts at the end).

The scenario

Here is the setup. A Deployment named web sits behind a Service, and an Ingress named web terminates TLS for the host gateway.web.k8s.local and routes to it. A GatewayClass is already installed. Your task is to migrate this to the Gateway API, keep the same HTTPS host, and once it works, delete the old Ingress.

An Ingress named 'web' serves HTTPS for gateway.web.k8s.local
A GatewayClass is already installed in the cluster
Recreate the routing with a Gateway + HTTPRoute
Keep HTTPS, then delete the old Ingress

How an Ingress maps onto the Gateway API

The Gateway API splits what one Ingress did into three objects. The GatewayClass is the controller, like the ingress class. The Gateway is the listener: the port and the TLS the Ingress used to own. The HTTPRoute holds the routing rules: the hostname, the paths, and the backend Service. Map the Ingress onto those three and the migration is mechanical.

Where the YAML comes from (you author it, not `apply -f`)

There is no imperative kubectl create for a Gateway or HTTPRoute, so in the exam you
write the YAML yourself, copying a starting template from the Gateway API docs (linked from
the Kubernetes documentation, which is allowed during the exam):

Gateway + TLS: Simple Gateway and TLS termination
HTTPRoute: HTTP routing
Field reference: API spec
From kubernetes.io: Gateway API

Every value is something you already have. gatewayClassName is the class from
kubectl get gatewayclass. The HTTPS listener port and the certificateRefs Secret are
carried over from the existing Ingress's tls block. The hostname, the path, and the
backend Service (web:80) are the Ingress's own routing rule, re-expressed as an HTTPRoute.
You can see both files (cat) right before each kubectl apply below.

Inspect what's running

Start by reading the current state. The web Deployment and Service are running, and the Ingress named web is serving the host gateway.web.k8s.local on ports 80 and 443. The GatewayClass is present and Accepted, so the controller is ready for a Gateway to bind to it.

$ kubectl -n web get deploy,svc,ingress
NAME                     READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/tester   1/1     1            1           117m
deployment.apps/web      1/1     1            1           117m

NAME          TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
service/web   ClusterIP   10.96.31.154   <none>        80/TCP    117m

NAME                            CLASS   HOSTS                   ADDRESS     PORTS     AGE
ingress.networking.k8s.io/web   nginx   gateway.web.k8s.local   localhost   80, 443   74m

$ kubectl get gatewayclass
NAME   CONTROLLER                                      ACCEPTED   AGE
eg     gateway.envoyproxy.io/gatewayclass-controller   True       117m

The Ingress serves HTTPS today

Prove the starting point. A client that resolves gateway.web.k8s.local to the Ingress controller gets the page back over HTTPS, returning WEB APP OK. This is the behavior we must preserve through the migration.

$ curl -k https://gateway.web.k8s.local/
WEB APP OK

Create the Gateway

There is no kubectl create for these, so you author the YAML by hand from the Gateway API docs, the Simple Gateway and TLS termination examples, which are linked from the Kubernetes docs. Look at the file. apiVersion and kind declare a Gateway. The gatewayClassName is eg, the class you just saw was Accepted. Then one listener: HTTPS on port 443 for the hostname gateway.web.k8s.local. Under tls, mode Terminate with a certificateRef to web-tls, the very same Secret the Ingress used, so HTTPS is preserved. Every value came from the GatewayClass or the old Ingress. Apply it and wait until it reports Programmed, which means the controller has provisioned the data plane and assigned an address.

$ cat gateway.yaml
# Gateway: an HTTPS listener on 443 that terminates TLS with the web-tls Secret.
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: web-gateway
  namespace: web
spec:
  gatewayClassName: eg
  listeners:
    - name: https
      protocol: HTTPS
      port: 443
      hostname: gateway.web.k8s.local
      tls:
        mode: Terminate
        certificateRefs:
          - name: web-tls

$ kubectl apply -f gateway.yaml
gateway.gateway.networking.k8s.io/web-gateway created

$ kubectl -n web get gateway web-gateway
NAME          CLASS   ADDRESS         PROGRAMMED   AGE
web-gateway   eg      10.96.138.181   True         9s

Create the HTTPRoute

Now the routing, from the docs HTTP routing example. Read the file. It is an HTTPRoute. parentRefs attaches it to the web-gateway you just made, so this route is served by that Gateway. hostnames matches gateway.web.k8s.local, the same host as before. And the one rule matches the path prefix slash and sends it to backendRefs, the web Service on port 80. The host, the path, and the backend are lifted straight from the old Ingress rule. Apply it and confirm the hostname is bound.

$ cat httproute.yaml
# HTTPRoute: same host as the Ingress, path / -> the web Service on port 80.
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: web-route
  namespace: web
spec:
  parentRefs:
    - name: web-gateway
  hostnames:
    - gateway.web.k8s.local
  rules:
    - matches:
        - path:
            value: /
      backendRefs:
        - name: web
          port: 80

$ kubectl apply -f httproute.yaml
httproute.gateway.networking.k8s.io/web-route created

$ kubectl -n web get httproute web-route
NAME        HOSTNAMES                   AGE
web-route   ["gateway.web.k8s.local"]   0s

Verify, then delete the Ingress

Verify before you remove anything. The same request, now routed through the Gateway, returns WEB APP OK over HTTPS. Only once that works do you delete the old Ingress. Test one more time: traffic still flows through the Gateway, with the Ingress gone. The migration is complete with no outage.

$ curl -k https://gateway.web.k8s.local/
WEB APP OK

$ kubectl -n web delete ingress web
ingress.networking.k8s.io "web" deleted from web namespace

$ curl -k https://gateway.web.k8s.local/
WEB APP OK

Exam tips

A few traps. There is no kubectl create for Gateways or HTTPRoutes, so keep the Gateway API docs open and copy the manifests. Reuse the same TLS Secret on the Gateway listener so HTTPS keeps working. Wait for Programmed before testing, or you will curl a listener that is not up yet. And delete the Ingress last, only after the Gateway is verified, so you never drop traffic.

No imperative command: copy Gateway + HTTPRoute YAML from gateway-api.sigs.k8s.io
Reuse the same TLS Secret on the Gateway listener to keep HTTPS
Wait for the Gateway to be Programmed before you test
Delete the Ingress LAST, only after the Gateway is verified

Recap

Gateway = listener + TLS; HTTPRoute = host + path -> Service
Reuse the TLS Secret; wait for Programmed
Verify HTTPS, then delete the Ingress last
Subscribe + dev.to writeup

Reproduce this yourself

The entire scenario is scripted on a throwaway kind cluster: https://github.com/The-Cyber-Sidekick/TCS_CKA_2026_Exam_Scenarios

git clone https://github.com/The-Cyber-Sidekick/TCS_CKA_2026_Exam_Scenarios.git
cd TCS_CKA_2026_Exam_Scenarios/learning/scenarios/scenario6-ingress-to-gateway
./setup.sh        # creates the cluster AND arms the scenario
# solve it by hand, or:
./solution.sh     # apply the answer key and verify

If this helped, subscribe to The Cyber SideKick on YouTube for more CKA drills, and grab the newsletter at https://thecybersidekick.beehiiv.com.

CKA Scenario 5 - Force nginx to TLS 1.3 with a ConfigMap edit + rolling restart (CKA Workloads)

The Cyber Sidekick — Mon, 29 Jun 2026 12:53:22 +0000

Force nginx to TLS 1.3

An nginx server is accepting an old TLS version, and the exam wants it locked to TLS one point three. The config lives in a ConfigMap. The catch is that editing the ConfigMap alone changes nothing. Let's do it the way the CKA expects.

🎥 Watch the video: https://www.youtube.com/watch?v=rx-77YBw99w

This is a CKA Workloads & Scheduling walkthrough. Every command below is real output from a live cluster, and you can reproduce the whole thing yourself (scripts at the end).

The scenario

An nginx-static Deployment serves HTTPS, and its server config comes from a ConfigMap named nginx-config. Right now it allows both TLS one point two and one point three. Your task is to allow only TLS one point three, then make nginx actually use the change, so that a TLS one point two request fails.

nginx-static serves HTTPS from the nginx-config ConfigMap
It currently allows TLS 1.2 AND 1.3
Restrict ssl_protocols to TLS 1.3 only
A TLS 1.2 request to the Service must then fail

How nginx, ConfigMaps, and rolling restarts fit together

Two ideas drive this. First, ssl_protocols is an allow list; leave only TLSv1.3 and nginx rejects any older handshake. Second, a ConfigMap mounted into a pod updates the file on disk, but nginx only reads ssl_protocols when it starts. So you must roll the Deployment, with kubectl rollout restart, for the new value to take effect.

Inspect the current state

Start by seeing what is running and what the config says. The nginx-static Deployment, its Service on port four forty three, and the nginx-config ConfigMap are all here. Grep the rendered ConfigMap for the ssl_protocols line: it lists TLSv1.2 and TLSv1.3, so old clients still get in.

$ kubectl -n nginx-static get deploy,svc,configmap
NAME                           READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/nginx-static   1/1     1            1           17h
deployment.apps/tester         1/1     1            1           17h

NAME                   TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
service/nginx-static   ClusterIP   10.96.13.162   <none>        443/TCP   17h

NAME                         DATA   AGE
configmap/kube-root-ca.crt   1      17h
configmap/nginx-config       1      2s

$ kubectl -n nginx-static get configmap nginx-config -o yaml | grep ssl_protocols
        ssl_protocols TLSv1.2 TLSv1.3;

Confirm TLS 1.2 works today

Prove the starting point from a real client. The tester pod curls the Service, pinned to a maximum of TLS one point two. It returns the page, which confirms the server accepts TLS one point two right now. That is exactly what we are about to stop.

$ kubectl -n nginx-static exec deploy/tester -- curl -sk --tlsv1.2 --tls-max 1.2 https://nginx-static.nginx-static.svc.cluster.local
TLS OK

Roll the Deployment

Make nginx use it. A rolling restart replaces the pod, and the new pod reads the updated ssl_protocols on startup. Wait for the rollout to finish so you are testing the new pod, not the old one. This is the step people skip, and it is where the marks are.

$ kubectl -n nginx-static rollout restart deploy/nginx-static
deployment.apps/nginx-static restarted

$ kubectl -n nginx-static rollout status deploy/nginx-static
Waiting for deployment "nginx-static" rollout to finish: 1 old replicas are pending termination...
Waiting for deployment "nginx-static" rollout to finish: 1 old replicas are pending termination...
deployment "nginx-static" successfully rolled out

Verify

Prove it both ways. The same TLS one point two request now fails the handshake; curl reports an alert and exits non-zero, which is what we want. A normal request, letting curl negotiate, connects over TLS one point three and still returns the page. Old TLS is gone, the service still works.

$ kubectl -n nginx-static exec deploy/tester -- curl -sSk --tlsv1.2 --tls-max 1.2 https://nginx-static.nginx-static.svc.cluster.local
curl: (35) OpenSSL/3.3.2: error:0A00042E:SSL routines::tlsv1 alert protocol version
command terminated with exit code 35

$ kubectl -n nginx-static exec deploy/tester -- curl -skv https://nginx-static.nginx-static.svc.cluster.local | grep -iE 'SSL connection|TLS OK'
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384 / x25519 / RSASSA-PSS
TLS OK

Exam tips

A few traps to remember. Editing a ConfigMap does not restart anything; without a rollout restart your change is invisible. ssl_protocols is an allow list, so list only the versions you want. Test with the client version pinned, because curl will quietly use TLS one point three and hide the problem otherwise. And always verify against the Service name, not just the pod.

Editing a ConfigMap does nothing until you roll the Deployment
ssl_protocols is an allow-list: leave only TLSv1.3
Pin the client (curl --tls-max 1.2) or you won't see the failure
Verify against the Service, not just the pod

Recap

Set ssl_protocols to TLSv1.3 in the ConfigMap
kubectl rollout restart so nginx re-reads it
Verify TLS 1.2 fails, TLS 1.3 still serves
Subscribe + dev.to writeup

Reproduce this yourself

The entire scenario is scripted on a throwaway kind cluster: https://github.com/The-Cyber-Sidekick/TCS_CKA_2026_Exam_Scenarios

git clone https://github.com/The-Cyber-Sidekick/TCS_CKA_2026_Exam_Scenarios.git
cd TCS_CKA_2026_Exam_Scenarios/learning/scenarios/scenario5-nginx-tls-configmap
./setup.sh        # creates the cluster AND arms the scenario
# solve it by hand, or:
./solution.sh     # apply the answer key and verify

If this helped, subscribe to The Cyber SideKick on YouTube for more CKA drills, and grab the newsletter at https://thecybersidekick.beehiiv.com.

AI-Driven DevOps Is Reshaping CI/CD: From Pipeline Mechanics to Autonomous Orchestration

The Cyber Sidekick — Mon, 29 Jun 2026 11:31:15 +0000

How ML agents and LLM-powered observability are moving DevOps teams from reactive pipeline management to predictive, self-healing infrastructure automation.

AI-driven DevOps is eliminating manual CI/CD bottlenecks by turning pipelines into autonomous systems that detect, diagnose, and fix deployment issues before they reach production. The convergence of large language models, ML-based anomaly detection, and durable workflow orchestration is compressing mean-time-to-recovery from hours to minutes, with Gartner projecting that 40% of large enterprises will autonomously resolve infrastructure incidents without human intervention by 2027.

The Reactive Pipeline Problem and Why It Is Breaking Under Modern Scale

Traditional DevOps pipelines are fundamentally reactive: alerts fire after production metrics degrade, rollbacks trigger after error budgets are burned, and on-call engineers diagnose failures that users have already encountered. This approach creates mean-time-to-recovery gaps measured in minutes to hours, with threshold-based alerting generating noise that masks real signals until damage is done. The structural problem is that pipelines were designed as linear executors, not intelligent decision-makers, so every anomaly outside a predefined threshold requires human judgment to classify, prioritize, and remediate. Organizations using ML-based anomaly detection on deployment pipelines are reporting mean-time-to-detect reductions of 60 to 70 percent compared to threshold-based alerting, according to Dynatrace's 2024 State of Observability report, which illustrates the scale of the opportunity left untapped by conventional tooling.

The Emerging AI-Agentic Infrastructure Stack

The ecosystem is moving rapidly from AI-assisted tooling toward AI-agentic infrastructure, where platforms make autonomous decisions within policy-encoded boundaries rather than merely surfacing recommendations to humans. Dynatrace Davis combines causal AI topology mapping with LLM-generated root cause explanations, correlating logs, traces, and metrics simultaneously rather than in isolation. GitOps controllers like Argo CD are being extended with Keptn integrations that evaluate deployment risk scores derived from historical telemetry and automatically pause or roll back Helm releases based on SLO breach signals, effectively encoding SRE judgment as executable policy. Temporal.io has emerged as a critical durable execution backbone for these autonomous remediation agents, providing retry semantics, state persistence, and full workflow auditability across multi-step sequences; the platform reported over 500 billion workflow actions executed in 2024, reflecting how quickly durable orchestration is becoming the control plane for complex automated remediation. Startups including Cortex, Harness, and Port are layering ML models trained on deployment patterns directly into internal developer portals, surfacing reliability recommendations before code reaches merge.

Key Trends Defining the Next Generation of Intelligent Pipelines

Four converging trends are shaping how AI integrates into DevOps workflows at scale. First, the standardization of OpenTelemetry as a unified telemetry substrate is giving AI models consistent, vendor-agnostic data to reason over, removing the fragmentation that previously made cross-stack correlation impractical. Second, GitOps-native AI policy engines are encoding remediation runbooks as version-controlled code reviewed alongside application manifests, making autonomous decisions auditable and reversible through standard pull request workflows. Third, SRE copilots powered by LLMs fine-tuned on incident postmortems, Kubernetes event streams, and infrastructure runbooks are generating contextual remediation playbooks in real time, reducing the cognitive load on engineers during active incidents. Fourth, the combination of these signals into unified AI observability agents is enabling platforms to move from detecting that something is wrong to explaining why it is wrong and executing a fix, all within a single automated feedback loop.

Conclusion

The trajectory of AI-driven DevOps points toward infrastructure that is less a pipeline to be managed and more an autonomous system to be governed. The foundational pieces are already in production: OpenTelemetry provides the data substrate, Temporal provides the execution durability, Argo CD and Keptn provide the GitOps enforcement layer, and LLMs provide the contextual reasoning that previously required senior engineers. The near-term challenge for platform teams is not adoption but governance: defining the policy boundaries within which AI agents are permitted to act autonomously, ensuring auditability trails satisfy compliance requirements, and building the human-in-the-loop escalation paths that preserve trust when autonomous decisions fail. With Gartner projecting that fewer than 5% of enterprises autonomously resolve infrastructure incidents today versus 40% by 2027, the organizations that invest now in durable orchestration, telemetry standardization, and AI policy frameworks will hold a compounding reliability and velocity advantage over those still waiting for the tooling to mature.

Technologies covered: LLMs for log analysis and root cause detection, ML-based anomaly detection in deployment patterns, Autonomous workflow orchestration (Temporal, Dagster), GitOps + AI decision engines, Observability platforms with AI correlation

Sources aggregated from: DevOps Weekly, GitHub Trending, Hacker News, The New Stack

📬 Stay current with cloud-native

Get the latest Kubernetes, DevOps, and platform engineering insights delivered to your inbox.

Subscribe to The Cyber SideKick Newsletter — free, no spam, unsubscribe anytime.