DEV Community: Kacey Gambill

[Boost]

Kacey Gambill — Sat, 20 Dec 2025 15:28:23 +0000

Kacey Gambill

Dec 20 '25

Sleep Tight, Cluster Right: Stop Burning Cash at 3 AM

#kubernetes #cloud #observability #devops

Comments

2 min read

Sleep Tight, Cluster Right: Stop Burning Cash at 3 AM

Kacey Gambill — Sat, 20 Dec 2025 15:28:08 +0000

Most Kubernetes clusters are wide awake 24/7, even if you users aren't!

CPU-based HPAs try to help, but quickly fall apart. Especially when we add the VPA to the mix. These two can work in tandem, but only if we implement some smart scaling with KEDA, otherwise it is just a battle between the two, and it makes our clusters tired!

Queues, request rates, latency, and time of day can tell us far more about whether or not our workloads should exist at all.

This is where KEDA shines!

Instead of guessing how busy something could be, we scale based on events!

Prometheus metrics when work exists.
Cron schedules when traffic in the cluster drops.

During the day, we rely on metric triggers, which help drive rapid scaling decisions.
At night, cron or empty signals help pull our workloads to a minimum amount, sometimes down to zero!

Our VPA is no longer fighting with our HPA, and we are not running idle pods.
The best part, we are no longer paying our cluster to work, while we are asleep.

Scheduled Scaling: Night Shift

Traffic Patterns are predictable, but lets not guess, we can schedule this with a trigger.

Cron Scalers set explicit windows to scale down a workload during off hours
Prometheus Scalers scale our workloads based on metrics

We can use these both in tandem to create a workload that only runs when there is work to do. As soon as we start to get requests, we quickly scale out, and start to serve the traffic, then with a longer scale-down period, we can handle any remaining traffic while the cluster gets ready to scale back down.

Workload scaling is only half of the story. We aren't getting these crazy savings if we are still paying for nodes on standby!

The cluster autoscaler can do a decent job, but here is where Karpenter really shines.

As soon as our workloads are gone, Karpenter is consolidating and terminating nodes that are empty, allowing us to use beefy nodes when we need to, but also scales way down in the event that our workloads are ready for bed!

Day Shift

We use Cron triggers to ensure our workloads are warmed and ready to go in the morning. Pairing this with Karpenter, we end up with zero toil and a fully awake cluster ready to go before the first few developers start signing on!

In the event that there are some early birds, during the night, we still do fallback to metric scaling after our cron triggers. This allows us to ensure the pods are always available if needed.

Final Thoughts

Autoscaling is not just about surviving peak traffic, it's about efficiency. By combining KEDA's event based triggers, and Karpenter's node management, we no longer burn cash on empty compute.

Stop paying for idle time. Make your infrastructure work for you, not the other way around.

Rightsizing Kubernetes Requests with the In-Place Vertical Pod Autoscaler

Kacey Gambill — Tue, 09 Dec 2025 15:29:49 +0000

The Vertical Pod Autoscaler was not production ready. Initially it required restarting pods in order to change resource limits, but with the new InPlaceOrRecreate feature, we are now able to right-size our requests dynamically without killing the application.

Here is how we are using it to cut pod waste and reduce churn.

The "Over-Provisioning" Tax

When deploying a new, or even an old service, it's a common pattern to pad the requests to be safe and ensure the service is operating healthily and happily.

Given this example:

We are deploying a service to process PDFs in a job queue. We load test this and we estimate that it takes around 250MB's of memory and 250m of CPU. Just in case though, we are going to pad the service and give it 500MB's, and 500m of CPU to ensure nothing goes wrong, now multiply that by 10-500 microservices, and suddenly our cluster is running at 30% utilization while we basically pay to reduce toil.

Historically, fixing this was a manual nightmare of checking Grafana dashboards and adjusting YAML files.

Why We Avoided VPA Before

We initially avoided the Vertical Pod Autoscaler due to it's destructive updates.

To change a pod's CPU or Memory requests, VPA had to:

Evict the Pod.
Wait for the Scheduler to recreate it with new numbers.
Hope your application handles the graceful shutdown correctly.

For any stateful workloads, or services with larger docker files, this restart tax was painful.

Initially we could run the VPA and set the updateMode: off and then double check the recommendations, but that becomes a painful and manual process. Especially if our load fluctuates significantly throughout the day. Why pay for 500m of CPU at 2 A.M. if we only actually need it during normal business hours?

The Game Changer: In-Place Updates

Introduced in Kubernetes version 1.33, in beta, the InPlaceOrRecreate update mode for the Vertical Pod Autoscaler!

Instead of only resizing pods during creation using a web-hook, the kubelet can now resize resources allocated to running containers without even restarting it! If the node is full, then it goes ahead and reschedules the pod to a new node with the desired compute / memory available.

This moves VPA from "scary experimental tool" to "essential cost-saving infrastructure."

How We Implemented It

We shifted our strategy to use the InPlaceOrRecreate update mode. Here is the configuration we are rolling out to our stateless workers:

YAML

---
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: backend-service
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: backend
  resourcePolicy:
    containerPolicies:
      - containerName: '*'
        minAllowed:
          cpu: 100m
          memory: 100Mi
        maxAllowed:
          cpu: 1000m
          memory: 1000Mi
        controlledResources: ["cpu", "memory"]
  updatePolicy:
    minReplicas: 1
    updateMode: InPlaceOrRecreate

Our Rightsizing Workflow

We don't just turn this on blindly. Here is the safety workflow we use for existing services:

Audit Mode: (updateMode: Off): We deploy the VPA with updates disabled. We let it run for 1 week to gather metrics on actual usage vs. requests.
Review Recommendations: We check the VPA object status (kubectl describe vpa <name>) to see what the engine would do. Pro-tip: If the recommendation is 50% lower than current requests, we know we are wasting money.
Enable In-Place: Once we trust the baseline, we switch to InPlaceOrRecreate.
Monitor: We monitor to make sure that we aren’t getting OOMkills and to verify that application metrics are not trending in the wrong direction.

Things to Note

Three things to note before going through this workflow!

Some runtime’s do not currently support the update mode InPlaceOrRecreate, it’s dependent on application code or runtimes to support this.
When Using an HPA and a VPA on the same resource, be careful that you are not increasing your CPU while the HPA is testing the CPU to see whether or not to scale up additional replicas.
Using KEDA to horizontally scale applications based on: latency, traffic, errors or saturation metrics has essentially solved this for us.

Conclusion

The In-place VPA allows us to treat resource requests as fluid, living values rather than static guesses. It’s helping us pack nodes tighter and stop paying for wasted resources in our clusters.

Rightsizing Kubernetes Requests with the In-Place Vertical Pod Autoscaler

Kacey Gambill — Tue, 09 Dec 2025 15:29:49 +0000

Here is how we are using it to cut pod waste and reduce churn.

The "Over-Provisioning" Tax

When deploying a new, or even an old service, it's a common pattern to pad the requests to be safe and ensure the service is operating healthily and happily.

Given this example:

Historically, fixing this was a manual nightmare of checking Grafana dashboards and adjusting YAML files.

Why We Avoided VPA Before

We initially avoided the Vertical Pod Autoscaler due to it's destructive updates.

To change a pod's CPU or Memory requests, VPA had to:

Evict the Pod.
Wait for the Scheduler to recreate it with new numbers.
Hope your application handles the graceful shutdown correctly.

For any stateful workloads, or services with larger docker files, this restart tax was painful.

The Game Changer: In-Place Updates

Introduced in Kubernetes version 1.33, in beta, the InPlaceOrRecreate update mode for the Vertical Pod Autoscaler!

This moves VPA from "scary experimental tool" to "essential cost-saving infrastructure."

How We Implemented It

We shifted our strategy to use the InPlaceOrRecreate update mode. Here is the configuration we are rolling out to our stateless workers:

YAML

---
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: backend-service
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: backend
  resourcePolicy:
    containerPolicies:
      - containerName: '*'
        minAllowed:
          cpu: 100m
          memory: 100Mi
        maxAllowed:
          cpu: 1000m
          memory: 1000Mi
        controlledResources: ["cpu", "memory"]
  updatePolicy:
    minReplicas: 1
    updateMode: InPlaceOrRecreate

Our Rightsizing Workflow

We don't just turn this on blindly. Here is the safety workflow we use for existing services:

Audit Mode: (updateMode: Off): We deploy the VPA with updates disabled. We let it run for 1 week to gather metrics on actual usage vs. requests.
Review Recommendations: We check the VPA object status (kubectl describe vpa <name>) to see what the engine would do. Pro-tip: If the recommendation is 50% lower than current requests, we know we are wasting money.
Enable In-Place: Once we trust the baseline, we switch to InPlaceOrRecreate.
Monitor: We monitor to make sure that we aren’t getting OOMkills and to verify that application metrics are not trending in the wrong direction.

Things to Note

Three things to note before going through this workflow!

Some runtime’s do not currently support the update mode InPlaceOrRecreate, it’s dependent on application code or runtimes to support this.
When Using an HPA and a VPA on the same resource, be careful that you are not increasing your CPU while the HPA is testing the CPU to see whether or not to scale up additional replicas.
Using KEDA to horizontally scale applications based on: latency, traffic, errors or saturation metrics has essentially solved this for us.

Conclusion

The In-place VPA allows us to treat resource requests as fluid, living values rather than static guesses. It’s helping us pack nodes tighter and stop paying for "air" in our clusters.

What is eBPF?

Kacey Gambill — Sat, 07 Dec 2024 21:26:47 +0000

Intro

The goal of this post is to introduce what eBPF is and give an example as to why we should care about it. At the end, I will share my Dockerfile that you can use to work on eBPF programs, on a Macbook with a m series chip.

If you don’t plan on sticking around, please at least read the word of caution regarding eBPF, it’s only magic, if we decide not to understand it!
A Word Of Caution Regarding eBPF

I also want to provide an upfront warning. Just because a tool markets itself as using “eBPF” does not mean that it is performant. All tools should be measured and understood before being using in an environment where latency matters. The more middleware we add to an application, the longer calls take. The same is true about running tools in the kernel space.

When running a program on Kubernetes, or anywhere really, we need to think about security. Do we trust this program to have access to read everything we do.. We should always provision with the least amount of access as possible. Please do understand Pod Security Context before implementing any of these solutions.

Do also read through the eBPF Security Threat Model written by Jack Kelly, James Callaghan, and Andrew Martin. It is a great primer, and really helps you understand eBPF.

What is eBPF

eBPF stand for Extended Berkley Packet Filter. It came out as available in the linux kernel 3.18 and basically extended the existing Berkley Packet Filter. The existing BPF implementation was used to only filter and capture network packets. This tool lived in the kernel space making it fast, and not as accessible for the user.

eBPF made it possible for users to configure small programs to run on a lot of different hooks, such as:

System Calls
Kernel Functions
Network Events

This basically means that for most events, eBPF can be configured to run a program in the kernel space. These programs can modify the events, or they can simply record the events, which we see often in the observability space.

eBPF extended the Linux Kernel to the user, allowing us to configure low level programs to collect, record, and alter data.
Why run programs in the kernel space?

The programs ran here operate in a sandboxed environment where their execution is verified prior to execution. This helps ensure that the program is not going to crash the kernel. These programs are closer to the source that is handling the events, which means handling the events here will provide much better latency and response times. eBPF programs can also be loaded without restarting a node or the kernel, meaning programs can be dynamically added to the kernel space while the system is running.

eBPF really shines due to it’s low level of execution and it being triggered on tons of hooks into the kernel. This makes it an ideal candidate to provide low level observability into how a system is performing, it’s networking calls, and the security for the system. Some of the core things that eBPF can monitor are:

System Calls
Filesystem Activity
Processes
Security Auditing
Network Traffic

We can think of eBPF as a magical tool that essentially provides us with x-ray vision into the linux host.

Why use eBPF?

Well, monitoring of course! Observability, at the lowest level. There is less latency and we can observe events at the lowest level. This is incredible when we are looking at it from a security perspective. We can mitigate and observe filesystem modification, processes, system calls and network traffic as it operates on the host.

It can provide packet filtering and the ability to proactively drop packets that could otherwise be malicious. Other eBPF programs give us deep application insights all the way to the system layer.

Tools like Cilium are replacing Kube-Proxy and use eBPF instead of iptables or nftables for super efficient load balancing. Keep in mind though, iptables lives in the userspace and nftables, it’s replacement, also lives in the kernel space. It will be interesting to see some performance tests comparing eBPF to efficient use of nftables for load balancing. Although, I think we will likely always get “more” when using eBPF, just due to how we can monitor it and export the data to other services.

Tools like parca enable us to quickly find out if we are introducing a lot of latency to our application by understanding how our application is using the hosts memory, CPU, and IO. It helps bring to light inefficient system calls, like file.Open(), or maybe we are processing a byte array slowly in memory, and that is causing our latency for an application call to be 100ms higher then it needed to be.

Within the next couple posts, I will add some tutorials to showcase how to build and run eBPF programs. Below is the Dockerfile that I am using. I’ll re-post it, once I write up the tutorial though!
Dockerfile for eBPF development

FROM ubuntu:22.04

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    build-essential \
    clang \
    llvm \
    libelf-dev \
    linux-headers-generic \
    pkg-config \
    git \
    curl \
    vim \
    libbpf-dev \
    ca-certificates && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

ENV GO_VERSION=1.21.1
RUN curl -LO https://golang.org/dl/go${GO_VERSION}.linux-amd64.tar.gz && \
    tar -C /usr/local -xzf go${GO_VERSION}.linux-amd64.tar.gz && \
    rm go${GO_VERSION}.linux-amd64.tar.gz
ENV PATH="/usr/local/go/bin:${PATH}"
ENV GOARCH=arm64

RUN ln -s /usr/include/asm-generic /usr/include/asm

We do have to run this as privileged so it will have access to the necessary system resources.

docker build . -t ebpf-dev-container
docker run --privileged --rm -it -v $(pwd):/app -w /app ebpf-dev-container

Consistent Deployment Strategies for Kubernetes

Kacey Gambill — Fri, 29 Nov 2024 19:42:13 +0000

There are a few different ways to deploy manifests to Kubernetes. Picking one and sticking with it can be difficult due to lack of support. Some resources support Helm charts, some prefer jsonnet, and then some only support Kustomize.

Every manifest that is deployed to Kubernetes should be deployed in the same fashion. This will make it massively easier for the Site Reliability Engineering team to manage the infrastructure and have no doubt what to use when deploying any piece of software.

Simple, is always easier, and often times better.

Consistency over Correctness. If I deploy one application with helm, I'd like to deploy everything with helm just to ensure that there is no doubt when deploying anything new. Working over a consistent codebase is always easier, then one that switches back and forth creating more ambiguity.

This post will explore deploying applications with Kustomize, going over two different examples.

Creating a simple Kubernetes application.
Using the helm plugin to create an application that uses Helm, with Kustomize. But first I will cover why I like Kustomize in the first place.

Why Kustomize?

It is simple and easy with minimal overhead to deploy a single application to a single cluster.
It makes extending an application and applying patches to multiple clusters really simple, no real additional configuration is necessary, aside from a few potentially semi-duplicate manifests in directories prefixed with their cluster-name. It is still possible to deploy helm applications due to the Helm integration, and while doing so, you still have all the benefits from using Kustomize.

Deploying an Application with Kustomize

I will walk through two different examples. The first one will include deploying a simple application with Kustomize across multiple clusters, and the second one will include deploying a helm application using Kustomize across multiple clusters.

From my last blog post, the directory structure in our gitops repository looks similar to:

gitops/
|-birdbox/
|--base/
|  |--kustomization.yaml
|  |--deploy.yaml
|--overlays/
|--|--dev/
|  |---kustomization.yaml
|  |---ingress.yaml
|  |---secrets.yaml
|  |---service-account-patch.yaml
|--|--prd/
|  |---kustomization.yaml
|  |---ingress.yaml
|  |---secrets.yaml
|  |---service-account-patch.yaml

Now to deploy this application, all we need to do is point ArgoCD's ApplicationSet at the kustomization.yaml in the {{cluster}} overlay. It will use that kustomization.yaml and grab the one from the base directory and apply all of those manifests.
To give a more concrete example, the kustomization.yaml in the dev directory is:

---
namespace: birdbox
resources:
  - ../../base
  - ./secrets.yaml
  - ./ingress.yaml
patchesStrategicMerge: # this is deprecated
  - ./service-account-patch.yaml
images:
  - name: org/birdbox
    newName: image-repo:birdbox
    newTag: "2AB9Dd4FF"
# add annotation to all resources for 'dev'
commonAnnotations:
  environment: dev
  language: golang
  repo: birdbox

This is really nice, it shows us exactly what image should be deployed, what resources we are patching, and the annotations that will be applied to the objects that are going to be deployed. This is easy to debug too because Kustomize comes bundled with kubectl . This means we can hop in the {{cluster}} directory, in this case dev and run kubectl kustomize . > outputs.yaml This will give us an outputs.yaml that has all of our manifests from this directory and the subsequent base directory, that we can verify prior to deployment.

As of now, for simplicity I prefer updating these manifests via CI/CD PRs updating:

images:
  - name: org/birdbox
    newName: image-repo:birdbox
    newTag: "2AB9Dd4FF"

the newTag field to point to the new image. This makes it so we do not have to worry about versioning and the tags are directly linkable to a commit in our application repository.

Kustomize with Helm

In the case of an application like DataDog, or Grafana Loki, it might be a lot simpler to use helm, which seems to be the normal way to deploy most applications that are bundled up from the various vendors. With the Helm Plugin and Kustomize though, we can stick to our consistent approach and use Kustomize for this too.

In this case our Kustomize manifest will look like:

---
namespace: datadog

resources:
  - ../../base
  - ./secrets.yaml

helmCharts:
  - name: datadog
    namespace: datadog  # this sets the namespace for {{ .Relase.Namespace }} in the helm chart
    includeCRDs: true
    valuesFile: datadog-values.yaml
    releaseName: datadog
    version: datadog-3.81.0
    repo: https://helm.datadoghq.com

This ends up following our previous pattern. Gives us a nice way to deploy our custom values.yaml and ends up being pretty readable!

And again, it is really simple to debug with:

kubectl kustomize .  --enable-helm > outputs.yaml

I want to clarify something that has bugged me while writing these blog posts. The secrets.yaml file that has popped up are not raw secrets. It is an object that references secrets for our cloud provider to download and sync them into our cluster.

Conclusion

By using Kustomize, we can ensure that all of our manifests are deployed in a consistent manner. It gives us the ability to deal with raw kubernetes manifests and provides an easy enough interface to patch objects so that I do not feel like I am missing out when configuring manifests. Of course, we do not get the full benefit of helm templates, but I still prefer simple manifests that can be referenced and then patched when necessary. Kustomize also offers a nice easily debuggable approach that is consistent across all of our applications.

GitOps Across Clusters — How ArgoCD and Kustomize Makes It Simple

Kacey Gambill — Sat, 23 Nov 2024 03:07:52 +0000

Working with Kubernetes is fun and rewarding, wait... did I get that right? Overwhelming and complex... Just Kidding, I really do enjoy working with Kubernetes.

If a Site Reliability Engineer were to manage several clusters and lots of different applications in those clusters, it could quickly become stressful.

Even just thinking about having to run kubectl apply manually makes me sweat.

Instead, everything inside of our various Kubernetes clusters is done automatically via ArgoCD's continuous polling of our GitOps repository.

To better understand why GitOps is an optimal solution, we should know the four principles that sort of make up the idea of GitOps.

It's Declarative, your end state is defined.
Application manifests are versioned and their history is stored
Application manifests are continuously pulled
State is continuously reconciled

Now to get to the fun part.

What is ArgoCD and why use it?

ArgoCD

Why Argo CD?
Application definitions, configurations, and environments should be declarative and version controlled. Application deployment and lifecycle management should be automated, auditable, and easy to understand

It embodies the principles governing GitOps and becomes a great tool for the job. It automatically polls our gitops repository and continuously reconciles it, making sure that the desired state is the current state in our Kubernetes cluster. This makes it a lot easier for anyone operating within the cluster to understand the desired state.

What is Kustomize and why use it?

Kustomize

Kustomize introduces a template-free way to customize application configuration that simplifies the use of off-the-shelf applications.

Kustomize makes it simple to define your application manifest in a base directory and then define either patches or additional components in an overlay directory.
I'll go over more examples later in this post.

Why not helm?

I don't love templating yaml manifests, in fact, I don't think anyone finds it enjoyable! I also do not want to have to account for new fields in kubernetes manifests as they become available.
Applying helm charts is easy enough, but in my experience it becomes frustrating maintaining helm charts. At first, the Helm chart starts as a general-purpose application chart, but over time, more if statements are added until the entire template becomes difficult to read.

I have not yet had that experience with Kustomize.

How To Effectively use ArgoCD to Deploy Across Multiple Clusters

Argo has two main resources that can be used to deploy applications. One is an Application and the other is an ApplicationSet. I tend to prefer the ApplicationSet resource. Pairing ArgoCD with Kustomize makes it incredibly easy to set up and maintain multiple applications across multiple clusters.

To go off on an example, say we want to deploy an application called BirdBox in our Kubernetes clusters (dev, uat, and prd). We could:
Create 3 Application resources and configure ArgoCD to deploy each of them.
or
Make 1 ApplicationSet and use a few variables to determine the cluster and path to apply the application automatically across the clusters.

To get into how we define these ApplicationSet resources it will be easier to describe our gitops repository layout first.

We use Kustomize for all of the applications we support and the structure typically is:

gitops
├── README.md
├── birdbox
│   ├── base
│   │   ├── kustomization.yaml
│   │   ├── deploy.yaml
│   │   ├── service-account.yaml
│   │   └── service.yaml
│   └── overlays
│       ├── dev
│       │   ├── kustomization.yaml
│       │   ├── ingress.yaml
│       │   ├── secrets.yaml
│       │   ├── service-account-patch.yaml
│       │   ├── hpa-deploy.yaml
│       └── uat
│       │   ├── kustomization.yaml
│       │   ├── ingress.yaml
│       │   ├── secrets.yaml
│       │   ├── service-account-patch.yaml
│       │   ├── hpa-deploy.yaml
│       └── prd
│       │   ├── kustomization.yaml
│       │   ├── ingress.yaml
│       │   ├── secrets.yaml
│       │   ├── service-account-patch.yaml
│       │   ├── hpa-deploy.yaml

Now, back to the ApplicationSet resource.

---
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: birdbox
  namespace: argocd
spec:
  ignoreApplicationDifferences:
    - jsonPointers:
        - /spec/syncPolicy
  generators:
    - list:
        elements:
          - cluster: dev
            url: https://xxx.xxx.xxx.xxx.xxx.com
          - cluster: uat
            url: https://xxx.xxx.xxx.xxx.xxx.com
          - cluster: prd
            url: https://xxx.xxx.xxx.xxx.xxx.com
  template:
    metadata:
      name: '{{cluster}}-birdbox'
    spec:
      project: default
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
        syncOptions:
          - CreateNamespace=true
      source:
        repoURL: https://github.com/${ORG}/gitops.git
        targetRevision: main
        path: birdbox/overlays/{{cluster}}
      destination:
        server: '{{url}}'
        namespace: birdbox

This ApplicationSet will dynamically create the application across the various clusters: {{cluster}}-birdbox
We can use the name of the {{cluster}} to dynamically apply the correct path for the application, which matches our gitops application structure.

To be more specific. In our dev cluster, this ApplicationSet will look in our gitops repository under the path: birdbox/overlays/dev and that kustomization.yaml manifest will reference the kustomization.yaml manifest located in the base of the application directory. This makes it so we only have to define our app manifests once, and then we can define more cluster specific manifests in their own directories. In the above example, we apply a service-account-patch.yaml to patch an annotation on our service accounts to link them to the IAM Roles for Service Accounts (IRSA). We also tend to keep ingress.yaml defined at the environment layer, due to differing Ingress-nginx annotations and the lack of desire to patch everything.

ApplicationSet's also provide an easy way to ignore certain differences in applications:

  ignoreApplicationDifferences:
    - jsonPointers:
        - /spec/syncPolicy

This allows us to quickly disable application auto syncing in the ArgoCD UI in case of an emergency where we need to either patch something, while we work on a more declarative fix. It also helps in the case that ArgoCD might start thrashing if some how the application state does not match what is in the GitOps repository.

ApplicationSets to rule Applications

Earlier I mentioned that I never want to have to run kubectl apply, and while that is true, there is one manifest that still needs applied manually: the Application that maintains the other ApplicationSets. This resource is an Application

---
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: applicationset-controller
  namespace: argocd
spec:
  project: default
  syncPolicy:
    automated:
      prune: false
      selfHeal: true
  destination:
    name: ''
    namespace: argocd
    server: https://kubernetes.default.svc
  source:
    repoURL: https://github.com/${ORG}/gitops.git
    targetRevision: main
    path: argocd/appsets

Now, with this resource, whenever anyone is to PR a new ApplicationSet to our gitops repository, it will automatically sync that ApplicationSet to ArgoCD, and then to the various clusters. This enables us to implement well-scoped Role-Based Access Controls (RBAC) to ensure our compliance with the four principles of GitOps. The only exception is for production related emergencies, which usually involve a breakglass situation.

In an attempt to keep these short and sweet I will cover how to deploy new application images and go further in depth in to our Kustomize manifests in the next post!

CI/CD Observability and Why it matters.

Kacey Gambill — Thu, 22 Feb 2024 02:48:20 +0000

CI/CD Observability

There are plenty of posts on the importance of Observability. While it is important to have within the applications running the workloads. It is equally important to have it inside the CI/CD pipelines as well.

Having a deep understanding of your CI/CD pipeline will show when changes happen, how long does it take for that change to become live.

If an incident happens, how long will it take till the fix is live?

how many changes result in a failure?

what is the deployment frequency?

Why CI/CD Observability is Important

Normally, when we think about Observability, it is in relation to applications and the various endpoints, functions, database calls associated with it.

I find that one place that lacks observability is often inside of CI/CD Pipelines themselves. This is necessary for us to be able to answer questions like:

how long do builds usually take?
do we have flaky tests? If so, which ones are they?
how many apps are within xx compliance?
how long do deploys take for a given application?
did this rise or fall within the last week, and if so, why?
how many times is code pushed for X repository?

How Observability Can Help

A lot of these questions are extremely difficult to answer unless we are logging and gathering traces of the CI/CD pipeline throughout its various stages.

But a lot of these questions are really important.

For example, if we know an application had been fully deploying within 5 minutes, and it is trending to 7 minutes we can start to inspect other areas of the system that might be the root cause for this. This will also help visually highlight when this trend started.

In that example we might look at:

has application startup time increased?
are we fighting pod disruption budgets due to startup time increase?
are we having trouble scheduling our workloads? What does node pressure look like?
do we need to scale up new nodes preemptively?

Summary

From there, if we are logging and graphing the various metrics we need to make these determinations we can go back and easily see what changed, and when the deployment time increased. This makes it relatively easy to determine the commit it happened on, and start diagnosing the root cause from there. Or diagnose the correct system that is responsible for the new delay.

Having these metrics and this level of observability within the CI/CD pipeline starts to enable the ability to easily roll back changes, notify relevant developers and helps ensure that the applications remain healthy and happy.

Embedding observability into the CI/CD pipeline not only helps understand code deployment, but it also really highlights deficiencies in our testing frameworks. We can use observability to identify unreliable / flaky tests. Once we have identified the various tests, we can start to see commonalities between failures and then put those metrics into more visual graphs. This will help identify the test and allow us to either fix it or remove it.

Follow up to come on how we are implementing observability inside of GitHub actions!

KubeCon + CloudNativeCon 2023 - NA

Kacey Gambill — Sun, 12 Nov 2023 04:47:37 +0000

Thoughts

My thoughts on KubeCon + CloudNativeCon

This year the conference was in the McCormick Place in Chicago. A beautiful city, and a large venue. There was plenty of room for the various talks, which all filled up quite fast. There were a few that I did not make it to because of the amount of people, which is awesome though. I most will follow up and watch those sessions as they come out later!

There was a cool sticker system for the badges. There were green, yellow and red stickers. Each one indicating your social level. Green -- open to talk, yellow -- hesitant, red -- not interested. That or these served as a level of comfort to social distancing. I found this helpful because I enjoy the networking part of the conference. This made it easy to identify others who enjoyed taking moments to talk about the conference.

The Kubecrawl on the first day of the conference was a lot of fun. So many vendors helped take part and made it a wonderful event. Some gave out fresh baked cookies, cotton candy, brews and hosted small little carnival style challenges. This made for a really fun night. Also, it was great to see Phippy and a few of their friends out wandering around the event!

The breakfast and lunch provided had a nice set of healthy options each day. This kept me feeling ready to learn and take part each day.
McCormick Place also provided a coat drop-off and bag-drop area, which was on the lower level and open from before the event started to well after it ended. This made it really nice so that you did not have to carry around coats or swag all day.

Another thing I found helpful was the staff stationed around the event. They were helpful in directing people towards the various events and talks.

Networking and Community

As a Site Reliability Engineer I have a lot of fun doing any number of tasks that I get to work on. A typical day might look like: Improving existing infrastructure, coming up with novel solutions to a problem, or helping out and making sure the developers have what they need from us so that they can continue to be productive.

At KubeCon there was a good mix of different professions. There were a lot of other developers, product managers, VP's, CTO's, other SRE's and DevOps practitioners. With the different professions that were there it gave me a good chance to get out of my comfort zone and interview so many people. I was able to talk to a lot of the community members that help make up the Cloud Native Computing Foundation (CNCF).

Being a part of this passionate community is one of my favorite parts of what I get to do. Sometimes during day to day work, it can be isolating, especially with most jobs being remote. I am not sure that anyone enjoys building something by themselves. I know that I value input from my coworkers and the community members. Conferences like this provide a great time to go out and seek that input. Especially from other industry experts.

A funny example that I can think of is, if I was to build something, isolated, on my own island it might turn out pretty cool. If I could tour hundreds of other islands and receive feedback from other people before building that feature, it will likely turn out to be something great. In this example, sure there is so much googling you can do and researching what is out there. To me, that never compares to going out and talking to other members of the community.

I also love hearing about the new and exciting things happening within the Cloud Native community. It was also fun getting to celebrate the various projects that the community is now incubating, and especially those that graduated. Getting to see all that hard work become something excellent is wonderful.

Exposure

Conferences like this also help act as a sort of North Star to me. I will go for months plugging away having fun, but these conferences always help shed new light on things and re-energize my drive to do what I do.

Conferences like these also help expose us to the new technologies and tools coming out, which helps us do our job better and help enhance our systems' reliability and efficiency.

Talks that I enjoyed

There were a lot of good sessions that bring interesting perspectives and novel solutions.

Specific sessions that I enjoyed:

A Tiny Talk on Tiny Containers -- Eric Gregory
- slides Reducing image size of containers makes them much more efficient, sustainable and easier to secure. This quick talk focuses on these three key things as it demonstrates some best practices when building docker containers and then gives a quick peek at WASM (Web Assembly) and how tiny of a footprint it can take up.
15,000 Minecraft Players Vs One K8s Cluster. Who Wins? -- Justin Head, Super League Gaming & Cornelia Davis, Spectro Cloud
- slides In this talk Justin Head and Cornelia Davis demonstrated how much MAAS (metal as a service), and the cAPI (cluster API) has evolved and simplified deploying a kubernetes clusterto an on prem datacenter. By doing this they were able to see 55-60% cost reduction for their machine costs, and 90-100% cost reduction for networking. This did come with some interesting problems though. Due to the nature of physical machines, the boot time for their nodes was upwards of 15 minutes, so there was a lot of work that went into pre-provisioning and having nodes ready to go. This helped make it so that players would not have to wait 15 minutes for new nodes to come online.
A Practical Guide to Debugging Browser Performance with OpenTelemetry -- Purvi Kanal
- slides This talk gave a quick intro to how page load time was and is often times measured. It then dove deeper into how we can instrument Open Telemetry. And use it to gather more effective metrics that come from real users.

I will be adding more to this list as I have time.

Swag

I was not planning on mentioning swag when I decided to write this, but anyone that knows me, knows I love reading.

A lot of vendors this year gave out books as swag. I always appreciate stickers and shirts, but the books are valuable. A few of the books that I received from the various sponsors were:

GitOps Cookbook, Kubernetes Automation in Practice by Natale Vinto & Alex Soto Bueno
Observability Engineering, Achieving Production Excellence by Charity Majors, Liz Fong-Jones & George Miranda
Kubernetes Up & Running, Dive into the Future of Infrastructure by Brendan Burns, Joe Beda, Kelsey Hightower & Lachlan Evenson
DevSecOps in Kubernetes, by Wei Lien Dang & Ajmal Kohgadai (Report)
What is eBPF? An Introduction to a New Generation of Networking, Security and Observability Tools by Liz Rice (Report)
A Gentle Introduction To OpenSearch by Mitch Seymour
Phippy's Field Guide to WASM by Matt Butcher & Karen Chu

Alert Fatigue, and How to Fix it

Kacey Gambill — Sat, 11 Nov 2023 04:17:13 +0000

What is Alert Fatigue?

For somebody working in tech, especially as a Site Reliability Engineer or in a DevOps role, they are very likely facing a barrage of alerts that show numerous problems with plenty of the services they are supporting.

Alert fatigue generally happens when alerts are not actionable, or they are so frequent that eventually you end up tuning out the Slack channel because it would be impossible to actually get work done while triaging each alert.

A good example of alerts that end up causing fatigue are utilization alerts:

disk space is high
memory for a service is high
cpu utilization is high

When a service is near a metric threshold, we often get alerted about it, but if that alert is not actionable, it is not actually helpful.

I find that implementing these utilization alerts with an evaluation period certainly helps reduce the stream of alerts because we are no longer capturing just utilization spikes.For example, if a memory alert is set to alert on max_memory > 85% the alert could become really noisy. A better alert might look like: avg_memory > 85% for 5 minutes.
This is still going to capture samples that would help indicate if we were to need to increase request/limit's for a service, but this will be much less noisy.

The Impacts

Alert fatigue can cause us to overlook or miss the alerts that are actually important. Or the engineer might spend their whole day looking into the alerts, not realizing that they are maybe set to be a bit to sensitive. I have seen this happen numerous times. One engineer will ignore all of them because they are used to them and in the past most have not been actionable, and the next engineer who is on call will spend their entire shift looking into each alert, when the likely culprit is just normal load.

The actionable item at this point is adjusting the alert, or cpu/memory requests and limits.

Combating Alert Fatigue

The first thing engineers can and should do is try to make the alerts as simple as possible. If there is a noisy alert that is plaguing you today. Ask yourself:

Why is this alerting?
Is this actionable?
- how do I make this actionable? this should become the new alert The solution might be to add a longer evaluation period, increase memory/cpu, or remove the alert.

The Challenge of Alarms

Often times teams do not get to set up their alerts from the ground up, but even when they do, it is hard to not alert on everything. To get detailed memory and cpu alerts for a new service, we could load test in a production like environment, but not everyone has the time or infrastructure to set that up.
To set up these alerts for a service that has been running, hopefully we have historical metrics that we can look at. As these alerts start to happen, we should be frequently revising these alerts until there is very few of them, or they are actionable when they do happen.

Another interesting thing we should look at is composite alerts.

If a service has > 85% memory utilization, how does this affect the service? Are we noticing latency increases? Has our error rate went up?
These might be factors that would provide really actionable alerts, that are not just barraging the Slack channel.

If we see an increase in: latency, traffic, errors or saturation.
We likely need to know about this, but it's usually not just one thing that that got us to this point. Which is why it is just as important to have a runbook or dashboard for each alert. This helps ensure that if we are alerted on request latency going up, we can quickly verify that the application has enough memory to handle the request, and then start digging into downstream factors such as the latency of the database, or perhaps we hit a ton of cache-misses from our redis instance.

Solution

Setting up composite alerts can be difficult to get right. I prefer to keep alerts as simple as possible.

Ideally, I'd like an alert on the four golden signals.

latency time it takes to service a request
traffic how much demand is being placed on the system (request per second)
errors rate of requests that fail
- note, requests that succeed, but show the wrong content would be considered errors, they are just much harder to capture
saturation measure of utilization of memory, cpu, space available. How much load can the sytem handle?

With these alerts, they should only be created when we can link to a dashboard or a runbook along with the alert. This is going to save valuable time for the engineer who is looking into these alerts.

For example, if we alerted on receiving a high rate of errors, I would expect to see a dashboard or runbook indicating that we should look at:

what kind of errors are happening and how many errors are there??
sum of 500 internal server error
sum of 501 not implemented
sum of 502 bad gateway
sum of 503 service unavailable
sum of 504 gateway timeout's If we can break up the alerts, that's great, if not, we can aggregate them under http_code > 500 for 5 minutes From here I would like to be able to see at a quick glance answers to the following questions:
Are the various services up?
When was the last deploy pushed, and what was deployed?
- Is the service being deployed right now?
How does saturation and latency look?
- Are resources saturated?
Is there any downstream system affecting something upstream? In this case, a service map can be really helpful. Especially, if at a glance we can see latency, and saturation of those services.

If anyone has any feedback or suggestions, please let me know!
I would love to have a conversation on how you are handling alerts for your services!

Navigating Security and Compliance

Kacey Gambill — Mon, 28 Aug 2023 15:34:28 +0000

The Challenge of Silos in Team Collaboration

Teams within companies often operate in specialized areas, focusing on their unique responsibilities. Development teams work on features and maintenance, security teams emphasize compliance, and platform teams build tools for developers. While each team has its part in achieving overarching business objectives, collaboration is key to ensuring that efforts align with the company's goals.

Even in smaller companies, where teams may share resources like a Jira board, silos can form. It's the responsibility of managers and individual contributors to foster communication and prevent barriers that hinder collaboration.

Understanding the bigger picture is essential. Developers may not need to know specific IP address ranges, but they should be aware of how things run across various environments. The platform team should understand development efforts that might impact server load.

Isolation can lead to a lack of holistic understanding, diminishing the value of individual contributions to business goals. Collaboration bridges these gaps.

Bridging the Gap Between Security and Development

Sometimes it feels like there is a war. Security vs the rest of the engineering teams. But this is because these teams are not talking on a daily, or at least weekly basis. There is no list or shared set of priorities. Security is focused on keeping the company within the various compliances and making sure that they do not end up on the news. Developers are focused maintaining the various systems, or are sprinting to create new features. The platform team is usually trying to juggle priorities between the various teams they are supporting.

But there will come a time that security will create high priority tickets and cite some obscure, but very valid, compliance article that mentions having 30, 60, or 90 days to remediate these tickets or fall out of compliance. Then their tickets will jump to the front of the queue. This then pauses developer's and platform team members work. This is where the frustration comes in. Now the various engineering managers have to make a choice, do we keep all this work in progress and focus on paying down security debt? Or do we finish the work in progress, if it is possible within the time frame, and then tackle the security tickets, hoping they do not take too long, so that we do not fall out of compliance.

This is why it is important that the Security team, the development team, and the platform team meet to discuss priorities, observe each others work, and determine how to achieve business goals together.

Aligning Security with Business Goals

Sometimes I have felt that Security teams feel like their only goal is to keep the company within compliance. It is so much more than that. They should be involved in setting business goals and taking a large role in vendor selection. They should be one of the determining forces when a team is trying to decide to build something in house, or to outsource it. Having security involved earlier, is always better.

Practical Security and Working as a Team

I've ran into it so many times, that the security team are unaware of what domains the developers are responsible for, or even what the platform team members are responsible for.
This is unacceptable, like it would be unacceptable for a platform team member to not know what the network topology consists of.

Involving security from the start helps them be aware of when an effort is related to marketing, or if a specific domain needs to be PCI compliant. Too many times I have seen a security team run their automated scans with little insight to how the business is actually ran. They come back with a bunch of tickets, which are likely solved by firewall rules, or network topology. What peeves me the most is when these scans have full access to the system, bypassing many of those security controls. This results in tickets created, that are already mitigated, but from securities prospective, are not.

This is hard to balance though, and I am not sure the right answer. I agree with security at every layer, but the company still has to run, and we still have to meet business objectives.

A good security team has established secure development practices for the developers to reduce the amount of vulnerabilities introduced in the system. They will understand the business objectives and be in vendor talks. They will understand the network topology and various controls that are in place to prevent intrusion. And most of all, they will not create a burden for other engineering teams, but try to limit scope of security work to maintain compliance.

When this happens, other teams work more with the security team and are more happy to invlove them in product discussions sooner.

Bonus Checklist

Most people in the engineering department, especially in a smaller company should have a rough idea of:

What is the general business objectives we are working towards achieving?
How does what I am doing help in achieving those goals?
At a high level, what systems do we use or support?

By embracing collaboration and breaking down silos, companies can foster a more unified and effective approach to achieving their goals.

I would love to hear anyone's thoughts on this subject! Please leave a comment and let me know how communication barriers are down at your company!

Dive Into Docker part 4: Inspecting Docker Image

Kacey Gambill — Mon, 21 Aug 2023 19:11:56 +0000

This post is going to be shorter. I'd like to highlight a tool that I really enjoy working with called "Dive"

Dive is a an essential tool when building or inspecting Dockerfiles. This tool can help pinpoint exactly what is contained in each layer of the Dockerfile. Specifically it
quickly combs through the Dockerfile and tries to show wasted space.

Installing Dive

installation-instructions
My preferred way to install Dive, if using a mac, is to use brew: -- brew install dive

Using Dive

I prefer to use dive during local development of Docker containers. To get started I typically just run: dive image-name if the image is not found locally this will take care of pulling the image from the remote repository.

Note: tmux keybindings will get in the way, I usually detach from tmux or open another terminal session before using dive

Running dive ruby:3.2.0

It first pulls the image if it is not found locally, and then we are presented with "Layers", "Layer Details", "Image Details" and "Current Layer Contents".

Press to move between views.
In each view, it presents us with a few more hotkeys that we can use to further inspect this image.

Looking at the "Layers" tab, it presents us with either "layer changes" or "aggregated changes" on the right-hand side.
You can press either or to switch between these two.

Before moving to the "Layer Contents" view, I like to pick through the various
"Layer Details" right below "Layers"

Here it shows the command that was run to generate that layer.

On the right-hand side of the screen we can see "Current Layer Contents", this includes the details of the files that were added, removed, permissions on those files and how much space these files are taking up.
If we over to that view, it presents a few new options:

- collapse single dir
- collapse all dir's
Added
Removed
modified
unmodified
attributes
wrap

I prefer to start out by collapsing all dir's and then start digging into the layers that are showing the largest increase in file space.

Using Dive in a Continuous integration Pipeline

Running Dive with CI=true is one of the most effective ways to quickly find wasted space.
Example: CI=true dive ruby:3.2.0
This also is something that could be plugged into a docker image pipeline to ensure that a ridiculous amount of assets or image space is not wasted.

Full output here:



CI=true dive ruby:3.2.0
  Using default CI config
Image Source: docker://ruby:3.2.0
Fetching image... (this can take a while for large images)
Analyzing image...
  efficiency: 98.8316 %
  wastedBytes: 11616315 bytes (12 MB)
  userWastedPercent: 1.6002 %
Inefficient Files:
Count  Wasted Space  File Path
    6        5.0 MB  /var/cache/debconf/templates.dat
    4        3.2 MB  /var/cache/debconf/templates.dat-old
    6        1.2 MB  /var/lib/dpkg/status
    6        1.2 MB  /var/lib/dpkg/status-old
    5        376 kB  /var/log/dpkg.log
    5        194 kB  /var/log/apt/term.log
    6         95 kB  /etc/ld.so.cache
    6         86 kB  /var/cache/debconf/config.dat
    6         71 kB  /var/lib/apt/extended_states
    5         54 kB  /var/cache/ldconfig/aux-cache
    5         52 kB  /var/log/apt/eipp.log.xz
    4         42 kB  /var/cache/debconf/config.dat-old
    5         36 kB  /var/log/apt/history.log
    4         26 kB  /var/log/alternatives.log
    2         903 B  /etc/group
    2         892 B  /etc/group-
    2         756 B  /etc/gshadow
    2           0 B  /etc/.pwd.lock
    6           0 B  /tmp
    5           0 B  /var/cache/apt/archives/partial
    3           0 B  /var/lib/dpkg/triggers/Unincorp
    6           0 B  /var/lib/dpkg/lock-frontend
    5           0 B  /var/cache/apt/archives/lock
    6           0 B  /var/lib/dpkg/lock
    6           0 B  /var/cache/debconf/passwords.dat
    5           0 B  /var/lib/apt/lists
    2           0 B  /usr/src
    6           0 B  /var/lib/dpkg/triggers/Lock
    6           0 B  /var/lib/dpkg/updates
Results:
  PASS: highestUserWastedPercent
  SKIP: highestWastedBytes: rule disabled
  PASS: lowestEfficiency
Result:PASS [Total:3] [Passed:2] [Failed:0] [Warn:0] [Skipped:1]

With this particular image we could go through and remove those files, but in this case it does not take up a significant amount of room, so it is unnecessary.
more configuration options

Dealing with Sensitive Data

Do not pass sensitive details through build-arg's and environment variables into Dockerfiles during image creation. Simply inspecting the resulting docker image layers will expose these secrets.

If a Dockerfile needs sensitive data, pass it using buildx secrets mounts.

This can be done either with a file, containing the secret value, or an environment variable containing the secret.

First Create a file, named build_key with the value
xyz:xyz

Next add this to the Dockerfile to access the secret.



RUN --mount=type=secret,id=build_key
# to access the secret:
RUN echo "using build_key: $(cat /run/secrets/build_key)" # note this is an example

Finally when running the docker build with buildx we use the secret:
docker buildx build --secret id=build_key,src=build_key .

If we were to use an environment variable containing the secret the command to build would look like:



build_key=xyz:xyz
docker buildx build --secret id=build_key,env=build_key

Example of Secrets Leaking

This is a rough example, because we would likely never need to add the db connection string at build time, but there are a few apps that require a build_license when installing packages, or a way to authenticate to a remote GitHub server..



FROM ubuntu

ARG build_license \
    postgres_db_string

ENV build_license=$build_license \
    postgres_db_string=$postgres_db_string

COPY . .

CMD echo "secret_sauce: $secret_sauce" \
 && echo "build_license: $build_license" \
 && echo "postgres_db_string: $postgres_db_string"

To see these details in an image, all that is needed is the image to exist locally, then run: docker save <image-name> -o <image.tar> then from inspecting the tar archive with vim I can see the layer contents.



" tar.vim version v32
" Browsing tarfile /Users/kaceygambill/personal/ubuntu-mount/blog/4/test.tar
" Select a file with cursor and press ENTER

444f68a42c829ead4bff4566c6554c761e2075c92d2eef50cbb9152fde8b13cc/
444f68a42c829ead4bff4566c6554c761e2075c92d2eef50cbb9152fde8b13cc/VERSION
444f68a42c829ead4bff4566c6554c761e2075c92d2eef50cbb9152fde8b13cc/json
444f68a42c829ead4bff4566c6554c761e2075c92d2eef50cbb9152fde8b13cc/layer.tar
a93a4c1e4d72d16b55e6aae767bb48e862a4ad8a43ab33107f8d5dfdc749912b.json
ee72d37eae4759eeaadd189b4341c0418faa7662ebc5089ddb528b4640e08c2f/
ee72d37eae4759eeaadd189b4341c0418faa7662ebc5089ddb528b4640e08c2f/VERSION
ee72d37eae4759eeaadd189b4341c0418faa7662ebc5089ddb528b4640e08c2f/json
ee72d37eae4759eeaadd189b4341c0418faa7662ebc5089ddb528b4640e08c2f/layer.tar
manifest.json
repositories

Looking at any one of those .json files gives us more details about the layer.
Expanding 444f68a42c829ead4bff4566c6554c761e2075c92d2eef50cbb9152fde8b13cc/json
I can see a JSON object that includes the sensitive data.

If you haven't checked out Dive, I'd highly
suggest checking it out and implementing it as a check in your CI/CD pipelines!