Shawn Gestupa

Posted on Apr 12 • Originally published at blog.smgestupa.dev

Improving on DevOps with Kubernetes, Helm, and GitOps

#cicd #devops #kubernetes #monitoring

For the past few weeks, I shifted my focus on building a three-tier application declaratively with Kubernetes, making it more configurable with Helm, and implementing GitOps with ArgoCD for automated deployments:

With the added help of Prometheus and Grafana deployed via Helm, I was also able to improve the observability of my application. Loki and Tempo were also implemented to enhance log tracing within my cluster.

In one of my posts, I've mentioned the growing demand for containerization skills, which was why I decided to learn more about Docker, and even tinker with it, as a way to further understand containers.

Adding my previous experience in GitOps where I started by building CI/CD pipelines, the training experience was smooth and made me truly believe that I'm getting closer to my aspirations after passing it.

Improving my DevOps skills

Disclaimer: I didn't link my repositories used for the training, since I'm not sure if I'm allowed to and the training was held internally by my employer as of writing.

Unfortunately, I had to temporarily postpone getting for certificates, but on the bright side, I was given the opportunity to improve my principles on DevOps engineering.

I've learned that DevOps engineering is more than just containerizing apps or automating deployments: it's about having a proactive mindset. Creating systems where problems can be mitigated before it reach customers with the use of automation & observability, turns a proactive engineer to react in real-time and to be constantly looking for bottlenecks/improvements.

Embracing the DevOps philosophy allows us to think that the systems we're building will never be "finished".

Throughout the training, I've spent most of my time reducing human error -- or my own error -- by making sure my progress is recorded with Git to complement my declarative configurations that I use to setup my local Kubernetes cluster with just a single command; I actually created three GitLab repositories for my architecture, one each for: Application code; Kubernetes manifests, with Helm charts; and ArgoCD configurations.

I've also leveraged the use of Helm charts to conveniently deploy not only my application, but the systems necessary for observability & GitOps, such as Prometheus, Grafana, and ArgoCD.

Then, I integrated Helm with my local ArgoCD to continuously monitor services due to its self-healing, continuous delivery, and additional monitoring capabilities.

Lastly, I linked all three of my repositories with their own deployment pipeline to commit to a true continuous integration (CI), continuous delivery (CD) setup:

My application code's repository with source codes for frontend & backend will create a merge request to the repository where my Helm charts are located to update the respective tag of an image used by either frontend or backend, though, only if their respective source code was updated, otherwise, no merge request will be made and only the Static Application Security Testing (SAST) job will run.

I've chosen to manually approve the merge requests, instead of directly pushing the changes, since I wanted to visualize working in a team where requests must be reviewed & approved before their deployment to production, and this opportunity allowed me to easily comprehend how my deployment pipeline works from one repository to another.

ArgoCD will be pulling the changes in a repository every 3 minutes, where it will then automatically deploy the new changes to respective systems without human intervention.

The journey was quite exhilarating, for it gave me new insights on how to solve more complex problems with creative solutions by actively utilizing my past experiences to continuously improve my architecture, which may be lacking in some way, but thanks to what I've learned in the training, it'll be easier to discover, learn, and understand what I'm missing out where I can apply improvements on the next system that I'm trusted to build.

Some tinkering with Docker

I've used Docker with Kubernetes, and considering I had previous experience with the former, it was easy to digest the provided Docker-related resources for the training, to the point I'm watching related YouTube videos in twice the normal playback speed.

My three-tier application with the frontend (presentation layer), backend (logic layer), and database (data layer), are all deployed with Docker, while the data layer uses an external Postgres image from the Docker Hub registry.

The backend was a Node application that was used to setup APIs, with the help of Express, for the frontend where it was simply a single-paged React application that connects to the former via an environment variable by default.

I also opted to use Alpine variants of Docker images as much as I can to reduce the resources used by containers, as well as speed up the provisioning of my three-tier application.

In addition, both the frontend & backend were initially deployed with root privileges, meaning anyone with access to their respective containers control them. I sought to improve the security posture of my application by creating separate Dockerfile files for both dev (also used for testing) & prod environments, where the Dockerfile for the prod will be using a non-root user, which greatly mitigates the risks of container breakout or impacts of an attacker inside the container.

The initial Dockerfile of the frontend was using a Node image for deployment, however, I leveraged an Nginx configuration file -- which was already included in the initial training repository -- to implement a multi-stage build where the deployment stage will be using a non-privileged Nginx image, specifically the nginxinc/nginx-unprivileged image, while only moving the necessary files from build to reduce the final image size of my container.

Despite not being required by the training itself, it didn't feel right to me to keep the three-tier application "as-is" when I knew that I had the means to improve it, starting with its security.

As usual, I still had my own share of problems with Docker:

The frontend encountered host `backend` is not reachable error when it was trying to reach the backend's API server thru Nginx, which was why I had to provision my backend right after the database.
I had a hard time changing the value of the frontend's environment variable specifically for the prod Dockerfile, and apparently I had to add an ENV or ARG with the correct value to the the said environment variable because the correct value wasn't used during build-time and simply uses the fallback value specified in the source code.

My struggles with Kubernetes

When I was learning Kubernetes for the first time, I was struggling to digest the information, where I decided to document instead everything I've learned I've learned to the point my notes reached more than 20,000 lines and took me twice the time compared to when I was refreshing my knowledge on Docker.

For my cluster, I opted to use k3d, a lightweight wrapper to run k3s, which I believed to be a great way to start learning without worrying too much on the resource overhead.

My strategy for Kubernetes was to replicate how I deployed my three-tier application previously, by building the necessary manifests:

Using either ConfigMap (for non-sensitive values) or Secret (for sensitive credentials) to match the environment variables from docker-compose.yaml.
Replacing volume mounting of Docker from volumes to Kubernetes' VolumeMounts, with PersistentVolume & PersistentVolumeClaim for external volume creation.
Mirroring the Docker's healthcheck with Kubernetes' livenessProbe, as well as to add the ability for the latter to check if an application is ready with readinessProbe.
Connecting Kubernetes' deployments with Service -- by default are in ClusterIP type -- to mimic private connectivity within Docker Compose.
Ingress services were explicitly used for public-facing applications, similar to mapping an available host port to an internal port of a Docker container in order for it to be accessed from the host machine.

Nevertheless, learning Kubernetes allowed me to equip the knowledge to move from a simple deployment with Docker to having a complex yet flexible, scalable deployments, such as when to choose between Deployment or StatefulSet.

Of course, I don't think I'm learning when I didn't have problems:

I had to expose k3d with --api-port 127.0.0.1:<PORT> to map it to my localhost, otherwise, my Docker Desktop's kubectl won't be able to communicate to the cluster's API server.
A slight difference between Docker and Kubernetes when running a command inside a container/pod from a terminal, where I had to add -- before the command I want to run inside, as well as avoiding the / prefix on the command because it pointed to my local directory instead the command inside the container.
Encountered the One or more containers do not have resources error once made me always add limits on how much resources my deployments can use.
Deleting persistent volumes should be done right after deleting the persistent volume's claims, since I unknowingly waited longer than I needed to be when I deleted the volumes first.
I had an issue where I couldn't correctly mount a SQL file specifically for initialization when I pasted the contents of the said file within ConfigMap, so I opted to visualize how the contents should be added with kubectl create configmap init-sql --from-file=init.sql=./init.sql -o YAML.
Encountered issues when creating Secret for sensitive environment variables with echo, where I should've used echo -n instead to ensure to omit newlines.

Optimizing my infrastructure with Helm

I built separate Kubernetes manifests for each layer in my three-tier application, with their own ConfigMap or Secret and services, which became redundant (for me) to deploy the same applications with similar configurations and risked misconfigurations due to separate files.

As part of my training, I had to learn Helm then understood how useful it is for consistent deployments, where it enabled me to install third-party applications, known as "charts", from registries to complement my deployments without manually creating the manifests myself -- charts are stored in registries, usually in Artifact Hub, similar to container images from Docker Hub registry; it acts as a "package manager", similar to NPM.

With helm create and some time of tinkering, I was able to migrate my manifests to Helm as a single package, allowing me to manage each layers thru a single file named values.yaml.

Working with Helm was a breeze, especially its reliance on the Go templating language which allowed me to control the flow of my deployments and inject/replace data in a single manifest used by all of the layers in my application, and in addition:

Helm is considered to be a "homebrew" (a package manager used in MacOS) to Kubernetes.
Helm is similar to third-party Terraform modules that simplify deployments and avoids maintaing multiple yet similar YAML files.
The values in a Helm chart differ between charts as they are entirely on the choice of their respective developers, which reading documentations as necessary as always.
We can avoid "snowflake servers/clusters" where software/packages are installed imperatively and making it hard to be re-built by opting to a declarative workflow with Helm where configures can be stored in a source control, which results in a "phoenix server".
A library chart is a "library" of functions shared across multiple charts, while an application chart is a collection of templates.
Go lang is used by Helm developers.
Functions in Helm are called "pipelines" and are similar to the built-in functions of Terraform, where a pipeline syntax can be used to combine multiple pipelines with the pipe (|) symbol: {{ .Values.globalNamespace | default "default" }}.
Template flow controls can be done with if-statements, such as {{ if .Values.development }}-dev{{ end }} or even {{ if .Values.development }}-dev{{ else }}-prod{{ end }}.
The templating language by default creates a new whitespace, and adding - inside the syntax, such as {{- end }}, avoids the creation of said whitespace.
Professional Helm templates can be huge and rely heavily on utilizing values/variables for declarative workflows which avoids hardcoding values.

I'm always open to encountering problems with Helm, because it will allow me to reinforce what I've learned to be creative with my solutions:

Some of my pods encountered ImagePullBackoff from Helm charts that I used, which meant that the chart may be outdated/non-existent, so I opted to look for alternatives, such as using Bitnami's MariaDB chart while the tutorials I've consumed were still using their MySQL chart.
There's a difference between indent and nindent when programmatically injecting data to my Helm templates, where the former only indents the first line while the latter indents all of the lines and making it to work as expected compared to the former.

After Helm, I started to use it heavily on my deployments which helped me reduce the risks of misconfigurations as much as possible.

Hands-off deployments with GitOps thru ArgoCD

In order to achieve the "continuous delivery" part of my architecture, I opted to deploy ArgoCD as a Helm chart and a required software to learn during my training.

ArgoCD was necessary to implement the GitOps framework, since majority of its capabilities are tied to Git repositories and permitting it to read my repositories allowed my deployments to self-heal whenever a critical component within them gets unhealthy, reducing the time it takes for them to be available.

Fortunately, Helm charts are compatible with ArgoCD, which was why I combined my three Git repositories and third-party Helm charts to setup my architecture with it; ArgoCD allowed me to come up with a creative solution to setup my architecture correctly, allowing me to learn:

An imperative setup introduces the risk of setting up snowflake servers/clusters, as well as having to extensively configure a cluster's RBAC and lacking visibility on the deployment status, whereas ArgoCD solves this by delegating access to Git repositories while providing visibility on deployed applications.
The best practice was to separate unrelated deployments in separate Git repositories instead in a single repository for everything, especially a separate repository for system configurations to reduce the risk of exposign sensitive credentials.
ArgoCD supports both Kubernetes manifests & Helm charts, combining the best of both worlds for deployments; ArgoCD is basically an extension of Kubernetes, then I used Helm to deploy it.
Git repositories must be synced with ArgoCD for continuous monitoring to allow applications have "easy rollback", where ArgoCD watches changes in those repositories and apply updates automatically, though it can be disabled and an alert can be sent instead for new changes.
ArgoCD can be used to control Kubernetes cluster -- by acting as an agent -- which can avoid providing external access to the a cluster, where management can be indirectly done thru Git repos.
Git repositories will be the desired while Kubernetes clusters will be the actual live state, and ArgoCD ensures the two to be in sync.

ArgoCD has its own external Custom Resource Definitions (CRDs), where my three-tier application as well as the necessary components for my Kubernetes cluster are deployed with Application manifests, especially for Helm charts that I choose to deploy that aren't in my Git repositories and updated default values inside the same manifest.

I did have my own fair share of problems when I was deploying with ArgoCD:

Before I can deploy my applications with ArgoCD, I had to deploy first ArgoCD since it contained the CRDs used for deployments, especially for Application manifests.
Had a hard time using my GitLab repository's Deploy Keys for my ArgoCD, which led me to read the documentation in order to use the correct syntax for SSH-related keys.

To avoid committing sensitive credentials in my Git repository for ArgoCD, I utilized External Secrets Operator to create a Kubernetes secret that syncs with an AWS Secrets Manager secret:

I'm aware that I could've used a webhook for immediate syncing between my Git repository and ArgoCD, though, that required additional setup than what was needed and I've taken note of this instead where I'll be implementing this in the future.

Enhancing observability starting with Prometheus & Grafana

The last part of my training was to implement Prometheus and Grafana to enhance the observability of my deployments, which I was able to do so as Helm charts deployed with ArgoCD.

Prometheus & Grafana were originally separate external Helm charts, however, I opted to use the kube-prometheus-stack chart which combines both of them in a single package. At first, kube-prometheus-stack was deployed as an external Helm chart, but throughout the training, I found myself needing more flexiblity to configure either Prometheus, Grafana, or any components included in the same chart, which was why I ended up adding it in my Git repository for Kubernetes manifests & Helm charts thru helm pull.

I also integrated Loki & Tempo as external Helm charts, to specifically enhance log tracing in my log applications.

Improving my cluster's observability allowed me to learn:

Prometheus can be used as a monitoring tool for highly, dynamic container environments or traditional, bare servers -- a mainstream monitoring solution of choice for containers and microservice architectures.
Used to constantly monitor all deployments/services in order to identify for problems before they occur, as well as check for containers' resource usage while setting a threshold to alert for breaches; automated monitoring & alerting can be achieved with observability-related components.
Usually, metrics can be retrieved from /metrics endpoint by default, where client libraries can expose the said endpoint to an application for many services don't have default native support for Prometheus.
Prometheus pulls metrics from targets, while Grafana can be used to visualize the said metrics to gain insights.
Other monitoring systems, such as AWS CloudWatch, use a push system that push data to a centralized collection platform which can result in high load of network traffic, in contrast to Prometheus where it uses a pull system -- a pull system can have better detection/insight since it'll know immediately if something is dead or not.
Prometheus can use the Alert Manager component (included in the kube-prometheus-stack chart) to notify respective recipients/communication channels.
PromQL can be used to communicate with the Prometheus server or specifically used for querying a target directly, which can be used by visualization tools, such as Grafana; PromQL can be used in the background when creating a Grafana dashboard.
Prometheus is designed to be reliable and meant to work with other services, even if one of them are broken which can result in a less complex & extensive setup, however, it can be difficult to scale or has limits on monitoring (can be solved by increasing the Prometheus server capacity or limit the number of metrics pulled).

My three-tier application wasn't initially designed to send metrics, which was why I enjoyed going back to programming when I implemented metrics for my frontend & backend: backend was setup with the prom-client Node library to create the metrics for both frontend & backend, then created a /metrics endpoint for it.

The prom-client library wasn't compatible with client side, especially when the frontend was deployed with Nginx. My solution for this was to basically send the data for its own metric to the /metrics endpoint of the backend via the fetch API:

The initial metrics for my frontend was simply to get the First Contentful Paint, which was retrieved from the web-vitals Node library, and the page load duration.

Since this was somewhat my first time dealing with Prometheus & Grafana, my problems with them allowed me to be resourceful:

I had to manually (in a declarative way) create a Service Monitor for my layers in order for Prometheus to retrieve their metrics, otherwise, Prometheus won't see them as targets.
I thought kube-prometheus-stack chart was incompatible with the Loki chart, and what I had to do was basically stop setting the latter to be the default data source since the former uses Prometheus as the default data source.
I had to add @opentelemetry/api and @opentelemetry/auto-instrumentations-node libraries in my Node applications in order for Tempo to retrieve their traces and output it within Grafana.

The journey ahead

I prioritized the DevOps training because I believed this will keep me ahead in this ever-changing field, where automation and efficiency are no longer optional, and I believe it was the right decision; I emerged from the training equipped with new, comprehensive knowledge and deeper insights into architecture and system design.

Combined with my existing knowledge in Cloud Engineering, the training also provided me the principles necessary to build a robust, scalable, and stable infrastructure that I can bring with me wherever I go, achieving the adaptability that I always aim to have.

I'm thankful for being given the opportunity to improve myself, as well as having the guidance of my seniors which gave me the confidence to finish the training.

Maybe I'll continue getting the AWS Associate CloudOps Engineer since I was in the middle of it when I was given the training and had to stop abruptly.

I originally planned to learn Kubernetes thru the DevOps with Kubernetes course from the same instructors of DevOps with Docker, but I still recommend the two courses despite not taking the Kubernetes one -- I'm acting upon my experience with the Docker course, because I had a great experience with that.

If you want to upskill in DevOps engineering: take your time with what you can digest, don't rush with what you're learning because it's important for you in the long-term to build the necessary principles organically.

Don't forget to believe in yourself!