DEV Community: Ogonna Nnamani

Self-Hosted GitHub Runners on GKE: My $800/Month Mistake That Led to a Better Solution

Ogonna Nnamani — Wed, 03 Sep 2025 16:41:12 +0000

So, It's 3 PM on a Friday, your team is trying to push a critical hotfix, and GitHub Actions decides to put your build in a queue. For 15 minutes. Then 20 minutes. Your deployment window is closing, the stakeholders are breathing down your neck, and you're watching your GitHub Actions bill climb past $800 for the month.

That was my reality six months ago. And like most problems that keep you awake at 2 AM, it started small and innocent.

The $800 Problem That Kept Getting Worse

It began innocently enough. Our team grew from 3 to 15 developers, our deployment frequency increased, and suddenly our GitHub Actions usage exploded. What started as a manageable $100/month became $400, then $600, then crossed the dreaded $800 mark.

But the cost wasn't even the worst part. The worst part was the waiting.

During deployment rushes, builds would queue for 10-15 minutes. Developers would start their builds, then go grab coffee, chat with colleagues, or worse – start working on something else entirely, breaking their flow. Our feedback loops became molasses-slow, and productivity plummeted.

I'd sit there watching the queue, thinking: "There has to be a better way."

Spoiler alert: There was. And it involved making some spectacular mistakes along the way.

Enter Actions Runner Controller (ARC): The Light at the End of the Tunnel

After countless late nights researching alternatives, I stumbled upon Actions Runner Controller (ARC). Think of it as having a smart assistant who only hires contractors when there's work to do, then sends them home when they're done.

Traditional GitHub runners are like having a full-time employee sitting at their desk 24/7, even when there's no work. ARC creates ephemeral pods – containers that materialize when a job arrives, do their work, and vanish when complete. It's beautiful in its simplicity.

The promise was tantalizing:

Scale from 0 to 100+ runners instantly
Pay only for compute you actually use
Never wait in GitHub's queue again
Full control over the runtime environment

Of course, getting there required navigating through my usual minefield of spectacular failures.

Step 2: Installing the ARC Controller (And My First Epic Fail)

Mistake #1: The Great Firewall Fiasco

My first attempt at installing ARC was... educational. I spent an entire weekend setting everything up perfectly, only to find that GitHub couldn't talk to my cluster. Webhooks were failing, runners weren't registering, and I was questioning my life choices.

The culprit? I'd forgotten to configure the firewall to allow GitHub's webhook IP ranges.

Face, meet palm.

This taught me my first crucial lesson: networking isn't an afterthought. When you're dealing with webhooks, your cluster needs to be accessible from the internet, and GitHub needs to be able to reach your ARC controller.

How ARC Actually Communicates

Here's what I learned about the communication flow the hard way:

GitHub → ARC Controller: GitHub sends webhook events when workflows are triggered
ARC Controller → Kubernetes API: Creates/deletes runner pods based on job queue
Runner Pods → GitHub: Self-register and poll for jobs to execute
Runner Pods → ARC Controller: Report status and job completion

The critical insight: GitHub initiates the conversation. Your cluster must be reachable from GitHub's servers, which means:

Firewall rules allowing GitHub's webhook IPs
Load balancer exposing the ARC webhook endpoint
Proper DNS configuration for webhook URLs

What Actually Worked

After my networking debacle, here's the proper setup:

Certificate Manager First
I installed cert-manager to handle SSL certificates automatically. This ensures secure communication between GitHub and our cluster - because webhooks over HTTP are a security nightmare.

ARC Controller Installation
The ARC controller gets installed in its own namespace (arc-systems) and acts as the orchestrator. It's essentially a Kubernetes operator that watches GitHub webhook events and translates them into pod creation/deletion actions.

Testing Connectivity
Before proceeding, I learned to test the webhook endpoint thoroughly:

Verify external IP accessibility
Test SSL certificate validity
Confirm GitHub can reach the webhook URL
Monitor webhook delivery logs

Building the Foundation: GKE Cluster Setup

Once I'd conquered the networking nightmare, I focused on building a solid foundation. The key decisions that made or broke the implementation:

The Spot Instance Gamble

I made a bold choice: run everything on Google Cloud Spot instances. These preemptible nodes can disappear with just 30 seconds notice, but they cost 60-70% less than regular instances.

"This will either be brilliant or catastrophic," I thought.

Turns out, it was brilliant. ARC handles preemptions gracefully – if a node gets terminated, pods simply reschedule elsewhere. The cost savings were immediate and substantial.

Storage: Where I Learned About Shared State

Initially, I ignored storage entirely. "How hard can it be?" I thought.

Very hard, as it turns out.

Without shared storage for build caches, every job started from scratch. Build times were actually slower than GitHub's hosted runners. My "optimization" had made things worse.

The solution involved integrating:

Google Cloud Filestore: 250GB shared volume for build caches
Smart cache organization: Structured by repository and branch

Suddenly, build times dropped by 40%. Sometimes the obvious solutions are obvious for a reason.

Step 3: Building Custom Images (Learning Docker the Hard Way)

Mistake #2: The 8GB Image Monster

My first custom runner image was... ambitious. I threw everything I could think of into it: multiple Node.js versions, Python 2 and 3, every CLI tool I'd ever heard of, and enough packages to power a small data center.

The result? An 8GB monster that took 15 minutes to pull on each pod creation.

Watching developers wait 15 minutes just to start their build was painful. I quickly learned that in the world of ephemeral pods, image size directly impacts developer happiness.

The Right Approach to Custom Images

After several iterations, I developed a strategy:

Base Image Philosophy

ubuntu-22-04: Lean base with Node.js, Python, and essential tools only
ubuntu-22-04-infra: Infrastructure-focused with Terraform, kubectl, and cloud CLIs
ubuntu-22-04-qa: Testing-focused with Selenium, browsers, and test frameworks

Size Optimization Lessons

Multi-stage builds to eliminate build dependencies
Careful package selection (do you really need that 500MB SDK?)
Layer optimization to maximize Docker cache hits
Regular cleanup of apt caches and temp files

The Sweet Spot
My optimized images clock in at 1.5-2GB and pull in under 60 seconds. The difference in developer experience is night and day.

The results were dramatic: job setup time dropped from 3-4 minutes to under 30 seconds.

Step 4: Configuring Ephemeral Runners (Pod Lifecycle Mysteries)

Understanding the Magic

Each runner pod is ephemeral, meaning it has a complete lifecycle:

Creation: ARC sees a queued job and creates a pod
Registration: Pod starts up and registers with GitHub
Execution: Receives and executes the workflow job
Cleanup: Job completes, pod reports back, and gets deleted

Container Architecture

Each runner pod actually runs two containers:

Main runner container: Executes the GitHub Actions workflows
Docker-in-Docker (DinD) sidecar: Handles container builds securely

This architecture provides isolation while enabling Docker builds – crucial for most modern CI/CD workflows.

Mistake #3: The Docker-in-Docker Discovery

My next challenge was handling Docker builds within the runners. My initial approach was to mount the Docker socket from the host into the pods.

This worked beautifully in testing. In production? Not so much.

Security-wise, it was equivalent to giving every job root access to the host. One badly configured job could potentially compromise the entire node.

The better approach: Docker-in-Docker (DinD). This provided isolation while enabling Docker builds. No more security nightmares, no more compromised nodes.

Mistake #4: The Resource Allocation Disaster

Confident in my progress, I deployed to production with minimal resource limits. "Let Kubernetes figure it out," I thought.

Bad idea.

Jobs started failing mysteriously. Pods were getting OOMKilled. The cluster was thrashing under memory pressure. I'd created a resource contention nightmare.

The solution required careful tuning:

CPU: 1 core request, 4 core limit per runner
Memory: 1GB request, 2GB limit
Storage: Shared cache access for all runners

Pro tip: Always set resource requests and limits. Kubernetes is smart, but it's not psychic.

The Scaling Sweet Spot

After months of tuning, I found our ideal scaling configuration:

Minimum runners: 1 (always ready for immediate pickup)
Maximum runners: 100 (handles our largest deployment batches)
Scale-to-zero: Pods disappear when not needed

This gave us the best of both worlds: instant job pickup for small changes, and massive parallel capacity for large deployments.

The Results: When Everything Finally Clicks

Six months later, the transformation has been remarkable:

Cost Victory 💰

Monthly CI/CD costs: $800+ → $200
70% reduction in infrastructure spend
Predictable costs with no surprise overages

Performance Revolution 🚀

Job queue time: 15 minutes → 30 seconds
Build speed: 40% faster due to effective caching
Deployment reliability: Near-zero failures due to resource constraints

Developer Happiness 😊

The real win? Developers stopped complaining about builds. Feedback loops became fast again. People could stay in flow instead of context-switching while waiting for deployments.

The Antifragile System

What we built isn't just cost-effective – it's antifragile. When traffic spikes hit, it scales up. When nodes get preempted, pods reschedule. When builds fail, we have granular logs to diagnose issues quickly.

Each failure along the way taught us something valuable:

The firewall issue taught us to plan networking carefully
The storage problems taught us the importance of shared state
The resource disasters taught us the value of proper limits

Every broken deployment made the system stronger.

Should You Make the Jump?

If you're spending $500+ monthly on GitHub Actions and dealing with queue times, self-hosted runners on GKE might be your answer. But go in with realistic expectations:

Initial setup time: 2-3 weeks for a robust implementation
Learning curve: Steep if you're new to Kubernetes
Ongoing maintenance: You're now responsible for infrastructure
Cost savings: Significant, but not immediate

Start simple. Get basic runners working, then add complexity gradually. Monitor everything. And don't be afraid to fail – each failure teaches you something you couldn't learn any other way.

The Bottom Line

Six months ago, I was paying $800/month to wait in GitHub's queue. Today, I'm paying $200/month for instant deployments and custom runtime environments.

Sometimes the best solutions come from the problems that annoy you most.

If you enjoyed reading this, connect with me on Linkedin

Building your own CI/CD infrastructure? I'd love to hear about your journey and the spectacular failures along the way. They make the best stories.

Follow me for more CI/CD war stories and Kubernetes adventures!

Hello, I am a DevOps Engineer and I Broke Production Today

Ogonna Nnamani — Wed, 23 Jul 2025 15:27:43 +0000

The Beauty of Failure

Disclaimer: The failure lasted no more than 4 minutes, and I quickly reverted to a previous stable version. But here's the thing — I tweeted about it.

That simple tweet made me realize some things that I wasn't quite prepared for. In the comment section, I encountered four distinct types of people:

Those who found it hilarious — Fellow engineers sharing their own war stories.
Those who didn't believe me.
Those who immediately started suggesting fixes — Those who couldn't help but jump into troubleshooting mode.
Those who thought it meant I was just incompetent.

But here's what that interaction taught me: there's raw beauty in admitting that you can fail and that it's absolutely okay to fail.

What Failure Really Makes You

Failure doesn't make you incompetent—it makes you experienced. Each mistake adds a line of rank to your experience bar, a badge that says "I've been there, I've survived it, and I know how to handle it next time." More importantly, it removes the burden of claiming to know everything.

Spoiler alert: you never will, and that's perfectly fine.

Let me share some of my greatest hits in the failure department.

The Great Email Blackout of Monday Morning

When you forget to migrate nameservers and an entire company loses email access

Picture this: I'm performing an AWS cross-account migration for a major Oil & Gas company. Everything is going smoothly until the DNS migration phase. In my meticulous planning, I managed to overlook one tiny detail — migrating the nameservers.

Monday morning arrives, and suddenly an entire company wakes up to find themselves locked out of their emails. For over five hours. On a Monday. In the oil and gas industry.

The phone calls were… let's just say they were intense. But that failure taught me more about DNS propagation, backup communication channels, and the critical importance of testing every single component of a migration than any certification course ever could. Sometimes the most expensive lessons are the most valuable ones.

The Database Credential Catastrophe

The horror of realizing your production app is talking to staging database

Then there was the time I pushed what I thought was a simple fix. An old branch from my git repository deployed to production and decided to pick up the staging database credentials, completely replacing the production database credentials.

Our production application was suddenly trying to connect to our staging database. The irony wasn't lost on me — I had created the perfect test of our monitoring systems, just not intentionally.

That incident changed how I approached environment isolation. Now I have strict compliance checks before any PR is merged, proper credential management, and multiple validation layers. That "simple fix" became the catalyst for implementing some of our most robust security practices.

The Kubernetes Scheduling Nightmare

More recently, I pushed a fix and mistakenly changed the annotations of our self-hosted GitHub runners. Suddenly, our pods couldn't schedule on our node pools because they had a nodeSelector rule that no longer matched.

Our entire CI/CD pipeline ground to a halt. Developers couldn't deploy. The build queue started backing up like traffic on a Friday afternoon.

Each of these failures taught me something invaluable that I couldn't have learned any other way. The list is endless, honestly.

The Road to Antifragility

The road to antifragility is a continuous process. What happens after you break production is remarkably similar to a murder case investigation — this is why you need to build systems that anticipate these trying times. You need evidence, you need witnesses, you need to reconstruct the timeline, and you need to understand what went wrong.

Your Detective Toolkit

The essential tools for investigating production failures

Granular logs become your crime scene evidence. They tell you exactly what happened, when it happened, and in what sequence. Without them, you're investigating a case blindfolded. I can't stress this enough — log everything that matters.

Comprehensive metrics are your witnesses. They saw everything unfold in real-time and can testify to the state of your system at any given moment. Tools like CloudWatch, Prometheus, and Grafana have become my best friends.

A similar test environment is your crime lab. It's where you can safely recreate the incident, test your theories, and validate your fixes without risking further damage to production.

Clear post-mortems are your case files. They document not just what went wrong, but why it went wrong, what you learned, and how you're preventing it from happening again. Write them like your future self will thank you for it.

These are your detective tools when all hell breaks loose. And trust me, hell will break loose — it's not a matter of if, but when.

Every Failure is a Lesson

Every failure is a lesson. Some lessons are more expensive than others, but the process invariably makes us better engineers. The engineer who has never broken production is either lying, hasn't been doing this long enough, or isn't pushing boundaries hard enough to drive real innovation.

The most senior engineers I know aren't the ones who never make mistakes — they're the ones who've made the most mistakes, learned from them, and built systems resilient enough to handle future failures gracefully.

Embracing the Inevitable

So here's to failure. Here's to the 3 AM phone calls, the sweaty palms during incident response, and the wisdom we gain from each crash.

Because at the end of the day, our failures don't define our incompetence — they define our experience.

Once again, my name is Ogonna, I am a DevOps Engineer, and I broke production today.

What did you do today?

If you enjoyed this story of production failures and lessons learned, follow me for more DevOps insights and real-world experiences. Also connect with me on LinkedIn for more behind-the-scenes DevOps stories.

Have your own production failure story? Share it in the comments — let's normalize talking about our failures and learning from each other!

Securing Your Internal Tools: Implementing Identity-Aware Proxy (IAP) for GKE Resources with CDKTF

Ogonna Nnamani — Tue, 22 Jul 2025 10:47:39 +0000

Hello, Today I want to share something that's become increasingly critical in our cloud-native world — securing internal tools and dashboards without the complexity of traditional VPN setups.

Picture this: Your company has grown from a small startup to a mid-sized organization. You have internal dashboards, monitoring tools, admin panels, and various services running on Google Kubernetes Engine (GKE). Initially, maybe you secured these with basic auth or just left them on internal networks. But as your team grows and remote work becomes more common, you realize you need something more robust, more scalable, and frankly, more professional.

That's where Google's Identity-Aware Proxy (IAP) comes in, and today I'll walk you through implementing it using Infrastructure as Code with CDKTF.

What is IAP and Why Should You Care?

Identity-Aware Proxy (IAP) is Google Cloud's solution to the age-old problem of "how do I securely control access to my applications?" Think of IAP as a sophisticated bouncer at an exclusive club — it checks not just if you have a ticket (authentication), but also if you're on the guest list for that specific event (authorization).

Here's the beautiful part: IAP sits between your users and your applications, handling all the authentication and authorization logic without you having to modify your applications. It integrates seamlessly with Google's identity systems, supports your corporate Google Workspace accounts, and can enforce granular access controls based on user attributes, device security status, and more.

Why IAP is a Game-Changer

Zero Trust Security Model: IAP doesn't trust anyone by default, not even users inside your corporate network
No VPN Complexity: Users can access internal tools from anywhere with just their corporate Google account
Granular Access Control: You can control who accesses what, when, and from which devices
Audit Trail: Every access attempt is logged, giving you complete visibility
Integration with Google Workspace: Leverage your existing Google accounts and groups

Understanding CDKTF: Infrastructure as Code for the Modern Age

Before we dive into the implementation, let's talk about CDKTF — Cloud Development Kit for Terraform. If you've worked with traditional Terraform, you know it uses HCL (HashiCorp Configuration Language). While HCL is powerful, it can feel limiting when you need complex logic, loops, or want to leverage your existing programming skills.

CDKTF bridges this gap by allowing you to define your infrastructure using familiar programming languages like TypeScript, Python, Java, C#, or Go. For this article, we'll use TypeScript because of its excellent type safety and IntelliSense support.

Why CDKTF with TypeScript?

Type Safety: Catch configuration errors at compile time, not deployment time
Code Reusability: Create functions, classes, and modules to reduce duplication
IDE Support: Full IntelliSense, autocomplete, and refactoring capabilities
Familiar Syntax: If you know TypeScript/JavaScript, you're already halfway there
Testing: Unit test your infrastructure code just like application code

Think of it this way: traditional Terraform is like writing configuration files, while CDKTF is like writing a program that generates those configuration files. The end result is the same, but the development experience is significantly better.

Backend Services vs. Backend Configs: The Kubernetes Ingress Story

Before we implement IAP, it's crucial to understand the difference between Backend Services and Backend Configs in the GKE context — this tripped me up when I first started working with GKE ingress.

Backend Services

A Backend Service is a Google Cloud resource that defines how traffic should be distributed to your backend instances (in our case, Kubernetes pods). It's part of Google's load balancing infrastructure and handles:

Health checks
Load balancing algorithms
Session affinity
Traffic distribution

When you create a Kubernetes Service and expose it through an Ingress, GKE automatically creates a corresponding Backend Service in Google Cloud.

Backend Configs

A Backend Config is a Kubernetes Custom Resource Definition (CRD) that allows you to customize the behavior of the automatically created Backend Services. Think of it as a way to tell GKE: "When you create the Backend Service for my Kubernetes Service, please apply these additional configurations."

Backend Configs can control:

Connection draining timeouts
Session affinity settings
Custom request/response headers
Security policies
And most importantly for us — IAP settings

The key insight here is that Backend Configs are Kubernetes resources that influence Google Cloud Backend Services. It's GKE's way of bridging Kubernetes-native configuration with Google Cloud's load balancing features.

Hands-On: Implementing IAP for a SonarQube Instance

Now let's get our hands dirty. We'll implement IAP for a SonarQube instance — a popular code quality and security analysis tool that's perfect for demonstrating internal tool security.

Prerequisites

A GKE cluster
CDKTF installed and configured
Google Cloud project with appropriate permissions
Basic understanding of Kubernetes

Step 1: Project Setup and Service Enablement

First, we need to enable the IAP API. This is crucial because it also creates an IAP service agent that we'll need later:

import { enableServices } from "@examplecompany/iac";

// Enable Google IAP Service
enableServices(this, project.name, ["iap.googleapis.com"], false);

What's happening here? The enableServices function is a helper that enables Google Cloud APIs in your project. When you enable the IAP API (iap.googleapis.com), Google automatically creates a special service account called the "IAP service agent." This service account is what IAP uses behind the scenes to communicate with your applications. Think of it as giving IAP the keys to act on behalf of your project.

Step 2: Create OAuth Credentials

Before we can use IAP, we need OAuth 2.0 credentials. Head to the Google Cloud Console:

Navigate to APIs & Services > Credentials
Click Create Credentials > OAuth 2.0 Client IDs
Choose Web application
Add your domain to Authorized redirect URIs
Note down the Client ID and Client Secret

We'll store these as environment variables for security:

export IAP_CLIENT_ID="your-client-id-here"
export IAP_CLIENT_SECRET="your-client-secret-here"

Step 3: Create the Helper Function

Let's create a reusable function for IAP implementation. This is where the magic happens:

import { Manifest } from "@cdktf/provider-kubernetes/lib/manifest";
import { Construct } from "constructs";

export interface IapConfigProps {
  namespace: string;
  backendConfigName: string;
  oauthSecretName: string;
  clientId: string;
  clientSecret: string;
}

export function createIapResources(scope: Construct, props: IapConfigProps) {
  // Validate credentials are provided
  if (!props.clientId || !props.clientSecret) {
    throw new Error("IAP_CLIENT_ID and IAP_CLIENT_SECRET must be set");
  }

  // Create Kubernetes Secret for OAuth credentials
  new Manifest(scope, "iap-oauth-secret", {
    manifest: {
      apiVersion: "v1",
      kind: "Secret",
      metadata: {
        name: props.oauthSecretName,
        namespace: props.namespace,
      },
      type: "Opaque",
      data: {
        client_id: Buffer.from(props.clientId).toString("base64"),
        client_secret: Buffer.from(props.clientSecret).toString("base64"),
      },
    },
  });

  // Create BackendConfig with IAP enabled
  new Manifest(scope, "iap-backendconfig", {
    manifest: {
      apiVersion: "cloud.google.com/v1",
      kind: "BackendConfig",
      metadata: {
        name: props.backendConfigName,
        namespace: props.namespace,
      },
      spec: {
        iap: {
          enabled: true,
          oauthclientCredentials: {
            secretName: props.oauthSecretName,
          },
        },
        timeoutSec: 60,
        connectionDraining: {
          drainingTimeoutSec: 120,
        },
      },
    },
  });
}

Let me break down what this function is doing step by step:

The Interface: The IapConfigProps interface is like a contract that ensures anyone using this function provides all the required information. It's TypeScript's way of saying "these are the mandatory parameters."

Credential Validation: The first thing we do is check if OAuth credentials are provided. This prevents silent failures where you deploy everything successfully but IAP doesn't work because credentials are missing.

Creating the Secret: The first Manifest creates a Kubernetes Secret to store our OAuth credentials. Notice how we use Buffer.from().toString("base64") — this is because Kubernetes secrets must be base64 encoded. The secret type is "Opaque," which is Kubernetes' way of saying "this is arbitrary user-defined data."

Creating the BackendConfig: The second Manifest is where the real IAP magic happens. We're telling GKE: "When you create the Google Cloud Backend Service for any service that references this BackendConfig, please enable IAP and use the OAuth credentials from this secret." The timeoutSec and connectionDraining are additional configurations to ensure graceful handling of requests during deployments.

Step 4: Create the Main Stack

Now let's put it all together in our main infrastructure stack:

import { DataGoogleContainerCluster } from "@cdktf/provider-google/lib/data-google-container-cluster";
import { IapWebIamBinding } from "@cdktf/provider-google/lib/iap-web-iam-binding";
import { Namespace } from "@cdktf/provider-kubernetes/lib/namespace";

class SonarQubeStack extends TerraformStack {
  constructor(scope: Construct, id: string) {
    super(scope, id);

    const project = { name: "my-example-project" };

    // Enable IAP service
    enableServices(this, project.name, ["iap.googleapis.com"], false);

    // Reference existing GKE cluster
    const gke = new DataGoogleContainerCluster(this, "gke-cluster", {
      project: project.name,
      location: "us-central1",
      name: "my-cluster",
    });

    // Create namespace
    const namespace = new Namespace(this, "sonarqube-namespace", {
      metadata: {
        name: "sonarqube",
      },
    });

    // Create IAP resources
    createIapResources(this, {
      namespace: namespace.id,
      backendConfigName: "sonarqube-backendconfig",
      oauthSecretName: "sonarqube-iap-oauth-secret",
      clientId: process.env.IAP_CLIENT_ID ?? "",
      clientSecret: process.env.IAP_CLIENT_SECRET ?? "",
    });

    // Grant access to specific users/groups
    new IapWebIamBinding(this, "sonarqube-iap-access", {
      project: project.name,
      role: "roles/iap.httpsResourceAccessor",
      members: [
        "group:developers@examplecompany.com",
        "group:devops@examplecompany.com",
        "user:admin@examplecompany.com",
      ],
    });
  }
}

Here's what's happening in our main stack:

Data Source Reference: DataGoogleContainerCluster is not creating a new cluster — it's referencing an existing one. This is CDKTF's way of saying "I need information about this resource that already exists." It's like looking up a contact in your phone book rather than adding a new one.

Namespace Creation: We create a Kubernetes namespace to isolate our SonarQube resources. Think of namespaces like folders in your filesystem — they help organize and separate different applications.

Calling Our Helper: We call our createIapResources function with specific values. Notice how we use process.env.IAP_CLIENT_ID ?? "" — the ?? operator provides an empty string fallback if the environment variable isn't set. This prevents crashes but will be caught by our validation later.

IAM Binding - The Access Control: This is crucial! IapWebIamBinding is what actually grants people access. The role roles/iap.httpsResourceAccessor is Google's predefined role that allows access through IAP. Without this binding, even authenticated users would be denied access. The members array supports both individual users (user:someone@company.com) and Google Groups (group:teamname@company.com).

Step 5: Configure Your Service

The final piece is connecting your Kubernetes Service to the BackendConfig. In your service manifest (or Helm values), add this annotation:

apiVersion: v1
kind: Service
metadata:
  name: sonarqube
  namespace: sonarqube
  annotations:
    cloud.google.com/backend-config: '{"default": "sonarqube-backendconfig"}'
spec:
  # ... rest of your service configuration

If you're using Helm (like we are with SonarQube), update your values.yaml:

service:
  type: ClusterIP
  externalPort: 9000
  internalPort: 9000
  annotations: 
    cloud.google.com/backend-config: '{"default": "sonarqube-backendconfig"}'
    cloud.google.com/load-balancer-type: "External"

This annotation is the bridge between your Kubernetes Service and the BackendConfig. Here's what's happening:

The Magic Annotation: cloud.google.com/backend-config tells GKE's ingress controller: "When you create a Google Cloud Backend Service for this Kubernetes Service, apply the configuration from this BackendConfig." The {"default": "sonarqube-backendconfig"} part means "apply this BackendConfig to the default port" — if your service had multiple ports, you could specify different BackendConfigs for each.

Load Balancer Type: The cloud.google.com/load-balancer-type: "External" annotation ensures your service gets an external IP address that can be reached from the internet (after passing through IAP, of course).

The Moment of Truth: Testing Your Implementation

Deploy your infrastructure:

cdktf deploy

After deployment, navigate to your service URL. Instead of direct access, you should see the Google sign-in page. After authentication with your corporate account, IAP will check if you have the necessary permissions and either grant or deny access.

What Could Go Wrong (And How to Fix It)

From my experience implementing IAP across multiple services, here are the common gotchas:

1. OAuth Configuration Issues

Problem: Users see "Error: redirect_uri_mismatch"
Solution: Ensure your OAuth client's authorized redirect URIs include your actual domain

2. IAM Permission Problems

Problem: Users authenticate but get "You don't have access"
Solution: Check your IapWebIamBinding members list and verify users are in the specified groups

3. BackendConfig Not Applied

Problem: IAP doesn't seem to work at all
Solution: Verify the service annotation is correct and the BackendConfig exists in the same namespace

4. SSL Certificate Issues

Problem: IAP works but with SSL warnings
Solution: Ensure you have proper SSL certificates configured for your domain

Best Practices I've Learned the Hard Way

Use Google Groups: Instead of individual users, manage access through Google Groups. It's much easier to maintain.
Environment Separation: Use different OAuth clients for different environments (dev, staging, prod) for better security isolation.
Monitor Everything: Enable IAP access logging and set up alerts for failed authentication attempts.
Test Thoroughly: Always test with users who shouldn't have access to ensure your permissions are working correctly.
Document Your Groups: Keep clear documentation of which Google Groups have access to which services.

The Payoff

After implementing IAP across our internal tools, the benefits were immediately apparent:

Developer Productivity: No more VPN hassles or remembering different passwords
Security Compliance: Clear audit trails and granular access control
Operational Simplicity: Centralized identity management through Google Workspace
Scalability: Easy to add new tools and services under the same security model

The initial setup might seem complex, but once you have the pattern established, securing additional services becomes a matter of copying and adapting your existing code.

Wrapping Up

Identity-Aware Proxy represents a shift from traditional perimeter-based security to a zero-trust model. Combined with Infrastructure as Code practices using CDKTF, you get both security and maintainability.

The implementation we've covered here is just the beginning. IAP supports advanced features like device-based access controls, context-aware access based on user location and device security posture, and integration with third-party identity providers.

My advice? Start simple with basic IAP implementation, get comfortable with the concepts and workflows, then gradually add more sophisticated policies as your security requirements evolve.

Remember, security isn't just about keeping the bad guys out — it's about making it easy for the good guys to get their work done safely and efficiently.

What internal tools are you planning to secure with IAP? I'd love to hear about your implementation experiences in the comments!

*If you found this helpful, follow me on Linkedin.*for more DevOps and cloud security content.

The Role Of Chaos Engineering in Building Anti-Fragile Systems

Ogonna Nnamani — Fri, 12 Apr 2024 03:29:11 +0000

Intro
Welcome back to the Antifragile series guys!
We will be discussing the role of Chaos Engineering in designing antifragile systems.

Firstly,

What is Chaos Engineering?

Chaos engineering is a controlled chaos in systems design. It involves deliberately injecting failures and unexpected events into a system to see how it responds. The goal is to uncover weaknesses and vulnerabilities before they cause major issues in real-world scenarios.

Building a system that responds almost immediately to failure shows resiliency and that is what antifragility is really about.
Who needs Chaos Engineering?

Implementing chaos engineering in an architecture involves a lot of planning because nobody wants to build and destroy. Certain use cases and industry inspire this method of systems design such as:

Tech Companies
Especially those providing online services, cloud computing, or software as a service (SaaS) platforms.
Financial Services
Banks, stock exchanges, payment processors, and other financial institutions rely on highly available and secure systems to process transactions.
Healthcare:
With the increasing digitization of medical records and telemedicine, healthcare organizations need reliable systems to provide critical services to patients.
Energy and Utilities
Power plants, oil refineries, and utility companies use complex systems for monitoring and managing infrastructure.

These industries require constant uptime and design methods like chaos engineering can be implemented to test for resiliency.

Tools used for Chaos Engineering
Chaos Mesh
An open-source chaos engineering platform for Kubernetes-based applications. It allows users to orchestrate chaos experiments to test the resilience of their Kubernetes clusters. These include pod failure, network latency and load testing.

Pumba
A chaos testing tool specifically designed for Docker containers. It allows users to introduce network latency, packet loss, and other disruptions to Docker containers to simulate real-world failures.
Pumba can kill, stop or remove running containers. It can also pause all processes withing running container for specified period of time.

Chaos Monkey
Developed by Netflix, Chaos Monkey is one of the earliest chaos engineering tools. It randomly terminates virtual machine instances to ensure that engineers design systems that can withstand failures.

Litmus Chaos
An open-source chaos engineering platform for Kubernetes. It provides a framework and a set of pre-defined chaos experiments for testing Kubernetes resilience.
Litmus was accepted to CNCF on June 25, 2020 and moved to the Incubating maturity level on January 11, 2022.

Apache Bench
(ab is the real program file name) is a single-threaded command line computer program used for benchmarking (measuring the performance of) HTTP web servers. This tool can be used to stress test your APIs or endpoints to ensure they can withstand huge concurrent traffic before deployment.

These are some real life tools that are used to test for resiliency in your products to ensure you achieve an antifragile infrastructure.

There are some special use cases where Chaos engineering is automated and continually implemented.

A real life example is a company called FINBOURNE. Finbourne is a financial technology company that provides a cloud-based investment management platform called LUSID. LUSID is designed to help asset managers, wealth managers, and financial institutions streamline their investment operations.

Finbourne hosts their infrastructure on AWS and they implement an automated, special type of chaos engineering that terminates an application every seventeen(17) minutes, terminates an EC2 instance every six(6) hours and fails an Availibility zone twice weekly just to continually evaluate how quickly they recover from a failure.

Mindblowing right !!!

These are some of the extreme design methods some companies undergo just to ensure optimal performance and resiliency.
That will be all on chaos engineering today!!

If you enjoyed this read, connect with me on LinkedIn
HAPPY CLOUD COMPUTING!

Building Anti-Fragile Systems For Modern-Day DevOps

Ogonna Nnamani — Fri, 12 Apr 2024 03:09:36 +0000

INTRODUCTION
What is antifragility?

Antifragility is a concept introduced by Nassim Nicholas Taleb in his book “Antifragile: Things That Gain from Disorder,” published in 2012. The term refers to a property of systems or entities that thrive and benefit from volatility, uncertainty, stress, and disorder.

It’s common to say that robust or resilient is the opposite of fragile. Here, however, we respectfully disagree. I want to talk about an idea known as antifragility.

“Resilience refers to the ability of a system or entity to withstand shocks, recover from adversity, and return to its original state or function. Resilience suggests the capacity to absorb and adapt to challenges without significant damage”

Antifragility goes beyond resilience. An antifragile system not only withstands stressors but actually benefits and improves as a result of exposure to adversity. It thrives in dynamic and uncertain environments, becoming stronger and more robust through challenges. The human muscular system is one example of an antifragile mechanism in nature. Our muscles experience stress when we work out, which causes them to grow and strengthen. Another term for this is post-traumatic growth.

Let’s now use the example of using a courier service provider to deliver a glass piece. Packages marked “fragile” are those that detest stress and break easily upon experiencing it. An item that is designed to be mishandled and anticipates stress should be the exact opposite.

Let’s bring that into the day-to-day designing, building, and managing of scalable systems by trying to build systems that expect variability and predict outcomes.

HOW TO MEASURE FRAGILITY
Fragility refers to the quality or state of being fragile, which means easily broken, delicate, or vulnerable. It can be used to describe physical objects that are prone to breaking or damage.

Let’s revisit the glass mirror example. There are only two states for a mirror: whole and broken. There is no middle ground when it comes to measuring the risk associated with that mirror; it either breaks or it doesn’t. This implies that we now know the second-order derivative in the event that the mirror falls, and this information helps us prevent the mirror from falling.

The same ideology applies to systems. Only when we are aware of all that could go wrong in the system do we stand a better chance at preventing failure in the system.

FACTORS THAT INFLUENCE ANTIFRAGILITY
Build, test, and fail fast: AWS and other public cloud providers have made building and developing systems easier because we can have multiple environments quickly. This has also enabled us to build and test quickly. Imagine spinning up a high-end server for a 30-minute test compared to having to rent that same server 20 years ago. The ease cannot be overemphasized. The ability to test fast also comes with failing fast, and when we fail fast, we learn fast. And by learning, we are able to measure the fragility of that environment.

Size: The business’s size is a critical factor. For a monolithic program with ten infrequent users, the stressors of a mobile app with over 500k concurrent users cannot be the same. Determine the environmental risk level and apply antifragility accordingly.
Complexity: There are various ways that complexity can manifest itself; it might originate from the way certain functionalities are handled in the code or from the architecture of the entire infrastructure. I’ll use two AWS environments, ENV A and ENV B, as an example.
ENV A comprises a single server responsible for hosting file systems, databases, and the web server. In the event of a server failure, a recent backup can be deployed to replace the malfunctioning server. Additionally, when faced with a surge in traffic, auto-scaling mechanisms come into play to ensure the server remains operational. In this scenario, it can be asserted that the principles of disaster recovery contribute to the concept of antifragility.

While ENV B is a complex, loosely coupled microservice environment that comprises and relies on lambda functions, eventbridges, SNS, SQS, step-functions, and databases, Obviously, since the playing field is now bigger, so is the risk, meaning that multiple parts can fail; this would eventually introduce observability. Observability will in turn provide insights that predict failures using patterns. We may now configure automated actions and alerts that, depending on the kind of action, can be set to reverse or effect changes.
Because both environments were able to measure fragility and predict the second-order derivative, antifragility was introduced. Building systems that anticipate variability has been made possible by this in addition.

AREAS WHERE ANTIFRAGILITY CAN BE IMPLEMENTED
Security: Ensuring the security of our infrastructure is crucial, requiring a comprehensive approach from entry to exit. While traditional firewalls primarily served as detective systems, the evolution to next-generation firewalls (NGFWs) brings enhanced features. These include the Intrusion Prevention System (IPS), application awareness and control, and cloud-delivered threat intelligence, among others. NGFWs employ automation rules to detect anomalies and promptly respond by adjusting rules to mitigate potential threats. Notable examples of such advanced firewalls include AWS WAF and Fortinet FortiGate.

Compute: Public cloud providers, such as AWS and Azure, have taken significant steps to enhance system robustness. One notable feature at the compute level is Auto-Scaling, which dynamically adjusts resources in response to sudden increases in traffic. Additionally, Elastic Load Balancer (ELB) is a key service that spans multiple Availability Zones (AZs) and conducts regular health checks. This ensures that only healthy servers receive traffic, and in the event of any issues, the ELB automatically redirects traffic to other healthy instances from a pool.
This approach guarantees continuous uptime for the environment.

Networking: Networking is crucial for building robust systems, and achieving antifragility at the network level is essential for prioritizing interconnectivity. Spanning across two networks can enhance this antifragility, with services like AWS Route 53 enabling availability at a global scale. Route 53, a scalable domain name system (DNS) web service, efficiently routes end-user requests to globally distributed endpoints, contributing to application availability and reliability

Monitoring and Observability:
To gain insight into patterns for measuring fragility, we require systems that bolster monitoring. Tools like CloudWatch, Prometheus, and Grafana are employed to establish alerts and updates when anomalies are detected. Observability tools, such as AWS X-RAY, are utilized to monitor existing systems. The insights gathered from observation are then leveraged to predict and anticipate anomalies, enhancing the predictability of fragility.

By examining the breakdowns provided above, I hope I have been able to show you that building antifragile systems that can thrive in disorder is truly a robust way of development.

QUICK SUMMARY

Building anti-fragile systems is possible
Fragility should always be measured.
The next part will focus on the day-to-day DevOps practices, including developing CI-CD pipelines, testing and integration, automated deployment, and monitoring.

I hope you enjoyed this read. If you did, kindly connect with me on LinkedIn.

HAPPY CLOUD COMPUTING!!

The Kubernetes Resume Challenge Part 2

Ogonna Nnamani — Sun, 24 Mar 2024 00:05:29 +0000

CLICK FOR PART 1 HERE

Step 10: Autoscale Your Application
Task: Automate scaling based on CPU usage to handle unpredictable traffic spikes.

Implement HPA: Create a Horizontal Pod Autoscaler targeting 50% CPU utilization, with a minimum of 2 and a maximum of 10 pods.

Apply HPA: Execute kubectl autoscale deployment ecom-web --cpu-percent=50 --min=2 --max=10.

Simulate Load: Use a tool like Apache Bench to generate traffic and increase CPU load.

Implementation
To implement this, instead of using the autoscale command I generated a "hpa.yaml" file that defined the metrics that should trigger the autoscaling.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  namespace: default
  name: ecomm-hpa
  labels:
    app: ecomm-app
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ecomm-app-deployment
  minReplicas: 2 
  maxReplicas: 10  
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50

Now that we have defined our HPA, this does not mean that automatically our deployment will scale, i encountered an error and no matter the amount of load testing done with Apache Bench, the pods did not scale.

As we can see here, when we apply the hpa to the deployment it is not able to track the current utilization of the pods hence the "unknown" status. After some research, the fix was that the resource part of the deployment file under the container section was not defined. This is an example of how this is defined in a deployment file.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sample-app
  labels:
    app: sample
spec:
  replicas: 3
  selector:
    matchLabels:
      app: sample
  template:
    metadata:
      labels:
        app: sample
    spec:
      containers:
      - name: sample-app
        image: your-registry/sample-app:latest
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: "300m"

Explanation:
resources: Specifies the computational resources (CPU, memory, etc.) needed by the pods.

requests: Defines the minimum amount of resources required by the pods to run.

cpu: "300m": Sets the CPU request to 300 milliCPU (300m), which represents 0.3 of a CPU core. This indicates the minimum amount of CPU that each pod in the deployment requires to function properly.

After applying this updated file, to monitor this in real-time and to quickly test, i changed my hpa average from 50% to 10% and ran

kubectl get hpa -w

and we get the following outputs

it finally tracked the CPU utilization and also our replicas are now scaling from 3 to 5.

LOAD TESTING TOOLS
The guidelines suggested we use Apache bench which i found really interesting. The below command initiates the simulated requests to the kubernetes endpoint.

ab -n 1000 -c 10 http://<endpoint_url or IP>

The ab command is used for benchmarking HTTP server performance. Here's what the command you provided does:
-n 1000: Specifies the number of requests to perform. In this case, it's set to 1000, meaning Apache Bench (ab) will send 1000 requests to the server.
-c 10: Specifies the number of multiple requests to perform at a time. Here, it's set to 10, meaning Apache Bench will send 10 requests concurrently.
http://: Specifies the URL or IP address of the server to benchmark.

Also discovered another tool called kubectl load generator that also runs load testing for specifically kubernetes endpoints. with a single command requests begin simulating.

kubectl run -i --tty load-generator --rm --image=busybox --restart=Never -- /bin/sh -c "while sleep 0.01; do wget -q -O- http://website-service; done"

-i --tty: Allocates an interactive terminal for the command to run. load-generator: Name of the pod.
--rm: Removes the pod after it terminates.
--image=busybox: Specifies the Docker image to use for the pod.
--restart=Never: Indicates that the pod should not be restarted automatically if it fails.
/bin/sh -c "while sleep 0.01; do wget -q -O- http://website-service; done": The command to run inside the pod, which continuously sends HTTP requests to the specified endpoint - (http://website-service) with a delay of 0.01 seconds between requests.

Step 11: Implement Liveness and Readiness Probes
Task: Add liveness and readiness probes to website-deployment.yaml, targeting an endpoint in your application that confirms its operational status.

Implementation
This is another step where i reach out to a friend (PHP developer) to configure two checks on both the website and the db. hitting the endpoint/app-healthcheck returned "App is running", same with the database.

Step 12: Utilize ConfigMaps and Secrets
Task: Securely manage the database connection string and feature toggles without hardcoding them in the application.
Create Secret and ConfigMap: For sensitive data like DB credentials, use a Secret. For non-sensitive data like feature toggles, use a ConfigMap.

Implementation
Similar to the feature-toggle-config, I generated a ConfigMap.yaml file that stored all environment variables initially hardcoded in the deployment file. Now the configs are stored in a configmap file and referenced in the deployement file while a website-secret.yaml and mariadb-secret.yaml is also created to handle DB credentials of both resources.

apiVersion: v1
kind: ConfigMap
metadata:
  name: website-configmap
data:
  DB_NAME: "ecomdb"
  DB_HOST: "mariadb-service"
  DB_USER: "ecomdb-user"
  DB_PASSWORD: "password"

An example of a configmap.yaml file above.

apiVersion: v1
data:
  DB_PASSWORD: cGFzc3dvcmQxMjM= 
kind: Secret
metadata:
  name: db-secret
type: Opaque

An example of a secret.yaml file.
Now we reference this in our deployment.yaml file

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sample-app
  labels:
    app: sample
spec:
  replicas: 3
  selector:
    matchLabels:
      app: sample
  template:
    metadata:
      labels:
        app: sample
    spec:
      containers:
      - name: sample-app
        image: your-registry/sample-app:latest
        ports:
        - containerPort: 80
        env:
        - name: DB_HOST
          valueFrom:
            configMapKeyRef:
              name: website-configmap
              key: DB_HOST
        - name: DB_USER
          valueFrom:
            configMapKeyRef:
              name: website-configmap
              key: DB_USER
        - name: DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: db-secret
              key: DB_PASSWORD
        - name: DB_NAME
          valueFrom:
            configMapKeyRef:
              name: website-configmap
              key: DB_NAME
        - name: FEATURE_DARK_MODE
          valueFrom:
            configMapKeyRef:
              name: feature-toggle-config
              key: FEATURE_DARK_MODE

this references the configmap.yaml and secret.yaml files above, respectively

Extra credit:
Package Everything in Helm
Task: Utilize Helm to package your application, making deployment and management on Kubernetes clusters more efficient and scalable.

Implementation
Helm charts make managing and deploying kubernetes clusters very efficient. by utilizing a values.yaml file we can define values of our various files hereby making our chart generic and highly reusable. Below is a sample deployment file and a values.yaml handling the values of specific configurations.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ .Release.Name }}-app
  labels:
    app: {{ .Release.Name }}
spec:
  replicas: {{ .Values.replicaCount }}
  selector:
    matchLabels:
      app: {{ .Release.Name }}
  template:
    metadata:
      labels:
        app: {{ .Release.Name }}
    spec:
      containers:
      - name: {{ .Release.Name }}-container
        image: {{ .Values.image.repository }}:{{ .Values.image.tag }}
        ports:
        - containerPort: {{ .Values.containerPort }}
        resources:
{{ toYaml .Values.resources | indent 10 }}

Sample Deployment.yaml file

replicaCount: 3
image:
  repository: your-registry/sample-app
  tag: latest
containerPort: 8080
resources:
  requests:
    cpu: "300m"
    memory: "128Mi"
  limits:
    cpu: "500m"
    memory: "256Mi"

Sample Values.yaml file

The deployment.yaml file is a Helm template file. It uses Go templating syntax to inject values from the values.yaml file.
The values.yaml file defines configurable values for the Helm chart, such as the number of replicas, the Docker image repository and tag, container port, and resource requests and limits.

using helm install command deploys our website to a kubernetes cluster.

Implement Persistent Storage
Task: Ensure data persistence for the MariaDB database across pod restarts and redeployments.

Create a PVC: Define a PersistentVolumeClaim for MariaDB storage needs.

Implementation

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mariadb-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 200Mi

The following is a sample pvc.yaml file that defines a persistent volume that is abstracted from the pod restarts aensuring that data persists as long as it remains mounted.

Implement Basic CI/CD Pipeline
Task: Automate the build and deployment process using GitHub Actions.
GitHub Actions Workflow: Create a .github/workflows/deploy.yml file to build the Docker image, push it to Docker Hub, and update the Kubernetes deployment upon push to the main branch

Implementation
To automate this deployment, I generated a CICD file that has the following:

utilizes an Azure setup to install kubectl for our pipeline.

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
    - name: Checkout code
      uses: actions/checkout@v3
    - name: Install kubectl
      uses: azure/setup-kubectl@v2.0
      with:
        version: 'v1.27.0' # default is latest stable
      id: install

Authenticates to an AWS account using AWS account ID and secret access key.

- name: Configure AWS Credentials
      uses: aws-actions/configure-aws-credentials@v4
      with:
        aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
        aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        aws-region: us-east-1

As we are deploying to ECR, we log in, build and finally push to an ECR repository.

- name: Login to Amazon ECR Public
      id: login-ecr-public
      uses: aws-actions/amazon-ecr-login@v2
      with:
        registry-type: public

    - name: Build, tag, and push docker image to Amazon ECR Public
      env:
        REGISTRY: ${{ secrets.ECR_REGISTRY }}
        REPOSITORY: kubernetes-resume-challenge-repo
        IMAGE_TAG: latest
      run: |
        docker build -t  ${{ secrets.ECR_REGISTRY }}/kubernetes-resume-challenge-repo:latest .
        docker push  ${{ secrets.ECR_REGISTRY }}/kubernetes-resume-challenge-repo:latest

The below command updates the kubeconfig file with the credentials and endpoint information necessary to connect to the Amazon EKS cluster named kubernetes-cluster. This allows subsequent commands or steps in the pipeline to interact with the Kubernetes cluster

- name: Update kube config
  run: aws eks update-kubeconfig --name kubernetes-cluster

This below command navigates to the helm template directory of this project to execute the helm install and uninstall commands. Because this is a continuous pipeline the first command uninstalls it if it exists and re-installs it to update new changes made to the project file.


- name: Deploy go-app helm chart to EKS
      run: |
        helm uninstall helm-app -n helm
        cd kubernetes/helm-app
        helm install helm-app helm-app -f values.yaml . -n helm

Conclusion
Ladies and gentlemen, we have come to the end of this wonderful project and I didn't realize how much was put into this till I had to document and relive the roller coaster moments I had during this project. My knowledge of kubernetes and cloud in general has been stretched as a result of this.

I hope that you enjoy going through this with me and maybe encourage you to also try this and have some fun while at it because I sure did!

To connect with me
LinkedIn
GitHub
Twitter

The Kubernetes Resume Challenge Part 1

Ogonna Nnamani — Sun, 24 Mar 2024 00:04:44 +0000

Intro
This is a two-part blog article on the steps I took while doing the Kubernetes resume challenge by Forrest Brazeal, who is also known for creating the popular Cloud Resume Challenge.

A little backstory, Some months after I got into Cloud Computing as a DevOps Engineer, I really wanted to do the Cloud Resume Challenge but something always came in my way of completing the project. most times, it was the lack of AWS credits (don't blame me, it's easy to drown in cloud bills as an enthusiastic newbie), or some other urgent project or task at work. At some point i gave up on the idea.

Fast forward to March 2024, Forrest Brazeal is out with another challenge called "The Kubernetes Resume", Click on the link to access this challenge and guidelines. Ignore the name, this is far from a resume!
I got right into it and finally completed it. Link to my GitHub repo here.
Come along as I walk you through the steps I followed to accomplish this task.
Let's jump straight in!

Scenario
We have to deploy a PHP e-commerce website. This is a web application which faces challenges surrounding scalability and availability. To address these, we are going to leverage containerization using Docker and the orchestration using Kubernetes.

Prerequisites

Docker and Kubernetes CLI Tools
Cloud Provider Account: Access to AWS, Azure, or GCP for - creating a Kubernetes cluster.
GitHub Account
Kubernetes Crash Course
E-commerce Application Source Code and DB Scripts: Available at kodekloudhub/learning-app-ecommerce.

Step 1: Certification
The first step is to have the CKAD certification or complete the CKAD course on Kodekloud which I concluded last year but i had become rusty, nonetheless I proceeded. (don't be like me, take the course if you're not hands on with Kubernetes)

Step 2: Containerize Your E-Commerce Website and Database
The second step is containerizing the e-commerce website and the database. your proficiency with Docker is tested here because you have to update the database connection string and attach an initialization script that is mounted on the database during creation. That means creating two Dockerfiles, one for the website and another for the database.

Website
php:7.4-apache as base image
install mysqli extension for PHP.
and expose port 80.
Test this to ensure there are no errors during build.

Database
The database Dockerfile was built using an official MariaDB image, but a database initialization script will be mounted on launch of database. I spent alot of time working on this step as I am not really a fan of PHP. I was torn between having the script as an entrypoint script or as a Kubernetes ConfigMap object that will be used on my deployment.

The errors I had was because I was trying to create a .env file but I just needed to hardcode it either on the Dockerfile or as the website configmap variables. either way we need to ensure that the application has variables referencing to DB_NAME, DB_HOST, DB_PASSWORD and DB_USER.

Now after building locally, we push to Docker hub so we can pull this in on the deployment file. the below commands can be used to build, tag and push to Docker hub.

docker build -t cloudiepad/ecomm-img:v5

docker push cloudiepad/ecomm-img:v5

Step 3: Set Up Kubernetes on a Public Cloud Provider
As an AWS guy I used EKS to set up my clusters, If you are new to kubernetes as a whole, I would advise testing your cluster locally using minikube before going ahead to provision using EKS. it's pretty easy to rack up cloud debts using managed services on the cloud.

It's best to test and ensure that all your manifest files are correct and working before deploying to EKS.

Steps to deploy an EKS cluster using CLI

Install and configure AWS CLI
Install eksctl: This is an EKS command line tool that enables us run commands from the CLI to EKS.
Install kubectl : we will need this to interact with our cluster To set up an eks cluster, ensure that the IAM user has the appropriate permissions and roles to access both ECR and EKS.

Use eksctl create a cluster with the following command

eksctl create cluster --name eks-cluster --region us-east-1 --zones=us-east-1a,us-east-1b --nodegroup-name node-group --node-type t2.small --nodes 2 --nodes-min 2 --nodes-max 5 --managed

Replace eks-cluster,us-east-1 and us-east-1a,us-east-1b with your preferred cluster name, AWS region and availability zones.

create cluster: This part of the command instructs eksctl to create a new EKS cluster.
name eks-cluster: This specifies the name of the EKS cluster to be created, in this case, it's named "eks-cluster".
region us-east-1: Specifies the AWS region in which the cluster will be created. In this case, it's US East (N. Virginia).
zones=us-east-1a,us-east-1b: Specifies the availability zones in which the worker nodes will be created. In this example, it's specifying us-east-1a and us-east-1b.
nodegroup-name node-group: This specifies the name of the node group within the EKS cluster.
node-type t2.small: Specifies the EC2 instance type for the worker nodes. In this case, it's t2.small.
nodes 2: Specifies the initial number of worker nodes in the node group. In this example, it's set to 2.
nodes-min 2: Specifies the minimum number of worker nodes in the node group. In this example, it's set to 2.
nodes-max 5: Specifies the maximum number of worker nodes in the node group. In this example, it's set to 5.
managed: Indicates that this node group will be managed by Amazon EKS.

Step 4: Deploy Your Website to Kubernetes
At this stage, I generated a website-deployment.yaml file that declares my website deployment including the name of the website, the amount of replicas I want at all times, the name of my docker image and the location.

There is also a need to generate a mariadb-deployment.yaml file for the database as well. Instead of using a custom MariaDb image for the DB, I stated all the environment variables in my Dockerfile, so that when the container is spun up, it has all the details I need already set up.

Also ensure that the database pod has the db-load-script.sql script loaded unto the database and creates a database called 'ecomdb' . If you see that database then you have succesfully loaded the script as an entrypoint.

Another issue that I encountered severally was the inability for my db-user to access the db.

ERROR 1045 (28000): Access denied for user 'ecomm-user'@ 'localhost' (using password: NO)

This error happens when the right password is not captured in the Dockerfile. little errors like this can consume a lot of time. beware of the password that the website is expecting and the password used to launch the db. They should be the same.

Below is an example of a deployment.yaml file on kubernetes.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sample-app
  labels:
    app: sample
spec:
  replicas: 3
  selector:
    matchLabels:
      app: sample
  template:
    metadata:
      labels:
        app: sample
    spec:
      containers:
      - name: sample-app
        image: your-registry/sample-app:latest
        ports:
        - containerPort: 8080

Step 5: Expose Your Website
In this stage we have to generate another yaml file called a service.yaml file, we can call this website-service.yaml and as the guidelines state, we create a LoadBalancer type.
When we deploy this service to EKS, AWS provisions an ALB (Application Load Balancer) to serve our website. An equivalent is created for the database to maintain consistency. A kubernetes service.yaml file looks like this.

apiVersion: v1
kind: Service
metadata:
  name: sample-service
spec:
  type: LoadBalancer
  ports:
    - port: 80
      targetPort: 8080
  selector:
    app: sample

Step 6: Implement Configuration Management
Task: Add a feature toggle to the web application to enable a "dark mode" for the website.

Modify the Web Application: Add a simple feature toggle in the application code (e.g., an environment variable FEATURE_DARK_MODE that enables a CSS dark theme).

Use ConfigMaps: Create a ConfigMap named feature-toggle-config with the data FEATURE_DARK_MODE=true.

Implementation
This stage was a bit tricky for me because I am not a developer and even as a DevOps Engineer, PHP is not my favorite. The aim of this stage is to show that ConfigMaps can be used in several capacities. Anyways I reached out to a friend of mine who is a PHP developer and he helped me refactor the code and added a "style-dark.css" file as well to actually toggle. Now i was tasked with implementing this feature on ConfigMap. finally figured that out and my ConfigMap file looked like this.

apiVersion: v1
kind: ConfigMap
metadata:
  name: feature-toggle-config
data:
  FEATURE_DARK_MODE: "true"

Also the configMap had to be refereed to in the deployment file so the deployment is aware of this addition. When that is completed run

kubectl apply -f feature-toggle-config.yaml

to apply this config and voila we had a dark-mode website. I felt like a superman after this haha!

Step 7: Scale Your Application
AT this point we have deployment files, service files and a configmap file to toggle dark-mode. The next task is to Scale this website using the

kubectl scale deployment/<deployment_name> --replicas=6

This command refers to the active deployment and it is telling the deployment to scale to 6 replicas.
this scale happens immediately and can be observed in real-time by running

kubectl get pods -w

We now observe that 6 pods will be in the running state as against 1 initial replica.

Step 8: Perform a Rolling Update
Task: Update the website to include a new promotional banner for the marketing campaign.

Update Application: Modify the web application's code to include the promotional banner.

Build and Push New Image: Build the updated Docker image as yourdockerhubusername/ecom-web:v2 and push it to Docker Hub.
Rolling Update: Update website-deployment.yaml with the new image version and apply the changes.

Implementation
This stage includes making a change in the website, building and pushing a new Dockerfile image version and updating the deployment with the new image version.
Also, instead of just applying this new deployment we use the rollout mechanism where old pods will terminate while new pods are creating simultaneously so that there is no downtime hereby enhancing availability.

kubectl rollout status deployment/<deployment_name>

Replace with the name of your deployment. This command will provide you with the status of the rollout, including whether it's in progress, successful, or failed.

Step 9: Roll Back a Deployment
Task: Suppose the new banner introduced a bug. Roll back to the previous version.

Identify Issue: After deployment, monitoring tools indicate a problem affecting user experience.

Roll Back: Execute kubectl rollout undo deployment/ecom-web to revert to the previous deployment state.

Implementation
Similar to the last step but this is the exact opposite, how do we undo a rollout that causes our website to break? firstly, we rollback to a working version while we troubleshoot the issue with that particular buggy release.

kubectl rollout undo deployment/<deployment_name>

The above command seamlessly handles this.

We have to take a coffee break now and and let's move on to the second part of the project.

CLICK HERE FOR PART 2
If you enjoyed this article and would love to connect with me. Find me with the below links.
LinkedIn
GitHub
Twitter