DEV Community: Lydiah Wanjiru

Pose Estimation in Action: Visualizing the Golf Swing Frame by Frame ⛳️🤖

Lydiah Wanjiru — Fri, 04 Apr 2025 11:36:05 +0000

Pose Estimation in Action: Visualizing the Golf Swing Frame by Frame ⛳️🤖

In my last post, I introduced SwingSense—a golf swing diagnostic tool powered by AI and computer vision, built as both a technical challenge and a personal love letter to movement, mastery, and meaningful machine learning.

This is where we start building with intention.

In this post, I’ll walk you through how I used MediaPipe and OpenCV to process a swing video, detect human pose landmarks, and visualize the beauty of motion—one frame at a time.

🎯 Where We’re Headed

Here’s what we’re unlocking today:

Loading a real golf swing video
Applying pose estimation to detect key joints
Overlaying the skeletal movement onto video frames
Preparing to extract joint movement data for ML modeling

We're laying the foundation for understanding a swing not just as motion—but as data.

🎥 Step 1: Loading the Swing Video

I started with a slow-mo swing clip from YouTube, dropped it into my working directory under:

data/raw/sample_swing.mp4

To load it up with OpenCV:

import cv2

cap = cv2.VideoCapture('data/raw/sample_swing.mp4')

while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break

    cv2.imshow('Raw Frame', frame)
    if cv2.waitKey(10) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()

Simple, direct, and a great sanity check that your video pipeline is working.

🧍🏾‍♀️ Step 2: Pose Estimation with MediaPipe

Next, I brought in MediaPipe’s Pose model to identify 33 body landmarks—from head to toe.

import mediapipe as mp

mp_pose = mp.solutions.pose
pose = mp_pose.Pose()

results = pose.process(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))

Once the landmarks were detected, I used MediaPipe’s drawing tools to overlay the skeletal structure onto each frame:

mp_drawing = mp.solutions.drawing_utils

if results.pose_landmarks:
    mp_drawing.draw_landmarks(
        frame,
        results.pose_landmarks,
        mp_pose.POSE_CONNECTIONS)

It’s genuinely cool seeing the body mapped out mid-swing—limbs aligning, torso rotating, power coiled in motion.

👀 Step 3: Real-Time Motion Visuals

Displaying that annotated frame was easy with OpenCV:

cv2.imshow('Swing Pose Frame', frame)

But what it revealed wasn’t just working code—it was insight.
I could see the motion patterns emerge: shoulder tilt, hand lag, torso compression. The feedback engine we’ll build later? This is where it begins.

✨ Visualization = validation.
Before we start calculating metrics, we need to trust the model's eye.

🔄 Step 4: Prepping for Data Extraction

Next, I’ll be extracting the landmark coordinates from each frame—x, y, z, and visibility—for all 33 points. This is where we transition from visuals to structured, analyzable data.

We’ll format this into:

A Pandas DataFrame
One row per frame, one column per joint
Ready to calculate angles, offsets, tempo patterns, and movement phases

🧠 Why This Matters

Golf swings aren’t just art—they’re math, balance, timing, biomechanics.
By mapping joints, we capture the hidden structure of performance.

And by starting with pose estimation, we’re already creating a foundation where:

Movement becomes measurable
Feedback becomes automated
And improvement becomes data-driven

📍Up Next in Part 3:

Export pose landmark data for each frame
Calculate joint angles (like spine tilt, wrist lag, etc.)
Define a few “good vs improvable” swing metrics
Start building our training data for the feedback model

💬 Your Turn

Would you use pose estimation outside sports—like for dance, yoga, rehab, or even gesture control?

If you’re working on something similar, tag me—I’d love to learn from you, collaborate, or share notes.

SwingSense is just getting started.
But already, this journey feels like more than a project. It feels like insight in motion.

Until next time,
Keep swinging with intention.

Starting a new project.

Lydiah Wanjiru — Tue, 01 Apr 2025 08:52:37 +0000

Lydiah Wanjiru for AWS Community Builders

Apr 1 '25

SwingSense: How I’m Blending AI, Computer Vision, and Golf to Decode the Perfect Swing

3 min read

SwingSense: How I’m Blending AI, Computer Vision, and Golf to Decode the Perfect Swing

Lydiah Wanjiru — Tue, 01 Apr 2025 08:52:07 +0000

"Elegance is when precision meets purpose." — My mantra for both golf and code.

Hi, I’m Lydiah—a systems thinker, tech lover, and woman on a mission to build intelligent tools that make elite performance accessible to anyone.

This year, I’m launching a bold new project called SwingSense, where I’ll apply my skills in machine learning, computer vision, and data science to one of the most elegant yet data-rich sports on earth: golf.

In this post (the first in a build-in-public series), I’ll introduce you to the concept, motivation, and the tech backbone powering this portfolio project. And trust me—it’s not just about swings and stats. It’s about changing how we learn mastery, forever.

🎯 The Vision: Golf Coaching Reinvented with AI

Golf is a game of inches, intuition, and iteration. Yet for most aspiring golfers, quality coaching is expensive, subjective, and geographically limited.

What if I could change that?

SwingSense is my attempt to:

Analyze golf swings through pose estimation and kinematic data
Compare those movements to elite professionals
Offer real-time, explainable, and personalized feedback
Track a user’s improvement over time using machine learning

We’re not just building a tool. We’re architecting an experience—where feedback feels like a personal caddie whispering insights right in your ear.

🧠 The Tech Stack: Marrying AI with Sports Science

To bring this vision to life, I’m engineering an end-to-end system with the following components:

Layer	Tools & Ideas
🎥 Pose Estimation	MediaPipe, OpenCV, Skeleton Tracking
🧠 Model Logic	TensorFlow or PyTorch for classification + metrics
📊 Data Engineering	Custom pipelines to extract & label joint movement
🎛️ App Layer	Streamlit for quick UI (MVP), then React/Flask for production
☁️ Infrastructure	GitHub, Docker, AWS S3 for scaling later
📈 Insights & Feedback	Angle calculations, comparison heatmaps, audio/text feedback

In the early stages, I’ll work with public golf swing datasets and simulated motion data. But the long-term dream? Incorporating real-world video submissions from everyday users—closing the loop between model training and actual impact.

🔍 Why This Project Matters (and Why Now)

This isn’t just about golf—it’s about democratizing expert feedback using tech.

Imagine a high-school athlete in Nairobi, a retired golfer in Atlanta, or a beginner mom in Lagos… all receiving world-class coaching through AI. No private instructor. No expensive sensors. Just a smartphone and a personalized feedback loop.

For me, SwingSense is proof that:

ML and data engineering can solve meaningful real-world problems
Sports and science don’t have to be separate lanes
And that a Black African woman can architect next-gen tools on a global scale.

🗺️ What You’ll Learn in This Series

If you’re following this series, expect a guided journey through:

🎯 Pose detection & human movement modeling
🔍 Golf swing feature extraction & biomechanics mapping
🤖 Training an ML model to assess swing quality
🛠️ Deploying a feedback engine via Streamlit
📦 Lessons on building, breaking, and refining as we go

It’ll be real, raw, and rigorously documented—with code snippets, visuals, personal insights, and aha moments along the way.

👋🏾 Let’s Connect (and collaborate)

Whether you:

Love golf and want to explore the tech side,
Are building your own AI/ML project and need a sister-in-code,
Or you’re just curious how data science meets sports elegance—

…I’d love to hear your thoughts, questions, or even co-create something magical. 💬

🔮 Up Next: Pose Estimation in Action

In the next post, we’ll step into the swing—quite literally—by:

Loading our first swing video
Extracting joint landmarks using MediaPipe
Visualizing motion frames in real time
And preparing our data pipeline for training

So, grab your favorite club (or keyboard) and join me on this beautiful blend of logic, movement, and machine learning.

Until then, keep it precise. Keep it playful. And always… swing with purpose.

AI #MachineLearning #ComputerVision #Golf #BuildInPublic #DataScience #PoseEstimation #Streamlit #WomenInTech #SportsTech #MLOps

[Boost]

Lydiah Wanjiru — Tue, 25 Feb 2025 14:01:13 +0000

Beyond 99.99% Uptime: Engineering High Availability Like a Pro 🚀

Lydiah Wanjiru ・ Feb 25

[Boost]

Lydiah Wanjiru — Tue, 25 Feb 2025 13:51:55 +0000

Beyond 99.99% Uptime: Engineering High Availability Like a Pro 🚀

Lydiah Wanjiru ・ Feb 25

Beyond 99.99% Uptime: Engineering High Availability Like a Pro 🚀

Lydiah Wanjiru — Tue, 25 Feb 2025 13:47:17 +0000

"High Availability is not about avoiding failures; it’s about embracing them intelligently."
The industry often touts the 99.99% uptime promise, but real-world HA engineering transcends Service Level Agreements (SLAs). It's about ensuring that even when failures occur, your system remains operational without impacting end-users.

Drawing from experiences with large-scale HA architectures, including an active-active setup validating services for 70 million users, one key takeaway emerges: Downtime is not an accident; it’s an oversight.

Here’s an in-depth exploration of how real HA operates at scale and how AIOps is redefining availability. 👇

1️⃣ The HA Maturity Model: Where Do You Stand?

Before diving into advanced architectures, assess your system's current position on the HA Maturity Scale:

🔴 Level 1: Basic HA → Standby backup servers, slow manual failover, minimal automation.
🟡 Level 2: Intermediate HA → Load balancing, active-passive clusters, automated failover.
🟢 Level 3: Advanced HA → Active-active multi-region deployments, self-healing infrastructure, zero downtime deployments.
🔵 Level 4: AI-Driven HA → Predictive auto-scaling, anomaly detection, AIOps-driven remediation.
Operating at Level 1 or 2 leaves your system vulnerable to unforeseen failures.

2️⃣ Real-World HA Failures: Lessons Learned

📉 CASE STUDY 1: Netflix’s Chaos Engineering

Challenge: Serving over 250 million users globally, Netflix operates on AWS cloud infrastructure where downtime is unacceptable.

Approach:

Chaos Monkey: A tool that randomly terminates services in production to test resilience.
Active-Active Architectures: Deployments across multiple AWS regions to prevent regional outages.
Circuit Breakers (Hystrix): Manage partial failures without affecting all services.

Key Takeaway: Incorporating failure simulation into your HA strategy is crucial. Design HA proactively, not reactively.

✈️ CASE STUDY 2: Airline Booking System Outage

Challenge: In 2019, a global airline's reservation system experienced a major outage, grounding thousands of flights due to a single database failure.

Issues Identified:

Single Point of Failure (SPOF): A solitary database led to cascading failures.
Lack of Multi-Region Failover: All traffic was directed to a single data center.
Insufficient Real-World HA Testing: HA testing did not reflect actual traffic conditions.

Preventative Measures:

Geo-Redundancy: Replicate systems across multiple AWS regions.
Blue-Green Deployment: Implement rolling updates without affecting live traffic.
AIOps Monitoring: Utilize AI-based anomaly detection to predict issues before they escalate.

Key Takeaway: Testing HA under real-world conditions is essential to prevent operational disruptions.

3️⃣ Architecting HA Excellence with AWS

To design an enterprise-grade HA system capable of handling millions of requests seamlessly, consider the following AWS-centric strategies:

🔥 1. Active-Active Multi-Region Deployments

Implementation:

AWS Global Accelerator: Directs user traffic to optimal endpoints across multiple AWS regions, enhancing availability and performance.
Amazon Route 53: Employ latency-based routing to distribute traffic efficiently.
Example: In a recent deployment for a major telecommunications company, an active-active setup was configured with load balancers fronting geo-distributed API clusters. This ensured seamless traffic redirection even if an entire data center failed.

🔥 2. Stateless and Self-Healing Services

Implementation:

Amazon Elastic Kubernetes Service (EKS): Manages containerized applications with self-healing capabilities.
Amazon ElastiCache: Externalizes session data, enabling stateless service operations.
Example: Netflix's Chaos Engineering practices involve intentionally terminating services to validate auto-recovery mechanisms before real failures occur.
NETFLIXTECHBLOG.MEDIUM.COM

🔥 3. AI-Powered Observability (AIOps)

Implementation:

Amazon CloudWatch: Monitors applications and infrastructure in real-time.
AWS DevOps Guru: Leverages machine learning to identify operational issues and recommend remediation.
Example: Integrating AI-based anomaly detection reduced incident resolution time by 45% by predicting database bottlenecks before they led to system slowdowns.

4️⃣ The AIOps Revolution: Transforming HA

Challenge: Traditional HA monitoring is reactive, addressing issues post-occurrence.

Solution: AIOps transitions HA to a proactive stance, predicting and mitigating failures before they impact operations.

Enhancements:

Predictive Scaling: Machine learning models adjust capacity ahead of traffic surges.
Anomaly Detection: AI identifies deviations from normal patterns automatically.
Automated Incident Response: AI-driven runbooks resolve issues without human intervention.

Example: In a system processing billions of transactions, AI-based alerting reduced alert fatigue by 60% and improved uptime.

5️⃣ The Future of High Availability: What's Next?

As cloud computing, AI, and edge technologies evolve, High Availability (HA) strategies must adapt to maintain resilience at scale. The next generation of HA will go beyond traditional architectures, integrating self-healing systems, zero-downtime deployments, and predictive AI-driven failovers.

🚀 1. Zero Downtime Architectures
Traditionally, HA systems relied on multi-zone failover strategies, but the future lies in continuous availability with zero service disruption.

Emerging Technologies Driving This Trend:

Amazon Aurora Global Database: Enables low-latency reads across AWS regions with near-instant failover.
AWS Lambda + DynamoDB Streams: Eliminates downtime for serverless applications by ensuring continuous event processing.
Multi-Cloud Failover: Companies are increasingly adopting multi-cloud redundancy (AWS, GCP, Azure) to mitigate cloud-specific outages.
📌 Example: Uber built Failover Groups across AWS and Google Cloud to dynamically route traffic based on system health.

🤖 2. Self-Healing Infrastructure
The next stage of HA will fully automate failure resolution, eliminating the need for manual intervention in outages.

🔹 Key Features of Self-Healing HA Systems: ✅ Proactive Incident Resolution – AI-driven tools detect failures before users notice them.
✅ Automated Workload Shifting – Kubernetes, EKS, and Fargate auto-move workloads to healthy nodes.
✅ Predictive Auto-Scaling – ML algorithms adjust compute power based on real-time demand.

📌 Example: Netflix’s self-healing HA pipeline proactively replaces failing microservices using Chaos Monkey & AWS Auto Scaling.

🌎 3. Edge Computing & HA at Scale
HA is moving beyond centralized cloud data centers and pushing computing closer to users at the edge.

🔹 Why This Matters:

Lower Latency: Processing user requests closer to the source improves performance.
Distributed Resilience: Outages in one region don’t affect the entire system.
5G-Optimized HA: Next-gen networks will reduce failure points by routing traffic dynamically.

📌 Example: Amazon CloudFront & AWS Wavelength optimize HA for edge computing by dynamically caching content closer to end-users.

📊 Performance Benchmarks: HA in Action

Let's look at how different HA strategies impact uptime and downtime.

HA vs Downtime Chart
This visualization highlights the annual downtime for various HA configurations.

🔹 Uptime & Downtime Relationship:

Uptime (%)	Downtime per Year	Solution Required
99.9%	8.76 hours	Basic failover
99.95%	4.38 hours	Multi-AZ active-passive
99.99%	52 minutes	Active-active, DB failover
99.999%	5 minutes	Self-healing, auto-scaling
100%	0 minutes	AI-driven AIOps, predictive failover

📌 Key Takeaway: 99.999%+ uptime requires AI-driven failure prediction and self-healing infrastructure.

Final Thoughts: Mastering HA for the Future

The landscape of High Availability is evolving rapidly. If your HA strategy still relies on traditional failover techniques, you risk falling behind.

🔹 Key Takeaways: ✅ Move beyond basic redundancy—adopt self-healing, AI-driven HA.
✅ Predict failures instead of just reacting—use AIOps & anomaly detection.
✅ Leverage multi-cloud & edge computing to create truly global, resilient systems.

💡 Where does your system stand on the HA Maturity Scale? Let’s discuss in the comments below! 👇

📌 Follow me for more deep dives into AIOps, DevOps, and System Design! 🚀

Scaling Up: Overcoming Doubt, Finding My Fit, and Pushing the Limits in 2024

Lydiah Wanjiru — Fri, 31 Jan 2025 18:20:13 +0000

This is a submission for the 2025 New Year Writing challenge: Retro’ing and Debugging 2024.

Scaling Up: Overcoming Doubt, Finding My Fit, and Pushing the Limits in 2024

The Weight of Being First

Being the first in my generation to break into tech has been both exhilarating and daunting. There’s no blueprint, no guidebook—just an uncharted path full of unknowns. Throughout 2024, I often found myself questioning where I fit in, feeling the pressure of not just succeeding for myself, but for those who will come after me.

The year tested my resilience, pushing me through moments of imposter syndrome, career uncertainty, and self-doubt. Yet, through each challenge, I found lessons that shaped me, moments that defined me, and opportunities that expanded my vision for what’s possible.

"Nobody can teach me who I am." — Chinua Achebe

Nevada & The U.S. Experience: A Shift in Perspective

One of the most transformative moments of 2024 was being selected as an AWS re:Invent grant winner, which gave me a fully paid opportunity to attend the AWS re:Invent conference in Nevada in December. It was more than just a trip—it was a perspective shift.

Stepping into an entirely new environment, surrounded by professionals who had been in tech spaces for years, I was both inspired and intimidated. The experience challenged my understanding of global tech ecosystems, networking, and personal ambition. It made me realize how much I belonged in these spaces—not as an outsider looking in, but as someone who could thrive and lead.

Being in Nevada also opened my eyes to the power of proximity—seeing first-hand what was achievable when resources, community, and ambition aligned. It wasn’t just about gaining technical skills; it was about realizing that the barriers I had once seen were more mental than real.

"The world is like a mask dancing. If you want to see it well, you do not stand in one place." — Chinua Achebe

The CBS Upgrade & High Availability Hustle

In 2024, I played a pivotal role in platform engineering, focusing heavily on High Availability (HA) across network, hardware, application, and database levels. One of my most critical projects was upgrading the Core Billing System (CBS) from R20 to R23, ensuring seamless service for 70 million subscribers.

Ensuring high availability while managing failovers, switchovers, and backup strategies was a balancing act—a true test of precision, teamwork, and problem-solving. The stress was real, the stakes were high, but so was the growth.

Through this project, I learned the art of strategic planning, performance testing, and maintaining system integrity under pressure. More importantly, I proved to myself that I could handle mission-critical projects at scale and that I had a strong inclination for architecting resilient systems.

"Confidence is a choice. Arrogance is a mistake." — Alison Fragale

Imposter Syndrome, Teamwork & Finding My Place

Despite these achievements, there were days I felt behind—as if I was racing against an invisible clock. I’d scroll through LinkedIn, seeing others seemingly leapfrogging ahead, and I’d wonder: Am I really where I should be?

But 2024 taught me that success isn’t linear. It’s not about reaching a destination first; it’s about staying on course. I reminded myself that being first in my family meant paving a path, not rushing to the finish line.

One of the biggest lessons I learned was the importance of a strong support system—whether it's colleagues, mentors, or loved ones. Numbers don't matter as much as having the right people in your corner, cheering you on, pushing you forward, and reminding you that you belong.

This year also introduced me to working with a multi-racial, multi-cultural team, which pushed me to expand my interpersonal and professional skills. The experience reshaped how I want to approach teamwork, leadership, and collaboration in my career and personal life.

"When you lack confidence, borrow some from the people who believe in you." — Alison Fragale

Lessons in Growth & Pushing Limits

2024 was a masterclass in resilience, leadership, and self-belief. Key lessons I’m taking forward:

You belong in the rooms you walk into. Never shrink yourself.
Imposter syndrome is a sign of growth, not inadequacy.
Being first means there’s no roadmap—but it also means you get to build your own.
Scaling up isn’t just technical; it’s personal. Growth is about expanding your mindset as much as your skillset.
Diversity in teams brings strength, adaptability, and innovation. Learning to navigate different cultures and perspectives makes you a stronger professional and a better human.

"Until the lions have their own historians, the history of the hunt will always glorify the hunter." — Chinua Achebe

What’s Next: Compiling 2025

If 2024 was about proving I could hold my own, 2025 is about expanding my reach. My roadmap for the year is ambitious, but so am I:

Mastering ML & DevOps, positioning myself as a thought leader in High Availability and AI infrastructure.
Launching my pajama business, curating luxury comfort for African women.
Securing an 800K+ KSH role that aligns with my expertise and passion for scalable, reliable systems.
Becoming a mentor, helping other first-generation tech professionals navigate their journeys.
Applying my multi-cultural teamwork experience to build stronger, more inclusive tech environments.

2024 challenged me, shaped me, and reminded me that I am more than enough. In 2025, I’m not just keeping up—I’m taking the lead.

To every first-generation woman in tech feeling behind: Your journey is valid. Your progress is real. And your story is still being written. Keep pushing. Keep scaling. You’re not just part of the industry—you’re shaping it.

Mastering Gray Release for High Availability

Lydiah Wanjiru — Thu, 30 Jan 2025 17:10:17 +0000

Introduction

**
In the world of high-availability (HA) systems, rolling out upgrades without causing disruptions is a fine art. One wrong move, and millions of users experience downtime. Enter Gray Release—a deployment strategy that allows gradual, controlled rollouts while mitigating risks.

I recently worked on upgrading a Core Billing System (CBS) serving 70M+ users, where we had to ensure zero downtime. Gray Release was a game-changer, but it also came with challenges that required innovative solutions. Here’s how it works, the hurdles we faced, and how we could improve future rollouts.

What is Gray Release ?

A Gray Release is a phased rollout approach where a new version is gradually introduced to a subset of users, allowing real-time validation before full deployment. Unlike Blue-Green or Canary deployments, Gray Release operates in controlled, incremental waves, reducing impact on production.

💡 Key Benefit: If an issue arises, it affects only a small fraction of users, making rollbacks easy and impact minimal.

How We Implemented Gray Release

1️⃣ Segmenting Users for Controlled Rollout
We started by selecting a small, low-risk user group, such as internal testers or VIP customers.
Based on feedback and system stability, we expanded the rollout gradually.
2️⃣ Real-Time Monitoring with Huawei Digital View
Huawei Digital View provided a single-pane-of-glass monitoring system for:

Transaction success rates
Latency spikes
Resource utilization (CPU, memory, database performance)
Anomaly detection using AI-driven insights

We configured automated alerts to detect failures before they escalated.

3️⃣ Rollback Mechanisms for Safety
Feature flags were used to toggle new features on/off instantly.
We established an automated rollback system to revert to the stable version if error rates crossed a defined threshold.
4️⃣ Controlled Expansion for Stability
Once the first batch of users showed no anomalies, we expanded to 20%, then 50%, then full deployment.
Every phase was validated with real-time monitoring from Huawei Digital View before moving forward.

Why Use Gray Release?

✅ Zero Downtime: No need for service interruptions.
✅ Risk Mitigation: Bugs are caught early before reaching all users.
✅ Better User Experience: Issues are fixed proactively rather than reactively.
✅ Cost Efficiency: Avoids full-scale rollbacks that can be expensive.

***Challenges of Implementing Gray Release* **

Even with a solid plan, Gray Release had its pain points. Here’s where we faced challenges and how we overcame them:

Managing Feature Flags Becomes Cumbersome
💀 Problem: With multiple release phases, managing feature flags for different user groups became complex.
🔧 Solution: We introduced a feature flag governance strategy, ensuring that flags had a clear lifecycle (creation, validation, removal).
Unpredictable User Behavior in Early Segments
💀 Problem: Some users in early release segments didn’t use the new version as expected, making it hard to validate real-world performance.
🔧 Solution: We used synthetic traffic simulations to mimic real user interactions before expanding the release.
Rollback Challenges with Database Changes
💀 Problem: Database schema changes are often irreversible, making rollbacks tricky.
🔧 Solution: We implemented dual-write strategies and used Huawei Cloud’s database versioning tools to allow safe reversions.
Performance Overhead on Huawei Digital View
💀 Problem: Continuous monitoring across multiple release phases added heavy load on observability tools.
🔧 Solution: We optimized monitoring by:
Using sampling-based logging instead of full-trace logging.
Prioritizing high-impact metrics instead of tracking everything.
Delayed Detection of Critical Bugs
💀 Problem: Since Gray Release is gradual, some bugs only appeared later in the rollout, delaying response time.
🔧 Solution: We enhanced anomaly detection in Huawei Digital View, configuring alerts for even small performance deviations.

Final Thoughts

Gray Release isn’t just a deployment strategy—it’s a resilience strategy. In a Core Billing System, where even a small miscalculation can impact revenue, a controlled rollout is the only way to guarantee stability.

By continuously refining our approach, we can balance innovation with accuracy, ensuring that billing remains transparent, efficient, and error-free for millions of users.

💬 Have you faced challenges implementing Gray Release in a billing system? What solutions worked for you? Let’s discuss in the comments! 🚀

What’s Next? Bringing Gray Release to the Cloud

While this post focused on Gray Release for on-prem systems with Huawei, what happens when you need to implement the same strategy in the cloud?

🟠 How do you execute Gray Release on AWS or other cloud platforms?
🟠 What cloud-native tools help automate rollouts and rollback strategies?
🟠 How do services like AWS CodeDeploy, Lambda, and Canary Deployments fit into the equation?

🚀 Stay tuned for my next post on "How to Implement Gray Release on AWS (or Any Cloud)"—where I’ll break down step-by-step strategies for executing seamless cloud deployments.

CloudComputing #AWS #Azure #GoogleCloud #GrayRelease #DevOps #SystemDesign #ZeroDowntime

Building a Resilient High Availability System with Kubernetes on AWS

Lydiah Wanjiru — Tue, 28 Jan 2025 14:54:13 +0000

High availability (HA) is essential for modern applications, particularly in industries like finance, where downtime can result in massive financial losses and regulatory issues. Financial institutions, for example, target 99.99999% availability—just 3.15 seconds of downtime per year—to ensure that services are always up and running. Achieving this level of uptime is critical for handling sensitive transactions and maintaining customer trust.

Kubernetes on AWS offers a powerful solution for building highly available, scalable, and fault-tolerant systems. In this post, we’ll guide you through setting up a high availability system using Kubernetes on AWS, ensuring that your applications can meet the demanding uptime requirements of industries like finance, e-commerce, and more.

*Why use Kubernetes for High Availability ? *
K8s is an open source platform for automating the deployment ,scaling and management of containerized applications. It provides critical features for building high availability systems :

Autoscaling - K8s can automatically scale your applications up or down based on demand.
Self healing - K8's automatically replaces containers that fail, ensuring minimal downtime.
Load balancing - K8s distributed network traffic across containers, improving reliability and performance.
Rolling updates: K8 enables seamless deployment of new application versions without downtime.

When paired with AWS, Kubernetes can be deployed to take full advantage of the cloud’s scalability and infrastructure flexibility, creating a truly high-availability environment.

In building HA on AWS Kubernetes there are some essential components needed :

Amazon Elastic Kubernetes Services (EKS) which manages K8s services simplifies the setup, management , and scaling of K8s clusters.
Amazon Elastic Load Balancers (ELB) which distributes incoming application traffic across multiple EC2 instances for HA.
Amazon RDS (Relational Database Service) which is a fully managed relational database that can automatically failover and scale to handle traffic spikes.

4 . Autoscaling Groups (ASG) for automatically scaling EC2 instances based on load, ensuring the infrastructure scales as traffic increases.

### Step 1: Setting Up Your AWS Infrastructure

Create an EKS Cluster

Sign into the AWS Console and go to EKS.
Click Create Cluster, choose your region, and give it a name.
Select the Kubernetes version and choose a VPC with public and private subnets.
Set up IAM roles to allow EKS to interact with other AWS services like EC2 and RDS.
Once the cluster is created, configure your Kubernetes CLI (kubectl) to connect to it :

aws eks --region <your-region> update-kubeconfig --name <your-cluster-name>

Set Up RDS for HA

Go to RDS and select Create Database.
Choose a database engine (MySQL/PostgreSQL) and enable Multi-AZ for high availability.
Place the database in private subnets for security.
Set up security groups to allow communication between your EKS pods and RDS.

## Step 2: Deploy Your Application
Create a deployment.yaml for your app, defining 3 replicas for redundancy and a LoadBalancer service for traffic distribution.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: my-app-container
          image: <your-image>
          ports:
            - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: my-app-service
spec:
  selector:
    app: my-app
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80
  type: LoadBalancer

Apply the file using kubectl apply -f deployment.yaml .

Step 3: Auto Scaling & Load Balancing

**
Horizontal Pod Autoscaler (HPA)
Create an HPA to scale your app based on CPU usage

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 3
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 50

Apply the HPA with kubectl apply -f hpa.yaml.
Elastic Load Balancer
By using type: LoadBalancer in your service YAML, AWS automatically provisions an ELB to distribute traffic across pods.

Step 4: Monitoring & Health Checks

CloudWatch Monitoring: Set up CloudWatch for logs and metrics to monitor your EKS cluster, EC2, and RDS.
Pod Health Checks: Add readiness and liveness probes in your deployment.yaml to check if pods are healthy:

yaml
Copy
readinessProbe:
  httpGet:
    path: /health
    port: 80
  initialDelaySeconds: 5
  periodSeconds: 5

Alerts: Use CloudWatch Alarms to notify you if any pod or service fails.
By following these steps, you'll create a high availability system that ensures resilience and minimal downtime for your applications.

Ensuring High Availability in Microservices: A Failover Testing Journey with AWS

Lydiah Wanjiru — Mon, 05 Feb 2024 14:25:02 +0000

A company's transition to a microservice architecture necessitates careful planning, particularly with regard to security and high availability. My approach involves rigorous failover testing procedures coupled with robust security implementations. This article delves into my methodology, emphasizing the maintenance of high availability.

*Introduction:
*
For firms, switching to a microservices architecture offers scalability, flexibility and resilience. However, maintaining high availability (HA) and ensuring robust security are critical factors. In this post, I will explain how I would have approached failover testing procedures.

Performing high availability failover testing is an essential part of microservices architecture management. It entails confirming that the system can continue to provide service even in the case of infrastructure problems or component failures. Using a thorough strategy to failover testing is necessary to guarantee system reliability.

Techniques used to verify the effectiveness of failover systems. These consist of:

### Traffic shifting and load balancing

Incoming traffic is divided among several instances or servers to prevent service delivery from being interrupted by a single point of failure. When it comes to spreading traffic equally and rerouting requests in the event of an error, load balancers are essential.
Identifying single points of failure in the architecture is imperative to the effectiveness of traffic shifting and load balancing mechanisms. Through identification of possible weak points, like a lone server or instance that can interfere with service in the event of a failure, companies can put redundancy protocols in place and adjust traffic distribution appropriately. This proactive strategy reduces the possibility of downtime and strengthens the system's overall resilience.
I would spread incoming traffic among several instances and locations by utilizing Route 53 and AWS Elastic Load Balancing (ELB).

## Auto Scaling and Elasticity

Amazon Auto Scaling makes it easier to dynamically modify resources in response to demand. In order to determine whether the system can grow horizontally and maintain seamless service availability, failover testing involve abrupt surges in traffic and instance terminations.

## Multi-AZ Deployments Enhancing fault tolerance and resilience in cloud-based architectures is fundamentally achieved through the deployment of microservices across different Availability Zones (AZs). Within a given geographic area, an Availability Zone is a discrete and physically isolated data center that offers redundancy and isolation against possible localized failures.

The design gain's intrinsic fault tolerance when microservices are spread across many AZs. In the case that one AZ experiences an infrastructure failure—hardware failure, network problems, or power outages, for example—the services hosted in the other AZs are unaffected and may still process incoming requests. Because of this redundancy, the system as a whole stay operational even in the event that one of the AZs encounters outage.

Testing for failover is essential to confirming this architecture's efficacy. Through deliberate interference with one or more AZs, organizations can simulate real-world failures and evaluate resilience. Testing involves initiating automatic failover like rerouting traffic to healthy instances in other AZs to maintain continuity despite the disruption.

Multiple aspects get assessed during testing:

Response Time: Measuring how long it takes the system to recognize a fault and divert traffic to operational instances in other AZs. Reduction in response times guarantees that end customers are not disturbed too much.
Data Consistency: Making sure that, in the event of a failover, data integrity is preserved throughout distributed microservices. Mechanisms for data synchronization and replication are essential for maintaining consistency and preventing data loss or corruption.
Verifying the efficiency of load balancing systems in dividing traffic across available instances in AZs that are not affected. In order to maximize resource usage and guarantee peak performance during failover occurrences, load balancers are essential.
Monitoring and Alerting: To quickly identify problems and start failover processes, it is important to put strong monitoring and alerting systems in place. Organizations can detect problems through proactive monitoring before they become more serious and affect the provision of services.

Regular testing exposes weaknesses, optimizes procedures, and improves resilience. This proactive approach minimizes downtime risks and ensures continuous delivery despite infrastructure failures or unexpected events.

## Chaos Engineering

Although tools such as AWS Fault Injection Simulator may not be directly applicable, comparable principles can still be implemented by intentionally introducing controlled failures and faults within the closed environment. This process might entail scripting failure scenarios and closely monitoring system responses to confirm resilience.

In conclusion, meticulous failover testing is paramount for companies microservices transitioning to AWS. By embracing comprehensive methodologies, companies ensure high availability, resilience, and security in their microservices architecture, mitigating risks and enhancing operational excellence.