lizabeth

Posted on Sep 30

Designing a self-healing distributed Service on Kubernetes

#kubernetes

What if your service could mend itself, detect problems, diagnose the cause, and bounce back, all without human intervention?. Picture it not as code on a server, but as a living, adaptive organism relentlessly pursuing uptime. An environment where with automation, observability, and smart orchestration, failures are treated not as catastrophes, but as minor, self-correcting events and proof points of your system’s resilience and strength. This is the core of self-healing.
In Kubernetes, this vision goes far beyond keeping containers running, true resilience means weaving security, adaptability, and rapid recovery into the DNA of your system.
With that vision in mind, here are the steps I will take to design a secure, resilient, and adaptive self-healing service on Kubernetes:

Setting up Automatic Scaling:

To conquer the twin challenges of unpredictable traffic spikes and resource inefficiency, we are implementing a 'Triple-Threat' Autoscaling strategy in Kubernetes.
In Kubernetes, adaptability is not an afterthought, it is built into the system’s heartbeat. Imagine a bustling marketplace where the number of stall-keepers changes with the size of the crowd. When the streets overflow with visitors, more stalls open instantly. When the crowd thins, stalls close just as quickly, saving energy and resources. This is the role of the Horizontal Pod Autoscaler (HPA). It listens to the pulse of the application CPU, memory, or even custom business metrics and scales the number of Pod replicas up or down in perfect rhythm with demand. To the user, the service never falters, to the operator, resources are never wasted.
However, scaling is not only about numbers, it is also about strength because, sometimes a shop does not need more hands, it just needs stronger ones. That is where the Vertical Pod Autoscaler (VPA) steps in, like a master trainer, it watches each Pod and adjusts its muscle (CPU and memory limits) so it does not collapse under pressure or waste power when idle. With VPA, each Pod is always sized just right for the workload it shoulders.
Above them all stands the Cluster Autoscaler, the architect of the city itself. When the marketplace becomes so busy that there is no room for new stalls, the Cluster Autoscaler expands the city’s boundaries, adding new nodes to host the surge of activity. Later, when the crowd disperses and the city is underused, it reclaims empty streets, scaling the cluster back down to avoid waste.
In summary, HPA, VPA, and the Cluster Autoscaler form Kubernetes’ adaptive nervous system. They ensure that applications breathe with demand, scaling wide, growing strong, and expanding deep into infrastructure only when truly needed. It’s a living ecosystem where efficiency and availability coexist, making sure the system is always prepared, always resilient, and always right-sized for the moment.

Sample Pod autoscaling setup for a k8s deployment

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-app
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: test-app
  template:
    metadata:
      labels:
        app: test-app
    spec:
      containers:
      - name: auto-container
        image: docker-registry.com/app:013140b.30
        ports:
        - containerPort: 8000
        resources:
          requests:
            cpu: "100m"
            memory: "128Mi"
          limits:
            cpu: "500m"
            memory: "512Mi"

---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: test-hpa
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: test-app
  minReplicas: 2
  maxReplicas: 5
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50

---
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: test-vpa
  namespace: default
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: test-app
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
    - containerName: "*"
      minAllowed:
        cpu: 100m
        memory: 128Mi
      maxAllowed:
        cpu: 2

Sample node Autoscaling for a k8s cluster
k8s node autoscaling

Automatic Restart and Loadbalancing:

In Kubernetes, the intelligence of self-healing begins with continuous observation. Each Pod is equipped with Liveness and Readiness Probes, acting like vital signs for the application. The Liveness Probe is the heartbeat, if it declines, Kubernetes immediately detects the deadlock or failure and restarts the container to restore it's health. The Readiness Probe, on the other hand, is the vigilant gatekeeper. It ensures a Pod does not receive a single request until it is fully awake, initialized, and ready to serve, shielding clients from "half-awake" services.
When a container stumbles, crashes, freezes, or simply refuses to respond, guided by the Pod’s restartPolicy (whether Always or OnFailure), the unhealthy container is terminated gracefully and replaced with a new, functioning instance.
Meanwhile, the Load Balancer orchestrates traffic flow like a skilled conductor, dynamically steering requests only toward healthy Pods. By the time the new container is ready, the system has already ensured users experience uninterrupted service, as if the failure never happened.

Sample Automatic Restart

spec:
      restartPolicy: Always   
      containers:
      - name: auto-container
        image: nginx:latest
        ports:
        - containerPort: 8000
        resources:
          requests:
            cpu: "100m"
            memory: "128Mi"
          limits:
            cpu: "500m"
            memory: "512Mi"
        livenessProbe:               
          httpGet:
            path: /
            port: 80
          initialDelaySeconds: 10    
          periodSeconds: 10           
          timeoutSeconds: 2
          failureThreshold: 3
        readinessProbe:               
          httpGet:
            path: /
            port: 80
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 2
          failureThreshold: 3
        memory: 2Gi

Sample LoadBalancing with HAProxy

apiVersion: apps/v1
kind: Deployment
metadata:
  name: haproxy-ingress
  namespace: kube-system
spec:
  replicas: 2
  selector:
    matchLabels:
      app: haproxy-ingress
  template:
    metadata:
      labels:
        app: haproxy-ingress
    spec:
      containers:
      - name: haproxy-ingress
        image: haproxytech/kubernetes-ingress:latest
        ports:
        - containerPort: 80
        - containerPort: 443
---
apiVersion: v1
kind: Service
metadata:
  name: haproxy-service
  namespace: kube-system
spec:
  type: LoadBalancer
  selector:
    app: haproxy-ingress
  ports:
  - name: http
    port: 80
    targetPort: 80
  - name: https
    port: 443
    targetPort: 443

Load Balancing Methods Supported
In HAProxy, you can set load balancing algorithms under backend:
roundrobin → default, rotates requests evenly.
leastconn → sends traffic to the server with fewest active connections.
source → sticky sessions based on client IP.
uri / hdr(name) → balance by URI or header value.

Service types for HAProxy setup on k8s
ClusterIP (default): internal access only, pods in the cluster can reach HAProxy.
NodePort: exposes HAProxy on every node’s IP at a fixed port (e.g., 30080). Useful for bare-metal or dev/test clusters.
LoadBalancer: Provision a managed external load balancer that forwards traffic to HAProxy pods.

Error Handling and Graceful Degradation:

It is pertinent to ensure that systems can handle errors gracefully, offering fallback behavior, for instance, showing cached data, retrying failed requests when required, and also ensuring continued functionality as a reduced capacity when building self-healing services.

App level → retries, caching, fallback logic eg Redis
Service mesh → circuit breaking, rate limiting, traffic shaping, e.g Istio, Kong Api gateway
Kubernetes → pod lifecycle management, probes, scaling, restart policies.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ehgd-app
  namespace: default
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ehgd-app
  template:
    metadata:
      labels:
        app: ehgd-app
    spec:
      restartPolicy: Always   # Always restart failed pods
      terminationGracePeriodSeconds: 30   # allow 30s for shutdown
      containers:
      - name: ehgd-container
        image: nginx:latest
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: "200m"
            memory: "256Mi"
          limits:
            cpu: "500m"
            memory: "512Mi"
        # Health checks
        startupProbe:                   # ensures slow apps don’t get killed too early
          httpGet:
            path: /
            port: 80
          failureThreshold: 30
          periodSeconds: 10
        livenessProbe:                  # restarts container if unhealthy
          httpGet:
            path: /
            port: 80
          initialDelaySeconds: 10
          periodSeconds: 10
          failureThreshold: 3
        readinessProbe:                 # removes pod from LB if unready
          httpGet:
            path: /
            port: 80
          initialDelaySeconds: 5
          periodSeconds: 5
          failureThreshold: 3
        lifecycle:                      # graceful shutdown
          preStop:
            exec:
              command: ["/bin/sh", "-c", "nginx -s quit"]

Error Handling

If container fails → Kubernetes restarts it.
If / fails health check → kubelet restarts the pod.
If readiness fails → pod is removed from Service, avoiding bad traffic routing.

Graceful Degradation

startupProbe: gives apps time to initialize before readiness/liveness checks.
terminationGracePeriodSeconds: 30 → gives the pod 30s to shut down gracefully.
preStop hook: ensures Nginx stops accepting new connections before SIGTERM kills it.

User Impact

If one pod fails → traffic shifts to healthy replicas automatically.
Users see reduced capacity, but not a total outage (graceful degradation).

Failover and Disaster Recovery:

In Kubernetes, resilience stretches beyond the boundaries of a single cluster, it extends into the fabric of availability zones, regions, and even hybrid infrastructures. A self-healing service is not just about restarting Pods or scaling workloads, it is about surviving entire datacenter failures without missing a beat. To achieve this, clusters are often deployed across multiple availability zones in either an active-active or active-passive setup. In an active-active design, workloads in different zones serve traffic simultaneously, so if one zone stumbles, the others carry the load without interruption. While in an active-passive design, a secondary cluster stands ready in the background, quietly synced and waiting to take over if the primary falters.
Failover is orchestrated automatically, traffic is rerouted to healthy replicas through intelligent load balancing and DNS-level redirection. To the user, the shift is invisible, while to the system, it is simply another resilience exercise.

Yet, compute failover is only half the battle. An application that springs back to life without its data is like a body revived without memory, alive, but unable to function. That is why continuous data replication and comprehensive backups are not optional, they are the backbone of true resilience.
In our Kubernetes environment, every persistent volume and stateful workload is mirrored to secure secondary systems across regions, into alternate cloud providers, or even onto on-premise infrastructure. These replicas are kept securely, ensuring that the state of the system is never lost, only relocated.
When catastrophe strikes whether it is the failure of a region or the loss of an entire datacenter, Kubernetes does not just restart workloads, it restores them with their data intact, drawing from synchronized backups to recreate the complete state. These capabilities transform Kubernetes from a system that merely recovers individual containers into one that can withstand and seamlessly recover from even the largest-scale disasters.

Sample Active-Active setup

CI/CD and Rollback Mechanisms:

Implementing CI/CD pipelines for automated testing, change deployments, ensures that faulty updates do not impact the system by leveraging deployment strategies such as canary and blue-green deployments for safer rollout, and also ensuring that failed deployments can be automatically rolled back to previous stable versions using tools like ArgoCD to promote system stability.
In a Canary release, a new version is released to a small slice of users first, like testing the waters before diving in. If it proves stable, the rollout expands, if issues arise, the damage is contained. Similarly, in Blue-green deployments you run two app environments (blue = old, green = new), but traffic goes to only one at a time. A simple switch redirects users only when the new release has proven itself, making upgrades seamless and reversible.
However, even with the best precautions, not every change succeeds. That is why Kubernetes environments pair with tools like ArgoCD, which provide automatic rollback. Should a deployment fail or its health checks trigger an alarm, the system instantly executes an automated rollback. It fast-forwards to the last known good state in a controlled manner, without requiring any human intervention.

This fusion of CI/CD automation, progressive rollout strategies, and built-in rollback mechanisms ensures that Kubernetes does not just heal from failure, it is fundamentally engineered to avoid introducing failure in the first place. The result is a platform where innovation moves quickly, but stability always comes first.

# Stable old version (v1)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cart-app-v1
  labels:
    app: cart-app
    version: v1
spec:
  replicas: 3
  selector:
    matchLabels:
      app: cart-app
      version: v1
  template:
    metadata:
      labels:
        app: cart-app
        version: v1
    spec:
      containers:
      - name: app
        image: nginx:1.21
        ports:
        - containerPort: 8002

---
# Canary new version (v2)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cart-app-v2
  labels:
    app: cart-app
    version: v2
spec:
  replicas: 1   # Start with 1 pod for canary
  selector:
    matchLabels:
      app: cart-app
      version: v2
  template:
    metadata:
      labels:
        app: cart-app
        version: v2
    spec:
      containers:
      - name: app
        image: nginx:1.23
        ports:
        - containerPort: 8002

---
# Service routes traffic to both v1 and v2
apiVersion: v1
kind: Service
metadata:
  name: cart-app-svc
spec:
  selector:
    app: cart-app
  ports:
  - port: 8002
    targetPort: 80

Blue-Green Deployment sample

# Blue deployment (current live version)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cart-app-blue
  labels:
    app: cart-app
    version: blue
spec:
  replicas: 3
  selector:
    matchLabels:
      app: cart-app
      version: blue
  template:
    metadata:
      labels:
        app: cart-app
        version: blue
    spec:
      containers:
      - name: app
        image: nginx:1.21
        ports:
        - containerPort: 8001

---
# Green deployment (new version, not live yet)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cart-app-green
  labels:
    app: cart-app
    version: green
spec:
  replicas: 3
  selector:
    matchLabels:
      app: cart-app
      version: green
  template:
    metadata:
      labels:
        app: cart-app
        version: green
    spec:
      containers:
      - name: app
        image: nginx:1.23
        ports:
        - containerPort: 8001

---
# Service - initially pointing to blue
apiVersion: v1
kind: Service
metadata:
  name: cart-app-svc
spec:
  selector:
    app: cart-app
    version: blue   # switch to 'green' when ready
  ports:
  - port: 8001
    targetPort: 80

Logging, Monitoring, Tracing, Visualization and Alerting:

Setting up periodic health checks to assess the health of various components is essential in setting up self-healing services.
Periodic health checks constantly measure the pulse of the cluster tracking CPU usage, memory consumption, disk space, latency, and other vital signs. These signals reveal when the system is thriving and when it is under strain, but numbers alone are not enough, they must be gathered, aggregated, and understood. Tools like Fluentd and Elasticsearch collect logs from across the ecosystem, stitching together the story of every service, when anomalies like unexpected error spikes, unusual patterns, or silent failures surface, they act as the early whispers of trouble.
To turn raw data into insight, Kubernetes environments lean on Prometheus and Grafana. Prometheus scrapes metrics in real time, while Grafana transforms them into live dashboards providing vivid visualizations of the system's heartbeat at a glance for operators and developers. Tracing adds another layer, following requests as they weave through microservices, pinpointing where latency hides or failures emerge.
However, the real magic lies in alerting and automation. When metrics cross a critical threshold, alerts fire instantly, summoning attention before users ever notice. Better still, these alerts can trigger automated remediation, scaling Pods, rerouting traffic, or restarting services, all without human hands.

In this way, observability tools do more than just monitor, they empower Kubernetes to act with foresight.

Top comments (2)

Chinonso Amadi • Oct 2

Great article and an interesting write-up but a quick one, it is not adviceable to use both horizontal pod autoscaler AND vertical pod autoscaler at once. One is enough and can be paired with cluster autoscaler or even karpenter is you are on aws. Just pointibg that out

lizabeth • Oct 2

Thank you Chinonso for the observation
The idea here is to state all the available options so the user is allowed to pick any one that works for their use case.