Latchu@DevOps

Posted on Oct 8

Part-116: 🚀Implement Horizontal Pod Autoscaling (HPA) in Google Kubernetes Engine (GKE)

#kubernetes #gcp #gke #devops

In this guide, we’ll walk through how to implement and test Horizontal Pod Autoscaling (HPA) on a GKE cluster.

You’ll learn how to automatically scale your Pods based on CPU utilization — helping your applications handle increased load efficiently and reduce costs during idle time.

📘 Step 01: Introduction

In Kubernetes, the Horizontal Pod Autoscaler (HPA) automatically adjusts the number of Pods in a Deployment, ReplicaSet, or StatefulSet based on observed CPU/memory usage or custom metrics.

In this demo, we’ll:

Deploy a simple NGINX-based application
Expose it internally via a ClusterIP Service
Create an HPA that scales Pods between 1 and 10 replicas
Generate CPU load to observe the scaling in action

📁 Step 02: Review Kubernetes Manifests

Let’s create a new working directory and prepare the required YAML files.

mkdir kube-manifests-autoscalerV2
cd kube-manifests-autoscalerV2

🧩 01-kubernetes-deployment.yaml

apiVersion: apps/v1
kind: Deployment 
metadata: 
  name: myapp1-deployment
spec: 
  replicas: 1
  selector:
    matchLabels:
      app: myapp1
  template:  
    metadata:
      name: myapp1-pod
      labels:
        app: myapp1  
    spec:
      containers: 
        - name: myapp1-container
          image: ghcr.io/stacksimplify/kubenginx:1.0.0
          ports: 
            - containerPort: 80  
          resources:
            requests:
              memory: "5Mi"
              cpu: "25m"
            limits:
              memory: "50Mi"
              cpu: "50m"

💡 Note:
The CPU limits and requests are intentionally small to make autoscaling visible during the demo.

🧩 02-kubernetes-cip-service.yaml

apiVersion: v1
kind: Service 
metadata:
  name: myapp1-cip-service
spec:
  type: ClusterIP
  selector:
    app: myapp1
  ports: 
    - name: http
      port: 80
      targetPort: 80

This creates an internal ClusterIP service so that other Pods (like our load generator) can access the app inside the cluster.

🧩 03-kubernetes-hpa.yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: cpu
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp1-deployment
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 30

💡 Explanation:

Scales myapp1-deployment between 1–10 Pods
Target average CPU utilization: 30%
If average CPU usage across Pods exceeds 30%, Kubernetes will create more Pods automatically.

⚙️ Step 03: Deploy the Sample App and Verify

Apply all manifests:

kubectl apply -f kube-manifests-autoscalerV2/

✅ Check the current Pods

kubectl get pods

👉 Observation: Only 1 Pod should be running initially.

✅ Verify HPA

kubectl get hpa

You should see something like:

NAME   REFERENCE                     TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
cpu    Deployment/myapp1-deployment   0%/30%     1         10         1         1m

🔥 Run a Load Test (in a new terminal)

We’ll generate continuous load using a BusyBox container:

kubectl run -i --tty load-generator --rm --image=busybox --restart=Never -- \
/bin/sh -c "while sleep 0.01; do wget -q -O- http://myapp1-cip-service; done"

This will repeatedly send requests to our service, increasing CPU usage.

✅ Observe Scale-Out Event

In another terminal:

kubectl get pods

You’ll gradually see new Pods being created as the load increases.

🧠 Check Pod Metrics

kubectl top pod

You can also watch how HPA reacts:

kubectl get hpa --watch

✅ Observe Scale-In Event

After stopping the load generator, HPA will scale down the Pods over a few minutes (usually 3–5 mins).

kubectl get pods

👉 Observation: Only 1 Pod should remain after scaling in.

🧹 Step 04: Clean-Up

Delete the load generator (if it’s stuck in error state):

kubectl delete pod load-generator

Delete the sample app and HPA:

kubectl delete -f 01-kubernetes-deployment.yaml
kubectl delete -f 02-kubernetes-cip-service.yaml
kubectl delete -f 03-kubernetes-hpa.yaml

⚡ Step 05: Create HPA Using Imperative Command

You can also create HPA without YAML using kubectl autoscale.

🧾 Deploy the App Again

kubectl apply -f 01-kubernetes-deployment.yaml

Check running Pods:

kubectl get pods

🧰 Create HPA Imperatively

kubectl autoscale deployment myapp1-deployment --min=3 --max=10 --cpu-percent=30

This command creates an HPA resource that ensures:

Minimum 3 replicas
Maximum 10 replicas
Scales based on 30% CPU target

🔍 Verify

kubectl get hpa
kubectl get pods

👉 You’ll see 3 Pods running as per --min=3.

📜 Review the Generated HPA YAML

kubectl get hpa myapp1-deployment -o yaml

You’ll notice it uses apiVersion: autoscaling/v2.

🧹 Step 06: Final Clean-Up

kubectl delete -f kube-manifests-autoscalerV2/01-kubernetes-deployment.yaml
kubectl delete hpa myapp1-deployment

📊 Step 07: Use kubectl top to Check Resource Usage

# View node metrics
kubectl top node

# View pod metrics (e.g., in kube-system namespace)
kubectl top pod -n kube-system

🖥️ Step 08: Configure HPA via GKE Console (GUI Method)

You can also create and manage Horizontal Pod Autoscaling (HPA) directly from the Google Cloud Console — no YAML or kubectl needed.

🔹 Steps:

Go to the Google Cloud Console
Navigate to Kubernetes Engine → Clusters
Select your Cluster
Go to the Workloads tab
Click on your Application (Pod/Deployment)
Under the Details section, scroll down to Autoscaling
Choose Horizontal Pod Autoscaler → Configure

⚙️ Configuration Options:

You can specify:
Minimum number of Pods (e.g., 1)
Maximum number of Pods (e.g., 10)
Target CPU utilization percentage (e.g., 30%)
Optionally, use custom or memory-based metrics

🎯 Summary

Step	Description
1	Create a Deployment and ClusterIP Service
2	Define HPA YAML or use `kubectl autoscale`
3	Run a load generator to test scaling
4	Observe Pods scale out/in
5	Clean up resources after the test

💡 Key Takeaways

✅ Autoscaling is automatic — no manual intervention needed
✅ HPA uses the Metrics Server to fetch CPU/memory usage
✅ Use low resource limits in demos for visible scaling
✅ Imperative and declarative approaches both work
✅ Helps achieve cost efficiency and application resilience

🌟 Thanks for reading! If this post added value, a like ❤️, follow, or share would encourage me to keep creating more content.

— Latchu | Senior DevOps & Cloud Engineer

☁️ AWS | GCP | ☸️ Kubernetes | 🔐 Security | ⚡ Automation
📌 Sharing hands-on guides, best practices & real-world cloud solutions

DEV Community