Ashish Gajjar for AWS Community Builders

Posted on May 12 • Edited on May 26

EKS + Longhorn — Dancing into Dynamic Storage

#aws #kubernetes #devops #tutorial

Introduction

This comprehensive guide walks you through the complete process of deploying and configuring Longhorn as the default storage class in your Amazon Elastic Kubernetes Service (EKS) cluster. Longhorn provides a robust, cloud-native storage solution that addresses the limitations of traditional cloud-based block storage solutions like Amazon EBS.
By following this guide, you will learn how to:

Understand why EBS falls short for modern Kubernetes workloads
Prepare your EKS cluster for Longhorn
Install open-iSCSI prerequisites on every worker node
Deploy Longhorn using Helm as the default StorageClass
Configure AWS Load Balancer Controller for external access
Access the Longhorn management UI securely
Automate the entire setup using Atmos + Terraform

What is EBS Volume?

**
EBS (Elastic Block Store) is a network-attached storage volume for your EC2 instance — think of it as a hard disk in the cloud. It lives with you in the same Availability Zone

Step 1 — Create Volume
You create an EBS volume with a specific size, type, and Availability Zone.

Create an EBS volume
aws ec2 create-volume \
  --size 20 \
  --volume-type gp3 \
  --availability-zone us-east-1a \
  --region us-east-1

Step 2 — Attach to EC2
Attach the volume to an EC2 instance. The volume and instance must be in the same Availability Zone.

Attach volume to EC2
aws ec2 attach-volume \
  --volume-id vol-0123456789abcdef0 \
  --instance-id i-0123456789abcdef0 \
  --device /dev/xvdf

Step 3 — Format the Volume
Format the volume with a file system before use.

  Format with ext4
sudo mkfs -t ext4 /dev/xvdf
# All clean and ready!

Step 4 — Mount the Volume
Mount the volume to a directory so your application can use it.

Mount volume to /data
sudo mount /dev/xvdf /data
# Make mount persistent across reboots
echo '/dev/xvdf /data ext4 defaults 0 2' | sudo tee -a /etc/fstab

Step 5 — Use It
Your application can now read and write data to the mounted volume.
Step 6 — Data Persists
Even if the EC2 instance stops or restarts, your data is safe in EBS. The volume lifecycle is independent from the instance.
Step 7 — Snapshot (Backup)
Create snapshots of your EBS volume to back up data to Amazon S3.

Create a snapshot
aws ec2 create-snapshot \
  --volume-id vol-0123456789abcdef0 \
  --description "Daily backup"
# Snapshot saved in Amazon S3

In Simple Words
EBS = Virtual Hard Drive EBS is like a virtual hard drive that you attach to your EC2. You create it, attach it, mount it, use it, and keep your data safe. EBS and EC2 are Best Buddies — as long as they stay in the same Availability Zone!

EBS Limitations — Why Longhorn is Needed

Cannot move across AZ **— If the node is in another AZ, EBS cannot follow. Pod gets stuck, manual intervention needed.
**Single node access only — Not designed for multi-node ReadWriteMany. Only one EC2 at a time.
No auto failover — If the node fails, manual intervention required to recover and reattach the volume.
*Not ideal for Kubernetes HA *— EBS works but was not built for dynamic pod scheduling across multiple nodes and zones.

The Problem: Why EBS is Not Enough

When deploying stateful applications on Amazon EKS, most teams start with the AWS EBS CSI Driver — the default storage provider. While EBS works well for simple use cases, it has fundamental architectural limitations that become painful at scale.
EBS Limitations

Bound to a single Availability Zone — volumes cannot cross AZ boundaries
If a pod migrates to a node in a different AZ during failure or rebalancing, the EBS volume cannot follow it
Requires complex strategies: volume snapshots, restoration, or restricting pod scheduling to specific AZs
Read-Write-Once (RWO) access only — one pod can mount at a time
Tied to AWS infrastructure — no portability to on-prem or other clouds
Single-instance attachment creates a potential I/O bottleneck Real-World Impact These limitations create real operational pain:
High availability gaps during node failure — volumes cannot automatically migrate
Manual failover processes that introduce downtime
Cannot rebalance storage across Availability Zones
No built-in cross-cluster disaster recovery
Pod scheduling constraints that limit cluster flexibility

What is Longhorn?

Longhorn is a lightweight, reliable, and feature-rich distributed block storage system designed specifically for Kubernetes environments. Originally developed by Rancher Labs and now maintained as a CNCF Incubating project, Longhorn has become a production-ready solution trusted by organizations worldwide.

Architecture and Design Philosophy

Longhorn operates as a cloud-native storage orchestrator that runs entirely within your Kubernetes cluster. Unlike traditional storage solutions that require dedicated hardware or cloud provider-specific integrations, Longhorn leverages the local storage available on your Kubernetes nodes and manages it as a unified storage pool.
The system uses a microservices architecture where each volume is managed by its own controller, ensuring isolation and resilience. Storage replicas are distributed across multiple nodes, providing data redundancy and high availability without requiring external storage systems.

Key Capabilities

Persistent Storage for stateful applications — databases, message queues, and more
Cloud-Agnostic Storage — works consistently across any Kubernetes environment
Multi-Node Replication — automatically replicates across nodes and AZs
External Backup Integration — snapshots to S3, NFS, Azure Blob, or MinIO
Disaster Recovery — cross-cluster DR volumes for business continuity
Automated Snapshot and Backup Scheduling — recurring cron-based jobs
Non-Disruptive Upgrades — upgrade components without disrupting running PVs
RWO + RWX Access Modes — both ReadWriteOnce and ReadWriteMany supported ## Longhorn vs EBS CSI Driver Understanding the differences between Longhorn and the AWS EBS CSI Driver is crucial for making informed infrastructure choices. Here is a direct comparison:

EKS Cluster Requirements
Minimum Node Count
Longhorn uses a replica-based architecture to ensure data durability. The default configuration creates three replicas for each volume, distributing them across different nodes.
Critical Requirement
Deploy at least 3 worker nodes in your EKS cluster. With fewer than 3 nodes, Longhorn cannot maintain its default 3-replica configuration, which compromises data redundancy.

Recommended Instance Types
For production deployments, choose EC2 instance types with NVMe SSD instance store volumes (the 'd' suffix). These provide high IOPS and low latency that Longhorn can directly utilize:

Compute Optimized: c5d, c5ad, c6g, c6gd — ideal for CPU-intensive workloads
Memory Optimized: r5d, r5ad, r6g, r6gd — for memory-intensive applications
General Purpose: m5d, m5ad, m6g, m6gd — balanced compute, memory and storage
Storage Optimized: i3en, z1d — maximum NVMe throughput For development/testing, t3.xlarge (4 vCPU, 16 GB RAM) is acceptable. Never use t3.medium — it lacks the CPU for Longhorn engine processes. Important: Do NOT install EBS CSI Driver alongside Longhorn

Step-by-Step Installation Guide

Step 1 — Create EKS Cluster

 Create key pair first
aws ec2 create-key-pair \
  --key-name ashish-key \
  --region us-east-1 \
  --query 'KeyMaterial' \
  --output text > ashish-key.pem
chmod 400 ashish-key.pem

Full cluster create command (takes 15-20 minutes)

eksctl create cluster \
  --name ashish --version 1.33 --region us-east-1 \
  --nodegroup-name ashish-workers \
  --node-type t3.xlarge --nodes 3 --nodes-min 2 --nodes-max 5 \
  --managed --with-oidc \
  --node-private-networking \
  --ssh-access --ssh-public-key ashish-key

Watch progress

eksctl get cluster --name ashish --region us-east-1

Step 2 — Enable OIDC + Install Addons
Enable OIDC (if not done at cluster creation)

eksctl utils associate-iam-oidc-provider \
  --cluster ashish --region us-east-1 --approve

Verify OIDC

aws eks describe-cluster --name ashish \
  --query "cluster.identity.oidc.issuer" --output text

Install all EKS addons


eksctl create addon --name aws-ebs-csi-driver --cluster ashish --region us-east-1 --force
eksctl create addon --name vpc-cni --cluster ashish --region us-east-1 --force
eksctl create addon --name coredns --cluster ashish --region us-east-1 --force
eksctl create addon --name kube-proxy --cluster ashish --region us-east-1 --force

Verify all addons are ACTIVE
eksctl get addon --cluster ashish --region us-east-1

Step 3 — Install Longhorn Prerequisites
Longhorn has specific system-level dependencies that must be installed on each Kubernetes worker node.
Install open-iSCSI on every node (SSH in first)

# Install the iSCSI initiator package
sudo dnf install -y iscsi-initiator-utils

# Enable the iscsid service to start on boot
sudo systemctl enable iscsid

# Start the iscsid service immediately
sudo systemctl start iscsid

# Verify
sudo systemctl status iscsid

Install open-iSCSI via DaemonSet (alternative)

kubectl apply -f \
  https://raw.githubusercontent.com/longhorn/longhorn/v1.6.0/deploy/prerequisite/longhorn-iscsi-installation.yaml

# Verify iscsi running on all nodes
kubectl get pods -n longhorn-system | grep iscsi

Install NFS client (for RWX volumes)

kubectl apply -f \
  https://raw.githubusercontent.com/longhorn/longhorn/v1.6.0/deploy/prerequisite/longhorn-nfs-installation.yaml

# Verify NFS running
kubectl get pods -n longhorn-system | grep nfs

Run Preflight Check — do not skip this!

curl -sSfL https://raw.githubusercontent.com/longhorn/longhorn/v1.6.0/scripts/environment_check.sh | bash

# All nodes must show: [  OK  ]
# Fix any FAIL items before proceeding
kubectl get pods -n longhorn-system   # should all be Running

Step 4 — Install Longhorn via Helm

Create custom-values.yaml
cat > custom-values.yaml <<EOF
preUpgradeChecker:
  jobEnabled: false
EOF

# preUpgradeChecker.jobEnabled: false
# Disables the pre-upgrade checker on first install.
# Re-enable for future upgrades.

** Add Helm repo and install Longhorn**
helm repo add longhorn https://charts.longhorn.io
helm repo update

Install Longhorn

helm install longhorn \
  --repo https://charts.longhorn.io \
  longhorn \
  --namespace longhorn-system \
  --create-namespace \
  -f custom-values.yaml
# Watch pods come up (takes 2-3 mins)
kubectl get pods -n longhorn-system --watch

The installation deploys: DaemonSets (node agents), Deployments (Longhorn manager + UI), Services, and CustomResourceDefinitions (CRDs).
Step 5 — StorageClass, PV and PVC
Set Longhorn as default StorageClass

# Check existing storage classes
kubectl get storageclasses

# Remove default from gp2
kubectl patch storageclass gp2 \
  -p '{"metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}'

# Set Longhorn as default
kubectl patch storageclass longhorn \
  -p '{"metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

# Verify — longhorn should show (default)
kubectl get storageclass

Storage Flow: PVC (asks for storage) → StorageClass (decides how to create it) → PV (actual storage provided)
EBS StorageClass vs Longhorn StorageClass — Side by Side
EBS StorageClass (gp3)

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ebs-gp3
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  encrypted: "true"
volumeBindingMode: WaitForFirstConsumer

EBS Limitations:

Pod stuck if node fails in different AZ — EBS cannot reattach across zones
WaitForFirstConsumer means pod must schedule before volume is created
One EC2 instance only — no multi-node access (RWO only)
Every volume = a separate EBS bill + snapshot storage cost Longhorn StorageClass

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: longhorn
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: driver.longhorn.io
parameters:
  numberOfReplicas: "3"
  staleReplicaTimeout: "2880"
reclaimPolicy: Delete

Longhorn Advantages:

Pod restarts on any node in any AZ — Longhorn serves data from nearest replica
3 replicas = no single point of failure — cluster continues if one node dies
Built-in UI, snapshot, backup to S3 — no extra AWS services needed
Same YAML works on AWS, GCP, on-prem — no cloud-specific provisioner

07 — UI & Security Setup
Open Longhorn Dashboard
Method 1: Port Forward (quickest)

  Port-forward to local machine
kubectl port-forward \
  -n longhorn-system \
  svc/longhorn-frontend 8080:80
# Open browser at:
# http://localhost:8080

Method 2: LoadBalancer (team access)
Create AWS Load Balancer

kubectl get svc longhorn-frontend -n longhorn-system

kubectl apply -f - <<EOF
apiVersion: v1
kind: Service
metadata:
  name: longhorn-frontend-lb
  namespace: longhorn-system
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "external"
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "instance"
    service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"
spec:
  type: LoadBalancer
  selector:
    app: longhorn-ui
  ports:
    - port: 80
      targetPort: 8000
EOF

kubectl get svc longhorn-frontend-lb -n longhorn-system --watch

Method 3: NodePort
Patch to NodePort

kubectl patch svc longhorn-frontend \
  -n longhorn-system \
  -p '{"spec":{"type":"NodePort"}}'

kubectl get svc longhorn-frontend -n longhorn-system
# Access via: http://<NODE-IP>:<NODE-PORT>

Debug if Pods Not Running
Troubleshooting commands

# Check all Longhorn pods are Running
kubectl get pods -n longhorn-system

# If any pod is not Running
kubectl describe pod <pod-name> -n longhorn-system
kubectl logs <pod-name> -n longhorn-system

# Check node health in Longhorn
kubectl get nodes.longhorn.io -n longhorn-system

Traefik Gateway — SSO with HTTPRoute
Install Gateway API CRDs + Traefik

# Gateway API CRDs
kubectl apply -f https://raw.githubusercontent.com/traefik/traefik/v3.0/docs/content/\
  reference/dynamic-configuration/kubernetes-crd-definition-v1.yml

# Install Traefik with Gateway API enabled
helm repo add traefik https://traefik.github.io/charts
helm install traefik traefik/traefik \
  --namespace traefik \
  --create-namespace \
  --set providers.kubernetesGateway.enabled=true \
  --set gateway.enabled=true

traefik-middlewares.yaml

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: forward-auth-delegate
  namespace: longhorn-system
spec:
  chain:
    middlewares:
      - name: forward-auth
        namespace: service-foundry

Configuration explanation:

apiVersion traefik.io/v1alpha1 — uses the Traefik CRD API version for defining middleware
kind Middleware — defines this resource as a Traefik middleware
forward-auth-delegate — name referenced in HTTPRoute
namespace longhorn-system — placed in the same namespace as Longhorn
name forward-auth / namespace service-foundry — references existing auth middleware in service-foundry namespace Service Foundry Deployment Flow

A real-world deployment of PostgreSQL and Redis using Longhorn storage in the service-foundry namespace:

Create Namespace → service-foundry (isolated environment)
Create Secrets → store PostgreSQL & Redis credentials securely
Create PVCs → PostgreSQL (8Gi) + Redis (8Gi) — uses default Longhorn StorageClass automatically
Deploy PostgreSQL → mounted persistent storage + internal ClusterIP service
Deploy Redis → ConfigMap + Secret + mounted persistent storage + internal ClusterIP service
Persistent Storage Flow → PVC → StorageClass (Longhorn) → PV auto-created ** Verify PVCs and PVs**

kubectl get pvc -A
# NAMESPACE         NAME                          STATUS  STORAGECLASS
# service-foundry   data-postgresql-0             Bound   longhorn
# service-foundry   redis-data-redis-master-0     Bound   longhorn

kubectl get pv
# NAME                   CAPACITY  STATUS  STORAGECLASS
# pvc-4f9baf6b-...       8Gi       Bound   longhorn

UI Dashboard

LONGHORN Dashboard

Node Details

Volume Details

Snapshot and Backups

Set Cronjobs

Conclusion
Longhorn transforms how you think about storage in Kubernetes. By moving from the rigid, zone-bound architecture of EBS to a distributed, software-defined storage layer, you gain true pod mobility, automatic failover, built-in backup, and cloud portability — all from within your cluster.