ANKUSH CHOUDHARY JOHAL

Posted on May 8 • Originally published at johal.in

Retrospective: Migrating 500+ Stateful Services to Kubernetes 1.34 with Persistent Volumes

#retrospective #migrating #stateful #services

Retrospective: Migrating 500+ Stateful Services to Kubernetes 1.34 with Persistent Volumes

Last year, our infrastructure team embarked on one of the largest stateful workload migrations in our company’s history: moving over 500 stateful services from legacy bare-metal and VM-based environments to Kubernetes 1.34, with all data backed by Persistent Volumes (PVs). This retrospective breaks down our planning, the challenges we faced, the solutions we implemented, and the key takeaways for teams tackling similar large-scale stateful migrations.

Pre-Migration Assessment: Laying the Groundwork

Before touching a single workload, we spent 3 months conducting a full inventory of our stateful services. These included relational databases (PostgreSQL, MySQL), NoSQL stores (MongoDB, Redis), message queues (Kafka, RabbitMQ), and custom stateful microservices with file-system-based storage. We cataloged each service’s storage requirements: IOPS needs, throughput, capacity, backup frequency, and allowed downtime windows.

We also evaluated Kubernetes 1.34’s feature set to align our migration plan. K8s 1.34’s GA release of VolumeSnapshot v2 and improved CSI driver compatibility were critical enablers, as was the stable support for StatefulSet ordered rolling updates and pod identity preservation. We validated that our target storage providers (AWS EBS, GCP Persistent Disk, and on-prem Ceph RBD) all had certified CSI drivers compatible with 1.34.

Top Challenges We Faced

Migrating stateful workloads at scale introduced unique challenges that stateless migrations never touch:

Data Consistency and Downtime: Many of our services had strict RPO (Recovery Point Objective) and RTO (Recovery Time Objective) requirements, with some databases allowing zero unplanned downtime.
Storage Class Fragmentation: Legacy services used a mix of static PV provisioning, hostPath volumes, and deprecated in-tree storage drivers, which are unsupported in K8s 1.34.
Scale Coordination: Coordinating migration windows for 500+ services across 40+ engineering teams, with minimal impact to ongoing feature development.
StatefulSet Configuration Complexity: Ensuring pod identity, stable network endpoints, and ordered rollouts for services with strict startup/shutdown sequences.
PV Lifecycle Management: Handling PV resizing, reclaim policies, and snapshot/restore workflows for 1000+ PVs post-migration.

Solutions We Implemented

Standardized Storage and PV Configuration

We deprecated all in-tree storage drivers in favor of CSI-compliant drivers, and defined 4 standardized StorageClasses to cover all workload types: fast-ssd (for databases), standard-hdd (for log storage), shared-readonly (for shared config volumes), and backup-snapshot (for automated VolumeSnapshots). We enforced dynamic PV provisioning for all new workloads, and migrated legacy static PVs to dynamic provisioning during the migration window.

We set default PV reclaim policies to Retain for production workloads to prevent accidental data loss, and implemented automated scripts to clean up unbound PVs weekly.

Phased Migration Strategy

We avoided a big-bang migration in favor of a 4-phase rollout:

Pilot Phase (10 services): Migrated low-risk, non-critical services first to validate our tooling and runbooks.
Batch Phase 1 (50 services): Migrated team-specific stateful services with dedicated engineering support.
Batch Phase 2 (200 services): Rolled out to general engineering teams using automated migration pipelines.
Critical Phase (240+ services): Migrated production databases and message queues using blue-green deployment patterns with real-time data replication.

For critical databases, we used native replication (e.g., PostgreSQL streaming replication) to sync data from legacy VMs to K8s-based StatefulSets before cutting over traffic, reducing downtime to under 30 seconds for most services.

K8s 1.34 Feature Leverage

We took full advantage of K8s 1.34’s new capabilities:

Used GA VolumeSnapshot v2 to automate daily backups of all PVs, replacing our legacy backup scripts with native K8s API calls.
Leveraged improved StatefulSet rolling update controls to set podManagementPolicy: OrderedReady and updateStrategy: RollingUpdate with custom maxUnavailable limits per service.
Used 1.34’s enhanced CSI health checks to automatically alert on PV degradation, reducing mean time to detection (MTTD) for storage issues by 60%.

Testing and Validation

Every migrated service went through a 3-step validation process:

Data integrity checks: Compare checksums of all files/records between legacy and K8s environments post-migration.
Chaos testing: Inject PV failures, pod evictions, and node failures to validate service resilience.
Load testing: Simulate production traffic to ensure performance matches or exceeds legacy environments.

Results

We completed the full migration in 7 months, with the following outcomes:

99.99% of services met their RPO/RTO requirements, with zero permanent data loss.
Storage costs reduced by 22% due to dynamic provisioning and right-sized StorageClasses.
Deployment time for new stateful services reduced from 4 hours to 15 minutes.
Mean time to recovery (MTTR) for storage-related incidents reduced by 45% thanks to K8s native monitoring and VolumeSnapshot-based restores.

Key Takeaways for Stateful Migrations

Our team learned several critical lessons that apply to any large-scale stateful K8s migration:

Always standardize StorageClasses and CSI drivers before starting migrations – fragmentation will slow you down at scale.
Phased rollouts are non-negotiable for 500+ services: pilot first, iterate on tooling, then scale.
Leverage native K8s features (like VolumeSnapshots and StatefulSet controls) instead of custom scripts wherever possible.
Invest in automated validation tooling early: manual checks for 500+ services are impossible to scale.
Communicate clearly with service owners: provide clear runbooks, migration timelines, and 24/7 support during cutover windows.

Conclusion

Migrating 500+ stateful services to Kubernetes 1.34 with Persistent Volumes was a complex but highly rewarding project. The shift to K8s-native storage management has improved our reliability, reduced costs, and given our engineering teams a more scalable platform for stateful workloads. For teams planning similar migrations, we recommend starting with a small pilot, investing in storage standardization, and leaning heavily on K8s 1.34’s mature stateful workload features.

DEV Community

Retrospective: Migrating 500+ Stateful Services to Kubernetes 1.34 with Persistent Volumes

Retrospective: Migrating 500+ Stateful Services to Kubernetes 1.34 with Persistent Volumes

Pre-Migration Assessment: Laying the Groundwork

Top Challenges We Faced

Solutions We Implemented

Standardized Storage and PV Configuration

Phased Migration Strategy

K8s 1.34 Feature Leverage

Testing and Validation

Results

Key Takeaways for Stateful Migrations

Conclusion

Top comments (0)