How PlayStation achieved 99.99% uptime on Kubernetes

#kubernetes #kubecon #devops

When you're powering services for millions of PlayStation gamers, "downtime" isn't just an inconvenience—it's a headline. So, when Sony Interactive Entertainment (SIE) revealed achievement of 99.995% availability for Kubernetes platform last year, SIE has also shared learnings for the platform engineering teams could take cue from.

From Silos to a Unified Kingdom

Before 2021, SIE was similar to many large organizations: different teams in the US and Japan had their own platforms, leading to duplicated work and inconsistent standards. The solution was a massive "platform unification" program that created one global team and one platform: the Unified Kubernetes Service (UK Platform).

The UK Platform was built on three simple, powerful ideas:

Unification: One way to manage everything. No more ad-hoc fixes or team-specific quirks.
Multi-tenancy: Services from dozens of teams would run on shared clusters, maximizing resource usage.
Standardization: All services would use common Helm charts provided by the platform team. This ensures consistency and makes management sane.

The foundation is built on AWS EKS for managed clusters and Karpenter for node management. The platform team handles the core infrastructure, while service teams focus on what they do best: building amazing applications.

But as any SRE will tell you, a great start doesn't guarantee a smooth ride. As the platform grew, new challenges emerged. Here’s how they slayed each.

1 : The Battle for Availability

Keeping services online 24/7 is the ultimate goal.

Problem : Uneven Pod Spreading
Developers observed uneven pod distribution, especially during traffic spikes, despite configuring PodDisruptionBudget(PDB) and PodTopologySpreadConstraints(PTSC) with whenUnsatisfiable set to ScheduleAnyway. Employing the descheduler, which periodically checks the cluster and evicts pods from overcrowded nodes, forcing them to reschedule onto less crowded ones. Simple, effective, and automated.
Problem : Slow Pod Scaling
Pods didn't scale quickly enough during peak hours (e.g., major title launches, in-game events), leading to increased latency and errors. The total scale-up time was a bottleneck, comprising node creation time and pod startup time.
Overprovisioning: Prepared spare capacity using low-priority placeholder pods. These placeholders are evicted when application pods need to scale, allowing immediate space allocation without waiting for new nodes. This balances cost and responsiveness4.
Adopting Karpenter: Adopted Karpenter, a node autoscaler that directly provisions and consolidates EC2 instances, making node creation faster than traditional Cluster Autoscaler.
Problem : CoreDNS Issues
CoreDNS is the phonebook of your cluster handling DNS resolution. As the platform grew, CoreDNS pods, which were running on a limited set of nodes, started hitting rate limits from the upstream DNS resolver. This caused a cascade of failures across applications.
First, using pod anti-affinity ensured ensure CoreDNS pods were spread across many more nodes, distributing the load.
But the real breakthrough came from a counter-intuitive move: they removed Investigation revealed, CPU limits were causing throttling, which severely impacted performance and tail latency. By removing the limits and relying on CPU requests and the kernel's scheduler, performance dramatically improved.

2 : Taming Maintenance Challenges

Upgrades are a fact of life in the Kubernetes world. But with over 50 clusters, manual maintenance was a recipe for burnout.

Problem : Add-on Upgrades Took Forever Manually upgrading an add-on (like a logging agent or metrics server) across all clusters took over 300 minutes of engineering time. It was repetitive, tedious, and error-prone. The team built a fully automated workflow. The process now includes:
Running smoke tests on the first cluster.
Automatically progressing to the next cluster on success.
Automatically rolling back on failure.
Upgrade time dropped from 300 minutes to under 15 minutes, and reliability skyrocketed.
Problem : Bad Configs Blocked Node Upgrades
Node upgrades were constantly blocked because service teams had configured their PDBs improperly. This required manual intervention from the platform team to fix.
Standardized Helm Charts: PDB settings were baked into the common Helm charts that all services use.
Kyverno Policies: Kyverno, a policy engine for Kubernetes, to automatically block any deployments with improper PDB settings at the API server level. No more bad configs could enter the cluster.

3 : People, Process, and Time Zones

Problem : Burnout and Rising Operational Load
As the platform scaled, the on-call burden on the team grew, threatening work-life balance. An alert at 3 AM in California is a problem.
A Follow-the-Sun Global Team SIE built a global team with engineers in the US, Japan, and India. This "follow-the-sun" model provides 24/7 coverage, with clean handoffs between regions. When an incident occurs, the on-call engineer for that time zone handles it, preventing any single person from being awake all night.
Problem : Knowledge Gaps and Communication Delays
Working across time zones created information silos and slow down decision making. A question asked in Tokyo might not get an answer from California for 12 hours.
A Culture of Documentation and Shared Knowledge
Knowledge Sharing Sessions: Regular sessions to keep everyone in sync.
Documentation: Key decisions and architectures are formally documented.
Incident Management Process: A clear, three-phase process (Before, During, After) ensures that every incident is a learning opportunity, with action items tracked to continuously improve the platform.

Key Takeaways for Engineering Teams

SIE's journey offers a powerful blueprint for running platforms at scale:

Culture is Everything : Success started with a culture that values data, tracks metrics, and acts on them. Inclusive leadership made the global team model work.
Embrace the Ecosystem : Kubernetes and its rich open-source ecosystem (Karpenter, Kyverno, descheduler) provided the building blocks. without having to reinvent the wheel.
Automate Ruthlessly : Automation isn't a luxury; it's a necessity for reliability and freeing up engineers to solve bigger problems.
Master the Basics: Ultimately, 99.995% uptime comes from relentless focus on solving foundational problems with simple, robust, and well-understood solutions.

For full further Information : https://www.youtube.com/watch?v=xxSPRdwjuqE