Kazuya

Posted on Dec 6, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Zero-Downtime at Scale: Migrating Peacock's Global Streaming to EKS (IND3325)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Zero-Downtime at Scale: Migrating Peacock's Global Streaming to EKS (IND3325)

In this video, NBCUniversal Sky's platform engineering team shares their journey of migrating Peacock's global streaming infrastructure—supporting over 40 million customers and 1,600 applications—from self-managed Kubernetes to Amazon EKS with zero downtime. The team achieved a six-stage migration process using Velero, Route 53, and extensive automation, completing the entire migration in four months after six months of preparation. Key achievements include reducing BAU toil from 30% to 10%, making upgrades six times faster, and eliminating 50,000 lines of code. The migration enabled the team to handle major live events like NFL games that consumed 30% of US bandwidth, while maintaining their multi-tenant architecture across Peacock, Sky Showtime, Showmax, and NOW brands globally.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Zero Downtime Migration of Peacock's Global Streaming to Amazon EKS

Good morning and welcome to IND 3325, Zero Downtime at Scale: Migrating Peacock's Global Streaming to Amazon EKS. My name is Ian Coleshill. I'm an AWS Principal Solutions Architect. Joining me here today from NBCUniversal Sky are Mansoor Fazil, Director of Global Platform Engineering, and Peter Symonds-Bale, Head of Platform Infrastructure. Also with me is my colleague from AWS, Manish Joshi, Senior Technical Account Manager.

If you've worked in technology for maybe a few weeks, a few months, or a few years, I think you'll realize that migrating anything at scale to Kubernetes clusters at scale to Amazon EKS doesn't come without its challenges. After a short reel, I'll hand you over to Mansoor to talk to you about how Global Streaming was able to achieve this.

The NBA is streaming on Peacock. Games on Peacock NBA Monday, coast to coast Tuesday, and one epic 2026. Plus all new ways to see more than the score. From the court to the culture, NBA is on Peacock.

Global Streaming Technology: A Multi-Tenant Platform Spanning Four Propositions

Thank you everyone. Thank you, Ian. First of all, great to be here in Vegas. I want to welcome everyone, especially on behalf of AWS, but also Pete and I from NBCUniversal and Sky. I'm going to give you all just a very high level view about what we call Global Streaming Technology, which is essentially an engineering team that spans across effectively two companies, so that's NBCUniversal and Sky.

Just to give you a little bit more context, currently we have over 40 million customers for Peacock, which hopefully everyone's aware of. That's in the US, but we also have other propositions in Europe and in Africa, which I'll talk about in a second. We consider ourselves, obviously Prime might disagree with us, but we consider ourselves one of the leaders in live sports here in the US but also across Europe. Hopefully most of you sports fans will know we've got major deals with the NBA, NFL, Premier League, and recently announced the MLB.

One of our biggest events we had, I think it was a year and a half ago, was a wildcard game where we accounted for over 30% of the US bandwidth, which is a huge achievement. A little bit more about Global Streaming Technology in terms of where we're based. We've got engineers based here in the US, in the UK, also in Czech Republic, Portugal, and India. So we're a truly global team.

I want to talk a little bit about our infrastructure and our stacks. Multi-tenancy is absolutely core to the way we work, and that's not just from an infrastructure perspective but also from a front end. If you look at the diagram on the right, we cover four propositions using a single code base, both from a front end perspective but also from an infrastructure, which Pete will go through in a second.

From left to right, we have Peacock here in the US. We have Sky Showtime, which is a joint venture between ourselves and Paramount, mostly in Europe. We've got Showmax, which is another joint venture between us and MultiChoice in Africa. Then finally we have the NOW brand, which is a streaming service in Italy, Germany, UK, Austria, and the Republic of Ireland. As you can see, we have one single code base from the front end, as I said, but also the key for us is a homogeneous infrastructure layer that again covers all of those different propositions, whatever regions we are in around the world.

For us, because of scale, we really want to maintain consistency across all of our development teams, and that's really important for us. The main reason for that is how fast we want to move. As I said, we have one central platform team which Pete to my right manages, which we'll talk through in a second.

The Challenge: Reducing BAU Toil While Supporting Thousands of Namespaces

Now the problem statement and what we're here to talk about. If you look at the platform engineering team, right now we spend around 70% of our time on what we consider development tasks and over 30% on what is BAU toil. An example could be a Kubernetes upgrade or a security patch, the ones we all love to do. What we want to do is obviously bring that down. If you look to the right, because it's such an uber multi-tenant Kubernetes cluster, we have hundreds of different development teams across multiple time zones.

If you look at the graph, the number of deployments we do day to day is only just growing. The key for us is how do we free up more engineering time for Pete and his team from a development perspective, but at the same time, what we cannot do is disrupt our development teams, especially when we're talking about thousands and thousands of different namespaces and components that are running absolutely critical services for those propositions I mentioned.

So I'm now going to hand you over to Pete, who's going to talk about the migration strategy and the planning and how we got there.

Four Core Principles of Platform Engineering at Global Streaming

Perfect, thanks man. So platform engineering is the cornerstone of infrastructure for global streaming, and we really focus on trying to take away the work that engineering teams ultimately don't need to do or in many cases don't really want to do. So we really want to focus on pulling a lot of that heavy lifting away from them so that we can do that for those engineering teams. So I look after the platform engineering department across global streaming, and really what we're focusing on doing is doing that heavy lifting for the engineering teams.

So what I now want to do is focus on four key principles that we have as a department and link that to how we approach the EKS migration. The first principle is engineers are our customers. So what we really try to do, and this is really one of the key methodologies of platform engineering, is we want to internally market ourselves. Engineering teams should actually want to use our products. It shouldn't just be a case of they have to use our products because they happen to be plugged into our platform.

The second principle is protect the customer interfaces. So we have an obligation to limit the number of changes that we make on our platform because we can't just suddenly make a change to an important interface that many, if not hundreds of thousands of developers have plugged into. We can't suddenly make a change to that purely because there's a technical evolution. We should be focusing on moving slowly in terms of interfaces but rapidly advancing our technology that we're using under the hood.

Third, deliver capabilities, not technologies. So this is really doubling down on that aspect. We don't simply want to deliver a technology that may be the latest, greatest tech that might come out of the industry. We want to really understand how our customer, and when I say customer, in my case that means development teams, are going to use those products. So we want to work backwards from how we think they're going to use it, and then we leverage the technology under the hood to offer that.

And finally, reliable by design. So as Mant's already said, we support market leading streaming services globally, so even a few seconds of downtime just isn't acceptable. So taking those four key principles in, I now want to explain to you how we approached the EKS migration project.

From Early Kubernetes Adoption to the EKS Migration Decision

So before I delve into it, you might be thinking, why are we talking about an EKS migration in 2025? And I'm going to give you a bit of history as to why that is, but before I jump into that, when we were first looking at EKS, we could have really taken two traditional paths. We could have either said, look, this is a significant change, we're going to build a brand new version of our platform and we're then going to ask all of those development teams to shift over. That's not really doing the heavy lifting. Alternatively, we could have said, you know what, our platform's working, let's just leave it as is, and let's hope everything continues to work.

We were really tasked to do something in the middle, which was quite daunting for us as a team. It was pretty scary, but I'm going to run through how we approached that and some of the lessons learned that we had. So just to give you a bit of history, we started using Kubernetes right at the inception, so this is pre-V1 Kubernetes, so back in 2014, 2015. We as a department, and we were quite a small team at this point, we were exposed to a lot of bugs, a lot of issues in Kubernetes that we were really contributing back to the community and Kubernetes codebase, and we learned fast.

But one of the key takeaways that we took from that really early adoption of Kubernetes was the controller pattern and understanding the value of eventual consistency and being able to declaratively define what state you want and have controllers that will continually validate that state. So I could go in, I could change the environment that I'm deploying to, and we'd have controllers that would then go and reset that. It's quite different to things like Terraform, where you would do a one shot. This is something that we continually want to validate. So that's something that was really embedded early on in our development of Kubernetes, but we were given quite an extensive period of many years of being able to really refine and hone in our understanding of Kubernetes.

EKS was then launched in 2017 to 2018, but there was an issue for us. We would have wanted to immediately pivot to EKS at that point. However, Peacock came along and kind of changed the game a little bit for us. Our scale shifted from a small user base within things like Now Proposition and Sky Go over to Peacock in the US. We were increasing orders of magnitude on the number of customers, the number of developers, and the amount of focus that we had on us as a platform engineering team.

So over the succeeding few years, we then really tried as much as we could to focus on the EKS development, and thankfully at the end of 2024, we actually launched our first EKS-based streaming platform, which was Showmax, which is one of the propositions that Mansoor spoke about earlier.

But what we then needed to do was look backwards and go, how do we now take our platform that we've honed and built over the last 10 years and that a lot of significant streaming platforms are now integrated with, how do we now take that and move it to EKS? So just to give you a bit of more detail as to what happened for us throughout that same time period. So you can see on this graph here, we had a few years of fairly comfortable growth. It was great, we were able to really lean into our testing, our automation, we were able to really hone in on ensuring that our Kubernetes platform was reliable and stable, and I think we really fulfilled on that. But you can see just after that EKS platform was released, we then had a surge in usage on our platform. So, how do we approach this?

Migration Objectives: Moving 1,600 Apps with Zero Downtime and No Developer Action

This is essentially what we ended up with, so this is what we were tasked to migrate at a really high level. I'm sure anyone who's done Kubernetes the hard way will have seen a similar pattern as this. You'd use things like Terraform, Ansible, built on top of the AWS API, EC2, NLBs, ELBs, all that other good stuff. So I think we had a pretty mature platform, but it wasn't managed. This was very much self-managed, doing it the hard way.

We also had a number of engineering teams and a huge amount of business units. So we had over 1,000 developers consuming our platform. We had hundreds of business units. These are all disparate business units, different organizations, different ways of working, different management structures, so there's not one single individual or strategy that we could say we're going to now suddenly shift over to EKS. This is a significantly diverse set of individuals and teams that were using our platform.

But the key part is 1,600 apps. This is really what we wanted to drill into, because this is from the tech perspective. Let's ignore the business side, let's ignore the team side. We just needed to move 1,600 apps from doing it the hard way Kubernetes to EKS. So let's take the simplistic approach. Going back to that initial, you've got the two options. You either build fresh and then ask everyone to move, or you leave the status quo.

We were never going to leave the status quo. We knew we needed to evolve the platform. So let's maybe hone in on that first option. If we were to build a new platform based on EKS and we took let's say an average of 2 weeks per app, which is probably some apps significantly longer, probably a few less, but let's just assume it's 2 weeks, that would give you 61 full-time engineers working on that for a whole year. That's a significant investment, and when we really take a step back for the engineering teams, they don't really care, right?

It's really the platform engineering team that wants to move to EKS. It's less the development teams. They really just want to focus on developing their apps, building their apps, releasing new features. They don't want to be focusing on moving to EKS, going back to what Mansoor was talking about on that left-hand side. It was my team that was feeling the pinch of self-managed. We saw the increasing level of BAU, but we were thinking about shifting that responsibility onto development teams. This isn't doing the heavy lifting. So let's remove that, that's not really an option for us.

So these were the five objectives that we laid out, and this links back to those four principles that we have as a department. The first objective, no action for development teams. We offer interfaces, and it's a pretty extensible interface and it's a mature interface, in this case kubectl and the Kubernetes API. That doesn't need to change. That's how they deploy their app. We should keep that consistent. Two, migration must be done live. We don't have periods where we can just simply switch off our streaming platform. People always want to stream their favorite shows and live events. Three, zero downtime. So as with our principles, even a few seconds downtime just isn't acceptable.

Fourth, less than 12 months to complete. So this was really something that was largely driven by us as a platform engineering team. We weren't really given the timeline because ultimately from Mansoor's perspective, he was happy enough with self-managed Kubernetes and wasn't massively happy with the increasing burden of the operational overhead. But we as a team felt that we were being held back and being able to evolve our platform, so we were hearing about Karpenter and all of these additional add-ons and auto mode and all these features that we couldn't utilize, and we were seeing the industry constantly evolve, if not rapidly moving forward, and we weren't able to leverage that and offer that to the business. So we internally said 12 months, full stop, let's move to EKS and let's make it happen.

And lastly, and this is really the gold standard for us, teams shouldn't even know it's happened. So this is the gold standard. We want to revolutionize our technology whilst keeping our capability consistent. We want to be in and out without anyone even knowing. So I'm now going to hand over to Ian, who's going to talk about the partnership that we had with AWS to make this happen.

AWS Partnership: Validation, Tooling, and Enterprise Support for the Migration

Brilliant. Yeah, thanks mate. Okay, so as an AWS solution architect, it's really my responsibility to help the customer with complex migration planning. And I think over the last eight years we've really been able to build sort of confidence that we can help the customer to achieve their outcomes. So every day we talk about Kubernetes, every day we're talking about AWS services and features. But for this it was a little bit different. I mean, we needed to talk about the tooling as well as the services that would help with the migration process. So we were able to do this, but also we were able to talk about proven architectural patterns that would help the customer, and we're taking that right across a different sort of set of industries of how customers have approached large scale migration.

I mean this was really good, so we were then able to bring some of our solution architect specialists, some of our service teams in to validate the approach, to validate the migration plan, and I think that was really useful. So it's all of the right experts at the right time that were really able to help the customer. And actually right from the delivery of Peacock, we'd always worked with the AWS Well-Architected Framework, and that had been really useful, not just as a sort of simple checklist, but actually a way for the customer to validate the existing architecture, but to really look into the future, look around the corner a little bit, and perhaps look at scale, to look at redundancy, resiliency, reliability, and of course cost is a big part of this.

But having done all of this, we were able to help the customer to just validate the approach, to feel comfortable that they were able to migrate. So when they were actually ready to make the decision to migrate to Amazon EKS, I mean, we felt like we had all the relationships across the business. We felt that we had the relationships with AWS. So like when they were, you know, good to go, we were ready to really step up and help the migration.

So then if we move into the migration phase, we've really started to double down on the tooling and also the AWS services. And then we spent a long time really just completely validating the approach, and actually Pete and his team built out a six stage migration process. And I think what was really great about this is actually they ensured that every single step kind of like moved it forward and validated and tested everything they needed to do. But if you've worked in technology for a long time, you realize that actually being able to step backwards and restore operational state is absolutely critical. So it's great when things go well, but also you want to plan to restore operational state when they don't.

So we really started to double down on some of the key services, like Velero with Amazon S3 for handling the Kubernetes backup and restore, and Amazon EKS for now managing the Kubernetes clusters, Kafka for processing messages, Route 53 for DNS. But I think what was really critical was observability. So for the customer to observe the existing Kubernetes infrastructure, but also to observe the transition to Amazon EKS just to ensure that as the transition was moving, the custom, it was migrating across, that the customers were having the experience that they should be. And actually, we'll come onto this later in the presentation. Manish will cover AWS Enterprise Support in underpinning the migration.

The Six-Stage Migration Process: From Pre-Flight Validation to Pipeline Re-Enablement

And but now I'll hand you back to Pete to discuss how Global Streaming went through the migration process. Thanks mate. Thanks Ian. So I'm now going to go into a bit more detail as to how we did the migration.

So this is really a view of a slightly more detailed pictorial diagram of that Ansible Terraform cluster that we started with. So again, it probably looks similar to a lot of customers who are using self-managed or even EKS. In this diagram we've got our different integration points into Kubernetes. That's our CI/CD tooling, that's going to be a developer using kubectl, all of our CDNs and any third parties that are integrating into the platform. But the important part is that we control that route in. That's key for how we did this migration.

Also key was the fact that we had quite an opinionated platform and how applications deployed into the cluster. Again, that was a really key part in having this be a possibility for how we did this migration. So the first step for us was fairly self-explanatory, so spin up that new EKS cluster. This is in the same VPC, same egress NAT, so that any traffic coming out of this cluster looks and presents as if it's the same self-managed cluster, and that sort of simplifies things like firewalling and fun stuff like that. Create a new Route 53 zone, and this is so that all of the applications can still present themselves as the same layer 7 route into the cluster. But the important part here is we didn't want to delegate that zone via the parent zone.

Step two, and as I touched on earlier, this was something that we worked closely with AWS on because we really wanted to know surely other customers have done a similar migration, because it can't be that unheard of, moving from turnkey, do-it-yourself Kubernetes to EKS. And Velero was the de facto choice for doing this. But again, it's a technology, not really a capability. We needed to build a lot of automation around it. But the first phase was to really shift across those core services, so that's things like CoreDNS, ingress, monitoring. I wish we had EKS add-ons for this, but we had to do a lot of hand cranking here.

And then the next stage was to shift over the application workloads, and this is where we started to step a little bit further into the application team space. Previously we would have always said anything at the namespace level is for the development teams to manage. We as a platform engineering team, that is not our responsibility, that is the development team's responsibility. But this is where we really needed to question that for this migration. At this point, we stepped into the application space. We had to do a lot of understanding in that space. We needed to understand every application. I'll go into a bit of detail later on on some of the more complex applications we needed to migrate. But we needed to really understand that and then shift it across, because we are suddenly, as part of this migration, taking on a lot of accountability in terms of the reliability and availability of those services.

And then step four is delegate DNS. So this migration that we performed was a big bang migration. It was something that we had to do full stack. We could not do service by service because a lot of our architecture is based on things like Kubernetes service discovery and service routing. So it means we couldn't just shift over one app because any downstream apps may still not be up in that new cluster. So we needed to see the full stack in that cluster move over.

And then lastly, and probably the easy bit and the very satisfying bit for sure, was to get rid of the old cluster. These are clusters dating back to 2014, which really was testament to the team that we were able to continually upgrade, patch and actually take tech from 2014 and make it relevant for 2023, 2024, just when we were doing this. But yeah, it was sad to see some of them go, but ultimately it was better for the future and we're able to leverage a lot more out of EKS now.

So I'm going to go into that six stage migration process that Ian touched on, because what was key for us is to really lean on automation and controllers to do this migration. We could not rely on humans sitting there manually moving apps from one side to the other at the scale we were running at. It just wouldn't be safe and it also wouldn't take us 12 months. So the first step was pre-flight validation, so this is really where we look at things like Route 53 zone TTLs. We wanted to massively reduce those TTLs to give us more flexibility in being able to flick over to the new region or new cluster, and then if anything does go wrong, we can pull it back.

But this was something we did weeks in advance to ensure that all clients had that new fresh TTL set. Next was building the EKS cluster. Alongside all of the ECS clusters, we also deployed a dedicated end-to-end test suite that we developed as part of this migration. I'll go into a bit of detail in a moment on that. We also set up all of the Velero mechanics, including the integration with S3 and all of the other components that we needed to configure in both the self-managed clusters and the EKS clusters.

At this point, we're now ready to start taking a snapshot using Velero of that self-managed cluster, but we obviously need to stop any incoming changes into the cluster. Thankfully, because we have a fairly opinionated platform, we were able to centrally disable pipelines across all of our development teams to essentially stop them from making any changes that would then allow us to take that snapshot into Velero and start the migration. This is really the first time that developers might have been aware of something going on, but you'll be surprised as to the amount of messages we actually got. It was fairly minimal, and people weren't really aware that this was happening.

Now we've got the big stuff. This is where it's really starting to kick off. This is where we did that workload migration, and we decided to do a 50/50 scale up. This allowed us to scale down our self-managed cluster and up on EKS, partly for cost, but also we wanted to ensure that there weren't certain issues that may arise by having two large-scaled applications talking to some of the persistence layers. Again, I'll talk about that in a second.

Stage 5 is really where we start to see traffic moving over to the new clusters. This is where we do the zone delegation, and then we bring that EKS cluster fully up to 100% and we bring the self-managed clusters down. Lastly, we re-enable the pipelines. This stage process, really from stages 3 to 6, which is where development teams may be aware that something is happening, is something that will happen in a matter of hours. These are single-day activities that will be done so that we can do it quite quickly. We want to be able to take snapshots, and we don't want to be blocking teams from making development changes because, as Mans said at the beginning, there are a significant amount of deployments and changes continually going into production. We couldn't afford to take down the environments for too long.

Critical Success Factors: Comprehensive Testing and Workload Migration Automation

I'm now going to drill into two key areas that I think were absolutely critical for the success of this project. The first is our test suite. This is really the makeup of the different aspects of testing that we had across our migration, and some of these test suites remain even today and we leave in the environments. I'm going to step through a few of them just to give you a sense as to what we were focusing on and some of the key performance indicators that we wanted to drill into as part of the migration.

The first is around synthetic load testing. We want to continually know what is the latency, what is the availability at P99, P99.5 across the entire platform at load. This isn't a functional test. This is continually injecting load into our platform so that we have very reliable data to tell us if there's something wrong on the client. But also, you'll notice this is a service that we own as a platform engineering team, so we also have reliable data on the server side. We manage the entire journey from request all the way down to response.

Second is zone validation. I already touched on earlier around the fact that we have new zones that are created for EKS, but they're not delegated, so we've taken on the accountability to make sure that those services are running. But dev teams can't access them. They don't know how to access these new pods that are suddenly running in EKS. We had to take ownership of functionally testing those applications. We weren't doing full-scale load tests. I don't think they'd be particularly happy about that, but we were doing very low-level functional tests to ensure that things like all of their pods were up, that they were responding to health checks, that they were having a certain level of latency, that we were getting the right percentage of 2xx, 3xx, 4xx, 5xx responses.

We were doing continual comparison between our self-managed clusters and our EKS clusters. We were able to look at those metrics and essentially look at deltas between the two clusters. Even if an app has a significant amount of 204s or a significant amount of 404s because of the way that the application's configured or really the nature of the app, we would be able to compare and validate that there is no difference between self-managed and EKS. Next is node and cluster level checks.

Again, we want to make sure that all of our nodes are healthy. We have a lot of services that run on them, and we want to be able to run node and cluster level checks continually. Lastly, and this was something that I think was probably the scariest for us, and it's something that we couldn't really manage, and that is firewalls that we don't own. We felt that the most likely cause of an incident or an issue with this migration would not be our firewalls because we've managed those and we can essentially do all the due diligence around that. But if any other team has a firewall that maybe only allows a certain /26 or /28 that potentially we weren't aware of, we can't validate everyone's firewall.

So what we did was rather than look at those firewalls and reach out to all of those teams, because we sent the comms out, we chose to lean more on VPC Flow Logs and CloudWatch. We would look at any anomalies in spikes in potential dropped connections or anything essentially changing between what we expect on our self-managed versus what we see on our EKS. This was really key, and we did pick up quite a few issues here where we were seeing significant numbers of packets being dropped coming out of our platform, which indicated a firewall issue. Underpinning all of this is VictoriaMetrics. At the scale that we run at, we could not simply rely on something like Prometheus or even really Thanos. We needed something more distributed that could scale to the size that we needed. So this was absolutely critical in us being able to do this type of analysis.

So the next area I want to zoom into is our workload migration. I touched on it a moment ago around some of the more complex workloads, and I think the majority of our applications are stateless, so pretty easy to migrate. There were some issues, but generally speaking, we were moving stateless workloads from A to stateless workloads in B. Where it became a little bit more difficult is this example that you can see here. We have a number of services that use things like Kafka for queuing and being able to have some level of producer and consumer type architecture. The problem we had here was that a lot of the Kafka clusters that we configure have a set number of producers and consumers defined on the cluster itself. So if we as a platform team suddenly doubled the number of consumers and producers, that could significantly affect the performance of Kafka, and we could get additional number of essentially degrades to that Kafka cluster.

So what we needed to do was essentially take a more fine-tuned approach to things like Kafka. You can see in that diagram that I was showing a second ago, we had to shift over by a delta of plus or minus one. So where we took a consumer and producer pair, we increased the EKS, we reduced in Kubernetes. Again, this isn't something that we did one-off. We had to build automation to do this for a number of use cases across the board. But it was something that we worked really closely with a lot of our Kafka experts across global streaming. It was definitely the more complex issues that we actually hit, but thankfully we were able to sort of get through that.

You'll also notice in that diagram that multi-region was key. So what we needed to do was actually shift some of our traffic away from the environment that we were migrating and wait for the queues to essentially get to zero before we performed that migration. I do appreciate a lot of organizations don't have multi-region, and we also have certain propositions that are not multi-region. So the way that we approached that was really focused on really quiet periods of the day to ensure that those queues are as close to zero as we can, and thankfully we didn't hit any issues at that point.

So where did we get to in that 12-month target, which was pretty ambitious at the start? We essentially spent six months not migrating any cluster. Whenever I was giving an update, it would say I haven't done a single cluster yet, which wasn't particularly great when I was reporting that to Manz and Co. We weren't migrating clusters at that point, but we really had the confidence and backing of Manz and the team. He knew that we were developing a lot of tools and automation to be able to rapidly migrate to EKS. So we spent really six months building that six-stage process that I just ran through. And we finally migrated our first cluster at the end of Q2, which was kind of our internal target. But what we then were able to do over

the succeeding six months, and really four months, was migrate the entirety of global streaming in four months. And that was all down to that automation and that testing that we'd built. We had next to zero incidents or issues with this migration, and that was because of the due diligence that we really put in. I also really want to highlight some of the support that we had from Ian and Manish and team. We had a lot of support from AWS with this to ensure that a lot of the right people and the support was around us, and I'm sure Manish is going to go into a bit more detail on that in a second.

So what's the key takeaway? Because it's not really just about Kubernetes, to be honest. We're not going to be doing an EKS migration again, thankfully. But the real unsung hero here was an interface, a declarative way of defining what you want with your platform engineering team. So what we're now looking to try and do is pivot as much of what we offer as a platform engineering team away from a plethora of APIs or interfaces or GitHub repos or all of that other stuff, and we want to build a declarative interface and spec that teams and our customers, our internal customers, can declare what they want. So that today is their workloads in Kubernetes. But we're currently working on being able to define things like keyspaces, any other database technology, CDN configuration. Really, the options are limitless, and when I look at things like AWS Controllers for Kubernetes or ACK and a lot of other tools out there, this is really where we're seeing the platform engineering industry going, and we're really trying to keep pace and ensure that we can offer that same interface for our developers across global streaming.

AWS Enterprise Support and Countdown: Ensuring Operational Excellence During Migration

So I'm now going to hand over to Manish, who's going to go into some of the support he gave us. Thanks, Pete, and hello, everyone. As Global Streaming's technical account manager, I would like to highlight how we are building operational excellence for Global Streaming, as well as how AWS Enterprise Support helped facilitate this migration. But before we do that, could we have a quick show of hands, please? How many of you are AWS Enterprise Support customers? All right, fairly good mix. Thank you.

For those who are unaware, let me quickly talk about two key AWS Enterprise Support engagements that were directly linked with this migration. So, NBCUniversal and Sky are AWS Enterprise Support customers, which means that they get access to a designated technical account manager like me, who works closely with the team. Through Enterprise Support, we have built a technical partnership with the team. This also means that they get access to AWS expertise whenever they need it, whether it's for architectural discussions, technical deep dives, or operational challenges. For this migration, we had our AWS Countdown initiated for the migration. So what we did was we worked in parallel with the customer team and we collated all the important information around their AWS environment, and then we had our AWS experts look into the internal runbooks against the AWS services involved in the migration and come up with the detailed specific technical recommendations. In the next slide, we will look at how AWS Countdown helped in this migration, but the real success story here is how Global Streaming team has leveraged all of these support mechanisms to make sure they meet their project migration goals. They take our inputs whenever valuable, but always maintain the clear ownership of the technical decisions.

So, let's look into the timeline once again. Pete has shared this is what the project timeline looked like. The project got kicked off in January, and soon after, AWS Countdown was initiated. Through the information we collated, we knew that Amazon EKS is at the front and center of this migration. So this prompted a very good technical discussion point with the team. The team has been managing the self-managed Kubernetes cluster, but with Amazon EKS, the control plane becomes the managed offering. So we discussed with the team around the best practices for the observability of the control plane, what are the key metrics that get exposed via CloudWatch, and for this migration specifically, what are the specific metrics to look for to make sure that the control plane is healthy.

Similarly, we also had a discussion with the team around the control plane scaling aspect. There was already a good public blog post around how Amazon EKS control plane scales in response to various inputs and metrics. But for this specific technical migration, we discussed with the team around what are the specific steps to take care of so that the platform scales in response to their migration.

And last but not least, before the first production cluster migration, we had heightened AWS awareness. This meant that in case anything goes wrong during the migration and if there is a need for an AWS support case to be logged, the support case engineer who is working on that support case gets the context of the migration very quickly so that they can help the customer efficiently and quickly. Through the rigorous testing, the proactive planning discussions, and this heightened AWS awareness, the project went smoothly without any issues.

Now in the last slide for this section, I just want to highlight what does the life look like after the migration. Operationally, we are already seeing the benefits after this migration. Tracking the Kubernetes version has become very easy. The upgrade process has become very smooth now, so this means that engineers are spending time more on innovation rather than maintenance. Through our technical discussions, we also continue to discuss with the engineering team around various optimization opportunities.

For example, we are already talking with the team around Karpenter for more efficient node scaling, as well as provisioned control plane scaling, and much more. Now all of these technical discussions happen organically through our enterprise support partnership. And at last, we are also exploring the different AWS native services across compute, networking, database to add more resilience into the platform capabilities. With this, I'll hand it back to Mans to talk about specific improvement metrics that we are seeing after the migration.

Results and Future Vision: 6x Faster Upgrades and Declarative Infrastructure Management

Thank you, Manish. Hopefully everyone's still awake in the back. By the way, you all look super cool. It looks like a scene out of Tron with the futuristic pink headset, so it looks amazing. I'd love to take a photo later, but yeah, that's one for later.

So, now that I've given Pete 12 months for this project, what's the benefit? What do we actually get out of this? So let's have a look. So that pie chart I showed you at the beginning where we had 30% as a reminder of what we consider BAU toil, it's now dropped down to 10%. So that's just as a reminder, that's things like upgrades and security patches and all the things we love to hate. The benefit of that is obviously 50,000 fewer lines of code that we've just thrown away, which is great.

Our upgrades are now six times faster with EKS, so obviously that's a huge benefit. But what does that actually mean for us in reality? What it means is that we can now accelerate what we consider development. So things like rolling out Istio and Karpenter and Argo. It's not to say we couldn't have done that without this migration, but it's allowed us to accelerate it and do more in parallel. But ultimately that's the benefit that we've got.

So great work to Pete and the team and obviously massive help and thank you to AWS. But yeah, I mean, obviously we could not have done that without this great partnership. So yeah, thank you. I'm now going to hand it over to Ian, who's got some questions for all of us, so over to you, Ian.

Alright. Thanks, Mans. So because it's a silent session today, it's really hard for us to do a traditional Q&A. So we actually were able to talk amongst ourselves and do more like a FAQ. So what we thought would really resonate with you are questions that we probably thought that you would ask, but we will be hanging around for a few minutes after the session if you want to come and speak to us, and then of course you're welcome to do so.

So, first question for Mans. Mans, how has the Amazon EKS migration transformed Peacock's ability to deliver major live events like the Super Bowl and the Olympics? Thank you, Ian. So yeah, I mean, as you can imagine, there's a lot less for Pete and his team to do in terms of whether that's scaling, whether that's testing, and things like support, it's a lot easier. You also get the benefits of the new features that Amazon will roll out via EKS.

But it's actually a lot more than EKS and what we're looking to do is kind of transform how we work with managed services like things like Keyspaces as well. Yes, we have obviously large events like the Super Bowl and the NFL coming up, but the key for us is obviously the growth internationally.

I could easily get a call tomorrow from my boss saying we've landed another deal, another partnership with another streaming service, and we've got another 12 months to do a migration. So I think it's more about the mindset and the tooling that we've built as part of this that's ultimately given us that great option going forwards.

Yeah, brilliant, thanks Mans. And I think Pete, over to you. You mentioned earlier in your presentation that you're hoping to expand the migration approach maybe from Kubernetes, but maybe to database or content delivery networking or something like this. So can you tell us a little bit more about that? That'd be great.

Yeah, sure. As I sort of touched on in the presentation, I think the key for us is ensuring that we follow a declarative way, not just for application deployment, but for all aspects of the developer ecosystem and lifecycle. And I think traditionally what we've needed to do is build our own controllers. We have a set of engineers within the team that this is really something that's their bread and butter, is to be able to build those controllers, and we've built a number of them over the years. This is pre-ExternalDNS. We were building controllers to manage Route 53 zone delegation through the Ingress resource and all of that other stuff.

Obviously now the industry really is leaning in to the declarative controller-based eventual consistency, and I think we are really excited to see what's coming out of, in particular, AWS. So I think tools such as ACK, or AWS Controllers for Kubernetes, I think are going to be game changing for us in that we can just use that. And I know it was actually announced a couple of days ago, the fact that that's now going to be an EKS capability that we can just enable, as well as things like Argo CD, which is just totally game changing for us.

But if I pivot back to things like Keyspaces and CDN configuration, I think we right now are actively working on building controllers and contributing to controllers to be able to manage those aspects of essentially the ecosystem. But the important part for us is making sure that the way that is declared is not technology specific. It needs to be declaring what the developer wants from that aspect, whether it is a relational database or whether it is the ability to expose their app on a CDN or adding caching. It should not be related to the technology under the hood so that in the future if we need to go and move to a different technology or a different CDN, or we want to expose more CDNs, we don't need to then go and ask 1600, 1800, however many developers that we may have to go and make those changes on their code base. That data's there. We can make that change on the platform engineering side.

Yeah, and I think, look, I think that's super interesting, and I think probably the people here today, and I think perhaps in the future, I'm sure you'll talk about it again. But I think it's a really interesting approach that I think people can really benefit from, and it kind of leads me nicely on to the next question from Manish. So Manish, have you observed similar migration needs among other AWS customers in the way that Global Streaming have migrated to Amazon EKS?

Yes, in fact, we do. We do see other customers going through similar migration journeys, but they all do it in isolation. I appreciate there is not a lot of publicly referenceable documentation at this moment around these kind of journeys, so I'm really grateful that Global Streaming is sharing the journey with the wider audience. And I'm hoping that we can follow it up with probably a further white paper or a blog post so that we can distribute it with the wider audience.

Yeah, I mean, look, I think from my perspective there's a lot for people to learn from yourself, Mans and Pete. I think what you've done here is really interesting and kind of super cool. I think if we can blog about it in the future, or perhaps a customer use case, I think that people would really appreciate that. So changing track a little bit now, as Peacock expands, Mans, how has the modernized infrastructure supported your overall growth strategy?

Yeah, I think speed to market is absolutely key. And as I said, we could easily sign a deal with another streaming service in a matter of days and give it another 12 months. So I think speed to market, and rather than us having to provision and write a lot of our own infrastructure code, is using managed services, whether that's persistence, whether that's CDN, whether that's actual Kubernetes clusters itself. I think it just gives us that speed to market.

Key for us is a lot of these new kind of partnerships that we're making are kind of global. So we could get a call next week to say that we've signed a deal with someone in Asia or in Australia or in South America. So I think absolutely the key is the speed to market, and we only really get that by using more and more kind of managed services with yourselves.

Yeah, I think that's amazing. I think obviously speed to market really counts, right? So Pete, I just got another question for yourself. You talked about your team's approach to testing and automation. I know we talked about Application Recovery Controller and Fault Injection Simulator and these kinds of technologies.

How are you testing recovery capabilities and resilience for live sports events like the Super Bowl or the Olympics? Yeah, I think you can probably tell testing and automation is at the heart of my department. It's something that we all pride ourselves in, and it's always the first thing that we ask whenever we're starting a new project. And that also means that it's built into our culture.

When we look at our platform, we're constantly looking at ways that we can further validate and essentially break our platform and find those particular breaking points. And I think something like AWS Fault Injection Service, or FIS, has been really valuable. We've been heavily using that over the last 12 months or so since it got launched. And that really allows us to validate things like zonal failures or regional failovers, all these sorts of things. That's absolutely key, but I think another tool, and I know you mentioned it as well, is the AWS Recovery Controller.

What we're now looking to do, and this is really only possible via us being able to free up that business-as-usual time, is we want to now pivot towards being able to be totally regionally tolerant. So we want to be able to use something like ARC to fully fail over to a different region. But alongside that, things like chaos engineering and chaos testing within the Kubernetes cluster is again something that we really want to sort of try and focus on and build. We don't want to have all of the testing and automation that we did as part of this migration lost. We want to continue evolving that. It shouldn't just be a one shot. This has to just be something that we continually lean into for essentially ensuring that things like the Peacock Super Bowl actually works.

Yeah, I mean, certainly we're hoping that is the case for sure. I think that kind of segues now. So, Manish, what preparations are being prioritized to ensure the optimal viewer experience for large-scale events? Yes, so as Pete said, the teams across globally streaming do rigorous testing in preparation for all of these big live event games. So what we have been recommending to the team is to go for another engagement within AWS Enterprise Support, which is AWS Countdown Premium. So it gives you much more support while doing these testing phases.

So our recommendation has been to get this Countdown Premium at the start of your testing lifecycle so that whichever designated engineer is coming in, they have more context around the live event games, your scale, anything that you hit on the roadblocks, so they are aware. So that when the big day comes, everything has been tested. There is heightened AWS support awareness so that there is nothing that could go wrong on the day of the main live events. Yeah, thanks. I think that's absolutely critical.

So I think we've got time for perhaps one last question. So let me throw a question back at you, I guess, Ian. So obviously I've touched on kind of our global ambition, and obviously not just kind of global reach, but also kind of a lot of things we want to do with Gen AI and kind of the features. But can you give me, give, I guess, all of us, kind of an insight into what you have in your world over the next couple of years that will kind of help us push towards that next stage in terms of our ambitions globally?

Yeah, I mean, I think that's a really good question for where we are at re:Invent right. So I think we're seeing new services and features turn up every day. I think it was really, I think we're really relieved to see Amazon EKS Auto Mode turn up as a pre-Invent item. I think that's helped us out quite a bit, but look, I think we'll have a roundup next week. I think there's so much to come from re:Invent, and obviously I'd encourage you now to attend as many sessions as possible like this, but also a lot of the service announcements.

And then I think the same for everyone here today. You know, I'd really encourage you to get out there, really make the most of re:Invent and really look at all of the new services and features that are now being released on a daily basis. So I think with that now, we will close the session. And but I'd really, really like to thank Manish and Pete for their time and for their partnership. I think, I hope you found the story of their journey to Amazon EKS as inspiring as we do. And look, I'd encourage you to get out there for the rest of re:Invent, really enjoy your time here, and thank you so much today for your time and attention. Thank you very much. Thank you everyone. All right.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community