Kazuya

Posted on Dec 5, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Making Level 4 Autonomous Networks a reality with British Telecom (IND205)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Making Level 4 Autonomous Networks a reality with British Telecom (IND205)

In this video, AWS and British Telecom demonstrate how agentic AI enables autonomous telecommunications networks. Ishwar Parulkar explains AWS's AI stack including Amazon Bedrock and AgentCore for building enterprise-grade agentic systems. Reza Rahanma describes BT's vision for AI-powered, intent-driven autonomous networks managing 30 million subscribers and 20,000 macro sites, emphasizing their DDOps (data-driven operations) approach. Ajay Ravindranathan details the solution architecture featuring multivariate anomaly detection, domain-specific community agents for root cause analysis, and service impact analysis using Amazon Neptune, SageMaker, and MSK. The implementation targets Level 4 network autonomy, transforming petabytes of network data into actionable insights while reducing operational costs and improving SLA performance across BT's 5G standalone network deployment.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Making Autonomous Networks a Reality with Agentic AI

Good afternoon everybody. Does everybody have their headphones on and can you hear me? All right. Good afternoon. Welcome to the session on making autonomous networks in the telecommunications industry a reality using agentic AI. My name is Ishwar Parulkar. I'm the chief technologist for the telecommunications vertical at AWS. Telecommunications companies spend a lot of effort and money in operating networks. Hence, they're called operators as well. Today I'm going to talk about how agentic AI is helping them reduce the cost and the time to manage and operate these networks, the work we've done with one of our strategic customers and partners, British Telecom.

The session is structured as follows. I'll give a brief introduction on some of the challenges that the telecom industry is facing, some of the big challenges, to give some context of why making networks autonomous is important and critical for the industry. I'll talk a little bit about what autonomous networks mean and how AWS's agentic AI strategy fits in to help take them on that journey towards making networks fully autonomous. Then I'll have Reza Rahanma, who's the managing director of Mobile Networks of British Telecom, talk about British Telecom and their vision in this space and also the challenges that they set out to solve using AWS services and AWS technology.

After he talks about the challenges and gives you the lay of the land in terms of what the problem statement was and what we were trying to solve, we'll have Ajay Ravindranathan, who's a principal solutions architect in the telecom vertical at AWS, talk about some of the approaches we took, some of the use cases that we focused on, the approach we took, and the solution architecture that we applied to make these networks autonomous. Then we'll wrap it up with some of the benefits we expect out of this work and next steps. There's a path we have to scaling and expanding it, and we'll talk a little bit about what the next steps are in this journey.

The Telecommunications Industry's Challenges and the Network Life Cycle

Let me give you a quick overview on why making networks autonomous is relevant and how AWS's strategy in agentic AI fits in here. If you look at a high level, there are three big challenges that the telecom industry is facing. These are also opportunities, so you can look at them as challenges or opportunities. The first one, as I mentioned earlier, is the cost of managing telco networks. It's pretty significant and can go up to one-fifth of the revenue in some of the larger operators. A large part of the reason is that a lot of the operations are manual. You employ a large workforce to manage operations. It takes time to diagnose failures and understand what's happening in the network. So that's one of the bigger challenges that the operators are trying to solve.

The second one is about 5G, which is the current state of the art in mobile networks. 5G was a big departure from 4G in the sense that it was meant to be a dynamic programmable network. It gave visibility into the network and allowed for control of some of the configurations in the network. But the industry has not yet completely fulfilled this promise, and agentic AI, the transformation that we're going through in terms of this big technology inflection point is something that can help them achieve that promise of 5G. Lastly, telcos have a lot of data. The data comes from network elements. It comes from devices that users use. It comes from user behavior, and all of that data has a lot of potential to not only help reduce costs but monetize and build personalized products, and it's quite underutilized today.

So one of the opportunities here is how can this data be transformed to data products that can then be used by ML technology and applications on top to derive value. One of the significant tenets we started this work with was to look at it from a data-first perspective, really look at what data is available, how it can be curated, and how that can be leveraged to build ML applications. Let me give you a brief overview of what a telco network's life cycle looks like. Folks from the industry are familiar with this, but for the others, I thought it would be helpful to just understand where we are focused on. There are different phases in a network life cycle. The first one is planning and engineering. This is where you figure out the targets, which markets you want to get into, what kind of bandwidth you want to provide, and where the cell towers should be located.

There is a whole bunch of effort that goes into planning and engineering this. You decide on which network equipment to procure to build the network, and so on. The second phase is deployment and configuration. This is the actual installation of all the elements, connecting them, testing them, validating them, and configuring the various routers, base stations, and elements in the network.

Once the network is live, a big part of it is services fulfillment and activation. Based on customer demand, consumer demand, and enterprise customer demand, the services are activated and launched on top of this network. Then there is operations, which is making sure that the network is performing to standards, meeting the KPIs, and making sure that you are optimizing the network and getting the most out of it through traffic steering and optimizing RAN scheduling, and so forth. There is a whole area of optimizing network performance.

And then there is detection, diagnosis, and fixing of failures when something goes wrong. How do you figure out what is wrong? How do we fix it? This work is mainly focused on the last two pieces here, which is operations and the services fulfillment part of it. Over time, we expect agentic AI to expand to these other phases of planning and configuration as well.

Autonomous Network Maturity Levels and AWS's Agentic AI Strategy

The industry defined levels of maturity in this journey towards autonomous networks. TM Forum is an industry standards body that came up with this maturity level framework to be able to evaluate where an operator is in its journey towards autonomous networks. This came out in 2019 before the AI transformation and the technology wave that we are seeing emerged. Level 0 is all manual, going to Level 1, which is preconfigured execution of a few tasks, then goes to Level 2 and Level 3, which has closed loop operations, going from very static and prescriptive to intent-based.

There was some work done on languages to express intent which could get translated to network configurations and so forth. Level 4 and Level 5 are where we really get into a closed loop with end-to-end services. We are roughly around Level 4 with the work we are doing here. Note that this was an abstraction created in 2019 before AI came to the forefront, so it was aspirational and a little abstract. What we are doing with this work is really giving it a more tangible sense and feedback as to what we mean by these different levels and the levels of autonomy that can be achieved.

We are using AWS's agentic AI strategy and different pieces and different parts of the approach that we have taken. I am just going to quickly go over the AI stack and how it is relevant to this space. At the bottommost layer in the stack we have infrastructure. This is access to silicon. We obviously have access to Nvidia GPUs, but also we have custom silicon, AWS Trainium and Inferentia, which has some benefits in terms of price performance.

And then there is SageMaker AI, which is a tool set to train and fine-tune models on this infrastructure. The middle layer is AI and agent development software and services. At the top we have SDKs for agents. We have our own Strands agents, but also we support open source frameworks and we are using Strands to create these agents for autonomous networks. Underneath that is Amazon Bedrock, which is our managed service where through simple API calls you can build complex agentic systems.

We have access to different models, a lot of tier-one third-party models, but also our own family of Amazon Nova models. You can select and choose, and there are capabilities of optimization, guardrails, and customization along with that. The key piece here is Agent Core, which we announced a few months ago, and you will see a lot more announcements at this re:Invent coming in terms of more capabilities and features there. Think of Agent Core as a suite of primitives to help build complex agentic systems that are scalable, that are enterprise grade with the security and reliability that is required of a large complex agentic application. This is a part that we are leveraging heavily in building autonomous networks.

At the top are applications, which are out-of-the-box agents or families of agents. Right now, this does not really translate to what is required in autonomous networks, even though we are using Amazon Q Developer CLI in some cases to look at how we can translate natural language to telecom infrastructure. I also see potential for using agents and agent tools from the AWS marketplace to fulfill components of this overall vision of what is possible in making networks autonomous. There is a roadmap in terms of more use cases, and I see that marketplace playing a role as we build more use cases and more complex agentic systems.

As you can see, we are using the infrastructure part for autonomous networks. The inference runs on AWS, and we are using a large part of the Amazon Bedrock AgentCore and SDKs for agents to build out some of the components here. With that, I will hand it over to Reza Rahanma to talk about BT's vision and also why they chose AWS.

British Telecom's Network Scale and Vision for the Future

Thank you. I hope I am audible. I am Reza Rahanma. I work at British Telecom, and I am going to take you through the journey that we are going through as part of utilizing this technology to bring the vision we have to life for our network. British Telecom has been around since 1846, and throughout these decades and centuries, we have prided ourselves on being innovators in this domain. We have done a lot of things, from the first telegraph to right now having the best mobile and fiber network in the UK.

Our network really consists of two major layers that we have in the UK. There is our fixed network, which contains our copper network, our fiber network, an enormous MPLS network, and the internet network that runs pretty much the UK. On top of that, we have our mobile network. I run the mobile network, so my passion sits in mobile, and I will talk mainly through what we are doing around modernizing our mobile network and how we are going to use this new technology to take us into where we want to go in terms of autonomous networks with our mobile network.

Just to give you some examples in terms of the scale of the network, the mobile network that we have consists of about 30 million provisioned subscribers. Every day, about 22.5 million mobile users use our mobile network, and millions of other business users. Plus, we also run what we call critical national infrastructure, so emergency services and all of that runs on top of our mobile network.

We have an ambition with three pillars. We really want to be the trusted connector for our consumer base and for our business base. These three pillars are build, connect, and accelerate. We are building our fiber network across the UK, and millions and millions of subscribers will have fiber to the premises. On top of that, we also have an ambition by the end of 2029 and early 2030 to have 99 percent of UK population covered by 5G standalone.

Just briefly, what I mean by 5G standalone for some of you who are not familiar with the term is that it is 5G end to end, 5G radio with a multi-spectrum approach, and the core networks. We are trying to do the whole thing within the next few years. On top of that, we want to also switch down our legacy cellular networks. We have already switched off the 3G network, and we will be switching off hopefully by then our 2G network. So we will be mainly 5G standalone with some LTE left.

Then we want customers wanting to connect to these awesome networks that we have built. So when we bring fiber to your premises, we do not only leave it there at the doorstep; we bring Wi-Fi 7 into your home as well. We are the first operator in the UK to launch Wi-Fi 7. When you think about it, by having that potential of Wi-Fi 7 and 5G, we create the best converged network, the first heterogeneous network for our consumers and our business customers. So wherever you are, the best bearer serves your needs.

On top of that, we will build new applications, and we want to accelerate doing these things. We want to take cost out and develop these services much faster. This is where an example of that is here: creating with AWS what I call AI operation work.

The Vision of AI-Powered, Intent-Driven, Autonomous Networks and Operational Challenges

When we think about this network of the future, it is AI powered, it is intent driven, and it is autonomous. When I say it is AI powered, it is actually a network that not only utilizes AI to operate itself, but it is the best network for the future AI applications. These new applications are coming thick and fast, so the network of today is not necessarily what we need for the network of the future, where wearables and millions of devices that you carry continuously want to talk to the cloud in order to have various AI applications.

Intent-driven means that when you need to use a service, the network needs to be able to provide it for you based on the application that you have. If you are a gamer, then we want to put you into a slice for gaming so you have an awesome gaming experience. If you want extra security and you just launched an application, the network knows that you need extra security and provides that for you.

Autonomous means that for us, the vision is a self-healing network. That self-healing network was started in 4G, but the reality is that it really has not delivered yet. However, with 5G standalone it is very different because the entirety of the network is API driven. The days of Diameter and SS7 and all of them are gone. Everything is IP based and API driven. Therefore, we can utilize technologies such as NSF to launch slices or applications that give you the intent and the network looks after that service.

But that is easier said than done. We have a huge number of challenges as an operator that have been around and have had all of these various generations of the network. This just gives you a very basic view of what we deal with in running and operating this network. On the top, you see approximately 20,000 macro sites. These are the big antennas and the poles that you see in the streets. Each one of them has multi-carriers on them, and we have 20,000 of these and we are growing.

We have a huge amount of small cells that we pepper throughout the country and we bring that in, especially for our top 25 cities in the UK, in order to really provide excellent coverage and even during the busy hour to provide you that minimum bandwidth that you need for the applications that you use today. We have a distributed core network, so our core network is 100 percent containerized now. It runs all on Kubernetes and is distributed across the UK from north to south. The reason we have done that is because we really want to optimize the network for the use cases that we have.

When you are attached to this tower, from that tower your bits go into the closest gateway. From that gateway it goes to the closest UPF. From there, then we put it to the internet in the closest possible way. That distributed network is very important. However, that means that lots of stuff has changed when you run the majority of your network on a containerized network, including our network.

For the radio network, the radio itself, we monitor 4,000 KPIs for each tower. That is very chatty. When a node goes up and down in the containerized domain for the core, it is chatty, so the amount of data that is generated within this network is enormous and huge. On top of that, per day we have some of these stats. For example, we have 12,000 events a day. We do hundreds of changes in a day. Our operation is not really where it needs to be in order to deal with the ambition that we have for the awesome network that we are building for the UK.

When we talk about change, a small change can impact something somewhere else. You can make a basic SRV record change, and something somewhere else in the network breaks.

Data-Driven Operations (DDOps): Transforming Network Management Through Clean Data

These are the challenges we face today in this containerized, 100% software-run environment. We have to raise the bar and ensure that we're modernizing our network. We've started and jumped on the AI bandwagon a while ago, but we made a number of mistakes. Some of the things we're working on right now are what I've put here.

Consider how mobile networks have been run. I've been in mobile networks for 30 years, from the days of 2G, 3G, 4G, and 5G. The functions performed within mobile network engineering haven't changed significantly. In 2G and 3G, you had the HLR engineer. In 4G, you had the HSS engineer. In 5G, you have the HLR, HSS, and SDM engineer. The same function exists. You see the same pattern with gateways from 2G through 4G and 5G. Technology has modernized, but the people side hasn't modernized as much. We're still a network engineering team, but we need to transition ourselves from a network engineering team to a software engineering team that runs a mobile network. That transition is what we're going through, let alone becoming experts in AI and data.

I showed you how many nodes we have in the network. We have petabytes and petabytes of data sending all sorts of information from basic SNMP MIBs and OIDs to syslog and everything else continuously being pumped out. Throughout the years as we keep building these networks, these networks have their own data ingests, and some of them are completely separate from each other. One group uses one data ingest, another group uses a different one. I don't think we're alone in this. Most operators still operate that way. The siloing of data and not knowing what to do with data that is not clean is a problem.

Then there are the processes. As we keep building stuff, we set up certain processes, and these processes hardly change. Tools are one of my big concerns. The reason is that every node you build comes with an element manager, and that element manager is monitored by another network management service. In our network, we're a museum of tools. We have one of every tool under the sun. All of these are the challenges we face.

We came up with a concept that we really want to run our network with data. We called it DDOps, data-driven operations. We're not marketers, so we come up with strange names. DDOps basically starts with the first action we decided to take: fix data, clean our data, ingest the data into one place, fix data engineering, and then clean this data. The cleaning of this data is very difficult because you have to bring the RAN guy who understands what the data coming out of radio means and helps you clean it. Same with the core network, same with the devices team. You have to bring that together holistically to create one view of what's happening in this enormous network.

Once we clean that data, we bring it into visualization and data analytics so that humans can make faster decisions. We've established five pillars for ourselves. What is the cause of what happened when something goes wrong? We could spend hours looking at various tools to figure out what broke. We want DDOps to quickly pinpoint that within this function, this went wrong. What is the impact of that function not working? Who is impacted by that function? Which customer? What do we do with all this data that we get? What can we learn from it, from the behavior of these nodes and how they operate? Are they operating correctly? How do we automate everything? How do we automate what we just saw? How do we automate what just broke and bring the continuous improvement mindset into engineering? If something breaks once, we really don't expect it to break again. Use the data to figure out why it broke and try to ensure that it doesn't break again.

Why British Telecom Chose AWS as Their Strategic Partner

We then started to look for partners. We chose AWS because, like many other organizations, we initially decided to do this on our own and made some good progress. However, doing this by yourself as a telco, we found that it could be slow and you can make a lot of mistakes. There were three key things we went through in this journey.

The first was the culture that AWS brings into the organization. I remember in one of the early sessions with them, they put an empty seat there and said nobody can sit there—it's the customer's seat. As an engine room person, I initially thought, what are they talking about? But it is important because we realized that the aim is not for me as an engineer to have a tool that makes my life easier. The aim is actually for the customer not to have an interrupted service. Mobile networks are becoming the lifeline of almost everyday work, and a huge amount of us rely on them heavily. In the UK, 83 percent of all emergency calls come through the mobile network.

Thinking about customer affairs and ensuring that we use the right approaches so that our uptime and service are at a certain standard—that customer-first mentality helped us a great deal to build a fair division. The second thing is that if we wanted to build everything on our own, it would take us a lot of time. AWS has developed a lot of technologies that we can use, such as Bedrock AgentCore and Nitro. When we get the data and work with AWS, putting it in this environment accelerates that work significantly.

The third thing, which is very important for us, is that we started to understand each other's language very quickly because of the expertise that AWS brought in to work with us. That understanding of telcos meant that when I talk about RAN and RAN KPIs, they were matching that understanding, and that was really helpful for us. What are the success criteria for us? Data is going to be the foundation of everything we are going to do. We wanted to slow down and not jump on a basic use case to create some basic generative AI or agentic AI to do one part of the bigger journey. That was important for us. We wanted to look at the much bigger picture and bring that into life, and data is at the heart of that.

We are really working with AWS to create the best data engineering roadmap and deliver the future of AI on top of that. Secure by design is also critical. We all hear what is happening every day around cyberspace and the attacks that happen specifically in the United Kingdom across some of the big names. We started to design everything with cybersecurity in mind as well, especially as we go through this journey. A lot of the stuff that we build in this domain around network optimization and resilience, we can bring that into cyberspace as well. The basics like Snort and Suricata development of stuff that we use to push into SIEM and our SOC—a huge amount of that we can automate in this domain, and nothing is sacred. We can go after everything to automate.

Building that massive picture of what it takes for the user to have the best service and understanding everything that happens in the middle and automating as much of that as possible—that is our aim. We have got to build stuff that is scalable. We cannot just build one tool and then have another tool and then create a process between them. This is going to be one enormous ecosystem that provides us with these sets of technologies. Now I am going to pass over to my colleague AJ here who will talk to you about the details of this solution. Thank you, Reza. We really thank you for your vision, your team's vision, as well as your intent, and also the resilience you have shown in putting together this program to execute with us.

Solution Architecture: Three-Layer Framework for Autonomous Networks

My name is Ajay Ravindranathan. I am a Principal Solutions Architect in the Telecom Industry Business Unit. I am going to spend a few minutes talking about the solutions architecture and some details about the use cases we are building as well as some we have already delivered.

So first, when we started with BTV, what we did is we worked together to define what is a North Star for us in autonomous networks and implementing Level 4. We realized that AI is going to be pervasive across all the layers you see here. There are three fundamental layers: the network layer, where we are going to see AI native networks and network data sources that are going to provide data which has been already curated by machine learning models and AI agents in the fullness of time, of course, that sit close to the network.

And then the next layer above is AI powered data product life cycles. Over here what we're doing is we are ingesting that data, curating it using data management primitives and creating products out of them, data products that then serve the AI applications that sit above in the layer which is data-driven AI for network applications. In this layer, as you see, you've got generative AI primitives and really hyper optimized machine learning models and analytics algorithms as well that solve for specific use cases and which are essentially using the right tool for the right job and then supplying that insight into an agentic AI layer that brings it together and serves the outcome to the human or system.

So essentially this architecture is all about turning data into insights and turning intent into action. So let me deep dive into the layers now. So as you saw, there is a set of data sources right at the bottom. There are performance counters, alarms, topology, config, incidents, changes, and knowledge sources. All of these act as the raw data that is required, and you need to ingest this data from the network. So if you drive an analogy, this is like the flour and the raw ingredients you've got in your kitchen to make that beautiful bread which is the inside that comes out of it.

So then you've got the second layer which is your AI powered data product life cycle management. So here we have data management primitives which go from agentic data engineering and feature engineering. So if you think about an ML life cycle, there is a lot of effort that goes into curating the data, into building data engineering pipelines, into building features out of those so that ML models can use them. And here we want to accelerate that using agentic techniques.

The second part of this, which is also unique, is the agentic semantic layer. So when you define these data engineering pipelines and features, you want to define them once in one place and let all of your products use them in a way that they can at runtime understand the KPI definitions, alarm definitions, correlation definitions, and then use that in runtime. So that is the agentic semantic data layer. And then when you couple this with an open data format like Iceberg on S3 and couple it with agentic graph databases as well, the intelligence layer, networks are best represented as connected elements, and graph databases do that really well, and you can traverse the network and find connections between networks very easily and then run heavy analytics on top of that.

And then finally, a vector store is a key ingredient, a primitive that allows you to make all of the data, that unstructured data that sits in your enterprise, that's really a wealth of knowledge within your operations as well as your documentation, come to life so that Gen AI and Agentic AI applications can use that. So what you see above is those data products. In the RAN and core you've got KPIs and alarms, you've got customer experience metrics coming in from the core as well, and then you've got a cross domain network and service topology, and then finally your vectorized operation docs.

One key element is a network digital twin that you also see there, which is essentially a higher order data product that we generate out of network and service models and topology and performance and alarms alongside giving you a view of your network, which is really a historical view as well as a current view that can be used by AI applications. So when we then use this on the top, you've got data-driven AI applications for networks

where you have hyper-optimized models like anomaly detection models as well as impact prediction models for change impact analysis. Reza talked about change and how it is difficult to analyze the impact of change, so these models would sit in that layer. You also have causal graphs that record what has been done by agents and the root cause that has been analyzed by them, and how you make use of that in the future.

At the top, you see the agents that we have started with. These are some of the agents like root cause analysis agents, service impact analysis, troubleshooting and diagnostics, and optimization. Essentially, what we're doing here is again looking at that analogy: taking all the raw data, that flower, curating it, curing it, and then creating insight out of it. We're also creating an agentic loop for intent-driven orchestration back into the network where you could ask for things like what's the problem in this particular part of the network, how do I optimize this network for better capacity, or how do I optimize it for better coverage or removing interference.

Use Cases in Action: Anomaly Detection, Root Cause Analysis, and Service Impact Analysis

Those kinds of intents can be expressed at a very high order of magnitude by an operations executive and then can be translated into what is needed to change the network. These are the top three use cases that we have started with: core and RAN anomaly detection, core and RAN root cause analysis, and service impact analysis.

Let's look at the architecture. Here we have data coming in from the on-premises data centers. You have an ingestion layer that is treating streaming and batch transfer data. You have Amazon MSK for your streaming data services and Amazon EMR for batch processing. All of this is underpinned by a data catalog which holds your metadata and your business data catalog, and you can then create the right level of governance to drive quality in your data ingestion and curation pipelines. You have Lambda and EventBridge to create event-driven architectures out of it, and Amazon S3, of course, is our object storage that allows you to store data in that open data format Iceberg that I talked about earlier.

On top of this layer, it is important to use the right tool for the right job. Time series data is stored in Iceberg and Redshift and ClickHouse for your cold, hot, and super hot data. Topology data is being stored in Amazon Neptune and Neptune Analytics. This is a key part of the architecture that underpins the connectivity between all of your elements and allows you to run analytics algorithms on top of graph analytics algorithms like breadth-first search, depth-first search, community detection, and centrality. You also have all of your alarms and events in Amazon Aurora and GIS representation of your network as well.

What do we do with this data once it's stored in the right format? You can then start using Amazon SageMaker to run machine learning models. You can see multivariate anomaly detection and coverage analysis as models being run there. The semantic data layer that I talked about is built using Amazon Bedrock and the semantic data store being DynamoDB, which facilitates this whole consumption layer so that you are not repeating definitions of KPIs and other things. They are all done in a declarative way so that they can be ingested on demand.

What really takes this to the next level and how we are talking about acceleration is using agentic AI applications on top of this. At the top, you can see RCA agents, service impact analysis agents, troubleshooting agents, and RAN optimization agents that are built on Amazon Bedrock Agent Core. Agent Core allows you to really take these agents to production using the right primitives around session isolation, identity, integrating with the external world with MCP, integrating within agents using A2A, and then providing security with identity as well for these agents.

Let's look at one of these use cases now: the multivariate anomaly detection use case.

What were the challenges that we found? BT already has anomaly detection algorithms with machine learning running on their network. They use univariate anomaly detection models with dynamic thresholds defined on them. However, the challenge is that univariate anomaly detection models produce a lot of noise. There are large volumes of anomalies coming in, and most of these can be false positives as well.

What we are doing is transforming this to a multivariate anomaly detection method where we are using temporal pattern clustering techniques to group cells of similar behavior. We are then optimizing the number of models to be trained based on the topology of the network, having awareness of which parts of the network behave similarly, such as dense urban areas versus macro cells in rural areas versus small cells. We then train multivariate anomaly detection models on top of this, so we have trained models such as LSDM, autoencoders, as well as transformer models to provide the right level of accuracy in detecting these anomalies in different scenarios, learning interdependencies between KPIs and forming that causal graph.

How does this look in terms of architecture? Data preparation is with Lambda and the streaming services are MSK and EMR for batch. They do the data preparation. Cell clustering and KPI clustering is all done within Amazon SageMaker using analytics as well as machine learning algorithms, clustering algorithms, and temporal analysis algorithms. Models are trained within SageMaker, stored, and the results are stored on S3 and Iceberg.

Inference is done using SageMaker endpoints, and all of this is managed and serverless services. There is no infrastructure to stand up. You use these services on demand and you pay for them on demand as well. Evaluation is done within SageMaker, providing what are the important features that are leading to these anomalies, what are the false positives and false negatives, and the objective metrics from these machine learning algorithms.

Super important is getting feedback from operational SMEs as well, and then feeding that feedback into supervised retraining or fine tuning of this model for subsequent inference. Let's talk about the next use case. All of these anomalies then feed into this agentic root cause analysis and service impact analysis use case. Many of you are from the telco industry and for those who are not, this is a very quick introduction. You have all of this coming in from the network: anomalies, incidents, changes, knowledge bases, network topology, and raw alarms. The job is to really turn that sea of red into actionable insight. What is the root cause and what is causing it?

There are a number of challenges here. There is heavy cost in rule-based automation. Supervised ML models do not give enough performance today because of lack of availability of good training data. Network topology is what underpins this, and it is often incomplete and inaccurate. So how did we solve for this? We invented this technology or this representation of agents, and we called it domain-specific community agents.

What is a domain in this case? It is, for example, 5G core or 5G RAN or transport. These are network domains: IPM, PLS, DWDM, and so on. Then you have communities. Communities are affinity groups of nodes that are connected closely to one another. Networks are designed in a way to keep them resilient. When they are designed, they are designed with a blast radius in mind, and these blast radiuses often mean where you will see propagation of alarms and anomalies. That is what a community is. We use network SME knowledge from BT to define what these communities are, and then we are evolving it with community detection algorithms within the graph as well.

Think about these agents operating within these communities. They collaborate with one another and share knowledge with other agents across communities—inter-domain, inter-community agents. Their job is to correlate across these communities and find the root cause. For example, if a transport failure occurs in a particular area that is causing cells to go down in many areas, that would be an example of how you would do interdomain, intercommunity correlation. Service impact analysis is where we take the output of the root cause and correlate it with customer experience metrics to identify the number of customers impacted and what kind of services are impacted for those customers. This allows you to communicate to those customers and proactively solve for them as well.

I would like to take you through the high-level architecture. You have Amazon Bedrock Agent Core at the heart of this architecture with runtime, identity, gateway, observability, and memory. These primitives provide particular capabilities that you will hear about in the coming sessions throughout this week. The flow of data is that alarms and anomalies come in via MSK from the left. The agents deployed on runtime use Amazon Neptune and the network topology within Neptune to identify clusters of alarms or connected groups of alarms using graph analytics algorithms. They then use the knowledge bases within your operations knowledge base as well as the RCA knowledge base that grows over time to perform root cause analysis using the reasoning capabilities of large language models and small language models. Right now we are using base models, but we are also embarking on fine-tuning these models to create a smaller footprint of your token usage as well as better latency.

The alarms are stored on Amazon RDS and the service impact metrics are stored on Amazon S3. We have integrations into trouble ticketing systems to create, update, and delete tickets as well. This is where we are at the moment with this use case. The next step is to take this forward with more closed-loop automation where possible. To talk about these next steps, I will invite Reza back on the stage to discuss some of the benefits and where we are going next.

Expected Benefits, Next Steps, and the Path Forward for Autonomous Networks

This cannot just be a technology modernization. It is very much about how we change the way we operate and how we run our business. First and foremost, we are looking at taking a certain amount of cost out of our business by removing the efforts that we have today in running the network. These agentic workloads and the ways of utilizing AI remove cost not only from the people's side but from every element, including consolidation of tools. Cost savings is very important. We must improve our SLA, improve the uptime of the network, and improve the service that we provide to our customer base, regardless of what that service is.

The data platform is critical. The data has to be consistent and available in a timely manner. We cannot operate in an autonomous network environment when data is hours and hours late. That is very important. Customer impact identification means that the impact to customers when something goes wrong is very clear, and a remedy is available very quickly. Cost reduction is the most important one. What is next for us? Upon the success of what we go through right now in various elements of the mobile network from the core to RAN and beyond, we want to make sure that we are able to expand this to every part of the network and every function.

Coverage analysis and optimization is one example of a proof of concept that I want to open up with. Mobile networks work in octagons, so the amount of data that you receive within this octagon that you're sitting in is very important. We want the network to optimize itself for that area that you're in and adjust its performance in order to give you that content-aware networking.

People and process transformation is very important. We recognize that we need a new set of skill sets in the team, including software engineering teams and AI engineering teams that run the mobile network. The reduction of the impact to change is critical. Per week we make 11,000 changes in our mobile network, and the majority of the time we're successful, but we want to reduce the risk of that change. We want to use and adopt domains such as utilization of tooling for this specific function.

Dynamic network slicing use cases are very important in the future of mobile networks. As all users and millions of devices from wearables to cellular phones to IoT move into the 5G domain, prioritization of service based on when you need the service or if you pay for that prioritization is extremely important, and we want that to happen automatically. If you're a gamer and you've paid for a gaming subscription, the moment you start your gaming app, that's when it drops you into the slicing that is good for gaming and holds you there. That is the vision that we have.

Everything has to become measurable. We have to reduce costs, improve efficiency, and delight our customers. You saw the benefits and you saw the next steps. Obviously, British Telecom is leading the charge with us on this one, but we are looking to see how we can serve more use cases with more operators around the world. I wanted to touch on a few other sessions which are related to this topic.

The first one is Agentic AI for Autonomous Networks: AgentCore Design Patterns in Action. We didn't get a chance to go into detail on the use cases. We gave you a flavor and gave you the overall picture of what we are doing in this space, but in the first session run by Ajay, you will go deeper into how we're using AgentCore to really build this agent. That's tomorrow, so I would encourage that if you want to learn more about how we are actually building these use cases.

The second one is about domain-specific fine-tuning. We are realizing that there is an opportunity here to fine-tune some of these models for them to be cost efficient and more accurate with domain-specific data. We're doing some experimentation and proofs of concept in that space. The third one is using the Amazon Q Developer CLI. That was the top layer in the Agentic stack that I described, and we're using that CLI, which is a command-line interface and MCP to use natural language to translate that to telecom infrastructure deployment. This is not directly about autonomous networks but related to how you can manage telecom infrastructure.

The last one is a hands-on workshop for an AI agents framework for RAN network optimization, RAN network operations and optimizations. This is a space that has very rich potential in terms of applying AI for optimizing various parameters from power to scheduling to things like carrier aggregation and so forth. In this workshop you'll get a feel for how you can use AgentCore and some of the other tools that we have to build out some of these optimization use cases in the radio access networks.

Thank you for attending. I hope this was useful to understand how agentic AI is being used in solving one of the big challenges in the telco industry. This is something that's unique to the telco industry. A lot of the other use cases are horizontal and we see them across industries, but this area is very specific to the telco industry. Jointly with British Telecom, we are on a path to really fulfill that vision of fully autonomous networks.

Thank you again for attending. Please remember to fill out the survey. It's in your mobile app. Please take the survey and give us your input and feedback. Thank you.

; This article is entirely auto-generated using Amazon Bedrock.