DEV Community

Cover image for AWS re:Invent 2025 - Streamlining Telecom Cybersecurity Operations with AWS Generative AI (IND206)
Kazuya
Kazuya

Posted on

AWS re:Invent 2025 - Streamlining Telecom Cybersecurity Operations with AWS Generative AI (IND206)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Streamlining Telecom Cybersecurity Operations with AWS Generative AI (IND206)

In this video, Anwar Ali and Ray from AWS demonstrate how Generative AI streamlines telecom cybersecurity operations. They address challenges of analyzing verbose, high-volume security logs across multiple sources in proprietary formats. The solution uses Amazon Bedrock Agent Core Runtime with Claude 3.7 Sonnet to create a Network Trace Expert Assistant that diagnoses blocked user requests in minutes instead of hours. The demo shows the agent querying Zscaler, Checkpoint, and Illumio logs to troubleshoot a security incident, identifying where traffic was blocked across multiple network firewalls. They recommend using Amazon Security Lake with Open Cybersecurity Schema Framework for centralized log storage. The architecture leverages Amazon Bedrock Agent Core Gateway as an MCP gateway, requiring only four lines of code to deploy agents at scale. Key benefits include faster incident resolution, reduced ticket volumes, improved developer productivity, and potential for self-service troubleshooting portals.


; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Thumbnail 0

Thumbnail 10

Thumbnail 20

The Challenge: Navigating Complex Telecom Network Security Logs

Welcome to the Lightning Talk session, Streamlining Telecom Cybersecurity Operations with Generative AI. I'm Anwar Ali. I will be your host for the session. I will be co-presenting with Ray. We both work as solution architects at AWS. So this is a broad agenda of what we are going to cover in this lightning talk. We'll look at common challenges faced by the telecoms while analyzing the network security logs. Particularly, we'll focus on diagnosis of any blocked user request if there's any incident or any security event, so how the Agentic AI can help the security analyst to quickly resolve the issues. There is an interesting demo which Ray will be covering, and he will also cover the solution architecture and how we build this agentic architecture.

Thumbnail 60

So telecom and communication service providers are facing an increased number of security attacks. They are constantly facing an increased number of cybersecurity threats, and they are operating under stringent regulatory norms. There is a need for them to constantly evolve the security architecture. Eventually, they have added a lot of monitoring and security control tools that has added a lot of complexity in managing these telco networks. So it becomes a challenging task, very time consuming, if there's any security event to diagnose these particular logs and quickly find out the root cause.

Thumbnail 110

Thumbnail 150

Thumbnail 160

Thumbnail 190

So here I'm showing a representative flow where a user is connecting to an internal application. It's a very simple flow. You can look at various network security tools that go through. There are various interconnected network segments. This is a very simple representation of one of the flows, but in a real network there will be millions of such requests. There will be user to application connectivity. There will be application to application connectivities also between different network segments, and in between you can find various network monitoring and security operations tools. And all these tools are emitting the logs. So when any security incident happens, it becomes very time consuming and error prone to diagnose these logs. So the agentic solution what we are talking about is going to help in this particular case. And there are two use cases what we are focusing on. The use case where a user, an authentic user, is being blocked while accessing the applications over a telco network, but it can even be extended for security analysis diagnosis. So with that background, let's look at the common challenges what someone faces while analyzing the logs.

So the logs are very verbose. There is a lot of information in it. Now, all these security and network tools are emitting the logs. It presents in multiple places. It will be on AWS cloud. It will be across the region. It will be on third-party cloud, on-premise, and these logs are not in a standard format. So many of the tools are emitting the logs in the proprietary format. There is no standard convention that helps you to correlate these particular logs. It becomes very time consuming and cumbersome because of this.

The log volumes are high, so even to store the logs, many large enterprises are emitting the logs in terabytes per day. That's the kind of volume, even for large enterprises, that's the volume that we are talking about. So to store these particular logs and process these logs, to query these logs in the analytics tool, it becomes extremely cost prohibitive. It becomes extremely difficult.

And last but not the least, very important is it's hard to correlate these logs. So many of these solutions are providing logs in a proprietary format, or even there are specific query tools that are supported. So someone as a security analyst needs to learn and fetch these particular logs. So look at the scenario when there is any incident, the user, the end user is blocked for a network request. The developer is blocked. He has raised a ticket with the cybersecurity team. Someone in the cybersecurity team, the security analyst, goes and looks at all the logs, tries to make sense out of it, and it becomes a very cumbersome and difficult process. So although I was showing one of the flows, there can be many different flows.

Thumbnail 310

There can be several flows happening in the real network, and as I said, there will be application-to-application connectivity or interconnected networks. There will be users accessing the telco network. There will also be various security and network components in between.

Agentic AI and Amazon Security Lake: A Solution for Log Analysis

Now we'll see how Agentic AI and Amazon Security Lake can help you to mitigate this. The security analyst, what you're seeing on the right, can use the network trace expert assistant. It is nothing but a virtual assistant that can help the security analyst to diagnose the logs. It's a chatbot in a simple way. At the back end, it is having an agent, Bedrock Agentic AI. It is powered by the agents which are built on Bedrock Agent Core, and these particular agents are functioning and doing two things. One is it can help fetch the logs from various sources, help you to correlate and analyze these particular logs, and then the agents, as you know, are also having reasoning power, so it can help diagnose and find out the root cause of this particular issue.

Overall, Agentic AI can be your friend. It can help you to diagnose the logs. What you can accomplish within minutes, earlier it was not even possible to accomplish within several hours as well. One more best practice is to store these particular logs in a centralized Amazon Security Lake. Amazon Security Lake will help you to store these particular logs in a central place. It is also supporting the Open Cybersecurity Schema Framework. It's a standard format that helps the AI agent as well as your other analytics tools to query these particular logs in an efficient manner, and it also helps to manage the lifecycle management policies to store these logs in Iceberg in a very efficient way that is easier to query and process.

Thumbnail 470

Although Amazon Security Lake is not a prerequisite for this Agentic architecture to work, it is a best practice. So we are using it in combination in the architecture. One more thing is, as the maturity of this particular tool increases, you can even provide this particular network trace expert assistant as a self-service agent to the users. So now I want to call Ray to explain in detail and to give a walkthrough on the architecture of how we have built these particular agents.

Architecture Deep Dive and Live Demo with Amazon Bedrock Agent Core

Thank you, Anwar. Okay, so we're going to talk a little bit about the architecture. We'll do a quick demo, and then we'll talk about some of the solution benefits. Starting on the left-hand side here, you can see the personas that might interact with this application. Primarily, we're listing a security analyst, right? This is someone who might manually go through and look through these logs currently. What they're doing in this case is they're actually interacting with our generative AI agent. This agent is running on Amazon Bedrock Agent Core Runtime, which is a purpose-built runtime environment for generative AI agents. You can bring your own framework and you can bring your own model. So in this case, we built it with the Strands SDK and we're using a model available on Amazon Bedrock, which is Claude 3.7 Sonnet.

The agent is, of course, only as valuable as the tools that it can use to retrieve the information it needs. So we're actually using Amazon Bedrock Agent Core Gateway as an MCP gateway to expose those tools to the agent. When the agent first authenticates to the gateway, it's going to present a JWT, and that JWT is going to be validated by Amazon Bedrock Agent Core Identity against our chosen IDP, which in this case is just Amazon Cognito, but you could use anything you want there, right? You could use Okta, Entra ID. You're welcome to swap that out.

Now, for the gateway, the first thing that's going to happen is when the agent initializes, it's actually going to do a list tool sync call to retrieve the current tool information. This is the schema for the various tools that it can call, right, what the parameters it needs to pass in are. Then what's going to happen is the agent is going to rationalize based upon the tools that it has, based upon the base prompting, and based upon the input from the user. So it'll use that model, it'll decide to call tools, it'll make the calls through Amazon Bedrock Agent Core Gateway, right, this is an MCP call, and those calls are actually going to a Lambda function which actually implements the tool.

So in this case, the Lambda function is going to be connecting to our various log sources. You can see we've listed three here for the sake of the demo, right? Zscaler, Checkpoint, and Illumio. But I do want to make it clear, right, wherever you're hosting your logs, whether it's S3 or somewhere else, even on another cloud potentially, that's something that we can support, right? You can connect to that log source and use the agent to troubleshoot across many different sources.

So let's take a look at a quick demo here. I just want to walk you through the scenario.

Thumbnail 620

Thumbnail 630

So David Lee is an employee of Acme Telecom, and he encounters a security block while accessing a billing app service through an Acme laptop and from his home internet. So here's a quick front end that I built here for the sake of this demo. You can see here that we're going to put in a natural language query. Just kick it off here. There we go.

Thumbnail 650

So we're putting in a natural language query that we're passing to the agent. The agent is of course hosted on Amazon Bedrock Agent Core runtime, and currently the agent is troubleshooting that incident, right? So it's going to connect to that Agent Core gateway, it's going to authenticate, it's going to list its tools and then use those three tools to troubleshoot the incident. What I'll point out about this specific call right here is we've kind of asked it to do some generic troubleshooting, right? We haven't given it specific instructions about the specific queries it might run or the sources to look at. So it's actually going to check against all three data sources in this scenario.

Thumbnail 690

Thumbnail 700

But because it's a chatbot format, it's actually able to selectively, right, if I were to ask follow-up questions, go ahead and specifically call on individual data sources. So you can see it produced a report here. There's a summary of the findings, so it did identify some blocked traffic at this timestamp, and it's giving us a detailed log analysis. First with the Checkpoint, then the Illumio segmentation logs, where the blocked traffic was actually found, and then lastly in the Zscaler logs.

Thumbnail 710

And then it's generating this kind of final conclusion where it's walking us through the full traffic flow. We were successfully able to authenticate to Zscaler Private Access. We established a connection to Elasticsearch. The initial connection was allowed by both Zscaler Private Access and Checkpoint, but a subsequent attempt to connect to the Elasticsearch instance was blocked by our third and final network firewall, right? So this is kind of a quick demo of how the solution works.

Thumbnail 740

And I actually wanted to talk a little bit about the code here, I know that's always kind of more fun. I know it's a little hard to read, so I apologize, you probably can't read the prompting here, but I wanted to include it because I think it's really important. So the prompting here is what tells the agent how it should call these tools. So you'll notice we have on top of the kind of basic, like, you know, you're an analyst and you should be respectful with the user, we actually have information about the schema for each of the log sources, and this allows for the agent to dynamically write queries that it's going to use to troubleshoot any incident.

So the scenario that we ran was a little bit basic, right? It was just troubleshoot this incident more generally. But if you asked a follow-up question, right, something that required it to actually formulate a specific query for a specific data source, the agent can use its knowledge of the schema for each of these log sources to formulate those queries and call the appropriate tools. You'll also notice we're accessing that model in Amazon Bedrock in the upper right corner there of your screen. So this is how we're connecting to Claude 3.7 Sonnet.

And we're connecting to our Agent Core gateway to access our tools, right? So we're establishing an MCP client. This is a streamable HTTP MCP client to our gateway URL which is kind of passed in, it's that blue variable that we're passing in right there. And we're also passing in a bearer token, that's the JWT that's going to get validated by Agent Core identity when we authenticate to the gateway.

Lastly, I wanted to point out some of the, just these are just two of the lines that you have to add. There's four total when you're taking an agent that might be built with Strands or LangGraph, LangChain, CrewAI, right, any of the frameworks that we support. If you're trying to run them on Agent Core runtime, you only have to add four lines of code, and two of them are shown right here. Just wanting to highlight really how simple it is to take an agent that you might first run locally and then push it up onto Agent Core and scale it up to hundreds of concurrent requests.

Thumbnail 850

Thumbnail 880

Solution Benefits and Key Takeaways from Implementation

Okay, Anwar, can you tell us a little bit now about the value of the solution? Yeah, thanks, Ali. So we found that the organizations which are adopting this particular solution have multiple benefits. First and foremost is the faster incident detection and resolution time. So as I said that for security ops team, particularly if there is a new team member or the team was pretty much stretched, so with this particular solution they were able to resolve the issues faster. What used to take hours is now within a matter of minutes, so it helps the productivity of the team.

Second is some organizations have even rolled it out as more as a self-service agent for the internal users. So with that, the information that is present in the ticket means that the end users themselves can query, and internal users can use this particular tool to identify where the specific network block is.

They can diagnose their own issues, and when they raise the support ticket, it will be faster. Then the volume of the tickets has been reduced. It is faster, and it is also helping in better collaboration. So the developer team and the security operations team can communicate in a much better way.

Many times developers raised the issue, and the security operations team is pretty much spread thin. They cannot resolve it within a couple of days' time. So it was also reducing the developer productivity. With this particular tool, there is increased agility for the development within the organization. And as I mentioned, the agents themselves write the queries, so you can interact with the chatbot with the natural language query, and you don't need to learn sophisticated query and tools. So it's pretty useful in the way that it alleviates a lot of these pain points.

Thumbnail 990

So what are some of the key learnings and takeaways when you build this solution? Thanks. So I think a few key takeaways I wanted to highlight from this solution. The first is just the power of generative AI on AWS for this specific use case. Building with the Bedrock SDK, we were able to dramatically reduce the time to resolution for incidents, really from hours just to minutes. Previously, the security analysts had to go and comb through the logs to understand and triangulate across various log sources, and now the agent can really do some of that undifferentiated heavy lifting for the security analysts to help them find the incident root cause sooner.

The second thing I'll point out is that Amazon Bedrock Agent Core Runtime can really dramatically cut down the time to put this agent into production. So I started by just building this locally on my computer, added just four lines of code, and was able to push the agent onto Agent Core Runtime and then scale it up in just a few minutes. So I think that's worth highlighting.

The third thing is a crawl, walk, run approach. We started with just a portal where a security analyst might interact with the agent. I think as part of the walk step here, you could actually have a self-service portal where individuals could go in when they hit a network block. They could interact with the agent, find the root cause, and then maybe they only have to submit a ticket if they want some policy change that's blocking them currently. So that eliminates some of the heavy lifting that the security analyst has to do right now.

And then the final run aspect of this could even be using the same interface and the same agents to actually investigate dangerous security incidents. So if you had many failed requests, that might indicate potentially that there was some sort of bad actor that was operating against your system, and there's an opportunity here to use the same system to look at many different timestamps and try and identify what blocked it, or maybe if it wasn't blocked, what should have blocked it.

I'll point out that Amazon Security Lake, we use that to centralize the security posture around all of our log sources. And that's a good practice here. Of course, we used S3 with OpenSearch, which is obviously something you can protect with Amazon Security Lake. But if you're bringing your own potential logging sources from other places, you can also use Amazon Security Lake.

And I think this actually connects to my last point, which I really wanted to highlight, which is the power of Amazon Bedrock Agent Core Gateway. So we showed just three sources that are running on AWS for the logs. But I don't think it's fair to expect that your log sources are all going to be on AWS. You've probably got logs on-premises, for example. You might have logs on other providers. And the power of Agent Core Gateway is if you were to host an MCP server that's running where your logs are accumulated, you could then connect to it via Agent Core Gateway because it also supports an MCP server target type as well as a REST API type and the Lambda target that we were using for the sake of this demo. And you could then connect it to the gateway and let your agent use that MCP server in its troubleshooting process.

Thumbnail 1190

So that's what we wanted to talk about today. We really appreciate your time. Also wanted to throw in this slide here. If you want to level up your skills and get started with SkillBuilder, you can get started today. I'll hold the QR code here for a minute. We really appreciate your time. Thank you so much. And if you wouldn't mind, inside your AWS Events app, there's going to be a session survey. If you could fill that out, it would help us a lot. Thank you everyone, and we're happy to take questions over here once we get the mics on.


; This article is entirely auto-generated using Amazon Bedrock.

Top comments (0)