Kazuya

Posted on Dec 5, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - The AI Discussion Control Plane: How Your Agentic Team Redefines Ops (AIM418)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - The AI Discussion Control Plane: How Your Agentic Team Redefines Ops (AIM418)

In this video, the speaker introduces the Discussion Control Plane, explaining how AI will transform operations as code becomes increasingly AI-generated. The presentation distinguishes between AI assistants (passive helpers) and AI teammates (proactive collaborators) that can autonomously initiate actions and resolve incidents. A real-world scenario demonstrates how an orchestrator named Eco AI at Edge Delta coordinates multiple AI teammates to analyze issues, retrieve data, and execute rollbacks with minimal human intervention. The speaker addresses challenges of running LLMs on streaming telemetry data at petabyte scale, emphasizing the critical role of telemetry pipelines in filtering, masking, and protecting sensitive information before feeding data to LLMs. The presentation concludes by positioning this shift as an elevation of human roles rather than replacement, with engineers focusing on risk management, decision-making, and managing AI agent behavior.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

From AI Assistants to AI Teammates: Introducing the Discussion Control Plane for Modern Operations

Good afternoon everyone. I'd like to thank you for coming to this late afternoon meeting. Today I'm going to talk about the Discussion Control Plane and how operations will change in the next two years. Whatever you are dealing with regarding software, you will deal with it differently, either yourself or by using your product.

I'd like to share this quote from Dario Amodei, CEO of Anthropic. In March 2025, about eight months ago, he said AI will be writing 90% of the code that software developers were in charge of. I don't know if you agree with this sentiment. I don't think we are there now, but I think we are coming to that point. If not 90%, it will be around 50% today. Considering where we were two years ago, I think this is quite a leap. Even if we are not at 90% now, I think we will be in the very near future.

This actually sounds exciting because now we are not doing the daunting task of writing code. But this means that you will be responsible for the alerts you get at 2 a.m. about code that you have no context for, maybe even less context on the issues. You also will have no context about how this issue impacts other systems that are probably written by AI again. This area requires something new: a new operational pattern for SREs and for ops engineers and whoever is responsible for waking up at 2 a.m.

Even today, without any AI getting involved, modern ops is already very siloed. People are responsible for different parts of the systems and they don't care about the rest. They're always reactively coming up with action steps, and even if runbooks are piling up, it's about lessons learned from other incidents. People are just looking at dashboards, trying to query data, acknowledging errors, and doing all of this with old school tools. You cannot resolve tomorrow's problems with yesterday's tools. That's why we are introducing today something we are calling the Discussion Control Plane, where your agentic SRE teammates or any other teammates will be talking with your systems.

Before jumping in more, let me introduce myself. I've been working on observability for almost a decade. I started my journey with observability, and that's how I became a Site Reliability Engineer back in 2020. I'm actually an ex-developer who became a product manager accidentally, and now I'm becoming a developer again with the powers of AI. This is my seventh AWS event this year, and I'm really feeling the energy here about how AI changes everything for every role and how it will be changing in the next year. I'm already full of questions about what the future will look like.

The agenda for the rest of today is as follows. We will talk about what I mean by teammates. We will discuss what is an assistant and then what is a teammate. Then we will talk about the challenges of running LLMs on streaming data, and we will link it with guardrails on data in telemetry pipelines, because you just cannot send everything to an LLM. Then we will talk about a cultural shift after AI. Let me talk about assistants versus teammates first.

AI assistants are everywhere on websites now. In every website there is now an AI assistant that's great at fetching relevant information for SRE use cases. When I'm asking a question or curious about something, it helps me query my ask in Prometheus or whatever query language I'm using. It's good at retrieving data from a dashboard, but you need to follow up with the discussion. You need to tell it, okay, I got my answer, but is this actually helping me? By design, they are passive helpers. They are not there to resolve the issue for you, but they are there to answer your questions so that you can resolve the issue. This is good and what we have been using AI for quite a long time. But this is kind of passive.

The actual difference between a teammate and an assistant is rooted in action versus information. AI assistants respond to prompts and give an answer, then wait for the next question, while AI teammates can proactively initiate actions and start doing research about an issue without you even asking about it. For AI assistants, scope and memory are limited to the questions you are asking. They only remember previous questions and discussions. AI teammates, on the other hand, are multi-turn. They think: I did something, but it didn't have the solution, so I need to do something else.

From that perspective, you can see AI assistants as helpers, while you can look at AI teammates as your collaborators, almost like your junior engineer that gets better and better with everything they are learning. The context is that AI teammates can know something other than the prompts you are giving to them. They can learn about how your service map is looking, how your dependency map is looking, and you can understand that the AI teammates know your systems more than you do. That is how you can actually trust them. They can act autonomously within the boundaries that you set for them. You can say that you want this AI teammate to do this stuff, but not that much. You can limit some actions and say that you want certain approval permission for this type of thing.

Let's talk about a real-life scenario that we have been having. We are anonymizing the names here, but this is one of the ways to start a discussion with AI teammates. As a human, you can just go and ask a question to it, but there might be other ways. A monitor can trigger a discussion before, or an external event or a periodic task can trigger a discussion. At this time, a human is asking a question: I am seeing something wrong. I am not sure if it is an incident. I just want to check something happening. This is normally how you would behave on Slack, right? You would say that something seems off, and can we just check?

Then it starts here. There is an orchestrator for all AI teammates that knows all the AI teammates in your system. Let's say you have ten AI teammates, and one is responsible for SRE, one is responsible for writing code, one is responsible for design. The orchestrator, in our case it is Eco AI at Edge Delta, looks at all these teammates, looks at all their tools, looks at all their permissions, and looks at all their memory, and picks a teammate that is the most suited for this question. In this case, it is the SRE, and it is giving it a prompt about running this analysis to answer this question. I would like to take your attention to this approval thing, because it is asking for permission before moving on. This is how you actually can configure a system to act within the boundaries that you set. It is asking humans to get approval for this, and then it starts its analysis and pulls the first shot it can retrieve right into the thread.

Then it goes back to the orchestrator, and the orchestrator says: I did some analysis, maybe there can be some other stuff, but do you want me to go for it? The human user says: I see what is happening now. I would like to ask more questions about it. Can I also look at the lock patterns? Then the orchestrator picks up the other teammates just to look at this question. The other teammate is asking for another approval and then pulls this data. I really want to take your attention to this: normally, if you were the person trying to resolve this issue, you would switch between tabs. You would say: now I need to look at patterns, now I need to look at dashboards. All of a sudden, you would find yourself with ten browser tabs open at the same time. But here, everything is actually in context, just in a thread.

This conversation takes long, and then after some time, some other teammates came in. The SRE reported, and then the code analyzer identified the problematic commit in the latest deployment. It asked: do you want me to add a rollback? The user requested to rollback, and it executed the rollback.

You can continue analysis and watch for issues. The issue is resolved with minimum human involvement, and this person doesn't need to know how to query data or orchestrate actions. It's simply asking questions in a straightforward way and letting the AI teams do their work. I wanted to talk about one more thing here. As you can see now, in these threads, everything is preserved. All the charts and data that are retrieved for this purpose, all the charts that are drawn, are now preserved in this thread. You can always go back as a human user and see what actually happened, but more importantly, your AI teammates can go back and see what happened there.

Managing Telemetry Data at Scale: Guardrails, Pipelines, and the Human Role in an AI-Driven Future

Joining elements on streaming data is more difficult than working with sets. Some of the telemetry data sets are more suitable for running with AI. Logs and events have stars on them now because they are more compatible with LLMs. By their nature, they contain words that are tokenizable by LLMs, and any log line can actually mean something to an LLM. Similarly, events like Kubernetes events actually mean something, and the LLM has enough context about those. They are easy to understand for LLMs, but they are not easy to feed to LLMs because they come in at petabytes of scale, and you need to find a way to feed the LLM in a distilled way.

Metrics entries are also interesting. You cannot just expect an LLM to reveal trends for you. You will need to do the work of identifying that there is a metric anomaly, then pass it to the LLM. Similarly, with traces, I will do distributed tail sampling on traces and pass them to the LLM so that the LLM can do analysis. You cannot just dump all the metrics and traces to the LLM because in that case you will go bankrupt. Efficient data processing requires handling terabytes or petabytes of data, but it allows you to run queries in English. The scalability, of course, is not the best thing that can happen.

I see that tokens are getting cheaper, and we might in the future just feed everything to the LLM. However, no matter how cheap they become, it will still be very expensive at scale. Training and fine-tuning models with your own metrics is infeasible for most of the data. There is an unspoken truth about data flowing to AI: we are just letting everything out. We are saying, okay, LLM, just read everything—read my code, read my data, read all my intellectual property. It's all yours. This exposes a risk, and teams are now busy managing the data flowing to LLM.

At this point, we come to telemetry pipelines. When telemetry pipelines first came out, they were being used to compress data, archive it in a destination, and store telemetry data sets in an S3 bucket, keeping the data safe. Then we discovered nicer ways of seeing data, and the cost actually went down by another 100 times. We could send the data to Splunk or Datadog, but if you just ingest all this data to do so, you will also go bankrupt. That is why telemetry pipelines came into play—to reduce the data, to filter the data in an intelligent way, and to mask data. This is what we know as telemetry pipelines today. This is good value, but it is not strategic. It is about reducing cost, but it does not make the data speak for itself.

Now we have LLMs using telemetry pipelines, which is also costing 100 times more. Using telemetry pipelines, you can provide high value to AI while still keeping the data under control. I know a lot of companies are complying with all these security things and security certificates, but they are just sharing their source code and their architecture bluntly with LLMs. With telemetry pipelines, you can say that you would like to mask the data that goes to the LLM. You can say that you do not want to leak any information about your users, their personally identifiable information, or your own code and service topology.

When you have masking and filtering capabilities before the data goes to the LLM, it's actually protecting not only you but also your customers and your new hires that you will bring on in the coming months. They will not be exposed to sensitive information.

Now you may ask whether AI agents are here to take our jobs. What is our role in this new paradigm? I believe this is an elevation rather than a replacement. As human people, you will still be responsible for risk management. You are still responsible for the mistakes that LLMs can make and the problems they can create.

You will need to focus on how they behave and change their behavior by adjusting the tools they use, the models they use, and the data they use. You will need to manage them actively. You will also move up to a higher layer in the organizational risk management perspective and focus on more complex cases. You will shift left in the development flow to respond to issues before they reach production.

You will be the ultimate decision makers about how AI agents behave and how humans should focus on these jobs. This brings me to the end of my presentation. If you have any questions, we have a data booth over there, and we will be happy to discuss these topics with you. Thank you for listening today.

; This article is entirely auto-generated using Amazon Bedrock.