Kazuya

Posted on Dec 11, 2025

AWS re:Invent 2025 - Building agentic AI platform engineering solutions with open source (OPN303)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Building agentic AI platform engineering solutions with open source (OPN303)

In this video, Niall Thomson from AWS and Hasith Kalpage from Cisco Outshift demonstrate how platform engineering teams are integrating AI agents to reduce developer toil and improve productivity. They showcase a practical demo where AI troubleshoots CI/CD pipeline failures by querying multiple systems through MCP (Model Context Protocol) and Agent-to-Agent protocols, automatically identifying misconfigurations and proposing fixes. Hasith shares Cisco's real-world implementation using CAIPE (Cloud Native AI Platform Engineering), which eliminated their three-engineer support desk by automating tasks like LLM key provisioning and dev machine requests that previously took hours. The session emphasizes building centralized, reusable AI capabilities using open source frameworks like LangGraph and Strands, with agents accessing platform context through Backstage, ArgoCD, and Kubernetes MCP servers across multiple developer interfaces including CLI, Slack, and IDEs.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Welcome to re:Invent: AI-Powered Platform Engineering for Developer Productivity

Welcome everybody. I will say getting up and chatting in front of all you folks usually would be one of the more stressful things that I spend my time doing, but it's been a rough month, so I am very happy, even more than usual, to be up here to chat with you today. Thank you so much for taking the time to come down and see us. I know it's 9:00 AM, fresh and early on a Monday. I assume for most of you this is probably your first session, so welcome to re:Invent. We're so happy to have you here.

When I let my co-speaker know that we had the Monday 9:00 AM session, I could see in his eyes he thought we would be talking to an empty room, but we have quite the opposite here, so thank you for showing up. My name is Niall Thomson, and I'm a Container Specialist Solutions Architect at AWS. That tongue twister means that I spend most of my time talking to customers about why their EKS clusters have run out of IP addresses. If you know, you know, but for those of you who don't, I'm very happy for you. I'm joined today by my co-speaker, who will introduce himself.

Hi, my name is Hasith Kalpage, and I lead platform engineering and security at Cisco's incubation unit.

Okay, cool. So I'll be kicking us off here, and then Hasith will take over in a little bit.

Sounds good.

We have a packed agenda today. This is a 300 level talk. We have a brief introduction, but we do not have time to talk about what is platform engineering. We don't have time to talk about what are LLMs. We kind of have to just jump straight into stuff that is more useful to you, that you can take home and use in your organizations when you get back from Vegas.

So we're going to jump straight into how we are seeing organizations provide AI capabilities to platform teams and to application teams on top of their platform capabilities that already exist. We're going to look at how they are using open source tools to do that. Hasith will show us some great stuff from what Cisco has been doing in their platform at Cisco as part of Outshift, and then he will talk about how they're open sourcing that work so that hopefully we can jumpstart you to try some of this stuff out yourself. And then we'll be wrapping up.

The State of Platform Engineering: Challenges in Developer Experience and Cognitive Load

So, platform engineering. This is all we can really base on platform engineering. It continues to be a very popular way for customers to operationalize on top of AWS. The DORA report from this year shows that 76% of organizations say that they have at least one dedicated platform team, at least one, some more than one. But in terms of just what is platform engineering, I like to think about it in terms of building centralized capabilities.

Some of these capabilities might be golden paths to production, right? Stamping out a standard CI/CD process that gets you from zero to production really fast. Maybe self-service capabilities, right? Backstage, for example, being a very popular developer portal to do that. Standard observability that comes out of the box that just works once you've stamped out your application, and usually some sort of abstraction around the compute. A lot of the time we see this being Kubernetes, but we have AWS customers who are building platforms on top of Amazon ECS and even AWS Lambda. Customers are starting to embrace some platform engineering. So it doesn't matter what you're using to run your workloads, we're seeing everyone deal with platform engineering in some way.

Now, when it comes to platform engineering, there is a certain amount of technology bingo that you inevitably have to play.

Sometimes I like to call this the CNCF hipster stack. These obviously are not by any means technologies I am recommending. These are just some examples of some of the things that we most commonly maybe see folks doing, especially in the Kubernetes world. There are like five alternatives to each one of these that you could potentially choose. Do not take these as recommendations from me, but from an open source perspective, this is what we see a lot of. And these are tools that are open source, battle tested, and we've seen a lot of community adoption around, and they'll also help us break into some of the AI stuff a little bit later.

But for those of you, who here would say they're already doing some form of platform engineering in their organizations?

Yeah, quite a bit of you. So you of all people will know that getting platform engineering up and running can be pretty challenging even at the best of times. You have to combine all those tools that we just saw with your infrastructure, your organization, how it works. It can get very complex very quickly. And if you don't build a good developer experience around all those tools, your adoption is probably going to struggle.

The developers that do adopt your platform are probably going to struggle using the platform, and the cognitive loads that they experience can start to spiral, right? That could be abstractions that just weren't thought through all the way, or it could be documentation, a wiki that hasn't been updated since 2022, right? There's lots of reasons why this can start to stumble. And what ends up happening is the platform team just ends up handholding all of the time, right, instead of building out more capabilities and more features.

And what we're not looking for here are platform teams that are just vending out Kubernetes clusters. We're looking to build solutions that make developers more productive. You're not going to be able to do that if you're simply answering Slack messages or JIRA tickets all day, just trying to get them through the day.

Now, one of the open source initiatives at AWS that we helped form to try to help with these platform engineering conundrums is the Cloud Native Operational Excellence, which is quite a mouthful, but shortened to CNOE. This is essentially a group of large organizations that have embraced platform engineering that are collaborating out in the open and sharing their approaches, their strategies, their tool sets, and reference architectures for how they're going about doing this. So if you haven't checked out CNOE, it's potentially very useful to take a look at for your platform engineering efforts, and maybe you'll learn something from taking a look there.

Beyond Coding: Expanding AI's Role Across the Developer Workflow

But another tool that we're seeing organizations reach for is AI, which is probably unsurprising this year of 2025. It's been pretty cool to see AI evolve since it first started, from LLMs where we just threw it a prompt and got a paragraph of text or some code or an image, moving on to slightly more sophisticated agents that would start to be able to break down tasks into smaller steps and potentially reach out to external information through APIs or even take actions. I think one of the things that we're starting to see even more recently is giving those agents on the far right here even more autonomy, where they're maybe not even triggered manually. They are reacting to events in your existing architecture and taking at least some form of action, usually still with some level of review and approval so that you're spending less time having to ask the AI to do something, and it's kind of just starting on your behalf.

Now when it comes to developers and AI, I think coding seems to have been a pretty popular use of LLMs. The DORA report again this year, which has been renamed and this caught me off guard a little bit, has been renamed to "The State of AI Assisted Software Development." So the state of DevOps is gone, but respondents here said 90% report using AI in their daily work, and 80% said it's made them more productive. Now, whether you believe the 80% number or not, I'll leave that up to you. There are lots of different studies and reports getting done, but 90% of developers are reporting to be using it, which means the ship has sailed and people are using it daily.

So as platform engineers, we need to start to figure out how we dovetail AI with our platform abstractions so that the AI is working with our platforms and developers are not pumping out code and configuration that just doesn't work with our platforms that we've built in our organizations to make them more productive in the first place. Another useful stat: the median developer spends less than one hour a day coding. So if all we think about when we think about AI is coding and hands-on developing code, we're neglecting the rest of their day. Their CI/CD pipelines are going to be breaking and they have to fix them, they're going to be patching vulnerabilities, they're going to be working on issues and PRs, commenting and reviewing. They're going to be dealing with incidents at two o'clock in the morning, and they're going to be cost optimizing.

So if we want to really make them more productive, we can't just concentrate on generating code for them. There are lots of other areas where we could benefit that naturally fit as part of platform engineering that I think we want to think about. And if we want to start to spread this to more areas of platform engineering and developers, we need to start to spread out where we can inject AI in the places where developers are. Meet them where they're at. Developers are coding in their IDE and increasingly in the CLI now. Tools like Kiro CLI, Claude Code, and all the other ones have them coding in the CLI more now. They're working on those issues and those PRs in GitHub, they're navigating their organization in Backstage, and they are working on incidents in incident management. All of these are places where we can start to inject AI to help make them more productive and to make sure our platform and the abstractions we've built and all the productivity tools that we have are further helping them along in all these different areas.

Demo: Autonomous CI/CD Troubleshooting with Centralized AI Agents

So let's take a look at what this could actually look like as a real example. Now, I more than anyone love to give live demos of GenAI. It takes the usual risk factor of a demo and pumps it up just that little bit more, makes you a little bit more nervous. But this time last year in the Mandalay Bay, a chunk of their WiFi went out, and so I'm going to use that as an excuse to dodge that bullet today. So I have something that has been prerecorded, but I would love to be showing this live.

So in our case here, our developer John is innovating and transforming away, coding using that one hour of time he's got, and his pipeline fails. The contents of the pipeline don't really matter that much, but we can see over on the far right our deployment to our staging environment has failed. Now the platform that John's team uses is built on Kubernetes, uses Argo CD for GitOps and, well, for some reason CodePipeline, but that was just the easiest thing for me to get up and running.

But we want John to figure out what's broken with his pipeline. So John is going to hop into Kiro CLI. Now, if you're not familiar with Kiro CLI, it was up until very recently Q Developer CLI. It was recently rebranded to the Kiro brand. And it's a tool that we can use in our command line to use AI directly to write applications, to deploy applications, to write tests, all sorts of great stuff. What we're showing here does not have to be done with Kiro, right? Anything like your Claude Codes, for example, are perfectly capable of doing this. This is the open source track, so I'm not explicitly endorsing any AWS stuff to do this. There are lots of things that you can do for most of the stuff that I'm showing here.

Now here, one of the things that these tools often have is what we call tools, right? It's a way for the AI, that model, to reach out and do something usually or get something. It could be to read a file, it could be to write a file, it could be to access an API. You can do all sorts of great stuff with tools. Now our platform team, if you see down the bottom there, has added a single tool called Query. I know it's a bit small, but I couldn't make it much bigger. So they've built us a custom MCP.

Now in our case, our awesome platform team has built us a centralized AI that developers can use that's reusable from all the different places that our developers are working. So what John is going to do is use our MCP, hook into our centralized AI, to ask, troubleshoot why the last CI/CD pipeline execution failed for my payment API workload. Now he is not using any system identifiers. He's not saying anything about CodePipeline or Argo or Kubernetes or anything like that. It's a pretty vague statement, but because after we approve the execution, because we've hooked it into our central AI agent, it will go off and start to do a ton of work for us. It's going to crunch through a bunch of our systems, look at what went wrong, and come back with a recommendation.

I'll give you a more detailed idea of what happened there once we've covered off some of the tech. But it comes back and says, well, it looks like either John or one of his teammates, I'm not pointing fingers, managed to fat finger their last update into their Helm values file and they set the resource request for the Kubernetes memory higher than the limits. And the Kubernetes API kicked it back. That caused the Argo CD push to fail, that caused the pipeline to fail, and they got their notification.

So Kiro has come back and offered some options to remediate that. We'll take option one as a suggestion, and at this point Kiro takes all the information we got from our upstream central platform agent and does the local work on our desktop, sorry, on our laptop, to actually put the fix into action. Now, I think this part is really important. We see a lot of GenAI demos where people just spit out a bunch of information and leave you to imagine what can be done. I really like to show GenAI filling the full loop, so we actually get a remediation to the problem that we had rather than just leaving them with, I found a problem, what are you going to do about it?

So here we're actually solving the problem. Kiro is updating the YAML file and then all we have to do is review it, push it, and our pipeline will kick through and we've solved our issue. And that's taken us from issue to remediation pretty quickly with John doing very little work other than reviewing. Now earlier, I talked about different channels the developers are working in. So since we've built this AI agent centrally as part of our platform capabilities, right, not directly in our Kiro or our Claude Code, we can call it from Slack and get the same thing back.

Now obviously, Slack is not going to go and update our code for us, but just to show that we get the same result from the same AI in different places, we can do this across multiple different channels and get the same result. But what if, and this goes towards that more autonomous mode, the pipeline fails, it fires an event, the AI triggers automatically and just raises a pull request to solve our problem. This is where we start to get to that more autonomous part of the equation where we don't have to trigger the AI explicitly. It's triggered through an event, but we still get to review the proposed change.

We're not just letting it run riot and update our environment on the fly. We still get to go in and check and validate the change that it made, approve it, and promote it and get all the assistance that we get from AI while still having a human in the loop.

To be honest, I had a bunch of other use cases I didn't have time to include. I think we're just scratching the surface here, and your imaginations are probably already covering many of the possibilities, whether it's simply helping our developers when they code and build with our platforms. Our Helm charts that we've built internally, our Terraform modules that we've defined centrally—all of this stuff becomes accessible across all of our different channels to help them build and ask questions. We can troubleshoot production issues, CI/CD issues, or anything else, trying to reduce mean time to recovery or just the amount of work developers have to do to fix them.

We can use them for security use cases. I think we already have a ton of great tools for finding vulnerabilities and even tools like Dependabot and Renovate that will raise pull requests, but there's still a lot of work involved in many of these cases to actually fix the problem, and we can start to factor that in too. And finally, cost optimization—not necessarily trying to replace all the great tools that we have, but how do we take a cost optimization recommendation and actually implement it as part of the abstractions we have in our platform, whether that's things like t-shirt sized instance types or whatever else you've built to make it easy for developers to consume.

Building the Foundation: Agent Frameworks and Platform Context Integration

So that is some art of the possible around what you can do—a real example that I hope is pretty realistic and you could see happening in your organization. We can switch gears now from that to how that was put together, and this will dovetail into what Hasith is going to talk about in a little bit.

The first thing we need is an agent, and this will build up to our overall diagram of what that demo was built with. Now, agents themselves are, compared to the LLMs, relatively straightforward. You've got that agent loop, which is just a continuous loop of input, decision, tools, and user response, but there's other stuff that you want there. You want model flexibility because the models are evolving at such a fast rate—you have to be able to keep up and switch that out. You need session management, you need memory, you want observability to make sure you understand what's going on under the covers. These frameworks are designed so that, like any framework, you don't have to start from scratch.

And if there's one thing that we've learned from the JavaScript community, there's always room for one more framework. We can always have one more. Now this is not a recommendation by any means for what to use. There are probably three or four times as many frameworks available as I've listed on this slide—these are just some of the popular ones. Strands on the left is one from AWS, which is open source as well, and each has their own strengths and opinions on how you attack this problem. It's really up to you to figure out what makes more sense, but the point is you don't start from scratch—you've got something to work with.

Now, once we have our agent and our model, we've got something pretty powerful, but it's going to be very generic. It's going to give us public documentation answers, Medium posts, or Stack Overflow Q&A. What we really need to make it productive, like we saw in that example just there, is it needs to know about our platform and hopefully be able to take some actions related to our platform.

Look at all this great context we have that you can get into those agents to help them actually do things that are specific to your organization. Even just simple things like all your documentation which you can search through, the software catalog that you're building out through Backstage that maps your workloads, the owners, their relationships, the infrastructure that applies to them—all of a sudden that sort of information becomes a gold mine. All the other stuff that we have—our CI/CD system that we just showed being used, cloud cost, incident management—all this becomes the difference between asking the agent how do I deploy my app and getting a Kubernetes YouTube tutorial, versus getting how to actually deploy using your platform. This is what makes the difference, and if we want to allow it, potentially taking actions to make that even quicker.

Model Context Protocol: Connecting AI to Real-Time Platform Data

Now this is where most of you are probably familiar with this, but we just want to make sure we fill out the picture. The Model Context Protocol comes in, or MCP. This is essentially a standard protocol for connecting AI to other stuff, letting it query things and potentially take actions on our behalf. There are a few things to call out here. Firstly, because it's standard, it works across agent frameworks and across different models. We can build an MCP server once and use it hopefully everywhere.

Real-time data also becomes important here. We're actually hitting the APIs and getting the data back—we're not hitting a knowledge base that you've built up that could potentially be stale. There are pros and cons there, but this gets us straight to the source of the data.

And we can also take actions potentially if we want to. Lastly, the specification is being developed out in the open by a whole bunch of companies. It was originally developed by Anthropic, but now a lot of organizations including AWS are contributing to that specification.

Now it's not just the specification that's being developed in the open. A lot of organizations are building their MCP servers in the open as well. On the AWS side, we have this MCP repository under AWS Labs that has, well this, I made this slide a little bit ago, 55 plus, it's probably more by now. This gives you all sorts of great stuff, pre-built MCP servers for great things. The AWS API MCP server basically gives your AI agent access to the AWS CLI, but without actually having shell access, for example. So it's a really interesting approach to a token efficient way of navigating your entire AWS account, which you obviously lock down using permissions.

Actually, and this is one of the reasons why I like to do these things so early on a Monday, yesterday we just announced preview of the hosted AWS API MCP. So you don't even have to run this yourself now. We just offer you an API endpoint that's protected through SIGV4 authentication that you can just hit, so you don't have to run this locally or remotely. We just give it to you now and you can just start using it. So take a look at that. There's blogs and stuff that I didn't have time to add to these slides, but that's a great addition.

A lot of these other ones, we can use the Knowledge MCP server for the agent to hit AWS docs. We can use the Dynamo one to grab data straight from Dynamo. We have the Cost Explorer for grabbing cost information. This gives the agents access to so much information, pretty generic to AWS, but still a great start. If you look more broadly at open source, there's an ArgoCD MCP server under the Argo project. Backstage now has an MCP server built in. You don't even have a separate MCP server. There's MCP servers starting to get built out as first class citizens for so many open source projects now that you can take off the shelf. Plus a lot of organizations that are building hosted ones like we are, MCP servers are becoming something you can just grab and use and you don't have to think about building.

Multi-Agent Architecture and the Agent2Agent Protocol for Distributed Collaboration

So if we take our agent and add MCP, it starts to look like this. We can access code or issues or PRs through the GitHub MCP server. Maybe we can access pods or events through the Kubernetes MCP server and maybe we access our Backstage catalog, our tech docs through the Backstage MCP server. All this becomes pretty straightforward and it makes our agent a lot more specific to our platform almost immediately. But as we start to add more and more and more of all these MCP servers we have now,

we start to run into some practical implications. As you add more tools, the agents tend to start to struggle picking the right tool and figuring out what to use. The more of a task we give it, maybe the context window of our models starts to, we have to start to manage that a bit more carefully and we can't optimize for specific tasks. Langchain did a great article on this where they tried to measure the impact of single agent versus multi-agent architectures.

So running multiple agents is becoming a pattern that we're seeing a lot more commonly. Instead of creating one generalist agent, each of these individual agents has its own tools, its own prompts, and they become specialists in their own domains so that we can start to specialize them and make them more efficient at making the right choices and pulling the right information. A lot of the agent frameworks that we saw earlier have their own opinions of how to do this. There's lots of ways to do it, even in terms of the design. But also, do you run these as a monolith in one container or do you run them distributed like microservices? You've got even options in that regard.

And if you are going to run distributed agents, while we're talking open source, the Agent2Agent protocol is another thing to be aware of. Now this is a protocol that came out of Google but has been donated to the Linux Foundation. Where MCP was a standard protocol for connecting agents to information, Agent2Agent, as the name suggests, is a standard protocol for connecting agents to each other and letting them collaborate and work together, usually when they're running in a distributed way.

So a few things to note about this one. Firstly, it starts to make things interesting from an autonomous discovery perspective. Agents can find each other and do almost a form of negotiation of what each other can do so they can figure out where to delegate tasks to. Collaboration, the agents can start to work together in more of a collaborative way. And obviously it's also open source. The specification is being built out in the open and as far as I'm aware, they're working towards a V1 of the specification right now that has lots of improvements to it.

So if we take a look at a, horrifically simple look at A2A in practice, this means that we can have multiple agents using different agent frameworks, using different models

with different MCP servers can start to work together, and they communicate over Agent-to-Agent, which is HTTP calls using things like JSON-RPC, for example. This then allows, say, the agent on the left to find the agent on the right, figure out what it can do, and say, oh, I can use you to solve this more specialized task for me instead of doing it myself. One of the ways that they do that, if you see on the right, is what we call an agent card. So this is, if you think about like OIDC, OpenID Connect has that well-known endpoint that tells stuff where all the different endpoints are, like your token endpoint. This is the same thing for agents. It's a well-known endpoint where the agent can advertise its name, a description, a version, the URL, but it can also give examples of things like, this is the stuff I can do and here's some examples of prompts you can send me. And this means that agents can actually start to dynamically build prompts themselves based on the examples. You're basically giving prompt engineering tips to another agent through the agent card so they can work together, which is how these agents can work without super specific instructions. You still give it some hints, but it makes them a lot more decoupled and a lot easier to kind of fit together.

Architecture Deep Dive: How the Demo Worked with Strands, MCP, and A2A

So if we take a look at the demo that I showed earlier, this is basically what I pulled together using that stuff that I just showed you. The demo that I showed you earlier was actually more agents than this, but I didn't have space on the slides to put them all on. These are all built using Strands as the SDK for my agent framework. I was running on EKS, they're just normal deployments, they're just APIs, really. And in terms of what it looks like from a workload perspective, it's relatively straightforward.

Now let's take a look at the scenario we did to see how the information flowed through it. So originally, we were in Kiro, our CLI, and we asked the original question and Kiro had that MCP server that we'd built. The MCP server was actually built into my agent, so I actually just built an MCP endpoint into that agent remotely and Kiro was configured to use it. It sends an MCP call, so MCP interestingly can be our client protocol here as well as something on the backend. And it sent that query saying, troubleshoot my CI/CD pipeline for me, remote agent, I'd like you to help me.

The platform agent reached out to the agents that I'd made it aware of and said, how can you guys help me? Got all those agent cards back and started to figure out its strategy using chain of thought or whatever pattern you want to consider and how these agents do their form of reasoning. And then it formulated its plan.

So the first thing it did was it reached out to my catalog agent, which reached out itself over MCP to Backstage. It hit that MCP server right inside of Backstage and got my catalog information about turning payment API into workload identifier that was more specific about what I want to do. So I clarified exactly what I was working with.

The platform agent then said, okay, I know exactly what workload you're talking about now. I'm going to go and hit the CI/CD agent, which I know can help with my CI/CD pipelines. And it then hit the AWS API MCP server to get my CodePipeline status, my CodeBuild logs. It hit my ArgoCD MCP server and saw that the application in ArgoCD failed to sync and actually got the Kubernetes events back, which said the requests and the limits were mismatching. And then finally, it reached out to GitHub. And if you were eagle-eyed earlier with the smaller text, you might have noticed there was actually a commit ID it mentioned when it was troubleshooting earlier. It actually reached out to GitHub and pulled the commit that triggered my pipeline so it could check the code change that was done.

And it takes all of that and comes back over. And the agent itself, the CI/CD agent, formulates a response to the platform agent, which then came back to Kiro. And Kiro was then able to do that little bit of work locally to update the file so that we could push the fix and resolve the issue. The Slack example over messaging worked pretty much the exact same way. All this stuff in the center that our platform team has built as a centralized capability, as an API for us, using all that open source technology, is reusable across those channels. I just, from the Slack side, called Agent-to-Agent directly instead of MCP. That was the only difference. But as you saw, you got the same result.

So to wrap up this section, when we're building all these capabilities as part of our platform team, open source is giving us so much to work with. We get those protocols like MCP and Agent-to-Agent that are being developed out in the open. We get those agent frameworks that we have a whole swath of to pick from based on exactly how you want to work and how your opinions are formed. And then we have the MCP servers themselves, which you're increasingly just able to take off the shelf. And platform engineering especially has a ton of these that are just available for you to use, that you can take and use as part of building out these capabilities. So with that, I'm going to hand over to Hasith to talk a little bit more about exactly what they've been doing with this in practice over at Cisco.

Cisco Outshift's Journey: From Burnt-Out SRE Teams to Agentic AI Transformation

Thank you, Niall. Let me start with what Outshift is. Outshift is Cisco's incubation unit, so we very much look at what's emerging in the future. Currently, there have been two focus areas, one is agentic AI and the other one is quantum computing. For agentic AI, we have been doing various explorations and work in the last two years or so. One most recent one being the open sourcing of Agent Collective to power the future internet of agents.

Now let's look at the Outshift platform at a high level. This is a simplified overview of what the Outshift platform is. We have a single cloud provider strategy for speed. There are three environments, a dev environment, staging, and production for engineers to incubate and take ideas all the way into production. There's edge computing when it comes to things like GPUs or content processing units or where you have data concerns around inferencing or training. Then you have command and control and CI/CD aspects in another AWS account, as well as some Cisco-specific security functionality around secrets, active security, and vulnerability scanning. We also have Splunk that we use as a centralized observability platform.

Let's think about the history a little bit. If we go back about 15 years, we had dev and ops split. Developers were throwing things over the fence to operations, and then we introduced DevOps and SRE also came in around the same time. Then you had microservices, containers, Kubernetes, cloud native, everything exploding with complexity and diversity. Today you have ten different ways of doing something, and it's a problem.

In recent years, platform engineering has been deliberately introduced as a bottleneck. You might have read this famous book called "The Goal." Bottlenecks are not a bad thing, they're good. However, the problem is because it's a bottleneck with developers and platform engineers on either side, if we don't operate the bottleneck efficiently, that leads to an ineffective platform and then that creates problems. This is why platform engineering is challenging, and in most organizations, they're not as successful as they aim to be. In fact, with AI, it's even adding a lot more toil into this bottleneck. You can leverage AI in order to sustain platform engineering as we know it and evolve it for the next years.

Now, let me share a story from the last two years. I started a new role back in January 2024, responsible for all things platform and all things security at the incubation unit. What I found when I walked up to the job was that there was a somewhat burnt-out SRE team being pulled in so many different directions being in an incubation unit. With the other efforts going on, we started this grassroots effort around how can we apply agentic AI into platform engineering. It was not a top-down project, it was very much a bottoms-up project.

We had some ideas, we trialed a few things through internships, and there was one exploration project that was going on. We also had to change the workflow, and we were trying to think, should we use something like Argo Workflow or should we leverage LangGraph and think about it as a workflow engine? Which is somewhat contradictory, but that actually quite worked, and we ended up with a multi-agent system that was quite successful. Now if you look at what's been happening in the industry since then, MCP was exploding in March 2025.

You had the A2A and Agentic Collective that I mentioned, and we also joined this CNOE, Cloud Native Operation Excellence group. And then it made a lot of sense to open source this effort to make a special interest group in CNOE and then create CAIPE, which is Cloud Native AI Platform Engineering, built by the community for the community.

Now this is a visual representation of how it looks at Outshift. So on one end, you have the Outshift developers, and they can talk to the agentic system through many of their existing interfaces. We use Webex as an instant messaging platform, Backstage as an internal developer portal, and then you have Jira, your CLI, and your IDE. In terms of the functionality, you have knowledge bases. This is an extremely useful place to start with because most of the time in platform engineering, you have documentation in wikis, playbooks, and a lot of tribal knowledge in these types of locations, including chat history. This is very useful to gather.

And then you have live tool calling that you can do in order to query systems, and then often you end up having fragmented data sources like your vulnerabilities are in one system and you may have to correlate certain things. So combining this, you can really get a lot of insights and data very quickly. Things like insights, even with access to systems, a platform engineer could take two hours, whereas the agentic system could answer that within a minute or two.

Now then, let's think about self-service, another tool calling. The holy grail of current platform engineering is to get to a form that you can click Submit. However, the problem is understanding what's in that form and actually filling it out in the right way. Getting to the right form is not straightforward, and there's a lot of back and forth that happens. An agentic system is perfect here because an agentic system can close that gap and be a personal butler, almost, to any developer to guide them through that process.

And then we also did some more advanced stuff. We have this EKS sandbox where using natural language, a developer can iterate through an application on Kubernetes without having much knowledge on Kubernetes at all. Now let's talk about the impact. So before any of the agentic AI was being introduced, we had a dedicated three-engineer support desk and there was a lot of toil there, a lot of requests coming through. We have managed to almost completely remove that and leverage that time in order to do more creative and engineering work.

And things like questions, there was a lot of questions being asked in multiple Slack rooms. We had over 20 spaces where questions were being fired, and not everything was being answered quickly. And now you have AI systems that people can use to get the answers, as well as it can intercept and give an answer if there's high confidence. Now, some simple tasks like, hey, I need an LLM key or I need a dev machine used to take half a day, a day, or multiple days sometimes if nobody had looked at it, but these types of things are now end-to-end automated with the agentic system helping out.

Real-World Impact: Automated Self-Service and Deep Agents at Cisco

Okay, so here's the internal developer portal. So this is what it looks like. Yeah, built on Backstage, you know, you may recognize a few things. Now on the bottom right-hand corner you have this icon you can click in order to bring up the chat interface for the agentic system. And when you click that, you have this friendly interface. Hey, I'm CAIPE, how can I help you? And if you go on to ask, hey, what can you do? It'll list out many things around the CI/CD lifecycle with all the tooling it has access to that it can help you with.

So let's actually go for an example. Let's say I'm doing a new agentic project and I need to get an LLM key.

We have this request happening multiple times a day. This was something that was taking at least half a day and many attempts at it, and there was a lot of contention here with a lot of incubation going on. Whereas when you ask something like this, the system comes back, hey, I need to know where to get it from the provider. Can you tell me what model you want and which project are you part of so it can be attributed and whatnot. So you complete that, it processes it and the information gets sent.

And here's the, here's an example of the information coming. Now, something that used to take half a day is done under two minutes end to end, and it's also done with an AI LLM gateway behind. So we have all the good platform engineering practices applied. It's not a key that is given without any auditing and tracing.

Okay, let's look at another example. So here's a Jira ticket. I'm trying to do something else. So, same type of request and I'm asking for a development machine. I haven't exactly specified what it is and I'm kind of saying, hey, I probably need an EC2 type of instance. Maybe you want to, can you recommend me something? And I create a Jira ticket on this.

Now it hits the service desk. So the service desk engineer knows, oh, Jarvis can take care of this. The service desk engineer goes and assigns it to Jarvis and the system takes over. Now we currently have a human here, but in the future potentially that assignment can be automatically done as well.

Now, then it processes the information. Now, Jarvis is a bit like the knowledgeable new SRE, right? So it's almost like it knows about what systems are available. It says, hey, I can create an EC2, I can create EKS as well, what do you want? And it's giving all the options around it. So in some ways, a more experienced SRE probably would not even mention EKS. So in this type of situation, the systems could be potentially improved further if you don't want it to kind of offer EKS, but here you have all the options.

And now I'm going to respond to Jarvis saying, hey, I need an EC2 instance. Here are the details. Use this account. Hey, give me Ubuntu please and no EKS cluster is needed. And then it goes back to an actual human in the loop approval flow.

So it hits the SRE service desk and somebody in SRE needs to approve it in order for this request to be served. And it's approved in the normal GitOps workflow just by giving approval on the PR. You get a link in the chat interface about what we use at work for collaboration. So it's not outside their existing workflow, it's similar to how normal GitOps is done.

Okay, and then once that's all sorted out, I get the information, hey, here's the details. You can access your dev instance. Here's a private key, it's sent on a secure channel. So we have, you know, handle responsibly. So something complex like that, end to end automated with the assistance of agentic and the beauty of it is the agentic systems can go back and forth for any incomplete or incorrect information.

So what happens with the Jira ticket type of a situation is there's a lot of back and forth going on between two humans asynchronously. Whereas now, here you have an agentic system which can validate as well as instantly respond when your user is asking for specific things and information. So that way, it can reduce a significant amount of toil that people face in normal platform engineering teams.

Okay, let's now think about a slightly different use case. Very typical scenario. I think most of us have been here, so there was a big outage, SRE on call didn't sleep at all. So, okay, we want to, it's morning, so let's help them out. So I'm asking, hey, can you get the Jiras and what's open so maybe, you know, can sort those things out, giving a better experience for the customers as well as not troubling the SRE guy and making sure he gets some sleep, right?

The supervisor plans the tasks it needs to do.

It needs to query PagerDuty, it needs to look at Jira for what's open, and then it needs to present its findings. This is actually called deep agents. It's a new concept that's emerging. So this supervisor is a deep agent who's able to populate a list of tasks and accomplish this. And you can see the list of tasks are done, including one additional step that was discovered during that process. And then you have the response. The SRE on call is this person, you can reach that person at this particular address. There's a link to the PagerDuty schedule in case you want to go and check it manually. It explains how it looked at Jira, and then most importantly, it has all four issues found with the links. So I can now do something about it and take that burden off that SRE.

Also, looking at how this was executed, it was done within 40 seconds. Obviously, many iterations and steps were involved. So you can see the top half of it was the deep agent supervisor planning how to accomplish the tasks. Then it decided that it needed two parallelized agents to do it, and then it got the information processed, and then the response to the user was presented. Now, this is a relatively simple example. You can do things that are significantly more complicated involving tens of agents with the available tooling you have. So you can imagine if the system has those capabilities, you can really do very complex tasks by explaining and telling it what you want to accomplish. And you can also iterate on those.

Now, let's talk about the challenges. There are many technical challenges. You have obviously the usual ones: cost, accuracy, needing golden data sets and trajectories. You end up having to use LLM as a judge in order to evaluate the system. Then once you get acceptable performance, when you iterate on it, there's a cost. Because if you change a model or if you introduce a new agent, the system behavior could change significantly. So CI for these new probabilistic systems needs a lot of thought and a very decent CI process if you don't want to get burned by that. And obviously there are a lot of safety, security, and governance concerns that are being introduced by agentic systems.

For us in particular, one of the biggest issues was all the tooling that the subagents use. They have privileged access more often, but then the users on one hand have various RBAC. If you're an SRE, you have certain RBAC. If you're in a particular project, you have different RBAC. So there is a significant risk of privilege escalation, accidentally or intentionally, which needs to be safeguarded. Now this is a fundamental transformation both for your users and the team itself, because you usually have a lot of distrust in AI. And as well, if you don't think from a growth mindset and a learning type of attitude, it's going to be very difficult to actually make these types of transformations possible. I mean, I would say when you look at the technology, so many things are possible, but it's those human transformation and aspects that are the most challenging in an organization.

Introducing CAIPE: Open Source Cloud Native AI Platform Engineering for the Community

Okay, now let's move on to what we are doing around the Community AI Platform Engineer. So CAIPE is very much an attempt to redefine platform engineering, leveraging AI built by the community for the community. What it is is it's a scalable system that you can apply in production. You have a built-in knowledge base. You have many open source agents. You can interact with developers at different workflows, and it's built on open source technologies. So MCP, A2A, LangGraph. In terms of, so if you abstract anything to MCP or A2A, you can integrate with the system in the multi-agency system.

In the multi-agency system frameworks, currently you have LangGraph and Strands Agents there, and you can use other frameworks as well. When it comes to agentic observability, everything is abstracted to OTel. Then you have agents, and you can have specific multi-agent systems to serve specific needs. The Backstage plugin is open sourced. So this is an A2A-compatible chat plugin that integrates with CAIPE. You can potentially integrate with other things as well. Now, the important thing is it's not just about the chat. Things like streaming, what's happening in the system, forms generated by structured input and structured output are key to having a very good user experience. Simple things like presenting a form with the right things populated makes a massive difference to whether the workflow actually works for people or not.

Now I'd say my big advice would be if you're starting with something, start with the knowledge aspect because that's a low barrier to enter. It's, you know, help me understand my platform documentation before going and changing my production deployments. The better context you have in terms of playbooks, wikis, and tribal knowledge, if you can get that into the agentic system, it'll lead to a quality outcome. Lots of the time, especially if you're leveraging LLMs that we do today rather than training specific models, having that high quality knowledge is the key for good outcomes. Obviously, if you have a good system with internal knowledge in the future, you can build more capable AI systems as well.

Now this is a unified RAG architecture. We are on the third iteration now, so it has both RAG and GraphRAG that you can use. You can ingest data from different systems as well as an ontology agent that would map out the relationships automatically. It's all open source, so you can try it out and also connect and evolve as you need for your own enterprise needs. Now, get involved in the community. You can access it at caipe.io. We have weekly meetups, and if you like the project, do support it with a GitHub star to bookmark and support the project as well. I'm going to hand over to Niall now.

All right, thank you so much Hasith. Really grateful that Hasith came over to share what Cisco is doing in practice. It's one thing for me to stand here and show you guys some stuff, but having Hasith talk to what they're actually doing at Cisco, I hope was valuable to see what people are doing in the real world. So with this, we just want to wrap up a bit. Now, as I said at the beginning, this is a tricky topic I think for us to figure out what to share with you guys. Obviously there is so much going on in this area. We're building on top of platform engineering and AI and layering on top of this. So I really hope that what we've been able to share with you today has been something that will maybe inspire you to try stuff in your organization and give you an idea of how you can go about doing it in an actionable way. Obviously, the CAIPE Project gives you a lot of useful stuff to get started.

But just to quickly recap, I guess it's not just the folks at Cisco Outshift that are doing this, right? I'm talking to customers many weeks that are also looking at doing variations of this. There are other talks that you will see at re:Invent along these lines. I believe there's one from Salesforce who are going to talk about something that I think is a bit more on the incident management side, but is very much in this platform engineering case that I would recommend you look up in the catalog if you want to see more of that. Cisco is by no means an outlier here. There are lots of folks looking at this, and it's something which I think is useful to at least invest a little bit of time in.

Open source, as with many things that we do these days, is giving us a pretty solid foundation, right? We don't have to start from scratch. We have our agent frameworks, our MCP servers, and all those great protocols that we can use for interoperability that are all being developed out in the open that you can use to get started quicker, along with CAIPE, hopefully, that you guys can take a look at. If you are interested in taking a look at this at your organization, they can get you off the ground quicker.

So with that, we are going to thank you so much for taking the time to come down to Mandalay to hang out with us this morning. We really appreciate you taking the time. Please rate the session and leave us reviews. That's very important for us. It lets us know how we're doing, and we really appreciate if you take the time to fill that form. If you would like to chat with us, we're going to go hang outside for a little bit. We're not allowed to take questions right now. But other than that, I hope you guys have a great re:Invent, and thank you so much for coming along.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community