Kazuya

Posted on Dec 6, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Service-oriented builders guide to agentic AI: Insights from WEX (ARC313)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Service-oriented builders guide to agentic AI: Insights from WEX (ARC313)

In this video, Andrew Baird from AWS and Dan DeLauro from WEX demonstrate how traditional service-oriented architecture principles apply to building agentic AI systems. Baird explains that agents are essentially Docker containers with LLMs at their core, integrated through SDKs like Strands and MCP servers, making them familiar territory for software engineers. He covers evolved design principles including statelessness versus contextual memory, orchestration versus autonomous coordination, and new concepts like goals replacing CRUD operations and reasoning transparency. DeLauro shares WEX's journey building Chat GTS, which handles network troubleshooting and automated EBS volume management for their 40,000+ annual support requests. Using Bedrock agents, Step Functions, DynamoDB, and Kendra, they achieved production deployment in under three months with 2,000+ users. The architecture leverages Google Chat integration, Active Directory for permissions, and comprehensive observability through Splunk, demonstrating that existing distributed systems expertise directly translates to successful agent development.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Bridging Software Development and Generative AI Agents

Good morning and afternoon to everybody. This is Architecture 313. How many people is this your first breakout session so far that you've attended at re:Invent? A couple people, great. Well, thank you for making it all the way to the other side of the conference. If you attended the keynote this morning, you're the expeditious ones, or if you're staying close by on this end of the strip and knew that you wanted to attend this and did not attend the keynote in person, you're the ones that have forethought and foresight to make your time efficient. So anyway, thank you for being here. My name is Andrew Baird. I'm a Senior Principal Solutions Architect with AWS. I work out of our Atlanta, Georgia office. I've been with AWS for about 13.5 years, and there has been no stretch of time in my entire career in technology where I have felt as motivated, disrupted, and excited as now. My past time at AWS has been mostly dedicated to developer-oriented things: building systems, DevOps, serverless, API-oriented services, and CI/CD. The amount of impact and capability that's been delivered to folks like me and the customers I talk to in the last 18 months is mind-blowing, and hopefully that resonates with everybody in the room or watching on YouTube.

We're here to talk to you about how, for folks that maybe have career histories and trajectories like myself—software builders and developers—who are coming into the fold of generative AI as it's evolved into quite a bit of focus around agents, what are some ways you can take the skills you've got already and the things you understand and bring them into this new world. We're going to have the agenda broken out as follows. I'm going to spend the first third of the presentation talking from an AWS perspective about how to think about and understand agents in the context of service-oriented architectures as a builder. I'm going to talk about how some of the design principles and important dimensions of building good distributed systems that are service-oriented really lend themselves well, or may have evolved in the world of agentic AI.

I'll drop a couple of ideas about some services that have existed even before today's keynote announcements that could help you as a builder getting into generative AI. Then I'm joined by Dan DeLauro from WEX, who's going to go much deeper into a particular use case and architecture and approach that they took when building out their first set of agentic systems inside of the organization. You can hopefully walk away at the end of the session with a lot of deeper understanding and a new level of comfort as a software builder on how you might think about approaching these topics and building agentic systems using a lot of the knowledge you have already and getting tangible, credible advice from a team and a person who's walked that walk over the last year or so. So let's jump in.

Demystifying Agents: Are Service-Oriented Architectures Dead?

Agents for builders: when we've talked about agents and maybe others have talked about agents, they're often displayed in a way like this. There's an agent that's doing things that sound really personified. They're observing, they're taking actions, they're learning. When somebody describes agents to me in these terms, for me as a technologist and practitioner of technology, it feels fuzzy to me. It makes me feel like, does that mean the capabilities have become so advanced that there's software and models living inside of the systems we deploy that have the ability to take actions and literally learn in a way that makes everything I'm used to and the way that we build software feel moot now? Does that mean technology has evolved past the point where all of the skills and knowledge that I've earned until this point is being put at risk in some way?

So it gives me this premise: does that mean service-oriented architectures are dead in some way? Do we now have software elements that have superseded the context of how we design distributed systems and the way in which they deliver capabilities? Is this the message now? Is this where we've reached? So as a technologist, what I like to do in those moments where you feel like something is just outside your grasp of understanding and you want to dig in a little deeper and understand it from a technical perspective is to dive a layer deeper and get hands on.

The Technical Reality: Agents as Service-Oriented Systems

What is observing and learning? Those things have technical meaning, and those descriptors are often used for agents, but really from a technical perspective...

At the center of the story, there are LLMs, which I don't need to describe too much. All of the capability that a model brings to the table includes the ability to reason, to have varying levels of expertise across different domains and industries. There's a way in which we want to tap that knowledge and reasoning capabilities within the context of some business scenario. I'm comfortable as a software builder. I'm not a data scientist, and I'm certainly not an AI researcher. The amount of advanced technology capability embedded inside that model is something I don't understand deeply enough to extract and describe from a builder perspective, but that's living at the heart of the systems I'm going to build with agents.

Talking to that LLM, we have an agent application, which is usually a container, a Docker container. Building that software is where I believe one of the biggest advancements over the last twelve months has occurred. Technology companies and developers have gained a much better understanding of how to integrate with models in a way that lets them have a deeper, meaningful impact on different application contexts. SDKs like Strands have made it really easy to build agent applications in those Docker containers that are going to be deployed. These SDKs define how the software is going to interact with the LLM, including how the prompts are structured, how the different turns in conversation are going to occur, and how they manage things like interaction with memory and past conversations and integration with tools.

All of those new elements of how to integrate with an LLM so that you can get application-level capability out of that model are packaged in neat and very abstracted ways that developers are much more comfortable with. SDKs like Strands provide this abstraction. I'm going to draw a box around those two things: the Docker container-based deployed application and the LLM. I'm going to call that an agent. You build those agents in pretty abstracted ways using these SDKs that make it easy. For how requests and responses map into that agent, we've got a use case, whether it's an agent sitting inside some chatbot architecture or, more commonly, folks are recognizing that the ability to build autonomous systems that sit within business workflows and data architectures is valuable.

Those inputs and outputs might be the same types of messages and events that you prepackage from other upstream applications and services you have. There's some notion of a request making it into the Docker container you built and how that request is going to relate to the prompts that are sent to the LLM sitting at the center. That agent application and the SDK used to build it has some mechanism for that agent software to interact with dependencies. Maybe the next biggest advancement that's made building agent applications easier is the emergence of MCP, which was released about twelve months ago this month, I think, when that standard was first released and really matured to the point of being adoptable in an enterprise production landscape just maybe six or seven months ago as the authentication story matured a little bit.

So now we have a Docker application and a bunch of new software capabilities on how to interact with models, but then a pretty standardized way to make requests out to what our dependencies might be. Those dependencies can be software applications, databases, data resources, document repositories, all exposed via a standard integration protocol in MCP that uses service-oriented mechanisms that folks are used to. There may be other agents that it's interacting with via protocols like A2A. Those MCP servers that you're writing may be the integration mechanisms for integrating with those agents too, so we've got a couple of different options for integrating with agents. Collectively, you have a story here that feels like a service-oriented architecture. We've got Docker containers, we've got integration mechanisms, and we've got the ability for defining dependencies and how they interact with each other and sit in the context of some business use case.

To me, that feels like service-oriented architecture. Everyone here should be encouraged. I believe that the way agentic applications are truly being built and developed has tons of symmetry to the same types of systems we've been building for decades. I'll talk in a minute about different service-oriented architecture principles that still largely apply. There's a lot of net new patterns that are emerging.

It seems like every week new libraries go viral and new capabilities are announced and launched. As much as it feels like our heads can spin at the speed of innovation occurring right now, software engineers and technology builders have been building muscles at adapting and remaining flexible in the patterns you are embedding into your designs for years and years. All of those muscles for you all and for the builders sets us up as an audience to take advantage of these technologies rather than necessarily just be disrupted by them. It sets us up nicely, so take confidence in that premise.

Traditional Design Principles That Still Apply to Agentic Systems

Lastly, do not forget nonfunctional requirements. We have seen somebody who talks to hundreds of customers throughout the year about what they want to do with agents and how they have been building so far. The amount of technical magic that feels like is occurring as folks build these systems makes it very easy to forget things like operational excellence, security, and resilience. These are things that do not come for free. They need to be intentionally built into the designs that you have. There are a lot of folks that I think would attend a session like this who have probably been that voice in the room in the past while building systems, knowing that things like scalability are important to include in the design up front in the design thinking.

All of those tendencies you have as a builder are going to lend themselves to making you really valuable as agent applications are being built and developed. So let's talk about some design principles. On this next set of slides there is going to be an evolution of some principles and terminology that hopefully folks are largely familiar with and how they would apply in an API-based distributed system and the design principles that apply. Then we are going to talk about some new ones that we will try to explain in a set of terminology that may feel more familiar in the lexicon of building distributed systems but are new things you need to consider inside of building agentic systems.

Here is a set of equivalent design principles I am going to say when you are building agentic systems versus all the distributed systems you have built historically. Loose coupling still totally applies. How you deploy those Docker containers, how you think about the benefits of things like asynchronous processing, having no shared concerns between systems, being able to scale various aspects of a distributed system independently. The way in which an agentic use case is going to sit inside the architecture, all of those principles still apply in very much the same way.

For modularity, I think this is one where we have seen a tendency to build anti-patterns in the space inside of agents compared to service-oriented architecture. Rather than having agents that are expected to handle a lot of various different tasks and contexts within the same use case all be built as part of the same set of prompts and systems instructions that you are going to deliver to a particular use case, the ability to think of agent granularity much in the same way as you think about service granularity as you are building microservices and service-oriented architectures. The same types of tradeoffs exist. If you have a use case where it is very logical how you might break down those tasks along different business logic lines or ownership lines or security segmentation, the same types of benefits you would have gotten when building services in that way you will get from building multi-agent systems in that same way.

On the flip side of having modular agents, make the tools that you build reusable. Having the ability for a REST service to satisfy the requirements of many different clients that are integrating with it, you want to build your components of tooling, your MCP servers with that same kind of mindset in mind. If you have a retail context and you have got an MCP server that is talking to your order history systems and exposing information about past transactions and orders that have been placed, if you have various different agents within a customer support context and an e-commerce buying context and a fulfillment context all having their own MCP servers that talk to the same set of eventual back-end sources of truth, you are going to find yourself in the same types of lack of consistency and problems that would have emerged in a service-oriented landscape before.

Having your tools and MCP servers be built in a reusable way and being built by the appropriate business line domains that own those topics, we have seen a more of a pattern emerge where the agents that are being built are often very use case oriented and sit close to the edge of an organization.

They're building Model Context Protocol tools for themselves that their own agents need, and they're building proxies between their own agents and the back end architectures that they need to integrate with. We find it's much more efficient and scalable over the long term to think of the MCP server as an extension of the service integration mechanism for those sources of truth that different agents are going to integrate with.

Just like you'd have service registries and copious documentation about what the different capabilities are of a particular distributed system, service, or API, you want agents to have the same type of capability. It should be obvious to anybody who's meant to interact with an agent or part of a technology team that is meant to integrate with an agentic system that you've built, or a developer that's being onboarded into your organization, to have a very clear understanding of what the intended boundaries are for an agent that you've built.

The good news is that a lot of these things are more self-evident when building agents for humans than when building APIs in the past. There's a lot of stuff we do in natural language when building agents that makes it easier to be discoverable. However, the same general concept applies. As you're building agents, you need to maintain team-wide, department-wide, and company-wide catalogs of where those agents are and what their capabilities are meant to do, how to integrate with them, documentation for them, and where the observability points are for them. All of those things apply to agent systems just like you would have when you're building service-oriented architectures.

Evolved Principles: Navigating Statefulness, Orchestration, and Emergent Capabilities

Now let's talk about a few principles that have evolved a little bit. At their core, there are ways in which the knowledge you have about a prior principle in service-oriented design is going to transition naturally into the new world. However, there's a little nuance or a way that some contradiction might have emerged that you need to take into account.

The benefits of statelessness in service-oriented design have been well understood for a really long time. The benefits that gives you to resilience, scalability, and deployment safety mean that having as much statelessness embedded in an architecture as possible was always a practical thing to strive for. But now we know that in agent applications, having persistent contextual memory across multi-turn conversations or use cases within workflows that require multiple turns is important. Having contextual memory be embedded as part of the context window for the agent that's performing the tasks or achieving the goals that you define is an important part of getting the most capability as you can out of an agent.

So there are elements of the data architecture, particularly memory, that might feel a little unnatural as a distributed systems builder coming from the past. Being comfortable with the idea that the runtime application of an agent is going to have a lot of what you can think of as session state, conversation state, and memory built within the same runtime infrastructure that's deployed adds a couple of different dimensions that you need to be conscious of.

As memory gets embedded in these multi-turn conversations, things like scalability and deployment safety across situations when users are in the middle of having a conversation and work is in the midst of being done become important considerations. Understanding what that means for your use case, the idea of whether you're going to disrupt a multi-turn conversation in the midst of it happening, and what the user experience is meant to look like in real time are all critical factors.

Next, let's talk about orchestration versus autonomous coordination. The idea of building logical graphs and understanding what the dependency flow and structure looks like for the architecture you build were very deterministic and able to be documented and understood at design time in the world of distributed systems. Using capabilities like step functions and other graph-based workflow platforms allowed for this deterministic approach.

But with agents, you have the ability to let the coordination emerge at runtime as the type of nondeterministic work that you might want to take advantage of. It's possible for there to be multiple use cases that a single agent system is going to satisfy. Depending on the type of request that comes in and the context for it, you may have different specialized agents downstream that handle different components of it.

Understanding how coordination is meant to relate to each other and how that coordination is described, and how to allow it to emerge at runtime versus defining it in a very discrete way, involves trade-offs that you need to carefully consider.

Dan's portion of the conversation may address use cases where you want more determinism in the workflow that you're building and you don't want a lot of emergent coordination that's more autonomous and nondeterministic at a macro level. You can combine some of these principles with the principles and patterns you have from service-oriented architecture that will still apply.

Next, let's discuss service contracts and capability emergence. It has been fundamentally important in good API design to understand concepts like backward compatibility and the client-server relationship at the contract level, including what the different data elements are and the allowed values and ranges of those values. This was a key part of what allowed service-oriented architectures to proliferate and be stable. However, in the world of agents, capability can emerge at runtime, which is a benefit. As tools change and capabilities are deployed, you don't have to update layers of your agentic stack to expose those things or expect the agent to evolve its behavior through coding changes.

As tools, descriptions, and capabilities emerge over time, the agent will discover them at runtime and allow new capability to emerge as tools are built, change, and as capabilities are released. These things can emerge in real time. There are some totally new design principles that folks need to get comfortable with.

New Design Principles: Goals, Reasoning Transparency, and Nondeterminism

Software engineers often joke that a lot of what our career has boiled down to has been building CRUD—create, read, update, delete—in a million different business contexts. But in the end, if you're talking about business logic, there's some way in which the code you're working on any given day is going to boil down to building some form of CRUD related to some type of business domain or technical domain object. That helps ground your design thinking and the patterns that you're going to build. Things are very different now, and the way in which you think about how to prioritize what use cases are a good fit and what models are going to be practically applied requires you to pivot your design thinking to the idea of goals.

Goals are the new unit of technical work being done rather than CRUD. If you can distill the type of value you want your software system to deliver to your business or your customers in the context of a very succinctly describable goal, there's a good chance that a model will understand the work that you're trying to achieve. This will give you a better idea of what data and what context needs to be delivered into that model to achieve the goal.

Reasoning transparency is the whole idea of reasoning as a deployable piece of technology, which is fundamentally new on its own. It means that observability is fundamentally different too. The logs that we used to gather needed to only include metadata about the actions that were taken, metadata about those actions, the events that occurred within the sequence of code, such as timestamps and where elements of code were executed. Now there's a nondeterministic ability for software to think inside of your architecture, and being able to think about how you want to ingest that observability into an architecture and use it thoughtfully when you have operational reasons to review it, analyze it, or debug it is important.

Having a good set of ideas about how reasoning transparency and the evidence of it is part of your operational pipelines and architectures is a key new thing that you need to think about, including building other distributed systems to ingest in the process.

Self-correction involves patterns like circuit breaker and others where error handling and fault handling has been a large portion of what can make distributed systems reliable. Agents have a general tendency to want to self-correct and find ways to achieve their goal via multiple paths. There may be times where this is a big benefit for you, and there are other times where you may not want an agent to take advantage of this. Having a very explicit understanding that at build time, at the definition of the agentic system you're designing and deploying, this tendency exists inside of the agent you're building and what it might mean is critical.

You need to consider the different error scenarios that you'll encounter, whether it be a tool being unavailable or access changes that limit capability, and how an agent may react to those things. What would you want its behavior to be in those scenarios? You need to take those things into account as you're building your instructions.

As you're building your instructions and designing your downstream dependency architectures, this is net new work that you should be thinking about as a service-oriented builder. Most fundamentally, there's the idea of nondeterminism in general. Understanding that the type of testing you're able to do—we just announced this morning Agent Core evaluations can play a huge role here. The fundamental premise that the software work being done is going to be nondeterministic is going to change a lot of the typical ways in which you've achieved some operationally related goals for agentic systems compared to how you worked before.

AWS Tools for Building Agents: Strands, Bedrock, and Agent Core Runtime

Now I'm going to run through things that hopefully, if you're in the session, you understand. These services are available already, but I've mentioned Strands a little bit. AWS's mindset here is to adopt an agent framework in general. Don't feel like you need to write software that's going to integrate with all your models out of the box in some sophisticated way without taking advantage of the SDKs in the market. Strands was released at a very opportune time when model capabilities and their ability to reason advanced to the point where it led us to build an SDK that hit a really nice sweet spot on making the ability to develop quickly and reach production quickly at the appropriate level of abstraction that leans into the models' capabilities to reason without a lot of design-time complexity, but still with a ton of robust operational and security benefits.

Choosing the right agent SDK for building and authoring the code related to everything I've described so far, we have the Quiro CLI and the Quiro IDE. You have the ability to adopt Cloud Code. You can have your model inference for cloud models behind Cloud Code get deployed on Bedrock. If you're a team that manages model governance and access inside your enterprise and you like the Bedrock model of that, you can have Cloud Code be backed by a Bedrock model. As you as software engineers are learning these things, there are new capabilities that are going to help you author the software and produce it for you without you having to get down to that level of detail yourself.

Then, obviously, there's Bedrock Agent Core runtime. As you more and more are going to be responsible for understanding the high-level translation between business requirements into the prompting and co-development you're going to do with agentic systems, having an infrastructure platform like Agent Core Runtime that abstracts as much as it can on the infrastructure side in a serverless way and gives you a lot of confidence that the AWS reputation for operational excellence and security are going to be embedded in that story is going to help you accelerate and get to production more quickly. The Agent Core runtime, where the Docker containers get deployed, can also be where your MCP servers that you build get deployed as well. So all the different tools that you might build for your agents deploying to Agent Core runtime is a place those MCP servers can be deployed as well.

WEX's Journey: From Pilot to Production with Chat GTS

That's all the general overview that I'm going to provide you today. I'm going to pivot over to Dan, and Dan from Wex is going to come up and talk to you about their experience building agents as a company. Good afternoon, everybody. My name is Dan DeLauro. I'm a solutions architect on the cloud engineering team at Wex. Last year I was sitting out there in a session just like this, and I was hearing about Bedrock and agentic and all these new tools that make it look really easy to build with AI, but it always kind of felt a little bit out of reach. I'm no data scientist. I don't have a background in machine learning, and I'm not building neural networks, but I didn't need any of that because, honestly, as builders, all of us, as people who understand systems and patterns and architecture, we've actually got an advantage.

That's what prepared me for the building part, though the speaking thing is new for me, but I'm here, so thanks for coming out today. I'd like to talk about how we've been using Bedrock agents and now Agent Core to enrich our operational support at Wex, and I think you'll see a lot of the principles and a lot of the things that Andrew covered. They're there, and they're what made that possible for us. They're really what helped us go from pilot to production in under three months, and now we've got well over two thousand users internally, so it's been a good year. But first, we've got the company slide in case you don't know who we are. Wex is a global commerce platform. We power mobility, benefits, and payment solutions for organizations in more than two hundred countries.

We operate one of the world's largest proprietary fleet networks, and we help consumers manage their benefits accounts—things like HSAs, FSAs, LSAs, and COBRA. We handle everything from corporate travel to expense management, all with the goal of simplifying the business of running a business. Last year we processed over $230 billion in transactions in more than 20 different currencies, so it goes without saying that our platforms need to be reliable, secure, and able to run at scale.

That's where Global Technology Services comes in. We're the team behind the scenes that designs the shared services driving platform engineering standards, reliability, governance, and cost optimization—all the paved roads that let everybody move really fast without getting into trouble. My team's job at Wex is to make cloud feel simple even when it isn't. Last year, Global Technology Services saw more than 40,000 support requests. That's a lot, but if you think about it, every single one of them is critical to somebody, even if they are repetitive and time-consuming for us.

We have operations, SRE, and support across all those tiers, but we're always looking for ways to reduce that number without reducing quality. Like everyone else, we started looking at AI, but we didn't want to do anything flashy or complicated. We just wanted to start small and build something simple that would quietly make our lives easier. That's what inspired us to build Chat GTS. In case you don't see what we did there, GTS is short for Global Technology Services. It has a nice ring to it.

But there's more to this than a chatbot story. Sure, it can chat and it lives in our chat. It can read our documentation and do Q&A all day, but underneath it's evolving into more of a virtual engineer—one that understands cloud, network, security, and operations. We're not replacing people; we're doing what we do in operations: automating the repetitive stuff and expanding our self-service capabilities. We're trying to free people up so they can focus on the problems that matter, the ones that actually move the business forward.

We had to start somewhere. We didn't want to build agents just because we could or get stuck in that cycle of chasing shiny objects whenever something new came out. So we looked at our data, our support history, and those 40,000 tickets, and then we asked ourselves: what are we seeing the most? Which are the most complex or just flat-out painful? Where are we spending the most time? Then we looked at the places where we had existing automation, runbooks, or some kind of process that people were already executing.

We realized that was the sweet spot: high volume, high friction, well-understood work. That's where we knew we could make a dent with AI. I brought two examples to share today, but keep in mind we did start with chat and we focused on Q&A so we could build up our knowledge base. We saw immediately that instead of opening a ticket just to find information, people were able to come and find it on their own. We were building agents that could leverage that knowledge base to make their own decisions, so that was like setting the foundation and laying the platform for all of this.

Real-World Use Cases: Network Troubleshooting and Autonomous EBS Management

This first example is honestly my least favorite ticket, and hopefully some of you will understand why when we get to it. The second one is a little more exciting because we're moving beyond chat and embracing event-driven design with AgentCore. Together, I think they show how AI can really become a part of operations—not a side project, but more like a teammate who's always on call. Can anybody relate to this? Has anyone ever had a network issue they've had to troubleshoot? Well, I know you do. Consider yourself lucky because at Wex we operate hundreds of AWS accounts, and we have Azure and Google spanning eight regions and multiple on-premises data centers, all interconnected.

There's a lot of technology that goes into making that possible, so when something inevitably fails, it can be challenging even with the right tools. Picture this: it's almost noon, you're getting ready to go to lunch, and you get a ping in your support chat from Jared from the PaaS engineering team. He's saying, "Dude, we're blocked. We're trying to deploy this cluster in this new VPC, and we're expecting it to reach all these things." Now Jared doesn't have access to the transit gateways, firewalls, or VPNs, and even if he did, maybe it's a little out of his wheelhouse. But that's not his job, right? You know that feeling—you don't want to leave Jared hanging, but you want to go to lunch.

It's like you don't have a choice—you have to respond. Well, now you don't because we built an agent that can respond for you. What used to require all this tribal knowledge across all these different domains can now happen in minutes, and anyone can use it now, even Jared. In this example, it's an EKS cluster and the agent knows it's AWS. It could go into our core network account and use reachability analyzer to provision a network analysis path. We know that takes a while, so while that's running, it fans out and checks flow logs, looks at any recent changes in the network, and then looks to see if there's any known issues. There could be something already happening. By the time it's done, it's collected all of this information from all of the things it's looked into and breaks it down in natural language, presenting it to you as the end user in chat. It shows you exactly where that traffic dropped and why.

So now when Jared needs to escalate, he can do it with the right team and all of the right information. Since we're logging all of these investigations, we're able to spot recurring issues and maybe identify opportunities for some tighter guardrails in the network. At the end of the day, this is a perfect example of how an agentic system can scale where humans can't. The second one is my current favorite because this was more than a chatbot. We're now building agents that are responding to alerts and anomalies and understanding the state of a system before they're deciding how to react. They're not just waiting for a ping, and honestly, this is the best kind of AI because you don't even realize you're using it.

We all love EKS and containers—they're going to solve all our problems and save the world—but they're not always the answer. I don't judge, but at WEX we've still got some critical workloads that just make more sense running on EC2, but we still have to support them. So picture this: you've got an EBS spike out of nowhere. Your CloudWatch lights up and you can almost smell the smoke right now. This could be a warning shot, but why chance it if we can get in front of it? If that workload matters, someone's getting a page. Somebody's got to wake up, log in, and figure out what's happening and where it's happening. They're checking logs, maybe running playbooks to expand the volume, or maybe just hopping on and clearing some space. Whatever they're doing, that takes time, and let's be honest—nobody wants to do that at 2 a.m. I certainly don't, and that's one of the reasons we built this, but it's the reality today.

So now we can flip the script a little bit. Instead of paging an engineer, we can send these alerts to an agent with all of those metrics in context. Now we're not calling anybody, we're not waking anyone up—that's no longer the first line of defense. The agent can see what's happening, where it's happening, and it can find an agent or a team of agents to help with the issue. Our first agent does some discovery. This is like triage. It looks at the operating system, it looks at the version, it looks to see which platform it belongs to, and it looks to see how critical it is. And then it looks at history—has this happened before? How many times have we expanded the volume here?

We maintain policies to cap expansion, right? You don't want to just keep adding disk; you're kicking the can down the road and that's never going to work. So at this point, if anything looks off, the agent steps out of the way and we escalate back to a human, and then we're back to the way we do it now. But if not, our agent can connect into Jira through AgentCore Gateway and open a ticket and start logging the incident. Think about it—this analysis is huge because you're collecting all of the inputs you're going to need a week later. If it evolves to an incident, you need to craft an RCA document.

So now that we know what's happening, it passes on to our maintenance agent, and this agent can choose from a library of pre-existing SSM documents. Keep in mind these are the same documents that our ops engineers are using at 2 a.m. We've got playbooks to run diagnostics, backup, clean up, and expansion—it's all the usual runbooks, but we've exposed them as tools to the agent on an MCP server. So now, whatever the agent decides to do, it's using the same automations that we already trust and we're not waking somebody up to push that run button. We've eliminated the chats, the texts, the pages, the cross-team escalations. And now these systems are starting to learn how to take care of each other. As they learn, they're building memory and over time they can start to recognize patterns.

We can learn where we're starting, where we're cleaning up on occasion, and where we have to escalate. Maybe they start to spot issues at the application layer. That's what makes this feel less like automation and more like a team that gets smarter over time. Once the issue is resolved, we can update Jira through Agent Core Gateway and then we can publish status to SNS through an MCP server. From here we can notify engineers on call systems, dashboards, and whatever else is downstream from there.

But we're not done yet. This same agent is going to follow that resource all the way upstream back to the Terraform or CloudFormation or whatever built it in the first place, and it's going to open a pull request or create a Jira issue or send an email. It's going to do something that closes that loop between operations and infrastructure so we can remediate that drift we introduced in the incident.

Now, I'm going to be honest. I've been doing this a long time and I've seen some cool stuff, but when I see stuff like this work for real, like the first couple of times, I still kind of feel like a little kid. It feels like magic. I just hope it sparks some ideas and inspires you to see what's possible when you think about applying AI to operations. I know this is a basic example and we're just expanding the volume here, but we're just getting started. This is us learning how to use what we already know and we're reusing what already works. It's not magic. It's just engineering and architecture, and that's kind of the point.

Architecture Deep Dive: Building a Scalable Agent Platform at WEX

So I'm going to zoom out and talk a little bit about how we got there, and then we can look at some diagrams and see what it looks like under the hood. When we started this, we knew we wanted to build more than a chatbot and a RAG pipeline. We wanted to create a platform, something sustainable, something extensible, something that would inspire all of the other teams to come and collaborate and help us expand this because operations takes a village.

But we didn't have to change the way we built things. We just had to let these old patterns breathe a little bit. It was all the same stuff. We built agents with boundaries. We gave them clear responsibilities, and we let them all work independently. We saw guardrails become the contracts for what those agents could and couldn't do. Events became less about something happened and more about here's what happened, here's what it means, here's what we can do, and here's what we did the last time it happened. Of course, observability is still important. We still need to see everything, but there's more to this than 200s and 500s now. We're looking at behavior and reasoning. There's nothing new though. It all translates and it all still makes sense.

So here's what it looks like. Out of the gate, I'll be honest, it was a little challenging because at WEX we don't use Slack or Teams. We use Google Chat, and there's no native integration between Google and AWS for collaboration. On the left, we have our users chatting with us in our workspace domain. Those requests come over the internet. In the middle, we have our WAF with Imperva that secures our inbound traffic. On the right, we have our AWS environment fronted by API Gateway, and from there we use a Lambda to route and acknowledge messages and then we use Step Functions to orchestrate all of our agents.

We store state and conversations in DynamoDB. Reasoning traces land in S3, and of course Bedrock hosts our agents and knowledge. On paper it looks pretty simple. It's neat. It's easy. It's serverless. It's exactly what we wanted. Our chat application is only assigned to the users who are allowed to use it, and all of those messages come with a signed token that we can validate. Then it hits our router where we can filter out noise and oversized prompts, and we can send a quick response to that user basically saying hey, we got it, we're working on it. That helps absorb the model latency because agents still take a while to do their work. But really at the front, at the edge, this is just a clean, well-defined contract for everything else downstream and it keeps our front door predictable.

Now I'll admit before this I never really did much with Step Functions. I know they're on all the exams and everybody has to learn them, but I always felt like they were just for people who've built too many Lambdas. But it turns out they're actually perfect for AI. Bedrock gave us the intelligence, the parts that can think and create and make decisions. Step Functions gave us the discipline we needed to keep it all in check. We use the retries and fallbacks and state transitions, and that's where it really clicked. Like Andrew said, these agents can be autonomous, but they don't have to be. You can give them as much freedom or as much control as you need to, and that's what made them fit so well in these operational workflows.

Google gives us a trusted identity, but it has no concept of permissions. That token really only tells us that you're allowed to talk to us. So we take that identity and reach out to Active Directory to fetch your entitlements and cache them in DynamoDB. We kept it pretty simple. It's all based on group memberships and organizational units, and that way we're not overloading systems that were never really meant for real-time traffic.

Google tells us who you are. This is how we figure out what you're allowed to do and how far you can go before you ever reach an agent. When we invoke that agent, we wrap the entire prompt in context tags and we include your identity and your entitlements and send that downstream to the agent. Now whatever the user claims, the agent only trusts that context because it's immutable. If something goes wrong, say the agent fails or times out or can't produce a response, we log the error and we can send a safe response back to the user without disrupting the entire workflow.

Once the agent responds, we want to capture everything. We store messages in DynamoDB and traces in S3, and from there we format the response so it looks good in chat. Google has their own markdown language, and that allows us to include citations and reference links and any attachments that came back. Then we update that temporary message we sent from the router. That really takes what feels like a request-response and turns it into more of a transaction. It's more like an actual conversation, and that's the point.

That's how Step Functions helped us. It gave us this structure where there was so much potential for chaos, and it's all the same patterns. There are retries and fallbacks and circuit breakers, and they're baked into every state transition. That helped us hold all of these creative parts accountable, and now we can observe it and measure it just like any other workflow. When it came to agents, we were inspired by SOA. Think about what it taught us to break up the monolith. We took these big systems and turned them into smaller pieces with a clear purpose.

That same discipline applies to agents. We built specialized agents and gave them one job that they have to do really well. That keeps the focus clean and the reasoning sharp, and then the handoffs between agents are even cleaner. Instead of having one giant orchestrator pulling all the strings, ours acts more like a conductor. It sits in the middle. It can interpret intent and figure out what's happening and where it's happening, then it connects that problem to the right expert or the right team of experts.

Even experts need boundaries. With all this autonomy, we needed to have guardrails. We needed some way that we could enforce policy and compliance, so we apply guardrails at the edge with the orchestrator. This gives us defense in depth for every decision that happens on the platform. We're sanitizing text, blocking topics, and redacting personally identifiable information both on the way in and on the way out. The guardrails are not just protecting our data; they're actually protecting the agents from themselves so they can wander, but they still can't color outside the lines.

None of this works unless they're all operating on the same source of truth. That's where shared knowledge comes in. All of our agents tap into the same knowledge base, so when we have a Q&A agent answering a question about connectivity, it's pulling from the same material an ops agent would if it were troubleshooting. They're separate services, but it's consistent understanding. It's like giving every application in your service layer a unified data plan, only now it's made up of runbooks and reference architectures and living documentation.

Thinking isn't enough. At some point we have to let the agents do something. They act just like any other service on the network. They're calling APIs and MCP servers. They're executing the tools we've given them to do their jobs. With Bedrock Action Groups we can run these Lambdas inside of our VPCs with tightly scoped permissions, and then we can control what they can reach and what they can't reach. At the end of the day it's just services talking to services, just like any other application layer on the network.

How do we define that truth? Documentation, of course, and ironically that was the hardest part of all this. At the enterprise, documentation can live anywhere, and even when people can find it, they don't like to read it, but they'll chat with it. So we wanted to build something that could grow and evolve with the organization, and we needed something more than just a dumping ground.

So we chose Kendra with the Gen AI index because it gives us this hybrid approach. We get keyword and vector search with multimodal embeddings. With all of the built-in connectors for things like Confluence, GitHub, and Google Drive, we're able to keep all of our information in sync automatically with cron schedules. Now it includes source code, diagrams, policies, and runbooks. It's like all of our domain expertise finally lives in this one searchable layer.

The best part about that is it actually comes to us. We don't have to chase it down anymore. We don't have to find the documentation. But the real breakthrough wasn't just building a knowledge base; it was figuring out how we're going to manage this thing. Those data sources for Terraform for Kendra are configured in Terraform and they live in GitHub, and we deploy them with CI/CD. With self-service, we allow the people who own that content, the subject matter experts, to maintain it themselves. They can open a pull request, and we can deploy their changes through GitHub's pipelines.

Observability, Lessons Learned, and Final Reflections

Now it's still enterprise knowledge. We're just treating it like infrastructure now. I'm not going to lie, observability was a bit of a puzzle. We have this third-party chat up on the front end, this hybrid identity between Google and Active Directory, and we had Lambdas and Step Functions and Bedrock agents. It started to feel like we had metrics coming at us from every direction, so we had to find a way to stitch it all together so we could actually see what was happening.

We built this persistence layer in DynamoDB, and this is where we store the things we care about long term. This is all of our users, their chat spaces, their sessions, and all of their messages. But it's not a transcript. This is relationships because every item here keys back to a session and a trace ID. Those traces land in S3, and this became our black box. This is like the flight recorder that captures every decision that's made across the platform. It's basically distributed tracing, but instead of following a request, we're following a train of thought.

Of course, we needed transparency, so we pushed all of our logs into Splunk through Kinesis. Now, even InfoSec and compliance and risk and legal have all got a real-time view of what's happening on the platform. They can see what people are saying. They can see what the agents are saying. They can see all of the redactions and all of the policies being enforced. At the end of the day, everyone's happy.

Now, I'm going to say this like it was ten years ago because it feels like it, but it was only four or five weeks. Back when we started, there was no agent cops and there was no built-in observability, but we still needed a way to understand what was happening. We got together and built this dashboard that produces some practical insights. When I say we, I mean me and my team and Cursor and Claude and Copilot, but all my smartest friends helped out. Literally with this dashboard, we can replay a conversation step by step, and then we can see the story unfold.

We can see what people are asking for. We can see where the agents are struggling, and we can see where maybe our knowledge base needs some work. But this kind of visibility doesn't just measure quality; it shapes it. This is what drives our roadmap. This is what tells us what we need to build next. I'm almost done, I promise. This was a big year for AI. I mean, all the new models, all the new services, the tools, all the new acronyms that we're secretly Googling on the side, and I hope I'm not the only one who has to do that.

But honestly, I've never had more fun learning and building at the same time, and I feel really lucky that I get to work with this stuff because it's awesome. It just is. I've learned a lot this year, but there are three lessons I'm going to carry into next year that I want to share with you. Number one, the architecture still matters. It's all the same diagrams. There are new services and there are new icons. But you don't need to be a data scientist to piece it all together. I feel like I'm living proof of that.

Second, you don't have to build a platform. You can start small but think big. Build something simple that will teach you what to build next. And of course, please breathe right, go outside, touch some grass. Maybe wait until you get home; there's not a lot of grass in Vegas. But seriously, it's way too easy to get wrapped up in all this tech. We still have a say in this. Don't let it overtake you because you could blink, and the next thing you know, you're in Vegas on the stage explaining what you did last year because that could happen. Trust me.

Thank you. This is awesome. Thanks, Dan. Thank you, everybody. I hope you enjoy the rest of the conference. Remember, if this is your first one, we've got surveys that come out, and it really helps us and gives us the ability to speak in the future, especially if you enjoyed the session here. So have fun this week, stay safe, enjoy the party on Thursday, and thanks again. Thank you.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community