🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.
Overview
📖 AWS re:Invent 2025 - Build production-grade middleware with Bedrock Agentcore and MCP (AIM204)
In this video, Cybage's Mohammad Zaman and Aneesh discuss production-grade AI architectures on AWS. They emphasize not over-engineering POCs, using low-code services like Bedrock knowledge bases for prototypes only. Key topics include separating MCP servers from agents for reusability, implementing AI gateways like LiteLLM for observability, and making APIs agent-ready by reducing noise and dependencies. They address critical challenges: security through PII masking with AWS Macie and dynamic tool binding in AgentCore based on user permissions, deterministic evaluation metrics for tool-calling agents, and pricing AI features by output rather than tokens. The session highlights real implementation strategies including CloudWatch integration for end-to-end traceability and throttling AI workloads by user groups.
; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.
Main Part
From Prototypes to Production: Architecting Enterprise-Grade AI Workloads on AWS
Thank you, Chris. We are all here in the Gen AI space, and everybody's building yourself and ten other agents. So we're going to be talking about that, and I'll hand it over to Aneesh. My name is Mohammad Zaman. I go by Mo. I lead the AWS Strategic Partnership for Cybage. With me is Aneesh, who leads our cloud, data, and AI practice at Cybage. We're going to be building some stuff, and you'll get to know it. Also, after this, if you have more questions, come to Booth 12.
Thank you so much. I'll use the handheld microphone. Hello? Okay, I will use this. Thanks everyone for coming. We're going to talk quickly about production-grade AI workloads, the first and only AI session at re:Invent. That's a lie, but we'll cover some interesting things here. So there are key ways that different organizations, both software companies and enterprises, are thinking about AI in their stack. When you think about a typical three-tiered stack, either we have some people working on their data layer to get it prepared to sell to foundation model providers. This is a lot of publications and media houses working on sending their data to model providers in a more effective manner. You have people who are working on AI-assisted development and code. It's also changing the way you view an application stack. Previously, you had APIs doing the heavy lifting of business logic in an application. Now, APIs are getting much thinner. You have AI agents that work on top of lighter-weight APIs that are largely doing CRUD operations, where agents can take on more of that workload, more of that logic, more of that orchestration, more of what the platform and the meat of the platform really is. This is being seen across the board. We'll talk a little bit about what this means in terms of actual real production-grade architectures.
AWS services have evolved immensely alongside this journey. Back in the day, it started with foundation models on Bedrock. Now, we have a whole host of services. We have AgentCore, which gives you real-time runtime with agents as well. People are adapting, and AWS is providing the end-to-end services for all of those different workloads. We'll jump into a few quick examples of real production-grade implementations that we're working on at Cybage. Hopefully, you can go away with some tips and ideas as well.
So one massive and recurring issue that comes up is the ability to separate what is a prototype or a proof of concept within AI development versus what is a production-grade implementation architecture. We think about this a lot at Cybage. The main takeaway from this slide is don't over-engineer prototypes in the space of Gen AI. AWS has great low-code services like Bedrock knowledge bases and Kendra. We have a bunch of services which are meant for you to prove a concept, and it ends at that. You're not meant to take those implementations and move them to production.
What we usually design is a team that works on design and experimentation in an enterprise. They are only focusing on sandbox AWS environments and using low-code services to prove a concept for different use cases. Then, you have a higher complexity data ingestion workflow. All your AI use cases in the enterprise need real-time data connectors and ingestion. Sometimes, it has to be custom built from the ground up. And then, you have your agents that work from that data, at least in typical RAG applications as well as in agentic applications. With AgentCore, now you can have agents that work across different MCP servers.
One really useful aspect that we try to implement in our designs is differentiating between MCP servers or tools and then the agents that use those servers. The same MCP server, for example, web browsing or context retrieval, can be utilized by multiple agents. This separation of tools with agentic workflows is something that we really focus on in our designs and in our applications. AI gateways are really picking up as well. There are tools like LiteLLM. AWS also has many offerings here. Having a centralized place to log, monitor, and observe your workloads is becoming extremely essential. The way to do that is through observability tooling. AWS now allows you to ingest your Gen AI logs in CloudWatch, which lets you do end-to-end traceability of sessions and traces. Observability has massive importance here. You need to know whether the workloads you're running and facing issues are happening at the stage of retrieval or at the stage of generation.
Are they happening because users are still figuring out how to use your platform? Observability helps you separate that as well. This last layer here, in terms of LiteLLM, orchestration, and observability, becomes extremely essential. The takeaway here is: don't over-engineer POCs. They're meant to prove a concept. Once you've proved that, think about production-grade workloads on AWS and what that means.
Building Agentic Layers on Legacy APIs: Addressing Real-World Challenges in Security, Evaluation, and Monetization
In some other cases, one of our largest implementations recently was focused on building AI layers on top of legacy APIs and legacy software products. A lot of us here, even every booth that you see, we ship software. That's our bread and butter. It's about: can I ingrain and can I infuse agentic layers on top of those product APIs? But in that process, multiple challenges do come up.
APIs that we've currently built and host in the world are not made for agentic consumption. They are made for consumption by typical user interfaces. This can cause multiple issues. I'll give you a simple example. A lot of product APIs return 100-plus results in their responses—very noisy responses with a lot of garbage-adjacent data coming from those APIs. Neither would an agentic solution ever want to use an API like that, because the way you interact with an agent is probably chat-based, which means you're sending smaller workloads and smaller results to that chat. Nor would an agent successfully be able to reason through that massive API response.
API readiness is something that is extremely essential before you build agentic workloads on top of it. Another example: APIs often have dependencies. We built this application which did agentic orchestration on legacy APIs, and one issue we faced was latency. Many people are currently struggling with it. Our LLM would hop and would do two consecutive trips to first get authentication tokens, then call APIs, and then with the results of that, call downstream APIs. Again, these APIs have not been built for agentic consumption.
If you know how users are going to interact with your agents, what type of workload they're going to run, what type of prompts they're going to run, you can better design your APIs with an agentic-first mindset. That has nothing to do with LLMs, that has nothing to do with generation. That is about logical design in your API schemas. That's something that we focus a lot on at Cybage as well.
Out here as well, observability and monitoring become extremely essential. The good part about tool calling and function calling is that observability is more deterministic. What do I mean by that? With open-ended chat responses, you have to create evaluation matrices that are subjective and not completely accurate. With tool calling, agent calling, and API calling, you can have more deterministic evaluation metrics. So if I build an agent that sits on top of my APIs, I can get accuracy scores about how successfully it's able to call the APIs I need it to call for a certain set of prompts. That determinism is super useful in evaluation, which you can't get in basic chat or just freeform chat. For that, you have to use custom evaluation metrics.
Evaluation is something I'll focus on, and it's something that we work on a lot at Cybage. It's something that we infuse in a lot of our agentic development as well. Moving to some of the real production-grade issues that people are currently facing, we'll talk about each one and how those are being solved as well.
One massive issue, and I'll start in the third box actually, is security and governance. The moment you introduce tools to an agentic implementation, the security concerns multiply by magnitudes. You have your CISOs getting extremely concerned with context mixing between different tools that have different permission levels. The other part there is that people are extremely apprehensive, and rightfully so, of sending PII for LLM model calls. How do you screen out? How do you put guardrails, both at input and output stages?
That's something that we focus a lot on. You can use services like Macie in AWS for PII detection and PII masking. You can use guardrails. There are some open source frameworks. You can use custom guardrails to rail your applications, both pre-generation and post-generation. And you can bind user groups to the right tools that they have access to. This is an extremely essential part of building these applications.
We need to reflect enterprise access permission levels with agentic permissions. If you're building an application for internal users or for your end customer users, each of them have tiered permissions that need to be reflected with the tools they have access to. With AgentCore, this becomes extremely powerful because you can use dynamic tool binding. AgentCore allows you to bind tools dynamically for a user so you can build in that pipeline binding that respects the permission levels of that user. That's something we focus a lot on at Cybage as well, on the security and governance side.
I'll also focus a little bit on the adoption and monetization side because we have some time. It's not as technical, but token-based pricing is not working out very well for people trying to ship products with AI. If I'm a software company, and many of us are representing software companies, and I push out a feature that uses LLMs at the backend, people are not willing to spend on token-based pricing for those end features. People are trying to bake that into subscription pricing, either with a new tier of subscription that demands higher AI workloads or through other innovative mechanisms.
We've seen some of our clients work on different pricing mechanisms. Imagine you've built an agent that runs a certain defined workload, maybe a content generation workload or a report generation workload. Maybe it runs a certain action with an actionable output like a research report. Try to price AI-based features on output and not tokens. That's something we're seeing a lot of in different companies. The only way to get adoption from your end users is to price these features based on end outputs instead of on tokens.
For that, your API and AI gateway layers become important. What I mentioned earlier around platforms like LiteLLM and integrating CloudWatch in AWS are aspects that are required before you differentially throttle AI workloads between different user groups. If I've decided to price an AI feature differentially by user groups, I need to make sure that I'm throttling the right groups by their AI usage. So that gateway layer that gates your calls for agentic layers with end LLMs becomes extremely essential as well.
These are some real production-grade issues and solutions that we're working on at Cybage. If there are any questions or further topics, you can find us at Booth 1231. We're building production-grade solutions with AWS and would love to get more into the details. That is all from us today, so thank you so much for coming. Hopefully you're leaving with some tangible learnings, and we're excited to talk to you all further. Thank you.
; This article is entirely auto-generated using Amazon Bedrock.








Top comments (0)