Kazuya

Posted on Dec 5

AWS re:Invent 2025 -NFL Fantasy AI: Zero to Production in Weeks w/ Bedrock and Strands Agents-SPF304

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 -NFL Fantasy AI: Zero to Production in Weeks w/ Bedrock and Strands Agents-SPF304

In this video, AWS and NFL present how they built Fantasy AI in just eight weeks—an AI assistant delivering NFL analyst-grade fantasy football advice. Michael Butler, Henry Wang, and Mike Band detail their architecture using Strands Agents framework, Model Context Protocol (MCP), and Amazon Bedrock. Key innovations include a semantic stats dictionary reducing token consumption by 70%, consolidated tools cutting specs by 50%, and a fallback provider achieving 90% throttling reduction. They implemented strategic caching for 2x throughput improvement and 45% cost reduction. The system achieves 90% analyst-approved responses in under 5 seconds, handling Sunday traffic spikes with zero incidents across 10,000 questions. Their core principle: prioritize intelligence (reasoning, tools, data) over perfect infrastructure, shipping practical solutions that prove job number one before iterating with real-world data.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introducing Fantasy AI: NFL and AWS Partnership

What's up guys? How's it going? All right, I'm excited to talk about fantasy and AI. Thank you for making some noise for us. Good afternoon everybody. My name is Michael Butler. I'm a principal deep learning architect in the AWS Generative AI Innovation Center. I get to work with amazing technology every day, learn it, build it fast, and share what I learned with customers and builders alike. One of those customers is here on stage with me today. I want to introduce the NFL's Mike Band.

Thank you, Michael. My name is Mike Band. I'm the senior manager of research and analytics at the NFL's NextGen Stats Group. My team is responsible for tracking data—collecting it, processing it, distributing it, and also turning that into products that we'll talk about today. Yes, my name is Henry Wang and I'm a senior applied scientist with AWS Generative AI Innovation Center, where every day I think about innovations and how we can build delightful experiences for our customers.

AWS and the NFL have been working together for multiple years. We're one of the technology providers for NFL NextGen Stats and we're the official cloud, AI, and machine learning provider for NextGen Stats. We've been constantly innovating together to think about how to transform the experience for players and fans to experience the game of football. Without further ado, we'd love to let Michael kick us off to talk about the journey and how we built this fantasy AI system.

Alright, thanks, Henry. So folks, what we're going to do is spend a little bit of time showing you what we built today. We're going to walk you through the architectural decisions that made it possible and then share with you some of the nitty gritty details. This isn't a presentation about perfection. It's about what it takes to get real agentic applications into production in an actual customer environment with all the mess and the fun that goes with doing that.

But before we jump in, let me take a moment and set the stage. It's Sunday, it's 12:45, 15 minutes to kickoff. You've been staring at the same analysis all week long. Do I start Caleb Williams? Do I start Justin Herbert? Are my waiver wires going to pull out? Is my streaming defense right? Did I make the right call? You don't know, but you do know that you're going to hear it tomorrow if you didn't make the right call.

Fantasy managers live this life every day. All sorts of data coming at us fast. We have roster decisions to make, trades to evaluate, injury reports updating by the minute. Five different experts say five different things. You're overwhelmed by data and looking for trusted, confident, fast advice. This is the problem that the NFL wanted to solve, not just for the fun of it, and believe me this was a fun problem to solve, but because they had something that no one else has.

So I'd like to welcome Mike Band back to the stage. Mike, tell them what you were sitting on. Thank you, Michael. When Michael and his team approached us in June of this year, we had an idea and a dream, and Michael said we can help you build that. We had no formal design, no requirements, just an idea. We were two months away from the start of the regular season, so we said, hey, can we build something that we could bring to market to fans in eight weeks' time?

That's why it says zero to production in eight weeks, because we literally went from an idea to a product for fans to use in just eight weeks. Today we're going to talk about how we leveraged a new series of APIs. We leveraged the NextGen Stats database, and we combined that to bring a fantasy AI assistant to NFL Plus users through our DTC platform at the NFL Media Group. Via NFL Pro, a website where you can go and find NextGen Stats insights, analysis, film, and beyond. Now with this fantasy AI experience, we've brought a whole series of new platforms and new tools to the fantasy player at home.

Without further ado, I'm going to let Scott Hansen tell a little bit of a story about the NFL Pro platform. NFL Pro is the ultimate fantasy football companion, including the first of its kind fantasy AI assistant. It's going to help you draft smarter, optimize your lineups, and gain the winning edge. Burrow airs it out. Chase, hello, touchdown, Cincinnati.

Three Key Pillars: Accuracy, Speed, and Security

So what does it mean that we delivered? Well, in about eight weeks' time we made it. We managed to hit our three key pillars. The first was that it had to be analyst expert approved. So it's a team of analysts on the NextGen Stats team. We went through thousands of possible questions during the QA process.

What did it mean for us to be accurate and right? First, the information had to be accurate. We had to present the data in an accurate form. There's a true number of how many times a player completed a pass, and there's a true number of how many times they scored a touchdown. If we were to hallucinate those answers, the fan at home can't trust the output, so we had to make sure it was right and accurate.

That's why we leveraged our APIs that we built, which had structured data that we knew had a low risk of error. Secondly, when we say 90% directionally approved, what we mean is that when we evaluated all of the answers, we wanted the fantasy expert that we have at NextGen Stats to be able to say, "Yeah, that passes the smell test. I agree with that." In a way, you're trying to not make foolish decisions. You're not trying to avoid the landmines, the fatal flaw decisions. With the APIs that we've built through NextGen Stats and NFL Pro, you now have all of that information and analysis available to you at the touch of a question.

Second, we needed it to be fast. If you've experienced or worked with an AI assistant, you know that speed can affect your experience. We needed initial question answers in less than five seconds and fantasy-rich, deep analysis in less than thirty seconds. That is our 95th percentile under thirty seconds for every one of our answers so far, meeting that threshold. Finally, probably the most important thing from the NFL: we are the shield. We cannot put out a product that is not secure and reliable. This meant that we had a rigorous process to go through legal approvals, our AI council approval, and all of that meant that we had to have a really buttoned-up system.

What that means is we don't want our AI assistant to be able to answer questions that are outside the bounds of fantasy football. You could imagine that someone could think, "Oh, I could use this to make a bet on sports." Well, what that would do is we would essentially create liability for us if we allowed recommendations to be through betting, for people to ask, and then they can come back and say, "Oh, you told me the wrong thing." So we wanted to limit the ability and ensure that this assistant could only do fantasy football analysis. Because of that, we have had zero incidents since launch. In our first month of ten thousand questions, not one incident was reported from a user or through our logs.

Now, I know it's always good to talk about the product, but why don't I show you exactly how it works? I'm going to do this a couple of times throughout today. So first, I'm going to ask you a question: What's up with the NFL's Fantasy AI powered by AWS? Let's just ask for their answer. So we'll wait a second for it to load. Alright, the NFL just dropped a game-changing AI assistant that's about to revolutionize how you dominate your fantasy leagues. Key features include exclusive NFL data, real-time analysis, personalized advice, and rolling out through NFL Pro for serious fantasy managers who want data-driven championship advantages. Bottom line: while your league mates are still comparing basic stats across different apps and websites, you'll have AI-powered analysis delivering insights that actually move the needle. This isn't just another fantasy tool. It's the difference between hoping for wins and engineering them.

That was an answer directly from our AI assistant, and later in this talk, I'll show you a discrete example of how we really use it on game days. So I'm going to invite Henry to the stage to talk about the architecture and how we built it. Thank you very much, Mike. All right, before we talk about the architecture, I'd like to take a quick poll of the audience here today. Please raise your hand if you have heard of the word "agent." I'm expecting all hands. Okay, since you are a ring man, don't lower them yet. Don't lower them yet. Keep them up. A little bit of exercise after lunch is good for your body, okay? All right, now keep your hands raised if you have heard of the term "workflows." Wow, okay, everyone. All right, now keep them raised if you know the difference between an agent and a workflow. Great, okay, awesome. Just the mix we're expecting from the audience today. Now everybody, you can lower your hand and rest a little bit.

Understanding Agents and the Fantasy AI Architecture

So just a quick refresher on what agents are, right? Agents are nothing but an autonomous system that can reason, plan, and take actions on behalf of humans or another system. Now we think about a simple chatbot that just gets turn-based responses

with that uses a large language model and internal knowledge ability. The agentic chatbot can do a lot more. Here is a quick, more detailed look into what an agentic loop will look like. When this agentic system takes a user input, it will perceive it, try to understand it with a large language model behind it, and think about what kind of tools it should use to answer the question. It will execute the tool, see the results coming back, and decide if another loop is needed in order to achieve the final goal it's instructed to do.

For Fantasy AI where you need to compare which quarterback you want to start for your lineup, you need to actually think about the matchup, the upcoming stats, the weather, and so many different things that is very complicated and a complex system to do. But this is exactly what agents excel at, and this is the reason why we decided to go with building an agentic architecture for Fantasy AI.

Another quick primer: MCP. MCP is the short for Model Context Protocol, which is a standard plug that allows your large language model to talk to different tools in a standard way. The advancement of MCP really allows us to have this. Every time you need to connect to a new API or talk to the database, you don't have to build new things. Consider it as a universal portal where you can just plug in, so that as long as both sides support it, you don't have to be creating new things. This really makes the LLM integration with different systems much more seamless.

All right, now with those primary things, let's talk a little bit about the architecture. From the left, we have the user input or query, and then it feeds into Fantasy AI which is hosted on EKS that auto scales. It uses the large language model from Amazon Bedrock to reason and pulls session memory from an S3 bucket so it has the context. More importantly, it is connected through MCP to different data sources. It connects to the NFL Next Gen Stats to get all the player information about their stats. In addition to that, it also connects to Rotoworld for additional user stats and player stats.

This way, everything just gets connected and the system is able to have the large language model, our Fantasy AI agent, process a call and try to provide the best response to the users. Everything happens from end to end, from left to right in under five seconds, even during Sunday peak traffic. This is really amazing, right? The right question you should be thinking about is how do we actually pull this off in just a matter of weeks.

Key Design Decisions: Strands Agents, MCP, and AI-Assisted Coding

If you're thinking about it, great, that's the right question to be asking because we will be talking about some of the key decisions we made to enable this to happen. If you think about building such a complicated agentic system from scratch, there are so many things for you to think about. You need to have large language model processing, you need to have a different prompt management system, you need to consider how to manage your sessions and how to control fallbacks when the models are throttled.

So what are some of the decisions that we decided to make? It comes down to three key pieces. The first is that we decided to go with Strands Agents, which is an open source framework that makes it much easier for you to build a production level agentic system. Second, we decided to use MCP as our semantic data layer. This way, it gives us separate responsibility and we can scale them independently between the agent and our data layer. The third is that we leaned in very heavily to use AI-assisted coding, which we will talk about some of the best practices and learnings we found.

Now the first thing is that Strands Agents is an open source framework that really with a few lines of code allows you to build an agent framework in production. With Strands Agents, it handles so many things for us. It will help us with managing the sessions, will help with managing the prompts, it will support multi-language models so that we can plug and play with different models when we see fit. So really the Strands Agents framework provides us with scaffolding that allows us to focus on building the agent logic for Fantasy AI rather than the plumbing part of it.

The other decision we really need to make is think about how we want to let the agent talk to different data sources. Now there are different options you can do here. The simple option is of course you can use Python decorators to just give the tool definitions to the agents inside the agent codebase. This way everything is in one place, but the issue comes with that is our data layer is really complicated.

We need to actually identify different parameters to be filtering for the data sources. We need to have business logic validations. After the data, we need to do additional parallel processing in order to get data very efficiently. If we embed all of these code into agent code, it actually makes the code very hard to maintain and very hard to scale. Now the other option is to separate the agent logic with the data logic, and that's actually the choice we decided to make that offers a few benefits for us.

The first thing of course is a separate layer of concern. This way we're able to be building agent so that the agent code focus on the logic and the orchestration and then the MCP will allow us to basically manage all the data layers, what data sources to talk to. The second thing is, once we build this data layer MCP we can introduce additional agents in the future, say different personas, different agents for different purposes. They can all talk to the MCP data layers. So you build once, you can use it for multiple times in the future.

And the third benefit is that it allows us to actually scale each part independently, because it's really hard to predict the traffic. So sometimes we may need to scale up the agent computation, sometimes we may need to scale up the MCP data layer. This way it allows us to disentangle those two, so that we can scale each as we see fit. And the third thing is that due to the time crunch, we were really thinking hard about how can we speed up our development process, and this is where the AI assistant coding comes in.

Now I'm not recommending you to say give a prompt to your favorite AI coding assistant and go lie on the beach and drink your pina colada for a few hours and come back. No, that's definitely not what we're recommending. But there's a few things that we found really helpful for us when we were building Fantasy AI. The first thing is that it allows you to learn a framework, a new framework, much faster because you can do customized Q&A and you can actually just ask questions that are specific to you right now. It allows you to speed up learning instead of reading all the raw documentations line by line.

The second thing is, we all know we're not knowledgeable on everything, so there are always cases we need to learn about a new concept or there's a concept we're not familiar with. With AI's help, we can actually dive deep into some of the things and know how this framework is working. How does this EKS, for example, we should be orchestrating them. That allows us to know why and how for this new technology we'll be using. And the third is we find this AI coding assistant is really helpful for writing undifferentiated code.

For example, for writing the test suite that really shrinks down from hours or days of work writing test cases into just a matter of minutes. Of course you still need to validate the right of those, like whether those are true, but that really takes off a lot of the work off our shoulder. All sounds rosy, right. Now, I'd like to welcome Michael back to stage to actually talk about the challenge and the lessons learned we had throughout this journey. Take it away, Michael.

Challenge One: Building the Semantic Stats Dictionary

So I get to talk about the fun part. How do you actually get this into production. How do you make it happen. We're going to spend a few minutes talking about some real production challenges we faced, and the theme I want you to think about is pragmatic production beats perfect every time. We're trying to get production grade software cranking. We're not necessarily looking for perfection, so I'm going to take you on the journey that we went on. Challenge number one, getting our agentic playbook together. How would we take this type of data and provide it to an agent quickly and under pressure. If you're building an agentic application, you're probably dealing with some domain specific context, some domain specific semantic meanings, and you're trying to figure out how to get that data to your agent quickly. Challenge one was how we solve that for Next Gen Stats.

The very first challenge was the data itself. NFL's Next Gen Stats are incredibly vast, with hundreds of unique data fields that mean different things in different contexts. Snaps aren't just snaps—they could mean snap share, the number of times the quarterback takes the ball, drops back, the ratio, the percentage, or the defensive conditions. Domain expertise matters because snaps aren't just snaps. So we had to figure out how to focus the agent on the right data.

We didn't try to make that decision ourselves. We let the agent tell us, and we let the NFL tell us. The first place we started was with NFL Pro analysts. We asked them how they break down these questions. When they write insights on NFL Pro, how do they think about answering these types of questions, and when do they use the data they use and why? We wanted to understand the mind of the analyst so we could encode it into our agent.

We took that information and compared it against NFL's Next Gen Stats. Each API and each source within Next Gen Stats has a contextual usage. We asked Pro analysts to categorize this for us and give us basic instructions for where we would use certain types of data to answer certain types of questions. From that, we built a basic data dictionary—not a data layer, not a third-party service, not another product, because we were going from zero to production in eight weeks. We needed a tactical, practical solution, and this was a semantic stats dictionary.

We stripped out all of the context and all of the rich descriptions that NFL's Next Gen Stats API provides and passed just the field names to our model. How would you use it? What do these stats mean? We then used a larger model to evaluate the responses, and bit by bit, the LLM informed itself about what the data was, how to use it, and when to apply different types of stats to answer different types of questions.

This was key because it allowed us to focus on just the right data to provide to the agent to solve a different type of question. Otherwise, we'd be dumping hundreds of thousands of tokens if we tried to provide a vast level of rich domain expertise to every single question. We needed just the right information filtered at just the right time to the agent.

Having built that semantic data dictionary, we could pass in just the semantically relevant portions of the dictionary at runtime and allow the agent to apply the information. Because we were using Strands Agents, which has model-driven orchestration, we had a loop cycle to go through where the agent could choose what data it needed and retrieve it. This dropped our initial token consumption by 70 percent. Instead of pulling in vast, highly descriptive human-readable API specs, we brought in information in a language that the LLM could understand. We taught the agent what to ask for, how to ask for it, and when to use it. We didn't try to provide it with everything.

Challenge Two: Consolidating Tools for Agent Autonomy

That gave us the data we needed, but we still had to figure out how the agent was going to get a hold of it. We made a design decision for MCP, and that led us to our next challenge: tools. This was the first place we messed up. We sat down and thought, all right, we've got MCP, we're going to use tools to access the data. Let's write a tool for each of the major use cases we think we're going to encounter. That should be great, right?

When you're dealing with an agentic loop and you have 29 different tools or whatever that number happens to be, and you give an agent instructions to be complete and thorough, it's going to be complete and thorough. You're going to have dozens of loops and tool calls, each one with narrow bits of information that lack the rich context to answer a broad question. How do you give the agent the right tools without essentially hardcoding the decisions into tiny slices that are effectively a new set of APIs? We're dealing with autonomous agents—we want them to think, reason, plan, and orchestrate, not just call a singular little bit of data.

The answer was to consolidate not on our use case but on the boundaries of the data itself. For example, with projections, if you want to get a projected fantasy score or a projected number of touchdowns or projected yardage for a player or across a game, you need to consider weekly projections, season projections, and rest-of-season projections for a player, a team, or a defense. We started out with six tools that gave highly fragmented answers.

What we realized was that we needed a much broader tool. We needed tools that looked at data boundaries and gave the agent the autonomy to express rich and nuanced intent.

The agent can select one or more dimensions of projections at the same time, allowing it to think ahead about the data it might need and ask for it all at once instead of making multiple round trips and multiple loops with the associated latency. You get one call from multiple players, multiple teams, or defenses. The trade-off was that we had to build in fairly complex logic for how to compress the data. This was a production system, and we were returning quite a bit of information when the model was able to request a fairly rich response.

The challenge was getting our agent to understand when to use those parameters so that it didn't just result in a very large call each time. We used the same approach that we used with our data dictionary: LLM-assisted refinement. We stripped out everything from the docstring and decorator except for the parameters and asked our agent how it would use these. Then we took that response, used a larger model to evaluate it, and continually refined until we had a docstring that the LLM understood.

It didn't make a lot of sense to the developer first reading it, so we added human-readable comments beneath it. But now we had a docstring and a data dictionary that LLMs could almost natively understand, and it gave them domain expertise and the ability to understand the data and how to get it. This effectively reduced our tool spec by fifty percent. Why fifty percent? We increased the size of the docstring for a smaller number of tools. If you're caching your tool spec, this probably isn't a massive change in your token consumption. But what it does mean is that you have far fewer tools, fewer round trips to the LLM, fewer invocations, lower latency, and more throughput.

Surviving Game Day: Fallback Providers and Observability

Combined data and tools gave us intelligence—the intelligence to produce fantasy-grade advice fast. But intelligence doesn't ship. You need to deliver it. So we had our second challenge: making sure this thing could survive game day. We weren't going in blind. We had a scouting report. We knew from NFL Pro what their web traffic looked like and the expected user base. We had hundreds of questions provided to us by NFL Pro analysts that we could use to evaluate quality and performance test. We could still do our due diligence, but this is emerging technology with non-deterministic emergent behaviors.

When you're dealing with emerging technology, history isn't the true test of success—it's users and actual user behavior in production environments. We didn't have time for that. We were going zero to production in eight weeks, so we had to build defensively. Three plays kept us in the game here. The very first play: what do you do when you hit capacity limits? Who has attempted to build or use a frontier model in the last few months? How many of you have seen a message to the effect of "Throttling exception, service quota exceeded, service not available"? That's not really something you can show to your users, is it?

We had planned for a certain volume based on NFL Pro traffic. We'd done our service quota due diligence and set our thresholds appropriately, but what happens on Sunday morning? What if the users who came were more than we thought? What if they asked more complex questions than we thought? What if our emergent behavior was different than we thought? If a user sees a spinning notice, doesn't get a response back, or heaven forbid gets a throttling exception, they're not coming back, and this is NFL Pro's brand on the line. We couldn't abide a situation where we ran into service capacity or a throttling exception, so we built around it and architected around it with a fallback provider.

The Strands Agents framework allows you to extend its model class with what we call our Bedrock fallback provider capability. At the time of the agent processing a request, this sits between the agent and the Bedrock service. If the Bedrock service returns anything that looks remotely like a throttling exception, account quota exceeded, or service not available, we intercept that message and send our request instead to a secondary model. We chose to use the Anthropic family of models for this, with a frontier model and a relatively well-tested model as our fallback. In the situation where we encounter throttling, we have a backup ready to go.

When we encounter throttling, that message is intercepted and sent to the fallback model, then back to the user with additional latency in the order of a handful of milliseconds. The user still gets a response, and we get battle-tested information on the true extent of token consumption and throttling affecting the throughput in our production application without the user ever realizing they were part of the test.

We also introduced a circuit breaker because even though we're adding milliseconds of latency with this fallback capability, we don't want to continue hitting a service that's overwhelmed. The circuit breaker stays open for a period of time and reevaluates the primary model. If the throttling exception is gone, we fall back over.

Now, some of you are thinking: that's an anti-pattern. You just introduced bimodal system behavior. You don't know which model is going to service your request. You're right, we did. We chose to accept a 90% reduction in throttling exceptions on launch day, to ensure no user sees a throttling exception or capacity limit, and to provide the best of frontier models available on Amazon Bedrock in exchange for building something we will have to replace in a future version. When we have battle-tested real-world information about consumption, token burn, and throughput, we can adjust our service quotas accordingly, and we won't need this functionality anymore. But day one is too important. We couldn't allow our users to have a subpar experience, so we made the decision to introduce something we might not otherwise introduce.

Play number two was predicting our problems. We're dealing with emerging technology with emergent behavior. A model can produce different results for the exact same data with the exact same tool calls. How will the model respond when we actually see production behavior? How do you test the unknown? How do you know what your model is going to do when it hits the real world without it ever actually hitting the real world?

We extended the Strands agent framework to give us per-turn reasoning insight. Strands agents has a very vast observability framework in it. When we started working, that observability framework was still in progress. We started on roughly version 0.3, and it's in version 1.18 right now. We added per-turn reasoning instrumentation to allow us to see what the agent was doing, the tools it was calling with what parameters, and what data it was returning. This gave us an understanding of exactly how it processed the hundreds of questions NFL pro analysts asked during our UAT.

This exposed certain decision patterns that allowed us to avoid some pretty unfortunate things on game day. Here's a look at what happened when we asked the question: who's the best wide receiver to pick up for week 10? That's a fairly straightforward question. That single question consumed 1.3 million tokens. Why did it consume 1.3 million tokens? It wasn't the answer. The answer was actually really good. It passed all of our QA and UAT checkpoints. The tool calls made sense. Why was it 1.3 million tokens?

Part of the instruction to the agent was to produce complex and thorough reasoning that's defensible and backed by data. So the agent requested stats, backstory, and projections for every wide receiver in the league. You don't need that. Fantasy managers are going five to ten players down if they're evaluating a waiver wire decision. They're not going to player number 80. We observed that even though we had the right data and the right tools, we still needed to govern and observe the behavior of our agents and put appropriate guardrails around the maximum data it could pull and under what circumstances it could do it.

That let us avoid a simple question burning 1.3 million tokens. Can you imagine the impact that would have had on throughput if we had released that? The takeaway here is: don't trust your new resources. This is emerging technology, and even though you may have a unit test, UAT, and performance testing, you need to interrogate the behavior of your models until you understand the emergent behavior inside and out.

Optimizing Performance Through Caching and Prioritizing Intelligence

The third and final tactical challenge we had to solve was caching. You might be wondering why we're talking about caching. Well, we're talking about caching because NFL's Next Gen Stats is incredibly vast. Even with the data dictionary that we used to bring in only semantically relevant data, even with the tools to choose just the right data for the time, and even with the instrumentation to know when we were pulling more than we needed, we still needed caching to optimize our throughput.

It's still an incredibly token-rich environment. If your data constitutes a token-rich environment, this is something that can help you. How do you allocate cache when you don't understand what users are going to do? When you can't predict the conversation pattern well in advance, don't. Don't try to predict it down to the nth degree. Don't build an incredibly complicated predictive algorithm to try and understand something that hasn't launched yet.

We studied the tape just as a coach or team would study the tape before they go onto the field on game day. We studied past conversational patterns of the NFL pro analysts we work with. We asked real users that we could find how they ask questions about fantasy. From that, we understood a handful of things. We used our understanding of conversational patterns to allocate our cache points in the Anthropic family of models we worked with.

Now, most of you are already using two of these. You're caching your system prompt, and you're caching your tools back, and you're getting a lot of good results from that. You're not burning tokens at every invocation on prompt or tools. But you've got two more. How should you use them? Fantasy users tend to ask open-ended questions and refer back. They ask about player A, how is Justin Jefferson looking this weekend? Well, when we ask that question, the agent is going to do something like get stats, that might be a 50-token response. It's going to get some relevant news, look at injury reports, look at context. That's going to be another 280 tokens or so. What's the follow-up question going to be? It's not going to be about Devonte Adams, it's going to be about Justin Jefferson. You want to retrieve all that data again. Fantasy managers tend to ask follow-up questions, so caching the heavy hitters—not every single call to the LLM, not every single tool, but the ones that were really meaty and provided heavy token consumption—allowed us to then slide those heavy token hitters out once we were out of that part of the conversation.

We used a very simple mechanism. Remember, practical beats perfection in production. We used a simple sliding window mechanism that looked at the heaviest MCP tool calls, and when a new tool call came in, we slid the old one out. Sounds fairly simple, right? Simple patterns work. This simple pattern increased throughput on our agent by 2 times. That's it—just caching the most two recent heavy MCP tool call results increased throughput by 2 times. And folks, this is after all of the other optimizations we've shown, reduced our cost by 45%.

The answer isn't build a sliding window to manage cache points. The answer is when you're thinking about speed to production, find the simplest pattern that enables your benefits and go chase it. You can analyze later. You can analyze with real-world data once you've been in production and gather those patterns, and then you may find out that this isn't actually optimal. But get the 80% win first, take it, and move to production. Simple, dramatic results beat perfection in production every time.

These are 5 of the challenges that we face across those two categories: building intelligence and getting ready to deliver. But you might not be building an assistant. You might be building some other type of agentic application, and you're like, yeah, this is great. I don't deal with heavy MCP tool calls. I don't have a lot of heavily semantic information I have to deal with. Okay, fine. I'm going to share what we learned and how we made the decisions of what we shipped versus what we didn't ship. This is a mental model for thinking about the evolution of agentic applications and how you should prioritize features when you're trying to get to production quickly.

Agentic applications only have 2 layers. The first is intelligence. Intelligence is comprised of reasoning, tools, and data. Reasoning is the ability to think, to plan, to orchestrate, to execute steps in order to understand based on information what it should do to handle a query. The agent applies reasoning to tools and data, and between the two of those, it gets intelligence. This is what matters. Don't get caught up in the perfect infrastructure. Don't get caught up in the perfect user experience, because if you get this wrong, your users won't come back. Intelligence may not ship, but it is the product. We are no longer in a world where features are the product. Intelligence is the product in agentic AI. Get intelligence right first.

Then figure out how to ship it. Ship it just good enough with delivery. I'm not telling you to neglect the well-architected framework. Think about resilience, think about security and privacy, think about compliance. Your infrastructure is still important. How you make it accessible to your users, how you make that intelligence accessible to your users, is still important through the user experience. But if you don't have intelligence, if your agent doesn't have intelligence, you don't have a product. You may have a wonderfully resilient architecture with a UX that your users aren't going to come back to.

When you're prioritizing your own agentic applications, think of this mental model: get intelligence right, ship good enough infrastructure and user experience, and when you get real world data, begin to iterate. This is the mental model that we used that allowed us to ship in eight weeks. We applied that mental model to job number one. Mike told you about it when he was up on the stage. That was to deliver NFL Pro Analyst grade fantasy advice fast.

Because we prioritized intelligence of our agent, we did. We're able to achieve 90% analyst-approved responses. We're able to stream those responses in under five seconds initially, full response with complex analysis by the end of thirty seconds at high throughput because of some of the challenges and trade-offs that we encountered. We had multimodal resilience and the ability to handle Sunday, Monday night, Thursday night football traffic spikes. There's a lot we didn't ship. We didn't ship conversation history. We didn't ship integration with leaks. We didn't ship user feedback loops or custom preferences.

Because these are valuable features that don't prove job number one, and when you're trying to get a product to production, job number one is the only job that matters. Get it to production, make it work with the simplest patterns in architecture that produce the results that you can't compromise on: analyst-grade advice and timing. Then build upon it. I'm not saying do an inferior architecture. Conversation history and fantasy AI is a reconfiguration action, not a refactor. We use the Strands S3 session manager to persist conversation data. When we're ready to expose that to users, it's a front-end exercise.

Results and Future Applications: Empowering NFL Analysts

Those are the types of design decisions you get with a model-driven orchestration framework and building with MCP. We proved job number one, had a lot of challenges along the way, but we deployed to production in eight weeks. To tell you what we've seen, what it's meant for the NFL, and what's next, I want to welcome back Mike Band. Thank you, Michael. When Michael asks us where we all see taking the Fantasy AI Assistant, the first thought was, yeah, we could add more data, we could add more context, we can add articles on NFL.com from our fantasy analysts, and we could make the answers better and more robust.

But what if we took it a step further? What if we thought outside the box? So one of the ideas that we have is we have a team of analysts right now that are at the office preparing for week fourteen. They're going through all of our stats dashboards, going through all the tools that we have to work with by hand, more or less, using their own football expertise to come up with these storylines and to write the most compelling narration that we give to broadcasters like Greg Olson and Cynthia Freeland alike. How can we then bootstrap those efforts? How can we create more insights, more richer context, and really drive the output of our team to not just two times, four times, ten times our team and the amount of output that we have?

So we had an idea. What if we could get our Fantasy AI Assistant to help us write weekly insights on a week-to-week basis to increase our output on a week-to-week basis? So let me show you exactly what that would look like live. I pull up NFL Pro. You can do it on your phones right now. Go to pro.com. Go to NFL.com. I'm going to type in analyze. The fantasy performance by Patriots rookie running back Tre'Yevon Henderson. In week thirteen against the Jets on Thursday night football. After conducting your analysis, summarize your findings in the format of a four-sentence Next Gen Stats insight that can be featured on NFL Pro.

Use Next Gen Stats and advanced stats to support our analysis. On NFL Pro, we use next-gen stats and advanced stats to support our analysis. What we're doing here is prompting our AI assistant to write an insight in the exact format that our team would do on a week-to-week basis. In this case, we get a very rich answer that provides reasoning and detailed analysis. Based on the analysis, the Patriots rookie running back delivered a breakout three-touchdown performance in Thursday night's 27-14 victory over the Jets, recording 19 carries for 62 yards and two rushing scores while adding five receptions for 31 yards and a receiving touchdown. His 27.3 fantasy points ranked him as the RB1 for the week, showcasing his red zone efficiency with seven red zone carries that resulted in touchdowns.

I won't read the whole thing because I want to go a little bit deeper into this. We had a human analyst that wrote a very similar insight, completely separate in a parallel path, and I want to show you the similarities between these answers. Both outputs mentioned that he scored 27.3 fantasy points. They both mentioned that Rhamondre Stevenson, the previous starting running back, was out with a toe injury that led to Henderson's emergence and breakout performance. They mentioned that he had a 90 percent team snap share, which really emphasizes that the team is relying on him and his participation, meaning he has the opportunity to score points.

Both outputs mentioned that he had 19 carries for 62 yards, adding five receptions for 31 yards through the air, and that he forced nine missed tackles. What the AI assistant did not output was that he had 70 rushing yards after contact or that he's playing the Bengals next week. However, what the AI assistant did note was that he had a 37.5 percent missed tackle rate on just 3.3 yards per carry, and more context about the game itself and that the Patriots won 27 to 14. When you compare those two outputs, you can see that we essentially added another fantasy expert to our team.

Now our fantasy experts who write these insights on a week-to-week basis are using the AI assistant to power their work on a day-to-day basis. It is not about replacing human analysts because that research and football context and football acumen is so important. If you don't know that a defensive coordinator was fired and that has an effect on the way that they're playing, then that's going to be a missing piece if you go fully automated. What we expect to do is use the AI assistant to help us find research nuggets, put it into a concise format in the style of a paragraph, and help us write on a week-to-week basis. Our output as a team goes from one times to now ten times, and that's what we're going to do and continue to evolve with the Fantasy AI system, not just an experience for external public users to help make decisions, but internally how we make our team more effective on a day-to-day basis.

With that, you can scan the following QR code and have access to the NFL Pro Fantasy AI assistant right now. There are only a few weeks left of the fantasy season. If you have a team, I hope you're in the playoffs, and if you are in the playoffs, Fantasy AI is ready to help make that a reality. All right, we talked about our building experience. Now it's your turn. Go build. Put your hands on the keyboard. Production is not about perfection; it's about practical application of frameworks and resources. Here are two of the best that we used during our time.

The first is AWS Prescriptive Guidance for agentic patterns. You may not be building a human-to-agent assistant. You may be building a different type of agent. Agentic patterns through AWS Prescriptive Guidance provide real-world guidance on how to build those agents and build them to production grade. Second is a link to the Strands Agents framework. We found the model-driven orchestration power of Strands Agents to greatly accelerate not just our build, but our evaluation of our agents' capabilities. If you're looking to build fast and build agents fast to production grade, Strands Agents can be a great place to start.

; This article is entirely auto-generated using Amazon Bedrock.