Kazuya

Posted on Dec 5, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - RAG is Dead: Long Live Intelligent Retrieval-Augmented Generation (AIM214)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - RAG is Dead: Long Live Intelligent Retrieval-Augmented Generation (AIM214)

In this video, the speaker explores how agentic RAG (Retrieval Augmented Generation) solves complex query problems that traditional RAG systems fail to handle, using restaurant recommendations for couples as a real-world example. The presentation covers three key contributions: query understanding through sub-query generation, query routing, and query expansion; optimized retrieval by making databases ergonomic for agents using hybrid search, filters, and re-ranking; and iterative generation using evaluation checklists to ensure quality. The speaker demonstrates how agents can decompose complicated natural language queries with multiple constraints (dietary restrictions, atmosphere preferences, budget) and orchestrate retrieval across multiple data sources (restaurants, dishes, reviews) to deliver highly relevant recommendations, even at the cost of increased latency.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

The Restaurant Recommendation Challenge: Why Traditional RAG Falls Short

Hey everybody, how are we doing? Thanks so much for coming to my session. RAG is dead. Long live intelligent retrieval augmented generation. We gather here today to discuss the types of situations that unfortunately can slay RAG and what agentic retrieval can do to help solve this problem. Today we'll talk about recommendation systems, what agentic RAG is, and different contributions that agentic RAG can make in order to solve these sorts of difficult queries.

But before we get into any of that, I want you all to take a second and think about this problem with me. Imagine you're building a tech startup, which I imagine is not too difficult to imagine for many of you. You're trying to build a recommendation service for restaurants and dishes, a really popular one, and over time you notice that there's this really sticky group of people that come back to use your system, but they don't seem to actually enjoy it that much. These are couples that are going on a first date. You realize that these couples tend to have user queries in your system, and what they really want is a set of dishes and restaurants that they can go to that aligns with that query.

So you might have a situation where you have person one with some restrictions like they need to avoid dairy, they only eat chicken, and they prefer eating spicy food. And you have the second person in the relationship, which might love fish and beef, value a romantic atmosphere, and they have a lot of specific food allergies, so it's really important to find the best restaurant for them. You do some user studies, you sit down with these couples, and you realize that the conversation goes back and forth. People think about what they're looking for, and they eventually land on something like the following phrase: we need date night restaurants within our budget, near us. That have vegetarian, non-dairy options, spicy options, maybe a good shareable dessert with a romantic vibe, a really complicated set of preferences.

Now, at this point, you might be thinking that this is an oddly specific scenario and seems too arbitrary. How many of you have tried to figure out where to eat with your significant other? It gets pretty complicated pretty quickly. You might be thinking this is not real, and I'm here to tell you that this is absolutely real. Almost six years to this exact day, I was agonizing over exactly what to do in this exact situation because I was trying to impress a beautiful girl. I had to figure out exactly where we needed to go, and it turns out that those were our real preferences that I had to figure out to find the actual restaurant for our first date. And it worked. Six years later, I proposed earlier this year. So yes, this is a real problem that people face. These are types of queries that people will have to deal with in specific situations, with restaurant recommendations being one of them.

What's sticky about this is that if we were to use a system like this and it gave you a poor recommendation, the consequences are huge. You really just get one shot to get this right. Now, let's think a little harder about what makes this problem so difficult. There are four distinct issues with this type of situation. Number one, there are a lot of distinct data sources that you have to reconcile that your system needs to be able to deal with. Having a lot of distinct data sources is difficult for a retrieval augmented generation system to resolve. You need to look at restaurant reviews, descriptions, dishes, images, what people are saying, where these restaurants are, any dietary restrictions associated with those dishes. Just thinking about how to maintain the context around all of those data sources is a pain.

The second issue is that it's a high risk, high reward situation. If your primary demographic is people who haven't used your application before and they're looking for a really good recommendation the first time they interact with your system, you better believe you serve them a good recommendation. In this situation, there are three consequences. First, you're trying to acquire this couple, this user, for the entirety of their journey so they keep coming back to your application to get those good restaurant recommendations. Second, the restaurant that's going to be serving those recommendations is going to gain two seats instead of one, so it's really important for them to be aligned to the users that are interested in what they need to be doing. And finally, the couple cares, right? You want to go to a place that's going to have really good food.

The third issue is that we're dealing with really complicated natural language queries. If you were to throw this query at a traditional vanilla RAG system, it would absolutely fail because usually there is not a query understanding portion built into that retrieval. Specifically, the query is long, requires a lot of reasoning capabilities to understand the motivation behind the query, and it requires planning and coordination to solve the relevant recommendations. Finally, because this query is so complicated, RAG approaches will just fail, and we need a lot more engineering work to solve this problem.

If only we had a set of magical boxes that could help us parse and understand what these queries are doing, wouldn't that be wonderful? Well, it turns out that if we use agentic RAG, we can address the complexity of these different types of complicated queries through flexibility. In this talk, I'm going to go over a few techniques in order to do that specifically with retrieval and databases.

Agentic RAG Architecture: Query Understanding and Optimized Retrieval Patterns

The contributions that agentic RAG can make are the following. First is query understanding. In other words, we're going to ask the question, how can we interpret what the user actually needs, and then serve recommendations under that context? Second, we can optimize retrieval patterns even further. If we know what they need, how do we retrieve the best context for that information? Finally, with agentic RAG, we have the ability to do iterative generation, so we don't have to stop at the first retrieval. Once we have the best context, how do we generate the best response?

The primary architecture we're going to be talking about in this talk is the tool use workflow architecture that is so easy to implement with agentic RAG and allows you to take a stab at this problem. So we might have some query, like, "Hey, we need to eat somewhere near our hotel with our budget. I avoid dairy and meat, but I'm fine with eggs. I love spicy food and it's date night, so we need shareable plates." This is a real thing I've said in real life, by the way. So an agent might take that query. It might have access to a set of tools and workflows like query transformation procedures, criteria evaluation, a bunch of different databases like our dishes, restaurants, allergies, known allergens, and reviews, and then serve a pattern of recommendations. It's important to note that in this architecture, the agent is deciding what tools to use when, so it's particularly flexible to the types of queries that are being passed.

Great, so let's begin with query understanding. Agents are able to plan queries for optimal retrieval. In a traditional RAG system, you're going to have the query hit the database directly, but an agent can modify the query or add additional queries in order to retrieve optimized context. Here are three different ways you can accomplish this. The first is sub-query generation, where we ask the question, what else could the user need based on the query that they're asking? The second is query routing. This is really important and underestimated by a lot of developers. When we have an incoming query, where do we actually send that query and what databases are relevant for the user at that given point in time? Finally, query expansion. If we have a given user query, what other terms or concepts need to be included in order to retrieve those things? And what constraints, in addition to those terms and concepts, do we need to be aware of?

So what does this look like in practice? Imagine you're going to do query planning in the context of tool use. So we have this query I said before, which was we need to eat somewhere near our hotel. We have a budget, we avoid dairy and meat, but we're fine with eggs. I love spicy food and it's date night, so let's do shareable plates. The agent might intake that query and understand and think about what tools are relevant in order to start parsing and breaking this query down. I've put in some example tools here that generate individual queries that we can all pass through their relevant databases to retrieve contexts that are then reconciled for the ultimate recommendation.

For example, there are some restrictions in the query here that are relevant to what could be inside dishes, so I could have a tool that parses those restrictions out to generate those filters. For example, food that leans toward being spicy, or we indicate some allergies, so we need another parsing step that can reliably pull that information out of the query. There might be preferences that are more flexible and natural language that we can use in criteria evaluation later, such as no dairy, meat, or eggs, or having family style serving. And finally, there are restaurant level filters that we'll need to extract and interpret. For example, maybe we need something that's under $40 that has a romantic atmosphere. That information might be inside the reviews, for example, and that's four stars plus.

What about retrieval? In order for agents to work well with databases, you need to make the databases ergonomic for those agents, and there are a few ways to do that. The first is you need to put context around the data sources that your agents have access to. This is probably the number one mistake I see developers make, where they link their database to their agents without properly informing and passing schemas back to the agents on what information is actually contained in that database. So you need to ask the question, what data can the agents access and how can it be used, and how should the agent use that information?

The second is the implementation of powerful search techniques. How do we interact with this data? How is it stored, and how do we use it in order to maximize relevance? For example, there may be specific search techniques that are more appropriate for certain kinds of databases than for others. Finally, we should be using filters as much as we can in order to reduce the search space and decrease latency.

How should we store this data in order to reduce that latency in time to first request? So what does this look like? I've created a demo, a table here of different databases that we could use in our application and how we would want to store and represent that data in order to retrieve it and make it ergonomic.

Suppose we have an index of restaurants in this first row. It might contain names, descriptions, some metadata about the restaurants like location. We might want to do some pre-processing, like summarizing what the restaurant is about across all of the reviews. That allows us to search over all the restaurants with the same exact schema, thus resulting in standardization. There might be some applicable filters that are important for our agents to know about, like pricing, moods, location.

And it might be best to implement a form of hybrid search, which allows us to leverage the semantic descriptions that restaurants naturally have, in addition to the keywords that we care about, like romantic or date night. For dishes, we might need a separate index, and that index might be multimodal because people take photos of their food and we care what they look like. There might be descriptions and linked reviews, we might need to generate potential ingredients that could be missing in order to caution users against potential cross-contamination.

There might be allergen level filters that we'll have to think about. And because people really care about what they ultimately will eat at the restaurant, we might need to implement some re-ranking in order to optimize the relevancy there. Finally, we have the review index, which is also inherently multimodal. People are typing their descriptions and also putting images of the food that's there. We might need to implement an image to image search or an image to text search, and we need to think about how to instrument this for our agent at hand.

Iterative Generation and Quality Assurance Through Evaluation Checklists

Finally, we need to think about how to improve the underlying generation. The thing about vanilla RAG is that very often people will shoot out a query to a database and return those results that are generated by an LLM directly to the user, but there's a lot of opportunity lost there where we could spend even more time understanding whether the results are relevant to that specific user in that specific context. And we'll talk briefly about some techniques that are helpful for this situation.

First is thinking about loops and looping your retrieval. So what else could the user need, and could we spend this time and exchange latency for some more information that would be relevant to the user? The second is identifying implied preferences. Users are notoriously bad at describing what they want in the first interaction you're going to have with them. So you need to build an architecture that allows us to understand what information we need to follow up with in order to retrieve the appropriate information, and what of that information is actually useful for us in the context of the data sources we're working with.

Finally is using structured generation. In other words, we should be constraining the things that we are sending back to the user, presenting them in information that is helpful for us to understand and analyze, so that it's comparable when the user gets the recommendation. What does that look like in practice? For example, one of my favorite design patterns for implementing evaluation in agentic retrieval is checklists. So you might have an initial query, such as the one we've been working with where you have to eat somewhere, you have a bunch of restrictions, and you can shoot that query at your database and orchestrate it with your agent to retrieve all sorts of context, such as your restaurants, your dishes, and your reviews.

And you can pass that to the agent with all of that context being there and say, hey, why don't you generate a set of recommendations, but at the same time, in parallel, what you could do is dynamically create an evaluation checklist of the criteria that you need to satisfy in order to provide a satisfactory set of recommended restaurants and dishes back to the user as a sort of manual double checking. Because there's a lot of querying that happens in between the initial query and the retrieved context, you're going to want something that persists beyond that, that evaluates against what the agent has already generated in order to increase the quality of the results.

For example, it's extremely important that we adhere to the allergens that are described inside this query. So it makes sense for us to extract that as an item in our checklist and to evaluate the returned results at the end of this query workflow against the agent to ensure that those tests pass, or we want to make sure that the restaurant has a date night atmosphere, so on and so forth. Having the opportunity to implement this additional check at the end is really useful for us to guarantee the quality of the results that we end up passing back to the recommendation system.

And we're comfortable sacrificing some of the latency costs here because of the increase in the relevancy at the end of this workflow. So the main takeaways behind this architecture is we're able to use agentic retrieval in areas where traditional RAG would fail, and three specific benefits.

The first is query understanding, which allows for deep planning and reasoning capability to decompose complex queries and serve users that have really complicated preferences. The second is optimizing retrieval, where agentic RAG allows us to have dynamic access to all of the data sources which have been made ergonomic for the agent to use, provided that they're accessible, and this allows us to optimize the context that is then retrieved for the generation step. Finally, we implement iterative generation, which allows for looping and increases the perceived quality of the generation results.

We're comfortable sacrificing some of the latency costs here in order to increase the quality of the results because we care so much about making sure that the couple goes on a great first date. If you want to learn how to build agentic RAG and you're tired of the example that I'm going through, go ahead and scan this QR code where you can learn about how other customers with Pinecone have implemented agentic RAG, such as Delphi, Aquant, Terminal X, and custom GPT. This will take you straight to the case study section on our website, where you can learn how real customers have implemented real architectures implementing agentic RAG.

Thanks so much for coming to my talk. If you have any more questions for me about how to do this, please meet me at the back. I'll be happy to speak with you. My team is at booth 534 in that corner over there. If you're interested in having a more in-depth conversation, we'd be happy to have you. Thanks so much, everybody.

; This article is entirely auto-generated using Amazon Bedrock.