Kazuya

Posted on Dec 5, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - How Spice AI operationalizes data lakes for AI using Amazon S3 (STG364)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - How Spice AI operationalizes data lakes for AI using Amazon S3 (STG364)

In this video, John Mallory from AWS and Luke Kim from Spice AI demonstrate how to operationalize data lakes for AI workloads using Amazon S3 Tables and S3 Vectors. They address key challenges like data fragmentation, retrieval accuracy, and integration complexity. Luke showcases a live demo using Apache Answer, a Stack Overflow-like application, where Spice AI ingests streaming data via Kafka, performs BM25 full-text search and vector search using S3 Vectors, and feeds results into Amazon Nova for AI analysis—all configured with minimal YAML code. The demo illustrates hybrid search combining text and vector search across 250,000 historical questions, delivering relevant results in seconds. Spice AI provides an AI-native interface that handles indexing, caching, sharding, and federation across multiple data sources, enabling developers to leverage S3 primitives without managing complex distributed systems problems.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Building AI-Ready Data Lakes: The Challenge of Integrating S3 Tables and S3 Vectors

Thanks everyone for joining us. I'm John Mallory, a storage go-to-market specialist focused on storage for AI here at AWS. Hi everyone, I'm Luke Kim, founder and CEO of Spice AI. We were day one launch partners for Amazon S3 Vectors, and I'm excited to partner with AWS on helping turn your data lake into an AI-ready platform.

The reason partners like Spice are so important when it comes to operationalizing data lakes for AI is because if you think about how data supports an agent—in this case, maybe an HR agent that onboards a new employee—that agent probably has multiple agents in the workflow, and they're going to need multiple different data sources to all work together. You're going to need structured data from the HR system, object data from your object store around documents that you're going to give to that person you're onboarding. Ultimately, the agents themselves are going to need semantic memory for long-term storage and performant memory for short-term storage.

It's not just building a data lake and having everything you need. You have to stitch lots of pieces together. The more of these capabilities you can bring into your data lake, the better it's going to be and the easier it's going to be to operationalize it. Some of the key challenges to adopting AI are really data fragmentation and silos. Per this example I just touched on, you probably have data that today lives in many different systems that are purpose-built for that. Along with that, you've got to think about privacy and security and how to extend governance over multiple systems.

Cost-effective scaling is key, as is retrieval accuracy if you're building agent workflows. You're grounding your data with RAG typically, or your foundation model with RAG, as well as providing semantic meaning and long-term context to the agent itself, which leads to integration complexity. Ultimately, you have to observe all this and include log data in the mix. Today, S3 is the foundation for a lot of this, and we continue to build out for the AI era. We released S3 Tables where you can start to bring Iceberg tabular data into a managed table in S3, where you can start to bring workloads that might have sat in databases or data warehouses directly into the data lake and query that.

We also just launched Amazon S3 Vectors as Luke mentioned, because ultimately vectors really are the language of AI, both for things like RAG and data content characterization and search at large scale, as well as long-term memory for agents. Spice AI has been a key partner for us to do this and tie all this together. They integrate both with S3 Vectors, as well as S3 Tables, as well as data stored in S3, improve the performance, and ultimately help integrate across all of these different data sources. With that, I'm going to turn it over to Luke.

Thanks John. S3 has really been the foundation for organizations over the last ten years. We talked to a whole bunch of customers, some of our customers like Tulio and Barracuda Networks, and we hear from them that they still want the scale, durability, and cost efficiency of S3. But there are additional challenges to make these workloads work for this next era of AI workloads. What that requires essentially is handling a whole bunch of additional distributed systems problems to actually leverage these primitives, which are very scalable.

It pushes the onus up into your application layer to be able to turn all these primitives into a large-scale data and AI platform. That's where Spice can help. Spice will provide an AI-native interface across all of these primitives and handle all these problems like ingestion of data—how you get data into S3, into S3 Tables, into S3 Vectors. It handles things like caching and indexing and sharding, and it can also, as John mentioned, federate data across any other sources.

You want to leverage all of these great products, but how do you do that very fast with faster time to market and get that value back into your application. If we zoom in on just one of these problems that you have to deal with, if you want to use S3 Vectors, you might want to do something like a daily index. Now you have to start managing indexes, adjusting a whole bunch of data, partitioning it, and putting it into indexes. You have to manage metadata and push that into S3 Vectors for filtering. If you do this at large scale, you now have to do index striping, cross-index query, and so forth. Normally all of that would get pushed into your application layer that you now have to deal with.

Live Demo: Transforming Apache Answer with Spice AI's Hybrid Search and Real-Time Analytics

Spice handles those problems for you and enables you to use these products in a very short amount of time. I'll show you in a demo here in about 15 minutes how you can get all of this into S3 Vectors, queried out, and into your application with almost no code at all. In the demo here, we have a demo application called Apache Answer. It's like a Stack Overflow-like application. We're going to be ingesting questions and answers as they come through Kafka streaming, and Spice is going to ingest them. It's going to index them for full-text search, vectorize them for vector search, cache and store them partitioned within S3 Tables and S3 Vectors, including the metadata.

This also enables you to do real-time historical queries across not just structured data in S3 and S3 Tables, but also unstructured data and other data systems like Aurora DB. It's going to use the Titan embedding model for the embeddings, but it could also load and serve models itself and get all that stuff into AI so that you can actually do a higher level analysis on it. I'll swap across here to my application. This is Apache Answer, a straightforward question-answer application. What we want to do here is provide essentially a better search. If I just do a search query here for "MySQL connection error," then you can see that the search is slow. This is the native search within the application, just using Postgres in the backend. One, it's slow. Two, you'll see that sometimes it doesn't actually work and will give you irrelevant results.

This query is so large that sometimes it won't even return results, but normally it'll give you 20,000 irrelevant results. So how can we actually make this better? This is Spice Cloud, a hosted managed version of Spice. But Spice is also open source, so you can host it yourself anywhere from a Raspberry Pi all the way up to a cluster and self-host in any deployment environment. The first thing Spice gives you is very fast queries. What we're doing here is starting a script that's going to start loading questions and answers into the actual application so that we have real-time streaming data. That streaming data is coming into the Spice platform, and we can just do a query here and we'll see the numbers essentially going up.

We also have a whole bunch of historical questions here, 250,000, that's in S3 Tables. Spice offers incredibly fast query across this data, so that you can query across any of these federated sources. But what you really want to do is make the search better. Spice offers very easy-to-use functions. Here I've got this text search function, and as we're ingesting the data in, we're indexing it for BM25 full-text search. With just a couple of lines of SQL, I can now do full BM25 full-text search over that same dataset as it's being ingested in. But what I really want to do is provide a better quality search, and that means a hybrid search that uses both text search and vector search.

This vector search function has all the work done in the backend for you to leverage S3 Vectors. When I call this text search function, we've put all the data into S3 Vectors, indexed it for you, and then we'll provide the search across multiple indexes back into your application for you. Then we're going to combine the results of my full-text search and my vector search into one re-ranked list to give me a better semantic search across my application. Here I'm going to run the query, and if the Wi-Fi holds up, we'll get a search answer back. Here I've got a better results set now for my search query. I want to show you what's going on behind the scenes here, so if I run an explain on this, which shows you what the database is actually doing in the background, then you'll see there's all this work being done in the background in the database, and it's kind of hard to see here, but each one of these nodes you see is a query into S3 Vectors on your behalf.

The reason there are many boxes here is that Spice has done all the work to spread all this data across multiple indexes in the background, and it's able to give you in just one very concise query this full large-scale, petabyte-scale search across S3 Vectors. It combines it with a full text search and a re-ranked list back to your application, and you didn't have to do any work to actually do that.

But what we really want to do even more than that is operationalize this for AI and get it into AI. Now I can essentially query across this large dataset, get a search query, get the results of that, and feed it into AI. We can do that with Spice. We have an AI function here, so very simply, just a small function, and we can give it a prompt to extract three technology keywords from the answers. We're taking all of the search results here and piping that into Nova.

I'll run that, and you'll see that it's now doing that search across that large corpus of data with re-ranked hybrid search, and it's feeding it into Nova, which will give me back my three technology keywords. So it looks like Nova might—oh, there we go. So it took a little bit longer because of the Wi-Fi here, but you can see here I've got these keywords. You can imagine what you could do here. This could be a security analysis, it could be a fraud analysis, it could be sentiment analysis on any of these search results you got across this very large dataset, and all of this is just a little bit of SQL here.

The great thing about SQL is that language models can write it. So you can imagine that you can even use text-to-SQL here to do natural language queries and actually get these results back, leveraging the full power of S3 and S3 Vectors in the background. That was a very fast demo with a lot there, so let me come back and revisit what we did. We ingested a whole bunch of questions and answers into the application streamed through Kafka. We ingested it, we indexed it, we vectorized it, we pushed it down into multiple different indexes in S3 Vectors and tables.

We queried it back out again, and then we fed it back into an AI model for higher-level analysis, very fast. If we come back to my application here, we can enable Spice Search on it. We'll come back to that same query, which didn't actually work to start with on the native search. Now this will use Spice Search in the background, and it will give me a much higher quality search result back that's much more relevant to my use case. If the initial search actually worked, it showed like thirty thousand results that were irrelevant, and this shows twenty relevant results here.

So how is all this wired up in the background? Again, there was very little code here. The way you configure Spice is through YAML. To wire up that entire workflow, all we had to do was say: here's my S3 table's Glue table. I want to enable it with Amazon S3 Vectors using S3 Vectors engine. I want to partition it by year, but it could be like per day or per month. And then I want to make it super fast query, so we can accelerate it and do tiered caching with DynamoDB or other embedded databases.

I want to have filterable metadata that I push down into S3 Vectors so I can make the search super fast. And I want to combine it with BM25 full-text search. Finally, I want to connect it all over to my set of models here, and we also support a whole bunch of other different models, as well as serving and loading your own as well. In just a few lines of YAML code, there's no application code here at all. I was able to ingest a whole bunch of data in real time, query across the historical dataset, index it, search it, and provide it into a model in a very short amount of time.

That's Spice. It's an AI-native database that enables you to turn your S3 data lake and your S3 Vectors data lake into an AI-native platform very easily, providing SQL, federated SQL security, accelerated data, hybrid search, and inference across all of these great primitives that Amazon has built. I'd like to give us a star on GitHub. It's all open source, and you can deploy it and run it yourself. We also have some t-shirts and things at the back that we'd love to give you if you come and see us after the tour. Thank you.

; This article is entirely auto-generated using Amazon Bedrock.