Kazuya

Posted on Dec 8

AWS re:Invent 2025 - Build gpu-boosted, auto-optimized, billion-scale VectorDBs in hours (ANT213)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Build gpu-boosted, auto-optimized, billion-scale VectorDBs in hours (ANT213)

In this video, Dylan Tong and Vamshi Vijay Nakkirtha from Amazon OpenSearch introduce two major launches for building billion-scale vector databases: GPU vector acceleration and auto-optimize. They explain how OpenSearch has evolved since 2020, integrating FAISS and becoming the default for Amazon Bedrock Knowledge Bases. The presenters demonstrate that GPU acceleration reduces index building time from days to hours, achieving 6-14x speed improvements and 6-12x cost savings. They showcase real customer examples like Amazon Brand Protection's 68 billion vector deployment and DevRev AI's agentic applications. The auto-optimize feature eliminates the need for manual hyperparameter tuning by automatically analyzing datasets and providing optimized index configurations within an hour, reducing memory footprint by up to 75% at scale. A live demo illustrates the simplified workflow from S3 data ingestion to index creation with just a few clicks.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction to OpenSearch and Vector Search at re:Invent

Well, hello everybody. It's day four of re:Invent and it's lunchtime, so I'm so honored to have such a dedicated audience. There's a lot to learn today. We're also joined by an audience out in Wynn as well as the Venetian, so hello from Mandalay Bay. I'm Dylan Tong. I'm from OpenSearch Product. I'm responsible for AI and vector workloads. I'm joined by my colleague.

Hello everyone, this is Vamshi Vijay Nakkirtha working as a senior software manager, working on search at Amazon OpenSearch Service. Hope you guys are having an amazing re:Invent. Yeah, so this session is about building billion-scale vector databases. We're going to be introducing two re:Invent launches for OpenSearch: GPU vector acceleration and auto-optimize.

So with that said, can I have a quick show of hands who here is currently using OpenSearch? All right, so a few of you. Any of you have experience with vector search? A couple, okay, so we have, so we're going to spend some time just a brief moment to get everybody up to speed for some of you maybe a refresher, but we're going to keep it succinct, okay?

So quickly, OpenSearch, it's a search and analytics engine. It's Apache open source, it's part of the Linux Foundation. And on AWS there's a couple of flavors. One, managed clusters where you have full control of selecting your instances. We provide automation to provision those clusters as well as operational functionality. And then there's a serverless version, right, with auto-scaling and such. There's peripheral services, OpenSearch Ingestion for you to process and load your data into OpenSearch and OpenSearch Dashboards for visualizations.

People use OpenSearch for a lot of things, but there's really kind of two main use cases. There's one half that uses it for log analytics and observability, and then the other half is search, which is the focus of this presentation today. It could be keyword search, it's vector search, as well as a hybrid, a blend of the two. Don't worry if you're not familiar with vector search, we'll get you up to speed, but for now, just know, hey, why do people use vector search. Foremost, it's simply to improve search quality. Vector hybrid variants, it's the state of the art. It's also very versatile, right? It supports text-based search, but you can search audio images across between the two.

OpenSearch's Evolution in Vector Search: From 2020 to 2025

There are also diverse applications, the obvious ones being semantic search, but also you can use vector search for recommendations, personalization, anomaly detection as well. Talk more about this, but it's also a key ingredient in agentic applications. A lot of people are surprised about our history, our long history in vector search. Did you know we began our journey back in February 2020? It actually predates the birth of OpenSearch. We contributed something called the K-NN, K-Nearest Neighbor plugin to the Elastic Open Distro.

And it wasn't until 2021 where OpenSearch 1.0 was released. This was a fork of Elasticsearch 7.10. And with that also, you know, built on Apache Lucene, so we inherited a wide breadth of advanced search capabilities from that foundation. We already had a growing number of vector search users, some very large scale, you know, for example, folks in Amazon pushing the envelope, retail, Amazon Music, Amazon Rekognition. Use cases back then were around like image search and a lot of personalization, but we wanted to uplevel our scalability with 1.0, so we integrated FAISS, which is Meta's vector search library well known for performance and scale.

I think we all remember 2023, right? That was the year of Gen AI, and with that there was the buzz around vector database, right? It's vector database instead of like K-NN. So we brought our vector search capabilities to OpenSearch Serverless. We became the default for Amazon Bedrock Knowledge Bases where people can, you know, build on Bedrock foundation models and agents and RAG, Retrieval Augmented Generation. We created something called hybrid search, so out of the box you can blend the traditional lexical keyword search with vector search which turns out improves search quality. We started thinking beyond vectors as well, you know, how can we create an AI native search engine. So we built these connectors, integrations to third-party services, OpenAI, Cohere as well as Amazon SageMaker and Bedrock so that we can automate the generation of vectors.

Throughout the year, the deployment started getting bigger, so it became a priority to help our customers reduce costs. Cost optimization was top of mind, building disk optimizations, tiered storage, automatic quantization, which is basically compression capabilities, and we continue our journey around ease of use and AI native functionality. So not just vectors but general AI enrichments like language detection, translation, and other automated metadata extraction.

So today in 2025, naturally cost optimization continues to be a priority. AI native, we built agentic and AI search flows into our engine. We have MCP integration, but there was a new theme. Deployments are getting larger and larger, so a new theme we doubled down on is empowering scale, and that's the focus of our conversation. Two new features focus on customers coming to us saying we're growing, we're building bigger vector databases, we like OpenSearch, but please help make it faster, easier, and cheaper to scale your vector database.

Billion-Scale Customer Use Cases: From Brand Protection to Agentic Applications

So before we talk about these two features, let's take a look at a couple of customer examples and that trend towards billion scale. As mentioned before, we already had large scale customers early on in 2020. An example of that is Amazon Brand Protection. So this is the unit within Amazon that protects our customers and partners, our retail customers and partners from abuse, and so they have a large number of automated systems in the background to detect things like IP infringement and anomalies throughout the entire product catalog.

So they take the entire catalog, they basically vector encoded it into 68 billion vectors, and they're running an automation system on OpenSearch searching for anomalies and potential IP infringement. This, among other automation capabilities, enables us to detect abuse before 99% of them are even reported. So early days, a lot of the scale was driven by the likes of Amazon, but today it's prevalent. Part of that trend, you probably heard it if you've been to the AI talks, is agentic applications. More and more customers are building agentic features into their apps, and a good example of that is a startup called DevRev AI.

We've been working with them for about a year now. So DevRev, they have this product they call AgentOS, and just think about it as your AI teammates. The challenge or the problems they're helping their customers to solve is an age-old one, nothing special. It's about everybody having siloed enterprise data sources and tools, and they're looking to bring it together. But they're rethinking it, they're reimagining how we can better solve the problem with AI agents to connect, unify the data sources, and drive search and automation.

They're driving it specifically across product support, sales, and product management. They built it on OpenSearch. Today they have hundreds of millions of vectors, and a million vectors changing on a daily basis, so growing really rapidly. They've already achieved great successes with their customers in support. They've achieved 85% of their tickets resolved without human interaction, 30% cost reduction in customer support, saving their customers a lot of time.

Let's dive a little bit deeper into the components of agentic systems. So imagine we have a travel system. Agentic features give us a conversational interface that you can perhaps request, help me plan my trip to re:Invent 2025. That request gets processed by an AI agent, typically a large language model. It creates an execution plan, usually it will then reach out to knowledge bases and application history for context, and then compile some automation, perhaps putting together a travel itinerary in this case.

Well, you know what the saying goes, garbage in, garbage out. So when it reaches out to the knowledge bases, let's say in this case hotel venues, flights, if it's not able to retrieve high quality search results, you're going to expect poor responses and automation.

Vector search is the state of the art. Let's take a look at an example to understand why that's the case. So here I'm running a search on a wiki dataset. I'm searching for wild west, and you see on the left-hand side keyword search is returning results like the West Virginia basketball team because it's keying in on the term West. On the right-hand side, it's semantic search using a vector-based implementation, and I'm getting results like cowboys and rodeos because vector search is matching on semantic similarity and I'm getting much better results.

Understanding Vector Search Fundamentals and OpenSearch Architecture

So if we're going to build high-quality agentic features, we need to build these vector databases or run vector search across our vast enterprise data sources, and that requires scale. But easier said than done, there are a lot of challenges, as some of you may know, to scaling your vector database. Let's get everyone up to speed on the basics. At the highest level, you start with your content. You have an embedding model that's specially designed to encode that content into vectors, and then OpenSearch provides APIs for you to ingest and build an index on those vectors. That enables you to then run search queries like similarity search queries on that content.

But what are those vectors exactly? The vectors are the same vectors that you learned back in physics and linear algebra, right? The main difference is back then, you know, X, Y, Z, typically three dimensions. These vectors are typically over 1,000 dimensions these days, but they're just a long list of numerical values. But what's interesting is that these vectors that are encoded by these embedding models, when you measure the distance between the two, that represents a degree of similarity. So imagine I had a music corpus. I got two music tracks coded into vectors. If I measure, let's say, the cosine or the Euclidean distance, that's the degree of similarity between those two songs.

If we take those vectors and we just superimpose them into two dimensions so that we can visualize it, it's going to look something like this scatter graph. And this arrow, pretend that's your favorite song. It's been encoded to a vector. Basically, the area around that vector, those are similar songs, the most similar songs, and perhaps the songs that you may be interested in. That's how vector search works. So you have things turned into vectors, you can measure similarity. How do you run search queries? There are a couple of ways.

The first one is brute-force exact. So imagine you have a query vector, that's your favorite song, it's a vector. And then you have the rest of the music corpus, all the songs, they're all vectors as well. And literally, if you want to find the top K, you know, maybe top 100, top 10, top K songs, known as K-Nearest Neighbors, you literally do a comparison between everything. That query vector, compare it to one song, measure the distance, and then figure out which ones are the most closest together. Obviously, that takes a very long time when you scale to a billion. It's going to take, you know, minutes at least to run a query.

And that's why we have indexes. So we have these algorithms, a popular one like Hierarchical Navigable Small Worlds, HNSW. There are other algorithms where you preprocess these vectors and you build these graph-like structures that you see here. And once you have this index, you can then do real-time queries. I can now run that top K query in milliseconds. It traverses this graph and gets the results for you. The trade-off mainly is it's a lot of heavy-duty processing to build that index upfront, and it's also approximation. But the approximation of exact K-NN is generally very close, right? You can get 0.99 or higher.

OK, so what does OpenSearch do exactly with the indexes? So it does build these indexes, but it does more to scale. So OpenSearch, the way it scales is it basically creates a search cluster. It runs across a whole bunch of servers, a whole bunch of instances. So think about a billion vector corpus. It's distributing it, partitioning it across a whole bunch of servers, and we build these indexes and host them, provide security and management across all these data partitions and these indexes. When you run a search query, what's happening is in parallel it's traversing all these indexes across all these servers, and that's how it's possible to deliver millisecond search queries on a billion-scale vector dataset.

So that's managed clusters on the left-hand side, serverless on the right. The architecture you can see is more fancy, but the concept in terms of how we're scaling vector search is the same, right, across servers.

The only difference here is what you're seeing is that the search and the indexing workloads are separated, and storage and compute are separated. This is how it enables more auto-scaling capabilities, but it's the same type of scaling.

The Challenge of Building and Maintaining Large-Scale Vector Indexes

So great, we could scale. We've always been able to scale, but like I said, these indexes require a lot of memory, a lot of compute, more so than traditional database indexes or traditional search indexes. I'm curious for the folks, especially those who don't have experience. What would you expect? Do you think that if I were to build a 1 billion scale index, who thinks it shouldn't take more than half a day? Who thinks it shouldn't take more than half a day? Okay, I probably gave it away, but yes, it takes more than half a day.

Generally in the wild, it takes days. So that's tough, right, in terms of productivity and your innovation velocity. Typically in the wild, there are lots of variables, but it typically takes days. And the catch is it's not just a one-time thing. It's not like build the index, okay, it's going to take a couple of days, it's just a one-time thing. There's a lifecycle to it.

These indexes are special. You have to rebuild them more often than your traditional database index. So first of all, HNSW does degrade when the data changes. So you have an agentic app, it's very dynamic, right? The vectors, the content is changing all the time. Well, you're going to have to mitigate the search quality degradation and rebuild that index. If you change that embedding model as well, it means you're going to have to regenerate the vectors and rebuild that index. So when can that happen? Maybe you're switching between OpenAI and Anthropic Cohere. Okay, you're going to have to rebuild that index. These vendors produce multiple versions of their model every year. That happens one to four times, I think, typically per year. If you want to benefit from the latest and greatest, you're going to have to update and rebuild your index.

But it could be even more frequent than that, especially for special use cases, maybe like personalized search. Taking an example from Amazon Music, Amazon Music is a streaming service. And when you see, if you've used it before, you may see, hey, because you listened to this song, we think you may be interested in these other songs. That is powered by vector search. So they have this recommender type model. It's built for vectors, generates vectors from their song selection, all the older music tracks. Over a billion vectors running on OpenSearch, and they use that to power those recommendations.

If you're familiar with recommendations and personalizations, it's important for you to continually retrain or fine-tune those models on your users' latest behaviors and profiles if you want to have the best recommendation. In their case, they need to do it on a daily basis. So there are these use cases where you have to rebuild the indexes on a daily, weekly, or monthly basis depending on your use case. What happens if that takes days? It's going to be tough making your SLA, or it's going to be a burden.

Also, for those on the managed cluster, your search and your indexing workloads are also running on that same infrastructure. Well, you have these graphs that take a lot of RAM, a lot of compute, and at the same time you're running search. There's a tug of war going on for these resources, and there's going to be the challenge operationally to maintain good search times. So there's also that challenge in terms of separation of workloads.

So these are the problems that our customers are facing at large-scale vector search systems. Building and maintaining them takes days. It's tough to maintain, right? Vector ingestion can impact search speed. So we need to help our customers maintain their innovation velocity, their productivity, and build responsive applications.

GPU Acceleration with NVIDIA cuVS: Performance Benchmarks and Results

So we knew that one potential solution could be GPU, right? So we all know GPU from things like computer graphics in gaming and AI. And what's going on in those cases is a lot of vector math. It's really good at that, so we knew that there's a lot of potential. It's the same as what's going on with vector search and building these indexes. So we worked with NVIDIA.

NVIDIA has this library called cuVS, which contains a bunch of vector algorithms optimized for GPU. We worked with them to explore things like their algorithm called CAGRA. We looked at that and worked with them on enhancements as well because we wanted to be able to build on CAGRA, but we wanted to run those indexes in RAM and run search in RAM still so we could run other workloads. So we worked with them on enhancements in that area, and we wanted it to run in FAISS, so build on CAGRA, transfer to RAM, and run within Meta's FAISS libraries.

With those enhancements, we also ran a whole bunch of tests that you see here. Data set sizes running from 1 million, 10 million, 100 million to 1 billion different vector data set sizes. So we start off by creating first a baseline. Let's just run it on CPU. We build the index, we run something called a merge command if you're familiar with it, but that's important. Basically you build the index with Lucene, and you need to do a merge to ensure that afterwards the index has high search performance. So we include that in the test and you can see here it takes a long time. Once you get to the large scale, the 100 million to the 1 billion cases, it now takes over a day to perform that on a right-sized cluster.

What happens when we add GPU instances into the mix? Huge difference. We see a speed up ranging from 6 to 14 times. And there's also the cost of building that index. The cost of that task, building that index task, is also much lower cost because it's doing it so much faster and so much more efficiently, even though we have more infrastructure. Not only do we have CPU infrastructure, but we added additional GPU infrastructure, so it's more. But because it does the task so much faster and so much more efficiently, you can see cost savings from 6 to 12 times, which is substantial.

The cost savings are a little bit more easier to understand from a serverless standpoint. Remember I said that the search and the indexing workloads are separated on serverless. So what that means is that when you get your bill from serverless, you actually see a line item that says these are your index costs, this is your search cost. And along with that cost, it's going to say you used this many OCU hours, that's OpenSearch Compute Units. So with the same test, basically you're seeing, let's look at the 113 million case, 1,024 dimensions. It's saying that you used 2,721 OCU hours. That's a lot. It's basically a lot just to build that big index.

Well, when we add GPU, and I converted the GPU cost to OCU so it's kind of easy to understand, again we're adding GPU infrastructure, but it's reducing the CPU utilization, and again it's tremendous. So if we look at the 113 million test case, we brought that down to 104.5 OCUs to accomplish that task. That's 8.9 times cost reduction. To put that in hard numbers, in US North Virginia, the cost for one OCU hour is 24 cents. If we do the math, basically we're going from $853 to $73. That's huge. Now it's manageable. If you're at that scale and you have to rebuild the indexes, now the cost is manageable.

The other thing I want to look at is we want to simulate a dynamic application. Now I think we have an agentic app. We have a whole bunch of concurrent users searching for things, but at the same time they're also updating and inserting content, so we're indexing at the same time. The search speed can be impacted. So this test is demonstrating that. We're doing a mixed workload. Mixed workloads on the Y axis, this is indexing clients. Just think of that as we're ramping up the number of writers, like the amount of updates and indexing operations on the system. And in the line, what we see is, as expected, the CPU utilization on a cluster is increasing. And the bar graph is showing that the search latency as a result is increasing because there's pressure and competition between CPU and RAM between the reads and writes.

When we add GPU and we offload the CPU, we bring down the CPU utilization, and the search latency now becomes much more manageable. Now it's a much better user experience.

Serverless GPU Integration: Pay-Per-Use Architecture for Vector Indexing

We're back to good latency. So we saw we proved it out, hey, a lot of potential for GPU, but there's one more challenge. Exactly how are we going to integrate GPU into OpenSearch? The obvious thing is, okay, we just support GPU instances, right? That's the obvious, that'll be really easy, but if we did that, that would cause economic problems.

Let's take a look at an example. Let's say we did that. Let's take a look at a deployment for 1 billion vectors, 1,024 dimensions. We're going to apply aggressive cost optimizations. We use binary quantization, basically 32X compression to make the cluster as small as possible on CPU. What that means is that we have, it's going to be memory bound. Generally we need 3X, more than a terabyte of RAM, right, for good performance. So we use 3 of these r8g.12xlarge instances. Okay, so that's going to cost some amount.

So let's say if I wanted to bring GPU, I'm going to have to pick GPU instances, use GPU instances. We use the g5.12xlarge because those are very cost effective, but they're not optimized for RAM capacity. So I need 6 of those now to meet the RAM requirements. That's 2.4 times the cost. I got GPU now, that's great, but it's a huge premium for the speed gains and I'm not always indexing, so that's the problem.

So we knew that we, for this to be practical for most of our customers, we have to do things differently. So working backwards from the ideal scenario, we want to let you continue using your existing collections and domains running on your CPU instances. We want to use the same APIs, we don't want to change any of that, but let's, can we bring the GPU to your CPU based clusters only when you need it, and you only pay for value, only when you benefit from the indexing, the GPU acceleration.

And we did that, we did just that. Exactly how did we do it? Behind the scenes we built the system that you see here. So number one, that's you as a user. There's no change to how you use OpenSearch. You're indexing vectors, same thing, use the index API to reindex bulk. But what happens is when you have it enabled and we detect that now you have high write throughput, let's say you're writing 10,000 vectors per second, it's going to trigger a configurable threshold.

And we are then going to allocate GPU instances to your CPU cluster, offload those graph builds. We behind the scenes are going to manage a warm pool of GPU instances for you. You don't need to worry about that. Multi-tenant because you then benefit from the economies of scale, but we do a single tenant secure assignment, so that you know it's only your workloads running on those instances. We scale, basically automate the scaling for you. When you're done with the GPU acceleration we return it to that pool, so we do all that behind the scenes so that you only need to pay for what you benefit from.

From your standpoint it's as simple as a light switch. You just enable it through the APIs, CLI through the console like you see here, and you pay, it's serverless. Doesn't matter if you're using managed clusters or OpenSearch Serverless, this acceleration, it's serverless GPU and it's at 24 cents basically per OCU hour in North Virginia, pretty much same as the standard indexing OCUs. So that's it, GPU accelerated vector indexing, building these indexes much faster, but there's actually more to it to create these production systems.

The Complexity of Index Configuration and Optimization

So with that, my colleague is going to talk about, you know, what else we need to do. Thanks, Julian. So my colleague has covered about the basics of vector database, generating the vectors, ingesting the vectors, and building these indexes, so we can enable the real-time vector search for the end customers.

But there is more to it. There's one more important aspect in this flow. Can anyone guess? It's the index configuration and optimizing the configuration. To build any vector index, you need index configuration. This index configuration determines the search quality, cost, and speed of your indexes and your applications. Optimizing these indexes can cut down the cost by one-third of the total cost, and this is more evident as the scale of the vector workload increases.

So let's take a look at the scale. On the x-axis, I have the number of vectors. On the y-axis, I have the memory footprint. As you see, from 10 million onwards, the memory footprint reduction is almost 75%. At 10 million, we see around 300 gigabytes of memory footprint savings. At a billion scale, we are able to see around 3 terabytes of memory footprint savings. So it's very important to optimize this index configuration.

What are these index configurations? Amazon OpenSearch Service has a rich feature set of index configurations and tunable parameters that let our customers make a very smart trade-off between what should be their quality, cost, and speed. On the x-axis, we have the search quality, which we call recall. On the y-axis, we have the latency. As we trade off between this quality and the latency, you could see those purple bubbles. All of those are index configurations, and each index configuration has tunable parameters.

On a high level, you can think of these tunable parameters in three categories. One category helps to determine the type of algorithm. Is it HNSW, as Dylan talked about? That's a graph-based algorithm. Another one is IVF, which is a bucket-based algorithm. Similarly, the second kind of category is the compression techniques. Do you need binary quantization, scalar quantization, or product quantization? You can achieve the compression from 2X all the way to 64X, which helps you trade off the quality for the memory footprint. Similarly, the third category talks about the mode of operating these graph data structures. Do you want them in memory or on disk or somewhere remote like S3 Vectors? It helps you trade off the latencies.

Our customers love these trade-offs. It helps them get the best of their hardware resources. However, it's not straightforward. It needs some expertise in k-NN vector search algorithms, and it also takes time. The experts would generally start with picking up the configuration, the algorithm configuration, the quantization technique, and the mode of operating these algorithms. Then you would run and build these indexes, have your ground truth, and evaluate what is your quality and latency and cost. If it doesn't meet your requirements, you adjust and repeat. So this can take time.

Let's take an example. Let's say I'm building an application which needs 95% accuracy. I pick up a 32X compression. Binary quantization is very popular. I build my indexes, I run my quality check, and then I realize it's 90%, not 95%. Maybe I should reduce my compression level so I can preserve more vector precision and get better quality. But the trade-off is in the memory. I might need to pay more memory, so I'll go back, reduce my compression techniques to 16X, and then rerun the builds and validate what is my recall quality. Maybe this time it's 93%, but still not meeting my requirement. I go back and then make it 8X, and so on and so forth. I repeat this process, build the index, and validate my configuration to check whether it meets my requirements or not.

One more important aspect: there is no one optimal configuration that works for all the use cases. The index configuration is determined and is more subject to the dataset and the business requirements. Let's take an example. Let's say I'm building a recommendation system for an e-commerce application. I need a high-performance vector database. The quality has to be better. The latency has to be near single-digit or double-digit milliseconds. I would need something like the hyperparameters. It could be in memory, and then I have an HNSW algorithm and different parameters.

Similarly, I have another application where I'm building a search application for my company internal usage.

I need quality, but I don't need a single digit millisecond. Even 100 millisecond or 200 milliseconds is fine. So then accordingly, I'll choose my hyperparameter configurations, or maybe you're trying to have something in between, right? So arriving at these optimal configurations takes time.

Auto-Optimization Framework: Simplifying Vector Database Management

So how did we simplify this? Amazon OpenSearch Service wants to simplify the onboarding experience. Whether you may be building a POC or a large vector workload, we wanted to reduce the expertise needed to build and manage these indexes. So what we did is we built an auto-optimization framework and we let the framework learn all of these techniques, all of these algorithms, parameters, and the tuning configurations, so we could do the heavy lifting for you.

So now it gets simplified. All a customer needs to provide is what is my acceptable search quality and what is my acceptable search latency. And behind the scenes, Amazon OpenSearch Service will take your data, will take your business requirements, analyze your vectors, run a bunch of hyperparameter optimization jobs, and get you the recommendations. And the good thing is we are very transparent. We provide you all the details, we provide you a detailed report for each of the recommendations. You can look at your performance metrics like what is my recall, what is my memory footprint, you can look at your algorithms, algorithm parameters, hyperparameters, and compression levels. So all the focus can now be on the recommendations and then you can pick up your best recommendation.

And the best part is our customers don't have to manage the infrastructure. Amazon OpenSearch Service has a serverless auto-optimization framework. It's built on a serverless fleet where we parallelize all of our hyperparameter optimization jobs. And we build it under an hour. Under one hour we are able to build the optimizations and provide you the recommendations. So with this, one, we cut off the expertise required. All we need to do is to focus on the business requirements. Two, we cut down the time. We are able to give you recommendations under an hour, so all the iterations that you need to do for a POC, we are doing it behind the scenes for you. So under an hour you get your recommendations. And three, we're cutting down the cost. You don't have to manage the infrastructure needed to run these experiments, the POC, to get to production.

Live Demo: Vector Ingestion Workflow with Auto-Optimization and GPU Acceleration

Let's see it in action. Once you land on your OpenSearch Service, on the left hand side under the ingestion, we added a new entry, vector ingestion. For the purpose of the demo, I'm going to play a role of a search engineer who is trying to upgrade an application from a lexical search to a semantic search like vector search. So I have the workflow explaining the process here. Step one is preparing the dataset. So I have all my documents. It could be text, image, audio, or video. So we take all of the documents, pass it to the embedding models, generate the vectors, and have them in the S3 bucket.

Then step two, I provide my business requirements. So I need X search quality and Y latency, so I provide the requirements. And step three, we would have the recommendations available. And step four, I can choose one of the recommendations and say build index. So we ingest and accelerate the building process with GPU acceleration. So let's do it, let's create one ingestion job.

So at step one, we have to configure the data source and destination. So my data source is an S3 bucket and it's in US East one. So within this, I have the data folder. So let me select my folder. And then I need to provide the destination details. So this could be your Amazon OpenSearch domain or Amazon OpenSearch collection.

I'm going to pick my domain and then configure the permissions. Then I'll go to the second step. As part of the second step, I configured my index. Here, I have my vector field named "train," and my space type is inner product, and the dimension is 768. This particular configuration has to match with what we have in the S3 bucket. When we generate those embeddings from the model, I'm putting the name of those vectors as "train" and then having the embeddings there. So the mapping would be something like "train," and then the vector, and the dimension of the vector is 768. The space type you could determine from the model when you run it, so it tells you what the space type is.

And then this is the best part of this. So here I focus on the requirements. As I said, if I'm building an application which needs high performance, I would choose something like 0.95 recall and fast, which means within a single digit or under 50 milliseconds latencies. So it's like a high performance application. Similarly, if I want to build something like a chat application, maybe 90% quality is good and it could be modest, so it's more like a balanced profile. Accordingly, based on your application, you could provide the requirements on this page. And then here you could review your configuration and then click create.

So once you click on this button, we are sending a signal to our backend system to run the hyperparameter optimization jobs and provide the recommendations. This would take between 30 minutes to 1 hour. So for the purpose of the demo, I already had a job that is run, so let me show the recommendations from that job. So once the job is complete, you would have a banner saying it's completed. It says that you are in which step of the workflow. Right now, I am at the recommendation step.

So this is how my recommendations would look like. I have three recommended configurations. It took 42 minutes to run this job, and the configurations, the business requirements were 90% recall and fast. When we project our recommendations, we always optimize them based on the cost, so we sort them based on the cost. The first recommendation is a baseline for us. We're able to achieve 0.92 recall. And then between option two and option three, I'm able to get more recall, but there is an increase in the memory footprint.

So this is the trade-off I need to make to get a 0.5% better quality. Will I be okay to have 50% memory footprint? Or maybe I need 3% more better recall, and I'm okay for having 100% memory footprint, a 2X memory footprint. So you can look at the performance metrics on this page. And let's say you want to understand what was actually the process, what were the algorithm options that were picked up behind each of these recommendations. You can also look at the detailed report. So you can click on the report.

You can look at the engine parameters, the mode and the engine. So here we are picking up in-memory because the business requirement was fast. So to make it a single digit millisecond or a double digit, we have to host the vector data structures in the memory. And then we are using the FAISS algorithm and FAISS library, and the algorithm is HNSW. So here we talk about all the hyperparameters that we picked up and the value of the parameters and what each parameter means.

So on a high level, HNSW has different parameters like M, EF construction, EF search, which talks about how many edges I need to establish between the nodes or how many neighbors I need to consider while I'm walking through the graph. So the more the neighbors, the better the quality. The more the edges, the better the quality, but at the cost of memory footprint. So these are all the evaluations we do behind the scenes and come up with these parameters. And then here we are using the binary quantization technique, the 32X compression. So essentially we are taking your dimension, which is 32-bit floating vector, and making it to a single bit.

Likewise, I can go to different recommendations and then look at the configurations, the engine parameters, algorithm parameters, the compression parameters. So here it's using 16X compression. So that's why the memory footprint has been increased because previously the baseline is using 32X, but the second option, we're able to recommend you the 16X compression because it gives you around 0.3% better recall.

So we look at all these options and recommendations, and for me and for my application, I'm willing to have option one because my accuracy requirement was 90%. This is well above my requirement and it is fast, so I'm going with option one. And then you click on the build index.

As part of the index, provide the name of the index. So this is the index which we are going to create on your destination. In this case, I have chosen the Amazon OpenSearch domain. So on that particular domain, I'm going to create this index with the following configuration, option one. Finally, all the configurations that we see in the report boil down to the JSON configuration. So this is how it looks like. It talks about the files. It talks about the construction parameters in our product. It talks about the compression level in memory. So we are able to finally get to the JSON which OpenSearch can understand.

And let's say along with vector fields, you also have other fields. Maybe you have some metadata like text or date. You can add your fields here. So let's say I have a field text. So I'm going to add a text field. Similarly, I have a date field. So on. So you can add your fields, and when you're ingesting, you look for these fields on your documents in an S3 bucket. And then I can click build index.

The most important thing here is the GPU acceleration. So as my colleague talked about, building these graphs are compute intensive operations. So one part of it is ingesting into the destination, but the second part is once it arrives at the destination, the GPU acceleration speeds up the building process, so we can enable the search applications faster. So you could enable from here. It takes you to the edit cluster page, and then under the vector database features, we have the option to enable or disable the acceleration.

So for this particular destination, I already had it enabled. So I can enable this and then click on dry run. So I need to choose a new generation instance type, and that's it. So the dry run evaluates all the configurations just to make sure that everything is correct, and then it kicks off the update process where the flag gets enabled. And then you can click on the build index.

So once we click on the build index, what happens behind the scenes is we kick off the ingestion workflow. So we have an ingestion pipeline that takes the source, in this case your S3 bucket, the sink, which is our destination, which is our domain or a collection, and then streams the data. And the ingestion pipeline is also serverless, so based on your volume of data and based on how much your destination can take, it can increase or decrease the computing units.

And then once the ingestion is done, you can look at all the job configurations here. You can go to the vector ingestion page and then look at all the completed jobs here. So for that optimization, there are multiple states like completed and then active or failed for the ingestion status. Right now, I just triggered the ingestion, so it's in the creating state. And then it becomes active. And once the ingestion is completed, we stop the pipeline.

So with this, we are able to, within a few clicks, take your data, take your business requirements, run the hyperparameter optimization jobs under an hour, provide you the recommendations so you can look at your recommendations and choose the best recommendation. And then within a single click, we are able to take the index configuration, create an index, and stream your data to the destination all behind the scenes without any manual intervention. So this is how with optimization and the GPU acceleration features, we have tried to simplify the onboarding experience for Amazon OpenSearch Service.

So this concludes my demo. I think we have some time to maybe take a few questions a little bit early. What do you think? Yeah.

; This article is entirely auto-generated using Amazon Bedrock.