Kazuya

Posted on Dec 8, 2025

AWS re:Invent 2025 - Video sampling & search using ElastiCache & multimodal embeddings (DAT433)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Video sampling & search using ElastiCache & multimodal embeddings (DAT433)

In this video, Elad and Kevin demonstrate building a video sampling and search application using vector similarity search with Valkey on ElastiCache. They explain how VSS works with hash maps and JSON documents through asynchronous indexing with dedicated worker threads, while maintaining synchronous APIs for read-your-own-writes consistency. The session covers scaling patterns—sharding increases ingestion rate while replicas boost query throughput—and walks through implementing deduplication, multimodal analysis using Amazon Bedrock's Titan embedding models, and storage using Valkey Glide client. Kevin live-codes Lambda functions for generating embeddings, creating FT.CREATE indexes with HNSW algorithm for 1,024-dimension vectors, and implementing FT.SEARCH with K-nearest neighbor queries. The demo showcases semantic and multimodal search capabilities, returning results with cosine similarity scoring, and discusses trade-offs between using ElastiCache as primary vector storage versus caching layer for durable databases like DynamoDB.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction to Vector Similarity Search with Valkey and ElastiCache

My name is Elad. With me is Kevin. We're both principal engineers working on ElastiCache, and in a minute, Kevin is going to walk us through some code, building a video sampling and search application. But first I'd like to dive a little bit into how vector similarity search works with Valkey.

VSS works on two data types with Valkey, on hash maps and on JSON documents, and you start by defining an index and defining schema. Whenever you go and change one of these objects, that change immediately happens on the main key database, and then the search module receives a keyspace notification. And it starts asynchronously indexing those changes using a set of dedicated worker threads. Now that means that the Valkey main thread can continue serving other requests by other users. But that particular user that sent the modification waits because we have a synchronous API, and a synchronous API is really, really useful here because you can read your own writes, right, versus other databases where it's asynchronous and it will happen sometime in the future. So once indexing is complete, the user will get an OK response and can continue.

Now queries get handled immediately by the query engine. The work also happens on dedicated worker threads. So the queries get sent both to local indices but also to remote shards. And that's because when we search for nearest neighbors, we don't necessarily know at which shard the nearest neighbors are going to be. So we fan out and the query gets applied everywhere. When the results come in, optionally, Valkey will actually enrich them back with data from the main database and send them of course back to the user.

Let's talk a bit about scaling because it's a little bit different than what you're used to with Valkey, if you're a Valkey user. So when you ingest information, you send modification to a particular primary in a particular shard. That primary does the work of ingesting and building, updating the index, and asynchronously updates the replica, which also indexes the information. Now, this architecture means that the more shards you add, the bigger you scale out, the more data you can store. And also it increases your ingestion rate. So you can ingest billions and billions of items per second. Well, maybe not billions per second, but support is very high, the more you increase the shards.

Search works differently. When you send out the query, the node that you send to has to fan out to all the shards like we just mentioned. And that means that as you add more shards, it doesn't increase your throughput, it doesn't increase your queries per second, right, because all the work has to be done by all the shards. So what can you do if you need more throughput, more queries per second? You can add more replicas, right, because as you can see, queries fan out to all the shards, but they pick either a replica or primary. So if you have a lot of replicas, you can actually do more work concurrently. Another good option is to scale up the instance type, right? If you have more cores, those dedicated worker threads can actually do more work and those actually are very, very efficient at utilizing the cores.

Building a Video Sampling and Search Application Architecture

Let's talk about what Kevin is going to build. So we'll have a web application. It's going to have two parts. It's going to have an ingestion part, which is where we upload videos, and it's going to have a query part where we can query by either text or image. Now what holds them together is actually Valkey, right? When we ingest, we take images, we analyze them, we turn them into embeddings using foundation models, right? So basically embeddings are vectors. We insert them, and when we query, we take the query, it could be either a text or an image, right? We turn that also into an embedding, transform it to a vector using a foundation model.

Then we look for similar vectors or similar images in Valkey. When we find something, we can go and grab it off S3 and display it to the user.

So the ingestion pipeline means that we take the video, we break it apart into frames, and then we perform deduplication. After we find images that are unique enough, we run through multimodal analysis. Diving deeper into the deduplication and what it means on the application that we're going to build, it's going to be triggered from an S3 insertion, the creation of an object. Then we're going to take the image, transform it into an embedding, and we'll start by looking for similar images on ElastiCache on Valkey.

If we find something that's similar enough, like neighbors that are close enough, then that image is probably not unique enough and doesn't add additional value or additional content. So we can just throw it away and save all the extra analysis. But if it's unique enough, we'll continue our pipeline and run additional analysis and eventually store it. Now, the multimodal analysis is going to work by taking our image and we're going to run it through several analysis models. We're going to use categorization, labelization, and summarization. And we're going to take the output of all of these, which is basically text.

We're taking the text and we're taking the image, and we're putting them all into the foundation model because we want to generate embeddings. That's what we're doing, embedding vectors. This is what it's all about. And then we're taking the text, transforming it into embeddings, but we're also inserting the text itself along with the generated embeddings into ElastiCache. So you'll have this JSON object that contains multiple fields. It will have text for all the things that we analyzed, and it will have the vectors that we got from the foundation model, and then we'll be able to search for all of it.

Now, searching means that the user is going to input either text or an image. Either way, we take that, we also transform it into an embedding, and we look for nearest neighbors. Found something, grab it off S3, display it. Cool. So let me hand it over to Kevin. Yeah, thank you, Elad. Switch over to my demo machine.

Demonstrating the Completed Application and ElastiCache Configuration

So we'll start out by just showing you what the completed application looks like before we actually go in and write some code here. So this is the web UI that Elad was talking about, and I have one video here that's ingested, which is an AWS ad. And what the system does, I can show you, if I were to click into it, it does all of the analysis on these frames and provides some insights into the video, including different labels that are identified from Amazon Rekognition, text from Amazon Transcribe, and other things that are identified through the individual frames.

This is all derived from a set of frames. So this is set at a one-second sampling, yeah, on the insights, yeah, category. That's okay, I can repeat it too. Yeah, so on the insights, how do you determine the label category, label, and text, please? Yeah, so these are from the services that Elad talked about. So we feed individual frames to, for example, Amazon Rekognition can identify labels. So let's say home and indoors.

This may not be the best example, but so I see hits on home and indoors on frames at the one interval, the eight, eighteen, and twenty-seven. I can click through to those and actually seek to those frames, and see what it doesn't tell me what is identified, but these are just some of the machine learning services that AWS provides, taking in the images as input.

These frames are extracted, and I can go in and see the extraction of the frames. It goes frame by frame here in a one-second sampling interval. Each of these frames is then fed to Claude in this demo to generate a summary. You can see the label, another view of seeing labels and text as well as subtitles for each of them. As it goes, you see it's one-second intervals, but there's also a similarity score here that can be observed. This is derived from the deduplication process. We can see that these frames have a threshold set at 0.24. Frames that are below that will be dropped because they are similar enough to another frame. All of this is the information that is derived from those individual frames.

Let me walk through the pieces of where Valkey fits into this. That's what we'll be walking through today. To start out with, I wanted to start by clearing out the code that is already written. To do that, I'm just going to do a quick CDK deploy to remove the information on here. This is all driven by a CDK package. That should clear things out. Let me go over here from the Lambdas that we're going to be implementing with, and I'm also going to delete this video. We'll clear everything out.

The actual frames are stored in S3. Metadata, in this case durable metadata, is stored in DynamoDB, and then the vectors themselves are stored inside Valkey. I'm going to pull up a Valkey CLI here and clear out all of the information that's in Valkey just so we start out fresh. Let me show you my Valkey first. ElastiCache has a number of different configurations, and this one we're using is a provisioned Valkey cluster running version 8.2, which is the version that supports vector search. I have three shards, each of which has three replicas, so it has nine nodes total in my cluster. They're running on an R7G XLarge node type, but it's supported on a variety of different instance types that you can configure.

I'm going to do this. There are three different shards. I will flush all on all of them. I'm just doing this by using GET to move to the different shards here. If you aren't familiar with Valkey, I'm just basically flushing everything out. The indexes are defined a little bit differently, so I'm going to call drop index. Let me do this. This will start over from scratch, so we have a clean slate to build off of.

Implementing the Embedding Generation Lambda with Amazon Bedrock

Now, the first thing we need to do in that pipeline that we talked about is getting deduplication working and storage. All of this is derived from the embeddings we're using today, which are derived from Amazon Titan. We're using two different embedding models. One is the V2 for text, which is slightly better on text information. Then we're using this multimodal embedding model, and that can generate embeddings on both text and on images, so we can feed in those frames and get back vectors. We're going to be implementing that through Bedrock. Let me go ahead and refresh these. Any questions so far?

I can walk through this Lambda. This application is broken up into a bunch of different Lambdas, each of which are purpose-built. A lot of them are connected via Step Functions, which we can look at in a little bit. One of the core ones is actually the function to generate embeddings because we use that everywhere on search, we use that on ingestion, we use that on deduplication. Right now, the Lambda doesn't do that much. Let me see if I can get it to work. It shouldn't be doing much. Let's see what my CDK deploy did. It did work.

Let's go ahead and reopen it to make sure it's not cached. Let's see what this does. The file was last modified six hours ago. In the worst case, I can clear it out myself if it isn't going to get updated by the deployment here. One second. Was it a regional issue? Let's see. I'll just start with the embedding one here. One second. I flushed out all the data but not all of the code. Okay, that's not wanting to delete, so I will push it manually. Cool, I'll just copy and paste.

Okay, from my local to here, then deploy. We'll see if that happens on the other ones too. So the empty one is actually empty here now, and it is a Lambda handler that does some validation and currently just returns nothing in its response here. But what it's supposed to do is take the text and image input, both of which are optional, and then the embedding type to call one of the models or the other. So the first thing we want to do is create the Bedrock client. I got some help here from Amazon Q, but I'm just going to use the Boto3 client. This is Python, and I'll connect to the Bedrock runtime. The autocomplete is nice here. And if you see any typos, feel free to call them out too as I'm doing this. We can do some peer programming so that it actually works.

So we want to handle the different embedding models here. If the embedding model is multimodal, then we want to do the multimodal embedding type. Otherwise, we'll do the text model. This is a little bit helpful. And then we want to generate our request, which is going to be JSON. We could do that, or we can do it simpler with just a text input. So if we want to look up the API, I don't have it here for Bedrock, but it takes in a JSON document and it takes an input text. Or if there's image input, then we'll do the image. And if you haven't worked with Bedrock before, it's pretty easy to do. In fact, you can autocomplete it for me. We will dump the request as a JSON document and call the correct model. I don't even necessarily need the accept and the content type, but this will then return the vector response. So I can get the body out of the response by loading the response body here, and then get the embedding out of it. So it's a JSON with a single field here that is called embedding, and I'm just going to return that in the body here. I could return the actual JSON blob, but I'm returning the vector embedding. Okay, I think that looks all right. Let's see. So we'll go ahead and deploy it and test it. Is anyone spotting any obvious mistakes in here? This is just demonstrating that it's pretty easy to interact with Bedrock.

So I'm going to just create a simple test here. I need my embedding type and text or image. Let's do that. I'll create a test event with embedding type as text, and it was what? Text input and image input, not text image.

There's no text input, just an image. The text input is "Hello World" and what I'm expecting to get back is a vector, which I do get. It's a large vector, 1,024 in length, which we can find out just from the output vector size here. Now, the question is what is generating the vector. So yeah, the question is whether the Lambda is the one generating the vector. In this case, the Lambda is not doing the vector generation. The Lambda is just basically calling the Bedrock service, and the Bedrock service is the one that is interacting with the Titan model.

There are local embedding models, so it is an option to generate embeddings within the Lambda or an EC2 instance locally here. For this demo, we are just interacting with Bedrock via the AWS CLI and it's getting the credentials just locally through the Lambda in order to invoke that remote service to get the embedding. Yeah, the question is do you need a token ID or credentials, and that is true. This is set up so that the Lambda has the credentials present and so the default SDK picks it up just by instantiating the client.

Creating the Vector Save Function with Valkey Glide Client

So this is the foundational one. It looks like it's working, but it's not going to do anything if we were to ingest a video because we haven't hooked it up to Valkey yet. And so the next step is to go ahead and take it when it has a vector and has decimated the video to actually store that information inside Valkey. And for that we have this vector save functionality, which again has not been cleared out, it looks like. So I can do my same trick I did before of taking the empty one locally and overwriting it and deploying it.

Cool. So this one is later on in the path where it has already taken a keyframe and all this information is stored inside DynamoDB and it's provided in this event context that we have. We have the S3 image, we have some task ID, we have the frame and some configuration items. So this frame is the JSON blob that we're going to store inside ElastiCache, and it has a bunch of fields I can show you. It has things like the S3 URI to be able to display in the UI, timestamp, a lot of the metadata necessary to perform the vector search as well as the actual vectors themselves. So those are already populated here.

Our job here is to initialize it. So we have create Valkey client, which is not doing anything right now, and so we want to be able to connect it into Valkey. To do that, we're going to use a client that's called Valkey Glide, which is one that our team supports. It is one of the official clients for Valkey, and it is a polyglot client, so it supports multiple different languages. We're going to be using the Python one here since all of our Lambdas are in Python.

So I'm going to import some things here and I'll see if you can figure that out. Probably not. From glide sync, I'm just importing star here since there's a bunch of different ones that we're going to be using for this. So actually, let's do just the glide sync as sufficient to be able to create the client. To initialize the client, we have a domain endpoint and a port. So those are sourced just from our ElastiCache. We have a single configuration endpoint and port here. So there's a DNS name and a port 6379 that we're using in order to connect to our cluster.

And the way that Valkey works in cluster mode is it discovers all of the different nodes. We talked about there being nine nodes in our cluster. So from this configuration endpoint, it queries any node and is able to identify what the nine nodes are and talk to them for pieces of data. So it's a direct communication from the client into the nodes that have all of this data resident in memory.

So let's go ahead and initialize our client. We'll do that by returning our Glide cluster client. We're using the cluster client because it is cluster mode enabled. We'll call create and pass it some cluster client configuration, passing it the list of addresses. In this case, we just have one address, which is the domain endpoint, and then we will pass it the port. In this case, we have enabled TLS but not authentication.

So if we look under here, we have encryption in transit enabled, and so that is an option that you're able to toggle. But we don't have any authentication, which is also available, but it'll make it a little bit easier. So with just TLS, we can set the use TLS flag to be true, and then the client will do the TLS negotiation. And we're going to set some configuration here to increase our timeout here with the Advanced Glide Client Configuration. And we're going to set a two-second timeout here, which works best for this particular demo. So that should work to actually create the client.

And to actually, next we'll skip the index creation because that can happen asynchronously here, and we want to be able to store this frame. So the fundamental data structure that we're working with here is the JSON one inside Valkey. So that is one of the modules that is available for Valkey that comes with the bundle there, and Glide supports it using this Glide JSON wrapper. So we're going to have to import that first, and then we can use it.

So from Glide sync, we'll import, actually I think it's there by default. It's the full text that we don't have there. Glide JSON set. And we'll pass at the EC client. Can you use this Lambda function? The question is, is there a way to generate this code using an agentic AI, and that is a good question, and the answer is yes. You see, Amazon Q is attempting this here, but it is not succeeding always. I did play around with trying to do this using Q for this demo, and it has a lot of background information on invoking Bedrock, a little bit less so in its data set on using Valkey Glide because that's a little bit newer repo. But I was able to get it to generate the code there too.

So that is an option for you to use this or anything else to try using a large language model to do this code generation. The question is, is Q is what? So not, I'm not using it right here. Oh, so the one that's being used here that's popping up, I think it's just Amazon Q, which I can, it's running, I could pause it. But Q is part of the Lambda console already here. I don't think that there's a code generator within the Lambda console. I could be wrong on that. There is an Amazon Q tab here, but I think it's just for asking questions.

But you could use something like Q here that I have that allows you to actually generate code by chatting with it. I'm not doing that today just to kind of illustrate the actual points as opposed to having it generate something for us. So when I set the, I'm passing in the client, and then I want to set it with a particular key index. So everything in Valkey is indexed by a key name, and here I am putting a frame as the prefix to the frame and then the frame ID, which is unique. And that's a common pattern for interacting with Valkey.

And since this is working with JSON, I'm going to replace the entire JSON document, so the dot, and then I will dump the frame using JSON. And so this will take it and store it inside Valkey, and you can interact with JSON. You don't need to have a vector index. That's just one of the data structures that's supported. But here we want to be actually able to create an index so that when we do this set, it automatically ingests it and makes it available for vector search.

This is getting into some of the vector functionality, the VSS capabilities there. There are a lot of different options. I will show the Valkey command for, let's do FT.CREATE. So Valkey is pretty well documented in terms of what the commands are. It's a little bit hard because it has the command here on the side, but we have an FT.CREATE function. FT stands for full text, but right now Valkey supports just vector search. Full text is coming soon. We can create a schema with a bunch of different field types and attributes, and so we'll do that just to index two different vector fields, one of the text and one of the image like we had talked about before.

So let me go back there. To actually create the index, we only have to create the index once and we could do this manually, but we're going to make it do it as part of the Lambda. In order to do that, we can test to see if it exists first. I would need to actually import all of my full text stuff, so that doesn't come in the GlideSync. Those are separate commands. So from GlideSync.sync commands, we'll import FT to interact with the VSS module, and then from GlideShared commands, server modules is a little bit long. This is where having Gen AI created would be helpful, create options, import star. Okay. So now I should be able to call FT.info, pass it to the client, and then check to see if our index, we're going to have one index here if it exists. We'll call it the index with the video frames. This will throw an exception if it doesn't exist, except Exception as X. Okay. And so if it doesn't exist, then we'll go ahead and try and create it here.

And so to create it, we have to create a schema first. And we have to define what exactly the fields are that we want to index. So we'll call, we'll say it's a vector field. The name is going to be where we find it in the, well, we can talk about where it's coming from later, the data type, but within the JSON field, it's going to be called mm_embedding for multimodal embedding. We'll alias this just to mm_embedding so we don't have to deal with the dollar sign at the front. So when we actually query it, we'll just have to, we can drop the dollar sign. We're going to use the HNSW algorithm. Oh, look, it is helpfully doing it. Let's see if it's actually correct. It's a little bit wrong. VectorAlgorithm is the right one. So this is, you know, the pitfalls of using Gen AI that it might get it mostly correct, but not one hundred percent. So we'll do HNSW. I'll go ahead and pause that so that it doesn't distract from the, let's see, it was down at the bottom. Okay. That should be a little bit better. HNSW, that did not work. And we'll see if it works now. And we want to pass at the attributes for HNSW.

So HNSW is the graph navigation scheme, and we have to pass at what the vector field attributes are, what how many dimensions to expect, and sort of how we wanted to generate a score for that. So we can pass it and tell it that there's 1,024 dimensions in these vectors, and we want to use the cosine metric as opposed to the metric L2. So we'll do metric type cosine. And we'll generate a, they're all float thirty-twos all of them in the vector field. So we can basically, so this is the multimodal embedding, we can kind of copy and paste this for the text embedding because it is, line 38 distance, yep, thank you. Good eye. We'll see then if I'm able to copy and paste this for the text embedding, which is just in this text embedding. Okay, so this allows us to index two different fields. Each of there are different embeddings. One is just representing the text and one is representing text and image.

So this is most of how we create the schema here, and I just need to finish, yeah, okay, defining it. And then there's a couple of options we need to tell it which fields to index. So we could look for everything that's stored into Valkey to try and index all JSON documents, but here we're just going to narrow it down to things that are JSON.

Specifically, things that start with this frame prefix. So we can say anything that starts with frame, we'll inspect to see if we can insert it into the index, and this allows us to store other things on the same Valkey server without having to index them or to insert them in different indexes. And as we talked about earlier, we can have one object be in multiple indexes if we so chose.

And then we can just do, I'm going to do a, we could do FT.CREATE to create the thing, pass it the video frame, the index ID, and then schema and options. So this will create the index. I'm going to surround this in a try catch just because there might be a race here where several of them try and create at the same time and one of them will throw an error. So normally I would handle this better, but since we're just doing it for demo purposes, we'll just ignore errors that get thrown here and validate manually.

So now we have the ability to create the client. We'll first try and see if the index exists. If not, we'll go ahead and create it, and then we'll go ahead and store this. We could do it the other way around. It's not important to create the index before you store data. Indexes are backfilled, so you can dynamically drop them and recreate them, and it'll go and search the keyspace and reindex everything as necessary, but this is the way that we've chosen to do it here.

Testing Video Ingestion and Inspecting the Valkey Index

And if I got all of the, no other typos, we'll go ahead and try and deploy this. There's another typo. Online 59 exception, yeah. Cool. I don't have a test for this, so I could theoretically test it within the Lambda, but I'd have to generate the right context here, so we're just going to test this through the application.

And while that's running, I'm going to go ahead and clear out the one other Lambda function that we will be creating since it didn't get cleared out before, which is search, and I'll deploy an empty version of that. Yeah, I don't care. Okay. Deploy empty search. Okay, now, let's go ahead and go back to our application and, that's stuck and deleting. That might be why things are not working very well. Okay. We're going to ignore the ones stuck in deleting because we've already manually cleared out everything within the ElastiCache.

So, let's go ahead and check to see if this works. There should be nothing created yet. Yeah, cool, and there should be no keys. Okay, and so this is a multi-shard cluster, so the keys will be targeted at specific nodes. Each frame might land in a different shard. However, the index only needs to be created once in one shard and it automatically propagates to the other shard, so everyone gets that same index.

So let's go ahead and try and ingest the video and see what happens. I'm going to ingest the same ad that we played with earlier. And I got a bunch of settings here, and this allows you to choose a sampling interval. Here we're just doing one for every one second, and this is what we talked about before. For each image frame we can run different detections here. So recognition picks up labels and text, as well as the moderation and celebrities, and then the summary is done from Haiku and then Transcribe does the subtitles. So we're going to just keep the defaults here and upload it.

And so then this runs a Step Function which we'll go ahead and look at in a moment that runs through the transcription and then all of the different analysis. So, as we talked about earlier, Valkey is best for real-time use cases that's ultra low latency. This isn't really a good showcase of the real-time capabilities because this is done at a little bit lower or is done sort of in an offline fashion, right? We're running a workflow on this and then showing a web UI that can do searching. So it's really just there as an example application of how you would connect it, but where Valkey would shine is if you're doing sort of real-time streaming.

analysis or other things where the latency and throughput are much higher than in this example application. Let's go ahead and check on the extraction flow here, so we're running through a set of Lambda functions. Let me see if I can make this bigger. Yeah, running through a set of Lambda functions, we'll get the video metadata. Can I make this maximize? Yup. And zoom in.

So we go through the sampling and removing redundant frames through the sampled images to de-dupe them, and then we iterate through them, extract metadata, and actually store those vectors. And it looks like something probably didn't work. So let's go ahead and dig into what else went wrong here. Vector is not defined. Okay, so either I made a typo in the save. Where did I say vector? Vector dot, it's vector type. There we go, let's try that. Cool, and we'll rerun from failure. See if it can complete successfully.

Still not. Let's see if I made other typos. Create options doesn't have the await and the create index. Let's go look at that. Have to create options has a data type and, ah, it's prefixes. There we go. This is where I wish Amazon Q would help me if only I could ingest the JSON syntax here, which I have this one. Okay, let's go ahead and try it one more time.

So this is iterating them, I think, in a loop with a fairly low concurrency, so it takes a little bit longer. Again, if you're doing something real-time, it could ingest it as the frames are being streamed in. So that completed successfully now after I fixed all those typos, and we should be able to see on this non-deleting one the same UI that we saw before where we see the different image frames.

And I'll point out that there is one duplicate here, so this particular image I just pulled from YouTube, but you see similarity scores. Frame 9 often has, yeah, so it goes from frame 8 here to frame 10, and that's because if I were to look at what happens between 8 and 9, it's sort of the same woman moving around before it gets to that point, and so I was able to de-dupe a particular frame there.

So let's go ahead and look at what that looks like inside Valkey. So now if I run keys star, it may have keys. It probably should because they're in different ones. So I see frame prefix and then a timestamp at the end. So this is a UUID followed by the timestamp here. I can do a JSON.GET on mine, and this is lots of vectors, but we see it has some things other than vectors. It has task ID, and then it has the embeddings that we generated, two of them, and then it has the actual metadata itself, text and labels. This is all the stuff that it's actually indexing in the vector, as well as then the, let me see if I can make this bigger. I may not be able to readily. Oh, I can. There we go.

And then the S3 URI. So this is all what is being indexed. If I list my vector indexes, I have one that does the similarity checking, and then this video frame index is the one that we defined before, and I can run an FT.INFO to get information and observability about the actual video frame.

This gives a lot of output, but it gives the definition. We defined it with both the text embedding and the multimodal embedding. But the interesting part is it also tells us how many documents have been indexed. So there are seven different documents, 14 records within that index that we can search on this particular node, which should correspond to the number of keys here. Seven, yep. So each node has a portion of that index, and so when search happens it fans out, but that is transparent to the application.

Implementing Semantic Search with FT.SEARCH

So if it was cleared out successfully, search should be broken right now. If I were to try and do some semantic search here, it will return nothing. No videos are found because it is not hooked up. So the last part that we want to do is make this searchable, and to do that, we will look at the search function here. So we've created the index, we've populated it, and now we want to search it. So this is taking the information from that web UI when I do a search. I've pre-populated it with a Valkey connection, but we just need to basically implement a function to do searching the text embedding. We'll do that first, this is the text, which is semantic search.

So we have the input text and the score threshold which we'll use to find just relevant results. So to do this we do an FT.SEARCH command, and that one I will walk through here kind of quickly. So to start with, we just have the text as input. So the first thing we need to do is to generate the embedding from the text, and this get embedding function will call that lambda that we defined earlier and basically transform the input text into a vector embedding. We'll pass it the input text but no image. And then we need to query the given vector, and so to index and to do that, we will use the syntax that is defined in Valkey command FT.SEARCH. It's a little bit syntactically difficult, but I can walk through what it means.

So first, the star means that we're not going to do any filtering on the vectors. We could filter by different tags. We're not going to get into that today. We'll do a K nearest neighbor search and we'll pass in a default K, which is 20 in this example, but we could just hard code the number here. And we will query this text embedding field in our index, and we will then use this to say we will pass in our query vector as a variable, and we will return the actual score here. We want more than just the score. By default it'll just return the score, and that's not very useful because we want some metadata out of this rather than just the key name, which is just a UUID.

So we can do that by doing vector search options, and we'll return some different fields that are also populated in that JSON. So we'll do return field field identifier equals. So we want the score and then we want other fields too. So let me copy and paste this. We want the task ID to be able to populate it. We want the image S3 URI, which is also in there. We want the timestamp and we want the text itself to be able to display what the matching text is. So those are our fields that we're returning. We're going to, by default, it returns just 10 results no matter what the K is that you pass in, so we want to be able to change that and set it to return 20 in this case. So we'll do EC default K.

And then lastly, we need to actually pass in the vector that we're searching, and to do that, we will define this query vector parameter that matches this string here. And we'll pack it as a structure so that it can be sent over the wire. And so we'll encode it in a way that is recognizable by Valkey. We pack the length of the embedding and then take the actual raw embedding itself. Okay, I think that looks good for the query, and then we just need to call the function itself. So we'll do FT.SEARCH, pass in the client, the index name that we're searching, and then the options that we just defined. And then we have a helper function here that

will format those results and do filtering and return it in the way that we want it to be displayed in the UI. So we'll do format frame results of the response with our given score threshold. So this will do some filtering.

Testing Text and Multimodal Search Capabilities

Okay, and this one's easier to test because we can just run the web UI as soon as the deploy happens. Let's see if I made any mistakes here. So now if I were to do the same thing, search AWS. Aha, I did not memorize this well enough. Okay, let's go ahead and find the FT search options. Can anyone find my mistake? If not, then we'll go into our CloudWatch logs and find it out. What error is returning back.

Set has no items. And this, did I pass in the wrong query options? Do I pass in a set instead of a list? Let's go back. Where am I doing a set? I have a return field list. FT search vector said params is, yes, that's it. Colon, thank you. And this still is correct, yep. Let's try that. Thank you.

Cool, so I get back 20 frames in the single video. I can expand this and see it ordered from most relevant to least relevant. So as you can see when we do a search, it pulls out things that are with the lowest score here that actually represent AWS. So Claude was able to pull out the AWS text from here, and if we look at some of these other ones we can see AWS is in the image here at the bottom. AWS is present here, so the most relevant results come up first.

We still haven't done the multimodal search, though that is basically the same thing, so we can copy and paste a lot of this except we're just going to be searching a different, and I could come up with a helper function here, but I'm not going to right now. So we will search the multimodal, generate the multimodal embedding, pass in the image as well, change this to MM embedding. I think that's it. We get the same details out, it's just a separate embedding. It's within the same index. Let's see what that does.

And so now that opens up this other dropdown search where I can do a multimodal search. This one I can pass in text or an image. I'm going to search now for an ogre because there is an ogre in this video or a monster of some sort. We can see if it shows up correctly. Okay, we still got 20 frames and the top ones are of this monster that is in the AWS one, and I could do a combination of that so I could search AWS and ogre. It sort of interleaves them so because it's generating an embedding across both simultaneously, so I can see things that are most relevant of the ogre and of AWS.

Cool, and I think that is everything that I wanted to demonstrate today. I wanted to talk, I say that this is an application that is largely available online in the AWS Solutions library. The caveat being that this was originally developed for OpenSearch, so it doesn't work against Valkey. The code that is running is against OpenSearch, but you can swap it out for Valkey, but it is a. And within AWS Labs here you can do, it's called Guidance for Media Extraction and Dynamic Content Policy Framework. So it's a framework you can walk through in your own time if you want to understand more about how the entire ingestion pipeline works.

Q&A: Cost Considerations, Use Cases, and Caching Strategies

And with that, I wanted to open it up to any questions that folks have.

Yes, so the question is about the services you ran being expensive. You're putting it straight in the ElastiCache, but does it make sense to put it in something like DynamoDB first and store the embeddings there? Yeah, so the question is about cost, like is this a cost efficient way to do it. Generally, yeah, so that is a good point. The ElastiCache, the information there is ephemeral. So in this particular case where you want something long lasting, you're right. I think a durable database would be a better fit for this particular use case, which is why something that's real time where you're actually doing like, say, security analysis or something where the information is more ephemeral, you want to look for particular threats and then you don't really care that you age it out or set a TTL, that would be a better use case for the vector search on ElastiCache.

You could store the embeddings in something like DynamoDB, which doesn't have a vector search capability, and then you could do the transition into Valkey if it's just a core working set so that you could have it be easily searchable. Then age things out. That is one way to do it. Like we could set TTLs on all of these JSON documents and have them be removed after some period of time, and so it just, but still, mastered in another durable database. That is a pattern that we see customers use.

This question in the back. Is it possible to get your code, the way that you implemented this through GitHub? The content of the Valkey, it's not currently available. I think we could look into making the Valkey parts itself available. If you want to come, I can chat with you afterwards and work with you to get the information to you. Thank you.

Another question over here. It's not really a question, but I'm generally confused about using embedding caches because you're not going to get cache hits all the time, and then you won't end up searching your durable storage. So in this case we're using ElastiCache as our primary vector database. So in this case, there is no other vector database in the picture. We've replaced OpenSearch or, you know, Postgres with ElastiCache entirely. So you will get cache hits anytime there is a relevant match.

Is the question about how you would use ElastiCache as a vector database in front of like another vector database that's your durable store? Yeah, so you could do that as well, right? You can do a search. In this case we could just have, you know, a hot working set here, but it would require us to, you know, if there's a miss we might need to fall back to the database. So it would front the cache hits but not necessarily cache misses where you would then need to go fall back and perhaps bring that vector into the Valkey. So you could do a similar pattern as like a look aside cache with vectors. You would just have that two layer cache for the vector search.

But there could be like a much closer, more similar vector in your durable storage that you just wouldn't hit because there's something in your cache. Yeah, so it depends. That's where like the score thresholding is important. You're right, you might not get the most relevant results if you have just a subset of the vectors inside your Valkey cache and the rest in the durable storage, but they might be good enough. It works, I think, best if you have the entirety in here because then you can get the most relevant results, and that's what we see a lot of customers doing is storing all of their vector data in here. It's just that you have to handle the ephemeral nature of it when you do that. Okay, thank you.

Cool. Thank you, and I'll take other questions offline. We'll be around outside. Thank you so much.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community