The Missing Link: How to Retrieve Full Documents with AWS S3 Vectors

#aws #rag #vectordatabase

Why write yet another blog article on how to use AWS S3 Vectors, when there are already many such blog articles and tutorials out there?

Because all existing tutorials I read miss a critical aspect: they don't explain how to actually retrieve your full documents after finding matching vectors. Instead, they store tiny example "documents" (often just a sentence) directly in the vector metadata. This approach, while easy for a tutorial demonstrating similarity search using vectors, completely falls apart when dealing with real-world content. Hopefully nobody is going to store entire documents in a vector metadata!

This article fills this important gap. I won't rehash the basics that others have covered well. Instead, I'll focus specifically on implementing a complete document retrieval with S3 Vectors and S3 Bucket that:

Stores your actual documents in a standard S3 Bucket
Creates and indexes embeddings in S3 Vectors
Connects vector search results back to your original documents

Unlike a vector database that handles document storage and retrieval for you, S3 Vectors only manages the vector index. Understanding how to bridge this gap is essential for building production-ready applications with AWS S3 Vectors.

At a high level, using S3 Vectors is a 3 steps process for both storing and querying as shown in this diagram.

Storing Documents and Vectors

The steps to store the documents and their embeddings are:

Put the documents in a regular S3 bucket. In such a context, I hash the file name or identifier to generate the S3 object key,
Use an embedding model to generate an embedding based on the content of the document,
Store the embedding in the vector index.

In this article example, the document are crawled web pages and the S3 object key is generated by hashing the page URL. The crawled pages have the format

{
    "content": string,
    "metadata": {
        "url": string,
        "title": string
    }
}

At a high level, the code for the 3 steps is then:

s3_vectors = boto3.client("s3vectors")
s3 = boto3.resource("s3")
MODEL_ID = "amazon.titan-embed-text-v2:0"
vectors_data_bucket = s3.Bucket(S3_DOCUMENTS_BUCKET_NAME)
vectors = []

for page in pages:
    key = hashlib.md5(page["metadata"]["url"].encode()).hexdigest()
    # store the actual document in the S3 bucket as a text content
    vectors_data_bucket.put_object(
        Key=key,
        Body=page["content"].encode("utf-8"),
        Metadata={
            "title": re.sub(r"[^a-zA-Z0-9\s]", "", page["metadata"]["title"]),
            "url": page["metadata"]["url"]
        }
    )
    # Generate embedding for the page text
    model_response = bedrock_runtime.invoke_model(
        modelId=MODEL_ID,
        body=json.dumps({
            "inputText": page["content"],
            "dimensions": 1024
        }).encode("utf-8"),
    )
    response_body = json.loads(model_response["body"].read().decode("utf-8"))
    embedding = response_body["embedding"]
    # Set the vector with the same key as the S3 Object
    vector = {
        "key": key,
        "data": {
            "float32": embedding
        },
        "metadata": page["metadata"]
    }
    vectors.append(vector)

# Store the vectors in the S3 Vectors index
s3_vectors.put_vectors(
    vectorBucketName=S3_VECTORS_BUCKET_NAME,   
    indexName=S3_VECTORS_BUCKET_INDEX_NAME,
    vectors=vectors
)

Querying S3 Vectors and Retrieving Documents

Then when you query the S3 Vectors to retrieve the documents you need to:

Use the embedding model to generate the embedding for the search question,
Query S3 Vectors to get the embeddings close to the search embedding,
For all embeddings in the results, use the key to retrieve the objects from the S3 Bucket.

question = "What is AWS S3 Vectors?"
documents = []

# Invoke the same model to generate the embedding for the question
response = bedrock_runtime.invoke_model(
    modelId=MODEL_ID,
    body=json.dumps({
        "inputText": question,
        "dimensions": 1024
    }).encode("utf-8"),
)
model_response = json.loads(response["body"].read())
question_embedding = model_response["embedding"]
# Use the query embedding to search for similar embeddings in the S3 Vectors index
query_results = s3_vectors.query_vectors( 
    vectorBucketName=S3_VECTORS_BUCKET_NAME,   
    indexName=S3_VECTORS_BUCKET_INDEX_NAME,
    queryVector={"float32":question_embedding},
    topK=3, 
    returnDistance=True,
    returnMetadata=True
)
vectors = query_results.get("vectors", [])
# Retrieve the actual documents from S3 using the keys from the query results
for vector in vectors:
    obj = s3.Bucket(S3_DOCUMENTS_BUCKET_NAME).Object(vector["key"]).get()
    content = obj["body"].read().decode("utf-8")
    documents.append({
        "title": vector["metadata"]["title"],
        "url": vector["metadata"]["url"],
        "content": content
    })

Key Takeaways

Unlike a vector database which stores and retrieves the documents for you, S3 Vectors only stores the vector index. It is up to you to make the relationship between the vector key and the actual document. S3 makes that easy if you use the same key for the document vector in S3 Vectors and the document object in the S3 Bucket. While storing and retrieving becomes a multi-step process that you have to orchestrate and that inevitably increases response latency, this approach offers substantial cost savings compared to dedicated vector databases.

Note that the documents do not have to be stored in a S3 Bucket. In the above example, we could imagine not storing the pages' content as objects in a S3 Bucket, and just give back the page URLs from the vectors metadata which would be crawled downstream.

DEV Community

The Missing Link: How to Retrieve Full Documents with AWS S3 Vectors

Storing Documents and Vectors

Querying S3 Vectors and Retrieving Documents

Key Takeaways

Top comments (0)