DEV Community

Cover image for You Don’t Need a Vector Database to Build RAG (Yet): A ~$1/Month DynamoDB Pipeline
Matia Rašetina
Matia Rašetina

Posted on

You Don’t Need a Vector Database to Build RAG (Yet): A ~$1/Month DynamoDB Pipeline

You've just been tasked with building a document Q&A system, where the users are able to upload PDF documents and then ask questions to get accurate answers based on the content in the uploaded documents. This request is the perfect opportunity to build a Retrieval Augmented Generation (RAG) pipeline. For starters, you’ve been asked to do a Proof-of-concept project therefore, you don’t have a budget to create this project with best tools available for this scenario.

If you’ve never researched how a RAG system works, it works in the following way - it breaks down a document into multiple chunks, converts them into a vector embeddings (numerical representations of the incoming data), storing them in a database and then later, when the system is asked about the document, it finds the most relevant chunks inside the database. The results are then fed into a Large Language Model, which then provides an answer.

I wanted to learn about RAG pipelines as cheap as possible, without using a vector database, even though I know that’s a better option performance-wise.

That’s why I decided to use DynamoDB. This post will show you that it’s one of the cheapest implementation of a RAG pipeline available. Of course, there are drawbacks and trade-offs, but for a strict budget and POC pipeline, it’s a great choice.

Technologies and services which we’ll be using are:

  • Python and AWS CDK in Python
  • AWS Bedrock
  • AWS Lambda
  • AWS DynamoDB
  • AWS SQS

Link to the GitHub repository is here - https://github.com/mate329/budget-rag-system-with-dynamodb.

Let's dive in.

Architecture Overview

Take a look into the architecture of the pipeline, together with explained steps:

  1. Document Upload: The user requests a pre-signed S3 URL from the Upload Handler Lambda to upload the document into the S3 bucket
  2. Event-Driven Ingestion: Upon document upload in S3, it triggers an EventBridge rule that sends the event to an SQS queue, where the messages are batched to reduce Lambda invocations and cuts costs significantly
  3. Document Processing: The Index Document Lamdba is triggered by SQS batches. For each document, it:
    • Downloads the file from S3 via the provided S3 key in the payload
    • Splits the document into semantic chunks
    • Generates vector embeddings using Bedrock (Titan Embeddings or similar)
    • Stores chunks + embeddings + metadata in DynamoDB
  4. Question Answering: When a user asks a question via the Ask Question Lambda:
    • The question is converted to a vector embedding (same model as step 3)
    • The database is searched for the most similar chunks
    • Top N relevant chunks are retrieved as context and provided to a Bedrock LLM (in this project, we use Amazon’s Titan model to keep the cost down) to generate an answer
    • The answer is returned to the user

Architecture Diagram

DynamoDB as a Vector Store

When I started building this RAG system, I had two goals: ship fast and keep costs as low as possible.

Here's the mental shift required: DynamoDB isn't a vector database, and that's fine. It doesn't have native k-NN search or approximate nearest neighbor algorithms. But it does have:

  • Single-digit millisecond latency for queries
  • Built-in user isolation through partition keys
  • Zero operational overhead (no cluster management, no indexing jobs)
  • Cost that scales to zero when not in use

For an MVP with 10-100 documents and a handful of users? DynamoDB is perfect. You're trading query performance for operational simplicity and cost savings. That's a trade worth making early on.

Architecture & Data Model

Defining the table inside the CDK is easy:

# 3. Create DynamoDB table with USER ISOLATION
self.vectors_table = dynamodb.Table(
    self, "DocumentVectorsTable",
    table_name="document-vectors",
    partition_key=dynamodb.Attribute(
        name="user_id",
        type=dynamodb.AttributeType.STRING
    ),
    sort_key=dynamodb.Attribute(
        name="document_page"
        type=dynamodb.AttributeType.STRING
    ),
    billing_mode=dynamodb.BillingMode.PAY_PER_REQUEST,
    removal_policy=RemovalPolicy.DESTROY
)

# GSI for listing user's documents
self.vectors_table.add_global_secondary_index(
    index_name="user-document-index",
    partition_key=dynamodb.Attribute(
        name="user_id",
        type=dynamodb.AttributeType.STRING
    ),
    sort_key=dynamodb.Attribute(
        name="document_id",
        type=dynamodb.AttributeType.STRING
    ),
    projection_type=dynamodb.ProjectionType.KEYS_ONLY
)
Enter fullscreen mode Exit fullscreen mode

The key to making DynamoDB work as a vector store is the table design:

# Partition Key: user_id
# Sort Key: document_page (format: "doc_id#page_num")
{
    'user_id': 'user-123', # user_id is derived from Cognito user ID
    'document_page': 'quarterly-report-2024#0',  # doc_id#page_num
    'document_id': 'quarterly-report-2024',
    'page_number': 0,
    'page_text': 'Executive Summary...',
    'page_vector': [0.123, -0.456, 0.789, ...],  # 1536 dimensions from Titan
    's3_bucket': 'my-documents',
    's3_key': 'user-123/quarterly-report-2024.pdf'
}

Enter fullscreen mode Exit fullscreen mode

Why this design?

  1. User isolation is automatic – Querying with user_id as the partition key inside the Ask Questinon Lambda means User 1 can never see User 2's data. No filter logic, no mistakes.
  2. Composite sort key allows efficient queries like "get all pages of this document"
  3. Vectors as native lists – DynamoDB supports List types, no serialization needed

The GSI on user_id + document_id lets users list their documents without scanning all pages.

PDF Document Indexing Flow

Before the Index Lambda processes the incoming documents, all S3 notifications about uploaded documents to the bucket get put into the SQS queue. All of the notifications are batched and this is where SQS batching is crucial. Without it, you'd pay for one Lambda invocation per document. With batching, 10 documents = 1 invocation = 90% cost savings on Lambda.

Here's the document processing code for each of the incoming PDFs, where:

  • we take each page of the incoming PDF document and check if it has text data on it
  • communicate with the Bedrock model to embed the data
  • save the embedded data into the DynamoDB table
def get_embedding(text: str) -> list:
    """Generate embedding for text using Bedrock."""
    try:
        response = bedrock_runtime.invoke_model(
            modelId='amazon.titan-embed-text-v1',
            body=json.dumps({"inputText": text})
        )

        response_body = json.loads(response['body'].read())
        return response_body.get('embedding')
    except Exception as e:
        print(f"Error generating embedding: {str(e)}")
        raise

def process_pdf(bucket: str, key: str, document_id: str, user_id: str):
    """Process PDF and store page vectors in DynamoDB."""
    table_name = os.environ['DYNAMODB_TABLE_NAME']
    table = dynamodb.Table(table_name)

    # Download PDF from S3
    pdf_obj = s3_client.get_object(Bucket=bucket, Key=key)
    pdf_data = pdf_obj['Body'].read()

    # Extract text from each page
    pdf_reader = PyPDF2.PdfReader(BytesIO(pdf_data))

    items_to_write = []

    for page_num, page in enumerate(pdf_reader.pages):
        page_text = page.extract_text()

        if not page_text.strip():
            print(f"Skipping empty page {page_num}")
            continue

        # Generate embedding
        embedding = get_embedding(page_text)

        # Prepare item for DynamoDB with user_id partition key
        item = {
            'user_id': user_id,  # Partition key for user isolation
            'document_page': f"{document_id}#{page_num}",  # Sort key: doc_id#page_num
            'document_id': document_id,
            'page_number': page_num,
            'page_text': page_text,
            'page_vector': convert_float_to_decimal(embedding),
            'document_type': 'pdf',
            's3_bucket': bucket,
            's3_key': key,
        }

        items_to_write.append(item)

    # Batch write to DynamoDB
    with table.batch_writer() as batch:
        for item in items_to_write:
            batch.put_item(Item=item)

    print(f"Indexed {len(items_to_write)} pages for document {document_id} (user: {user_id})")
    return len(items_to_write)

Enter fullscreen mode Exit fullscreen mode

Important: Convert floats to Decimals – DynamoDB doesn't support native floats. This conversion is annoying but necessary.

Another very important detail — notice the batch_writer method? It’s another great feature of the Tableobject of the DynamoDB resource from boto3 — if the document has 10 pages, and we write the data after each page has been processed, it would take 10 database transactions to process the document.

By using batch_writer , we reduce it to ONE. 10x the improvement just by using one method.

Similarity Search: The Brute Force Approach

Here's where DynamoDB shows its limitations, where we're doing full linear scan with client-side similarity calculation to adjust to the DynamoDB table architecture and usage:

def cosine_similarity(vec1, vec2):
    """Calculate cosine similarity between two vectors using pure Python."""
    # Dot product
    dot_product = sum(a * b for a, b in zip(vec1, vec2))

    # Magnitude of vec1
    magnitude1 = math.sqrt(sum(a * a for a in vec1))

    # Magnitude of vec2
    magnitude2 = math.sqrt(sum(b * b for b in vec2))

    # Cosine similarity
    return dot_product / (magnitude1 * magnitude2)

def search_similar_pages(table_name: str, user_id: str, question_embedding: list, k: int = 5, document_ids: list = None):
    """
    Search for similar document pages using cosine similarity.
    ONLY searches within the specified user's data - complete isolation!

    Args:
        table_name: DynamoDB table name
        user_id: User ID for data isolation (partition key)
        question_embedding: Query vector
        k: Number of results to return
        document_ids: Optional list - search within specific documents only
    """
    table = dynamodb.Table(table_name)

    # Query only this user's data using partition key
    # This ensures User 1 can NEVER see User 2's data
    if document_ids and len(document_ids) > 0:
        # Search within specific documents for this user
        # Use FilterExpression with IN operator for multiple document IDs
        response = table.query(
            KeyConditionExpression='user_id = :uid',
            FilterExpression='document_id IN (' + ', '.join([f':doc{i}' for i in range(len(document_ids))]) + ')',
            ExpressionAttributeValues={
                ':uid': user_id,
                {f':doc{i}': doc_id for i, doc_id in enumerate(document_ids)}
            }
        )
    else:
        # Search all documents for this user
        response = table.query(
            KeyConditionExpression='user_id = :uid',
            ExpressionAttributeValues={':uid': user_id}
        )

    items = response['Items']

    # Handle pagination (DynamoDB returns max 1MB per query)
    while 'LastEvaluatedKey' in response:
        if document_ids and len(document_ids) > 0:
            response = table.query(
                KeyConditionExpression='user_id = :uid',
                FilterExpression='document_id IN (' + ', '.join([f':doc{i}' for i in range(len(document_ids))]) + ')',
                ExpressionAttributeValues={
                    ':uid': user_id,
                    {f':doc{i}': doc_id for i, doc_id in enumerate(document_ids)}
                },
                ExclusiveStartKey=response['LastEvaluatedKey']
            )
        else:
            response = table.query(
                KeyConditionExpression='user_id = :uid',
                ExpressionAttributeValues={':uid': user_id},
                ExclusiveStartKey=response['LastEvaluatedKey']
            )
        items.extend(response['Items'])

    print(f"Found {len(items)} pages for user {user_id}")

    # Calculate similarity for each item
    results = []
    for item in items:
        # Convert Decimal to float for numpy operations
        vector = [float(x) for x in item['page_vector']]
        similarity = cosine_similarity(question_embedding, vector)

        results.append({
            'page_number': int(item['page_number']),
            'page_text': item['page_text'],
            'score': float(similarity),
            'document_id': item['document_id'],
            'user_id': user_id  # Include for debugging (but never expose other users)
        })

    # Sort by similarity (highest first) and return top k
    results.sort(key=lambda x: x['score'], reverse=True)
    return results[:k]
Enter fullscreen mode Exit fullscreen mode

Let’s be honest here — here is where time complexity starts playing a part. For 100 processed pages inside the table? Expect fast responses (500ms max), but if you have 10 000 pages? That’s where the user experience starts to take a hit because of the chosen architecture. When you hit this stage, it’s time to think about another database solution, most probably one of the vector databases.

But here's the thing: at small scale, this works fine. You're fetching data at DynamoDB speed (single-digit milliseconds), doing simple math in Python, and sorting. For a simple MVP, this is something we’ve expected and accepted as a trade-off.

This makes DynamoDB an excellent starting point, not a permanent solution — and that’s exactly the point.

The Complete Q&A Flow

Once we have relevant chunks from the DynamoDB table, we send them to an LLM to get an answer to the user’s question by doing the following steps:

  • build the context and prompt payloads, where we put the pages information received from the table and ask the LLM model to provide a clear answer to the user’s question
  • build a dynamic profile ID for the inference to access the Amazon Nova Micro model
  • receive, parse and return the response
def generate_answer(question: str, context_pages: list) -> str:
    """Generate answer using Bedrock with retrieved context."""
    # Build context from pages
    context = "\n\n".join([
        f"[Document: {page['document_id']}, Page {page['page_number'] + 1}]\n{page['page_text']}"
        for page in context_pages
    ])

    prompt = f"""You are a helpful assistant that answers questions based on provided document context.

Context from documents:
{context}

Question: {question}

Please provide a clear, accurate answer based only on the information in the context above. If the context doesn't contain enough information to answer the question, say so.

Answer:"""

    try:
        # Construct the inference profile ID based on region
        # For us-east-1 or us-west-2, use "us" prefix
        region = os.environ.get('BEDROCK_MODEL_REGION', 'us-east-1')
        region_prefix = 'us' if region in ['us-east-1', 'us-west-2'] else 'eu'
        inference_profile_id = f"{region_prefix}.amazon.nova-micro-v1:0"

        response = bedrock_runtime.invoke_model(
            modelId=inference_profile_id,
            body=json.dumps({
                "messages": [
                    {
                        "role": "user",
                        "content": [{"text": prompt}]
                    }
                ],
                "inferenceConfig": {
                    "max_new_tokens": 1024,
                    "temperature": 0.7,
                    "top_p": 0.9
                }
            })
        )

        # Return the response text
        response_body = json.loads(response['body'].read())
        return response_body['output']['message']['content'][0]['text']

    except Exception as e:
        print(f"Error generating answer: {str(e)}")
        raise

Enter fullscreen mode Exit fullscreen mode

Cost Breakdown

One of the main reasons I chose DynamoDB for this RAG pipeline was cost predictability.

The estimates below assume on-demand capacity mode, which is ideal for MVPs and early-stage products with unpredictable traffic.

Assumptions

  • Average PDF size after processing: ~3 MB (worst case)
  • Average document split into 10 chunks/pages
  • Each chunk stored as a separate DynamoDB item (text + embedding + metadata)
  • Item size: ~3–4 KB per chunk
  • Region: us-east-1
  • No backups, streams, or global tables enabled

Monthly Cost Breakdown (Example)

Component Usage Assumption Monthly Cost
DynamoDB Storage 1,000 PDFs (~3 GB total) ~$0.75 (often free under 25 GB free tier)
DynamoDB Writes 10,000 items/month (ingestion) ~$0.04
DynamoDB Reads 10,000 similarity queries/month (brute-force scan) ~$0.08
S3 Storage 1,000 PDFs (3 GB) ~$0.07
Total (DynamoDB-related) ~$0.10–$1.00 / month

Even if you 10× the usage (10,000 documents or 100,000 queries/month), DynamoDB costs remain in the single-digit dollar range. At this stage, Bedrock embeddings and LLM inference will dominate your bill, not the database.

When This Approach Breaks (And How You’ll Know)

This DynamoDB-based RAG pipeline doesn’t fail silently — it gives you very clear warning signs when it’s time to start thinking about migrating to time to migrate to a purpose-built vector database. Here are a couple warning signs:

  1. Pages per user: once a single user consistently exceeds ~1,000–2,000 stored chunks, brute-force similarity search starts to feel slow.
  2. Query latency, especially p95; when it crosses ~3 seconds, users notice and trust drops.
  3. Lambda memory usage — if your query Lambda needs to load thousands of vectors and regularly exceeds 50–60% memory just to sort results, you’re approaching a hard ceiling.
  4. DynamoDB read capacity consumption — it will start climbing linearly with corpus size; not because DynamoDB is inefficient, but because you’re intentionally scanning more data per query.

None of these are bugs — they’re the natural limits of an O(n) retrieval strategy. The key advantage is that these limits are predictable, measurable, and give you plenty of time to migrate before the system becomes unusable.

Conclusion

Not every RAG system needs a vector database on day one.

If you’re building a POC project or a new feature you’re not yet sure users will care about, optimizing for lowest cost and fastest iteration matters more than performance. Using DynamoDB as a vector store lets you ship a fully functional, serverless RAG pipeline with almost no fixed infrastructure cost, predictable pricing, and minimal operational overhead.

The trade-offs are real and intentional as we’ve discussed. Similarity search is linear, latency grows with the size of the database. Eventually, this approach will hit a ceiling. But that ceiling is high enough for early-stage systems — and, more importantly, it’s visible and measurable. You’ll know when you’ve outgrown it long before it becomes a production problem.

The key takeaway is not that DynamoDB replaces vector databases. It doesn’t and it can’t. The main point of this blog post is to understand that you don’t have to pay for the vector database before you need it. Start simple, validate your idea and then migrate to new architecture which will support your new requirements.

If you’re building RAG for the first time, this approach will get you there faster and cheaper. And when it stops being enough, you’ll be in the best possible position to move on — with a working system, real users, and real constraints guiding the next step.

Top comments (0)