Low-Cost RAG API Using AWS Lambda & Bedrock

#rag #ai #llm #development

Hi! Coming back here after almost a year feels… overdue. I realised I haven’t really written anything here throughout this year, and that realisation made me feel both nostalgic and a little guilty. This year has been incredibly fast, packed & honestly quite overwhelming, all in a good way. I switched to a new company and stepped into a new role, suddenly finding myself deep in the world of AI platforms. I had to accelerate my learning curve more than ever before. Within just a few months, I delivered multiple AI and platform engineering projects.

Looking back, I’m actually grateful for the way life tossed me around pushing me in new directions and exposing me to entirely new challenges.

So before this year ends, I want to recall some of the small glitches, personal experiments & learnings and engineering puzzles I faced on this new journey. This is one of my personal RAG Implementation.

The Problem I Wanted to Solve

I wanted to build a simple personal knowledge engine for myself, a small RAG (Retrieval-Augmented Generation) system to search through:

my technical notes,
PDFs I keep collecting,
random snippets from articles,
AWS/Azure/GCP docs,
personal learning logs,
and some of my own project write-ups.

I didn’t want a fancy UI or anything.
Just an API endpoint I could ping from Postman, curl or any app I’m building.

But I had three constraints:

It had to be cost-friendly(preferably near-free). I didn’t want ECS, EC2, SageMaker, EKS, or any constantly running infra.
It had to be simple. No giant pipelines, no heavy orchestrators. Because I was then just starting with such implementations.
It had to scale to zero. Because I don’t query my notes every second. This immediately eliminated many models and deployment choices. I needed something minimal & efficient.

The First Issue I Hit: Cost Was Exploding

My initial plan was:

use an EC2 t3.small instance,
run a small vector DB like Weaviate/Chroma,
use LangChain,
use any open-source embedding model locally.

But EC2 + storage + vector DB would have cost a few thousand rupees per month for a personal experiment. Not worth it. I shut that plan down. And that’s when I revisited AWS Lambda + Bedrock.

The Idea That Worked

Instead of running anything 24/7, I thought:

“Why not just use Lambda for inference
and S3 for storing vector data,
and keep everything serverless?”

Lambda runs only when called → cost = negligible.
Bedrock provides embeddings → no need for local models.
I can dump embeddings in a simple CSV/JSON/DynamoDB row.
And use a lightweight similarity search via NumPy.

This became the foundation.

Concepts Involved & Approach (You can also refer this blog of mine for basics in case used terms are strange for you)

RAG (Retrieval-Augmented Generation) You store documents → break into chunks → embed them → search by similarity → feed top matches to LLM.
Vector Embeddings Bedrock Titan Embeddings v1 give a 1536-dimensional vector per chunk.
Similarity Search I used cosine similarity via NumPy. Enough for small datasets.
AWS Lambda My entire RAG pipeline runs inside one Lambda function.
Serverless Cost Optimization
cold starts negligible for Python
no servers running 24/7
you only pay Bedrock API calls

No servers.
No clusters.
No databases.
Just Lambda + S3 + Bedrock.

How I Built It (Step-by-Step)

Step 1: Prepare documents
I uploaded a few markdown and text files into a folder locally:
notes/ ├── docker_basics.txt ├── k8s_primitives.md ├── llm_security.md └── azure_openai_tips.md

Then chunked them using Python.

def chunk_text(text, chunk_size=500, overlap=50):
    chunks = []
    for i in range(0, len(text), chunk_size - overlap):
        chunk = text[i:i+chunk_size]
        chunks.append(chunk)
    return chunks

Step 2 : Generate embeddings using Bedrock

import boto3
client = boto3.client("bedrock-runtime")
def embed(text):
    response = client.invoke_model(
        modelId="amazon.titan-embed-text-v2",
        body={"inputText": text}
    )
    return response["embedding"]

Stored embeddings in JSON:

{
  "id": "docker_01",
  "text": "Docker is a containerization technology...",
  "vector": [0.12, 0.08, ...]
}

Uploaded to S3 as rag_store.json.

Step 3 : Create the Lambda Function
My Lambda contains:

load JSON from S3
compute cosine similarity
select top 3 chunks
call Bedrock LLM (I used the Claude 3 Haiku)
return final answer

Cosine Similarity:

from numpy import dot
from numpy.linalg import norm

def cosine_sim(a, b):
    return dot(a, b) / (norm(a) * norm(b))

Similarity ranking:

def retrieve(query_vec, store):
    scores = []
    for item in store:
        score = cosine_sim(query_vec, item["vector"])
        scores.append((score, item["text"]))
    scores.sort(reverse=True)
    return [text for _, text in scores[:3]]

Step 4 : Bedrock Generation

def generate_answer(context, query):
    prompt = f"Context:\n{context}\n\nQuery: {query}\nAnswer:"

    response = client.invoke_model(
        modelId="anthropic.claude-3-haiku",
        body={"prompt": prompt}
    )
    return response["outputText"]

Step 5 : Deploy and Test

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{"query": "explain docker networking"}' \
  $API_Gateway_URL

Results & Cost

Lambda invocations → FREE (under limits)
S3 storage → Negligible
Bedrock embeddings + text-gen → again negligible for my usage
A fully functional RAG system with low cost scales infinitely because it’s serverless.

Final Thoughts

I built this as a one of my first personal RAG experiments, not a production pipeline but it turned out surprisingly usable, scalable & affordable. And more importantly: I actually learned something while doing it.
As an AI Platform Engineer, I’ve built bigger pipelines during the year… but this small project reminded me why I love this field
being able to experiment, break things, fix things & create something meaningful with very little infra.
Coming back to blogging like this feels refreshing like reconnecting with an old part of myself.
More stories coming soon.
Before this year ends, I want to share all the little puzzles, fixes & insights from this intense learning journey.

Thanks for reading.
Mahak