Hi! Coming back here after almost a year feels… overdue. I realised I haven’t really written anything here throughout this year, and that realisation made me feel both nostalgic and a little guilty. This year has been incredibly fast, packed & honestly quite overwhelming, all in a good way. I switched to a new company and stepped into a new role, suddenly finding myself deep in the world of AI platforms. I had to accelerate my learning curve more than ever before. Within just a few months, I delivered multiple AI and platform engineering projects.
Looking back, I’m actually grateful for the way life tossed me around pushing me in new directions and exposing me to entirely new challenges.
So before this year ends, I want to recall some of the small glitches, personal experiments & learnings and engineering puzzles I faced on this new journey. This is one of my personal RAG Implementation.
The Problem I Wanted to Solve
I wanted to build a simple personal knowledge engine for myself, a small RAG (Retrieval-Augmented Generation) system to search through:
- my technical notes,
- PDFs I keep collecting,
- random snippets from articles,
- AWS/Azure/GCP docs,
- personal learning logs,
- and some of my own project write-ups.
I didn’t want a fancy UI or anything.
Just an API endpoint I could ping from Postman, curl or any app I’m building.
But I had three constraints:
- It had to be cost-friendly(preferably near-free). I didn’t want ECS, EC2, SageMaker, EKS, or any constantly running infra.
- It had to be simple. No giant pipelines, no heavy orchestrators. Because I was then just starting with such implementations.
- It had to scale to zero. Because I don’t query my notes every second. This immediately eliminated many models and deployment choices. I needed something minimal & efficient.
The First Issue I Hit: Cost Was Exploding
My initial plan was:
- use an EC2 t3.small instance,
- run a small vector DB like Weaviate/Chroma,
- use LangChain,
- use any open-source embedding model locally.
But EC2 + storage + vector DB would have cost a few thousand rupees per month for a personal experiment. Not worth it. I shut that plan down. And that’s when I revisited AWS Lambda + Bedrock.
The Idea That Worked
Instead of running anything 24/7, I thought:
- “Why not just use Lambda for inference
- and S3 for storing vector data,
- and keep everything serverless?”
Lambda runs only when called → cost = negligible.
Bedrock provides embeddings → no need for local models.
I can dump embeddings in a simple CSV/JSON/DynamoDB row.
And use a lightweight similarity search via NumPy.
This became the foundation.
Concepts Involved & Approach (You can also refer this blog of mine for basics in case used terms are strange for you)
- RAG (Retrieval-Augmented Generation) You store documents → break into chunks → embed them → search by similarity → feed top matches to LLM.
- Vector Embeddings Bedrock Titan Embeddings v1 give a 1536-dimensional vector per chunk.
- Similarity Search I used cosine similarity via NumPy. Enough for small datasets.
- AWS Lambda My entire RAG pipeline runs inside one Lambda function.
- Serverless Cost Optimization
- cold starts negligible for Python
- no servers running 24/7
- you only pay Bedrock API calls
- No servers.
- No clusters.
- No databases.
- Just Lambda + S3 + Bedrock.
How I Built It (Step-by-Step)
Step 1: Prepare documents
I uploaded a few markdown and text files into a folder locally:
notes/
├── docker_basics.txt
├── k8s_primitives.md
├── llm_security.md
└── azure_openai_tips.md
Then chunked them using Python.
def chunk_text(text, chunk_size=500, overlap=50):
chunks = []
for i in range(0, len(text), chunk_size - overlap):
chunk = text[i:i+chunk_size]
chunks.append(chunk)
return chunks
Step 2 : Generate embeddings using Bedrock
import boto3
client = boto3.client("bedrock-runtime")
def embed(text):
response = client.invoke_model(
modelId="amazon.titan-embed-text-v2",
body={"inputText": text}
)
return response["embedding"]
Stored embeddings in JSON:
{
"id": "docker_01",
"text": "Docker is a containerization technology...",
"vector": [0.12, 0.08, ...]
}
Uploaded to S3 as rag_store.json.
Step 3 : Create the Lambda Function
My Lambda contains:
- load JSON from S3
- compute cosine similarity
- select top 3 chunks
- call Bedrock LLM (I used the Claude 3 Haiku)
- return final answer
Cosine Similarity:
from numpy import dot
from numpy.linalg import norm
def cosine_sim(a, b):
return dot(a, b) / (norm(a) * norm(b))
Similarity ranking:
def retrieve(query_vec, store):
scores = []
for item in store:
score = cosine_sim(query_vec, item["vector"])
scores.append((score, item["text"]))
scores.sort(reverse=True)
return [text for _, text in scores[:3]]
Step 4 : Bedrock Generation
def generate_answer(context, query):
prompt = f"Context:\n{context}\n\nQuery: {query}\nAnswer:"
response = client.invoke_model(
modelId="anthropic.claude-3-haiku",
body={"prompt": prompt}
)
return response["outputText"]
Step 5 : Deploy and Test
curl -X POST \
-H "Content-Type: application/json" \
-d '{"query": "explain docker networking"}' \
$API_Gateway_URL
Results & Cost
Lambda invocations → FREE (under limits)
S3 storage → Negligible
Bedrock embeddings + text-gen → again negligible for my usage
A fully functional RAG system with low cost scales infinitely because it’s serverless.
Final Thoughts
I built this as a one of my first personal RAG experiments, not a production pipeline but it turned out surprisingly usable, scalable & affordable. And more importantly: I actually learned something while doing it.
As an AI Platform Engineer, I’ve built bigger pipelines during the year… but this small project reminded me why I love this field
being able to experiment, break things, fix things & create something meaningful with very little infra.
Coming back to blogging like this feels refreshing like reconnecting with an old part of myself.
More stories coming soon.
Before this year ends, I want to share all the little puzzles, fixes & insights from this intense learning journey.
Thanks for reading.
Mahak

Top comments (0)