As a full-stack developer juggling multiple projects, context switching is my biggest productivity killer. I use AI tools daily, but they have two major flaws for professional workflow:
They Forget: Start a new chat, and the context is gone. They don't remember the bug I debugged yesterday or the specific architectural constraints of my current project.
Privacy Anxiety: There are times I want to paste sensitive client logic or proprietary snippets, but sending that data to a cloud API feels risky.
I realized I didn't just need a chatbot; I needed a persistent, private "Second Brain" that lived on my machine and knew my work history.
Instead of waiting for a product to solve this, I decided to engineer my own solution using the stack I know best.
The High-Level Architecture
The goal was to build a system that runs 100% locally—no internet connection required for inference, no data leaving my laptop.
Here is the system design I came up with:
The Stack Breakdown
Here is why I chose these specific tools for the job:
Frontend: React (Vite)
Why: I needed a snappy, familiar chat interface. React’s component-based architecture makes it easy to manage chat history state and streaming responses.
*Backend: Node.js / Express
*
Why: It’s the glue. Node acts as the orchestration layer, handling API requests from the frontend, managing file uploads for memory, and communicating asynchronously with the AI engine.
The Brain: Ollama running Llama 3 (8B)
Why: Ollama is hands-down the easiest way to run local models. I chose Llama 3 8B because it hits the sweet spot for my hardware—it's fast enough for real-time chat but smart enough to follow complex instructions.
The Memory (RAG): ChromaDB (running locally)
Why: This is the core of the "Second Brain." I needed a Vector Database to store embeddings of my notes and code. I chose ChromaDB because it's open-source, easy to run locally via Docker, and integrates well with JavaScript ecosystems.
The Challenges: It's Not Magic
Any senior developer knows that the "happy path" is only 20% of the work. The biggest challenge wasn't getting the components to talk to each other; it was Retrieval Accuracy.
Initially, the RAG pipeline was "dumb." It would fetch documents based on simple keyword matching, confusing the LLM with irrelevant context.
The fix (currently in progress): I'm experimenting with smaller, more semantic chunk sizes and looking into implementing a "re-ranking" step—where we retrieve 20 documents but have a smaller, faster model sort them by relevance before sending top 5 to Llama 3. This significantly improves the quality of answers.
Conclusion
This project is still very much a work in progress. It’s messy, but it’s mine, and most importantly, it’s private.
It has forced me to dive deep into the mechanics of Vector Databases and local inference, skills that are becoming essential for modern backend engineering.
If you’re interested in seeing the final polished version or following my journey as I build this out in public:
Follow me on Twitter/X: [👉 https://www.somanathkhadanga.com/]
Check out my other projects: [👉 https://www.somanathkhadanga.com/]

Top comments (0)