A quick note: This article was originally published on Medium. You can read it there by clicking here.
Last month, I wrote an article on how I built a self-hosted AI Meeting Note Taker. The response was fantastic and showed a clear demand for private, offline AI tools that put users back in control of their data.
Whether you’re a small business uploading sensitive client data, a developer working on proprietary code, or just someone who wants to organize personal files without sending them to a third-party server, the problem is the same. You’re forced to choose between powerful AI tools and data privacy.
Building on that momentum, I decided to create the solution I wanted to see: a full-featured, multi-user “Chat with Your Documents” application that runs entirely on your own hardware. This is the second part of my self-hosted AI series — a technical deep-dive into a production-ready RAG application for everyone.
The Mission: A Private AI for Everyone
The core technical goals for this project were:
- 100% Local Processing: All data, from your company’s strategy documents to your personal financial records, is processed locally. Nothing ever leaves your server.
- Team & Family Ready: The system is designed for multiple users, with isolated knowledge bases and a central admin dashboard to manage everything.
- Efficient & Smart: The RAG pipeline intelligently syncs only new or modified documents, saving time and computational resources.
- Incredibly Simple to Deploy: The entire application is packaged into a single executable with easy-to-use installer scripts. No complex setup required.
The Tech Stack: The Powerhouses
This application is built on a foundation of powerful open-source tools:
- Backend: Flask serves as the lightweight web server, with Gunicorn for threaded performance to handle concurrent users.
- AI & Embeddings: Ollama runs open-source LLMs (like Llama 3, Gemma) and embedding models locally.
- Vector Store: ChromaDB provides a persistent, on-disk vector database for storing document embeddings efficiently.
- RAG Orchestration: LangChain glues the components together, managing the flow from document loading to question-answering.
- User Management: A custom authentication layer using SQLite for persistence and PyJWT for secure sessions.
Hardware Requirements & Performance
A common question with self-hosted AI is about the hardware required. The application is designed to be flexible, and performance will scale with your machine’s capabilities. The main requirement is sufficient RAM to load the language models.
To give you a real-world idea, here are a few setups I’ve tested:
-
Good (Accessible Start): For users with more modest hardware (e.g., a laptop with 8–16GB of RAM), the application runs well with a smaller, efficient model like
gemma3:4b
. The results are surprisingly good for most document Q&A tasks, making this a great starting point. -
Better (Smooth Personal Use): On my MacBook with 24GB of RAM, it handles a 12-billion parameter model (eg.
gemma3:12b
) very smoothly for all my personal documents and projects. This offers a noticeable boost in the quality of the AI’s responses. -
Best (Team Performance): At my office, we have it on a server with an NVIDIA A6000 GPU. Our whole team of 6 uses it with the much larger
gemma3:27b
model without any issues.
This demonstrates that the application can effectively scale from a standard laptop for individual use to a dedicated server for team collaboration, depending on your needs.
Deep Dive: The System Architecture
The application is broken down into three main Python components: the web server (main_app.py
), the authentication system (auth.py
), and the RAG core (local_rag_chroma.py
).
1. The RAG Core: local_rag_chroma.py
This is the heart of the application. I designed a KnowledgeBaseManager class to handle the entire lifecycle of document processing and retrieval.
Intelligent Document Synchronization
To avoid redundant processing, I implemented a synchronization function that compares the state of files on disk with the metadata stored in ChromaDB.
The _synchronize_documents
method works in a few steps:
- It scans the document directories and creates a dictionary of all current files and their last-modified timestamps.
- It queries the ChromaDB collection to get a list of already-indexed files and their modification times stored in the metadata.
- By comparing these two lists, it identifies:
- New files to be added.
- Modified files that need to be deleted and re-indexed.
- Deleted files whose chunks must be removed from the database.
This ensures that only necessary changes are processed, making startup and resyncing incredibly fast.
Document Loading and Splitting
The system uses LangChain’s document loaders to handle various file types (.pdf
, .docx
, .md
, .txt
, .csv
,etc.). Each document is then split into manageable chunks using the RecursiveCharacterTextSplitter
.
# A snippet from _load_and_split_document in local_rag_chroma.py
def _load_and_split_document(self, file_path):
# ... logic to select the correct loader based on file extension ...
loader = PyPDFLoader(file_path) # Example for PDF
documents = loader.load() # Add file metadata for synchronization
file_mod_time = os.path.getmtime(file_path)
for doc in documents:
doc.metadata['source'] = file_path
doc.metadata['last_modified'] = file_mod_time
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=CHUNK_SIZE,
chunk_overlap=CHUNK_OVERLAP
)
split_docs = text_splitter.split_documents(documents)
return split_docs
Embedding and Storage
The chunks are then passed to a locally running embedding model via OllamaEmbeddings
. Each resulting vector is stored in a persistent ChromaDB collection, creating a searchable index for that specific knowledge base. The application creates separate, isolated collections for each team's knowledge base and each user's personal documents, ensuring data segregation.
2. The Multi-User System: auth.py
A robust application needs a solid authentication and authorization layer. I built one from scratch using a simple SQLite database to store user and session information.
Database and Authentication
The DatabaseManager
class sets up tables for users, sessions, and analytics. User passwords are never stored in plain text; instead, I use werkzeug.security
to store salted and hashed passwords.
When a user logs in, the system verifies their credentials and generates a JSON Web Token (JWT) that is used to authenticate subsequent API requests.
Securing Endpoints with Decorators
To protect the API, I implemented custom decorators. The @require_auth
decorator checks for a valid session or JWT, while @require_role('admin')
restricts access to admin-only endpoints, like the analytics dashboard.
# A snippet from auth.py
def require_role(required_role):
"""Decorator to require specific user role."""
def decorator(f):
@wraps(f)
@require_auth
def decorated_function(*args, **kwargs):
user_role = request.current_user.get('role', 'user')
# ... logic to check if user has sufficient privileges ...
return f(*args, **kwargs)
return decorated_function
return decorator
This design makes it easy to secure new API routes as the application grows.
3. The Web Server: main_app.py
The Flask application ties everything together. It defines the API endpoints for the frontend to interact with. The most important one is /api/ask
.
# A simplified view of the /api/ask endpoint in main_app.py
@app.route('/api/ask', methods=['POST'])
@require_auth
def ask_question_api():
session_id = get_or_create_session_id()
data = request.get_json()
question = data['question']
kb_id = data.get('knowledge_base_id', 'user_personal')
# Get or create a RAG instance for this session and knowledge base
rag_chain, retriever, memory = get_or_create_rag_instance(session_id, kb_id)
if not rag_chain:
return jsonify({"error": "RAG service not ready."}), 503
# Invoke the RAG chain and return the response
response = rag_chain.invoke({"query": question})
# ... format and return the answer and sources ...
The application maintains a thread-safe dictionary (rag_instances
) to manage separate RAG chains and conversation histories for each user session and selected knowledge base, preventing memory leaks and ensuring that conversations are isolated.
The Result: Your Private AI Knowledge Base for Work and Life
The final result is a powerful, production-ready application that turns any collection of documents into an interactive AI assistant. It’s the perfect tool for:
- Small Businesses wanting a secure, internal knowledge base for their team without the high cost and privacy risks of SaaS products.
- Developers & Freelancers who need a powerful, private RAG system for their own documents or as a foundation for client projects.
- Personal Users & Families who want to securely organize and ask questions about their private files, from financial records to research papers.
It’s the best of both worlds: the power of modern AI without sacrificing control over your data.
Want to Run It Yourself?
This project takes the core principles of local processing from the AI Meeting Note Taker and expands them into a robust, multi-user platform ready for any use case.
I’ve packaged the entire source code and made it available for a one-time purchase. It’s a fantastic way to get a powerful, private AI tool up and running in minutes and serves as a solid foundation for your own customizations.
For a limited time, use the code MEDIUM25
for a 25% discount.
You can get the complete RAG Application project by clicking here
Thanks for reading! I hope this deep-dive inspires you to take control of your data and explore the incredible potential of self-hosted AI.
Top comments (0)