DEV Community

Cover image for How I Built a Private, Multi-User “Chat with Your Documents” App That Runs 100% Offline
Muhammad Monjurul Karim
Muhammad Monjurul Karim

Posted on • Originally published at Medium

How I Built a Private, Multi-User “Chat with Your Documents” App That Runs 100% Offline

A quick note: This article was originally published on Medium. You can read it there by clicking here.

Last month, I wrote an article on how I built a self-hosted AI Meeting Note Taker. The response was fantastic and showed a clear demand for private, offline AI tools that put users back in control of their data.

Whether you’re a small business uploading sensitive client data, a developer working on proprietary code, or just someone who wants to organize personal files without sending them to a third-party server, the problem is the same. You’re forced to choose between powerful AI tools and data privacy.

Building on that momentum, I decided to create the solution I wanted to see: a full-featured, multi-user “Chat with Your Documents” application that runs entirely on your own hardware. This is the second part of my self-hosted AI series — a technical deep-dive into a production-ready RAG application for everyone.

The Mission: A Private AI for Everyone

The core technical goals for this project were:

  1. 100% Local Processing: All data, from your company’s strategy documents to your personal financial records, is processed locally. Nothing ever leaves your server.
  2. Team & Family Ready: The system is designed for multiple users, with isolated knowledge bases and a central admin dashboard to manage everything.
  3. Efficient & Smart: The RAG pipeline intelligently syncs only new or modified documents, saving time and computational resources.
  4. Incredibly Simple to Deploy: The entire application is packaged into a single executable with easy-to-use installer scripts. No complex setup required.

The Tech Stack: The Powerhouses

This application is built on a foundation of powerful open-source tools:

  • Backend: Flask serves as the lightweight web server, with Gunicorn for threaded performance to handle concurrent users.
  • AI & Embeddings: Ollama runs open-source LLMs (like Llama 3, Gemma) and embedding models locally.
  • Vector Store: ChromaDB provides a persistent, on-disk vector database for storing document embeddings efficiently.
  • RAG Orchestration: LangChain glues the components together, managing the flow from document loading to question-answering.
  • User Management: A custom authentication layer using SQLite for persistence and PyJWT for secure sessions.

Hardware Requirements & Performance

A common question with self-hosted AI is about the hardware required. The application is designed to be flexible, and performance will scale with your machine’s capabilities. The main requirement is sufficient RAM to load the language models.

To give you a real-world idea, here are a few setups I’ve tested:

  • Good (Accessible Start): For users with more modest hardware (e.g., a laptop with 8–16GB of RAM), the application runs well with a smaller, efficient model like gemma3:4b. The results are surprisingly good for most document Q&A tasks, making this a great starting point.
  • Better (Smooth Personal Use): On my MacBook with 24GB of RAM, it handles a 12-billion parameter model (eg.gemma3:12b) very smoothly for all my personal documents and projects. This offers a noticeable boost in the quality of the AI’s responses.
  • Best (Team Performance): At my office, we have it on a server with an NVIDIA A6000 GPU. Our whole team of 6 uses it with the much larger gemma3:27b model without any issues.

This demonstrates that the application can effectively scale from a standard laptop for individual use to a dedicated server for team collaboration, depending on your needs.

Deep Dive: The System Architecture

The application is broken down into three main Python components: the web server (main_app.py), the authentication system (auth.py), and the RAG core (local_rag_chroma.py).

Secure login and user admin dashboard
The application includes a secure login portal and a dashboard for user administration. These features allow for multi-user support, making the app suitable for shared use by a project team, a small business, or a family, with each user having their own secure access. (image by author)

1. The RAG Core: local_rag_chroma.py

This is the heart of the application. I designed a KnowledgeBaseManager class to handle the entire lifecycle of document processing and retrieval.

Intelligent Document Synchronization

To avoid redundant processing, I implemented a synchronization function that compares the state of files on disk with the metadata stored in ChromaDB.

The _synchronize_documents method works in a few steps:

  1. It scans the document directories and creates a dictionary of all current files and their last-modified timestamps.
  2. It queries the ChromaDB collection to get a list of already-indexed files and their modification times stored in the metadata.
  3. By comparing these two lists, it identifies:
  • New files to be added.
  • Modified files that need to be deleted and re-indexed.
  • Deleted files whose chunks must be removed from the database.

This ensures that only necessary changes are processed, making startup and resyncing incredibly fast.

Document Loading and Splitting

The system uses LangChain’s document loaders to handle various file types (.pdf, .docx, .md, .txt, .csv,etc.). Each document is then split into manageable chunks using the RecursiveCharacterTextSplitter.

# A snippet from _load_and_split_document in local_rag_chroma.py
def _load_and_split_document(self, file_path):
    # ... logic to select the correct loader based on file extension ...
    loader = PyPDFLoader(file_path) # Example for PDF
    documents = loader.load()    # Add file metadata for synchronization
    file_mod_time = os.path.getmtime(file_path)
    for doc in documents:
        doc.metadata['source'] = file_path
        doc.metadata['last_modified'] = file_mod_time
        text_splitter = RecursiveCharacterTextSplitter(
                chunk_size=CHUNK_SIZE, 
                chunk_overlap=CHUNK_OVERLAP
            )
    split_docs = text_splitter.split_documents(documents)
    return split_docs
Enter fullscreen mode Exit fullscreen mode
Embedding and Storage

The chunks are then passed to a locally running embedding model via OllamaEmbeddings. Each resulting vector is stored in a persistent ChromaDB collection, creating a searchable index for that specific knowledge base. The application creates separate, isolated collections for each team's knowledge base and each user's personal documents, ensuring data segregation.

User Interface
Here, the RAG system is demonstrated on a private, personal note. The model is able to extract and synthesize information directly from the user’s content. Because all processing is handled locally, it’s possible to run queries on sensitive information without the data ever leaving the machine. (image by author)

User Interface
This example demonstrates the system’s ability to query complex, structured documents. The model extracts specific data points from a financial report in response to a direct question. This showcases its utility for analyzing dense information, whether it’s technical documentation, financial data, or academic papers. (image by author)

2. The Multi-User System: auth.py

A robust application needs a solid authentication and authorization layer. I built one from scratch using a simple SQLite database to store user and session information.

Database and Authentication

The DatabaseManager class sets up tables for users, sessions, and analytics. User passwords are never stored in plain text; instead, I use werkzeug.security to store salted and hashed passwords.

When a user logs in, the system verifies their credentials and generates a JSON Web Token (JWT) that is used to authenticate subsequent API requests.

Securing Endpoints with Decorators

To protect the API, I implemented custom decorators. The @require_auth decorator checks for a valid session or JWT, while @require_role('admin') restricts access to admin-only endpoints, like the analytics dashboard.

# A snippet from auth.py
def require_role(required_role):
    """Decorator to require specific user role."""
    def decorator(f):
        @wraps(f)
        @require_auth
        def decorated_function(*args, **kwargs):
            user_role = request.current_user.get('role', 'user')
            # ... logic to check if user has sufficient privileges ...
            return f(*args, **kwargs)
        return decorated_function
    return decorator
Enter fullscreen mode Exit fullscreen mode

This design makes it easy to secure new API routes as the application grows.

Admin Dashboard
User management using the admin dashboard (image by author)

3. The Web Server: main_app.py

The Flask application ties everything together. It defines the API endpoints for the frontend to interact with. The most important one is /api/ask.

# A simplified view of the /api/ask endpoint in main_app.py
@app.route('/api/ask', methods=['POST'])
@require_auth
def ask_question_api():
    session_id = get_or_create_session_id()
    data = request.get_json()
    question = data['question']
    kb_id = data.get('knowledge_base_id', 'user_personal')
    # Get or create a RAG instance for this session and knowledge base
    rag_chain, retriever, memory = get_or_create_rag_instance(session_id, kb_id)

    if not rag_chain:
        return jsonify({"error": "RAG service not ready."}), 503
    # Invoke the RAG chain and return the response
    response = rag_chain.invoke({"query": question})

    # ... format and return the answer and sources ...
Enter fullscreen mode Exit fullscreen mode

The application maintains a thread-safe dictionary (rag_instances) to manage separate RAG chains and conversation histories for each user session and selected knowledge base, preventing memory leaks and ensuring that conversations are isolated.

The Result: Your Private AI Knowledge Base for Work and Life

The final result is a powerful, production-ready application that turns any collection of documents into an interactive AI assistant. It’s the perfect tool for:

  • Small Businesses wanting a secure, internal knowledge base for their team without the high cost and privacy risks of SaaS products.
  • Developers & Freelancers who need a powerful, private RAG system for their own documents or as a foundation for client projects.
  • Personal Users & Families who want to securely organize and ask questions about their private files, from financial records to research papers.

It’s the best of both worlds: the power of modern AI without sacrificing control over your data.

Want to Run It Yourself?

This project takes the core principles of local processing from the AI Meeting Note Taker and expands them into a robust, multi-user platform ready for any use case.

I’ve packaged the entire source code and made it available for a one-time purchase. It’s a fantastic way to get a powerful, private AI tool up and running in minutes and serves as a solid foundation for your own customizations.

For a limited time, use the code MEDIUM25 for a 25% discount.

You can get the complete RAG Application project by clicking here

Thanks for reading! I hope this deep-dive inspires you to take control of your data and explore the incredible potential of self-hosted AI.

Top comments (0)