🎉 Update: Wow! This post was awarded the Top Docker Author Badge of the week! Thanks to everyone for the amazing support and feedback. 🙏

Stop sending your sensitive datasheets to the cloud. Here is how I deployed a private, enterprise-grade RAG system.
As a Senior Automation Engineer, I deal with hundreds of technical documents every month — datasheets, schematics, internal protocols, and legacy codebases.
We all know the power of LLMs like GPT-4. Being able to ask, “What is the maximum voltage for the RS485 module on page 42?” and getting an instant answer is a game-changer.
But there is a problem: Privacy.
I cannot paste proprietary schematics or NDA-protected specs into ChatGPT. The risk of data leakage is simply too high.
So, I set out to build a solution. I wanted a “Second Brain” that was:
100% Offline: No data leaves my local network.
Free to run: No monthly API subscriptions (bye-bye, OpenAI bills).
Dockerized: Easy to deploy without “dependency hell.”
Here is the architecture I built using Llama 3, Ollama, and Docker.
The Architecture: Why this Tech Stack?
Building a RAG (Retrieval-Augmented Generation) system locally used to be a nightmare of Python dependencies and CUDA driver issues. To solve this, I designed a containerized microservices architecture.
The Brain: Ollama + Llama 3
I chose Ollama as the inference engine because it’s lightweight and efficient. For the model, Meta’s Llama 3 (8B) is the current sweet spot — it’s surprisingly capable of reasoning through technical documentation and runs smoothly on consumer GPUs (like an RTX 3060).The Memory: ChromaDB
For the vector database, I used ChromaDB. It runs locally, requires zero setup, and handles vector retrieval incredibly fast.The Glue: Python & Streamlit
The backend is written in Python, handling the “Ingestion Pipeline”:
Parsing: Extracting text from PDFs.
Chunking: Breaking text into manageable pieces.
Embedding: Converting text into vectors using the mxbai-embed-large model.
UI: A clean Streamlit interface for chatting with the data.
How It Works (The “Happy Path”)
The beauty of this system is the Docker implementation. Instead of installing Python libraries manually, the entire system spins up with a single command.
The docker-compose.yml orchestrates the communication between the AI engine, the database, and the UI.
YAML
# Simplified concept of the setup
services:
ollama:
image: ollama/ollama:latest
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
backend:
build: ./app
depends_on:
- ollama
- chromadb
Once running, the workflow is simple:
Drop your PDF files into the knowledge_base folder.
Click “Update Knowledge Base” in the UI.
Start chatting.
The system automatically vectorizes your documents. When you ask a question, it retrieves the most relevant paragraphs and feeds them to Llama 3 as context.
The Challenge: It’s Not Just About “Running” the Model
While the concept sounds simple, getting it to production-grade stability took me weeks of debugging.
Here is what most “Hello World” tutorials don’t tell you:
PDF Parsing is messy: Tables in engineering datasheets often break standard parsers.
Context Window limits: Llama 3 has a limit. You need a smart “Sliding Window” strategy for chunking large documents.
Docker Networking: Getting the Python container to talk to the Ollama container on the host GPU requires specific networking configurations.
I spent countless nights fixing connection timeouts, optimizing embedding models, and ensuring the UI doesn’t freeze during large file ingestions.
Want to Build Your Own?
If you are an engineer or developer who wants to own your data, I highly recommend building a local RAG system. It’s a great way to learn about GenAI architecture.
However, if you value your time and want to skip the configuration headaches, I have packaged my entire setup into a ready-to-deploy solution.
It includes:
✅ The Complete Source Code (Python/Streamlit).
✅ Production-Ready Docker Compose file.
✅ Optimized Ingestion Logic for technical docs.
✅ Setup Guide for Windows/Linux.
You can download the full package and view the detailed documentation on my GitHub.
👉 View the Project & Download Source Code on GitHub link
By Phil Yeh Senior Automation Engineer specializing in Industrial IoT and Local AI solutions.
- Python MQTT Data Logger - A clean GUI to debug brokers & auto-save data to CSV.
- Python CAN Bus & J1939 Sniffer - Decode vehicle data without expensive hardware.
- Python Modbus Data Logger - Debug RS485 devices with a multi-threaded GUI.
- Ethernet/IP Study Kit - Learn CIP protocol with a Python-based mock PLC. **** 👉 Get the Source Code for all these tools: Visit my Gumroad Store

Top comments (4)
Love the offline-first RAG stack. Extra timely with Meta's Llama 3.1 bringing a 128k context window—great for long datasheets—and Apple's “Apple Intelligence” underscoring the shift to on‑device, privacy‑first AI. With EU AI Act compliance heating up, a Dockerized, air‑gapped setup like this makes a lot of sense.
Exactly! The 128k context window is a total game-changer for engineering datasheets—I can finally feed an entire MCU manual into the context without losing details.
You nailed it regarding the privacy aspect. Many companies (especially in manufacturing) are hesitant to upload proprietary specs to the cloud. Air-gapped + Docker really is the only way forward for compliance. Thanks for the insight!
very cool. perhaps this tool cam also assist making the chunking part less complex. github.com/docling-project/docling
Thanks for the suggestion, Paul! Parsing and chunking (especially with complex tables in PDFs) is definitely the hardest part of the pipeline.
I haven't tried docling yet, but I'll definitely check it out to see if it can optimize my current ingestion logic. Always looking for better ways to handle unstructured data!