Jovan Chan

Posted on Jun 13 • Originally published at aifoss.dev

LocalGPT Setup 2026: Private Document Chat in 10 Minutes

#localgpt #rag #privacy #llama

This article was originally published on aifoss.dev

TL;DR: LocalGPT runs a fully offline RAG pipeline — your documents stay on your machine, nothing touches a cloud server, no API key required. Setup takes under 10 minutes if you have Python and a CUDA GPU. The main trade-offs against tools like AnythingLLM are that LocalGPT is single-user and doesn't persist chat history between sessions.

What you'll have running after this guide:

LocalGPT ingesting PDF, DOCX, TXT, CSV, and Markdown files from a folder on your machine
Llama 3 8B (or a model of your choice) answering questions about those documents, running entirely offline
A working CUDA or CPU setup with the configuration to swap in a larger model when you need it

Honest take: LocalGPT is the leanest path to private document Q&A for a single user. If you need team access or session history, AnythingLLM does more.

What LocalGPT actually does

LocalGPT is a RAG (retrieval-augmented generation) tool built on LangChain. You drop documents into a source folder, run an ingestion script that chunks and embeds them into a local Chroma vector database, then query those embeddings through a local LLM. Every step runs on your hardware.

The main branch uses HuggingFace model downloads and LlamaCpp for GGUF models — no cloud calls, no telemetry, no internet connection once you've downloaded the model. That's what this guide covers.

There's also a localgpt-v2 branch that replaces the stack with Ollama as the LLM backend. It's architecturally cleaner, but as of mid-2026 it's only tested on macOS and supports only PDF ingestion. The main branch is the stable choice for production use today.

Prerequisites

Before cloning anything:

Requirement	Minimum	Recommended
Python	3.10	3.11
RAM	8 GB	16 GB
VRAM (NVIDIA)	None (CPU fallback)	8 GB+
Disk space	20 GB	40 GB (multiple models)
OS	Linux / macOS / Windows	Ubuntu 22.04

CUDA is optional but matters a lot for practical use. On a CPU, expect 3–8 tokens/sec with a 7B model. On a GPU with 8 GB VRAM you get 25–35 tokens/sec. On a GPU with 24 GB VRAM — an RTX 3090 being the common choice — a 13B GGUF model runs at around 40 tokens/sec.

Check your CUDA version before installing:

nvcc --version
# Also check: nvidia-smi (look for "CUDA Version" in the top-right corner)

You need CUDA 11.8 or 12.x. PyTorch 2.x supports both — you'll select the right wheel during install.

Step 1: Clone and install

git clone https://github.com/PromtEngineer/localGPT.git
cd localGPT
python -m venv venv
source venv/bin/activate       # Windows: venv\Scripts\activate

Install PyTorch first, matching your CUDA version:

# CUDA 12.1
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# CUDA 11.8
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# CPU-only
pip install torch torchvision torchaudio

Then install the project dependencies:

pip install -r requirements.txt

This pulls in LangChain, ChromaDB, sentence-transformers, pdfminer.six, docx2txt, and unstructured — expect 5–10 minutes and roughly 3 GB of packages.

Step 2: Choose and configure a model

LocalGPT runs GGUF models via LlamaCpp. Open constants.py and update these two fields:

MODEL_ID = "TheBloke/Llama-3-8B-Instruct-GGUF"
MODEL_BASENAME = "llama-3-8b-instruct.Q4_K_M.gguf"

The first run downloads the model from HuggingFace (~4.7 GB for Q4_K_M 8B). If your GPU has 24 GB VRAM, the 13B model is a meaningful upgrade:

MODEL_ID = "TheBloke/Llama-3-13B-Instruct-GGUF"
MODEL_BASENAME = "llama-3-13b-instruct.Q4_K_M.gguf"

The 13B Q4_K_M file weighs ~7.3 GB and handles instruction following and long-context reasoning noticeably better. Verify the exact filename on the HuggingFace model card before downloading — names can vary by uploader.

Llama 3 is released under the Meta Llama 3 Community License, which permits commercial use below 700 million monthly active users. For an explanation of how GGUF quantization levels compare, see the quantization guide.

You can also use Mistral-7B or other GGUF models. Any model hosted on HuggingFace in GGUF format works — just update MODEL_ID and MODEL_BASENAME.

Step 3: Add your documents

Drop files into the SOURCE_DOCUMENTS/ folder in the repo root. LocalGPT reads the following formats, dispatching a different loader per file extension:

Format	LangChain loader used
`.pdf`	PDFMinerLoader
`.txt`	TextLoader
`.md`	TextLoader
`.py`	TextLoader
`.csv`	CSVLoader
`.xls`, `.xlsx`	UnstructuredExcelLoader
`.docx`, `.doc`	Docx2txtLoader

Mixed formats work fine — you can have PDFs, DOCX files, and CSVs in the same folder and they all get ingested in one pass.

Then run ingestion:

python ingest.py --device_type cuda

Expected output:

Loading documents from SOURCE_DOCUMENTS
Loading new documents: 100%|██████████| 12/12 [00:08<00:00,  1.47it/s]
Loaded 12 new documents from SOURCE_DOCUMENTS
Split into 1247 chunks of text
Creating embeddings. May take some minutes...
Using embedded DuckDB with persistence: storing vectors in DB

CPU-only:

python ingest.py --device_type cpu

Ingestion time scales with document count and chunk size. For 50 MB of PDFs, expect 2–5 minutes on GPU and 10–20 minutes on CPU. The vector database is stored in DB/. Re-running ingest.py only processes files not already in the DB, so you can add documents incrementally.

Step 4: Run a query session

python run_localGPT.py --device_type cuda

The model loads in 30–60 seconds, then you get a prompt:

> Enter a query:

Type any question about your documents. LocalGPT retrieves the top-k most relevant chunks and passes them to the model along with your question. Add --show_sources to see which files and pages each answer draws from:

python run_localGPT.py --device_type cuda --show_sources

Example session:

> What are the main findings in the Q3 risk report?

Answer: The Q3 report identifies three primary risks: supply chain disruption in APAC
markets, rising material costs impacting gross margin by approximately 2–3%, and
upcoming regulatory changes in the EU affecting product certification timelines.

Source Documents:
../SOURCE_DOCUMENTS/q3_risk_report_2025.pdf (page 4): ...material cost pressures have
intensified in Q2–Q3 2025, with polysilicate pricing up 18%...

GPU vs CPU: actual performance expectations

These numbers are based on community benchmarks for LlamaCpp-based inference with Q4_K_M quantization. Your results will vary with RAM speed, chunk count, and context length.

Hardware	Model	Tokens/sec	Typical response time
CPU (8-core, 32 GB RAM)	Llama 3 8B Q4_K_M	3–8	20–50 sec
RTX 3060 12 GB	Llama 3 8B Q4_K_M	25–35	4–8 sec
RTX 3090 24 GB	Llama 3 13B Q4_K_M	35–45	5–10 sec
RTX 4090 24 GB	Llama 3 13B Q4_K_M	55–70	3–6 sec

CPU inference is workable for occasional queries. For ongoing use with a 13B+ model, GPU memory bandwidth is the decisive factor. If you want to run 70B models without consumer hardware, RunPod rents A100 instances by the hour — you can spin one up, run a batch of queries, and pay a few dollars rather than buying a $2,500 GPU.

Swapping to a different model

Model configuration lives entirely in constants.py. To switch:


python
# Smaller, faster — good for CPU
MODEL_ID = "TheBloke/Mistral-7B-Instruct-v0.2-GGUF"
MODEL_BASENAME = "mistral-7b-instruct-v0.2.Q4_K_M.gguf"

# Embedding model (also swappable)
EMBEDDING_MODEL_NAME = "al