This article was originally published on aifoss.dev
TL;DR: LocalGPT runs a fully offline RAG pipeline — your documents stay on your machine, nothing touches a cloud server, no API key required. Setup takes under 10 minutes if you have Python and a CUDA GPU. The main trade-offs against tools like AnythingLLM are that LocalGPT is single-user and doesn't persist chat history between sessions.
What you'll have running after this guide:
- LocalGPT ingesting PDF, DOCX, TXT, CSV, and Markdown files from a folder on your machine
- Llama 3 8B (or a model of your choice) answering questions about those documents, running entirely offline
- A working CUDA or CPU setup with the configuration to swap in a larger model when you need it
Honest take: LocalGPT is the leanest path to private document Q&A for a single user. If you need team access or session history, AnythingLLM does more.
What LocalGPT actually does
LocalGPT is a RAG (retrieval-augmented generation) tool built on LangChain. You drop documents into a source folder, run an ingestion script that chunks and embeds them into a local Chroma vector database, then query those embeddings through a local LLM. Every step runs on your hardware.
The main branch uses HuggingFace model downloads and LlamaCpp for GGUF models — no cloud calls, no telemetry, no internet connection once you've downloaded the model. That's what this guide covers.
There's also a localgpt-v2 branch that replaces the stack with Ollama as the LLM backend. It's architecturally cleaner, but as of mid-2026 it's only tested on macOS and supports only PDF ingestion. The main branch is the stable choice for production use today.
Prerequisites
Before cloning anything:
| Requirement | Minimum | Recommended |
|---|---|---|
| Python | 3.10 | 3.11 |
| RAM | 8 GB | 16 GB |
| VRAM (NVIDIA) | None (CPU fallback) | 8 GB+ |
| Disk space | 20 GB | 40 GB (multiple models) |
| OS | Linux / macOS / Windows | Ubuntu 22.04 |
CUDA is optional but matters a lot for practical use. On a CPU, expect 3–8 tokens/sec with a 7B model. On a GPU with 8 GB VRAM you get 25–35 tokens/sec. On a GPU with 24 GB VRAM — an RTX 3090 being the common choice — a 13B GGUF model runs at around 40 tokens/sec.
Check your CUDA version before installing:
nvcc --version
# Also check: nvidia-smi (look for "CUDA Version" in the top-right corner)
You need CUDA 11.8 or 12.x. PyTorch 2.x supports both — you'll select the right wheel during install.
Step 1: Clone and install
git clone https://github.com/PromtEngineer/localGPT.git
cd localGPT
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
Install PyTorch first, matching your CUDA version:
# CUDA 12.1
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# CUDA 11.8
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# CPU-only
pip install torch torchvision torchaudio
Then install the project dependencies:
pip install -r requirements.txt
This pulls in LangChain, ChromaDB, sentence-transformers, pdfminer.six, docx2txt, and unstructured — expect 5–10 minutes and roughly 3 GB of packages.
Step 2: Choose and configure a model
LocalGPT runs GGUF models via LlamaCpp. Open constants.py and update these two fields:
MODEL_ID = "TheBloke/Llama-3-8B-Instruct-GGUF"
MODEL_BASENAME = "llama-3-8b-instruct.Q4_K_M.gguf"
The first run downloads the model from HuggingFace (~4.7 GB for Q4_K_M 8B). If your GPU has 24 GB VRAM, the 13B model is a meaningful upgrade:
MODEL_ID = "TheBloke/Llama-3-13B-Instruct-GGUF"
MODEL_BASENAME = "llama-3-13b-instruct.Q4_K_M.gguf"
The 13B Q4_K_M file weighs ~7.3 GB and handles instruction following and long-context reasoning noticeably better. Verify the exact filename on the HuggingFace model card before downloading — names can vary by uploader.
Llama 3 is released under the Meta Llama 3 Community License, which permits commercial use below 700 million monthly active users. For an explanation of how GGUF quantization levels compare, see the quantization guide.
You can also use Mistral-7B or other GGUF models. Any model hosted on HuggingFace in GGUF format works — just update MODEL_ID and MODEL_BASENAME.
Step 3: Add your documents
Drop files into the SOURCE_DOCUMENTS/ folder in the repo root. LocalGPT reads the following formats, dispatching a different loader per file extension:
| Format | LangChain loader used |
|---|---|
.pdf |
PDFMinerLoader |
.txt |
TextLoader |
.md |
TextLoader |
.py |
TextLoader |
.csv |
CSVLoader |
.xls, .xlsx
|
UnstructuredExcelLoader |
.docx, .doc
|
Docx2txtLoader |
Mixed formats work fine — you can have PDFs, DOCX files, and CSVs in the same folder and they all get ingested in one pass.
Then run ingestion:
python ingest.py --device_type cuda
Expected output:
Loading documents from SOURCE_DOCUMENTS
Loading new documents: 100%|██████████| 12/12 [00:08<00:00, 1.47it/s]
Loaded 12 new documents from SOURCE_DOCUMENTS
Split into 1247 chunks of text
Creating embeddings. May take some minutes...
Using embedded DuckDB with persistence: storing vectors in DB
CPU-only:
python ingest.py --device_type cpu
Ingestion time scales with document count and chunk size. For 50 MB of PDFs, expect 2–5 minutes on GPU and 10–20 minutes on CPU. The vector database is stored in DB/. Re-running ingest.py only processes files not already in the DB, so you can add documents incrementally.
Step 4: Run a query session
python run_localGPT.py --device_type cuda
The model loads in 30–60 seconds, then you get a prompt:
> Enter a query:
Type any question about your documents. LocalGPT retrieves the top-k most relevant chunks and passes them to the model along with your question. Add --show_sources to see which files and pages each answer draws from:
python run_localGPT.py --device_type cuda --show_sources
Example session:
> What are the main findings in the Q3 risk report?
Answer: The Q3 report identifies three primary risks: supply chain disruption in APAC
markets, rising material costs impacting gross margin by approximately 2–3%, and
upcoming regulatory changes in the EU affecting product certification timelines.
Source Documents:
../SOURCE_DOCUMENTS/q3_risk_report_2025.pdf (page 4): ...material cost pressures have
intensified in Q2–Q3 2025, with polysilicate pricing up 18%...
GPU vs CPU: actual performance expectations
These numbers are based on community benchmarks for LlamaCpp-based inference with Q4_K_M quantization. Your results will vary with RAM speed, chunk count, and context length.
| Hardware | Model | Tokens/sec | Typical response time |
|---|---|---|---|
| CPU (8-core, 32 GB RAM) | Llama 3 8B Q4_K_M | 3–8 | 20–50 sec |
| RTX 3060 12 GB | Llama 3 8B Q4_K_M | 25–35 | 4–8 sec |
| RTX 3090 24 GB | Llama 3 13B Q4_K_M | 35–45 | 5–10 sec |
| RTX 4090 24 GB | Llama 3 13B Q4_K_M | 55–70 | 3–6 sec |
CPU inference is workable for occasional queries. For ongoing use with a 13B+ model, GPU memory bandwidth is the decisive factor. If you want to run 70B models without consumer hardware, RunPod rents A100 instances by the hour — you can spin one up, run a batch of queries, and pay a few dollars rather than buying a $2,500 GPU.
Swapping to a different model
Model configuration lives entirely in constants.py. To switch:
python
# Smaller, faster — good for CPU
MODEL_ID = "TheBloke/Mistral-7B-Instruct-v0.2-GGUF"
MODEL_BASENAME = "mistral-7b-instruct-v0.2.Q4_K_M.gguf"
# Embedding model (also swappable)
EMBEDDING_MODEL_NAME = "al
Top comments (0)