DEV Community

plasmon
plasmon

Posted on • Originally published at qiita.com

I Built a Fully Local Paper RAG on an RTX 4060 8GB — BGE-M3 + Qwen2.5-32B + ChromaDB

I Built a Fully Local Paper RAG on an RTX 4060 8GB — BGE-M3 + Qwen2.5-32B + ChromaDB

I was using GPT-4o to read ArXiv papers. Throw in a PDF, say "summarize this," get a response in 30 seconds. Convenient.

Then one day I tried to batch-process 50 papers related to an internal research topic and stopped cold. Security policy — can we even send these to an external API? Asked my manager. Predictably, the answer was no. So the only option was to do everything locally. That's how this project started.

I'd already confirmed in my previous article that Qwen2.5-32B runs under llama.cpp. The LLM was there. All I needed was a system to search paper contents and feed relevant passages to the LLM — in other words, RAG. Easy to say. The real question was how to cram it all into 8GB of VRAM.


Extracting Text from ArXiv Papers — Getting Data Out of PDFs First

First step: text extraction. Pull PDFs from the ArXiv API, convert to text with PyMuPDF.

import arxiv
import fitz  # PyMuPDF
from pathlib import Path

def fetch_arxiv_papers(query: str, max_results: int = 20,
                       save_dir: str = "./papers") -> list[dict]:
    '''Search ArXiv, download papers, extract text'''
    save_path = Path(save_dir)
    save_path.mkdir(exist_ok=True)

    client = arxiv.Client()
    search = arxiv.Search(
        query=query,
        max_results=max_results,
        sort_by=arxiv.SortCriterion.SubmittedDate
    )

    papers = []
    for result in client.results(search):
        pdf_path = save_path / f"{result.entry_id.split('/')[-1]}.pdf"
        if not pdf_path.exists():
            result.download_pdf(dirpath=str(save_path),
                              filename=pdf_path.name)

        text = ""
        with fitz.open(str(pdf_path)) as doc:
            for page in doc:
                text += page.get_text()

        papers.append({
            "title": result.title,
            "arxiv_id": result.entry_id,
            "abstract": result.summary,
            "authors": [a.name for a in result.authors],
            "published": result.published.isoformat(),
            "full_text": text,
            "categories": result.categories
        })
        print(f"{result.title[:60]}...")

    return papers

papers = fetch_arxiv_papers(
    query="semiconductor AND (LLM OR deep learning OR anomaly detection)",
    max_results=25
)
print(f"\nDone: {len(papers)} papers fetched")
Enter fullscreen mode Exit fullscreen mode

With 25 papers, PyMuPDF's text extraction was rougher than expected. Math-heavy papers had their layouts completely mangled — garbage strings around every symbol. I decided to live with it. If you need accurate formula extraction, use Nougat or GROBID, but my goal here was "search and find relevant passages," so body text was enough.

The ArXiv API has rate limits. 25 papers is fine, but if you're pulling 100+, add time.sleep(3) between requests or you'll get banned.


Why I Chose BGE-M3 as the Embedding Model — Harder Decision Than Expected

With text in hand, next was vectorization. A few options here.

First I tried multilingual-e5-large. Runs locally, supports multiple languages. But when I tested cross-lingual retrieval — asking questions in Japanese and pulling English papers — the hit rate was mediocre. Roughly 30% of top results were irrelevant, by feel.

Then I tried BGE-M3. An embedding model from BAAI (Beijing Academy of Artificial Intelligence). Supports 100+ languages, handles up to 8,192 tokens. Switching to this noticeably improved Japanese-query-to-English-paper retrieval accuracy. The benchmarks for cross-lingual retrieval put it at the top, and my subjective experience matched the numbers.

OpenAI's text-embedding-3-small was an obvious candidate too, but the whole point of this project was to avoid sending data to external APIs. Rejected.

from sentence_transformers import SentenceTransformer
import numpy as np

class BGEEmbedder:
    '''Multilingual vectorization with BGE-M3 (CUDA-enabled)'''

    def __init__(self, device: str = "cuda"):
        print("Loading BGE-M3...")
        self.model = SentenceTransformer(
            "BAAI/bge-m3",
            device=device
        )
        print(f"BGE-M3 loaded (device={device})")

    def encode(self, texts: list[str], batch_size: int = 32) -> np.ndarray:
        embeddings = self.model.encode(
            texts,
            batch_size=batch_size,
            show_progress_bar=True,
            normalize_embeddings=True  # Normalize for cosine similarity
        )
        return embeddings
Enter fullscreen mode Exit fullscreen mode

Encoding speed on the RTX 4060:

Text Language Speed
English ~1,200 chunks/min
Japanese ~950 chunks/min
Mixed multilingual ~820 chunks/min

VRAM peaked at about 2.5GB. This becomes a problem later.

Here's the thing though — those numbers are when CUDA is actually working. The first time I ran BGE-M3, encoding was absurdly slow. Over 10 minutes for 1,000 chunks. "Is BGE-M3 just this slow?" I thought for a second, but something felt off. I opened Resource Monitor in Task Manager and saw GPU utilization sitting at 0%, with CPU hovering around 55%. If the CPU had been pinned at 100%, I would have caught it immediately. But because it had headroom, all I felt was "hm, this is kinda slow." I had specified device="cuda", but it was actually running on CPU the entire time.

The cause: the CUDA build of PyTorch wasn't installed. A bare pip install torch can silently install the CPU-only version depending on your platform. PyTorch won't throw an error even when torch.cuda.is_available() returns False — it just falls back to CPU. SentenceTransformer does print Using device: cpu in the logs, but it was buried in a wall of output and I missed it. If I didn't have the habit of opening Resource Monitor, I'd have concluded "local RAG is just slow" and moved on. After reinstalling with pip install torch --index-url https://download.pytorch.org/whl/cu121, the same job finished in 65 seconds. A literal 10x difference.


Chunking Strategy for RAG — More Painful Than Expected

Before vectorization, you need to split text into chunks. I initially went with "just split every 512 tokens" and the search results were poor.

The problem: chunk boundaries cut right through the middle of sections. When the tail end of "Methods" and the beginning of "Results" get mashed into one chunk, it half-matches queries for both and fully matches neither.

I settled on overlapping chunks with 50-word overlap, dropping chunks under 50 words.

import re

class PaperChunker:
    def __init__(self, chunk_size: int = 512, overlap: int = 50):
        self.chunk_size = chunk_size
        self.overlap = overlap

    def chunk(self, text: str, metadata: dict) -> list[dict]:
        text = re.sub(r'\n{3,}', '\n\n', text)
        text = re.sub(r' {2,}', ' ', text)

        words = text.split()
        chunks = []

        for i in range(0, len(words), self.chunk_size - self.overlap):
            chunk_words = words[i:i + self.chunk_size]
            if len(chunk_words) < 50:
                continue

            chunk_text = " ".join(chunk_words)
            chunks.append({
                "text": chunk_text,
                "metadata": {
                    **metadata,
                    "chunk_index": len(chunks),
                    "word_count": len(chunk_words)
                }
            })

        return chunks

chunker = PaperChunker(chunk_size=512, overlap=50)
all_chunks = []

for paper in papers:
    meta = {
        "title": paper["title"],
        "arxiv_id": paper["arxiv_id"],
        "authors": ", ".join(paper["authors"][:3]),
        "published": paper["published"]
    }
    chunks = chunker.chunk(paper["full_text"], meta)
    all_chunks.extend(chunks)

print(f"Total chunks: {len(all_chunks)}")
Enter fullscreen mode Exit fullscreen mode

25 papers produced roughly 1,000 chunks. One more thing I noticed here: adding the Abstract as a separate, standalone chunk significantly improves retrieval accuracy. Abstracts are information-dense summaries of the entire paper, so they tend to be semantically closer to most queries. This was a hack I added after the fact, but it made a big difference.

for paper in papers:
    abstract_chunk = {
        "text": f"Title: {paper['title']}\nAbstract: {paper['abstract']}",
        "metadata": {**meta, "chunk_type": "abstract"}
    }
    all_chunks.append(abstract_chunk)
Enter fullscreen mode Exit fullscreen mode

Why ChromaDB as the Vector Store — FAISS vs Qdrant vs ChromaDB

For storing vectors, I had three candidates:

  • FAISS: Meta's go-to. Fast. But you have to handle persistence yourself (pickle or raw file I/O).
  • Qdrant: Full-featured, looked nice, but requires Docker. Overkill for this project.
  • ChromaDB: One-line persistence with PersistentClient. Lightweight.

My motivation was "finish this over the weekend," so I went with ChromaDB — the easiest to set up.

import chromadb
from chromadb.config import Settings

class PaperVectorStore:
    def __init__(self, persist_dir: str = "./chroma_db"):
        self.client = chromadb.PersistentClient(
            path=persist_dir,
            settings=Settings(anonymized_telemetry=False)
        )
        self.collection = self.client.get_or_create_collection(
            name="arxiv_papers",
            metadata={"hnsw:space": "cosine"}
        )

    def add_chunks(self, chunks: list[dict], vectors: np.ndarray):
        ids = [f"chunk_{i}" for i in range(len(chunks))]
        documents = [c["text"] for c in chunks]
        metadatas = [c["metadata"] for c in chunks]

        batch_size = 500
        for i in range(0, len(ids), batch_size):
            end = min(i + batch_size, len(ids))
            self.collection.add(
                ids=ids[i:end],
                documents=documents[i:end],
                embeddings=vectors[i:end].tolist(),
                metadatas=metadatas[i:end]
            )

        print(f"Saved {len(ids)} chunks to ChromaDB")

    def search(self, query_vector: np.ndarray, top_k: int = 5) -> list[dict]:
        results = self.collection.query(
            query_embeddings=query_vector.tolist(),
            n_results=top_k,
            include=["documents", "metadatas", "distances"]
        )

        hits = []
        for i in range(len(results["ids"][0])):
            hits.append({
                "text": results["documents"][0][i],
                "metadata": results["metadatas"][0][i],
                "similarity": 1 - results["distances"][0][i]
            })

        return hits

embedder = BGEEmbedder(device="cuda")
texts = [c["text"] for c in all_chunks]
vectors = embedder.encode(texts, batch_size=32)

store = PaperVectorStore(persist_dir="./paper_rag_db")
store.add_chunks(all_chunks, vectors)
Enter fullscreen mode Exit fullscreen mode

Smooth sailing up to this point. Index construction for 25 papers, ~1,000 chunks, done in 90 seconds.

Now, I have a complaint about ChromaDB's API. The return value of query uses this nested structure: results["documents"][0][i]. Why is there an extra list layer? Apparently it's for batch multi-query support, but the vast majority of use cases involve a single query, and having to write [0] every single time is genuinely annoying.

Also, if you don't explicitly set anonymized_telemetry=False, it sends telemetry data. The whole reason I'm using ChromaDB is to keep things local, and silently phoning home contradicts that philosophy. Still the case in 0.5.x.

That said — the ease of persistence and HNSW search speed are genuinely excellent. Searching 1,000 chunks in under 1ms. Disk usage around 50MB. I have complaints, but not enough to switch.


Local RAG Answer Generation — Feeding Search Results to Qwen2.5-32B

Search works. Now pass the results to Qwen2.5-32B to generate answers. Since llama-server exposes an OpenAI-compatible API, the OpenAI SDK works directly.

from openai import OpenAI

class LocalPaperRAG:
    def __init__(self, store: PaperVectorStore,
                 embedder: BGEEmbedder,
                 llm_base_url: str = "http://localhost:8080/v1"):
        self.store = store
        self.embedder = embedder
        self.llm = OpenAI(base_url=llm_base_url, api_key="dummy")

    def ask(self, question: str, top_k: int = 5,
            max_tokens: int = 1024) -> dict:
        q_vec = self.embedder.encode([question])
        hits = self.store.search(q_vec, top_k=top_k)

        context_parts = []
        sources = set()
        for hit in hits:
            title = hit["metadata"].get("title", "Unknown")
            sources.add(title)
            context_parts.append(
                f"[Paper: {title}]\n{hit['text']}"
            )
        context = "\n\n---\n\n".join(context_parts)

        response = self.llm.chat.completions.create(
            model="qwen2.5-32b",
            messages=[
                {
                    "role": "system",
                    "content": (
                        "You are an expert at reading academic papers. "
                        "Answer accurately based on the provided context. "
                        "If the context doesn't contain relevant information, "
                        "say 'No relevant information found.' "
                        "Always cite the paper title(s) that support your answer."
                    )
                },
                {
                    "role": "user",
                    "content": f"## Context\n{context}\n\n## Question\n{question}"
                }
            ],
            max_tokens=max_tokens,
            temperature=0.3
        )

        return {
            "answer": response.choices[0].message.content,
            "sources": list(sources),
            "chunks_used": len(hits),
            "tokens": response.usage.total_tokens
        }

rag = LocalPaperRAG(store=store, embedder=embedder)
Enter fullscreen mode Exit fullscreen mode

I set temperature=0.3 because hallucinations in paper-based answers are fatal. Lowering the temperature is the simplest and most effective countermeasure.

I'm using Qwen2.5-32B because I already had a working setup on the RTX 4060 from the previous article. Qwen3.5-27B is out now and I want to test it, but I haven't compared RAG answer quality yet. At 27B, parameter count drops so VRAM gets more breathing room, and architectural improvements should bump quality too. That's material for the next article.


Hitting the VRAM 8GB Wall with RAG

Here's where the biggest problem hit.

Loading BGE-M3 on CUDA eats ~2.5GB of VRAM. Qwen2.5-32B at ngl=60 uses 7.6GB. Total: 10.1GB. No way that fits on an 8GB card.

I naively assumed "just run both on the GPU at the same time." Wrong.

Here's what I ended up doing:

# During index construction: stop llama-server, give GPU to BGE-M3
# → After encoding, restart llama-server
# During search: vectorizing a single query is fast enough on CPU (tens of ms)
embedder_for_query = BGEEmbedder(device="cpu")
Enter fullscreen mode Exit fullscreen mode

Time-sharing. Index construction (batch encoding) isn't something you do constantly, so I stop the LLM server during those runs and let BGE-M3 have the full GPU. During the search phase, you're only vectorizing one new query at a time, so CPU is perfectly fine.

Another approach: just run BGE-M3 on CPU from the start. Speed drops to about 1/4 — encoding 1,000 chunks takes 4+ minutes — but you don't have to stop llama-server. If you can live with the wait, this is operationally simpler.


Testing Actual Retrieval Accuracy on ArXiv Papers

result = rag.ask("What are the main challenges of using LLMs for semiconductor failure analysis (FA)?")
print(f"Answer:\n{result['answer']}\n")
print(f"Source papers: {result['sources']}")

result = rag.ask("What's the difference between N-BEATS and Transformer-based approaches for time-series anomaly detection?")
print(f"Answer:\n{result['answer']}\n")

# The real power of BGE-M3: querying in Japanese, retrieving English papers
result = rag.ask("What are the latest trends in AI applications for semiconductor manufacturing processes?")
print(f"Answer:\n{result['answer']}\n")
Enter fullscreen mode Exit fullscreen mode

The third query produced interesting results. Despite asking in Japanese (in the original setup), the relevant passages from English papers correctly appeared in the top 5. Choosing BGE-M3 paid off.

Performance numbers:

Step Time
Text extraction from 25 papers ~30s
Chunking (1,050 chunks) <1s
BGE-M3 encoding ~65s
ChromaDB storage ~2s
Index construction total ~100s
Query (search + answer generation) ~55s

Of those 55 seconds, 54 are spent waiting for Qwen2.5-32B to generate. Search itself is under 1ms. If you want to speed this up, either reduce max_tokens or drop to a 7B-class model — but for technical paper content, 32B answer quality is leagues ahead of 7B. Worth the wait.


The Full Picture — A Working Local RAG in ~200 Lines

The final system looks like this:

Component Role VRAM
BGE-M3 Multilingual vectorization ~2.5GB (batch)
ChromaDB Vector search + persistence Disk only
Qwen2.5-32B (llama.cpp) Answer generation ~7.6GB (ngl=60)

About 200 lines of code total. 6 library dependencies. Got it working in under a day.

I initially thought "building local RAG sounds like a huge pain," but in practice each piece is a combination of mature technologies. The hardest part turned out to be VRAM budgeting — a hardware constraint, not a software one. At the software layer, BGE-M3, ChromaDB, and llama.cpp are all stable enough.


Improving RAG on 8GB VRAM — What I Want to Tackle Next

HyDE (Hypothetical Document Embeddings) has my attention. Instead of vectorizing the query directly, you first have the LLM generate a "hypothetical document that would answer this question," then vectorize that. Supposedly it bridges the semantic gap between queries and documents. But it means hitting the LLM an extra time, doubling response latency. On 8GB VRAM where queries already take 55 seconds — 110 seconds is rough.

BGE-M3 supports Sparse and ColBERT scoring in addition to Dense, but I'm only using Dense right now. Multi-vector retrieval should improve accuracy, but ChromaDB doesn't natively support sparse vectors, so I'd need a separate storage backend. Still figuring out how to make that work.

And there's the Qwen3.5-27B migration to test. At 27B, I can push ngl harder on 8GB, and I'm curious about the quality gap versus 9B. A model-size comparison for RAG answer quality — that's enough material for a standalone article, so I'll save it for next time.

Top comments (0)