Your local LLM is smart but blind — it can't see the internet. Here's how to give it eyes, a filter, and a citation engine.
This is a hands-on tutorial. We'll install a library, run a real query, break down every stage of what happens inside, and look at the actual output your LLM receives.
By the end, you'll have a working pipeline that turns any local model (Ollama, LM Studio, anything with a text input) into something that searches the web, reads pages, ranks the results, and generates a structured prompt with inline citations — like a self-hosted Perplexity.
Background: If you want to understand the architecture this is based on, I wrote a deep dive into how Perplexity actually works — the five-stage RAG pipeline, hybrid retrieval on Vespa.ai, Cerebras-accelerated inference, the citation integrity problems. This tutorial is the practical counterpart.
What We're Building
A pipeline that does this:
Your question
↓
Search (Bing + DuckDuckGo, parallel)
↓
Semantic pre-filter (drop irrelevant results before fetching)
↓
Fetch pages (only the ones that passed filtering)
↓
Extract content (strip boilerplate, ads, navigation)
↓
Chunk + Rerank (BM25 + semantic + answer-span + MMR)
↓
LLM-ready prompt with numbered citations
The pipeline does NOT include the LLM itself — it builds the prompt. You plug in whatever model you want.
Step 1: Installation
git clone https://github.com/KazKozDev/production_rag_pipeline.git
cd production_rag_pipeline
Pick your install level:
# Minimal — BM25 ranking, BeautifulSoup extraction. No ML models.
pip install .
# Better extraction with trafilatura
pip install .[extraction]
# Semantic ranking with sentence-transformers (recommended)
pip install .[semantic]
# Everything
pip install .[full]
For this tutorial, use .[full]. First run will download embedding models (~100–500MB depending on language) — this only happens once.
No API keys needed. Bing and DuckDuckGo are queried without authentication.
Step 2: Your First Query — 3 Lines
from production_rag_pipeline import build_llm_prompt
prompt = build_llm_prompt("latest AI news", lang="en")
print(prompt)
That's the entire interface. build_llm_prompt runs the full pipeline — search, filter, fetch, extract, rerank — and returns a formatted string ready to paste into any LLM.
CLI alternative
production-rag-pipeline "latest AI news"
Or with options:
# Search-only mode (no page fetching)
production-rag-pipeline "Bitcoin price" --mode search
# Russian query
production-rag-pipeline "новости ИИ" --mode read --lang ru
macOS users
./run_llm_query.command
This bootstraps a virtual environment automatically on first run.
Step 3: What Just Happened — Stage by Stage
Let's trace what the pipeline actually does with "latest AI news". Enable debug mode to see it:
from production_rag_pipeline.pipeline import search_extract_rerank
chunks, results, fetched_urls = search_extract_rerank(
query="latest AI news",
num_fetch=8,
lang="en",
debug=True,
)
Stage 1: Dual-Engine Search
Bing and DuckDuckGo are searched in parallel. Results are merged with position-based scoring — first result from each engine scores highest, and results that appear in both engines get a boost.
The pipeline detects keywords like "news", "latest", "breaking" and switches DDG to its News index — returning actual articles instead of generic homepages.
Stage 2: Semantic Pre-Filtering
This is the key optimization. Before fetching any page, the pipeline computes cosine similarity between the query embedding and each result's title+snippet embedding.
Results below threshold get dropped:
- English: threshold 0.30
- Russian: threshold 0.25
In practice, ~11 out of 20 results get filtered — saving about 6 seconds of HTTP fetches.
Example from a real run with "LLM agents news":
✗ flutrackers.com sim=0.12 → filtered (irrelevant)
✓ llm-stats.com sim=0.68 → fetched
✗ reddit.com/r/gaming sim=0.15 → filtered
✓ arxiv.org/abs/2503 sim=0.71 → fetched
No hardcoded domain lists. Pure semantic relevance.
Stage 3: Parallel Fetch + Content Extraction
Surviving results (typically 5–9 URLs) are fetched in parallel. Content extraction runs a two-stage quality check:
Structural check: Does >30% of lines look like numbers/prices/tables?
Semantic check: If flagged, is the table relevant to the query?
This is how exchange rate tables from cbr.ru pass for a currency query (similarity 0.75) but CS:GO price lists get rejected (similarity 0.05).
After extraction, boilerplate is stripped — navigation, ads, newsletter signup patterns, cookie banners.
Stage 4: Chunking + Multi-Signal Reranking
Extracted content is chunked, then reranked by four signals:
- BM25 — classic lexical term-frequency matching
- Semantic similarity — cosine between query and chunk embeddings
- Answer-span detection — does this chunk directly answer the question?
- MMR diversity — prevents top results from all being paraphrases of the same paragraph
Optional: a cross-encoder runs on the final shortlist for maximum accuracy (slower but better).
For news queries, freshness penalties apply:
- Content >7 days old: −1 confidence
- Content >30 days old: −2 confidence
- Outdated sources flagged in the prompt with exact age
Stage 5: Prompt Assembly with Citation Binding
The pipeline builds a structured prompt:
from production_rag_pipeline.pipeline import build_llm_context
from production_rag_pipeline.prompts import build_llm_prompt
context, source_mapping, grouped_sources = build_llm_context(
chunks,
results,
fetched_urls=fetched_urls,
renumber_sources=True, # ← fixes phantom citation numbers
)
Citation numbers are renumbered after every filtering step. If three sources survive, they're numbered [1], [2], [3] — never [1], [3], [7] with phantom gaps.
Current date and time are injected into the prompt so the LLM can reason about source freshness.
Step 4: What the Output Looks Like
The final prompt looks roughly like this (abbreviated):
Current date: 2026-03-20
Answer the user's question using ONLY the provided sources.
Cite sources using [1], [2], etc. Do not make claims without a citation.
=== SOURCES ===
[1] OpenAI announces GPT-5 turbo with 1M context window
Source: techcrunch.com | Published: 2026-03-19
OpenAI today released GPT-5 Turbo, featuring a 1 million token
context window and improved reasoning capabilities...
[2] Google DeepMind publishes Gemini 2.5 technical report
Source: blog.google | Published: 2026-03-18
The technical report details architectural changes including
mixture-of-experts scaling to 3.2 trillion parameters...
[3] Anthropic raises $5B Series E at $90B valuation
Source: reuters.com | Published: 2026-03-17
Anthropic closed a $5 billion funding round, bringing its
total raised to over $15 billion...
=== QUESTION ===
latest AI news
Drop this into Ollama, LM Studio, or any API. The model sees curated, relevant, cited content — not raw web pages.
Step 5: Configuration
Dataclass
from production_rag_pipeline import RAGConfig, build_llm_prompt
config = RAGConfig(
num_per_engine=12, # results per search engine
top_n_fetch=8, # max pages to fetch
fetch_timeout=10, # seconds per page
total_context_chunks=12, # chunks in final prompt
)
prompt = build_llm_prompt("latest AI news", config=config)
YAML
production-rag-pipeline "latest AI news" --config config.example.yaml
Environment variables
export RAG_TOP_N_FETCH=8
export RAG_FETCH_TIMEOUT=10
production-rag-pipeline "latest AI news"
Step 6: The 50-Line Version
Here's the entire pipeline, from query to LLM-ready prompt, using the module-level API:
from production_rag_pipeline.search import search
from production_rag_pipeline.fetch import fetch_pages_parallel
from production_rag_pipeline.extract import extract_content, chunk_text
from production_rag_pipeline.rerank import rerank_chunks
from production_rag_pipeline.pipeline import build_llm_context
from production_rag_pipeline.prompts import build_llm_prompt
# 1. Search
query = "latest AI news"
results = search(query, num_per_engine=10, lang="en")
# 2. Fetch
urls = [r["url"] for r in results[:8]]
pages = fetch_pages_parallel(urls, timeout=10)
# 3. Extract + Chunk
all_chunks = []
for url, html in pages.items():
text = extract_content(html, url=url)
if text:
chunks = chunk_text(text, url=url)
all_chunks.extend(chunks)
# 4. Rerank
ranked = rerank_chunks(query, all_chunks, lang="en")
# 5. Build prompt
context, mapping, sources = build_llm_context(
ranked, results, renumber_sources=True
)
prompt = build_llm_prompt(query, context=context, sources=sources)
print(prompt)
This is what build_llm_prompt("latest AI news") does internally, broken into visible steps.
Graceful Degradation
The pipeline works at every install level:
| Install | Ranking | Extraction | Speed |
|---|---|---|---|
pip install . |
BM25 only | BeautifulSoup | Fastest, least accurate |
pip install .[extraction] |
BM25 only | Trafilatura | Better content quality |
pip install .[semantic] |
BM25 + semantic + MMR | BeautifulSoup | Much better ranking |
pip install .[full] |
BM25 + semantic + cross-encoder + MMR | Trafilatura | Best quality |
No GPU required. Semantic models run on CPU — slower, but functional.
How It Compares to Perplexity
| Perplexity | production-rag-pipeline | |
|---|---|---|
| Index | 200B+ pre-indexed URLs | Real-time Bing + DDG |
| Latency | 358ms median | 8–15s on a MacBook |
| Models | 20+ with dynamic routing | You choose (Ollama, LM Studio, etc.) |
| Inference | Cerebras CS-3, 1,200 tok/s | Your hardware |
| Cost | $20/mo Pro | Free |
| Privacy | Cloud | Local |
| Code | Closed | Open source, MIT |
The gap is real — especially on latency and index size. But for a tool that runs on your laptop, feeds any local model, and costs nothing, the tradeoff is worth it.
Multilingual Support
The pipeline auto-detects language by Cyrillic character ratio (10% threshold):
-
English →
all-MiniLM-L6-v2(fast, English-optimized) -
Russian →
paraphrase-multilingual-MiniLM-L12-v2(13 languages)
Cross-encoder reranking also switches models per language. No manual configuration needed.
production-rag-pipeline "новости ИИ" --lang ru
What's Next
This is Part 2 of a series:
- Part 1 — How Perplexity Actually Searches the Internet (architecture teardown)
- Part 2 — You're reading it (build the local equivalent)
Star the repo if this is useful: github.com/KazKozDev/production_rag_pipeline
Issues and contributions welcome.

Top comments (1)
This is Part 2 of a series. Part 1 breaks down how Perplexity's five-stage RAG pipeline actually works under the hood — from hybrid search on Vespa.ai to Cerebras-accelerated inference: medium.com/@kazkozdev/how-perplexi...
Code is on GitHub: github.com/KazKozDev/production_ra...