DEV Community

Artem KK
Artem KK

Posted on

Building a Perplexity Clone for Local LLMs in 50 Lines of Python

Your local LLM is smart but blind — it can't see the internet. Here's how to give it eyes, a filter, and a citation engine.


This is a hands-on tutorial. We'll install a library, run a real query, break down every stage of what happens inside, and look at the actual output your LLM receives.

By the end, you'll have a working pipeline that turns any local model (Ollama, LM Studio, anything with a text input) into something that searches the web, reads pages, ranks the results, and generates a structured prompt with inline citations — like a self-hosted Perplexity.

Background: If you want to understand the architecture this is based on, I wrote a deep dive into how Perplexity actually works — the five-stage RAG pipeline, hybrid retrieval on Vespa.ai, Cerebras-accelerated inference, the citation integrity problems. This tutorial is the practical counterpart.

Repo: github.com/KazKozDev/production_rag_pipeline


What We're Building

A pipeline that does this:

Your question
    ↓
Search (Bing + DuckDuckGo, parallel)
    ↓
Semantic pre-filter (drop irrelevant results before fetching)
    ↓
Fetch pages (only the ones that passed filtering)
    ↓
Extract content (strip boilerplate, ads, navigation)
    ↓
Chunk + Rerank (BM25 + semantic + answer-span + MMR)
    ↓
LLM-ready prompt with numbered citations
Enter fullscreen mode Exit fullscreen mode

The pipeline does NOT include the LLM itself — it builds the prompt. You plug in whatever model you want.


Step 1: Installation

git clone https://github.com/KazKozDev/production_rag_pipeline.git
cd production_rag_pipeline
Enter fullscreen mode Exit fullscreen mode

Pick your install level:

# Minimal — BM25 ranking, BeautifulSoup extraction. No ML models.
pip install .

# Better extraction with trafilatura
pip install .[extraction]

# Semantic ranking with sentence-transformers (recommended)
pip install .[semantic]

# Everything
pip install .[full]
Enter fullscreen mode Exit fullscreen mode

For this tutorial, use .[full]. First run will download embedding models (~100–500MB depending on language) — this only happens once.

No API keys needed. Bing and DuckDuckGo are queried without authentication.


Step 2: Your First Query — 3 Lines

from production_rag_pipeline import build_llm_prompt

prompt = build_llm_prompt("latest AI news", lang="en")
print(prompt)
Enter fullscreen mode Exit fullscreen mode

That's the entire interface. build_llm_prompt runs the full pipeline — search, filter, fetch, extract, rerank — and returns a formatted string ready to paste into any LLM.

CLI alternative

production-rag-pipeline "latest AI news"
Enter fullscreen mode Exit fullscreen mode

Or with options:

# Search-only mode (no page fetching)
production-rag-pipeline "Bitcoin price" --mode search

# Russian query
production-rag-pipeline "новости ИИ" --mode read --lang ru
Enter fullscreen mode Exit fullscreen mode

macOS users

./run_llm_query.command
Enter fullscreen mode Exit fullscreen mode

This bootstraps a virtual environment automatically on first run.


Step 3: What Just Happened — Stage by Stage

Let's trace what the pipeline actually does with "latest AI news". Enable debug mode to see it:

from production_rag_pipeline.pipeline import search_extract_rerank

chunks, results, fetched_urls = search_extract_rerank(
    query="latest AI news",
    num_fetch=8,
    lang="en",
    debug=True,
)
Enter fullscreen mode Exit fullscreen mode

Stage 1: Dual-Engine Search

Bing and DuckDuckGo are searched in parallel. Results are merged with position-based scoring — first result from each engine scores highest, and results that appear in both engines get a boost.

The pipeline detects keywords like "news", "latest", "breaking" and switches DDG to its News index — returning actual articles instead of generic homepages.

Stage 2: Semantic Pre-Filtering

This is the key optimization. Before fetching any page, the pipeline computes cosine similarity between the query embedding and each result's title+snippet embedding.

Results below threshold get dropped:

  • English: threshold 0.30
  • Russian: threshold 0.25

In practice, ~11 out of 20 results get filtered — saving about 6 seconds of HTTP fetches.

Example from a real run with "LLM agents news":

✗ flutrackers.com     sim=0.12  → filtered (irrelevant)
✓ llm-stats.com       sim=0.68  → fetched
✗ reddit.com/r/gaming  sim=0.15  → filtered
✓ arxiv.org/abs/2503   sim=0.71  → fetched
Enter fullscreen mode Exit fullscreen mode

No hardcoded domain lists. Pure semantic relevance.

Stage 3: Parallel Fetch + Content Extraction

Surviving results (typically 5–9 URLs) are fetched in parallel. Content extraction runs a two-stage quality check:

Structural check: Does >30% of lines look like numbers/prices/tables?

Semantic check: If flagged, is the table relevant to the query?

This is how exchange rate tables from cbr.ru pass for a currency query (similarity 0.75) but CS:GO price lists get rejected (similarity 0.05).

After extraction, boilerplate is stripped — navigation, ads, newsletter signup patterns, cookie banners.

Stage 4: Chunking + Multi-Signal Reranking

Extracted content is chunked, then reranked by four signals:

  • BM25 — classic lexical term-frequency matching
  • Semantic similarity — cosine between query and chunk embeddings
  • Answer-span detection — does this chunk directly answer the question?
  • MMR diversity — prevents top results from all being paraphrases of the same paragraph

Optional: a cross-encoder runs on the final shortlist for maximum accuracy (slower but better).

For news queries, freshness penalties apply:

  • Content >7 days old: −1 confidence
  • Content >30 days old: −2 confidence
  • Outdated sources flagged in the prompt with exact age

Stage 5: Prompt Assembly with Citation Binding

The pipeline builds a structured prompt:

from production_rag_pipeline.pipeline import build_llm_context
from production_rag_pipeline.prompts import build_llm_prompt

context, source_mapping, grouped_sources = build_llm_context(
    chunks,
    results,
    fetched_urls=fetched_urls,
    renumber_sources=True,  # ← fixes phantom citation numbers
)
Enter fullscreen mode Exit fullscreen mode

Citation numbers are renumbered after every filtering step. If three sources survive, they're numbered [1], [2], [3] — never [1], [3], [7] with phantom gaps.

Current date and time are injected into the prompt so the LLM can reason about source freshness.


Step 4: What the Output Looks Like

The final prompt looks roughly like this (abbreviated):

Current date: 2026-03-20

Answer the user's question using ONLY the provided sources.
Cite sources using [1], [2], etc. Do not make claims without a citation.

=== SOURCES ===

[1] OpenAI announces GPT-5 turbo with 1M context window
Source: techcrunch.com | Published: 2026-03-19
OpenAI today released GPT-5 Turbo, featuring a 1 million token
context window and improved reasoning capabilities...

[2] Google DeepMind publishes Gemini 2.5 technical report
Source: blog.google | Published: 2026-03-18
The technical report details architectural changes including
mixture-of-experts scaling to 3.2 trillion parameters...

[3] Anthropic raises $5B Series E at $90B valuation
Source: reuters.com | Published: 2026-03-17
Anthropic closed a $5 billion funding round, bringing its
total raised to over $15 billion...

=== QUESTION ===

latest AI news
Enter fullscreen mode Exit fullscreen mode

Drop this into Ollama, LM Studio, or any API. The model sees curated, relevant, cited content — not raw web pages.


Step 5: Configuration

Dataclass

from production_rag_pipeline import RAGConfig, build_llm_prompt

config = RAGConfig(
    num_per_engine=12,       # results per search engine
    top_n_fetch=8,           # max pages to fetch
    fetch_timeout=10,        # seconds per page
    total_context_chunks=12, # chunks in final prompt
)

prompt = build_llm_prompt("latest AI news", config=config)
Enter fullscreen mode Exit fullscreen mode

YAML

production-rag-pipeline "latest AI news" --config config.example.yaml
Enter fullscreen mode Exit fullscreen mode

Environment variables

export RAG_TOP_N_FETCH=8
export RAG_FETCH_TIMEOUT=10
production-rag-pipeline "latest AI news"
Enter fullscreen mode Exit fullscreen mode

Step 6: The 50-Line Version

Here's the entire pipeline, from query to LLM-ready prompt, using the module-level API:

from production_rag_pipeline.search import search
from production_rag_pipeline.fetch import fetch_pages_parallel
from production_rag_pipeline.extract import extract_content, chunk_text
from production_rag_pipeline.rerank import rerank_chunks
from production_rag_pipeline.pipeline import build_llm_context
from production_rag_pipeline.prompts import build_llm_prompt

# 1. Search
query = "latest AI news"
results = search(query, num_per_engine=10, lang="en")

# 2. Fetch
urls = [r["url"] for r in results[:8]]
pages = fetch_pages_parallel(urls, timeout=10)

# 3. Extract + Chunk
all_chunks = []
for url, html in pages.items():
    text = extract_content(html, url=url)
    if text:
        chunks = chunk_text(text, url=url)
        all_chunks.extend(chunks)

# 4. Rerank
ranked = rerank_chunks(query, all_chunks, lang="en")

# 5. Build prompt
context, mapping, sources = build_llm_context(
    ranked, results, renumber_sources=True
)
prompt = build_llm_prompt(query, context=context, sources=sources)

print(prompt)
Enter fullscreen mode Exit fullscreen mode

This is what build_llm_prompt("latest AI news") does internally, broken into visible steps.


Graceful Degradation

The pipeline works at every install level:

Install Ranking Extraction Speed
pip install . BM25 only BeautifulSoup Fastest, least accurate
pip install .[extraction] BM25 only Trafilatura Better content quality
pip install .[semantic] BM25 + semantic + MMR BeautifulSoup Much better ranking
pip install .[full] BM25 + semantic + cross-encoder + MMR Trafilatura Best quality

No GPU required. Semantic models run on CPU — slower, but functional.


How It Compares to Perplexity

Perplexity production-rag-pipeline
Index 200B+ pre-indexed URLs Real-time Bing + DDG
Latency 358ms median 8–15s on a MacBook
Models 20+ with dynamic routing You choose (Ollama, LM Studio, etc.)
Inference Cerebras CS-3, 1,200 tok/s Your hardware
Cost $20/mo Pro Free
Privacy Cloud Local
Code Closed Open source, MIT

The gap is real — especially on latency and index size. But for a tool that runs on your laptop, feeds any local model, and costs nothing, the tradeoff is worth it.


Multilingual Support

The pipeline auto-detects language by Cyrillic character ratio (10% threshold):

  • Englishall-MiniLM-L6-v2 (fast, English-optimized)
  • Russianparaphrase-multilingual-MiniLM-L12-v2 (13 languages)

Cross-encoder reranking also switches models per language. No manual configuration needed.

production-rag-pipeline "новости ИИ" --lang ru
Enter fullscreen mode Exit fullscreen mode

Dark abstract 3D render with the word BUILD in large translucent letters


What's Next

This is Part 2 of a series:

  • Part 1 — How Perplexity Actually Searches the Internet (architecture teardown)
  • Part 2 — You're reading it (build the local equivalent)

Star the repo if this is useful: github.com/KazKozDev/production_rag_pipeline

Issues and contributions welcome.

Top comments (1)

Collapse
 
kazkozdev profile image
Artem KK

This is Part 2 of a series. Part 1 breaks down how Perplexity's five-stage RAG pipeline actually works under the hood — from hybrid search on Vespa.ai to Cerebras-accelerated inference: medium.com/@kazkozdev/how-perplexi...

Code is on GitHub: github.com/KazKozDev/production_ra...