stabem

Posted on Feb 14

Build a RAG Pipeline in 5 Minutes with Python (No Scraping Headaches)

#ai #machinelearning #python #tutorial

If you've built a RAG (Retrieval Augmented Generation) pipeline, you know the dirty secret: 90% of the work is getting clean data, not the "AI" part.

Vector databases? Easy. Embedding models? Plug and play. But getting clean, structured text from YouTube videos, web pages, and social media? That's where projects go to die.

In this tutorial, I'll show you how to build a complete RAG pipeline in under 5 minutes using Python — with content ingestion that actually works.

The Architecture

URLs → ContentAPI → Clean Text → Embeddings → Vector DB → LLM → Answer

We'll use:

ContentAPI for content extraction (free tier: 5,000 req/month)
OpenAI for embeddings and chat completion
ChromaDB for vector storage (local, no setup)

Step 1: Install Dependencies

pip install contentapi openai chromadb

Step 2: Extract Content from Multiple Sources

Here's the magic — ContentAPI handles YouTube, web pages, Reddit, and Twitter with one API:

from contentapi import ContentAPI

client = ContentAPI(api_key="your_contentapi_key")

# Define your knowledge sources — mix and match!
sources = [
    "https://youtube.com/watch?v=example1",          # YouTube video
    "https://docs.python.org/3/tutorial/classes.html", # Web page
    "https://reddit.com/r/MachineLearning/comments/abc", # Reddit thread
    "https://example.com/blog/rag-best-practices",     # Blog post
]

# Batch extract — one API call for up to 10 URLs
results = client.batch(sources)

documents = []
for result in results:
    if result.get("success"):
        data = result["data"]
        documents.append({
            "content": data.get("full_text") or data.get("content", ""),
            "title": data.get("title", ""),
            "url": data.get("url", ""),
            "source": data.get("source", "web"),
        })

print(f"Extracted {len(documents)} documents")

That's all the scraping code you need. No Beautiful Soup, no Selenium, no YouTube API keys, no Reddit OAuth tokens.

Step 3: Chunk the Content

For RAG, we need to split documents into digestible chunks:

def chunk_text(text, chunk_size=500, overlap=50):
    """Split text into overlapping chunks."""
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        if chunk.strip():
            chunks.append(chunk)
    return chunks

all_chunks = []
for doc in documents:
    chunks = chunk_text(doc["content"])
    for i, chunk in enumerate(chunks):
        all_chunks.append({
            "text": chunk,
            "metadata": {
                "title": doc["title"],
                "url": doc["url"],
                "source": doc["source"],
                "chunk_index": i,
            }
        })

print(f"Created {len(all_chunks)} chunks from {len(documents)} documents")

Step 4: Embed and Store in ChromaDB

import chromadb
from openai import OpenAI

openai_client = OpenAI(api_key="your_openai_key")
chroma_client = chromadb.Client()
collection = chroma_client.create_collection("knowledge_base")

# Embed all chunks
batch_size = 100
for i in range(0, len(all_chunks), batch_size):
    batch = all_chunks[i:i + batch_size]
    texts = [c["text"] for c in batch]

    # Get embeddings from OpenAI
    response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )

    embeddings = [e.embedding for e in response.data]

    collection.add(
        ids=[f"chunk_{i+j}" for j in range(len(batch))],
        embeddings=embeddings,
        documents=texts,
        metadatas=[c["metadata"] for c in batch],
    )

print(f"Stored {len(all_chunks)} chunks in ChromaDB")

Step 5: Query Your Knowledge Base

def ask(question):
    """Ask a question against your knowledge base."""

    # Embed the question
    q_embedding = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=[question]
    ).data[0].embedding

    # Find relevant chunks
    results = collection.query(
        query_embeddings=[q_embedding],
        n_results=5
    )

    # Build context from retrieved chunks
    context = "\n\n---\n\n".join(results["documents"][0])
    sources = [m["url"] for m in results["metadatas"][0]]

    # Generate answer with GPT
    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "Answer questions based on the provided context. "
                           "Cite sources when possible. If the context doesn't "
                           "contain the answer, say so."
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {question}"
            }
        ]
    )

    answer = response.choices[0].message.content
    return answer, list(set(sources))

# Try it!
answer, sources = ask("What are the best practices for RAG pipelines?")
print(f"Answer: {answer}")
print(f"Sources: {sources}")

The Full Pipeline in ~50 Lines

Here it is, condensed:

from contentapi import ContentAPI
from openai import OpenAI
import chromadb

# Initialize clients
content_client = ContentAPI(api_key="your_contentapi_key")
openai_client = OpenAI(api_key="your_openai_key")
collection = chromadb.Client().create_collection("kb")

# 1. Extract content from any URLs
urls = ["https://youtube.com/watch?v=...", "https://example.com/article"]
results = content_client.batch(urls)

# 2. Chunk and embed
chunks = []
for r in results:
    if r.get("success"):
        text = r["data"].get("full_text") or r["data"].get("content", "")
        words = text.split()
        for i in range(0, len(words), 450):
            chunks.append(" ".join(words[i:i+500]))

embeddings = openai_client.embeddings.create(
    model="text-embedding-3-small", input=chunks
).data

collection.add(
    ids=[f"c{i}" for i in range(len(chunks))],
    embeddings=[e.embedding for e in embeddings],
    documents=chunks,
)

# 3. Query
q = "What did the video discuss?"
q_emb = openai_client.embeddings.create(
    model="text-embedding-3-small", input=[q]
).data[0].embedding

context = "\n".join(collection.query(query_embeddings=[q_emb], n_results=5)["documents"][0])

answer = openai_client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": f"Answer based on context:\n{context}"},
        {"role": "user", "content": q}
    ]
).choices[0].message.content

print(answer)

Why Not Just Use Beautiful Soup / Selenium?

You could — and I did for years. Here's why I stopped:

	Traditional Scraping	ContentAPI
YouTube transcripts	Need YouTube API key + quotas	✅ Works out of the box
JS-rendered pages	Need Selenium/Playwright	✅ Handled server-side
Twitter/X	$100/mo minimum API cost	✅ Included
Reddit	OAuth setup + rate limits	✅ Included
Maintenance	Breaks when sites change	✅ Maintained for you
Setup time	Hours per source	✅ 5 minutes total

Advanced: Continuous Ingestion with Crawling

For larger knowledge bases, use the crawl endpoint to ingest entire sites:

# Crawl an entire documentation site
crawl = content_client.crawl(
    url="https://docs.example.com",
    max_pages=100,
    include_patterns=["/docs/*", "/guides/*"],
    format="markdown"
)

# Wait for completion, then process all pages
for page in crawl.results:
    # Same chunking + embedding logic as above
    process_document(page.content, page.url)

Advanced: AI-Powered Structured Extraction

Need specific data points from pages? Use schema-based extraction:

# Extract structured data for product comparison
product_data = content_client.extract(
    url="https://example.com/product",
    schema={
        "name": "string",
        "price": "number",
        "rating": "number",
        "pros": ["string"],
        "cons": ["string"]
    }
)

This uses AI to understand the page and extract exactly the fields you need — no CSS selectors or XPath required.

Getting Started

Sign up at getcontentapi.com (free, no credit card)
pip install contentapi
Start extracting content

The free tier gives you 5,000 requests/month — enough for a serious RAG prototype.

Full docs: getcontentapi.com/docs

What are you building with RAG? I'd love to see how people use this in their pipelines. Share your projects in the comments! 🚀

DEV Community