DEV Community

stabem
stabem

Posted on

Build a RAG Pipeline in 5 Minutes with Python (No Scraping Headaches)

If you've built a RAG (Retrieval Augmented Generation) pipeline, you know the dirty secret: 90% of the work is getting clean data, not the "AI" part.

Vector databases? Easy. Embedding models? Plug and play. But getting clean, structured text from YouTube videos, web pages, and social media? That's where projects go to die.

In this tutorial, I'll show you how to build a complete RAG pipeline in under 5 minutes using Python — with content ingestion that actually works.

The Architecture

URLs → ContentAPI → Clean Text → Embeddings → Vector DB → LLM → Answer
Enter fullscreen mode Exit fullscreen mode

We'll use:

  • ContentAPI for content extraction (free tier: 5,000 req/month)
  • OpenAI for embeddings and chat completion
  • ChromaDB for vector storage (local, no setup)

Step 1: Install Dependencies

pip install contentapi openai chromadb
Enter fullscreen mode Exit fullscreen mode

Step 2: Extract Content from Multiple Sources

Here's the magic — ContentAPI handles YouTube, web pages, Reddit, and Twitter with one API:

from contentapi import ContentAPI

client = ContentAPI(api_key="your_contentapi_key")

# Define your knowledge sources — mix and match!
sources = [
    "https://youtube.com/watch?v=example1",          # YouTube video
    "https://docs.python.org/3/tutorial/classes.html", # Web page
    "https://reddit.com/r/MachineLearning/comments/abc", # Reddit thread
    "https://example.com/blog/rag-best-practices",     # Blog post
]

# Batch extract — one API call for up to 10 URLs
results = client.batch(sources)

documents = []
for result in results:
    if result.get("success"):
        data = result["data"]
        documents.append({
            "content": data.get("full_text") or data.get("content", ""),
            "title": data.get("title", ""),
            "url": data.get("url", ""),
            "source": data.get("source", "web"),
        })

print(f"Extracted {len(documents)} documents")
Enter fullscreen mode Exit fullscreen mode

That's all the scraping code you need. No Beautiful Soup, no Selenium, no YouTube API keys, no Reddit OAuth tokens.

Step 3: Chunk the Content

For RAG, we need to split documents into digestible chunks:

def chunk_text(text, chunk_size=500, overlap=50):
    """Split text into overlapping chunks."""
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        if chunk.strip():
            chunks.append(chunk)
    return chunks

all_chunks = []
for doc in documents:
    chunks = chunk_text(doc["content"])
    for i, chunk in enumerate(chunks):
        all_chunks.append({
            "text": chunk,
            "metadata": {
                "title": doc["title"],
                "url": doc["url"],
                "source": doc["source"],
                "chunk_index": i,
            }
        })

print(f"Created {len(all_chunks)} chunks from {len(documents)} documents")
Enter fullscreen mode Exit fullscreen mode

Step 4: Embed and Store in ChromaDB

import chromadb
from openai import OpenAI

openai_client = OpenAI(api_key="your_openai_key")
chroma_client = chromadb.Client()
collection = chroma_client.create_collection("knowledge_base")

# Embed all chunks
batch_size = 100
for i in range(0, len(all_chunks), batch_size):
    batch = all_chunks[i:i + batch_size]
    texts = [c["text"] for c in batch]

    # Get embeddings from OpenAI
    response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )

    embeddings = [e.embedding for e in response.data]

    collection.add(
        ids=[f"chunk_{i+j}" for j in range(len(batch))],
        embeddings=embeddings,
        documents=texts,
        metadatas=[c["metadata"] for c in batch],
    )

print(f"Stored {len(all_chunks)} chunks in ChromaDB")
Enter fullscreen mode Exit fullscreen mode

Step 5: Query Your Knowledge Base

def ask(question):
    """Ask a question against your knowledge base."""

    # Embed the question
    q_embedding = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=[question]
    ).data[0].embedding

    # Find relevant chunks
    results = collection.query(
        query_embeddings=[q_embedding],
        n_results=5
    )

    # Build context from retrieved chunks
    context = "\n\n---\n\n".join(results["documents"][0])
    sources = [m["url"] for m in results["metadatas"][0]]

    # Generate answer with GPT
    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "Answer questions based on the provided context. "
                           "Cite sources when possible. If the context doesn't "
                           "contain the answer, say so."
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {question}"
            }
        ]
    )

    answer = response.choices[0].message.content
    return answer, list(set(sources))

# Try it!
answer, sources = ask("What are the best practices for RAG pipelines?")
print(f"Answer: {answer}")
print(f"Sources: {sources}")
Enter fullscreen mode Exit fullscreen mode

The Full Pipeline in ~50 Lines

Here it is, condensed:

from contentapi import ContentAPI
from openai import OpenAI
import chromadb

# Initialize clients
content_client = ContentAPI(api_key="your_contentapi_key")
openai_client = OpenAI(api_key="your_openai_key")
collection = chromadb.Client().create_collection("kb")

# 1. Extract content from any URLs
urls = ["https://youtube.com/watch?v=...", "https://example.com/article"]
results = content_client.batch(urls)

# 2. Chunk and embed
chunks = []
for r in results:
    if r.get("success"):
        text = r["data"].get("full_text") or r["data"].get("content", "")
        words = text.split()
        for i in range(0, len(words), 450):
            chunks.append(" ".join(words[i:i+500]))

embeddings = openai_client.embeddings.create(
    model="text-embedding-3-small", input=chunks
).data

collection.add(
    ids=[f"c{i}" for i in range(len(chunks))],
    embeddings=[e.embedding for e in embeddings],
    documents=chunks,
)

# 3. Query
q = "What did the video discuss?"
q_emb = openai_client.embeddings.create(
    model="text-embedding-3-small", input=[q]
).data[0].embedding

context = "\n".join(collection.query(query_embeddings=[q_emb], n_results=5)["documents"][0])

answer = openai_client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": f"Answer based on context:\n{context}"},
        {"role": "user", "content": q}
    ]
).choices[0].message.content

print(answer)
Enter fullscreen mode Exit fullscreen mode

Why Not Just Use Beautiful Soup / Selenium?

You could — and I did for years. Here's why I stopped:

Traditional Scraping ContentAPI
YouTube transcripts Need YouTube API key + quotas ✅ Works out of the box
JS-rendered pages Need Selenium/Playwright ✅ Handled server-side
Twitter/X $100/mo minimum API cost ✅ Included
Reddit OAuth setup + rate limits ✅ Included
Maintenance Breaks when sites change ✅ Maintained for you
Setup time Hours per source ✅ 5 minutes total

Advanced: Continuous Ingestion with Crawling

For larger knowledge bases, use the crawl endpoint to ingest entire sites:

# Crawl an entire documentation site
crawl = content_client.crawl(
    url="https://docs.example.com",
    max_pages=100,
    include_patterns=["/docs/*", "/guides/*"],
    format="markdown"
)

# Wait for completion, then process all pages
for page in crawl.results:
    # Same chunking + embedding logic as above
    process_document(page.content, page.url)
Enter fullscreen mode Exit fullscreen mode

Advanced: AI-Powered Structured Extraction

Need specific data points from pages? Use schema-based extraction:

# Extract structured data for product comparison
product_data = content_client.extract(
    url="https://example.com/product",
    schema={
        "name": "string",
        "price": "number",
        "rating": "number",
        "pros": ["string"],
        "cons": ["string"]
    }
)
Enter fullscreen mode Exit fullscreen mode

This uses AI to understand the page and extract exactly the fields you need — no CSS selectors or XPath required.

Getting Started

  1. Sign up at getcontentapi.com (free, no credit card)
  2. pip install contentapi
  3. Start extracting content

The free tier gives you 5,000 requests/month — enough for a serious RAG prototype.

Full docs: getcontentapi.com/docs


What are you building with RAG? I'd love to see how people use this in their pipelines. Share your projects in the comments! 🚀

Top comments (0)