If you've built a RAG (Retrieval Augmented Generation) pipeline, you know the dirty secret: 90% of the work is getting clean data, not the "AI" part.
Vector databases? Easy. Embedding models? Plug and play. But getting clean, structured text from YouTube videos, web pages, and social media? That's where projects go to die.
In this tutorial, I'll show you how to build a complete RAG pipeline in under 5 minutes using Python — with content ingestion that actually works.
The Architecture
URLs → ContentAPI → Clean Text → Embeddings → Vector DB → LLM → Answer
We'll use:
- ContentAPI for content extraction (free tier: 5,000 req/month)
- OpenAI for embeddings and chat completion
- ChromaDB for vector storage (local, no setup)
Step 1: Install Dependencies
pip install contentapi openai chromadb
Step 2: Extract Content from Multiple Sources
Here's the magic — ContentAPI handles YouTube, web pages, Reddit, and Twitter with one API:
from contentapi import ContentAPI
client = ContentAPI(api_key="your_contentapi_key")
# Define your knowledge sources — mix and match!
sources = [
"https://youtube.com/watch?v=example1", # YouTube video
"https://docs.python.org/3/tutorial/classes.html", # Web page
"https://reddit.com/r/MachineLearning/comments/abc", # Reddit thread
"https://example.com/blog/rag-best-practices", # Blog post
]
# Batch extract — one API call for up to 10 URLs
results = client.batch(sources)
documents = []
for result in results:
if result.get("success"):
data = result["data"]
documents.append({
"content": data.get("full_text") or data.get("content", ""),
"title": data.get("title", ""),
"url": data.get("url", ""),
"source": data.get("source", "web"),
})
print(f"Extracted {len(documents)} documents")
That's all the scraping code you need. No Beautiful Soup, no Selenium, no YouTube API keys, no Reddit OAuth tokens.
Step 3: Chunk the Content
For RAG, we need to split documents into digestible chunks:
def chunk_text(text, chunk_size=500, overlap=50):
"""Split text into overlapping chunks."""
words = text.split()
chunks = []
for i in range(0, len(words), chunk_size - overlap):
chunk = " ".join(words[i:i + chunk_size])
if chunk.strip():
chunks.append(chunk)
return chunks
all_chunks = []
for doc in documents:
chunks = chunk_text(doc["content"])
for i, chunk in enumerate(chunks):
all_chunks.append({
"text": chunk,
"metadata": {
"title": doc["title"],
"url": doc["url"],
"source": doc["source"],
"chunk_index": i,
}
})
print(f"Created {len(all_chunks)} chunks from {len(documents)} documents")
Step 4: Embed and Store in ChromaDB
import chromadb
from openai import OpenAI
openai_client = OpenAI(api_key="your_openai_key")
chroma_client = chromadb.Client()
collection = chroma_client.create_collection("knowledge_base")
# Embed all chunks
batch_size = 100
for i in range(0, len(all_chunks), batch_size):
batch = all_chunks[i:i + batch_size]
texts = [c["text"] for c in batch]
# Get embeddings from OpenAI
response = openai_client.embeddings.create(
model="text-embedding-3-small",
input=texts
)
embeddings = [e.embedding for e in response.data]
collection.add(
ids=[f"chunk_{i+j}" for j in range(len(batch))],
embeddings=embeddings,
documents=texts,
metadatas=[c["metadata"] for c in batch],
)
print(f"Stored {len(all_chunks)} chunks in ChromaDB")
Step 5: Query Your Knowledge Base
def ask(question):
"""Ask a question against your knowledge base."""
# Embed the question
q_embedding = openai_client.embeddings.create(
model="text-embedding-3-small",
input=[question]
).data[0].embedding
# Find relevant chunks
results = collection.query(
query_embeddings=[q_embedding],
n_results=5
)
# Build context from retrieved chunks
context = "\n\n---\n\n".join(results["documents"][0])
sources = [m["url"] for m in results["metadatas"][0]]
# Generate answer with GPT
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": "Answer questions based on the provided context. "
"Cite sources when possible. If the context doesn't "
"contain the answer, say so."
},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}"
}
]
)
answer = response.choices[0].message.content
return answer, list(set(sources))
# Try it!
answer, sources = ask("What are the best practices for RAG pipelines?")
print(f"Answer: {answer}")
print(f"Sources: {sources}")
The Full Pipeline in ~50 Lines
Here it is, condensed:
from contentapi import ContentAPI
from openai import OpenAI
import chromadb
# Initialize clients
content_client = ContentAPI(api_key="your_contentapi_key")
openai_client = OpenAI(api_key="your_openai_key")
collection = chromadb.Client().create_collection("kb")
# 1. Extract content from any URLs
urls = ["https://youtube.com/watch?v=...", "https://example.com/article"]
results = content_client.batch(urls)
# 2. Chunk and embed
chunks = []
for r in results:
if r.get("success"):
text = r["data"].get("full_text") or r["data"].get("content", "")
words = text.split()
for i in range(0, len(words), 450):
chunks.append(" ".join(words[i:i+500]))
embeddings = openai_client.embeddings.create(
model="text-embedding-3-small", input=chunks
).data
collection.add(
ids=[f"c{i}" for i in range(len(chunks))],
embeddings=[e.embedding for e in embeddings],
documents=chunks,
)
# 3. Query
q = "What did the video discuss?"
q_emb = openai_client.embeddings.create(
model="text-embedding-3-small", input=[q]
).data[0].embedding
context = "\n".join(collection.query(query_embeddings=[q_emb], n_results=5)["documents"][0])
answer = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": f"Answer based on context:\n{context}"},
{"role": "user", "content": q}
]
).choices[0].message.content
print(answer)
Why Not Just Use Beautiful Soup / Selenium?
You could — and I did for years. Here's why I stopped:
| Traditional Scraping | ContentAPI | |
|---|---|---|
| YouTube transcripts | Need YouTube API key + quotas | ✅ Works out of the box |
| JS-rendered pages | Need Selenium/Playwright | ✅ Handled server-side |
| Twitter/X | $100/mo minimum API cost | ✅ Included |
| OAuth setup + rate limits | ✅ Included | |
| Maintenance | Breaks when sites change | ✅ Maintained for you |
| Setup time | Hours per source | ✅ 5 minutes total |
Advanced: Continuous Ingestion with Crawling
For larger knowledge bases, use the crawl endpoint to ingest entire sites:
# Crawl an entire documentation site
crawl = content_client.crawl(
url="https://docs.example.com",
max_pages=100,
include_patterns=["/docs/*", "/guides/*"],
format="markdown"
)
# Wait for completion, then process all pages
for page in crawl.results:
# Same chunking + embedding logic as above
process_document(page.content, page.url)
Advanced: AI-Powered Structured Extraction
Need specific data points from pages? Use schema-based extraction:
# Extract structured data for product comparison
product_data = content_client.extract(
url="https://example.com/product",
schema={
"name": "string",
"price": "number",
"rating": "number",
"pros": ["string"],
"cons": ["string"]
}
)
This uses AI to understand the page and extract exactly the fields you need — no CSS selectors or XPath required.
Getting Started
- Sign up at getcontentapi.com (free, no credit card)
pip install contentapi- Start extracting content
The free tier gives you 5,000 requests/month — enough for a serious RAG prototype.
Full docs: getcontentapi.com/docs
What are you building with RAG? I'd love to see how people use this in their pipelines. Share your projects in the comments! 🚀
Top comments (0)