I built a small RAG (Retrieval Augmented Generation) project where a user can ask questions from a PDF, and the LLM answers from that PDF along with the page number to look at. The stack is LangChain, OpenAI embeddings, and Qdrant running in Docker.
A small note before we start: this exact same pipeline is what powers web-apps like an "AI Tutor in Educative", an "AI web page builder". The only thing that changes between those products and my PDF Q&A is the data source. That is the key idea to take away.
What RAG is, in one line
Take a document → break it into small chunks → turn each chunk into a vector (a list of numbers) → store those vectors in a database. Later, when the user asks a question, turn the question into a vector too, find the closest chunks, and feed them to an LLM as context.
INDEXING (run once)
───────────────────
PDF file
│
▼
PyPDFLoader
│ one Document per page (text + metadata)
▼
RecursiveCharacterTextSplitter
│ chunk_size = 1000, overlap = 200
▼
~200 small text chunks
│
▼
OpenAIEmbeddings (text-embedding-3-small)
│ each chunk becomes a 1536-dim vector
▼
Qdrant (running in Docker)
│ vector + chunk text + page metadata stored
QUERY (every user question)
───────────────────────────
"explain me about variables"
│
▼
OpenAIEmbeddings
│ same model, same vector space
▼
Query vector (1536-dim)
│
▼
Qdrant similarity_search (cosine, top k = 4)
│ closest 4 chunks come back
▼
Build a prompt with those chunks as context
│
▼
OpenAI Chat Completion (gpt-5-nano)
│
▼
Final answer + page citations
The stack and why I picked it
- LangChain is the glue. It gives me one interface to talk to many providers. If I want to swap OpenAI for Cohere, or Qdrant for Pinecone, I change one line. It also has loaders for PDFs, websites, Notion, Google Docs, CSVs, image files, and a lot more. Plus tools for chains, agents, memory, prompts, and output parsing.
- Qdrant is the vector database. I ran it locally in Docker so I don't have to pay for a managed service while learning.
-
OpenAI
text-embedding-3-smallis the embedding model. More on why this one and not the large one further down. - uv instead of pip. Faster, lockfile-based, and the modern Python experience.
Some things worth pointing out:
- The embedding model is just a client — no API call happens when you create it. The actual call to OpenAI happens inside
from_documents. -
from_documentsis the convenience method. It connects to Qdrant, creates the collection, embeds every chunk, and inserts it. One call, three jobs.
Now the real question — how big a PDF can I actually ingest?
This is the part I couldn't find a straight answer to in any tutorial. The pipeline above worked beautifully for my 123 KB, 71-page PDF. But what about a 100 MB book? A 1 GB legal document dump?
The answer has many layers. There is no single PDF size limit.
| Layer | Hard limit? | What actually matters |
|---|---|---|
| PyPDFLoader | No | RAM. The whole PDF gets loaded into memory before chunking. |
| RecursiveCharacterTextSplitter | No | None. It just splits whatever you give it. |
| OpenAI embeddings API | Yes | Tokens per input + rate limits per minute + your wallet. |
| Qdrant | Practically no | Designed for millions of vectors. |
| Your laptop | Yes | RAM, disk, and patience. |
The actual bottleneck is almost always OpenAI, not Python or Qdrant.
What OpenAI actually limits
There are three things to watch on the embeddings API.
1. Max tokens per single input — 8,191
Each chunk you send to be embedded can be at most 8,191 tokens, which is roughly 32,000 characters. My chunk_size=1000 is way below that, so this is never hit in normal RAG. Official source: https://platform.openai.com/docs/guides/embeddings
2. Rate limits — RPM (requests per minute) and TPM (tokens per minute)
These depend on your account tier. Tier 1, where most new accounts start, looks like this:
| Tier | RPM | TPM | Batch queue limit |
|---|---|---|---|
| Free | 100 | 40,000 | — |
| Tier 1 | 3,000 | 1,000,000 | 3,000,000 |
| Tier 2 | 5,000 | 1,000,000 | 20,000,000 |
| Tier 3 | 5,000 | 5,000,000 | 100,000,000 |
| Tier 4 | 10,000 | 5,000,000 | 500,000,000 |
| Tier 5 | 10,000 | 10,000,000 | 4,000,000,000 |
Your tier moves up automatically as you spend more on the platform.
- Official rate limit docs: https://platform.openai.com/docs/guides/rate-limits/usage-tiers
- Your own account's limits: https://platform.openai.com/account/limits
When you hit the TPM, LangChain backs off and retries automatically. So your script doesn't crash — it just takes longer.
3. The cost — $0.02 per 1M tokens for text-embedding-3-small
This is the most underrated number. A 100 MB text PDF is roughly 20–50 million tokens. That works out to about $0.40 to $1.00 in embedding cost. Real-world cheap. Pricing page: https://platform.openai.com/docs/pricing
Top comments (0)