DEV Community

Cover image for # Day 5 of learning AI Engineering: built a small RAG app over a PDF
Suraj Bera
Suraj Bera

Posted on

# Day 5 of learning AI Engineering: built a small RAG app over a PDF

I built a small RAG (Retrieval Augmented Generation) project where a user can ask questions from a PDF, and the LLM answers from that PDF along with the page number to look at. The stack is LangChain, OpenAI embeddings, and Qdrant running in Docker.

A small note before we start: this exact same pipeline is what powers web-apps like an "AI Tutor in Educative", an "AI web page builder". The only thing that changes between those products and my PDF Q&A is the data source. That is the key idea to take away.

What RAG is, in one line

Take a document → break it into small chunks → turn each chunk into a vector (a list of numbers) → store those vectors in a database. Later, when the user asks a question, turn the question into a vector too, find the closest chunks, and feed them to an LLM as context.

INDEXING (run once)
───────────────────

   PDF file
      │
      ▼
   PyPDFLoader
      │  one Document per page (text + metadata)
      ▼
   RecursiveCharacterTextSplitter
      │  chunk_size = 1000, overlap = 200
      ▼
   ~200 small text chunks
      │
      ▼
   OpenAIEmbeddings (text-embedding-3-small)
      │  each chunk becomes a 1536-dim vector
      ▼
   Qdrant (running in Docker)
      │  vector + chunk text + page metadata stored


QUERY (every user question)
───────────────────────────

   "explain me about variables"
      │
      ▼
   OpenAIEmbeddings
      │  same model, same vector space
      ▼
   Query vector (1536-dim)
      │
      ▼
   Qdrant similarity_search (cosine, top k = 4)
      │  closest 4 chunks come back
      ▼
   Build a prompt with those chunks as context
      │
      ▼
   OpenAI Chat Completion (gpt-5-nano)
      │
      ▼
   Final answer + page citations
Enter fullscreen mode Exit fullscreen mode

The stack and why I picked it

  • LangChain is the glue. It gives me one interface to talk to many providers. If I want to swap OpenAI for Cohere, or Qdrant for Pinecone, I change one line. It also has loaders for PDFs, websites, Notion, Google Docs, CSVs, image files, and a lot more. Plus tools for chains, agents, memory, prompts, and output parsing.
  • Qdrant is the vector database. I ran it locally in Docker so I don't have to pay for a managed service while learning.
  • OpenAI text-embedding-3-small is the embedding model. More on why this one and not the large one further down.
  • uv instead of pip. Faster, lockfile-based, and the modern Python experience.

Some things worth pointing out:

  • The embedding model is just a client — no API call happens when you create it. The actual call to OpenAI happens inside from_documents.
  • from_documents is the convenience method. It connects to Qdrant, creates the collection, embeds every chunk, and inserts it. One call, three jobs.

Now the real question — how big a PDF can I actually ingest?

This is the part I couldn't find a straight answer to in any tutorial. The pipeline above worked beautifully for my 123 KB, 71-page PDF. But what about a 100 MB book? A 1 GB legal document dump?

The answer has many layers. There is no single PDF size limit.

Layer Hard limit? What actually matters
PyPDFLoader No RAM. The whole PDF gets loaded into memory before chunking.
RecursiveCharacterTextSplitter No None. It just splits whatever you give it.
OpenAI embeddings API Yes Tokens per input + rate limits per minute + your wallet.
Qdrant Practically no Designed for millions of vectors.
Your laptop Yes RAM, disk, and patience.

The actual bottleneck is almost always OpenAI, not Python or Qdrant.

What OpenAI actually limits

There are three things to watch on the embeddings API.

1. Max tokens per single input — 8,191

Each chunk you send to be embedded can be at most 8,191 tokens, which is roughly 32,000 characters. My chunk_size=1000 is way below that, so this is never hit in normal RAG. Official source: https://platform.openai.com/docs/guides/embeddings

2. Rate limits — RPM (requests per minute) and TPM (tokens per minute)

These depend on your account tier. Tier 1, where most new accounts start, looks like this:

Tier RPM TPM Batch queue limit
Free 100 40,000
Tier 1 3,000 1,000,000 3,000,000
Tier 2 5,000 1,000,000 20,000,000
Tier 3 5,000 5,000,000 100,000,000
Tier 4 10,000 5,000,000 500,000,000
Tier 5 10,000 10,000,000 4,000,000,000

Your tier moves up automatically as you spend more on the platform.

When you hit the TPM, LangChain backs off and retries automatically. So your script doesn't crash — it just takes longer.

3. The cost — $0.02 per 1M tokens for text-embedding-3-small

This is the most underrated number. A 100 MB text PDF is roughly 20–50 million tokens. That works out to about $0.40 to $1.00 in embedding cost. Real-world cheap. Pricing page: https://platform.openai.com/docs/pricing

Top comments (0)