Ержан Байгаринов

Posted on May 29

I built a RAG dataset tool in 8 hours

#python #webdev #ai #tutorial

I kept running into the same problem every time I built an AI bot for a client.

Before writing a single line of bot logic, I had to prepare the knowledge base. Parse the PDF. Figure out the right chunk size. Generate embeddings via API. Format everything into a structure the bot could actually use. Then repeat for every new document.

It was taking 2-3 hours per project just for data preparation. So I decided to build a tool that does all of this automatically.

What I built

ChunkIt is a simple web app — you upload a PDF or paste a URL, and it returns a clean JSON dataset with OpenAI vector embeddings, ready to plug into any AI bot.

The whole pipeline runs in under 60 seconds.

How it works under the hood

The stack is straightforward:

Frontend: React + Vite + Tailwind, deployed on Vercel
Backend orchestration: n8n (self-hosted)
Database + Storage: Supabase
Parsing: Python with PyMuPDF for PDFs, Playwright for URLs
Embeddings: OpenAI text-embedding-3-small

When a user uploads a PDF:

The file goes to Supabase Storage
n8n webhook triggers the Python parser via SSH
PyMuPDF extracts the text
The text gets split into chunks (256–1024 tokens depending on content type)
OpenAI generates embeddings for each chunk in batches of 100
Everything gets saved to Supabase and returned as a downloadable JSON

The chunking strategy

One thing I spent time on was making chunking smarter based on content type. Different documents need different chunk sizes:

Support FAQs: 256 tokens, small overlap — short Q&A pairs work best as precise chunks
Legal documents: 1024 tokens, large overlap — long paragraphs need context preserved
Real estate brochures: 512 tokens, medium overlap — balanced for property descriptions
E-commerce: 256 tokens — product descriptions are short, each product gets its own chunk

The output format

Each downloaded JSON file is an array of chunk objects:

[
  {
    "chunk_index": 0,
    "content": "The text content of this chunk...",
    "chunk_type": "description",
    "source_url": "https://example.com/doc.pdf",
    "metadata": {
      "district": "Downtown Dubai",
      "price_from": 1500000
    }
  }
]

You can plug this directly into n8n AI agents, LangChain, ChatGPT Custom GPTs, or any custom implementation.

What took the most time

Surprisingly not the embedding part — that was straightforward with the OpenAI API.

The hardest part was URL parsing. Most websites block automated access (Cloudflare, heavy JavaScript rendering). I ended up using Playwright with Chromium to properly render pages before extracting content, plus filtering out navigation, footers, and other noise.

What I learned

Building the MVP took 8 hours. The remaining time went into:

Setting up proper auth (Supabase + Google OAuth)
Adding chunk limits for the free plan
Making the pipeline actually delete data after download (privacy by design)
Writing docs so users understand what RAG even is

The last point was a reminder that building the tool is only half the work. Explaining what it does and why it matters takes just as long.

Try it

chunkit.yerzhan.online — free plan includes 200 chunks lifetime.

Would love feedback from anyone building AI agents — what data formats do you work with most?

DEV Community