I kept running into the same problem every time I built an AI bot for a client.
Before writing a single line of bot logic, I had to prepare the knowledge base. Parse the PDF. Figure out the right chunk size. Generate embeddings via API. Format everything into a structure the bot could actually use. Then repeat for every new document.
It was taking 2-3 hours per project just for data preparation. So I decided to build a tool that does all of this automatically.
What I built
ChunkIt is a simple web app — you upload a PDF or paste a URL, and it returns a clean JSON dataset with OpenAI vector embeddings, ready to plug into any AI bot.
The whole pipeline runs in under 60 seconds.
How it works under the hood
The stack is straightforward:
- Frontend: React + Vite + Tailwind, deployed on Vercel
- Backend orchestration: n8n (self-hosted)
- Database + Storage: Supabase
- Parsing: Python with PyMuPDF for PDFs, Playwright for URLs
- Embeddings: OpenAI text-embedding-3-small
When a user uploads a PDF:
- The file goes to Supabase Storage
- n8n webhook triggers the Python parser via SSH
- PyMuPDF extracts the text
- The text gets split into chunks (256–1024 tokens depending on content type)
- OpenAI generates embeddings for each chunk in batches of 100
- Everything gets saved to Supabase and returned as a downloadable JSON
The chunking strategy
One thing I spent time on was making chunking smarter based on content type. Different documents need different chunk sizes:
- Support FAQs: 256 tokens, small overlap — short Q&A pairs work best as precise chunks
- Legal documents: 1024 tokens, large overlap — long paragraphs need context preserved
- Real estate brochures: 512 tokens, medium overlap — balanced for property descriptions
- E-commerce: 256 tokens — product descriptions are short, each product gets its own chunk
The output format
Each downloaded JSON file is an array of chunk objects:
[
{
"chunk_index": 0,
"content": "The text content of this chunk...",
"chunk_type": "description",
"source_url": "https://example.com/doc.pdf",
"metadata": {
"district": "Downtown Dubai",
"price_from": 1500000
}
}
]
You can plug this directly into n8n AI agents, LangChain, ChatGPT Custom GPTs, or any custom implementation.
What took the most time
Surprisingly not the embedding part — that was straightforward with the OpenAI API.
The hardest part was URL parsing. Most websites block automated access (Cloudflare, heavy JavaScript rendering). I ended up using Playwright with Chromium to properly render pages before extracting content, plus filtering out navigation, footers, and other noise.
What I learned
Building the MVP took 8 hours. The remaining time went into:
- Setting up proper auth (Supabase + Google OAuth)
- Adding chunk limits for the free plan
- Making the pipeline actually delete data after download (privacy by design)
- Writing docs so users understand what RAG even is
The last point was a reminder that building the tool is only half the work. Explaining what it does and why it matters takes just as long.
Try it
chunkit.yerzhan.online — free plan includes 200 chunks lifetime.
Would love feedback from anyone building AI agents — what data formats do you work with most?
Top comments (0)