<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ержан Байгаринов</title>
    <description>The latest articles on DEV Community by Ержан Байгаринов (@__dbfc68ef).</description>
    <link>https://dev.to/__dbfc68ef</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3958511%2F2e966823-84f0-433a-b8df-7e3c0220fc0d.png</url>
      <title>DEV Community: Ержан Байгаринов</title>
      <link>https://dev.to/__dbfc68ef</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/__dbfc68ef"/>
    <language>en</language>
    <item>
      <title>I built a RAG dataset tool in 8 hours</title>
      <dc:creator>Ержан Байгаринов</dc:creator>
      <pubDate>Fri, 29 May 2026 13:13:08 +0000</pubDate>
      <link>https://dev.to/__dbfc68ef/i-built-a-rag-dataset-tool-in-8-hours-3e3m</link>
      <guid>https://dev.to/__dbfc68ef/i-built-a-rag-dataset-tool-in-8-hours-3e3m</guid>
      <description>&lt;p&gt;I kept running into the same problem every time I built an AI bot for a client.&lt;/p&gt;

&lt;p&gt;Before writing a single line of bot logic, I had to prepare the knowledge base. Parse the PDF. Figure out the right chunk size. Generate embeddings via API. Format everything into a structure the bot could actually use. Then repeat for every new document.&lt;/p&gt;

&lt;p&gt;It was taking 2-3 hours per project just for data preparation. So I decided to build a tool that does all of this automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I built
&lt;/h2&gt;

&lt;p&gt;ChunkIt is a simple web app — you upload a PDF or paste a URL, and it returns a clean JSON dataset with OpenAI vector embeddings, ready to plug into any AI bot.&lt;/p&gt;

&lt;p&gt;The whole pipeline runs in under 60 seconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it works under the hood
&lt;/h2&gt;

&lt;p&gt;The stack is straightforward:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Frontend: React + Vite + Tailwind, deployed on Vercel&lt;/li&gt;
&lt;li&gt;Backend orchestration: n8n (self-hosted)&lt;/li&gt;
&lt;li&gt;Database + Storage: Supabase&lt;/li&gt;
&lt;li&gt;Parsing: Python with PyMuPDF for PDFs, Playwright for URLs&lt;/li&gt;
&lt;li&gt;Embeddings: OpenAI text-embedding-3-small&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When a user uploads a PDF:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The file goes to Supabase Storage&lt;/li&gt;
&lt;li&gt;n8n webhook triggers the Python parser via SSH&lt;/li&gt;
&lt;li&gt;PyMuPDF extracts the text&lt;/li&gt;
&lt;li&gt;The text gets split into chunks (256–1024 tokens depending on content type)&lt;/li&gt;
&lt;li&gt;OpenAI generates embeddings for each chunk in batches of 100&lt;/li&gt;
&lt;li&gt;Everything gets saved to Supabase and returned as a downloadable JSON&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The chunking strategy
&lt;/h2&gt;

&lt;p&gt;One thing I spent time on was making chunking smarter based on content type. Different documents need different chunk sizes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Support FAQs: 256 tokens, small overlap — short Q&amp;amp;A pairs work best as precise chunks&lt;/li&gt;
&lt;li&gt;Legal documents: 1024 tokens, large overlap — long paragraphs need context preserved&lt;/li&gt;
&lt;li&gt;Real estate brochures: 512 tokens, medium overlap — balanced for property descriptions&lt;/li&gt;
&lt;li&gt;E-commerce: 256 tokens — product descriptions are short, each product gets its own chunk&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The output format
&lt;/h2&gt;

&lt;p&gt;Each downloaded JSON file is an array of chunk objects:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"chunk_index"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"The text content of this chunk..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"chunk_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"source_url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://example.com/doc.pdf"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"metadata"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"district"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Downtown Dubai"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"price_from"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1500000&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can plug this directly into n8n AI agents, LangChain, ChatGPT Custom GPTs, or any custom implementation.&lt;/p&gt;

&lt;h2&gt;
  
  
  What took the most time
&lt;/h2&gt;

&lt;p&gt;Surprisingly not the embedding part — that was straightforward with the OpenAI API.&lt;/p&gt;

&lt;p&gt;The hardest part was URL parsing. Most websites block automated access (Cloudflare, heavy JavaScript rendering). I ended up using Playwright with Chromium to properly render pages before extracting content, plus filtering out navigation, footers, and other noise.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I learned
&lt;/h2&gt;

&lt;p&gt;Building the MVP took 8 hours. The remaining time went into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Setting up proper auth (Supabase + Google OAuth)&lt;/li&gt;
&lt;li&gt;Adding chunk limits for the free plan&lt;/li&gt;
&lt;li&gt;Making the pipeline actually delete data after download (privacy by design)&lt;/li&gt;
&lt;li&gt;Writing docs so users understand what RAG even is&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The last point was a reminder that building the tool is only half the work. Explaining what it does and why it matters takes just as long.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;chunkit.yerzhan.online — free plan includes 200 chunks lifetime.&lt;/p&gt;

&lt;p&gt;Would love feedback from anyone building AI agents — what data formats do you work with most?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>webdev</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
