<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Avi Khandakar</title>
    <description>The latest articles on DEV Community by Avi Khandakar (@avikhandakar).</description>
    <link>https://dev.to/avikhandakar</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3930615%2F0b5c8f75-e2f4-48f5-b25e-d28fedaa656a.png</url>
      <title>DEV Community: Avi Khandakar</title>
      <link>https://dev.to/avikhandakar</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/avikhandakar"/>
    <language>en</language>
    <item>
      <title>Still wrestling with fragile scrapers and regex? I built a system that handles PDF, Web, and even Audio extraction with 99% accuracy using AI. Check out the technical deep-dive on how I solved deterministic JSON! 🚀 #ai #webdev</title>
      <dc:creator>Avi Khandakar</dc:creator>
      <pubDate>Thu, 14 May 2026 21:14:29 +0000</pubDate>
      <link>https://dev.to/avikhandakar/still-wrestling-with-fragile-scrapers-and-regex-i-built-a-system-that-handles-pdf-web-and-even-2f61</link>
      <guid>https://dev.to/avikhandakar/still-wrestling-with-fragile-scrapers-and-regex-i-built-a-system-that-handles-pdf-web-and-even-2f61</guid>
      <description>&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/avikhandakar/scraping-is-dead-how-ai-replaced-my-brittle-regex-and-beautifulsoup-scripts-4g4n" class="crayons-story__hidden-navigation-link"&gt;Scraping is Dead: How AI Replaced My Brittle Regex and BeautifulSoup Scripts&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/avikhandakar" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3930615%2F0b5c8f75-e2f4-48f5-b25e-d28fedaa656a.png" alt="avikhandakar profile" class="crayons-avatar__image" width="800" height="800"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/avikhandakar" class="crayons-story__secondary fw-medium m:hidden"&gt;
              Avi Khandakar
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                Avi Khandakar
                
              
              &lt;div id="story-author-preview-content-3671748" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/avikhandakar" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3930615%2F0b5c8f75-e2f4-48f5-b25e-d28fedaa656a.png" class="crayons-avatar__image" alt="" width="800" height="800"&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;Avi Khandakar&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/avikhandakar/scraping-is-dead-how-ai-replaced-my-brittle-regex-and-beautifulsoup-scripts-4g4n" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;May 14&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/avikhandakar/scraping-is-dead-how-ai-replaced-my-brittle-regex-and-beautifulsoup-scripts-4g4n" id="article-link-3671748"&gt;
          Scraping is Dead: How AI Replaced My Brittle Regex and BeautifulSoup Scripts
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/ai"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;ai&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/webdev"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;webdev&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/productivity"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;productivity&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/javascript"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;javascript&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
          &lt;a href="https://dev.to/avikhandakar/scraping-is-dead-how-ai-replaced-my-brittle-regex-and-beautifulsoup-scripts-4g4n" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left"&gt;
            &lt;div class="multiple_reactions_aggregate"&gt;
              &lt;span class="multiple_reactions_icons_container"&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/sparkle-heart-5f9bee3767e18deb1bb725290cb151c25234768a0e9a2bd39370c382d02920cf.svg" width="24" height="24"&gt;
                  &lt;/span&gt;
              &lt;/span&gt;
              &lt;span class="aggregate_reactions_counter"&gt;2&lt;span class="hidden s:inline"&gt; reactions&lt;/span&gt;&lt;/span&gt;
            &lt;/div&gt;
          &lt;/a&gt;
            &lt;a href="https://dev.to/avikhandakar/scraping-is-dead-how-ai-replaced-my-brittle-regex-and-beautifulsoup-scripts-4g4n#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              1&lt;span class="hidden s:inline"&gt; comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            2 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


</description>
    </item>
    <item>
      <title>Scraping is Dead: How AI Replaced My Brittle Regex and BeautifulSoup Scripts</title>
      <dc:creator>Avi Khandakar</dc:creator>
      <pubDate>Thu, 14 May 2026 20:16:05 +0000</pubDate>
      <link>https://dev.to/avikhandakar/scraping-is-dead-how-ai-replaced-my-brittle-regex-and-beautifulsoup-scripts-4g4n</link>
      <guid>https://dev.to/avikhandakar/scraping-is-dead-how-ai-replaced-my-brittle-regex-and-beautifulsoup-scripts-4g4n</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;We've all been there. You have a folder full of PDFs, a list of URLs, or hours of audio, and you need to turn them into structured data. Traditionally, this meant:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Custom Python scripts with Beautiful Soup or Selenium.&lt;/li&gt;
&lt;li&gt;Brittle regex patterns for PDFs that break on the slightest layout change.&lt;/li&gt;
&lt;li&gt;Manual transcription for audio.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's slow, error-prone, and a maintenance nightmare.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Shift: AI-Native Extraction
&lt;/h2&gt;

&lt;p&gt;With the rise of Large Language Models (LLMs), the game has changed. Instead of telling the computer &lt;em&gt;how&lt;/em&gt; to find data (e.g., "look for the text after 'Invoice Total'"), we can tell it &lt;em&gt;what&lt;/em&gt; to find (e.g., "Find the total amount and the currency").&lt;/p&gt;

&lt;p&gt;In this post, I'll share how I built &lt;a href="https://snapparse.app" rel="noopener noreferrer"&gt;Snapparse&lt;/a&gt; to handle this at scale, and the technical challenges I faced along the way.&lt;/p&gt;

&lt;h2&gt;
  
  
  Technical Challenge 1: Context Window vs. File Size
&lt;/h2&gt;

&lt;p&gt;Handling a 50-page PDF or a 100MB audio file isn't as simple as dumping it into an API. You need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Chunking&lt;/strong&gt;: Breaking down large documents without losing context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multimodality&lt;/strong&gt;: Processing images and text simultaneously.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transcription&lt;/strong&gt;: Using tools like Whisper to convert audio before extraction.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Technical Challenge 2: Deterministic JSON
&lt;/h2&gt;

&lt;p&gt;LLMs are probabilistic, but our databases are deterministic. Getting an LLM to reliably return valid JSON that matches a specific schema every single time is the "final boss" of AI engineering.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Example: Defining a schema for a Legal Contract&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;parties&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;array&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Names of the entities involved&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;effective_date&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;date&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;When the contract starts&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;termination_clause&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Summary of how to end the contract&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;total_value&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;number&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Total monetary amount if applicable&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;];&lt;/span&gt;

&lt;span class="c1"&gt;// Snapparse uses this schema to guide the LLM and validate the output,&lt;/span&gt;
&lt;span class="c1"&gt;// ensuring you get 100% valid JSON that fits your database perfectly.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How Snapparse Solves This
&lt;/h2&gt;

&lt;p&gt;I built &lt;a href="https://snapparse.app" rel="noopener noreferrer"&gt;Snapparse&lt;/a&gt; to be the "Intelligence Engine" that sits between your unstructured files and your database. &lt;/p&gt;

&lt;h3&gt;
  
  
  Key Technical Advantages:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Multi-Modal Ingestion&lt;/strong&gt;: Support for PDF, Web, and &lt;strong&gt;full Audio transcription&lt;/strong&gt;. You can literally extract structured data from a meeting MP3.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Automated Email Pipelines&lt;/strong&gt;: Every extractor you create generates a unique email address. Send an attachment there, and the extracted JSON hits your webhook automatically.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;AI Co-pilot (The Command Center)&lt;/strong&gt;: We built an AI agent right into the dashboard. Instead of hunting through docs, you can just ask the agent to create an extractor for you, explain an API endpoint, or check your usage stats.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Cost-Efficiency&lt;/strong&gt;: At &lt;strong&gt;$9.99 for 100 credits&lt;/strong&gt;, we're making AI-native extraction 50% cheaper than the competition.&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;The Snapparse AI co-pilot helps you build and manage extractors without leaving the dashboard.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  The Flow:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Ingest&lt;/strong&gt;: Via API, Dashboard, or the &lt;strong&gt;unique email address&lt;/strong&gt; generated for your extractor.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Process&lt;/strong&gt;: AI analyzes the content (including audio!) based on your predefined schema.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Webhook&lt;/strong&gt;: The structured JSON is pushed to your server instantly.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The era of manual scraping is ending. By leveraging AI, we can build data pipelines that are more robust, faster, and actually enjoyable to maintain.&lt;/p&gt;

&lt;p&gt;If you're building something similar or have questions about handling messy data, let's chat in the comments!&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>productivity</category>
      <category>javascript</category>
    </item>
  </channel>
</rss>
