<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Peter Njuguna</title>
    <description>The latest articles on DEV Community by Peter Njuguna (@peterboro).</description>
    <link>https://dev.to/peterboro</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3912193%2F28df7531-0bb7-48f4-ae36-1b16b5a6a305.png</url>
      <title>DEV Community: Peter Njuguna</title>
      <link>https://dev.to/peterboro</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/peterboro"/>
    <language>en</language>
    <item>
      <title>Stop Using Regex for Invoices: Use AI to Extract Line-Items in Seconds</title>
      <dc:creator>Peter Njuguna</dc:creator>
      <pubDate>Mon, 04 May 2026 14:10:23 +0000</pubDate>
      <link>https://dev.to/peterboro/stop-using-regex-for-invoices-use-ai-to-extract-line-items-in-seconds-1k5a</link>
      <guid>https://dev.to/peterboro/stop-using-regex-for-invoices-use-ai-to-extract-line-items-in-seconds-1k5a</guid>
      <description>&lt;h2&gt;
  
  
  The Nightmare of Parsing Invoices
&lt;/h2&gt;

&lt;p&gt;If you’ve ever tried to extract structured data from an invoice or receipt, you know exactly how painful it is. &lt;/p&gt;

&lt;p&gt;You write a perfect regular expression to extract the total amount from one vendor. It works beautifully. Then, a new vendor comes along with a slightly different format, and your regex silently fails, breaks your pipeline, and leaves you cleaning up messy data.&lt;/p&gt;

&lt;p&gt;Invoices are inherently unpredictable. They contain:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Different date formats (&lt;code&gt;DD/MM/YYYY&lt;/code&gt; vs &lt;code&gt;MM/DD/YYYY&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Tabular data represented as raw, unstructured text.&lt;/li&gt;
&lt;li&gt;Varied terminology ("Qty", "Units", "Quantity").&lt;/li&gt;
&lt;li&gt;Chaotic text generated by OCR (Optical Character Recognition) scanners.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Trying to parse this with traditional code is a never-ending game of whack-a-mole. &lt;/p&gt;

&lt;p&gt;In this guide, we'll look at a much better way: using a specialized AI extraction API to turn messy invoice text into clean, structured JSON in a single request.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Solution: AI-Powered Extraction
&lt;/h2&gt;

&lt;p&gt;Instead of trying to match patterns with text coordinates or regex, modern workflows pass the unstructured text directly to an LLM-backed API. The AI understands the context of the document, identifies the merchant, isolates the line items, and returns a uniform JSON schema every single time.&lt;/p&gt;

&lt;p&gt;Let’s see how to implement this using Python.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;To follow along, you will need:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Python installed on your machine.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;requests&lt;/code&gt; library (&lt;code&gt;pip install requests&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;A free API key from the &lt;a href="https://rapidapi.com/peterboro/api/invoice-and-receipt-extractor/" rel="noopener noreferrer"&gt;Invoice and Receipt Extractor API on RapidAPI&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Step-by-Step Implementation
&lt;/h2&gt;

&lt;p&gt;Let's assume you have an OCR scanner or a script that has extracted raw text from a messy PDF invoice. Here is what that unstructured text looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Coast View Investments.ltd
N0 PARTICULARS QTTY UNITS UNIT PRICE COST
1 POLES 150 PIECES 50 7500
TOTAL. 7500
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, let's write a Python script to send this data to the API and parse it automatically.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Python Code
&lt;/h3&gt;

&lt;p&gt;Create a file named &lt;code&gt;extract.py&lt;/code&gt; and add the following code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="c1"&gt;# 1. Define the API Endpoint and your RapidAPI credentials
&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://invoice-and-receipt-extractor.p.rapidapi.com/v1/extract&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x-rapidapi-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_RAPIDAPI_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Replace with your actual RapidAPI Key
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x-rapidapi-host&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;invoice-and-receipt-extractor.p.rapidapi.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# 2. Add the raw invoice text you want to parse
&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text_content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Coast View Investments.ltd&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;N0 PARTICULARS QTTY UNITS UNIT PRICE COST&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;1 POLES 150 PIECES 50 7500&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;TOTAL. 7500&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;⏳ Extracting data via AI...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 3. Make the API request
&lt;/span&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# 4. Print the clean JSON output
&lt;/span&gt;    &lt;span class="n"&gt;structured_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;✅ Success! Clean structured data received:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;structured_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exceptions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HTTPError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;❌ API Error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Result
&lt;/h2&gt;

&lt;p&gt;When you run the script, the API processes the messy text and extracts the data into a clean, highly reliable format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"merchant_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Coast View Investments.ltd"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"date_of_issue"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"invoice_number"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"line_items"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"POLES"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"quantity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;150.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"unit_price"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;50.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"total_price"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;7500.0&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"subtotal"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;7500.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tax_amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"currency"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"USD"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"grand_total"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;7500.0&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, instead of writing dozens of custom parsing rules, you can directly map this clean JSON output straight into your accounting software, database, or ERP.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion: Work Smarter, Not Harder
&lt;/h2&gt;

&lt;p&gt;In 2026, building fragile data pipelines around regular expressions doesn't make sense anymore. By utilizing specialized AI extraction APIs, you save hours of development time and build a pipeline that won't break when a merchant updates their document layout.&lt;/p&gt;

&lt;p&gt;If you want to try this out yourself:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Check out the &lt;a href="https://rapidapi.com/peterboro/api/invoice-and-receipt-extractor/" rel="noopener noreferrer"&gt;Invoice and Receipt Extractor API on RapidAPI&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Sign up for the free tier (10 requests/month) to start testing it in your own projects today.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>api</category>
      <category>webdev</category>
      <category>python</category>
    </item>
  </channel>
</rss>
