Vikram Warialani

Posted on Mar 25

How to Extract Structured Data from Indian Invoice Scans and Images

#ai #data #python #tutorial

How to Extract Structured Data from Indian Invoices Using Python (GST, Fuel, Telecom, IRCTC)

If you've ever built an expense management tool, accounting integration, or GST reconciliation system for Indian businesses, you know the problem: Indian invoices are a mess.

A Jio bill is 7 pages long but has only one useful page. A petrol pump receipt has handwritten amounts in blue ink over a printed template. An IRCTC ticket has the GST invoice buried on page 2. A Starbucks receipt is a blurry photo taken at an angle on a phone.

Traditional OCR tools like AWS Textract or Google Vision extract raw text — but they don't understand that 94:14 written on a fuel receipt means 94.14 litres, or that a GSTIN has a checksum you can validate, or that you should ignore the 80-row data usage table in a Jio bill and focus on the summary.

That's the problem I built BharatParse to solve — an API that turns any Indian invoice, bill, or receipt into clean, validated JSON with a single POST request.

In this article I'll show you how to integrate it in Python in under 5 minutes.

What BharatParse Handles

The API supports 13 document schemas out of the box:

Schema	Examples
`gst_invoice`	B2B tax invoices with GSTIN validation
`restaurant`	Starbucks, Zomato, local restaurant bills
`fuel`	Handwritten BPCL, HPCL, IOC pump receipts
`telecom`	Jio Fiber, Airtel, BSNL, Vi monthly bills
`travel`	IRCTC e-tickets, train ERS
`utility`	BESCOM, MSEDCL, Mahanagar Gas bills
`medical`	Pharmacy bills, hospital invoices
`ecommerce`	Amazon, Flipkart, Meesho invoices
`rent`	Rent receipts with landlord PAN extraction
`bank_statement`	HDFC, SBI, ICICI, Axis statements
`credit_card`	Credit card monthly statements
`auto`	Auto-detects the document type
`generic`	Any other Indian bill or receipt

Input formats supported: PDF, JPEG, PNG, WebP, TIFF — phone photos, scanner output, WhatsApp-shared images all work.

Getting Started

1. Get your free API key

2. Install requests

pip install requests

3. Make your first call

import base64
import requests
import json

def extract_invoice(file_path, schema="auto"):
    """
    Extract structured data from any Indian invoice.

    Args:
        file_path: Path to PDF or image file
        schema: Document type hint (default: auto-detect)

    Returns:
        dict: Extracted data with confidence score
    """
    # Determine file type from extension
    ext = file_path.rsplit(".", 1)[-1].lower()

    # Read and encode the file
    with open(file_path, "rb") as f:
        file_b64 = base64.b64encode(f.read()).decode()

    # Call BharatParse API
    response = requests.post(
        "https://bharatparse-indian-invoice-bill.p.rapidapi.com/v1/extract",
        headers={
            "X-RapidAPI-Key": "YOUR_RAPIDAPI_KEY",
            "X-RapidAPI-Host": "bharatparse-indian-invoice-bill.p.rapidapi.com",
            "Content-Type": "application/json"
        },
        json={
            "file_b64": file_b64,
            "file_type": ext,
            "schema": schema,
            "country": "IN"
        }
    )

    return response.json()

# Test it
result = extract_invoice("invoice.pdf")
print(json.dumps(result, indent=2))

Real Examples

Example 1 — Restaurant Bill (Starbucks photo)

result = extract_invoice("starbucks_receipt.jpg", schema="restaurant")

Response:

{
  "schema_detected": "restaurant",
  "confidence": 0.92,
  "data": {
    "restaurant_name": "Starbucks",
    "hsn_code": "996331",
    "line_items": [
      {
        "name": "Tall Cold Coffee",
        "quantity": 1,
        "unit_price": 320.0,
        "total": 320.0
      }
    ],
    "taxable_value": 320.0,
    "cgst_rate": 2.5,
    "cgst_amount": 8.0,
    "sgst_rate": 2.5,
    "sgst_amount": 8.0,
    "grand_total": 336.0,
    "payment": {
      "mode": "starbucks_card",
      "card_last4": "1821"
    }
  },
  "warnings": ["Invoice date not visible in scan"],
  "processing_ms": 1843
}

Notice it correctly identified HSN 996331 (restaurant services), extracted CGST + SGST at 2.5% each, and even identified the payment was a Starbucks loyalty card with last 4 digits.

Example 2 — Handwritten Fuel Receipt (BPCL pump memo)

This is where BharatParse really earns its value. Generic OCR tools fail on these.

result = extract_invoice("fuel_receipt.jpg", schema="fuel")

Response:

{
  "schema_detected": "fuel",
  "confidence": 0.90,
  "data": {
    "dealer_name": "N. M. Shamsuddin & Sons",
    "oil_company": "BPCL",
    "invoice_date": "2025-06-05",
    "fuel_items": [
      {
        "fuel_type": "Speed",
        "litres": 94.14,
        "rate_per_litre": 21.24,
        "amount": 2000.0
      }
    ],
    "total_amount": 2000.0
  },
  "warnings": [
    "Litres value '94:14' is handwritten and interpreted as 94.14"
  ],
  "processing_ms": 8598
}

It correctly interpreted 94:14 (written with a colon) as 94.14 litres, identified the fuel type as Speed (BPCL's premium petrol brand), and flagged the handwritten interpretation in warnings.

Example 3 — Jio Fiber Bill (7-page PDF)

result = extract_invoice("jio_bill.pdf", schema="telecom")

Response:

{
  "schema_detected": "telecom",
  "confidence": 1.0,
  "data": {
    "provider": "Jio",
    "customer_name": "Mr. Shyam Arjandas Warialani",
    "account_number": "411252569305",
    "due_date": "2025-09-30",
    "plan_name": "Postpaid_399_6M: Unlimited Data @ 30 Mbps",
    "vendor_gstin": "24AABCI6363G1ZP",
    "gst_bill_number": "W241252611070283",
    "sac_code": "998422",
    "charges": {
      "current_taxable_charges": 399.0,
      "cgst_rate": 9.0,
      "cgst_amount": 35.91,
      "sgst_rate": 9.0,
      "sgst_amount": 35.91,
      "total_current_charges": 470.82
    },
    "total_payable": 470.82,
    "payments_this_period": [
      {"mode": "credit_card", "date": "2025-09-01", "amount": 394.89}
    ]
  },
  "warnings": [],
  "processing_ms": 24406
}

From a 7-page PDF, it extracted only the useful billing summary — ignoring 80 rows of itemised data usage and focusing on what any accounting system actually needs.

Example 4 — IRCTC Train Ticket

result = extract_invoice("irctc_ticket.pdf", schema="travel")

Response:

{
  "schema_detected": "travel",
  "confidence": 0.95,
  "data": {
    "pnr": "8543381796",
    "train_number": "82902",
    "train_name": "IRCTC TEJAS EXP",
    "journey_date": "2026-01-24",
    "from_station": "AHMEDABAD JN (ADI)",
    "boarding_station": "VADODARA JN (BRC)",
    "to_station": "BORIVALI (BVI)",
    "passengers": [
      {
        "name": "SHYAM WARIALANI",
        "age": 67,
        "gender": "M",
        "current_status": "WL/44",
        "catering": "VEG"
      }
    ],
    "fare": {
      "ticket_fare": 1680.0,
      "convenience_fee": 35.4,
      "total_fare": 1715.4
    },
    "gst": {
      "invoice_number": "PS26854338179611",
      "supplier_gstin": "27AAACI7074F1ZK",
      "sac_code": "996421",
      "igst_rate": 5.0,
      "igst_amount": 80.0,
      "total_tax": 80.0
    }
  },
  "warnings": [],
  "processing_ms": 11990
}

It extracted from both pages of the ERS — the ticket details from page 1 and the GST invoice from page 2.

Building a Simple Expense Processor

Here's a practical example — a script that processes a folder of mixed invoices and outputs a CSV for accounting:

import base64
import requests
import json
import csv
import os
from pathlib import Path

API_KEY = "YOUR_RAPIDAPI_KEY"
HEADERS = {
    "X-RapidAPI-Key": API_KEY,
    "X-RapidAPI-Host": "bharatparse-indian-invoice-bill.p.rapidapi.com",
    "Content-Type": "application/json"
}

def extract(file_path):
    ext = file_path.suffix.lstrip(".").lower()
    with open(file_path, "rb") as f:
        b64 = base64.b64encode(f.read()).decode()

    r = requests.post(
        "https://bharatparse-indian-invoice-bill.p.rapidapi.com/v1/extract",
        headers=HEADERS,
        json={"file_b64": b64, "file_type": ext, "schema": "auto", "country": "IN"}
    )
    return r.json()

def get_total(data, schema):
    """Extract grand total from any schema"""
    d = data.get("data", {})
    for field in ["grand_total", "total_payable", "total_amount", "total_fare"]:
        if d.get(field):
            return d[field]
    totals = d.get("totals", {})
    return totals.get("grand_total") or totals.get("total")

def process_folder(folder_path, output_csv="expenses.csv"):
    folder = Path(folder_path)
    supported = {".pdf", ".jpg", ".jpeg", ".png", ".webp", ".tiff", ".tif"}
    files = [f for f in folder.iterdir() if f.suffix.lower() in supported]

    rows = []
    for file in files:
        print(f"Processing {file.name}...")
        try:
            result = extract(file)
            schema = result.get("schema_detected", "unknown")
            data = result.get("data", {})
            confidence = result.get("confidence", 0)
            warnings = result.get("warnings", [])

            rows.append({
                "file": file.name,
                "type": schema,
                "vendor": (data.get("restaurant_name") or 
                           data.get("dealer_name") or
                           data.get("provider") or
                           data.get("vendor", {}).get("name") or
                           data.get("bank_name") or "—"),
                "date": (data.get("invoice_date") or 
                         data.get("bill_date") or
                         data.get("journey_date") or "—"),
                "total": get_total(data, schema) or "—",
                "gstin": (data.get("gstin") or
                          data.get("vendor_gstin") or
                          data.get("vendor", {}).get("gstin") or "—"),
                "confidence": confidence,
                "warnings": "; ".join(warnings) if warnings else ""
            })
        except Exception as e:
            print(f"  Error: {e}")
            rows.append({"file": file.name, "type": "error", "error": str(e)})

    # Write CSV
    with open(output_csv, "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=["file","type","vendor","date","total","gstin","confidence","warnings"])
        writer.writeheader()
        writer.writerows(rows)

    print(f"\nDone. {len(rows)} invoices processed → {output_csv}")
    return rows

# Run it
results = process_folder("./invoices", "expenses.csv")

Drop any mix of PDFs and images into an invoices/ folder, run the script, and get a clean CSV with vendor, date, total, and GSTIN for every document.

Confidence Scores and Warnings

Every response includes a confidence score (0.0–1.0) and a warnings array:

result = extract_invoice("blurry_receipt.jpg")

if result["confidence"] < 0.70:
    print("Low confidence — recommend human review")
    print("Warnings:", result["warnings"])
elif result["warnings"]:
    print("Extracted successfully with notes:")
    for w in result["warnings"]:
        print(f"  • {w}")
else:
    print("Clean extraction — confidence:", result["confidence"])

This makes it easy to build a human-in-the-loop workflow — auto-approve high confidence extractions, flag low confidence ones for review.

GSTIN Validation

BharatParse automatically validates every GSTIN it extracts using full checksum verification. Invalid GSTINs are flagged in warnings rather than silently passed through:

result = extract_invoice("vendor_invoice.pdf", schema="gst_invoice")
vendor = result["data"].get("vendor", {})

print("GSTIN:", vendor.get("gstin"))
print("PAN:", vendor.get("pan"))

# Check for validation warnings
gstin_warnings = [w for w in result["warnings"] if "GSTIN" in w]
if gstin_warnings:
    print("GSTIN issue:", gstin_warnings[0])
else:
    print("GSTIN validated ✓")

Practical Use Cases

Expense management apps — automatically categorise and extract amounts from employee expense receipts. No manual data entry.

GST reconciliation — extract invoice numbers, GSTINs, and tax breakdowns for GSTR-2A matching.

Accounting integrations — push extracted data directly to Tally, Zoho Books, or QuickBooks India via their APIs.

Insurance claim processing — extract medical bills, pharmacy receipts, and hospital invoices for claim automation.

HRA compliance — extract rent receipts with landlord PAN for Form 16 and Section 10(13A) claims.

Corporate travel — extract IRCTC ticket details, journey dates, and GST invoices for travel expense reporting.

Pricing

The API is available on RapidAPI:

Free — 50 extractions/month, no credit card
Pro — $29/month — 500 extractions
Ultra — $79/month — 2,500 extractions
Mega — $199/month — 10,000 extractions

Full documentation at bharatparse.netlify.app.

Wrapping Up

Indian document extraction is a genuinely hard problem — not because the technology is complex, but because Indian documents are diverse, inconsistent, and often handwritten. A tool that understands Indian document structure rather than just reading raw text makes a real difference in production.

If you're building anything that touches Indian invoices, bills, or receipts — expense management, GST tools, accounting integrations, fintech — give BharatParse a try. The free tier is enough to validate your use case.

Questions or edge cases? Drop them in the comments — I'm actively improving the extraction prompts based on real-world documents.

Tags: python, india, api, webdev, productivity

DEV Community

How to Extract Structured Data from Indian Invoice Scans and Images

What BharatParse Handles

Getting Started

1. Get your free API key

2. Install requests

3. Make your first call

Real Examples

Example 1 — Restaurant Bill (Starbucks photo)

Example 2 — Handwritten Fuel Receipt (BPCL pump memo)

Example 3 — Jio Fiber Bill (7-page PDF)

Example 4 — IRCTC Train Ticket

Building a Simple Expense Processor

Confidence Scores and Warnings

GSTIN Validation

Practical Use Cases

Pricing

Wrapping Up

Top comments (0)