How to Extract Structured Data from Indian Invoices Using Python (GST, Fuel, Telecom, IRCTC)
If you've ever built an expense management tool, accounting integration, or GST reconciliation system for Indian businesses, you know the problem: Indian invoices are a mess.
A Jio bill is 7 pages long but has only one useful page. A petrol pump receipt has handwritten amounts in blue ink over a printed template. An IRCTC ticket has the GST invoice buried on page 2. A Starbucks receipt is a blurry photo taken at an angle on a phone.
Traditional OCR tools like AWS Textract or Google Vision extract raw text — but they don't understand that 94:14 written on a fuel receipt means 94.14 litres, or that a GSTIN has a checksum you can validate, or that you should ignore the 80-row data usage table in a Jio bill and focus on the summary.
That's the problem I built BharatParse to solve — an API that turns any Indian invoice, bill, or receipt into clean, validated JSON with a single POST request.
In this article I'll show you how to integrate it in Python in under 5 minutes.
What BharatParse Handles
The API supports 13 document schemas out of the box:
| Schema | Examples |
|---|---|
gst_invoice |
B2B tax invoices with GSTIN validation |
restaurant |
Starbucks, Zomato, local restaurant bills |
fuel |
Handwritten BPCL, HPCL, IOC pump receipts |
telecom |
Jio Fiber, Airtel, BSNL, Vi monthly bills |
travel |
IRCTC e-tickets, train ERS |
utility |
BESCOM, MSEDCL, Mahanagar Gas bills |
medical |
Pharmacy bills, hospital invoices |
ecommerce |
Amazon, Flipkart, Meesho invoices |
rent |
Rent receipts with landlord PAN extraction |
bank_statement |
HDFC, SBI, ICICI, Axis statements |
credit_card |
Credit card monthly statements |
auto |
Auto-detects the document type |
generic |
Any other Indian bill or receipt |
Input formats supported: PDF, JPEG, PNG, WebP, TIFF — phone photos, scanner output, WhatsApp-shared images all work.
Getting Started
1. Get your free API key
Sign up at RapidAPI — the free tier gives you 50 extractions/month, no credit card needed.
2. Install requests
pip install requests
3. Make your first call
import base64
import requests
import json
def extract_invoice(file_path, schema="auto"):
"""
Extract structured data from any Indian invoice.
Args:
file_path: Path to PDF or image file
schema: Document type hint (default: auto-detect)
Returns:
dict: Extracted data with confidence score
"""
# Determine file type from extension
ext = file_path.rsplit(".", 1)[-1].lower()
# Read and encode the file
with open(file_path, "rb") as f:
file_b64 = base64.b64encode(f.read()).decode()
# Call BharatParse API
response = requests.post(
"https://bharatparse-indian-invoice-bill.p.rapidapi.com/v1/extract",
headers={
"X-RapidAPI-Key": "YOUR_RAPIDAPI_KEY",
"X-RapidAPI-Host": "bharatparse-indian-invoice-bill.p.rapidapi.com",
"Content-Type": "application/json"
},
json={
"file_b64": file_b64,
"file_type": ext,
"schema": schema,
"country": "IN"
}
)
return response.json()
# Test it
result = extract_invoice("invoice.pdf")
print(json.dumps(result, indent=2))
Real Examples
Example 1 — Restaurant Bill (Starbucks photo)
result = extract_invoice("starbucks_receipt.jpg", schema="restaurant")
Response:
{
"schema_detected": "restaurant",
"confidence": 0.92,
"data": {
"restaurant_name": "Starbucks",
"hsn_code": "996331",
"line_items": [
{
"name": "Tall Cold Coffee",
"quantity": 1,
"unit_price": 320.0,
"total": 320.0
}
],
"taxable_value": 320.0,
"cgst_rate": 2.5,
"cgst_amount": 8.0,
"sgst_rate": 2.5,
"sgst_amount": 8.0,
"grand_total": 336.0,
"payment": {
"mode": "starbucks_card",
"card_last4": "1821"
}
},
"warnings": ["Invoice date not visible in scan"],
"processing_ms": 1843
}
Notice it correctly identified HSN 996331 (restaurant services), extracted CGST + SGST at 2.5% each, and even identified the payment was a Starbucks loyalty card with last 4 digits.
Example 2 — Handwritten Fuel Receipt (BPCL pump memo)
This is where BharatParse really earns its value. Generic OCR tools fail on these.
result = extract_invoice("fuel_receipt.jpg", schema="fuel")
Response:
{
"schema_detected": "fuel",
"confidence": 0.90,
"data": {
"dealer_name": "N. M. Shamsuddin & Sons",
"oil_company": "BPCL",
"invoice_date": "2025-06-05",
"fuel_items": [
{
"fuel_type": "Speed",
"litres": 94.14,
"rate_per_litre": 21.24,
"amount": 2000.0
}
],
"total_amount": 2000.0
},
"warnings": [
"Litres value '94:14' is handwritten and interpreted as 94.14"
],
"processing_ms": 8598
}
It correctly interpreted 94:14 (written with a colon) as 94.14 litres, identified the fuel type as Speed (BPCL's premium petrol brand), and flagged the handwritten interpretation in warnings.
Example 3 — Jio Fiber Bill (7-page PDF)
result = extract_invoice("jio_bill.pdf", schema="telecom")
Response:
{
"schema_detected": "telecom",
"confidence": 1.0,
"data": {
"provider": "Jio",
"customer_name": "Mr. Shyam Arjandas Warialani",
"account_number": "411252569305",
"due_date": "2025-09-30",
"plan_name": "Postpaid_399_6M: Unlimited Data @ 30 Mbps",
"vendor_gstin": "24AABCI6363G1ZP",
"gst_bill_number": "W241252611070283",
"sac_code": "998422",
"charges": {
"current_taxable_charges": 399.0,
"cgst_rate": 9.0,
"cgst_amount": 35.91,
"sgst_rate": 9.0,
"sgst_amount": 35.91,
"total_current_charges": 470.82
},
"total_payable": 470.82,
"payments_this_period": [
{"mode": "credit_card", "date": "2025-09-01", "amount": 394.89}
]
},
"warnings": [],
"processing_ms": 24406
}
From a 7-page PDF, it extracted only the useful billing summary — ignoring 80 rows of itemised data usage and focusing on what any accounting system actually needs.
Example 4 — IRCTC Train Ticket
result = extract_invoice("irctc_ticket.pdf", schema="travel")
Response:
{
"schema_detected": "travel",
"confidence": 0.95,
"data": {
"pnr": "8543381796",
"train_number": "82902",
"train_name": "IRCTC TEJAS EXP",
"journey_date": "2026-01-24",
"from_station": "AHMEDABAD JN (ADI)",
"boarding_station": "VADODARA JN (BRC)",
"to_station": "BORIVALI (BVI)",
"passengers": [
{
"name": "SHYAM WARIALANI",
"age": 67,
"gender": "M",
"current_status": "WL/44",
"catering": "VEG"
}
],
"fare": {
"ticket_fare": 1680.0,
"convenience_fee": 35.4,
"total_fare": 1715.4
},
"gst": {
"invoice_number": "PS26854338179611",
"supplier_gstin": "27AAACI7074F1ZK",
"sac_code": "996421",
"igst_rate": 5.0,
"igst_amount": 80.0,
"total_tax": 80.0
}
},
"warnings": [],
"processing_ms": 11990
}
It extracted from both pages of the ERS — the ticket details from page 1 and the GST invoice from page 2.
Building a Simple Expense Processor
Here's a practical example — a script that processes a folder of mixed invoices and outputs a CSV for accounting:
import base64
import requests
import json
import csv
import os
from pathlib import Path
API_KEY = "YOUR_RAPIDAPI_KEY"
HEADERS = {
"X-RapidAPI-Key": API_KEY,
"X-RapidAPI-Host": "bharatparse-indian-invoice-bill.p.rapidapi.com",
"Content-Type": "application/json"
}
def extract(file_path):
ext = file_path.suffix.lstrip(".").lower()
with open(file_path, "rb") as f:
b64 = base64.b64encode(f.read()).decode()
r = requests.post(
"https://bharatparse-indian-invoice-bill.p.rapidapi.com/v1/extract",
headers=HEADERS,
json={"file_b64": b64, "file_type": ext, "schema": "auto", "country": "IN"}
)
return r.json()
def get_total(data, schema):
"""Extract grand total from any schema"""
d = data.get("data", {})
for field in ["grand_total", "total_payable", "total_amount", "total_fare"]:
if d.get(field):
return d[field]
totals = d.get("totals", {})
return totals.get("grand_total") or totals.get("total")
def process_folder(folder_path, output_csv="expenses.csv"):
folder = Path(folder_path)
supported = {".pdf", ".jpg", ".jpeg", ".png", ".webp", ".tiff", ".tif"}
files = [f for f in folder.iterdir() if f.suffix.lower() in supported]
rows = []
for file in files:
print(f"Processing {file.name}...")
try:
result = extract(file)
schema = result.get("schema_detected", "unknown")
data = result.get("data", {})
confidence = result.get("confidence", 0)
warnings = result.get("warnings", [])
rows.append({
"file": file.name,
"type": schema,
"vendor": (data.get("restaurant_name") or
data.get("dealer_name") or
data.get("provider") or
data.get("vendor", {}).get("name") or
data.get("bank_name") or "—"),
"date": (data.get("invoice_date") or
data.get("bill_date") or
data.get("journey_date") or "—"),
"total": get_total(data, schema) or "—",
"gstin": (data.get("gstin") or
data.get("vendor_gstin") or
data.get("vendor", {}).get("gstin") or "—"),
"confidence": confidence,
"warnings": "; ".join(warnings) if warnings else ""
})
except Exception as e:
print(f" Error: {e}")
rows.append({"file": file.name, "type": "error", "error": str(e)})
# Write CSV
with open(output_csv, "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=["file","type","vendor","date","total","gstin","confidence","warnings"])
writer.writeheader()
writer.writerows(rows)
print(f"\nDone. {len(rows)} invoices processed → {output_csv}")
return rows
# Run it
results = process_folder("./invoices", "expenses.csv")
Drop any mix of PDFs and images into an invoices/ folder, run the script, and get a clean CSV with vendor, date, total, and GSTIN for every document.
Confidence Scores and Warnings
Every response includes a confidence score (0.0–1.0) and a warnings array:
result = extract_invoice("blurry_receipt.jpg")
if result["confidence"] < 0.70:
print("Low confidence — recommend human review")
print("Warnings:", result["warnings"])
elif result["warnings"]:
print("Extracted successfully with notes:")
for w in result["warnings"]:
print(f" • {w}")
else:
print("Clean extraction — confidence:", result["confidence"])
This makes it easy to build a human-in-the-loop workflow — auto-approve high confidence extractions, flag low confidence ones for review.
GSTIN Validation
BharatParse automatically validates every GSTIN it extracts using full checksum verification. Invalid GSTINs are flagged in warnings rather than silently passed through:
result = extract_invoice("vendor_invoice.pdf", schema="gst_invoice")
vendor = result["data"].get("vendor", {})
print("GSTIN:", vendor.get("gstin"))
print("PAN:", vendor.get("pan"))
# Check for validation warnings
gstin_warnings = [w for w in result["warnings"] if "GSTIN" in w]
if gstin_warnings:
print("GSTIN issue:", gstin_warnings[0])
else:
print("GSTIN validated ✓")
Practical Use Cases
Expense management apps — automatically categorise and extract amounts from employee expense receipts. No manual data entry.
GST reconciliation — extract invoice numbers, GSTINs, and tax breakdowns for GSTR-2A matching.
Accounting integrations — push extracted data directly to Tally, Zoho Books, or QuickBooks India via their APIs.
Insurance claim processing — extract medical bills, pharmacy receipts, and hospital invoices for claim automation.
HRA compliance — extract rent receipts with landlord PAN for Form 16 and Section 10(13A) claims.
Corporate travel — extract IRCTC ticket details, journey dates, and GST invoices for travel expense reporting.
Pricing
The API is available on RapidAPI:
- Free — 50 extractions/month, no credit card
- Pro — $29/month — 500 extractions
- Ultra — $79/month — 2,500 extractions
- Mega — $199/month — 10,000 extractions
Full documentation at bharatparse.netlify.app.
Wrapping Up
Indian document extraction is a genuinely hard problem — not because the technology is complex, but because Indian documents are diverse, inconsistent, and often handwritten. A tool that understands Indian document structure rather than just reading raw text makes a real difference in production.
If you're building anything that touches Indian invoices, bills, or receipts — expense management, GST tools, accounting integrations, fintech — give BharatParse a try. The free tier is enough to validate your use case.
Questions or edge cases? Drop them in the comments — I'm actively improving the extraction prompts based on real-world documents.
Tags: python, india, api, webdev, productivity
Top comments (0)