A few months ago, I had to process dozens of invoices from different vendors.
The problem wasn't the volume—it was the formats.
Some invoices arrived as PDFs, others as Excel spreadsheets, and a few were exported as HTML tables. Getting everything into a single CSV file for analysis became a repetitive and error-prone task.
My first approach was manual:
Open invoice
Copy values
Paste into spreadsheet
Repeat
It worked, but it didn't scale.
So I started experimenting with automation. The workflow I ended up using looked something like this:
from pathlib import Path
import pandas as pd
invoice_dir = Path("invoices")
all_data = []
for file in invoice_dir.iterdir():
if file.suffix == ".csv":
df = pd.read_csv(file)
all_data.append(df)
combined = pd.concat(all_data, ignore_index=True)
combined.to_csv("combined_invoices.csv", index=False)
print("Done!")
The code above is intentionally simple, but the lesson was valuable:
Eliminate repetitive data entry whenever possible.
Standardize input formats early.
Build small automation tools before buying large software solutions.
What started as a frustrating administrative task turned into a workflow that now saves hours every month.
How are you handling invoice processing or document extraction in your projects? Do you use OCR, custom scripts, or third-party tools?
Top comments (2)
Nice script! I've been using a similar approach with Screaming Frog, but this API method looks way cleaner for ongoing monitoring. One thing I'd add is error handling for cases when the API times out on deep crawls—have you hit any rate limits with higher crawl depths?
Nice script! I've been using Semrush for years, but their pricing is getting steep. Does SERPspur handle JavaScript-rendered content well, or is it more for static sites?