Parsing invoices programmatically sounds simple until you're staring at your fourth supplier's PDF with a completely different layout than the previous three.
The usual answer — train a document extraction model — is overkill for most projects. You need hundreds of labeled examples, a fine-tuning pipeline, and you still have to retrain when layouts change.
In this tutorial I'll show you a different approach: declarative OCR with a YAML rules file. You describe what you're looking for and where to find it. The engine does the rest. No training, no GPU, no labeled datasets.
We'll extract the following from a real invoice image:
- Invoice number, date, due date
- Payment info (bank, account, terms)
- Billing address
- Line-item table (description, qty, unit price, amount)
- Summary table (subtotal, tax, total) And we'll annotate the image to visually verify every extraction.
Prerequisites
- A SoceTonAI account (grab your API key and secret from the dashboard)
- Python 3.8+
pip install requests Pillow opencv-python numpy
Store credentials in config.py:
SOCETONAI_API_KEY = "your_api_key_here"
SOCETONAI_API_SECRET = "your_api_secret_here"
How SoceTonAI Script OCR works
The core idea: instead of training a model, you write a YAML rules file that acts as a template for the document. Each field definition has two parts:
- A keyword anchor — a word or phrase the engine searches for on the page (e.g.
"Invoice","Bill To") - A positional offset — a bounding box relative to that anchor where the value lives For tables, you define column headers instead of keyword anchors, and the engine auto-detects rows.
The API call is a single POST with the image and YAML as multipart form data. The response is structured JSON.
Step 1 — Write the YAML rules file
Here's the complete YAML for our invoice. I'll break down the key patterns after.
document_type: Receipt
description: "Extract key information from Receipt using OCR"
development: true
return_ocr_output: false
return_full_text: false
fields:
- name: invoice_no
label: "Invoice No"
find:
type: text
keywords:
- keyword: "Invoice"
index: 0
position_of_value: [1, -1, 5, 18]
words: 5
debug: false
returns:
- keywords
- words
- position
- name: date
label: "Date"
find:
type: text
keywords:
- keyword: "Date"
index: 0
next_keyword_position: [1, -1, 1, 3]
- keyword: ":"
position_of_value: [1, -1, 4, 10]
words: 10
debug: false
returns:
- keywords
- words
- position
- name: due
label: "Due"
find:
type: text
keywords:
- keyword: "Due"
index: 0
next_keyword_position: [1, -1, 1, 3]
- keyword: ":"
position_of_value: [1, -1, 3, 10]
words: 10
debug: false
returns:
- keywords
- words
- position
- name: payment_info
label: "Payment Info"
find:
type: text
keywords:
- keyword: "Payment"
index: 0
next_keyword_position: [1, -1, 1, 10]
- keyword: "Info"
position_of_value: [5, 0.5, -1, 10]
debug: false
returns:
- keywords
- words
- position
- name: bill_to
label: "Bill To"
find:
type: text
keywords:
- keyword: "Bill"
index: 0
next_keyword_position: [1, -1, 1, 10]
- keyword: "To"
position_of_value: [10, 0.5, -1, 15]
debug: false
returns:
- keywords
- words
- position
- name: purchases
label: "Purchase Table"
find:
type: table
row_orientation: horizontal
y_tolerance: 0.01
debug: false
headers:
- header:
- keyword: "#"
- header:
- keyword: "Description"
- header:
- keyword: "Qty"
- header:
- keyword: "Unit"
- keyword: "Price"
- keyword: "(USD)"
- header:
- keyword: "Amount"
- keyword: "(USD)"
returns:
- headers
- column_words
- name: summary
label: "Summary Table"
find:
type: table
row_orientation: vertical
y_tolerance: 0.01
debug: true
headers:
- header:
- keyword: "Subtotal"
- header:
- keyword: "Sales"
- keyword: "Tax"
- keyword: "(5%)"
- header:
- keyword: "Total"
returns:
- headers
- column_words
Understanding the key parameters
position_of_value: [top, left, bottom, right]
This is a bounding box relative to the keyword's location, expressed in normalized units (fractions of the page). The values [1, -1, 5, 18] mean: start 1 unit below the keyword, extend left to -1 (edge), go down 5 units, go right 18 units. Negative values mean "extend to the page edge."
next_keyword_position
When you need to chain two keywords as a compound anchor (e.g. find "Date" then confirm ":" is nearby), this tells the engine where to look for the second keyword relative to the first. Prevents false matches when a word like "Date" appears in multiple places.
index: 0
Matches the first occurrence of the keyword on the page. Increment to target later occurrences.
row_orientation: horizontal vs vertical
Horizontal tables have rows running left-to-right — standard line-item tables. Vertical tables have labels in one column, values in the next — typical for summary sections (Subtotal / Tax / Total stacked vertically).
y_tolerance
Controls how strictly words must align horizontally to be considered part of the same row. Increase slightly for scanned documents with skew.
Step 2 — Load the image
from PIL import Image
img = Image.open("dummies/images/invoice-1_0.jpg")
img = img.convert("RGB")
Step 3 — Send the API request
import requests
from config import SOCETONAI_API_SECRET, SOCETONAI_API_KEY
def generate_result(url, image_path, rules_path, headers):
files = {
"doc": open(image_path, "rb"),
"rules": open(rules_path, "r", encoding="utf-8")
}
response = requests.post(url, data={}, headers=headers, files=files)
return response
result = generate_result(
"https://api.soceton.com/script-ocr/read",
"dummies/images/invoice-1_0.jpg",
"dummies/ymls/invoice-1_0.yml",
{
"X-API-KEY": SOCETONAI_API_KEY,
"X-API-SECRET": SOCETONAI_API_SECRET
}
)
result = result.json()
You're posting two files: the document image and the YAML rules file. Auth is handled via two request headers.
Step 4 — Parse the response
import json
values = {}
for k in result["result"].keys():
try:
values[k] = result["result"][k]["value"]
except Exception as e:
print(k, ":", e)
print(json.dumps(values, indent=4))
Output
{
"invoice_no": "# INV - 2025-001",
"date": "2025-02-01",
"due": "2025-02-15",
"payment_info": "Account : 9876543210 Bank : Example Bank Terms : Net 14",
"bill_to": "Client Example Co. Attn : Jane Client 221 Demo Lane Sampletown , ST 12345",
"purchases": [
["1", "2", "3"],
[
"Custom OCR integration ( one - time )",
"Monthly hosting & support ( Jan 2025 )",
"Training dataset labeling ( 200 images )"
],
["1", "1", "1"],
["$ 1,500.00", "$ 120.00", "$ 350.00"],
["$ 1,500.00", "$ 120.00", "$ 350.00"]
],
"summary": [
["$ 1,970.00"],
["$ 98.50"],
["$ 2,068.50"]
]
}
Note: Table fields return data column-by-column, not row-by-row. The
purchasesfield returns five arrays, one per column. To convert to row-oriented records:columns = result["result"]["purchases"]["value"] headers = result["result"]["purchases"]["columns"] rows = [dict(zip(headers, row)) for row in zip(*columns)]
Step 5 — Annotate the image for debugging (optional but very useful)
The response includes normalized bounding box coordinates for every keyword and word the engine found. You can draw these onto the original image to verify extractions visually.
import random
import numpy as np
import cv2
w, h = img.size
img_copy = np.asarray(img).copy()
def annotate_value(key):
# Value bounding box — blue
try:
position = result["result"][key]["position"]
top = int(position["top"] * h)
left = int(position["left"] * w)
bottom = int(position["bottom"] * h)
right = int(position["right"] * w)
cv2.rectangle(img_copy, (left, top), (right, bottom), (0, 0, 255), 2)
except KeyError:
pass
# Keyword anchors — green
try:
for keyword in result["result"][key]["keywords"]:
y1, x1 = int(keyword["y1"] * h), int(keyword["x1"] * w)
y2, x2 = int(keyword["y2"] * h), int(keyword["x2"] * w)
cv2.rectangle(img_copy, (x1, y1), (x2, y2), (0, 255, 0), 2)
except KeyError:
pass
# Extracted value words — red
try:
for word in result["result"][key]["words"]:
y1, x1 = int(word["y1"] * h), int(word["x1"] * w)
y2, x2 = int(word["y2"] * h), int(word["x2"] * w)
cv2.rectangle(img_copy, (x1, y1), (x2, y2), (255, 0, 0), 2)
except KeyError:
pass
# Table headers and cell words — blue shades
try:
color = (0, 0, random.choice([50, 100, 150, 200, 250]))
for word in result["result"][key]["headers"]:
y1, x1 = int(word["y1"] * h), int(word["x1"] * w)
y2, x2 = int(word["y2"] * h), int(word["x2"] * w)
cv2.rectangle(img_copy, (x1, y1), (x2, y2), color, 2)
for words in result["result"][key]["column_words"]:
for word in words:
y1, x1 = int(word["y1"] * h), int(word["x1"] * w)
y2, x2 = int(word["y2"] * h), int(word["x2"] * w)
cv2.rectangle(img_copy, (x1, y1), (x2, y2), color, 2)
except KeyError:
pass
for k in result["result"].keys():
annotate_value(k)
Image.fromarray(img_copy)
Color legend:
| Color | Meaning |
|-------|---------|
| 🟩 Green | Keyword anchor — what the engine used to locate the field |
| 🟥 Red | Extracted value words |
| 🟦 Blue | Value bounding box |
| 🟦 Blue (varied) | Table headers and cell words |
When something extracts incorrectly, this visualization tells you immediately whether:
- The keyword matched the wrong occurrence → increment
index - The positional offset is slightly off → tweak
position_of_value
- The anchor isn't being found at all → check the exact string the OCR produces (debug: true helps)
Understanding the full response structure
Beyond .value, each field carries rich metadata you can use for validation:
"date": {
"value": "2025-02-01",
"keywords": [
{
"value": "Date",
"confidence": 0.9908,
"x1": 0.811, "y1": 0.076,
"x2": 0.846, "y2": 0.082,
"page_idx": 0,
"block_idx": 12,
"word_idx": 0
}
],
"words": [
{
"value": "2025-02-01",
"confidence": 0.9900,
"x1": 0.850, "y1": 0.076,
"x2": 0.919, "y2": 0.082
}
],
"position": {
"top": 0.069, "left": 0.846,
"bottom": 0.088, "right": 0.934
}
}
All coordinates are normalized (0–1), so they're resolution-independent. The confidence score (0–1) lets you build automated review queues — anything below a threshold goes to a human.
Tips for adapting this to your own documents
1. Always start with debug: true
It makes the engine return extra detail about what it matched. Flip it to false in production.
2. Use multi-keyword anchors to prevent false matches
Instead of anchoring on "Total" alone (which might match "Subtotal"), chain "Grand" → "Total" with next_keyword_position.
3. Tune position_of_value iteratively
Print the annotated image after each run. It takes 2–3 iterations to dial in a new layout.
4. Use index to handle duplicate keywords
index: 0 = first occurrence, index: 1 = second, etc.
5. Increase y_tolerance for scanned documents
Scanned pages often have slight skew. A tolerance of 0.02–0.03 instead of 0.01 helps row detection stay stable.
6. words limits how many words after the anchor to capture
Set it just high enough to cover your longest expected value. Too high and you risk capturing adjacent fields.
What this approach is good for (and where it falls short)
Good for:
- Documents with consistent keyword labels (most business docs are)
- Multiple layout variants of the same document type (just write one YAML per variant)
- Scenarios where explainability matters — you can always inspect why a value was extracted
- Rapid prototyping — first working extraction in under an hour Less ideal for:
- Completely unstructured free text with no consistent labels
- Handwritten documents
- Documents where field positions vary wildly with no keyword anchors nearby
What to build next
A few natural extensions of this pipeline:
# Flag low-confidence extractions for human review
CONFIDENCE_THRESHOLD = 0.90
flagged = []
for field, data in result["result"].items():
words = data.get("words", [])
if words and min(w["confidence"] for w in words) < CONFIDENCE_THRESHOLD:
flagged.append(field)
if flagged:
print(f"Review needed: {flagged}")
# Convert column-oriented table output to row records
def table_to_rows(field_name):
columns = result["result"][field_name]["value"]
headers = result["result"][field_name]["columns"]
return [dict(zip(headers, row)) for row in zip(*columns)]
line_items = table_to_rows("purchases")
for item in line_items:
print(item)
# {'#': '1', 'Description': 'Custom OCR integration...', 'Qty': '1', ...}
Resources
If you've dealt with invoice parsing before, I'd love to hear how you approached it — drop a comment below. And if you spot anything that could be clearer in the YAML explanations, let me know.
Top comments (0)