DEV Community

Cover image for Extract Structured Data from Invoices Using YAML + Python (No ML Training Required)
Sadidul Islam
Sadidul Islam

Posted on

Extract Structured Data from Invoices Using YAML + Python (No ML Training Required)

Parsing invoices programmatically sounds simple until you're staring at your fourth supplier's PDF with a completely different layout than the previous three.

The usual answer — train a document extraction model — is overkill for most projects. You need hundreds of labeled examples, a fine-tuning pipeline, and you still have to retrain when layouts change.

In this tutorial I'll show you a different approach: declarative OCR with a YAML rules file. You describe what you're looking for and where to find it. The engine does the rest. No training, no GPU, no labeled datasets.

We'll extract the following from a real invoice image:

  • Invoice number, date, due date
  • Payment info (bank, account, terms)
  • Billing address
  • Line-item table (description, qty, unit price, amount)
  • Summary table (subtotal, tax, total) And we'll annotate the image to visually verify every extraction.

Prerequisites

  • A SoceTonAI account (grab your API key and secret from the dashboard)
  • Python 3.8+
pip install requests Pillow opencv-python numpy
Enter fullscreen mode Exit fullscreen mode

Store credentials in config.py:

SOCETONAI_API_KEY = "your_api_key_here"
SOCETONAI_API_SECRET = "your_api_secret_here"
Enter fullscreen mode Exit fullscreen mode

How SoceTonAI Script OCR works

The core idea: instead of training a model, you write a YAML rules file that acts as a template for the document. Each field definition has two parts:

  1. A keyword anchor — a word or phrase the engine searches for on the page (e.g. "Invoice", "Bill To")
  2. A positional offset — a bounding box relative to that anchor where the value lives For tables, you define column headers instead of keyword anchors, and the engine auto-detects rows.

The API call is a single POST with the image and YAML as multipart form data. The response is structured JSON.


Step 1 — Write the YAML rules file

Here's the complete YAML for our invoice. I'll break down the key patterns after.

document_type: Receipt
description: "Extract key information from Receipt using OCR"
development: true
return_ocr_output: false
return_full_text: false

fields:

  - name: invoice_no
    label: "Invoice No"
    find:
      type: text
      keywords:
        - keyword: "Invoice"
          index: 0
      position_of_value: [1, -1, 5, 18]
      words: 5
      debug: false
      returns:
        - keywords
        - words
        - position

  - name: date
    label: "Date"
    find:
      type: text
      keywords:
        - keyword: "Date"
          index: 0
          next_keyword_position: [1, -1, 1, 3]
        - keyword: ":"
      position_of_value: [1, -1, 4, 10]
      words: 10
      debug: false
      returns:
        - keywords
        - words
        - position

  - name: due
    label: "Due"
    find:
      type: text
      keywords:
        - keyword: "Due"
          index: 0
          next_keyword_position: [1, -1, 1, 3]
        - keyword: ":"
      position_of_value: [1, -1, 3, 10]
      words: 10
      debug: false
      returns:
        - keywords
        - words
        - position

  - name: payment_info
    label: "Payment Info"
    find:
      type: text
      keywords:
        - keyword: "Payment"
          index: 0
          next_keyword_position: [1, -1, 1, 10]
        - keyword: "Info"
      position_of_value: [5, 0.5, -1, 10]
      debug: false
      returns:
        - keywords
        - words
        - position

  - name: bill_to
    label: "Bill To"
    find:
      type: text
      keywords:
        - keyword: "Bill"
          index: 0
          next_keyword_position: [1, -1, 1, 10]
        - keyword: "To"
      position_of_value: [10, 0.5, -1, 15]
      debug: false
      returns:
        - keywords
        - words
        - position

  - name: purchases
    label: "Purchase Table"
    find:
      type: table
      row_orientation: horizontal
      y_tolerance: 0.01
      debug: false
      headers:
        - header:
          - keyword: "#"
        - header:
          - keyword: "Description"
        - header:
          - keyword: "Qty"
        - header:
          - keyword: "Unit"
          - keyword: "Price"
          - keyword: "(USD)"
        - header:
          - keyword: "Amount"
          - keyword: "(USD)"
      returns:
        - headers
        - column_words

  - name: summary
    label: "Summary Table"
    find:
      type: table
      row_orientation: vertical
      y_tolerance: 0.01
      debug: true
      headers:
        - header:
            - keyword: "Subtotal"
        - header:
            - keyword: "Sales"
            - keyword: "Tax"
            - keyword: "(5%)"
        - header:
            - keyword: "Total"
      returns:
        - headers
        - column_words
Enter fullscreen mode Exit fullscreen mode

Understanding the key parameters

position_of_value: [top, left, bottom, right]

This is a bounding box relative to the keyword's location, expressed in normalized units (fractions of the page). The values [1, -1, 5, 18] mean: start 1 unit below the keyword, extend left to -1 (edge), go down 5 units, go right 18 units. Negative values mean "extend to the page edge."

next_keyword_position

When you need to chain two keywords as a compound anchor (e.g. find "Date" then confirm ":" is nearby), this tells the engine where to look for the second keyword relative to the first. Prevents false matches when a word like "Date" appears in multiple places.

index: 0

Matches the first occurrence of the keyword on the page. Increment to target later occurrences.

row_orientation: horizontal vs vertical

Horizontal tables have rows running left-to-right — standard line-item tables. Vertical tables have labels in one column, values in the next — typical for summary sections (Subtotal / Tax / Total stacked vertically).

y_tolerance

Controls how strictly words must align horizontally to be considered part of the same row. Increase slightly for scanned documents with skew.


Step 2 — Load the image

from PIL import Image

img = Image.open("dummies/images/invoice-1_0.jpg")
img = img.convert("RGB")
Enter fullscreen mode Exit fullscreen mode

Step 3 — Send the API request

import requests
from config import SOCETONAI_API_SECRET, SOCETONAI_API_KEY

def generate_result(url, image_path, rules_path, headers):
    files = {
        "doc": open(image_path, "rb"),
        "rules": open(rules_path, "r", encoding="utf-8")
    }
    response = requests.post(url, data={}, headers=headers, files=files)
    return response

result = generate_result(
    "https://api.soceton.com/script-ocr/read",
    "dummies/images/invoice-1_0.jpg",
    "dummies/ymls/invoice-1_0.yml",
    {
        "X-API-KEY": SOCETONAI_API_KEY,
        "X-API-SECRET": SOCETONAI_API_SECRET
    }
)

result = result.json()
Enter fullscreen mode Exit fullscreen mode

You're posting two files: the document image and the YAML rules file. Auth is handled via two request headers.


Step 4 — Parse the response

import json

values = {}
for k in result["result"].keys():
    try:
        values[k] = result["result"][k]["value"]
    except Exception as e:
        print(k, ":", e)

print(json.dumps(values, indent=4))
Enter fullscreen mode Exit fullscreen mode

Output

{
    "invoice_no": "# INV - 2025-001",
    "date": "2025-02-01",
    "due": "2025-02-15",
    "payment_info": "Account : 9876543210 Bank : Example Bank Terms : Net 14",
    "bill_to": "Client Example Co. Attn : Jane Client 221 Demo Lane Sampletown , ST 12345",
    "purchases": [
        ["1", "2", "3"],
        [
            "Custom OCR integration ( one - time )",
            "Monthly hosting & support ( Jan 2025 )",
            "Training dataset labeling ( 200 images )"
        ],
        ["1", "1", "1"],
        ["$ 1,500.00", "$ 120.00", "$ 350.00"],
        ["$ 1,500.00", "$ 120.00", "$ 350.00"]
    ],
    "summary": [
        ["$ 1,970.00"],
        ["$ 98.50"],
        ["$ 2,068.50"]
    ]
}
Enter fullscreen mode Exit fullscreen mode

Note: Table fields return data column-by-column, not row-by-row. The purchases field returns five arrays, one per column. To convert to row-oriented records:

columns = result["result"]["purchases"]["value"]
headers = result["result"]["purchases"]["columns"]
rows = [dict(zip(headers, row)) for row in zip(*columns)]

Step 5 — Annotate the image for debugging (optional but very useful)

The response includes normalized bounding box coordinates for every keyword and word the engine found. You can draw these onto the original image to verify extractions visually.

import random
import numpy as np
import cv2

w, h = img.size
img_copy = np.asarray(img).copy()

def annotate_value(key):
    # Value bounding box — blue
    try:
        position = result["result"][key]["position"]
        top    = int(position["top"]    * h)
        left   = int(position["left"]   * w)
        bottom = int(position["bottom"] * h)
        right  = int(position["right"]  * w)
        cv2.rectangle(img_copy, (left, top), (right, bottom), (0, 0, 255), 2)
    except KeyError:
        pass

    # Keyword anchors — green
    try:
        for keyword in result["result"][key]["keywords"]:
            y1, x1 = int(keyword["y1"] * h), int(keyword["x1"] * w)
            y2, x2 = int(keyword["y2"] * h), int(keyword["x2"] * w)
            cv2.rectangle(img_copy, (x1, y1), (x2, y2), (0, 255, 0), 2)
    except KeyError:
        pass

    # Extracted value words — red
    try:
        for word in result["result"][key]["words"]:
            y1, x1 = int(word["y1"] * h), int(word["x1"] * w)
            y2, x2 = int(word["y2"] * h), int(word["x2"] * w)
            cv2.rectangle(img_copy, (x1, y1), (x2, y2), (255, 0, 0), 2)
    except KeyError:
        pass

    # Table headers and cell words — blue shades
    try:
        color = (0, 0, random.choice([50, 100, 150, 200, 250]))
        for word in result["result"][key]["headers"]:
            y1, x1 = int(word["y1"] * h), int(word["x1"] * w)
            y2, x2 = int(word["y2"] * h), int(word["x2"] * w)
            cv2.rectangle(img_copy, (x1, y1), (x2, y2), color, 2)
        for words in result["result"][key]["column_words"]:
            for word in words:
                y1, x1 = int(word["y1"] * h), int(word["x1"] * w)
                y2, x2 = int(word["y2"] * h), int(word["x2"] * w)
                cv2.rectangle(img_copy, (x1, y1), (x2, y2), color, 2)
    except KeyError:
        pass

for k in result["result"].keys():
    annotate_value(k)

Image.fromarray(img_copy)
Enter fullscreen mode Exit fullscreen mode

Color legend:
| Color | Meaning |
|-------|---------|
| 🟩 Green | Keyword anchor — what the engine used to locate the field |
| 🟥 Red | Extracted value words |
| 🟦 Blue | Value bounding box |
| 🟦 Blue (varied) | Table headers and cell words |

When something extracts incorrectly, this visualization tells you immediately whether:

  • The keyword matched the wrong occurrence → increment index
  • The positional offset is slightly off → tweak position_of_value

- The anchor isn't being found at all → check the exact string the OCR produces (debug: true helps)

Understanding the full response structure

Beyond .value, each field carries rich metadata you can use for validation:

"date": {
    "value": "2025-02-01",
    "keywords": [
        {
            "value": "Date",
            "confidence": 0.9908,
            "x1": 0.811, "y1": 0.076,
            "x2": 0.846, "y2": 0.082,
            "page_idx": 0,
            "block_idx": 12,
            "word_idx": 0
        }
    ],
    "words": [
        {
            "value": "2025-02-01",
            "confidence": 0.9900,
            "x1": 0.850, "y1": 0.076,
            "x2": 0.919, "y2": 0.082
        }
    ],
    "position": {
        "top": 0.069, "left": 0.846,
        "bottom": 0.088, "right": 0.934
    }
}
Enter fullscreen mode Exit fullscreen mode

All coordinates are normalized (0–1), so they're resolution-independent. The confidence score (0–1) lets you build automated review queues — anything below a threshold goes to a human.


Tips for adapting this to your own documents

1. Always start with debug: true
It makes the engine return extra detail about what it matched. Flip it to false in production.

2. Use multi-keyword anchors to prevent false matches
Instead of anchoring on "Total" alone (which might match "Subtotal"), chain "Grand""Total" with next_keyword_position.

3. Tune position_of_value iteratively
Print the annotated image after each run. It takes 2–3 iterations to dial in a new layout.

4. Use index to handle duplicate keywords
index: 0 = first occurrence, index: 1 = second, etc.

5. Increase y_tolerance for scanned documents
Scanned pages often have slight skew. A tolerance of 0.02–0.03 instead of 0.01 helps row detection stay stable.

6. words limits how many words after the anchor to capture
Set it just high enough to cover your longest expected value. Too high and you risk capturing adjacent fields.


What this approach is good for (and where it falls short)

Good for:

  • Documents with consistent keyword labels (most business docs are)
  • Multiple layout variants of the same document type (just write one YAML per variant)
  • Scenarios where explainability matters — you can always inspect why a value was extracted
  • Rapid prototyping — first working extraction in under an hour Less ideal for:
  • Completely unstructured free text with no consistent labels
  • Handwritten documents

- Documents where field positions vary wildly with no keyword anchors nearby

What to build next

A few natural extensions of this pipeline:

# Flag low-confidence extractions for human review
CONFIDENCE_THRESHOLD = 0.90

flagged = []
for field, data in result["result"].items():
    words = data.get("words", [])
    if words and min(w["confidence"] for w in words) < CONFIDENCE_THRESHOLD:
        flagged.append(field)

if flagged:
    print(f"Review needed: {flagged}")
Enter fullscreen mode Exit fullscreen mode
# Convert column-oriented table output to row records
def table_to_rows(field_name):
    columns = result["result"][field_name]["value"]
    headers = result["result"][field_name]["columns"]
    return [dict(zip(headers, row)) for row in zip(*columns)]

line_items = table_to_rows("purchases")
for item in line_items:
    print(item)
# {'#': '1', 'Description': 'Custom OCR integration...', 'Qty': '1', ...}
Enter fullscreen mode Exit fullscreen mode

Resources


If you've dealt with invoice parsing before, I'd love to hear how you approached it — drop a comment below. And if you spot anything that could be clearer in the YAML explanations, let me know.

Top comments (0)