How to Extract Structured Data from Any Document with One API Call

#python #ai #api #tutorial

Every developer has faced this: you have a PDF invoice, a scanned receipt, or a resume — and you need the data in JSON. The traditional approach involves OCR libraries, regex parsing, and lots of edge-case handling.

I built ScoutExtract to solve this with a single API call.

How It Works

Send a POST request with:

Your document (text, PDF as base64, or image as base64)
A schema describing the fields you want

Get back typed JSON with confidence scores for every field.

Quick Example — Invoice Parsing


python
import requests

response = requests.post(
    "https://api.ramlabs.dev/v1/extract",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "document": """
            INVOICE #2024-0892
            Vendor: CloudStack Solutions Inc.
            Date: March 15, 2024

            API Integration    1    $2,500.00
            Cloud Hosting      3      $199.00

            Subtotal: $3,097.00
            Tax (8.875%): $274.86
            Total: $3,371.86
        """,
        "schema": "invoice"
    }
)

data = response.json()["data"]
print(f"Invoice: {data['invoice_number']['value']}")   # 2024-0892
print(f"Total: ${data['total']['value']}")              # 3371.86
print(f"Confidence: {data['total']['confidence']}")     # 0.99

Pre-built Schemas
ScoutExtract includes schemas for common document types:

Schema  Use Case
invoice Invoices, bills, purchase orders
receipt Store receipts, transaction records
resume  Resumes, CVs
contract    Agreements, legal contracts
Custom Schemas
Don't see your document type? Define your own:

custom_schema = {
    "product_name": {"type": "string"},
    "price_usd": {"type": "number"},
    "in_stock": {"type": "boolean"},
    "features": {
        "type": "array",
        "items": {"type": "string"}
    }
}

response = requests.post(
    "https://api.ramlabs.dev/v1/extract",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "document": product_page_text,
        "schema": custom_schema
    }
)

PDF Support
Extract from PDF files by sending base64-encoded content:

import base64

with open("invoice.pdf", "rb") as f:
    pdf_b64 = base64.b64encode(f.read()).decode()

response = requests.post(
    "https://api.ramlabs.dev/v1/extract",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "document": pdf_b64,
        "documentType": "pdf",
        "schema": "invoice"
    }
)

Confidence Scores
Every extracted field includes a confidence score (0.0 to 1.0). This enables smart automation:

data = response.json()["data"]

for field_name, field_data in data.items():
    confidence = field_data["confidence"]
    if confidence > 0.9:
        save_to_database(field_name, field_data["value"])
    elif confidence > 0.7:
        queue_for_review(field_name, field_data["value"], confidence)
    else:
        assign_to_human(field_name, field_data)

Pricing
Free: 25 extractions/month (no credit card)
Starter: $49/mo — 1,000 extractions
Pro: $199/mo — 5,000 extractions
Scale: $499/mo — 25,000 extractions
Links
Website: extract.ramlabs.dev
Docs: extract.ramlabs.dev/docs
GitHub: github.com/ramlabsdev/scoutextract-sdk
Blog: extract.ramlabs.dev/blog
Would love to hear what document types you'd find most useful. Drop a comment!

DEV Community

How to Extract Structured Data from Any Document with One API Call

How It Works

Quick Example — Invoice Parsing

Top comments (0)