Sarah Guthals, PhD for Tensorlake

Posted on Aug 1 • Originally published at tensorlake.ai

Page Classification: Smarter, Safer Structured Extraction

If you’ve ever tried to run structured extraction over a 100‑page PDF, you know the pain: pages with irrelevant legalese pollute your output, OCR burns cycles on tables you don’t care about, and your downstream logic drowns in noise. Most real‑world document bundles are a Frankenstein mix of formats and page types (loan forms, annexes, signatures, appendices) and dumping them all into your schema is like feeding your LLM a junk drawer.

Page Classification fixes this. With a single API call, you can label pages by type, extract structured data only where it matters, and keep multiple repeated data blocks neatly partitioned. No multi‑stage pipelines. No brittle regex gymnastics. Just clean, page‑aware JSON ready to drop into your RAG, agents, or ETL workflows.

Why Page Classification matters

Modern document workflows often involve mixed formats (PDFs, Word, PPTX, Excel, images) with varying page types. Consider these common scenarios:

Multi-page loan applications that mix applicant data pages with legal terms and appendices
Contract bundles that include redacted sections, signature pages, and technical annexes
Insurance files combining personal information, claim details, and supporting documentation
Research reports with executive summaries, data tables, methodology sections, and references

Running structured extraction across every page wastes compute cycles, introduces noise, and reduces accuracy. You end up with polluted schemas where personal data gets mixed with legal boilerplate, or where signature detection runs on pages that contain only text.

Page Classification solves this by letting you:

Classify pages upfront using simple, rule-based descriptions
Target extraction only to relevant page types
Partition data cleanly with multiple instances of the same schema across different pages
Maintain traceability knowing exactly which pages contributed to each extracted record

How Page Classification works

Page Classification operates on a simple principle: define your page types once, then let Tensorlake handle the classification and targeted extraction automatically. The process involves three straightforward steps that all happen within a single API call.

Instead of building complex multi-stage pipelines or writing brittle page-detection logic, you simply describe what each page type looks like in natural language. Tensorlake's AI models then classify each page and apply the appropriate extraction schema only where it makes sense.

Here's how it works in practice:

1. Define Your Page Classes

Define the types of pages you want to classify by providing a name and description for each:

page_classifications = [
  PageClassConfig(
    name="applicant_info",
    description="Page containing personal info: name, address, SSN" 
  ),
  PageClassConfig(
    name="contract_terms",
    description="Pages with legal contract terms and definitions" 
  )
]

2. Structured Extraction with partition strategy by page

Each structured data request is defined with a schema and can choose to only extract data from pages of a certain class:

structured_extraction_options = [
  StructuredExtractionOptions(
    schema_name="ApplicantInfo",     
    json_schema=applicant_schema,     
    page_classifications=["applicant_info"]
  ),
  StructuredExtractionOptions(
    schema_name="Terms",
    json_schema=terms_schema,
    page_classifications=["contract_terms"]
  )
]

3. One endpoint, everything delivered

By just calling the single /parse endpoint, you’ll get:

page_classes: pages grouped by classification,
structured_data: list of records per page,
A full document_layout,
Markdown chunks

All in one response

from tensorlake.documentai import DocumentAI
from tensorlake.documentai.models import StructuredExtractionOptions, PageClassConfig

doc_ai = DocumentAI(api_key="YOUR_API_KEY")

parse_id = doc_ai.parse(
  file="application_bundle.pdf",
  page_classifications=page_classifications,
  structured_data_extraction=structured_extraction_options
)

result = doc_ai.wait_for_completion(parse_id)

print("\nPage Classifications:")
for page_classification in result.page_classes:
    print(f"- {page_classification.page_class}: {page_classification.page_numbers}")

print("\nStructured Data:")
for structured_data in result.structured_data:
    print(f"\n=== {structured_data.schema_name} ===")
    data = structured_data.data
    pages = structured_data.page_numbers
    print(json.dumps(data, indent=2, ensure_ascii=False))
    print("Extract from pages: ", pages)\

Page Class Output:

Page Classifications:
- terms_and_conditions: [1]
- signature_page: [2]

Structured Data by Page Class Output:

Structured Data:

=== TermsAndConditions ===
{
  "terms": [
    {
      "term_description": "You agree to only use WizzleCorp services for imaginary purposes. Any attempt to apply our products or services to real-world situations will result in strong disapproval.",
      "term_name": "Use of Services"
    },
    {
      "term_description": "Users must behave in a whimsical, respectful, and sometimes rhyming manner while interacting with WizzleCorp platforms. No trolls allowed. Literal or figurative.",
      "term_name": "User Conduct"
    },
    {
      "term_description": "All ideas, dreams, and unicorn thoughts shared through WizzleCorp become the temporary property of the Dream Bureau, a subdivision of the Ministry of Make-Believe.",
      "term_name": "Imaginary Ownership"
    },
    {
      "term_description": "By using this site, you consent to the use of cookies - both digital and chocolate chip. We cannot guarantee the availability of milk.",
      "term_name": "Cookies and Snacks"
    },
    {
      "term_description": "We reserve the right to revoke your imaginary license to access WizzleCorp should you fail to smile at least once during your visit.",
      "term_name": "Termination of Use"
    },
    {
      "term_description": "These terms may be updated every lunar eclipse. We are not responsible for any confusion caused by ancient prophecies or time travel.",
      "term_name": "Modifications"
    }
  ]
}
Extract from pages:  [1, 2]

=== Signatures ===
{
  "signature_date": "January 13, 2026",
  "signature_present": true,
  "signer_name": "April Snyder"
}
Extract from pages:  [1, 2]

Context Engineering for RAG, Agents, and ETL

Structured extraction isn’t useful in isolation, it needs to plug seamlessly into your workflows. Whether you’re building a retrieval-augmented generation (RAG) system, automating agents, or feeding data into a database or ETL pipeline, Tensorlake outputs are designed to slot in cleanly with the tools you’re already using.

Here’s how structured extraction with per-page context and JSON output enhances every part of your stack:

RAG Workflows

Feed high-fidelity, page-anchored JSON directly into your retrieval pipelines. By anchoring each field to the correct page and context, you can extract data that respects both structure and semantics—improving retrieval precision.

Say goodbye to hallucinated content and hello to grounded generation.

Agents & Automation

Trigger agents or workflow steps based on what was found—on which page and in what context. With every page classified and parsed into clean JSON and markdown chunks, your automations can take action with confidence.

Databases & ETL

Each structured extraction is a self-contained, traceable entity. You know what was extracted, where it came from, and how it maps to your data model. Use this to build ETL pipelines that are both accurate and auditable, or create page-aware payloads for indexing and querying with pinpoint precision.

Try Page Classification for Precise Structured Data Extraction

Ready to streamline your document pipelines?

Explore Page Classification today with this Colab Notebook or dig into the docs.

Got feedback or want to show us what you built? Join the conversation in our Slack Community!

DEV Community