Oliver S

Posted on Aug 20 • Edited on Aug 23

Build a self-improving AI agent that turns documents into structured data (with LangGraph)

Project: Unstructured to structured

What this AI agent actually does?

This self-improving AI agent takes messy documents (invoices, contracts, medical reports, whatever) and turns them into clean, structured data and CSV tables. But here's the kicker - it actually gets better at its job over time.

Full code open source at: https://github.com/Handit-AI/handit-examples/tree/main/examples/unstructured-to-structured

Let’s dive in!

What this AI agent actually does?
1. The self-improvement (Best Part)
- Before Handit
- After Handit
2. Architecture Overview
- Another Perspective of the Workflow 🧠
3. Inference Schema Node - the schema detective
4. Document Data Capture Node - the data extractor
5. Generate CSV Node - the table builder
6. Conclusions

1. The self-improvement (Best Part)

I’m going to start with the best part — the cherry on top

Here’s a really cool thing — this AI agent actually gets better over time. Here is the secret weapon Handit.ai

Every action, every response is fully observed and analyzed. The system can see:

If the schema inferences worked well
Which data extractions failed
How long processing takes
What document types cause issues
When the LLM makes mistakes
If the CSVs generated are filled with real data
And more...

And yes sir! When this powerful tool detects any mistakes it fixes automatically.

This means the AI agent can actually improve itself. If the LLM extracts the wrong field or generates incorrect schemas or CSVs, Handit.ai tracks that failure and automatically adjusts the AI agent to prevent the same mistake from happening again. It's like having an AI engineer who is constantly monitoring, evaluating and improving your AI agent.

Here are the results before and after implementing Handit.ai:

Before Handit

Input: Purchases Order

Output:

Whoa! It looks like our AI agent is making some mistakes.

First mistake ❌: In Table 2, it duplicated two columns — item_code and sku. We don’t need both because they represent the same thing.

Second mistake ❌: This is the most crucial one. The Epson unit price is NOT 549.99, and the total is also NOT 549.99. This is very critical when dealing with important data.

After Handit

First, let’s see what results Handit gives us

Here it is — it highlights several insights, including that there are inconsistencies in the data extraction. Handit automatically detected the issue for us.

HERE IS THE BEST PART!

It automatically sent us a PR! with the fixes from our AI agent, ready to deploy, boosting accuracy by 43%.

After merging the PR, here are the results:

Input: Same Purchases Order

Output:

The mistakes have been fixed automatically!

Table 2 now contains only one code column — item_code — and groups everything correctly. ✅

The Epson unit price is now correct at 549.99, and the total is correct as well. ✅

2. Architecture Overview

So I built this thing called “Unstructured to Structured”, and honestly, it’s doing some pretty wild stuff. Let me break down what’s actually happening under the hood.

Let’s understand the architecture of our AI agent at a very high level:

This architecture separates concerns into distinct nodes:

inference_schema
- Purpose: AI analyzes uploaded documents to create a unified JSON schema
- Input: Images, PDFs, text files
- Output: Structured schema defining data fields and relationships
- AI capability: Multimodal analysis (vision + text)
document_data_capture
- Purpose: Maps document content to the inferred schema using AI extraction
- Input: Documents + inferred schema
- Output: Structured JSON with field mappings
- AI capability: Field extraction with confidence scores
generate_csv
- Purpose: Convert structured JSON into clean CSV tables
- Input: Structured JSON from the previous node
- Output: CSVs files ready for analysis
- AI capability: Intelligent table structure planning

And... How does this AI agent gets better over time?

Here is the secret weapon: Handit.ai

Observability
- Every interaction with our AI agent is monitored by handit
Failure Detection
- Handit automatically identifies errors in any of our LLMs — like when a CSV file doesn’t contain the right content (Really important for this AI agent)
Automated Fix Generation
- If a failure is detected, Handit automatically fixes our prompts for us

Another Perspective of the Workflow 🧠

Think of it as a smart pipeline that processes documents step by step. Here's what happens:

You upload documents - like invoices, contracts, medical reports (any type)
The agent analyzes everything - it looks at all your documents and figures out the best structure (schema)
It creates a unified schema - one JSON schema that can represent all your documents
Then extracts the data - maps each document to the schema with AI
Finally builds tables - creates CSV files and structured data you can actually use
Self-improvement - Handit observes every interaction with our agents, and if a failure is detected, it fixes it for us

3. Inference Schema Node - the schema detective

This is where the magic starts:

Reads images, PDFs, and text
Proposes a unified JSON schema that fits everything
Works across any document type
Adds useful field types and reasoning


class SchemaField(BaseModel):
    """
    Represents a single field in the inferred schema.

    Each field defines the structure, validation rules, and metadata for a piece
    of data extracted from documents. Fields can be simple (string, number) or
    complex (objects, arrays) depending on the document structure.
    """

    name: str = Field(description="Field name")
    types: List[str] = Field(description="Allowed JSON types, e.g., ['string'], ['number'], ['string','null'], ['object'], ['array']")
    description: str = Field(description="What the field represents and how to interpret it")
    required: bool = Field(description="Whether this field is commonly present across the provided documents")
    examples: Optional[List[str]] = Field(default=None, description="Representative example values if known")
    enum: Optional[List[str]] = Field(default=None, description="Enumerated set of possible values when applicable")
    format: Optional[str] = Field(default=None, description="Special format hint like 'date', 'email', 'phone', 'currency', 'lang' etc.")
    reason: str = Field(description="Brief rationale for inferring this field (signals, patterns, layout cues)")


class SchemaSection(BaseModel):
    """
    Logical grouping of fields to organize the schema structure.

    Sections help organize related fields into meaningful groups rather than
    having all fields in a flat list. This improves schema readability and
    makes it easier to understand the document structure.
    """

    name: str = Field(description="Section name (generic), e.g., 'core', 'entities', 'dates', 'financial', 'items', 'metadata'")
    fields: List[SchemaField] = Field(description="Fields within this section")


class InferredSchema(BaseModel):
    """
    Top-level inferred schema for a heterogeneous set of documents.

    This schema represents the unified structure that can accommodate various
    document types and formats. It combines common patterns found across
    multiple documents into a single, flexible schema definition.
    """

    title: str = Field(description="Human-readable title of the inferred schema")
    version: str = Field(description="Schema semantic version, e.g., '0.1.0'")
    description: str = Field(description="High-level description of the schema and how it was inferred")

    common_sections: List[SchemaSection] = Field(description="Sections that apply broadly across the provided documents")
    specialized_sections: Optional[Dict[str, List[SchemaSection]]] = Field(
        default=None,
        description="Optional mapping of document_type -> sections specific to that type",
    )

    rationale: str = Field(description="Concise explanation of the main signals used to infer this schema")

system_prompt = """
You are a senior information architect. Given multiple heterogeneous documents (any type, any language), infer the most appropriate, general JSON schema that can represent them.

Guidance:
- Infer structure purely from the supplied documents; avoid biasing toward any specific document type.
- Use lower_snake_case for field names.
- Use JSON types: string, number, boolean, object, array, null. When a field may be missing, include null in its types.
- Allow nested objects and arrays where the documents imply hierarchical structure.
- Include brief, useful descriptions for fields when possible without inventing content.
- Return ONLY JSON that matches the provided Pydantic model for an inferred schema.

 Per-field requirements:
 - For each field, add a short 'reason' explaining the signals used to infer the field (keywords, repeated labels, table headers, layout proximity, visual grouping, etc.).
"""

4. Document Data Capture Node - the data extractor

Maps every uploaded document’s content to the inferred schema using AI extraction


system_prompt = """
You are a robust multimodal (vision + text) document-to-schema mapping system. Given an inferred schema and a document (image/pdf/text), analyze layout and visual structure first, then map fields strictly to the provided schema.

Requirements:
- Use the provided schema as the contract for output structure (keep sections/fields as-is).
- For each field, search labels/headers/aliases using the 'synonyms' provided by the schema and semantic similarity (including multilingual variants).
- Prioritize visual layout cues (titles, headers, table columns, proximity, group boxes) before plain text.
- Do NOT invent values. If a value isn't found, set it to null and add a short reason.
- For every field, include a short 'reason' explaining the mapping (signals used) and a 'normalized_value' when applicable (e.g., date to ISO, amounts to numeric, emails lowercased, trimmed strings).
- Return ONLY a JSON object that mirrors the schema sections/fields. Each field should be an object: {{"value": <any|null>, "normalized_value": <any|null>, "reason": <string>, "confidence": <number optional>}}.
"""

5. Generate CSV Node - the table builder

Finally, it creates structured tables from all your data:

system_prompt = """You are a data shaping assistant.

You are given a set of JSON documents with the same schema (same keys & depth).

Your job is to analyze the documents and create 1..N CSV tables that include ALL the data from the files, but omit any 'reason' or 'confidence' values.

IMPORTANT: You must analyze the actual structure of the documents provided and create tables based on what you find, not on assumptions.

CRITICAL EXTRACTION RULES:
- ALWAYS check for "normalized_value" first, then "value" if normalized_value is null/empty
- If a field has both "normalized_value" and "value", prefer "normalized_value"
- If "normalized_value" is null/empty, use "value"
- Double check the data, if the data is correct
- Extract the actual string/number values, not the field objects

Rules:
- Analyze the actual JSON structure provided in the documents
- Create as many tables as needed to organize the data clearly
- Include ALL data fields from the documents (except reason/confidence)
- Skip 'reason' and 'confidence' fields completely
- Prefer 'normalized_value' over 'value' when both exist
- Make table and column names clear and descriptive
- Use lower_snake_case for naming
- If you see arrays, consider if they should be separate tables
- If you see nested objects, consider if they should be flattened or separate tables
- Be smart about data organization - group related fields together...

Want to deep dive into the tools, prompts, and nodes? Check the repo here: https://github.com/Handit-AI/handit-examples/tree/main/examples/unstructured-to-structured

6. Conclusions

Thanks for reading!

I hope this deep dive into building a self-improving AI agent has been useful for your own projects.

The project is fully open source - feel free to:
🔧 Modify it for your specific needs
🏭 Adapt it to any industry (healthcare, finance, retail, etc.)
🚀 Use it as a foundation for your own AI agents
🤝 Contribute improvements back to the community

Full code open source at: https://github.com/Handit-AI/handit-examples/tree/main/examples/unstructured-to-structured

This project comes with Handit.ai configured. If you want to configure Handit.ai for your own projects, I suggest following the documentation: https://docs.handit.ai/quickstart