Working with real world documents is still pain. PDFs, invoices, random exports from legacy tools. Half the work is just getting them into a clean, structured format your models can use. π
This post is about that first step. The one that usually gets ignored in demos and tutorials. Parsing and structuring the documents.
The tools here handle OCR, layout, tables, forms and file format so you can focus on the logic around them.
I am walking through a few I actually like using, with short code snippets you can drop straight into your own projects.
So, let's begin. π
1. Tensorlake
π‘ Document Ingestion API plus a serverless runtime for agentic data workflows
Tensorlake gives you two big things in one place:
- A Document Ingestion API that turns messy files into clean markdown or structured JSON
- A serverless platform to run agentic workflows on top of that data
You can send PDFs, Office files, images or raw text and get back well structured content with preserved layout. Long story short, you can treat it as a Document Ingestion API that handles PDFs, Office files, scans and images, then add agent style applications on top using their serverless runtime.
So, instead of handling OCR and background jobs with retry logic, you get one single platform that parses, chunks, classifies and then feeds the results into the agent or tools.
π€ Is it for you?
If you are building invoice extractors, contract analyzers, or any complex data ingestion or agents that need to actually read documents, Tensorlake sits right in the middle of your stack as the ingestion and workflow layer.
Features
- Multi format parsing: Parse PDFs, Office docs, spreadsheets, presentations, images and raw text to markdown or JSON.
- Layout aware output: Preserves tables, sections and reading order so your RAG or search stays aligned with the original document, which many other tools miss.
- Schema based extraction: Use JSON Schema or Pydantic models to pull out only the fields you care about.
- Agentic runtime: Decorate Python functions, run them in sandboxes and let Tensorlake handle scaling, retries and state.
And many more...
Now, let's go through a quick code example of some common use cases.
Code Example: From PDF to markdown
First, install the SDK and use the DocumentAI client to upload a PDF, start a parse job and stream the markdown chunks once parsing is done.
pip install tensorlake
Now, to extract the text from a PDF, you can do something like:
from tensorlake.documentai import DocumentAI, ParseStatus
doc_ai = DocumentAI(api_key="your-api-key")
# Upload and parse document
file_id = doc_ai.upload("/path/to/document.pdf")
# Start parsing
parse_id = doc_ai.parse(file_id)
# Wait until parsing is complete
result = doc_ai.wait_for_completion(parse_id)
if result.status == ParseStatus.SUCCESSFUL:
# Each chunk is a piece of clean markdown
for chunk in result.chunks:
print(chunk.content)
This is the basic flow you would use in a backend job that takes uploaded PDFs and turns them into LLM friendly text for something like RAG or search.
Once you have the chunks, you can push them straight into a vector store or a database.
You can have more control over parsing, like using structured parsing, which you can find here: Structured Extraction. I leave it up to you to explore more about this.
Code Example: Tiny agentic app on the Tensorlake runtime
To run a small agentic app on top of Tensorlake, it's as simple as:
import os
from agents import Agent, Runner
from agents.tool import WebSearchTool, function_tool
from tensorlake.applications import application, function, run_local_application, Image
# Container image with the dependencies the function needs
FUNCTION_CONTAINER_IMAGE = Image(
base_image="python:3.11-slim",
name="city_guide_image",
).run("pip install openai openai-agents")
@function_tool
@function(
description="Gets the weather for a city",
secrets=["OPENAI_API_KEY"],
image=FUNCTION_CONTAINER_IMAGE,
)
def get_weather_tool(city: str) -> str:
agent = Agent(
name="Weather Reporter",
instructions="Use web search to find current weather in the city",
tools=[WebSearchTool()],
)
result = Runner.run_sync(agent, f"City: {city}")
return result.final_output.strip()
@application(tags={"type": "example", "use_case": "city_guide"})
@function(
description="Creates a simple city guide",
secrets=["OPENAI_API_KEY"],
image=FUNCTION_CONTAINER_IMAGE,
)
def city_guide_app(city: str) -> str:
agent = Agent(
name="Guide Creator",
instructions="Make a friendly city guide that includes the current temperature",
tools=[get_weather_tool],
)
result = Runner.run_sync(agent, f"City: {city}")
return result.final_output.strip()
if __name__ == "__main__":
city = "Paris"
if not os.environ.get("OPENAI_API_KEY"):
print("Error: OPENAI_API_KEY is not set")
raise SystemExit(1)
request = run_local_application("city_guide_app", city)
response = request.output()
print(response)
This above code creates a city guide application using OpenAI Agents with tool calls. I'm not going to explain the code here, as the blog will get unnecessarily longer.
You can find the explanation for this code in their GitHub README.
Deploying and running on Tensorlake Cloud
To run the application on Tensorlake Cloud, it first needs to be deployed.
- Set
TENSORLAKE_API_KEYin your shell session:
export TENSORLAKE_API_KEY="Paste your API key here"
- Set
OPENAI_API_KEYin your Tensorlake Secrets so that your application can make calls to OpenAI:
tensorlake secrets set OPENAI_API_KEY "Paste your API key here"
- Deploy the application to Tensorlake Cloud:
tensorlake deploy examples/readme_example/city_guide.py
- Run the remote test script found in
examples/readme_example/test_remote_app.py:
from tensorlake.applications import run_remote_application
city = "San Francisco"
# Run the application remotely
request = run_remote_application("city_guide_app", city)
print(f"Request ID: {request.id}")
# Get the output
response = request.output()
print(response)
- The application will execute on Tensorlake Cloud, with each function running in its own isolated sandbox.
To put it short, Tensorlake takes care of spinning up containers, injecting secrets and keeping the function durable so it can retry tool calls without you building your own queue system.
Here's a quick Tensorlake document ingestion demo to see it in action working with a complex document. π
2. Docling
Docling is from the IBM Research Team, licensed under MIT (free and open to commercial use), and turns PDFs, Office docs, images, audio and more into a unified DoclingDocument format. You can then export that into markdown, HTML, DocTags or lossless JSON and plug it straight into RAG, agents or search.
It runs locally and comes with strong layout and table understanding plus OCR and vision models for scanned or complex documents.
Features
Multi format parsing - PDF, DOCX, PPTX, XLSX, HTML, images, audio and more into one structured representation.
Advanced PDF understanding - Page layout, reading order, tables, code, formulas and images handled out of the box.
Multiple export targets - Export a single DoclingDocument to markdown, HTML, DocTags or structured JSON.
Local and privacy friendly - Designed to run completely locally.
Gen AI integrations - Hooks into LangChain, LlamaIndex, Haystack and others out of the box.
And many more...
Code Example - Convert and print markdown
The basic flow is intentionally simple: create a converter, give it a source and then decide how you want to export the result.
from docling.document_converter import DocumentConverter
source = "https://arxiv.org/pdf/2408.09869" # can also be a local Path(...)
converter = DocumentConverter()
result = converter.convert(source)
markdown = result.document.export_to_markdown()
print(markdown)
This example shows the βone document in, one markdown document outβ path that you would usually add into your indexing step.
This gives you one markdown document you can split into chunks and feed into a vector database.
Code Example: Same idea from the CLI
Docling also comes with a CLI. You can install it with the following command:
pip install docling
Now, you can run it using the following command:
# Convert a PDF at a URL to markdown on stdout
docling https://arxiv.org/pdf/2206.01062
# Use the GraniteDocling vision language model in the pipeline
docling --pipeline vlm --vlm-model granite_docling https://arxiv.org/pdf/2206.01062
Obviously, there are a few more complex use cases with a lot more flags you can add. For this, visit their documentation.
Here's a quick video by Red Hat to see it in action. π
3. Unstructured
Unstructured gives you an open source library plus a managed platform to turn unstructured content into structured data for LLM apps. It partitions PDFs, slides, HTML, Office files and images into a standard set of elements that downstream tools can easily consume.
On top of that, the ingest layer adds connectors, chunking and embeddings so you can build full ETL style pipelines around your document sources.
Features
- One partition API autodetects file type and routes to the right parser for you.
- LLM friendly outputs structured elements with text, metadata and coordinates when needed.
- Source and destination connectors GitHub, S3 and more via the Ingest CLI and Python library.
- Hosted Partition Endpoint offloads compute to their API when you want better models or scale.
Code Example: Quickstart with partition
This is the core pattern you will see in most examples, and it is enough to plug into a RAG pipeline.
from unstructured.partition.auto import partition
# Read and partition a document
elements = partition("example-docs/layout-parser-paper.pdf")
# Inspect a few elements
for el in elements[:5]:
print(repr(el.category), "->", str(el)[:80], "...")
You end up with a list of elements that know their category, which makes it easy to filter for titles, paragraphs or tables before you use it further.
Code Example: Batch processing with Ingest CLI
For real projects you usually need to process many files at once and save the outputs somewhere. It comes with an ingest CLI and is built for exactly that.
# Chunk and partition an entire folder of files
unstructured-ingest \
local \
--input-path $LOCAL_FILE_INPUT_DIR \
--output-dir $LOCAL_FILE_OUTPUT_DIR \
--chunking-strategy by_title \
--chunk-max-characters 1024 \
--partition-by-api \
--api-key $UNSTRUCTURED_API_KEY \
--partition-endpoint $UNSTRUCTURED_API_URL \
--strategy hi_res
This runs a full pipeline that reads documents from LOCAL_FILE_INPUT_DIR, partitions them with the hi_res strategy, chunks them by title and writes the structured outputs into your output directory. From there, you can index or analyze them however you like.
Here's a quick API quickstart to get an idea. π
4. Amazon Textract
Amazon Textract is AWSβs managed OCR and document analysis service that pulls text, handwriting, layout and structured data out of scanned documents and PDFs.
It runs inside your AWS account, plugs into services like S3, Lambda, SNS and SQS, and is used at scale by companies like PayTM for document workflows.
Features
- Structured extraction pulls data from tables, forms and key value pairs, not just plain text.
- Layout and handwriting support detects paragraphs, titles, layout elements and handwritten text in scans.
- Works naturally with S3, Lambda, SNS, SQS and other AWS services.
- Sync and async APIs low latency calls for single pages plus batch jobs for large multipage docs.
- Security and compliance encryption, IAM and regional controls for regulated workloads.
Code Example: Detect text from a local file
This is the basic pattern if you just want the text out of a document. You read the file as bytes, call detect_document_text and print the lines Textract finds.
import boto3
textract = boto3.client("textract") # uses your AWS credentials
file_path = "sample-doc.png" # can be any image format
with open(file_path, "rb") as f:
image_bytes = f.read()
response = textract.detect_document_text(
Document={"Bytes": image_bytes}
)
for block in response["Blocks"]:
if block["BlockType"] == "LINE":
print(block["Text"])
What is happening here:
- Textract analyzes the image or PDF and returns a list of Blocks that represent words, lines and other elements.
- You filter for blocks of type LINE and print their Text, which is enough for many basic OCR use cases or as a first step before sending text into an LLM.
Code Example: Extract tables and forms from S3
To pull structured data from forms and tables, you use analyze_document with the FORMS and TABLES feature types and point Textract at a document in S3.
import boto3
textract = boto3.client("textract")
bucket_name = "my-doc-bucket"
object_key = "invoices/invoice-001.png"
response = textract.analyze_document(
Document={
"S3Object": {
"Bucket": bucket_name,
"Name": object_key,
}
},
FeatureTypes=["FORMS", "TABLES"],
)
print(f"Found {len(response['Blocks'])} blocks")
# Quick peek at found tables
for block in response["Blocks"]:
if block["BlockType"] == "TABLE":
print("Detected a table with Id:", block["Id"])
There is a lot of other complex stuff that you can do with Textract. For more details, check out the Textract documentation.
In production you usually wire this up with S3 triggers and Lambda so new documents are picked up and processed by themselves.
Here's a quick intro to Amazon Textract. π
5. Google Cloud Document AI
Document AI is Google Cloudβs document stack that gives you ready made processors for invoices, receipts, forms, IDs and general OCR. You pick a processor, send it a file and get back a Document object with text, structure, entities and layout info, not just raw strings.
The nice part is how it fits into the rest of GCP (Google Cloud Platform). You can drop files into Cloud Storage, trigger processing with Pub/Sub or Cloud Functions or Cloud Run, then push clean data into BigQuery or your app.
Use it when you want:
- Prebuilt processors - invoice, receipt, form, ID and general OCR processors that work out of the box.
- Tables and forms - key value pairs and tables straight from scanned PDFs and images.
- Custom models - custom extractors, classifiers and splitters when your docs do not match the prebuilt ones.
- Cloud pipelines - runs close to Cloud Storage, Cloud Run and Vertex AI so it is easy to wire into existing GCP setups.
Code Example: Send a PDF and read the text
This is the usual Python flow. You create a processor in the console, grab its ID, then call it from your code.
from google.cloud import documentai
project_id = "your-project-id"
location = "us"
processor_id = "your-processor-id"
file_path = "path/to/document.pdf"
client = documentai.DocumentProcessorServiceClient()
name = client.processor_path(project_id, location, processor_id)
with open(file_path, "rb") as f:
file_bytes = f.read()
raw_document = documentai.RawDocument(
content=file_bytes,
mime_type="application/pdf",
)
request = documentai.ProcessRequest(
name=name,
raw_document=raw_document,
)
result = client.process_document(request=request)
doc = result.document
print(doc.text[:1000])
You send raw bytes plus the MIME type, Document AI runs the selected processor and you get back a Document object. For quick use cases, grabbing doc.text is enough.
Code Example: Turn a parsed form into fields
If you use a form style processor, Document AI already marks fields as key value pairs, which you can loop over and map into your own schema.
def clean(text: str) -> str:
return text.replace("\n", " ").strip()
form_doc = doc # from the previous example. see above
fields = []
for page in form_doc.pages:
for field in page.form_fields:
name = clean(field.field_name.text_anchor.content)
value = clean(field.field_value.text_anchor.content)
conf = field.field_value.confidence
fields.append((name, value, conf))
for name, value, conf in fields:
print(f"{name}: {value} (conf {conf:.2f})")
This is the point where a scanned form basically becomes a Python dict. From here, you can push the data into BigQuery, Firestore or any service you use on GCP.
This is just a start, and there's a lot more to it. Visit the documentation to learn more.
Here's a quick introduction to Google Cloud Document AI. π
Conclusion
If you think of any other handy AI tools that I haven't covered in this article, do share them in the comments section below. βοΈ
So, that is it for this article. Thank you so much for reading! ππ«‘









Top comments (3)
Which one actually seems like youβd want to use in your projects?
I don't get the idea for this parging document. Why not use pypdf or something similar to it? Pypdf doc: github.com/py-pdf/pypdf
I don't work in AI and not use it myself, but I see the need when you'd want to use them