Iteration Layer

Posted on May 30 • Edited on Jun 8 • Originally published at iterationlayer.com

How Our Document Ingestion Pipeline Turns Files into LLM-Ready Markdown

#api #programming

The Hard Part Is Not Calling the Model

Most document automation projects fail before the first extraction prompt runs.

The invoice is a scanned PDF. The contract is a DOCX with images pasted into the appendix. The email has a PDF attachment and an inline HTML body. The spreadsheet has four tabs, merged headers, and values formatted as currency. The website looks fine in a browser but ships half of its content through JavaScript.

If your pipeline assumes "file in, text out," all of that becomes glue code. You add a PDF parser. Then OCR. Then an HTML cleaner. Then a spreadsheet reader. Then a special case for images. Then a special case for emails. Then a retry path because one vendor returns empty text for scanned documents and another returns malformed table output.

The LLM is usually not the bottleneck. The bottleneck is getting the source material into a representation the LLM can read reliably, which is why document parsing quality matters before extraction prompts, RAG chunking, or agent workflows begin.

That is why Iteration Layer treats ingestion as a first-class part of the product. Document to Markdown, Document Extraction, and Website Extraction all share the same ingestion layer. The APIs look different from the outside, but the inner path is intentionally boring: resolve the input, parse the file, count the billable pages, convert the content into markdown, enrich visual content where needed, then pass that normalized representation to the next step.

This post explains how that pipeline works and why we use markdown as the boundary between messy file formats and LLM-friendly workflows.

Why Markdown Is the Boundary

Every ingestion pipeline needs an intermediate representation. You can use plain text, HTML, JSON, layout coordinates, screenshots, or a custom tree format. Each choice optimizes for something.

Plain text is simple but throws away structure. Tables collapse. Lists lose hierarchy. Headings become indistinguishable from body paragraphs. A RAG pipeline built on plain text often chunks the right words in the wrong context.

HTML preserves structure but carries too much noise. Real HTML is full of navigation, scripts, styling hooks, cookie banners, tracking fragments, and layout wrappers. It is a web rendering format, not a clean document format.

Layout coordinates preserve the page, but they are painful downstream. Every consumer now has to understand bounding boxes, reading order, columns, rotations, and table geometry. That can be useful for audit views. It is not what you want as the primary input to an LLM.

Markdown sits in the middle:

It keeps hierarchy. Headings, lists, block quotes, and tables survive as text.
It is natural for LLMs. Models already understand markdown conventions without a schema explanation.
It is easy to inspect. Developers can diff it, store it, log it, chunk it, and send it to another API.
It composes across formats. A PDF page, a spreadsheet sheet, an email attachment, and an HTML document can all become the same kind of object.

That last point matters most. The ingestion layer is not only for the Document to Markdown API. It is the common substrate for extraction. Once every file is markdown, Document Extraction can apply one schema over many source types without needing a separate extraction strategy for every format, and RAG pipelines can chunk content by structure instead of raw character windows.

The Pipeline at a Glance

The simplified flow looks like this:

There are two design choices hidden in that diagram.

First, page counting happens before ingestion. That gives the customer a predictable cost model and lets us reserve credits before expensive work starts. If a request fails, credits are refunded. If it succeeds, the recorded usage matches the page count known at the beginning.

Second, ingestion returns markdown whether the source was visual, binary, tabular, or textual. The downstream APIs should not need to know whether the input started as a scanned PDF or a DOCX file. They need clean content with enough structure preserved for the next step.

Step 1: Resolve the Input

The API accepts files in two shapes: base64 data or public URLs.

Base64 input is straightforward. The request includes the file name and encoded bytes. The file name gives the parser an extension hint, and the buffer is available immediately.

URL input has more branches. A URL with an explicit file name can be treated like a remote file. A URL without a file name may be a website page. For website inputs, the fetch layer retrieves the public page and turns the response into an HTML file for ingestion.

This is where Website Extraction differs from generic document ingestion. Website Extraction accepts one public website URL, fetches it, optionally renders JavaScript through Chromium, then passes the resulting page content into the extraction path. It is single-page and respects standard access boundaries: no crawling, no authenticated content, and no anti-bot circumvention. The unit of work is one public page, which is the same boundary we recommend when turning public documentation websites into RAG inputs.

That boundary is intentional. Website extraction that silently turns one URL into a crawler creates unpredictable cost, unpredictable runtime, and unclear compliance boundaries. One URL should mean one fetched page.

Step 2: Parse and Identify the File

After resolution, the pipeline needs a parsed file object: name, extension, MIME type, byte size, and buffer.

This sounds trivial until you handle real inputs. URLs can lie about content type. File names can be missing. A PDF can arrive with application/octet-stream. A website can respond with HTML while the URL has no extension. An image can use a format that the browser displays but an OCR library does not support.

The parse step normalizes those cases before the format-specific ingestion code runs. The rest of the pipeline should not be guessing whether invoice is a PDF, a PNG, or an HTML page. It should receive an explicit parsed file and either know how to ingest it or return a clear unsupported-format error.

This is also where we keep the API surface simpler. The caller does not choose an OCR engine, an Office parser, or an HTML converter. They send the file. The pipeline dispatches to the right ingestor based on the parsed format.

Step 3: Count Pages Before Work Starts

Credits are reserved before processing. For page-based APIs, that means the gateway needs a page count before OCR, Office parsing, or extraction begins.

The current rules are deliberately simple:

Input type	Billable page count
PDF	Actual PDF page count
Image	1 page-equivalent
Website Extraction URL	1 page-equivalent
DOCX, PPTX, XLSX, CSV, HTML, text, markup	1 page-equivalent
EML or MSG email	Email body + attachment pages

The goal is predictability. A developer should be able to estimate the cost before sending the request. A 100-page PDF costs 100 credits. An image costs 1 credit. A website extraction request costs 1 credit.

Nested content has two categories.

Some nested content is part of a single document. Images embedded inside a DOCX file are processed as part of that DOCX today. They are included in the document's page-equivalent rather than billed as separate files.

Other nested content is a separate file inside a container. Email attachments are the clearest example. An EML file with a three-page PDF attachment counts as the email body plus the PDF pages. That matches how users think about the input: one email with an attached document is really two pieces of source material.

This distinction may sound subtle, but it keeps the pricing model understandable. Embedded content inside one document is included in that document's page-equivalent. Separately attached or separately submitted files count separately.

Step 4: Dispatch to Format-Specific Ingestors

Once the file is parsed and counted, ingestion dispatches by format.

The important thing is that each ingestor owns the weirdness of its format and returns the same kind of output: a file name, MIME type, and markdown metadata. Some ingestors also return a description or nested files.

That keeps complexity local. PDF logic does not leak into email parsing. Spreadsheet logic does not leak into document extraction. Website extraction does not need a separate schema extractor.

PDFs: Render First, Then OCR

PDFs are not documents in the way developers want them to be documents. They are closer to rendering instructions, which is why PDF processing has hidden failure modes. Text may exist as positioned glyphs. Scanned pages may contain no text layer at all. Tables may be visual alignment rather than semantic rows and columns.

Trying to extract text directly from the PDF object model works for some files and fails silently for others. The worst failure mode is returning partial text that looks correct until a user notices the missing page.

Our PDF path treats pages visually. It flattens annotations, renders pages to images, and runs OCR over those rendered pages. That gives scanned and digital PDFs the same processing path. It is slower than text extraction when a perfect text layer exists, but it avoids the split-brain behavior where some pages come from text objects and others come from OCR.

The output is markdown. Tables, headings, and visible text are represented in the format the downstream APIs expect.

Images: OCR Plus Description

Images are not just OCR problems.

An image can contain text, but it can also contain visual information that matters: a chart, a product photo, a diagram, a handwritten note, a screenshot, a stamp, a signature block. OCR only captures visible text. It does not explain what the image depicts.

That is why image ingestion runs two tasks:

OCR extracts text visible in the image.
Vision description produces a plain-language description of the visual content.

For Document to Markdown, image responses can include both markdown and description. For Document Extraction, the image is formatted as markdown that contains the OCR output and the description, so the extraction model can reason over both.

This matters for agent and RAG workflows. A screenshot with a pricing table should contribute the visible text. A product photo should contribute a description. A chart should at least be represented as visual context instead of disappearing because it had no OCR text.

DOCX and Office Files: Preserve Structure, Then Enrich Embedded Images

Office files are containers. A DOCX file is a ZIP of XML documents, relationships, styles, media files, and metadata. The text is not the whole story.

The DOCX path parses the document structure into markdown: headings, paragraphs, lists, tables, footnotes, and formatting where it affects meaning. When the parser finds embedded images, it extracts those image files and runs them through the same image OCR and vision path used for standalone images.

That means a DOCX with a pasted screenshot does not lose the screenshot. The markdown can include a text representation of the embedded image alongside the surrounding document content.

PPTX follows the same principle at the slide level: extract the meaningful slide content and normalize it to markdown. XLSX and XLS files become markdown tables grouped by sheet. The goal is not to recreate the original file. The goal is to preserve the information that downstream LLM workflows need.

Spreadsheets: Tables Are the Document

Spreadsheets are easy to underestimate because they look structured already. The problem is that spreadsheet structure is not the same as text structure.

Rows and columns need to remain aligned. Sheet names matter. Headers matter. Empty cells can mean "same as above," "not applicable," or "missing data" depending on context. Formatting can imply currency, percentages, or dates.

The spreadsheet ingestor reads workbook sheets and emits markdown tables. Each sheet becomes a section with a heading, followed by a table. That gives RAG and extraction workflows a representation where row-column relationships survive the conversion.

For extraction, this also means the schema extractor can see the spreadsheet in the same source list as PDFs, emails, images, or website pages. A workflow can extract a vendor name from an email body and line items from an attached spreadsheet without switching APIs.

Emails: Containers with Their Own Content

Emails are the clearest example of why ingestion cannot stop at "extract text from one file."

An EML or MSG file has headers, body content, possibly HTML, possibly plain text, and possibly attachments. The body might say "See attached invoice." The attached invoice contains the data. If your pipeline only reads the email body, it misses the document. If it only reads the attachment, it loses the sender, subject, and date.

The email ingestors split the work:

Parse headers into structured markdown.
Extract the body, converting HTML to markdown when needed.
Extract attachments as raw files.
Parse and ingest each attachment through the same pipeline.
Return the email markdown and the nested files.

For Document to Markdown, the response can include nested_files, so callers can see the email content and each ingested attachment separately.

For Document Extraction, nested attachment markdown is appended to the email markdown. The extraction model sees one combined context: headers, body, attachment list, and attachment contents. That is the behavior you want for questions like "What is the total amount from the invoice this customer sent?" The answer might be in the PDF, but the customer identity might be in the email header.

HTML and Websites: Clean the Page, Do Not Crawl the World

HTML ingestion converts HTML content into markdown. Website inputs add a fetch step before that conversion.

This is intentionally narrower than a crawler. A website extraction request targets one public page. Referenced assets are not treated as separate billable files. Images linked from the page are not downloaded and OCRed as standalone documents. If JavaScript rendering is enabled, Chromium may execute the page and load public assets needed to render it, but the output passed into ingestion is still the fetched page content.

That boundary keeps website extraction predictable. It is useful for pricing pages, documentation pages, public listings, and structured content extraction. It is not a web scraping platform with proxy rotation, crawling queues, and asset-level processing.

If a workflow needs to process a specific image or PDF linked from a page, that linked file should be submitted as its own input through Document Extraction or Document to Markdown.

Text and Markup: Do Less

Not every format needs AI.

Markdown, JSON, XML, YAML, TOML, RST, Org, Djot, MDX, BibTeX, Typst, and plain text are already textual. For these formats, the ingestion path should avoid inventing structure that is not there. The job is to normalize the file and return content that can move through the same downstream path as everything else.

This is one of the places where a unified pipeline helps. A caller does not need to branch between "already text" and "needs OCR" before calling the API. They can send supported files and receive the same response shape.

Document to Markdown Stops After Ingestion

Document to Markdown is the API that exposes the ingestion layer directly.

It resolves the file, parses it, ingests it, and returns the markdown. There is no schema. There is no field extraction. There is no attempt to decide what values matter. The API returns the normalized representation so the caller can store it, chunk it, index it, summarize it, or send it to another tool.

That makes it the right API for RAG pipelines and preprocessing jobs. If your next step is embedding, search indexing, summarization, or human review, you usually want markdown rather than typed fields.

Document Extraction Adds Schema Extraction

Document Extraction starts with the same ingestion layer, then applies a schema.

The schema describes the fields you want: text, dates, numbers, arrays, addresses, IBANs, currency amounts, calculated values, and other typed outputs. The extraction step receives the ingested markdown sources, not the raw file buffers. That keeps extraction focused on semantics instead of file handling.

The pipeline looks like this:

Files
  -> ingestion
  -> markdown sources
  -> schema extraction
  -> field validation
  -> calculated fields
  -> consolidated JSON

This separation is why the same extraction API can handle mixed inputs. A request can include a PDF, an image, an email, and a spreadsheet. Each file is ingested through its own path, then the schema extractor sees normalized sources with names and markdown content.

Website Extraction Adds Retrieval Before Extraction

Website Extraction is Document Extraction with a website retrieval step and a narrower input model.

It accepts one public URL, fetches the page, converts the page to markdown, then applies the same schema extraction path. The response includes the extracted data and URL metadata.

Use it when the source is one public page and you want structured JSON. Use Document to Markdown when you want the page content as markdown. Use Document Extraction when you need multiple files, uploaded files, or mixed formats.

Why Page Counting Happens Before Ingestion

It would be possible to count every internal operation after the fact: every embedded image, every rendered page, every attachment, every model call. That would make metering mirror compute cost more closely.

It would also make pricing harder to reason about.

We chose page-based pricing because developers need to estimate costs before running a batch. If a 100-page PDF might cost 100 credits, 137 credits, or 412 credits depending on what the parser discovers inside it, the pricing model becomes another integration risk.

So the gateway counts pages before work starts and reserves credits up front. For the current product, embedded content that is processed as part of a single document is included in that document's page-equivalent. Separately submitted files and email attachments count separately.

That tradeoff is not free. A DOCX file with many embedded screenshots costs more to process than a DOCX with only text, but both currently count as one page-equivalent. We accept that because predictable pricing is more important than perfect internal cost attribution at this stage. As we move more processing onto fixed-cost GPU infrastructure, utilization matters more than per-token accounting anyway.

The public contract stays simple: count source pages and page-equivalents, not internal implementation steps.

Why Ingestion Architecture Affects GDPR Scope

Document ingestion is often the step where sensitive data enters a system: invoices, contracts, HR documents, customer emails, ID scans, bank statements, and internal spreadsheets.

That makes infrastructure placement part of the pipeline design, not an afterthought. If the ingestion layer sends files through a US-hosted OCR vendor, a browser rendering service in another region, and a separate LLM provider, the technical pipeline also becomes a compliance pipeline. Every hop needs a data processing agreement, a retention policy, a transfer mechanism, and a clear answer to where the file went.

Iteration Layer keeps ingestion on EU-hosted infrastructure. Files are processed in memory and discarded after processing. We do not store source files for later training, debugging, or analytics. The same boundary applies whether the request is Document to Markdown, Document Extraction, or Website Extraction.

This matters most for composable workflows. If extraction, image handling, and file generation each use a different vendor, compliance review scales with every step. A shared ingestion layer keeps the data path narrower: one API surface, one processing region, one retention posture, and one DPA.

It does not remove the customer's own GDPR obligations. You still need a lawful basis for processing, appropriate access controls, and a reason to send the document to any processor. But the ingestion layer should make those obligations easier to reason about, not multiply vendors before the document is even normalized.

Why the Pipeline Is the Product

A single file converter is useful. A single OCR endpoint is useful. A single extraction model is useful.

But most real workflows need the chain:

Read the email.
Parse the attachment.
OCR the scanned pages.
Preserve the spreadsheet tables.
Extract structured fields.
Generate a report.
Transform the output image.
Send everything through one billing and auth model.

The hard part is not one operation. The hard part is making the operations compose without every customer rebuilding the same glue code.

That is the reason ingestion sits underneath multiple APIs instead of living as a hidden helper inside one endpoint. Once files become normalized markdown, the rest of the platform can treat PDFs, images, emails, spreadsheets, and websites as workflow inputs rather than separate product silos.

What This Means for Developers

If you are building a document workflow, the practical guidance is simple.

Use Document to Markdown when you need clean content for RAG, search indexing, summarization, or your own downstream processing. It gives you the ingestion output directly.

Use Document Extraction when you know the fields you want and need typed JSON with confidence scores and citations. It runs ingestion first, then applies your schema.

Use Website Extraction when the source is a single public web page and the output you need is typed JSON. It fetches the page, ingests it, and applies the same schema extraction path.

If the workflow spans multiple operations, keep the intermediate output as markdown or structured JSON and chain from there. That is the point of a composable content processing platform: the output of one step should already be shaped for the next step.

The Next Layer

Ingestion will keep getting better: more edge cases handled, better table fidelity, better Office parsing, better visual descriptions, and better performance as more work moves onto fixed-cost infrastructure.

But the core design will stay the same. Resolve inputs. Parse files. Count the unit of work. Convert every format into markdown. Add schema extraction only when the caller asks for structured fields.

That boundary is what makes the APIs composable. The pipeline is the product, not the individual file parser hidden inside it.

If you want the raw ingestion output, start with Document to Markdown. If you want typed fields from the same pipeline, use Document Extraction. If your source is a public page, use Website Extraction.

DEV Community