After a year running a document processing pipeline through hundreds of thousands of construction documents (tender packs, permit applications, site surveys, BIM exports, drawing sets at A0 and larger), I can tell you what actually breaks.
It is not the PDFs.
That is the thing most people get wrong about systems like this. Every PDF tutorial focuses on the parser: which library, which model, which extraction service. After a year, the PDFs themselves rank third on the list of things that break. The first is the error taxonomy. The second is coordination between documents. The actual content of the files is, mostly, a tractable engineering problem with off-the-shelf tools.
TL;DR
- One pipeline run per document, not per upload. Per-document isolation pays for itself the first time a corrupt PDF lands in a 2,000-file archive.
- PDF malformation has a bounded long tail. Five categories cover most of what production traffic produces. Handle the categories, not the cases.
- Vision LLMs are not the bottleneck. Coordination is. The completion fence is two SQL queries and it caused more incidents than the model layer ever did.
- Large-format pages require tiling before submission. A0 drawings at 300 DPI exceed every vision API's pixel budget. Detect at the geometry level, subdivide into a grid, and batch-submit tiles through a separate quota pool.
- The most effective intervention against hallucinated numbers is not a better model. It is grounding. Feed the model the deterministically-extracted text alongside the image and it becomes a layout mediator rather than a content generator.
A hundred thousand documents a month is one new document every twenty-six seconds on average, in bursts of several thousand at a time. Most are uneventful. The ones that are not built this pipeline.
The shape of the pipeline
ZIP upload ──► extract recursively ──► fan-out (one run per file)
│
▼
┌─────────────────────────────┐
│ format normalization │
│ (Office → PDF, flatten) │
├─────────────────────────────┤
│ extraction │
│ (parser / OCR / vision) │
├─────────────────────────────┤
│ enrichment │
│ (classify, embed) │
├─────────────────────────────┤
│ persistence │
│ (vector + search + DB) │
└─────────────────────────────┘
│
▼
project-wide completion fence
The architecture is conventional. The interesting parts live in the seams between stages and in the orchestration around them.
One run per document
The first instinct is to batch. Receive a ZIP of two thousand files, process them together, commit them together. Batching minimises overhead. It also couples receipt, processing, and commit into a single failure domain, which is the opposite of what you want when input variance is this high. A single corrupt PDF inside the batch fails the batch. Retries become coarse. Observability collapses into "this upload did not finish." The user sees an error on a project that contains one bad file and 1,999 good ones.
We run one pipeline run per document.
The coordinator receives a ZIP, expands it recursively to a configured nesting limit, uploads each file to object storage, and dispatches one child pipeline run per file. The launch is fire-and-forget through the orchestrator's task API. The per-file descriptor passed between stages is intentionally small: a storage key, a filename, a relative archive path, an extension, a size, a nesting flag. No bytes. Each child worker fetches its own file from storage.
This decouples extraction throughput from processing memory. The coordinator's working set stays flat regardless of whether the archive contains fifty documents or fifty thousand. Child pipeline runs execute on uniformly sized container tasks (4 vCPU, 30 GiB RAM, 100 GiB ephemeral disk), so capacity planning is one number multiplied by concurrency.
Within a child run, the stages are strictly sequential. Threading inside a run would muddle error attribution, complicate the status state machine, and burn file descriptors. Parallelism lives between runs, not inside them.
Every stage is wrapped by the same helper. It opens a trace span, awaits the stage coroutine, and on failure calls one of two status writers before re-raising:
async def _run_step(name: str, coro, doc_id: int):
with trace_span(name):
try:
return await coro
except UnprocessableError as e:
db.mark_unprocessable(doc_id, name, str(e))
raise
except Exception as e:
db.mark_failed(doc_id, name, str(e))
raise
The choice between the two writers is the architectural decision that took the longest to get right. A document the extraction service classifies as permanently unprocessable (DRM-locked, zero-byte, malformed) gets a different label in the database than a document that timed out talking to an API. On-call sees the difference at a glance and does not waste a retry on a file that will never succeed.
Status writers function as a product-level state machine. Failure modes become product states.
Before any work runs, the child inserts a database row with status processing and updates the run's display label to match the row's ID. If the run crashes anywhere downstream, the row stays, with the step name and stack trace attached. Nothing disappears between systems.
PDFs we did not write
The PDF specification is permissive and two decades old. Most generators violate parts of it. A few violate most of it. Each one looks valid until you try to read it.
Construction PDFs are particularly varied. A typical project contains:
- Tender documents (long, text-heavy, sometimes with embedded scans)
- Permit applications (form-heavy, populated through Acrobat or web tools)
- Site surveys (scanned or photographed, sometimes rotated)
- Construction drawings (A1, A0, vector-heavy with thousands of dimension annotations)
- BIM-exported sheet sets (huge, layered, often with missing fonts)
- Material specifications (mixed text and tables)
After a year, the failure modes are stable. Five categories cover most of it: form-heavy, scanned, structurally corrupt, oversized, and mislabeled.
Form-heavy
Open a permit application in a browser and you see numbers. Extract the text programmatically and you see Field1: [empty]. The visible content lives in form widgets, not page content; the two are stored separately inside the file, and a parser that does not know to look for widget annotations returns the empty shell.
The flattener inspects every page for widget and free-text annotations. If any exist, PyMuPDF bakes them into static page content and re-serialises the file with its highest garbage-collection pass: unreferenced objects removed, object numbers compacted, cross-reference table rebuilt.
def needs_flattening(doc: fitz.Document) -> bool:
for page in doc:
if page.widgets() or page.first_annot is not None:
return True
return False
def flatten(doc: fitz.Document) -> bytes:
for page in doc:
page.apply_redactions() # bakes widget values into content
return doc.tobytes(garbage=4, deflate=True, clean=True)
A particular subtlety surfaced in production: non-UTF-8 annotation names cause PyMuPDF to throw SystemError, UnicodeDecodeError, or RuntimeError from inside the widget enumeration call. The detection catches all three and uses them as signal.
def needs_flattening_safe(doc: fitz.Document) -> bool:
try:
return needs_flattening(doc)
except (SystemError, UnicodeDecodeError, RuntimeError):
return True # crash on enumeration = file definitely needs flattening
A parser crash on widget enumeration is, in practice, a near-certain indicator that the file needs flattening. The crash became the heuristic.
Scanned
Site surveys arrive looking like real PDFs. They are not. There is no text layer. A naive parser returns an empty string and reports success. The pipeline detects scans before any extraction attempt with a text-density heuristic.
SCAN_THRESHOLD = 0.5 # at most half the pages have text -> treat as scan
def is_scanned(doc: fitz.Document) -> bool:
if len(doc) == 0:
return False
with_text = sum(1 for page in doc if page.get_text().strip())
return with_text / len(doc) <= SCAN_THRESHOLD
The threshold is set above zero so mixed documents (a scanned cover sheet stapled to typed pages, common in tender packs) route correctly rather than failing the binary check on page one. OCR output is written back to the same storage key so reprocessing skips the OCR step on a second attempt.
Structurally corrupt
Print-to-PDF drivers occasionally emit documents with one or more pages containing empty content streams. The extraction service signals permanent failure and refuses the whole file. The pipeline does not give up. It scans every page for zero-length content streams, removes the offending pages, rebuilds the file, and retries.
def repair_empty_pages(doc: fitz.Document) -> bool:
"""Returns True if the document was modified."""
bad = [i for i, page in enumerate(doc) if not page.read_contents()]
for i in reversed(bad): # delete in reverse so indices stay valid
doc.delete_page(i)
return bool(bad)
A forty-page drawing set with two corrupt pages becomes a thirty-eight-page document that succeeds.
The general principle here is worth naming: do not try to predict which inputs are corrupt. Let the service tell you, then repair surgically. The same principle returns later for rate limits.
Oversized
When the extraction service signals page-count exceeded or payload too large, the pipeline splits the document into hundred-page chunks and submits each independently with up to five attempts per chunk (exponential backoff, 4 to 90 seconds). The full document is always tried first. Chunking is reactive, not proactive. Most files are not large; the ones that are land in their own code path without slowing down the median.
The waterfall
These behaviours feed a priority waterfall over three extraction tiers:
- Primary structured extraction service. Handles the common case. Returns text blocks with bounding boxes, tables, and figure renditions.
- Secondary extraction service. Takes over on permanent failure from tier 1. Its own retry budget.
- Vision-model fallback. Activates if tier 2 also exhausts.
Three tiers down and still failing, the document is marked permanently unprocessable and surfaced to the user. A retry at one tier never consumes the budget of another. This sounds obvious until you find a system where it does.
Office files (.docx, .doc, .xlsx, .xls, .pptx, .ppt) are converted to PDF through a cloud conversion API before they enter the extraction path. Three task-level retries with 10, 30, 60 second delays. Conversion errors are classified into permanent (genuinely corrupt source files) and transient (API issues) using the API's response metadata. Email files are converted in-process using a local parser and a PDF rendering library: header and body onto A4 pages, no external API call. The in-process path keeps email content off third-party infrastructure. Customers do not always realise their PDF pipelines contain their inbox.
Tiling large-format pages
A single A0 sheet rasterised at 300 DPI produces an image roughly 9900 x 14000 pixels: around 140 megapixels. Vision APIs enforce hard limits on both total pixel count and the length of any single side. Sending the raw image does not produce an error in every case. Some endpoints silently crop or downsample before inference, which means the model sees a different image than the one you submitted, with no signal that anything went wrong. The only safe path is to detect large-format pages before rasterisation and subdivide them.
Detection at the geometry level
PDF coordinates are measured in points (1 pt = 1/72 inch). PyMuPDF exposes page dimensions directly:
_LARGE_FORMAT_MIN_POINTS = 1190 # long edge of A3 landscape in PDF points
def is_large_format_page(page) -> bool:
return max(page.rect.width, page.rect.height) >= _LARGE_FORMAT_MIN_POINTS
The threshold is 1190 points because A3 landscape measures 1190 x 842 pt. Any page whose longest side meets or exceeds that value is treated as a large-format page and routed to tiling. PDF metadata is not used for this decision. Pages exported from BIM tools routinely carry an A4 /MediaBox while the actual geometry describes an A1 drawing.
Tile grid construction
The grid divides the page into equal-sized cells, each no larger than the threshold constant:
import math
def compute_tile_grid(width_pt: float, height_pt: float):
cols = max(1, math.ceil(width_pt / _LARGE_FORMAT_MIN_POINTS))
rows = max(1, math.ceil(height_pt / _LARGE_FORMAT_MIN_POINTS))
tile_w = width_pt / cols
tile_h = height_pt / rows
return rows, cols, tile_w, tile_h
An A1 page (roughly 1684 x 2384 pt) produces a 2 x 2 grid (4 tiles). An A0 page (roughly 2384 x 3370 pt) produces a 2 x 3 grid (6 tiles). Tiles are edge-to-edge with zero overlap. This means any entity whose geometry straddles a tile boundary (a dimension string, a room label, a wall line) is split across two tiles. There is no stitching pass after inference. The grounding mechanism described in the next section is what makes split entities recoverable: if the text layer was extracted, the dimension is already present verbatim in both tiles' grounding context regardless of where the tile cut falls.
Rasterisation
Tiles are rendered at a fixed 200 DPI, not the 300 DPI used for normal pages:
import fitz # PyMuPDF
def prepare_tile(page, clip_rect: fitz.Rect) -> bytes:
mat = fitz.Matrix(200 / 72, 200 / 72)
pix = page.get_pixmap(matrix=mat, clip=clip_rect)
raw = pix.tobytes("png")
return _downscale_if_needed(raw) # enforces 3072 px longest-edge cap
200 DPI is chosen because each tile is already a sub-region of the full page. At that resolution, a 1190 pt tile rasterises to roughly 3306 pixels on that edge, which exceeds the 3072 px per-dimension cap. The _downscale_if_needed guard re-encodes the tile to fit before submission. Using 300 DPI on the same tile would produce an even larger intermediate image, burning CPU for no benefit. Normal (non-tiled) pages start at 300 DPI and scale down adaptively to stay within a pixel-area budget just under 180 million total pixels and a longest-edge cap of 3072 pixels.
Tiles are encoded as RGB PNG. PNG is lossless and avoids format-rejection errors from CMYK or palette-mode inputs that some vision APIs refuse with a 400.
Batch submission and polling
Tiled pages are not submitted through the synchronous inference path. They go to a batch inference endpoint backed by a separate quota pool. Batch requests draw on a quota that is independent of the real-time API, so a burst of A0 drawings does not saturate the quota used by the rest of the pipeline.
Each batch job gets a unique scoped prefix in object storage. Tile images are uploaded in sequential insertion order. A manifest is written containing one request object per tile, each referencing its storage reference and inference parameters. The batch platform writes results to the same prefix as one or more result files.
The poller checks job status every 30 seconds. The timeout ceiling is 6 hours. On timeout, the job is cancelled and a TimeoutError is raised. The staging prefix is left intact for inspection.
Reassembly
Output lines from the batch are not guaranteed to arrive in submission order. Each input request embeds an identifier that maps back to its insertion index. The collector builds a reverse-lookup map from identifier to index and places each response into a pre-allocated list at the correct position:
id_to_index = {record.id: idx for idx, record in enumerate(records)}
descriptions = [""] * len(records)
for line in output_lines:
request_id = line["request_id"]
idx = id_to_index[request_id]
descriptions[idx] = line["response"]["text"]
The final list is in original tile order, which maps directly to reading order across the page grid.
Partial success is a hard failure
If any tile slot is still empty after all output files are consumed, the batch runner raises RuntimeError and returns nothing:
produced = sum(1 for d in descriptions if d)
if produced < len(records):
raise RuntimeError(
f"Batch incomplete: {produced}/{len(records)} tiles returned"
)
A document missing one tile out of six would look complete to a downstream consumer. In a construction context that means a floor plan with a silent gap in the extracted data. The pipeline raises rather than persist that state. The task-level retry will requeue the entire document.
Cleanup
The staging prefix is not deleted in the error path. A storage lifecycle policy reclaims any prefix older than a fixed TTL. The TTL covers the longest plausible batch job window plus inspection time. This keeps error-path code simple and ensures orphaned prefixes from crashed workers are cleaned up without any explicit handler.
Blueprint understanding
A single A1 sheet combines dimension chains at 0°, 90°, and 45° rotation; room labels; hatch patterns; section cut symbols; and revision clouds, all at different scales within the same sheet. The text density is high but non-uniform. Symbols carry meaning that differs by convention and trade. A "300" in one region is a room number; in another it is a dimension in millimetres.
Pure vision inference on these inputs produces two failure modes. First, the model misreads values that are visually ambiguous at the rendered resolution (rotated text, condensed fonts, overlapping annotations). Second, the model invents plausible-looking numbers when it is uncertain. In construction, a wrong dimension is a specification error. It enters procurement quantities and fabrication orders. The liability exposure from a single incorrect quantity in a bill of materials is orders of magnitude larger than a missing word in a document summary.
The hybrid path
When the structured extraction service processes a PDF successfully, it returns both the rendered page images and a list of text blocks, each with a bounding box in PDF point coordinates. The pipeline does not discard this structured output when it calls the vision model. It uses it as ground truth.
Spatial join
For each figure (a rendered tile or a figure rendition), the pipeline finds all text blocks whose bounding boxes overlap the figure's region:
def bounds_overlap(a: list[float], b: list[float]) -> bool:
# a and b are [x0, y0, x1, y1] in PDF points
return not (a[2] < b[0] or b[2] < a[0] or a[3] < b[1] or b[3] < a[1])
def collect_grounding(
figure_bounds: list[float],
text_blocks: list[dict],
page: int,
) -> list[str]:
parts = []
for block in text_blocks:
if block["page"] == page and bounds_overlap(figure_bounds, block["bounding_box"]):
parts.append(block["text"])
return parts
No spatial index is needed. With tens to low hundreds of text blocks per page, a linear scan adds under a millisecond per figure. The result is the set of text blocks whose bounding boxes overlap the figure's region.
Grounded prompt assembly
The collected strings are deduplicated before injection. Any word appearing more than 5 times across the concatenated text is collapsed to a single occurrence, and identical lines are removed. This handles the common case of warehouse floor plans and grid-heavy drawings where a single label repeats across every bay.
from collections import Counter
def deduplicate_grounding(parts: list[str]) -> str:
combined = "\n".join(parts)
words = combined.split()
freq = Counter(words)
seen_words: set[str] = set()
filtered_words = []
for w in words:
if freq[w] > 5:
if w in seen_words:
continue
seen_words.add(w)
filtered_words.append(w)
raw = " ".join(filtered_words)
seen_lines: set[str] = set()
lines = []
for line in raw.splitlines():
if line not in seen_lines:
seen_lines.add(line)
lines.append(line)
return "\n".join(lines)
def assemble_grounded_prompt(base_prompt: str, grounding_text: str) -> str:
if not grounding_text.strip():
return base_prompt
return (
f"{base_prompt}\n\n"
f"Extracted text in this region:\n{grounding_text}"
)
With grounding in place, the vision model maps regions and describes visual structure. The numbers come from the text layer, not from model inference. The model cannot invent a number that was not in the grounding text.
The grounding text is also appended to the stored segment output, not just the prompt. This means the downstream vector index writer receives both the vision narrative and the verified dimension values in the same chunk. Retrieval does not depend on the model having faithfully reproduced the number; the number is present in the chunk regardless.
The 429 chain under blueprint-heavy load
Blueprint-heavy workloads arrive in bursts. A tender pack or a permit application can contain dozens of A0 and A1 sheets. When several such archives land at once, the tiling path generates hundreds of synchronous vision calls in a short window.
The retry logic operates at two levels. The inner level handles transient quota exhaustion on a single model:
from tenacity import retry, stop_after_attempt, wait_random_exponential
from tenacity import retry_if_exception
def is_rate_limit(exc: Exception) -> bool:
return getattr(exc, "status_code", None) == 429
@retry(
retry=retry_if_exception(is_rate_limit),
wait=wait_random_exponential(multiplier=1, max=20),
stop=stop_after_attempt(3),
reraise=True,
)
async def call_vision_model(model: str, prompt: str, image_bytes: bytes) -> str:
... # vendor call
Three attempts, randomised exponential backoff, maximum 20 seconds between retries. If all three fail, the exception propagates. Randomisation prevents the thundering herd pattern where multiple workers retry in lockstep.
The outer level rotates through a list of 4 model candidates on both HTTP 429 (after the inner retries are exhausted) and HTTP 503 (immediately, without consuming the retry budget):
async def describe_with_rotation(
model_candidates: tuple[str, ...],
prompt: str,
image_bytes: bytes,
) -> str:
last_exc: Exception | None = None
for model in model_candidates:
try:
return await call_vision_model(model, prompt, image_bytes)
except RateLimitError as exc:
# inner retries exhausted; move to next model
last_exc = exc
continue
except ServiceUnavailableError as exc:
# 503: skip immediately, no retry budget consumed
last_exc = exc
continue
raise VisionServiceError("all model candidates exhausted") from last_exc
Each candidate in the rotation list draws on an independent quota pool. A 429 on one model does not predict a 429 on another. A 503 indicates a model-level availability issue, not a quota issue, so retrying the same model would waste time.
The outer task-level retry (5 attempts at 10, 30, 60, 120, and 300 seconds) handles failures that survive both the inner backoff and the model rotation, such as a batch job timeout or a network partition.
Why wrong numbers are the failure mode that matters
A missing tile means a blank region in the extraction output. That is detectable: the document has fewer segments than expected, downstream validation flags it, the task retries. A hallucinated dimension is undetectable without the source document. It looks like a valid result. It passes schema validation. It enters retrieval. If a quantity surveyor or a structural engineer queries it and acts on the number, the error has left the system.
Construction drawings with one wrong measurement are litigation problems. Grounding makes the pipeline's numeric accuracy independent of model behavior on any given input.
Orchestration under load
At a hundred thousand documents a month, coordination failures between pipeline runs cause more incidents than malformed documents do. The PDFs are the easy part. The coordination between PDFs is the system.
The first sign something was wrong was PoolTimeout errors appearing before a single child pipeline run had started. The first time a 5,000-file archive landed, every launch request went into the same HTTP connection pool and the pool ran out before the first child flow began. The arithmetic was fast and ugly: five thousand task-launch requests, a connection pool sized for ten, an orchestrator API that sustains roughly ten concurrent launches.
The coordinator now caps in-flight launches and rate-limits them with a semaphore and a sleep inside the same block.
launch_semaphore = asyncio.Semaphore(10)
async def launch_child(file_meta):
async with launch_semaphore:
await asyncio.sleep(0.1)
await orchestrator.dispatch(file_meta)
The semaphore controls connection pool depth. The sleep controls burst rate. Concurrency control and rate limiting look like the same problem until they fail.
After fan-out, the coordinator waits for the children. Same problem in reverse: five hundred simultaneous long-poll coroutines would hold five hundred connections open. The coordinator groups child run references into batches of fifty and awaits each batch with asyncio.gather(return_exceptions=True).
BATCH_SIZE = 50
async def wait_for_all(runs):
results = []
for i in range(0, len(runs), BATCH_SIZE):
batch = runs[i:i + BATCH_SIZE]
results.extend(await asyncio.gather(
*(wait_for_run(r) for r in batch),
return_exceptions=True,
))
return results
Exceptions are collected per child rather than propagated, so a single failure does not abort the wait for the rest of the batch.
Backpressure between the coordinator and the cluster is handled by the orchestrator's own concurrency settings, not by the coordinator. Container task capacity forms the lower bound. Adding a coordinator-side cap would create another constant to tune as load changes.
The vector index writer is the one place backpressure is deliberately turned off. During indexing bursts, the writer bypasses the index service's built-in flow control and absorbs transient overload with task-level retries instead. Service-side throttling would smooth the load curve and protect the storage tier from peaks. Higher peak load on the storage tier is an acceptable tradeoff for predictable end-to-end latency. Users observe indexing latency directly.
Completion is harder than fan-out. Multiple ZIP archives for the same project can arrive concurrently, and none of them can signal "project ready" until every archive has finished. The fence is two database queries:
-- 1. mark this upload processed (idempotent on second call)
UPDATE uploads
SET processed_at = NOW()
WHERE project_id = $1
AND upload_id = $2
AND processed_at IS NULL;
-- 2. count remaining unprocessed uploads for this project
SELECT COUNT(*)
FROM uploads
WHERE project_id = $1
AND processed_at IS NULL;
Before the fence closes, the pipeline checks whether this is the last outstanding upload; if so, a cross-document analysis step runs synchronously before the completion timestamp is stamped, so results are visible the moment the project is marked ready. Only when the count reaches zero does a final call fire to mark the project indexed and send a notification. The queries are idempotent. The downstream analysis step is gated by the same fence, so it sees a complete snapshot rather than a moving target.
Boring SQL. Surprisingly load-bearing. The fence is what saves you.
Observability is wired into the database, not into a separate log pipeline. Each document pipeline run writes its run identifier into the database before processing starts and updates its display label to match the database-assigned document ID. The link is bidirectional and queryable without polling. Treating the database as the source of truth and tracing as decoration is contrarian; most teams do the opposite. It is also the right call for any batch workload where the database already tracks state for product reasons. Two sources of truth is one too many.
Distributed tracing spans the full pipeline. Span data is buffered in process and exported at run completion; a crash before export loses that run's spans. Export failures are silently discarded so an observability outage cannot kill a document run.
The completion notification carries total file count, success count, failure count, and up to ten failed document records with step name and full stack trace. Failed records are fetched from the database at notification time rather than accumulated in memory, which keeps the coordinator's working set bounded regardless of failure rate.
What scales
Two things in this system cost the most to get right, and they cost the most for the same reason: the feedback loop runs at the speed of production traffic.
The error taxonomy, because every decision about whether to retry depends on it. Mark a failure permanent and a week later you find out the fix was a one-line threshold change. Mark it transient and you queue retry work that will never succeed. Neither wrong answer announces itself.
The completion fence, because partial signals are invisible. A downstream consumer reading a half-indexed project does not know it is reading a partial result.
Both belong to a category of bug that no amount of load testing surfaces until production traffic starts. At this scale, the cleaner your test set, the more dangerous the gap.
Everything else is the cost of accepting input from the world.
Top comments (0)