Every day, banking ops teams manually review thousands of documents -
loan applications, KYC forms, contracts - looking for the right stamps,
the right signatures, in the right places. It's slow, expensive, and
exactly the kind of work computer vision was made to automate.
The catch is that most YOLO tutorials online teach you to detect cars,
dogs, or people in natural photos. None of that translates cleanly to
documents. Documents are structured, scanned at varying quality, often
photographed on phones at angles, sometimes faxed, frequently watermarked, and almost never lit consistently. The model that detects stamps on a
clean PDF will collapse on a phone-shot photo of the same form.
"Over the past few weeks I've been deep in shipping a YOLOv11-based
detector for stamps and signatures on documents in a regulated banking
environment."
The work taught me where the off-the-shelf tutorials end and where the
real engineering begins. Here's the playbook.
Why YOLOv11 over the alternatives
There are a few reasonable starting points for document object detection:
-
Layout-aware models like LayoutLMv3 or Donut - strong for structured forms, but heavier, harder to fine-tune for a narrow task, and slower at inference. Overkill if you only need to detect a small set of objects
(stamps, signatures, initials).
- Classical OpenCV approaches - template matching, contour detection, Hough transforms. Fast and lightweight but brittle on real-world scans.
- YOLO family (v8, v11) - the sweet spot for object detection on
documents. Fast, well-documented, easy to fine-tune, and the
precision/recall tradeoff is tunable to ops-team requirements.
I went with YOLOv11. The
ultralyticsPython package handles most of the busywork, inference runs well under 100ms per page on a modest GPU, and the architecture handles small objects - which stamps often are at low scan resolutions - better than older versions. ## The 80%: data preparation and annotation Anyone who's shipped CV in production will tell you the same thing: the model is the easy part. Data is where the time goes. Annotation tooling. I used Roboflow - clean web UI for bounding-box labeling, automatic train/val/test splits, easy export to YOLO format. CVAT is the open-source alternative if you can't use a SaaS for compliance reasons. Class taxonomy. Resist the urge to define ten classes on day one. Start with the smallest set that solves the business problem: -
signature-stamp- (Optionallyhandwritten_initialsif your forms include them) More classes means more labeled examples per class, more failure modes, and a harder model to debug. You can always split a class later. You can rarely merge messy ones cleanly. Train/val/test split discipline. Separate documents into the three splits by source, not just randomly. If the same form template appears in both train and val, your validation metric is lying to you - the model is learning the form layout, not the object. In a regulated environment where wrong predictions cost real money, you cannot afford a lying validation set. Augmentation strategy - and why the defaults are wrong for documents. The off-the-shelf YOLO augmentation defaults are designed for natural images. They include rotation up to 30°, mosaic, MixUp. For documents, that's actively wrong: - Rotation should be tightly limited (±5°). Documents are upright. Heavy rotation creates training examples that don't reflect production input. - Mosaic augmentation should be off. Pasting four documents into a 2×2 grid produces inputs that don't exist at inference time. - What helps instead: brightness/contrast variation (different scan qualities), JPEG compression noise (low-quality scans), partial occlusion (parts of the document obscured), Gaussian blur (out-of-focus phone shots). "The single biggest accuracy gain in my project came from augmenting for phone-photographed scans. Production data was messier than my training set assumed - closing that gap mattered more than any architecture change." ## Training configuration that actually matters Most YOLO hyperparameters are fine at defaults. The ones that move the needle on documents:
from ultralytics import YOLO
model = YOLO('yolo11m.pt')
results = model.train(
data='dataset.yaml',
epochs=100,
imgsz=1024, # higher imgsz matters for small stamps
batch=8,
lr0=0.001,
patience=20, # early stopping if mAP stalls
augment=True,
mosaic=0.0, # off for documents
degrees=5, # limit rotation
fliplr=0.0, # don't horizontally flip docs
)
```
{% endraw %}
Two things worth flagging:
**{% raw %}`imgsz=1024`{% endraw %} not 640.** Stamps at low resolution can become a few
pixels - too small for the model to detect reliably. Higher input size
costs more compute per image, but the precision gain on small objects
is substantial.
**Disable horizontal flipping.** A flipped form is a wrong form.
Augmentations that produce never-seen-in-production inputs hurt
generalization on the inputs you actually care about.
## The metric you should actually optimize for
Most tutorials default to {% raw %}`mAP@0.5`{% endraw %}. For document AI in a regulated
environment, that's the wrong primary metric.
Ops teams care about **precision**. When the model says "there's a
signature here," they need it to be right. A false positive sends a
document downstream that shouldn't be there, costing reviewer time. A
false negative is recoverable - the document falls back to manual
review, which is the existing baseline.
Track both, but if you have to optimize one, optimize precision. Your
ops manager will thank you.
## Inference and deployment
A model that runs on a GPU is fun. A model that runs on a CPU is
shippable. For most document-AI workloads - where you're processing on the order of dozens to hundreds of pages per minute, not millions -
CPU inference with an ONNX-exported model is faster to deploy, cheaper
to run, and far more compatible with locked-down production environments
where GPU drivers are a fight you don't want.
The flow is:
1. Train with {% raw %}`ultralytics` (PyTorch backend, GPU during training)
2. Export the trained weights to ONNX
3. Serve via `ultralytics`'s ONNX-runtime path on CPU at inference time
Step 2 is one line:
```python
from ultralytics import YOLO
model = YOLO('best.pt')
model.export(format='onnx') # writes best.onnx alongside best.pt
```
Step 3 - the inference service:
```python
from fastapi import FastAPI, UploadFile
from ultralytics import YOLO
from PIL import Image
import io
app = FastAPI()
model = YOLO('best.onnx') # ONNX runtime, CPU-only
@app.post('/detect')
async def detect(file: UploadFile):
image = Image.open(io.BytesIO(await file.read()))
results = model(image)
detections = []
for r in results:
for box in r.boxes:
detections.append({
'class': model.names[int(box.cls)],
'confidence': float(box.conf),
'bbox': box.xyxy.tolist()[0],
})
return {'detections': detections}
```
The most important line in that snippet is `model = YOLO('best.onnx')`
at module level - load the model **once at startup**, never per request.
Reloading the model on every request is the most common production
mistake I've seen on YOLO endpoints. It's the difference between 50ms
response time and 5,000ms.
For the container: a slim Python base image (`python:3.11-slim`) is
enough. No CUDA, no GPU drivers, no NVIDIA dependencies. The image
ends up under 500MB, starts in seconds, and runs anywhere - including
locked-down corporate VMs and on-prem environments where shipping a
GPU-dependent service is months of approvals you don't have.
That's the real tradeoff: you give up a small amount of per-request
latency in exchange for a service that deploys today, not next quarter.
## What the tutorials don't tell you
Three lessons the standard YOLO blog posts skip:
**1. The long tail of weird scans is where production breaks.** Faxed
pages with horizontal banding, partially photocopied documents, phone
shots with one corner cut off, watermarks bleeding through from the
back side. Your training set won't include enough of these. Get a
sample of real production input as fast as possible - even just 50
images - and use them for evaluation, not training. They tell you what
the world actually looks like.
**2. Log every prediction with the input image hash.** When the model
fails in production, you want to be able to find the exact input that
broke it, retroactively. Hash the input, log the prediction, store both.
That's how you build round-2 training data without hunting.
**3. Don't chase mAP@0.95.** Diminishing returns. If your business
needs 95% precision at 70% recall, optimize for that operating point -
not for a metric that summarizes the whole curve. Talk to your ops
team. Get the actual numbers they care about. Train against those.
## Closing
The model is not the bottleneck for document AI. The bottleneck is
annotation discipline, augmentation tuned to real production input,
and deployment that doesn't blow up under load. If you're building
computer vision for regulated industries - banking, insurance, legal,
healthcare - the playbook above is what's worked for me. The frameworks
change. The data discipline doesn't.
Top comments (0)