<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Muhammad umair akram</title>
    <description>The latest articles on DEV Community by Muhammad umair akram (@anticrusader).</description>
    <link>https://dev.to/anticrusader</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3906268%2F1c87f62c-11aa-4288-8e42-b78cc4018763.png</url>
      <title>DEV Community: Muhammad umair akram</title>
      <link>https://dev.to/anticrusader</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/anticrusader"/>
    <language>en</language>
    <item>
      <title>Fine-tuning YOLOv11 to detect stamps and signatures on banking documents - a practical walkthrough</title>
      <dc:creator>Muhammad umair akram</dc:creator>
      <pubDate>Thu, 30 Apr 2026 14:16:55 +0000</pubDate>
      <link>https://dev.to/anticrusader/fine-tuning-yolov11-to-detect-stamps-and-signatures-on-banking-documents-a-practical-walkthrough-2a1g</link>
      <guid>https://dev.to/anticrusader/fine-tuning-yolov11-to-detect-stamps-and-signatures-on-banking-documents-a-practical-walkthrough-2a1g</guid>
      <description>&lt;p&gt;Every day, banking ops teams manually review thousands of documents - &lt;br&gt;
 loan applications, KYC forms, contracts - looking for the right stamps,&lt;br&gt;
 the right signatures, in the right places. It's slow, expensive, and&lt;br&gt;
 exactly the kind of work computer vision was made to automate.&lt;br&gt;
The catch is that most YOLO tutorials online teach you to detect cars,&lt;br&gt;
 dogs, or people in natural photos. None of that translates cleanly to&lt;br&gt;
 documents. Documents are structured, scanned at varying quality, often&lt;br&gt;
 photographed on phones at angles, sometimes faxed, frequently watermarked, and almost never lit consistently. The model that detects stamps on a&lt;br&gt;
 clean PDF will collapse on a phone-shot photo of the same form.&lt;br&gt;
"Over the past few weeks I've been deep in shipping a YOLOv11-based&lt;br&gt;
 detector for stamps and signatures on documents in a regulated banking&lt;br&gt;
 environment."&lt;br&gt;
The work taught me where the off-the-shelf tutorials end and where the&lt;br&gt;
 real engineering begins. Here's the playbook.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why YOLOv11 over the alternatives
&lt;/h2&gt;

&lt;p&gt;There are a few reasonable starting points for document object detection:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Layout-aware models like LayoutLMv3 or Donut&lt;/strong&gt; - strong for structured forms, but heavier, harder to fine-tune for a narrow task, and slower at inference. Overkill if you only need to detect a small set of objects
 (stamps, signatures, initials).
 - &lt;strong&gt;Classical OpenCV approaches&lt;/strong&gt; - template matching, contour detection, Hough transforms. Fast and lightweight but brittle on real-world scans.
 - &lt;strong&gt;YOLO family (v8, v11)&lt;/strong&gt; - the sweet spot for object detection on
 documents. Fast, well-documented, easy to fine-tune, and the
 precision/recall tradeoff is tunable to ops-team requirements.
I went with YOLOv11. The &lt;code&gt;ultralytics&lt;/code&gt; Python package handles most of the
 busywork, inference runs well under 100ms per page on a modest GPU, and the architecture handles small objects - which stamps often are at low
 scan resolutions - better than older versions.
## The 80%: data preparation and annotation
Anyone who's shipped CV in production will tell you the same thing: the
 model is the easy part. Data is where the time goes.
&lt;strong&gt;Annotation tooling.&lt;/strong&gt; I used Roboflow - clean web UI for bounding-box
 labeling, automatic train/val/test splits, easy export to YOLO format.
 CVAT is the open-source alternative if you can't use a SaaS for
 compliance reasons.
&lt;strong&gt;Class taxonomy.&lt;/strong&gt; Resist the urge to define ten classes on day one.
 Start with the smallest set that solves the business problem:&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;signature&lt;/code&gt;
 - &lt;code&gt;stamp&lt;/code&gt;
 - (Optionally &lt;code&gt;handwritten_initials&lt;/code&gt; if your forms include them)
More classes means more labeled examples per class, more failure modes,
 and a harder model to debug. You can always split a class later. You
 can rarely merge messy ones cleanly.
&lt;strong&gt;Train/val/test split discipline.&lt;/strong&gt; Separate documents into the three
 splits &lt;em&gt;by source&lt;/em&gt;, not just randomly. If the same form template appears
 in both train and val, your validation metric is lying to you - the
 model is learning the form layout, not the object. In a regulated
 environment where wrong predictions cost real money, you cannot afford
 a lying validation set.
&lt;strong&gt;Augmentation strategy - and why the defaults are wrong for documents.&lt;/strong&gt;
 The off-the-shelf YOLO augmentation defaults are designed for natural
 images. They include rotation up to 30°, mosaic, MixUp. For documents,
 that's actively wrong:&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rotation should be tightly limited (±5°).&lt;/strong&gt; Documents are upright.
 Heavy rotation creates training examples that don't reflect production
 input.
 - &lt;strong&gt;Mosaic augmentation should be off.&lt;/strong&gt; Pasting four documents into a
 2×2 grid produces inputs that don't exist at inference time.
 - &lt;strong&gt;What helps instead:&lt;/strong&gt; brightness/contrast variation (different scan
 qualities), JPEG compression noise (low-quality scans), partial
 occlusion (parts of the document obscured), Gaussian blur (out-of-focus
 phone shots).
"The single biggest accuracy gain in my project came from augmenting for phone-photographed scans. Production data was messier than my training set assumed - closing that gap mattered more than any architecture change."
## Training configuration that actually matters
Most YOLO hyperparameters are fine at defaults. The ones that move the
 needle on documents:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ultralytics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;YOLO&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;YOLO&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;yolo11m.pt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;train&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dataset.yaml&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;epochs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;imgsz&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# higher imgsz matters for small stamps
&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;lr0&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.001&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;patience&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# early stopping if mAP stalls
&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;augment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;mosaic&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# off for documents
&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;degrees&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# limit rotation
&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;fliplr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# don't horizontally flip docs
&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="sb"&gt;``&lt;/span&gt;&lt;span class="err"&gt;`&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;endraw&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;Two&lt;/span&gt; &lt;span class="n"&gt;things&lt;/span&gt; &lt;span class="n"&gt;worth&lt;/span&gt; &lt;span class="n"&gt;flagging&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="sb"&gt;`imgsz=1024`&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;endraw&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="mf"&gt;640.&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="n"&gt;Stamps&lt;/span&gt; &lt;span class="n"&gt;at&lt;/span&gt; &lt;span class="n"&gt;low&lt;/span&gt; &lt;span class="n"&gt;resolution&lt;/span&gt; &lt;span class="n"&gt;can&lt;/span&gt; &lt;span class="n"&gt;become&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;few&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;pixels&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;too&lt;/span&gt; &lt;span class="n"&gt;small&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;detect&lt;/span&gt; &lt;span class="n"&gt;reliably&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="n"&gt;Higher&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;costs&lt;/span&gt; &lt;span class="n"&gt;more&lt;/span&gt; &lt;span class="n"&gt;compute&lt;/span&gt; &lt;span class="n"&gt;per&lt;/span&gt; &lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;but&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;precision&lt;/span&gt; &lt;span class="n"&gt;gain&lt;/span&gt; &lt;span class="n"&gt;on&lt;/span&gt; &lt;span class="n"&gt;small&lt;/span&gt; &lt;span class="n"&gt;objects&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="n"&gt;substantial&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;Disable&lt;/span&gt; &lt;span class="n"&gt;horizontal&lt;/span&gt; &lt;span class="n"&gt;flipping&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="n"&gt;flipped&lt;/span&gt; &lt;span class="n"&gt;form&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;wrong&lt;/span&gt; &lt;span class="n"&gt;form&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;Augmentations&lt;/span&gt; &lt;span class="n"&gt;that&lt;/span&gt; &lt;span class="n"&gt;produce&lt;/span&gt; &lt;span class="n"&gt;never&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;seen&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="ow"&gt;in&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;production&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt; &lt;span class="n"&gt;hurt&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;generalization&lt;/span&gt; &lt;span class="n"&gt;on&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt; &lt;span class="n"&gt;you&lt;/span&gt; &lt;span class="n"&gt;actually&lt;/span&gt; &lt;span class="n"&gt;care&lt;/span&gt; &lt;span class="n"&gt;about&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="c1"&gt;## The metric you should actually optimize for
&lt;/span&gt;&lt;span class="n"&gt;Most&lt;/span&gt; &lt;span class="n"&gt;tutorials&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="sb"&gt;`mAP@0.5`&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;endraw&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}.&lt;/span&gt; &lt;span class="n"&gt;For&lt;/span&gt; &lt;span class="n"&gt;document&lt;/span&gt; &lt;span class="n"&gt;AI&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;regulated&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;environment&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;that&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s the wrong primary metric.
Ops teams care about **precision**. When the model says &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;there&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;signature&lt;/span&gt; &lt;span class="n"&gt;here&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; they need it to be right. A false positive sends a
 document downstream that shouldn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t be there, costing reviewer time. A
 false negative is recoverable - the document falls back to manual
 review, which is the existing baseline.
Track both, but if you have to optimize one, optimize precision. Your
 ops manager will thank you.
## Inference and deployment
A model that runs on a GPU is fun. A model that runs on a CPU is
 shippable. For most document-AI workloads - where you&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;re processing on the order of dozens to hundreds of pages per minute, not millions - 
 CPU inference with an ONNX-exported model is faster to deploy, cheaper 
 to run, and far more compatible with locked-down production environments 
 where GPU drivers are a fight you don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t want.
The flow is:
1. Train with {% raw %}`ultralytics` (PyTorch backend, GPU during training)
 2. Export the trained weights to ONNX
 3. Serve via `ultralytics`&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s ONNX-runtime path on CPU at inference time
Step 2 is one line:


```python
 from ultralytics import YOLO
model = YOLO(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;best.pt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)
 model.export(format=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;onnx&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;) # writes best.onnx alongside best.pt
 ```


Step 3 - the inference service:


```python
 from fastapi import FastAPI, UploadFile
 from ultralytics import YOLO
 from PIL import Image
 import io
app = FastAPI()
 model = YOLO(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;best.onnx&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;) # ONNX runtime, CPU-only
@app.post(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/detect&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)
 async def detect(file: UploadFile):
 image = Image.open(io.BytesIO(await file.read()))
 results = model(image)
detections = []
 for r in results:
 for box in r.boxes:
 detections.append({
 &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;class&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;: model.names[int(box.cls)],
 &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;confidence&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;: float(box.conf),
 &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;bbox&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;: box.xyxy.tolist()[0],
 })
return {&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;detections&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;: detections}
 ```


The most important line in that snippet is `model = YOLO(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;best.onnx&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)`
 at module level - load the model **once at startup**, never per request.
 Reloading the model on every request is the most common production
 mistake I&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ve seen on YOLO endpoints. It&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s the difference between 50ms
 response time and 5,000ms.
For the container: a slim Python base image (`python:3.11-slim`) is
 enough. No CUDA, no GPU drivers, no NVIDIA dependencies. The image
 ends up under 500MB, starts in seconds, and runs anywhere - including
 locked-down corporate VMs and on-prem environments where shipping a
 GPU-dependent service is months of approvals you don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t have.
That&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s the real tradeoff: you give up a small amount of per-request
 latency in exchange for a service that deploys today, not next quarter.

## What the tutorials don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t tell you
Three lessons the standard YOLO blog posts skip:
**1. The long tail of weird scans is where production breaks.** Faxed
 pages with horizontal banding, partially photocopied documents, phone
 shots with one corner cut off, watermarks bleeding through from the
 back side. Your training set won&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t include enough of these. Get a
 sample of real production input as fast as possible - even just 50
 images - and use them for evaluation, not training. They tell you what
 the world actually looks like.
**2. Log every prediction with the input image hash.** When the model
 fails in production, you want to be able to find the exact input that
 broke it, retroactively. Hash the input, log the prediction, store both.
 That&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s how you build round-2 training data without hunting.
**3. Don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t chase mAP@0.95.** Diminishing returns. If your business
 needs 95% precision at 70% recall, optimize for that operating point - 
 not for a metric that summarizes the whole curve. Talk to your ops
 team. Get the actual numbers they care about. Train against those.
## Closing
The model is not the bottleneck for document AI. The bottleneck is
 annotation discipline, augmentation tuned to real production input,
 and deployment that doesn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t blow up under load. If you&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;re building
 computer vision for regulated industries - banking, insurance, legal,
 healthcare - the playbook above is what&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s worked for me. The frameworks
 change. The data discipline doesn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>computervision</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Fine-tuning YOLOv11 to detect stamps and signatures on banking documents - a practical walkthrough</title>
      <dc:creator>Muhammad umair akram</dc:creator>
      <pubDate>Thu, 30 Apr 2026 14:16:55 +0000</pubDate>
      <link>https://dev.to/anticrusader/fine-tuning-yolov11-to-detect-stamps-and-signatures-on-banking-documents-a-practical-walkthrough-5753</link>
      <guid>https://dev.to/anticrusader/fine-tuning-yolov11-to-detect-stamps-and-signatures-on-banking-documents-a-practical-walkthrough-5753</guid>
      <description>&lt;p&gt;Every day, banking ops teams manually review thousands of documents - &lt;br&gt;
 loan applications, KYC forms, contracts - looking for the right stamps,&lt;br&gt;
 the right signatures, in the right places. It's slow, expensive, and&lt;br&gt;
 exactly the kind of work computer vision was made to automate.&lt;br&gt;
The catch is that most YOLO tutorials online teach you to detect cars,&lt;br&gt;
 dogs, or people in natural photos. None of that translates cleanly to&lt;br&gt;
 documents. Documents are structured, scanned at varying quality, often&lt;br&gt;
 photographed on phones at angles, sometimes faxed, frequently watermarked, and almost never lit consistently. The model that detects stamps on a&lt;br&gt;
 clean PDF will collapse on a phone-shot photo of the same form.&lt;/p&gt;

&lt;p&gt;"Over the past few weeks I've been deep in shipping a YOLOv11-based&lt;br&gt;
 detector for stamps and signatures on documents in a regulated banking&lt;br&gt;
 environment."&lt;/p&gt;

&lt;p&gt;The work taught me where the off-the-shelf tutorials end and where the&lt;br&gt;
 real engineering begins. Here's the playbook.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why YOLOv11 over the alternatives
&lt;/h2&gt;

&lt;p&gt;There are a few reasonable starting points for document object detection:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Layout-aware models like LayoutLMv3 or Donut&lt;/strong&gt; - strong for structured forms, but heavier, harder to fine-tune for a narrow task, and slower at inference. Overkill if you only need to detect a small set of objects
 (stamps, signatures, initials).
 - &lt;strong&gt;Classical OpenCV approaches&lt;/strong&gt; - template matching, contour detection, Hough transforms. Fast and lightweight but brittle on real-world scans.
 - &lt;strong&gt;YOLO family (v8, v11)&lt;/strong&gt; - the sweet spot for object detection on
 documents. Fast, well-documented, easy to fine-tune, and the
 precision/recall tradeoff is tunable to ops-team requirements.
I went with YOLOv11. The &lt;strong&gt;ultralytics&lt;/strong&gt; Python package handles most of the
 busywork, inference runs well under 100ms per page on a modest GPU, and the architecture handles small objects - which stamps often are at low
 scan resolutions - better than older versions.
## The 80%: data preparation and annotation
Anyone who's shipped CV in production will tell you the same thing: the
 model is the easy part. Data is where the time goes.
&lt;strong&gt;Annotation tooling.&lt;/strong&gt; I used Roboflow - clean web UI for bounding-box
 labeling, automatic train/val/test splits, easy export to YOLO format.
 CVAT is the open-source alternative if you can't use a SaaS for
 compliance reasons.
&lt;strong&gt;Class taxonomy.&lt;/strong&gt; Resist the urge to define ten classes on day one.
 Start with the smallest set that solves the business problem:&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;signature&lt;/strong&gt;
 - &lt;strong&gt;stamp&lt;/strong&gt;
 - (Optionally &lt;strong&gt;handwritten_initials&lt;/strong&gt; if your forms include them)
More classes means more labeled examples per class, more failure modes,
 and a harder model to debug. You can always split a class later. You
 can rarely merge messy ones cleanly.
&lt;strong&gt;Train/val/test split discipline.&lt;/strong&gt; Separate documents into the three
 splits &lt;em&gt;by source&lt;/em&gt;, not just randomly. If the same form template appears
 in both train and val, your validation metric is lying to you - the
 model is learning the form layout, not the object. In a regulated
 environment where wrong predictions cost real money, you cannot afford
 a lying validation set.
&lt;strong&gt;Augmentation strategy - and why the defaults are wrong for documents.&lt;/strong&gt;
 The off-the-shelf YOLO augmentation defaults are designed for natural
 images. They include rotation up to 30°, mosaic, MixUp. For documents,
 that's actively wrong:&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rotation should be tightly limited (±5°).&lt;/strong&gt; Documents are upright.
 Heavy rotation creates training examples that don't reflect production
 input.
 - &lt;strong&gt;Mosaic augmentation should be off.&lt;/strong&gt; Pasting four documents into a
 2×2 grid produces inputs that don't exist at inference time.
 - &lt;strong&gt;What helps instead:&lt;/strong&gt; brightness/contrast variation (different scan
 qualities), JPEG compression noise (low-quality scans), partial
 occlusion (parts of the document obscured), Gaussian blur (out-of-focus
 phone shots).
"The single biggest accuracy gain in my project came from augmenting for phone-photographed scans. Production data was messier than my training set assumed - closing that gap mattered more than any architecture change."
## Training configuration that actually matters
Most YOLO hyperparameters are fine at defaults. The ones that move the
 needle on documents:
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ultralytics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;YOLO&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;YOLO&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;yolo11m.pt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;train&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dataset.yaml&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;epochs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;imgsz&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# Higher imgsz matters for small stamps
&lt;/span&gt;    &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;lr0&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.001&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;patience&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Early stopping if mAP stalls
&lt;/span&gt;    &lt;span class="n"&gt;augment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;mosaic&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# Off for documents
&lt;/span&gt;    &lt;span class="n"&gt;degrees&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;# Limit rotation
&lt;/span&gt;    &lt;span class="n"&gt;fliplr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;    &lt;span class="c1"&gt;# Don't horizontally flip docs
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Two things worth flagging:&lt;br&gt;
 Stamps at low resolution can become a few&lt;br&gt;
 pixels - too small for the model to detect reliably. Higher input size&lt;br&gt;
 costs more compute per image, but the precision gain on small objects&lt;br&gt;
 is substantial.&lt;br&gt;
&lt;strong&gt;Disable horizontal flipping.&lt;/strong&gt; A flipped form is a wrong form.&lt;br&gt;
 Augmentations that produce never-seen-in-production inputs hurt&lt;br&gt;
 generalization on the inputs you actually care about.&lt;/p&gt;
&lt;h2&gt;
  
  
  The metric you should actually optimize for
&lt;/h2&gt;

&lt;p&gt;Most tutorials default to &lt;strong&gt;&lt;a href="mailto:mAP@0.5"&gt;mAP@0.5&lt;/a&gt;&lt;/strong&gt;. For document AI in a regulated&lt;br&gt;
 environment, that's the wrong primary metric.&lt;br&gt;
Ops teams care about &lt;strong&gt;precision&lt;/strong&gt;. When the model says "there's a&lt;br&gt;
 signature here," they need it to be right. A false positive sends a&lt;br&gt;
 document downstream that shouldn't be there, costing reviewer time. A&lt;br&gt;
 false negative is recoverable - the document falls back to manual&lt;br&gt;
 review, which is the existing baseline.&lt;br&gt;
Track both, but if you have to optimize one, optimize precision. Your&lt;br&gt;
 ops manager will thank you.&lt;/p&gt;
&lt;h2&gt;
  
  
  Inference and deployment
&lt;/h2&gt;

&lt;p&gt;A model that runs on a GPU is fun. A model that runs on a CPU is&lt;br&gt;
 shippable. For most document-AI workloads - where you're processing on the order of dozens to hundreds of pages per minute, not millions - &lt;br&gt;
 CPU inference with an ONNX-exported model is faster to deploy, cheaper &lt;br&gt;
 to run, and far more compatible with locked-down production environments &lt;br&gt;
 where GPU drivers are a fight you don't want.&lt;br&gt;
The flow is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Train with &lt;strong&gt;ultralytics&lt;/strong&gt; (PyTorch backend, GPU during training)
 2. Export the trained weights to ONNX
 3. Serve via &lt;strong&gt;ultralytics&lt;/strong&gt;'s ONNX-runtime path on CPU at inference time
Step 2 is one line:
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ultralytics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;YOLO&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;YOLO&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;best.pt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;export&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;onnx&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# writes best.onnx alongside best.pt
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Step 3 - the inference service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;UploadFile&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ultralytics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;YOLO&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;PIL&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Image&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;io&lt;/span&gt;
&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;YOLO&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;best.onnx&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# ONNX runtime, CPU-only
&lt;/span&gt;&lt;span class="nd"&gt;@app.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/detect&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;detect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;UploadFile&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;BytesIO&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;detections&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;box&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;boxes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;detections&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;class&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;names&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;box&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cls&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;confidence&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;box&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conf&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;bbox&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;box&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xyxy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;detections&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;detections&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The most important line in that snippet is &lt;strong&gt;model = YOLO('best.onnx')&lt;/strong&gt;&lt;br&gt;
 at module level - load the model &lt;strong&gt;once at startup&lt;/strong&gt;, never per request.&lt;br&gt;
 Reloading the model on every request is the most common production&lt;br&gt;
 mistake I've seen on YOLO endpoints. It's the difference between 50ms&lt;br&gt;
 response time and 5,000ms.&lt;br&gt;
For the container: a slim Python base image (&lt;strong&gt;python:3.11-slim&lt;/strong&gt;) is&lt;br&gt;
 enough. No CUDA, no GPU drivers, no NVIDIA dependencies. The image&lt;br&gt;
 ends up under 500MB, starts in seconds, and runs anywhere - including&lt;br&gt;
 locked-down corporate VMs and on-prem environments where shipping a&lt;br&gt;
 GPU-dependent service is months of approvals you don't have.&lt;br&gt;
That's the real tradeoff: you give up a small amount of per-request&lt;br&gt;
 latency in exchange for a service that deploys today, not next quarter.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the tutorials don't tell you
&lt;/h2&gt;

&lt;p&gt;Three lessons the standard YOLO blog posts skip:&lt;br&gt;
&lt;strong&gt;1. The long tail of weird scans is where production breaks.&lt;/strong&gt; Faxed&lt;br&gt;
 pages with horizontal banding, partially photocopied documents, phone&lt;br&gt;
 shots with one corner cut off, watermarks bleeding through from the&lt;br&gt;
 back side. Your training set won't include enough of these. Get a&lt;br&gt;
 sample of real production input as fast as possible - even just 50&lt;br&gt;
 images - and use them for evaluation, not training. They tell you what&lt;br&gt;
 the world actually looks like.&lt;br&gt;
&lt;strong&gt;2. Log every prediction with the input image hash.&lt;/strong&gt; When the model&lt;br&gt;
 fails in production, you want to be able to find the exact input that&lt;br&gt;
 broke it, retroactively. Hash the input, log the prediction, store both.&lt;br&gt;
 That's how you build round-2 training data without hunting.&lt;br&gt;
&lt;strong&gt;3. Don't chase &lt;a href="mailto:mAP@0.95"&gt;mAP@0.95&lt;/a&gt;.&lt;/strong&gt; Diminishing returns. If your business&lt;br&gt;
 needs 95% precision at 70% recall, optimize for that operating point - &lt;br&gt;
 not for a metric that summarizes the whole curve. Talk to your ops&lt;br&gt;
 team. Get the actual numbers they care about. Train against those.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;The model is not the bottleneck for document AI. The bottleneck is&lt;br&gt;
 annotation discipline, augmentation tuned to real production input,&lt;br&gt;
 and deployment that doesn't blow up under load. If you're building&lt;br&gt;
 computer vision for regulated industries - banking, insurance, legal,&lt;br&gt;
 healthcare - the playbook above is what's worked for me. The frameworks&lt;br&gt;
 change. The data discipline doesn't.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>computervision</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
