DEV Community: Muhammad umair akram

Why CRNN is overkill for fixed-length CAPTCHA OCR — a 6-digit case study (100% Accuracy)

Muhammad umair akram — Fri, 01 May 2026 17:06:24 +0000

Most automation projects in regulated industries hit the same wall eventually: a CAPTCHA on an internal portal blocks the very automation the team is trying to build.

In our case, the ops team needed to interact with one of our company's internal portals dozens of times per day. The portal — built by an internal team, used only by employees — gates access with a 6-digit numeric CAPTCHA on every login. Reasonable security choice for the original threat model. Not so reasonable for the team that needs to script repetitive workflows on top of it.

The right fix would have been to add a service-account API to the portal team's backlog. The realistic fix, given the timeline, was to teach a small ML model to read those CAPTCHAs reliably so our automation script could move past the login screen.

A note on legitimacy, since this article will inevitably be skimmed
by people wondering: this is internal automation on a portal owned
by my employer, used only inside the company, accessed with explicit
authorization to automate. It's the same shape of work as RPA — same
legal/ethical category. Solving CAPTCHAs on third-party websites you
don't own is a different conversation entirely, often a TOS violation
and sometimes illegal depending on jurisdiction. Don't conflate the
two.

With that framing out of the way, the technical question was: what's the right model architecture for fixed-length numeric CAPTCHA OCR?

The default answer most engineers reach for is CRNN — a CNN encoder followed by an LSTM/GRU decoder, trained with CTC loss. It's the standard recipe in every "deep learning for OCR" tutorial. And for variable-length text recognition (handwritten notes, scanned documents, scene text), CRNN is genuinely the right choice.

But our CAPTCHA was always exactly 6 digits. Always 0–9. No variation in length, no edge cases, no character set ambiguity. The structure was completely known.

When the structure of your input is known, the right architectural move is to lean into that structure — not reach for the most general possible model. So I skipped CRNN and built something simpler: a shared CNN backbone with six independent classification heads (one per digit position), tied together with learnable position embeddings.

The result was 100% accuracy on our held-out test set with about 4,000 training samples. Here's how it's built and why each design choice mattered.

CRNN vs. multi-head: a brief architecture comparison

CRNN is the standard recipe for OCR: a CNN encoder pulls features from the image, a recurrent layer (LSTM or GRU) decodes those features into a sequence, and CTC loss handles the alignment between the predicted sequence and the ground-truth label without requiring per-character supervision. It's a powerful approach because it handles variable-length outputs gracefully — the same model can predict a 4-character word, a 10-character phrase, or a 50-character sentence.

The trade-off is complexity. CRNN has more moving parts:

The recurrent decoder adds parameters and training instability
CTC loss has its own learning dynamics and edge cases (alignment collapse, blank-token tuning)
Inference is sequential — harder to parallelize across positions
Debugging is harder — when the model outputs "13456" instead of "123456," you need to figure out whether that's a recognition error, an alignment error, or a length error

Compare that to a multi-head approach for fixed-length output. The input image runs through a shared CNN backbone once, producing a single feature vector. Then six independent classification heads each predict one digit (0–9) at one specific position. The training signal is straightforward: six cross-entropy losses, one per position, averaged. No sequence decoding. No alignment. No CTC.

The structure is simpler in every dimension that matters:

Fewer parameters in the decoder
Faster training convergence (more stable gradient signal per output)
Faster inference (six parallel classifications, no sequential decode)
Easier debugging (each head's output is independent and inspectable)

The only thing CRNN gives you that multi-head doesn't is variable-length support. And we explicitly didn't need that.

The general principle worth taking away: if your task has known structure (fixed length, fixed character set, fixed slot count), encode that structure in your architecture instead of asking a more general model to learn it. You'll get faster training, fewer parameters, and better sample efficiency.

Position embeddings: the design choice that made the shared backbone work

The naive version of multi-head architecture has a subtle weakness. All six output heads consume the same feature vector from the shared backbone. If the backbone produces a feature vector f, then each head simply looks at f and emits a prediction:

  prediction_position_1 = head_1(f)
  prediction_position_2 = head_2(f)
  # ... etc., all six heads see the same f

The shared f has to encode "what character is at position 1 AND what character is at position 2 AND ... AND what character is at position 6" — all simultaneously, in the same vector. With enough training data and capacity, the model can learn this.

But it's inefficient — the backbone's representation is being asked to do six jobs at once, with no signal about which job it's currently serving.

The fix is small but powerful: give the model an explicit signal about which position it's predicting. A learnable position embedding.

  self.position_emb = nn.Embedding(6, 10)  # 6 positions, 10-dim each

  def forward(self, image, position_idx):
      features = self.cnn(image)              # shared backbone output
      pos = self.position_emb(position_idx)   # which position?
      combined = torch.cat([features, pos], dim=1)
      logits = self.classifier(combined)
      return logits

Now the backbone is asked one focused question — what's at this specific position?
— and the position embedding provides the context. The model learns position-aware feature extraction without needing six separate backbones.

The downstream effect is significant. With ~4,000 training samples, this design converged cleanly to 100% accuracy on the held-out test set. A naive multi-head architecture (without position embeddings) trained on the same dataset hits a lower accuracy ceiling, because the shared feature vector can't decompose its representation cleanly across positions.

This pattern is worth internalizing: when multiple output heads share a backbone, give the backbone an explicit signal about which output it's serving. The signal can be a position embedding (as here), a class embedding (in multi-task learning), or any other discriminating context. The shared backbone learns better when it knows what it's working on.

The backbone: why eca_nfnet_l0 over plain CNN

The shared CNN backbone needs to extract features from a small grayscale image (200x50) and produce a representation that the multi-head classifier can decode reliably. The default move for OCR work is a plain ResNet18 or VGG, but I went with eca_nfnet_l0 from the timm library — a Normalizer-Free Net with Efficient Channel Attention.

A few reasons for the choice:

Normalizer-Free Networks skip BatchNorm and replace it with weight standardization + adaptive gradient clipping. The architecture trains stably even at small batch sizes, and inference is faster (no BN running statistics to track).
ECA blocks add channel-wise attention with a 1D convolution rather than the standard squeeze-excitation MLP. Lower parameter count, similar accuracy gain.
Pretrained weights are available via timm. Even though ImageNet has nothing to do with grayscale CAPTCHAs, the low-level filters (edges, textures, basic shapes) transfer fine and reduce the data needed to converge.

The model gets configured to take 1-channel input instead of 3:

  backbone = timm.create_model(
      'eca_nfnet_l0',
      pretrained=True,
      in_chans=1,
      num_classes=0,
      global_pool='',
  )

num_classes=0 and global_pool='' strip the final classification head and the global pooling layer — we want the raw feature map to attach our own multi-head classifier to.

Augmentation: defaults are wrong for CAPTCHAs

The same lesson from my earlier YOLOv11 tutorial applies here: the default torchvision augmentation pipeline assumes you're training on natural images.

CAPTCHAs are not natural images. The augmentations that help on ImageNet either don't help or actively hurt for CAPTCHA OCR.

What I used for training:

  transform = transforms.Compose([
      transforms.Grayscale(num_output_channels=1),
      transforms.RandomRotation(5),                              # ±5° — anything more is unrealistic
      transforms.RandomAffine(degrees=0, translate=(0.1, 0.1)),  # small translation  
      transforms.RandomPerspective(distortion_scale=0.2, p=0.5), # slight perspective drift
      transforms.ToTensor(),
      transforms.Normalize((0.5,), (0.5,)),
      transforms.Resize((50, 200)),
  ])

Three augmentations I deliberately did not use:

No horizontal flips. A flipped 6 looks like a 9. A flipped 7 doesn't look like any digit. Training on flips actively confuses the model.
No vertical flips. Same logic.
No heavy rotation. CAPTCHA samples already include some rotation. Adding ±30° would generate training data that doesn't reflect actual portal output, hurting generalization.

Three augmentations I did use that the defaults wouldn't include thoughtfully:

Limited rotation (±5°) to mimic the small in-the-wild rotation present in real CAPTCHA samples
Translation augmentation to handle variable horizontal position of digits
Perspective distortion (mild) to handle the subtle shear/skew the CAPTCHA generator applies

Validation transforms strip all augmentation — straight grayscale + normalize + resize. The validation set should reflect actual production input, not augmented training input.

Training: averaging six losses instead of summing them

Each output head produces a 10-class logit vector for one digit position. The loss for each head is straightforward cross-entropy. The question is how to combine the six losses into a single training signal.

The naive approach is to sum them:

  loss = loss1 + loss2 + loss3 + loss4 + loss5 + loss6

This works, but it changes the effective learning rate. With six loss terms summing, the gradient magnitude is roughly six times larger than a single-head model. To compensate, you'd need to divide your learning rate by ~6 to get equivalent training dynamics.

The cleaner approach — what I used — is to average:

  loss = (loss1 + loss2 + loss3 + loss4 + loss5 + loss6) / 6.0
  loss.backward()

Now the gradient magnitude is comparable to a single-head model, so the standard Adam learning rate (3e-4 for fine-tuning a pretrained backbone) just works without further tuning.

Optimizer: Adam, lr=3e-4
Loss: averaged cross-entropy across six heads
Batch size: 128
Epochs: 150 max, manually stopped at epoch 74 once val-loss plateaued near zero
No learning-rate schedule — for a task this narrow, on a pretrained backbone, default Adam dynamics were sufficient

The training run produced a clean monotonic decrease in val-loss for the first ~30 epochs, then plateaued at the noise floor as the model hit 100% on the held-out set.
By epoch 74, val-loss was around 0.005 — effectively zero — so I stopped manually rather than running out the planned 150 epochs.

Test-time augmentation: belt-and-suspenders for production

Once the trained model hits 100% on the held-out validation set, you'd think there's nothing left to do for inference. There isn't, accuracy-wise. But production has weirder inputs than test sets — different screenshot resolutions, slight color shifts, edge alignment differences from the live portal vs. the captured training
samples.

For robustness against these distribution-shift cases, I added test-time augmentation (TTA): run the model on multiple lightly-modified versions of the same input image, average the predictions across versions, return the consensus.

The pattern is simple:

  def predict_with_tta(image, model):
      # Variant 1: original input
      logits_original = model(transform(image))

      # Variant 2: lightly center-cropped, then resized back
      logits_cropped = model(transform_with_center_crop(image))

      # Average the logits before argmax
      averaged = (logits_original + logits_cropped) / 2.0
      return averaged.argmax(dim=1)

The center-crop variant trims a bit of border and resizes back to the original input dimensions. This forces the model to see the digit content at a slightly different effective magnification, which in practice catches a small set of edge cases the un-cropped pass would miss on its own.

Trade-off: roughly 2x inference time per CAPTCHA. For a portal that gets logged into a few times a minute, that's invisible (a few extra milliseconds). For a high-throughput service, you'd want to benchmark first.

For a model already at 100% on validation, TTA is mostly belt-and-suspenders — catches the production-only edge cases I couldn't anticipate during training. Worth the small inference cost; not worth it for a less-mature model where you'd be better served improving training first.

What I'd change next time

Honest reflection on the things I'd revisit if I rebuilt this from scratch:

Mixup, defined but unused. I implemented a Mixup augmentation function in the training notebook but never wired it into the actual training loop. At 100% accuracy on the held-out set, Mixup probably wouldn't have helped — there's no headroom left to capture. But on a harder version of this task (more characters, more visually similar classes, less data), Mixup is one of the lowest-cost regularizers worth trying first.

Automated early stopping rather than babysitting. I stopped training manually at epoch 74 by watching the val-loss curve in the notebook. A more disciplined run would have wired patience=10 early stopping directly into the training loop — same outcome, less babysitting, easier to reproduce later.

The dataset size question. ~4,000 samples with strong augmentation got us to 100%. I never ran the experiment of "how few samples could we get away with?" The floor is probably around 1,500–2,000 samples for this exact CAPTCHA generator. For future similar projects, I'd start there and add more data only if accuracy plateaus below the target.

Transformer-based encoders for harder CAPTCHAs. This CAPTCHA was 6 fixed-length numeric digits. If the task scaled to alphanumeric, variable-length, or adversarially-designed CAPTCHAs (the kind built specifically to defeat ML), a transformer-based vision encoder (ViT, Swin, or a TrOCR-style decoder) would be a more expressive starting point. The multi-head + position embedding approach has a ceiling beyond which it stops being the right tool.

Was TTA necessary? Possibly not — given the model already hit 100%. The right answer would have been to compare production accuracy with and without TTA over a few weeks. I added TTA pre-emptively rather than measuring whether it was needed.

That's an antipattern I'd correct.

The general lesson here: at 100% accuracy on validation, the temptation is to keep adding tricks (TTA, ensembles, larger models). The actual move is to stop, measure on production, and only add complexity when production data tells you to.

The principle to take away

When the structure of your task is known, encode that structure in your architecture rather than reaching for the most general possible model.

For fixed-length CAPTCHA OCR, "the structure is known" meant six positions, ten classes per position, no length variation. The right architectural answer was six classification heads with a shared backbone and position embeddings — not a CRNN with sequence decoding and CTC loss.

This pattern applies far beyond CAPTCHAs. A few examples where it shows up:

Form-field extraction with a fixed schema (name, date, address, signature in known boxes). Don't use a free-form sequence model. Use field-specific heads attached to a shared document encoder.
Multi-label classification with a known label vocabulary. Don't use a generative decoder. Use one head per label.
Time series forecasting with a known forecast horizon. The right architecture often has explicit per-horizon heads, not a single autoregressive decoder.
Structured information extraction from a well-defined schema (invoices, lab reports, government forms). Slot-filling architecture beats sequence-to-sequence when the slots are stable.

The general engineering instinct is to reach for the most flexible model — the one that handles the widest range of inputs. For research and exploratory work, that's right. For production work where the input structure is genuinely known and stable, it's wrong. Specificity wins on training stability, sample efficiency, inference speed, and debuggability.

For this CAPTCHA project, the result was 100% accuracy on a held-out test set with about 4,000 training samples and a model that runs in under 50ms on CPU. The constraints made the problem easier; matching the architecture to the constraints made the solution simple.

Fine-tuning YOLOv11 to detect stamps and signatures on banking documents - a practical walkthrough

Muhammad umair akram — Thu, 30 Apr 2026 14:16:55 +0000

Every day, banking ops teams manually review thousands of documents - loan applications, KYC forms, contracts - looking for the right stamps, the right signatures, in the right places. It's slow, expensive, and exactly the kind of work computer vision was made to automate.
The catch is that most YOLO tutorials online teach you to detect cars, dogs, or people in natural photos. None of that translates cleanly to documents. Documents are structured, scanned at varying quality, often photographed on phones at angles, sometimes faxed, frequently watermarked, and almost never lit consistently. The model that detects stamps on a clean PDF will collapse on a phone-shot photo of the same form.

"Over the past few weeks I've been deep in shipping a YOLOv11-based detector for stamps and signatures on documents in a regulated banking environment."

The work taught me where the off-the-shelf tutorials end and where the real engineering begins. Here's the playbook.

Why YOLOv11 over the alternatives

There are a few reasonable starting points for document object detection:

Layout-aware models like LayoutLMv3 or Donut - strong for structured forms, but heavier, harder to fine-tune for a narrow task, and slower at inference. Overkill if you only need to detect a small set of objects (stamps, signatures, initials). - Classical OpenCV approaches - template matching, contour detection, Hough transforms. Fast and lightweight but brittle on real-world scans. - YOLO family (v8, v11) - the sweet spot for object detection on documents. Fast, well-documented, easy to fine-tune, and the precision/recall tradeoff is tunable to ops-team requirements. I went with YOLOv11. The ultralytics Python package handles most of the busywork, inference runs well under 100ms per page on a modest GPU, and the architecture handles small objects - which stamps often are at low scan resolutions - better than older versions. ## The 80%: data preparation and annotation Anyone who's shipped CV in production will tell you the same thing: the model is the easy part. Data is where the time goes. Annotation tooling. I used Roboflow - clean web UI for bounding-box labeling, automatic train/val/test splits, easy export to YOLO format. CVAT is the open-source alternative if you can't use a SaaS for compliance reasons. Class taxonomy. Resist the urge to define ten classes on day one. Start with the smallest set that solves the business problem:
signature - stamp - (Optionally handwritten_initials if your forms include them) More classes means more labeled examples per class, more failure modes, and a harder model to debug. You can always split a class later. You can rarely merge messy ones cleanly. Train/val/test split discipline. Separate documents into the three splits by source, not just randomly. If the same form template appears in both train and val, your validation metric is lying to you - the model is learning the form layout, not the object. In a regulated environment where wrong predictions cost real money, you cannot afford a lying validation set. Augmentation strategy - and why the defaults are wrong for documents. The off-the-shelf YOLO augmentation defaults are designed for natural images. They include rotation up to 30°, mosaic, MixUp. For documents, that's actively wrong:
Rotation should be tightly limited (±5°). Documents are upright. Heavy rotation creates training examples that don't reflect production input. - Mosaic augmentation should be off. Pasting four documents into a 2×2 grid produces inputs that don't exist at inference time. - What helps instead: brightness/contrast variation (different scan qualities), JPEG compression noise (low-quality scans), partial occlusion (parts of the document obscured), Gaussian blur (out-of-focus phone shots). "The single biggest accuracy gain in my project came from augmenting for phone-photographed scans. Production data was messier than my training set assumed - closing that gap mattered more than any architecture change." ## Training configuration that actually matters Most YOLO hyperparameters are fine at defaults. The ones that move the needle on documents:


from ultralytics import YOLO

model = YOLO('yolo11m.pt')
results = model.train(
    data='dataset.yaml',
    epochs=100,
    imgsz=1024,   # Higher imgsz matters for small stamps
    batch=8,
    lr0=0.001,
    patience=20,  # Early stopping if mAP stalls
    augment=True,
    mosaic=0.0,   # Off for documents
    degrees=5,    # Limit rotation
    fliplr=0.0    # Don't horizontally flip docs
)

Two things worth flagging:
Stamps at low resolution can become a few pixels - too small for the model to detect reliably. Higher input size costs more compute per image, but the precision gain on small object is substantial.
Disable horizontal flipping. A flipped form is a wrong form.
Augmentations that produce never-seen-in-production inputs hurt generalization on the inputs you actually care about.

The metric you should actually optimize for

Most tutorials default to mAP@0.5. For document AI in a regulated environment, that's the wrong primary metric.
Ops teams care about precision. When the model says "there's a signature here," they need it to be right. A false positive sends a document downstream that shouldn't be there, costing reviewer time. A false negative is recoverable - the document falls back to manual review, which is the existing baseline.
Track both, but if you have to optimize one, optimize precision. Your ops manager will thank you.

Inference and deployment

A model that runs on a GPU is fun. A model that runs on a CPU is shippable. For most document-AI workloads - where you're processing on the order of dozens to hundreds of pages per minute, not millions - CPU inference with an ONNX-exported model is faster to deploy, cheaper to run, and far more compatible with locked-down production environments where GPU drivers are a fight you don't want.
The flow is:

Train with ultralytics (PyTorch backend, GPU during training)
Export the trained weights to ONNX
Serve via ultralytics's ONNX-runtime path on CPU at inference time Step 2 is one line:

 from ultralytics import YOLO
model = YOLO('best.pt')
 model.export(format='onnx') # writes best.onnx alongside best.pt

Step 3 - the inference service:

 from fastapi import FastAPI, UploadFile
 from ultralytics import YOLO
 from PIL import Image
 import io
app = FastAPI()
 model = YOLO('best.onnx') # ONNX runtime, CPU-only
@app.post('/detect')
 async def detect(file: UploadFile):
 image = Image.open(io.BytesIO(await file.read()))
 results = model(image)
detections = []
 for r in results:
 for box in r.boxes:
 detections.append({
 'class': model.names[int(box.cls)],
 'confidence': float(box.conf),
 'bbox': box.xyxy.tolist()[0],
 })
return {'detections': detections}

The most important line in that snippet is model = YOLO('best.onnx') at module level - load the model once at startup, never per request.
Reloading the model on every request is the most common production mistake I've seen on YOLO endpoints. It's the difference between 50ms response time and 5,000ms.
For the container: a slim Python base image (python:3.11-slim) is enough. No CUDA, no GPU drivers, no NVIDIA dependencies. The image ends up under 500MB, starts in seconds, and runs anywhere - including locked-down corporate VMs and on-prem environments where shipping a GPU-dependent service is months of approvals you don't have.
That's the real tradeoff: you give up a small amount of per-request latency in exchange for a service that deploys today, not next quarter.

What the tutorials don't tell you

Three lessons the standard YOLO blog posts skip:
1. The long tail of weird scans is where production breaks. Faxed pages with horizontal banding, partially photocopied documents, phone shots with one corner cut off, watermarks bleeding through from the back side. Your training set won't include enough of these. Get a sample of real production input as fast as possible - even just 50 images - and use them for evaluation, not training. They tell you what the world actually looks like.
2. Log every prediction with the input image hash. When the model fails in production, you want to be able to find the exact input that broke it, retroactively. Hash the input, log the prediction, store both. That's how you build round-2 training data without hunting.
3. Don't chase mAP@0.95. Diminishing returns. If your business needs 95% precision at 70% recall, optimize for that operating point - not for a metric that summarizes the whole curve. Talk to your ops team. Get the actual numbers they care about. Train against those.

Closing

The model is not the bottleneck for document AI. The bottleneck is annotation discipline, augmentation tuned to real production input, and deployment that doesn't blow up under load. If you're building computer vision for regulated industries - banking, insurance, legal, healthcare - the playbook above is what's worked for me. The frameworks change. The data discipline doesn't.