Most automation projects in regulated industries hit the same wall
eventually: a CAPTCHA on an internal portal blocks the very automation
the team is trying to build.
In our case, the ops team needed to interact with one of our company's
internal portals dozens of times per day. The portal — built by an
internal team, used only by employees — gates access with a 6-digit
numeric CAPTCHA on every login. Reasonable security choice for the
original threat model. Not so reasonable for the team that needs to
script repetitive workflows on top of it.
The right fix would have been to add a service-account API to the
portal team's backlog. The realistic fix, given the timeline, was
to teach a small ML model to read those CAPTCHAs reliably so our
automation script could move past the login screen.
A note on legitimacy, since this article will inevitably be skimmed
by people wondering: this is internal automation on a portal owned
by my employer, used only inside the company, accessed with explicit
authorization to automate. It's the same shape of work as RPA — same
legal/ethical category. Solving CAPTCHAs on third-party websites you
don't own is a different conversation entirely, often a TOS violation
and sometimes illegal depending on jurisdiction. Don't conflate the
two.
With that framing out of the way, the technical question was: what's
the right model architecture for fixed-length numeric CAPTCHA OCR?
The default answer most engineers reach for is CRNN — a CNN encoder
followed by an LSTM/GRU decoder, trained with CTC loss. It's the
standard recipe in every "deep learning for OCR" tutorial. And for
variable-length text recognition (handwritten notes, scanned documents,
scene text), CRNN is genuinely the right choice.
But our CAPTCHA was always exactly 6 digits. Always 0–9. No variation
in length, no edge cases, no character set ambiguity. The structure
was completely known.
When the structure of your input is known, the right architectural
move is to lean into that structure — not reach for the most general
possible model. So I skipped CRNN and built something simpler: a
shared CNN backbone with six independent classification heads (one
per digit position), tied together with learnable position embeddings.
The result was 100% accuracy on our held-out test set with about
4,000 training samples. Here's how it's built and why each design
choice mattered.
CRNN vs. multi-head: a brief architecture comparison
CRNN is the standard recipe for OCR: a CNN encoder pulls features from the image, a recurrent layer (LSTM or GRU) decodes those features into a sequence, and CTC loss handles the alignment between the predicted sequence and the ground-truth label without requiring per-character supervision. It's a powerful approach because it handles variable-length outputs gracefully — the same model can predict a 4-character word, a 10-character phrase, or a 50-character sentence.
The trade-off is complexity. CRNN has more moving parts:
- The recurrent decoder adds parameters and training instability
- CTC loss has its own learning dynamics and edge cases (alignment collapse, blank-token tuning)
- Inference is sequential — harder to parallelize across positions
- Debugging is harder — when the model outputs "13456" instead of "123456," you need to figure out whether that's a recognition error, an alignment error, or a length error
Compare that to a multi-head approach for fixed-length output. The input image runs
through a shared CNN backbone once, producing a single feature vector. Then six
independent classification heads each predict one digit (0–9) at one specific
position. The training signal is straightforward: six cross-entropy losses, one per
position, averaged. No sequence decoding. No alignment. No CTC.
The structure is simpler in every dimension that matters:
- Fewer parameters in the decoder
- Faster training convergence (more stable gradient signal per output)
- Faster inference (six parallel classifications, no sequential decode)
- Easier debugging (each head's output is independent and inspectable)
The only thing CRNN gives you that multi-head doesn't is variable-length support.
And we explicitly didn't need that.
The general principle worth taking away: if your task has known structure (fixed
length, fixed character set, fixed slot count), encode that structure in your
architecture instead of asking a more general model to learn it. You'll get faster
training, fewer parameters, and better sample efficiency.
Position embeddings: the design choice that made the shared backbone work
The naive version of multi-head architecture has a subtle weakness. All six output
heads consume the same feature vector from the shared backbone. If the backbone
produces a feature vector f, then each head simply looks at f and emits a
prediction:
prediction_position_1 = head_1(f)
prediction_position_2 = head_2(f)
# ... etc., all six heads see the same f
The shared f has to encode "what character is at position 1 AND what character is
at position 2 AND ... AND what character is at position 6" — all simultaneously, in
the same vector. With enough training data and capacity, the model can learn this.
But it's inefficient — the backbone's representation is being asked to do six jobs
at once, with no signal about which job it's currently serving.
The fix is small but powerful: give the model an explicit signal about which
position it's predicting. A learnable position embedding.
self.position_emb = nn.Embedding(6, 10) # 6 positions, 10-dim each
def forward(self, image, position_idx):
features = self.cnn(image) # shared backbone output
pos = self.position_emb(position_idx) # which position?
combined = torch.cat([features, pos], dim=1)
logits = self.classifier(combined)
return logits
Now the backbone is asked one focused question — what's at this specific position?
— and the position embedding provides the context. The model learns position-aware feature extraction without needing six separate backbones.
The downstream effect is significant. With ~4,000 training samples, this design converged cleanly to 100% accuracy on the held-out test set. A naive multi-head architecture (without position embeddings) trained on the same dataset hits a lower accuracy ceiling, because the shared feature vector can't decompose its representation cleanly across positions.
This pattern is worth internalizing: when multiple output heads share a backbone, give the backbone an explicit signal about which output it's serving. The signal can be a position embedding (as here), a class embedding (in multi-task learning), or any other discriminating context. The shared backbone learns better when it knows what it's working on.
The backbone: why eca_nfnet_l0 over plain CNN
The shared CNN backbone needs to extract features from a small grayscale image (200x50) and produce a representation that the multi-head classifier can decode reliably. The default move for OCR work is a plain ResNet18 or VGG, but I went with eca_nfnet_l0 from the timm library — a Normalizer-Free Net with Efficient Channel Attention.
A few reasons for the choice:
- Normalizer-Free Networks skip BatchNorm and replace it with weight standardization + adaptive gradient clipping. The architecture trains stably even at small batch sizes, and inference is faster (no BN running statistics to track).
- ECA blocks add channel-wise attention with a 1D convolution rather than the standard squeeze-excitation MLP. Lower parameter count, similar accuracy gain.
-
Pretrained weights are available via
timm. Even though ImageNet has nothing to do with grayscale CAPTCHAs, the low-level filters (edges, textures, basic shapes) transfer fine and reduce the data needed to converge.
The model gets configured to take 1-channel input instead of 3:
backbone = timm.create_model(
'eca_nfnet_l0',
pretrained=True,
in_chans=1,
num_classes=0,
global_pool='',
)
num_classes=0 and global_pool='' strip the final classification head and the global pooling layer — we want the raw feature map to attach our own multi-head classifier to.
Augmentation: defaults are wrong for CAPTCHAs
The same lesson from my earlier YOLOv11 tutorial applies here: the default torchvision augmentation pipeline assumes you're training on natural images.
CAPTCHAs are not natural images. The augmentations that help on ImageNet either don't help or actively hurt for CAPTCHA OCR.
What I used for training:
transform = transforms.Compose([
transforms.Grayscale(num_output_channels=1),
transforms.RandomRotation(5), # ±5° — anything more is unrealistic
transforms.RandomAffine(degrees=0, translate=(0.1, 0.1)), # small translation
transforms.RandomPerspective(distortion_scale=0.2, p=0.5), # slight perspective drift
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,)),
transforms.Resize((50, 200)),
])
Three augmentations I deliberately did not use:
- No horizontal flips. A flipped 6 looks like a 9. A flipped 7 doesn't look like any digit. Training on flips actively confuses the model.
- No vertical flips. Same logic.
- No heavy rotation. CAPTCHA samples already include some rotation. Adding ±30° would generate training data that doesn't reflect actual portal output, hurting generalization.
Three augmentations I did use that the defaults wouldn't include thoughtfully:
- Limited rotation (±5°) to mimic the small in-the-wild rotation present in real CAPTCHA samples
- Translation augmentation to handle variable horizontal position of digits
- Perspective distortion (mild) to handle the subtle shear/skew the CAPTCHA generator applies
Validation transforms strip all augmentation — straight grayscale + normalize + resize. The validation set should reflect actual production input, not augmented training input.
Training: averaging six losses instead of summing them
Each output head produces a 10-class logit vector for one digit position. The loss for each head is straightforward cross-entropy. The question is how to combine the six losses into a single training signal.
The naive approach is to sum them:
loss = loss1 + loss2 + loss3 + loss4 + loss5 + loss6
This works, but it changes the effective learning rate. With six loss terms summing, the gradient magnitude is roughly six times larger than a single-head model. To compensate, you'd need to divide your learning rate by ~6 to get equivalent training dynamics.
The cleaner approach — what I used — is to average:
loss = (loss1 + loss2 + loss3 + loss4 + loss5 + loss6) / 6.0
loss.backward()
Now the gradient magnitude is comparable to a single-head model, so the standard Adam learning rate (3e-4 for fine-tuning a pretrained backbone) just works without further tuning.
- Optimizer: Adam,
lr=3e-4 - Loss: averaged cross-entropy across six heads
- Batch size: 128
- Epochs: 150 max, manually stopped at epoch 74 once val-loss plateaued near zero
- No learning-rate schedule — for a task this narrow, on a pretrained backbone, default Adam dynamics were sufficient
The training run produced a clean monotonic decrease in val-loss for the first ~30 epochs, then plateaued at the noise floor as the model hit 100% on the held-out set.
By epoch 74, val-loss was around 0.005 — effectively zero — so I stopped manually rather than running out the planned 150 epochs.
Test-time augmentation: belt-and-suspenders for production
Once the trained model hits 100% on the held-out validation set, you'd think there's nothing left to do for inference. There isn't, accuracy-wise. But production has weirder inputs than test sets — different screenshot resolutions, slight color shifts, edge alignment differences from the live portal vs. the captured training
samples.
For robustness against these distribution-shift cases, I added test-time
augmentation (TTA): run the model on multiple lightly-modified versions of the same input image, average the predictions across versions, return the consensus.
The pattern is simple:
def predict_with_tta(image, model):
# Variant 1: original input
logits_original = model(transform(image))
# Variant 2: lightly center-cropped, then resized back
logits_cropped = model(transform_with_center_crop(image))
# Average the logits before argmax
averaged = (logits_original + logits_cropped) / 2.0
return averaged.argmax(dim=1)
The center-crop variant trims a bit of border and resizes back to the original input dimensions. This forces the model to see the digit content at a slightly different effective magnification, which in practice catches a small set of edge cases the un-cropped pass would miss on its own.
Trade-off: roughly 2x inference time per CAPTCHA. For a portal that gets logged into a few times a minute, that's invisible (a few extra milliseconds). For a high-throughput service, you'd want to benchmark first.
For a model already at 100% on validation, TTA is mostly belt-and-suspenders — catches the production-only edge cases I couldn't anticipate during training. Worth the small inference cost; not worth it for a less-mature model where you'd be better served improving training first.
What I'd change next time
Honest reflection on the things I'd revisit if I rebuilt this from scratch:
Mixup, defined but unused. I implemented a Mixup augmentation function in the training notebook but never wired it into the actual training loop. At 100% accuracy on the held-out set, Mixup probably wouldn't have helped — there's no headroom left to capture. But on a harder version of this task (more characters, more visually similar classes, less data), Mixup is one of the lowest-cost regularizers worth trying first.
Automated early stopping rather than babysitting. I stopped training manually at epoch 74 by watching the val-loss curve in the notebook. A more disciplined run would have wired patience=10 early stopping directly into the training loop — same outcome, less babysitting, easier to reproduce later.
The dataset size question. ~4,000 samples with strong augmentation got us to 100%. I never ran the experiment of "how few samples could we get away with?" The floor is probably around 1,500–2,000 samples for this exact CAPTCHA generator. For future similar projects, I'd start there and add more data only if accuracy plateaus below the target.
Transformer-based encoders for harder CAPTCHAs. This CAPTCHA was 6 fixed-length numeric digits. If the task scaled to alphanumeric, variable-length, or adversarially-designed CAPTCHAs (the kind built specifically to defeat ML), a transformer-based vision encoder (ViT, Swin, or a TrOCR-style decoder) would be a more expressive starting point. The multi-head + position embedding approach has a ceiling beyond which it stops being the right tool.
Was TTA necessary? Possibly not — given the model already hit 100%. The right answer would have been to compare production accuracy with and without TTA over a few weeks. I added TTA pre-emptively rather than measuring whether it was needed.
That's an antipattern I'd correct.
The general lesson here: at 100% accuracy on validation, the temptation is to keep adding tricks (TTA, ensembles, larger models). The actual move is to stop, measure on production, and only add complexity when production data tells you to.
The principle to take away
When the structure of your task is known, encode that structure in your architecture rather than reaching for the most general possible model.
For fixed-length CAPTCHA OCR, "the structure is known" meant six positions, ten classes per position, no length variation. The right architectural answer was six classification heads with a shared backbone and position embeddings — not a CRNN with sequence decoding and CTC loss.
This pattern applies far beyond CAPTCHAs. A few examples where it shows up:
- Form-field extraction with a fixed schema (name, date, address, signature in known boxes). Don't use a free-form sequence model. Use field-specific heads attached to a shared document encoder.
- Multi-label classification with a known label vocabulary. Don't use a generative decoder. Use one head per label.
- Time series forecasting with a known forecast horizon. The right architecture often has explicit per-horizon heads, not a single autoregressive decoder.
- Structured information extraction from a well-defined schema (invoices, lab reports, government forms). Slot-filling architecture beats sequence-to-sequence when the slots are stable.
The general engineering instinct is to reach for the most flexible model — the one that handles the widest range of inputs. For research and exploratory work, that's right. For production work where the input structure is genuinely known and stable, it's wrong. Specificity wins on training stability, sample efficiency, inference speed, and debuggability.
For this CAPTCHA project, the result was 100% accuracy on a held-out test set with about 4,000 training samples and a model that runs in under 50ms on CPU. The constraints made the problem easier; matching the architecture to the constraints made the solution simple.
Top comments (0)