Running OCR for Persian text is tricky. Unlike English, Persian (and Arabic) scripts are right‑to‑left, letters change shape based on position, and there are fewer open‑source datasets available. In this post, we’ll build a custom OCR pipeline using YOLO for text detection and CRNN for character recognition.
Why YOLO + CRNN?
YOLO is great at detecting objects — here, the objects are text regions.
CRNN (Convolutional Recurrent Neural Network) is ideal for sequence recognition like text.
Combined, they form a two‑stage pipeline: detect → crop → recognize.
- Preparing the Dataset
For Persian OCR we need two datasets:
- Detection dataset (for YOLO): images with bounding boxes around words or lines.
- Recognition dataset (for CRNN): cropped images of words/lines with their correct text.
You can:
Use tools like labelImg or Roboflow to annotate bounding boxes.
Generate synthetic data: render Persian text on random backgrounds using different fonts to increase data size.
YOLO expects annotations in this format:
class x_center y_center width height
where values are normalized between 0 and 1.
- Training YOLO for Text Detection
Use YOLOv8 for best results:
yolo detect train data=persian_text.yaml model=yolov8s.pt epochs=50 imgsz=640
After training, YOLO will output bounding boxes for text regions.
- Training CRNN for Text Recognition
CRNN = CNN + RNN + CTC loss.
Define your Persian character set (32 letters + space) and encode labels as sequences.
Example PyTorch model:
import torch
import torch.nn as nn
class CRNN(nn.Module):
def __init__(self, num_classes):
super(CRNN, self).__init__()
self.cnn = nn.Sequential(
nn.Conv2d(1, 64, 3, 1, 1), nn.ReLU(),
nn.MaxPool2d(2, 2),
nn.Conv2d(64, 128, 3, 1, 1), nn.ReLU(),
nn.MaxPool2d(2, 2)
)
self.rnn = nn.LSTM(128*8, 256, bidirectional=True, num_layers=2)
self.fc = nn.Linear(512, num_classes)
def forward(self, x):
x = self.cnn(x)
b, c, h, w = x.size()
x = x.permute(3, 0, 1, 2).contiguous().view(w, b, c*h)
x, _ = self.rnn(x)
x = self.fc(x)
return x # [T, B, num_classes]
Use CTC Loss to align predictions with ground truth.
Combining YOLO + CRNN
Feed image → YOLO → bounding boxes.
Crop each box and resize to a fixed height.
Pass to CRNN → predicted text.
Concatenate results (right‑to‑left ordering).
Challenges and Tips
Right‑to‑left text: reverse CRNN output sequences before final join.
Fonts and noise: use data augmentation (blur, rotation, brightness) to improve generalization.
Small dataset? Consider transfer learning or fine‑tuning PaddleOCR models for Persian.
Conclusion
By combining YOLO and CRNN, we created a flexible OCR pipeline that works for Persian text. This approach can be extended to other right‑to‑left scripts like Arabic or Urdu.
You can check out the GitHub repo for sample code and try it on your own dataset!
Top comments (0)