AI Viewz

Posted on Sep 23

Generating Synthetic RTL OCR Data for Donut with SynthDoG-RTL

#python #ocr #computervision #programming

Introduction

Building OCR models for right-to-left (RTL) languages like Arabic, Urdu, Persian, or Hebrew often suffers from a lack of annotated training data. SynthDoG-RTL is a synthetic document generator adapted from Donut’s SynthDoG, extended to handle RTL text rendering correctly. In this post, we’ll walk through how advanced developers can generate large-scale synthetic datasets compatible with Donut.

What is SynthDoG-RTL?

SynthDoG (Synthetic Document Generator) was introduced with Donut to create training data on the fly for document understanding. SynthDoG-RTL extends it by:

Supporting RTL text direction and contextual script shaping.
Including sample corpora, fonts, and templates for Arabic, Urdu, Persian, Hebrew, and others.
Allowing custom YAML configuration for layouts, distortions, and effects.

Installation and Setup

Clone the repository and install dependencies:

git clone https://github.com/aiviewz/Synthdog-RTL.git
cd Synthdog-RTL

conda create -n synthdog python=3.8 -y
conda activate synthdog
pip install synthtiger

Make sure to install libraqm for proper Arabic/RTL shaping:

sudo apt-get install libfreetype6-dev libharfbuzz-dev

On macOS, set:

export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES

Preparing Resources

Each language needs:

Corpus: UTF-8 text file under resources/corpus/ (e.g., urdu.txt, arabic.txt).
Fonts: Place .ttf/.otf fonts in resources/font/<lang_code>/.
Backgrounds: Optional textures under resources/backgrounds/.

Example structure:

resources/
 ├─ corpus/
 │   ├─ urdu.txt
 │   └─ arabic.txt
 └─ font/
     ├─ ur/
     │   └─ NotoNastaliq.ttf
     └─ ar/
         └─ NotoNaskh.ttf

Configuring Generation

YAML config files (e.g., config_ur.yaml) define page size, font range, distortions, and paths.

Example Urdu config:

corpus_path: "resources/corpus/urdu.txt"
font_dir: "resources/font/ur/"
page_width: 1240
page_height: 1754
min_font_size: 20
max_font_size: 40
rotate_angle: [-2, 2]
background_dir: "resources/backgrounds/paper/"

Generating Synthetic Data

Run the CLI:

synthtiger -o ./outputs/synthdog_ur -c 1000 -w 8 -v template.py SynthDoG config_ur.yaml

This generates 1000 samples with 8 workers, outputting images and text into ./outputs/synthdog_ur/.

Repeat with config_ar.yaml, config_fa.yaml, etc. for multiple languages.

Formatting for Donut

Donut expects an image + JSON pair. Structure your dataset like:

my_dataset/
 ├─ train/
 │   ├─ metadata.jsonl
 │   ├─ 00000001.png
 │   └─ ...
 ├─ validation/
 │   └─ ...
 └─ test/
     └─ ...

Each line in metadata.jsonl:

{"file_name": "00000001.png", "ground_truth": "{\"gt_parse\":{\"text_sequence\":\"یہ اردو کا متن ہے\"}}"}

Donut will tokenize this internally. Ensure that file_name matches your image and text_sequence contains the RTL ground truth text.

Advanced Tips

Layouts: Customize template.py for multi-column, headers, or tables.
Effects: Add noise, blur, or perspective distortion in YAML for realism.
Fonts: Use multiple fonts per language to avoid overfitting.
Mixed Scripts: Include English corpora to simulate bilingual documents.
Scaling: Generate 10k–100k samples to pre-train Donut effectively.

Conclusion

With SynthDog-RTL you can rapidly bootstrap synthetic OCR datasets for all major RTL languages. The generated data integrates seamlessly with Donut, enabling you to train or fine-tune robust document understanding models even in low-resource settings.

References:

DEV Community