Introduction
Building OCR models for right-to-left (RTL) languages like Arabic, Urdu, Persian, or Hebrew often suffers from a lack of annotated training data. SynthDoG-RTL is a synthetic document generator adapted from Donut’s SynthDoG, extended to handle RTL text rendering correctly. In this post, we’ll walk through how advanced developers can generate large-scale synthetic datasets compatible with Donut.
What is SynthDoG-RTL?
SynthDoG (Synthetic Document Generator) was introduced with Donut to create training data on the fly for document understanding. SynthDoG-RTL extends it by:
- Supporting RTL text direction and contextual script shaping.
- Including sample corpora, fonts, and templates for Arabic, Urdu, Persian, Hebrew, and others.
- Allowing custom YAML configuration for layouts, distortions, and effects.
Installation and Setup
Clone the repository and install dependencies:
git clone https://github.com/aiviewz/Synthdog-RTL.git
cd Synthdog-RTL
conda create -n synthdog python=3.8 -y
conda activate synthdog
pip install synthtiger
Make sure to install libraqm for proper Arabic/RTL shaping:
sudo apt-get install libfreetype6-dev libharfbuzz-dev
On macOS, set:
export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES
Preparing Resources
Each language needs:
-
Corpus: UTF-8 text file under
resources/corpus/
(e.g.,urdu.txt
,arabic.txt
). -
Fonts: Place
.ttf/.otf
fonts inresources/font/<lang_code>/
. -
Backgrounds: Optional textures under
resources/backgrounds/
.
Example structure:
resources/
├─ corpus/
│ ├─ urdu.txt
│ └─ arabic.txt
└─ font/
├─ ur/
│ └─ NotoNastaliq.ttf
└─ ar/
└─ NotoNaskh.ttf
Configuring Generation
YAML config files (e.g., config_ur.yaml
) define page size, font range, distortions, and paths.
Example Urdu config:
corpus_path: "resources/corpus/urdu.txt"
font_dir: "resources/font/ur/"
page_width: 1240
page_height: 1754
min_font_size: 20
max_font_size: 40
rotate_angle: [-2, 2]
background_dir: "resources/backgrounds/paper/"
Generating Synthetic Data
Run the CLI:
synthtiger -o ./outputs/synthdog_ur -c 1000 -w 8 -v template.py SynthDoG config_ur.yaml
This generates 1000 samples with 8 workers, outputting images and text into ./outputs/synthdog_ur/
.
Repeat with config_ar.yaml
, config_fa.yaml
, etc. for multiple languages.
Formatting for Donut
Donut expects an image + JSON pair. Structure your dataset like:
my_dataset/
├─ train/
│ ├─ metadata.jsonl
│ ├─ 00000001.png
│ └─ ...
├─ validation/
│ └─ ...
└─ test/
└─ ...
Each line in metadata.jsonl
:
{"file_name": "00000001.png", "ground_truth": "{\"gt_parse\":{\"text_sequence\":\"یہ اردو کا متن ہے\"}}"}
Donut will tokenize this internally. Ensure that file_name
matches your image and text_sequence
contains the RTL ground truth text.
Advanced Tips
-
Layouts: Customize
template.py
for multi-column, headers, or tables. - Effects: Add noise, blur, or perspective distortion in YAML for realism.
- Fonts: Use multiple fonts per language to avoid overfitting.
- Mixed Scripts: Include English corpora to simulate bilingual documents.
- Scaling: Generate 10k–100k samples to pre-train Donut effectively.
Conclusion
With SynthDog-RTL you can rapidly bootstrap synthetic OCR datasets for all major RTL languages. The generated data integrates seamlessly with Donut, enabling you to train or fine-tune robust document understanding models even in low-resource settings.
References:
Top comments (0)