DEV Community

Cover image for Generating Synthetic RTL OCR Data for Donut with SynthDoG-RTL
AI Viewz
AI Viewz

Posted on

Generating Synthetic RTL OCR Data for Donut with SynthDoG-RTL

Introduction

Building OCR models for right-to-left (RTL) languages like Arabic, Urdu, Persian, or Hebrew often suffers from a lack of annotated training data. SynthDoG-RTL is a synthetic document generator adapted from Donut’s SynthDoG, extended to handle RTL text rendering correctly. In this post, we’ll walk through how advanced developers can generate large-scale synthetic datasets compatible with Donut.


What is SynthDoG-RTL?

SynthDoG (Synthetic Document Generator) was introduced with Donut to create training data on the fly for document understanding. SynthDoG-RTL extends it by:

  • Supporting RTL text direction and contextual script shaping.
  • Including sample corpora, fonts, and templates for Arabic, Urdu, Persian, Hebrew, and others.
  • Allowing custom YAML configuration for layouts, distortions, and effects.

Installation and Setup

Clone the repository and install dependencies:

git clone https://github.com/aiviewz/Synthdog-RTL.git
cd Synthdog-RTL

conda create -n synthdog python=3.8 -y
conda activate synthdog
pip install synthtiger
Enter fullscreen mode Exit fullscreen mode

Make sure to install libraqm for proper Arabic/RTL shaping:

sudo apt-get install libfreetype6-dev libharfbuzz-dev
Enter fullscreen mode Exit fullscreen mode

On macOS, set:

export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES
Enter fullscreen mode Exit fullscreen mode

Preparing Resources

Each language needs:

  • Corpus: UTF-8 text file under resources/corpus/ (e.g., urdu.txt, arabic.txt).
  • Fonts: Place .ttf/.otf fonts in resources/font/<lang_code>/.
  • Backgrounds: Optional textures under resources/backgrounds/.

Example structure:

resources/
 ├─ corpus/
 │   ├─ urdu.txt
 │   └─ arabic.txt
 └─ font/
     ├─ ur/
     │   └─ NotoNastaliq.ttf
     └─ ar/
         └─ NotoNaskh.ttf
Enter fullscreen mode Exit fullscreen mode

Configuring Generation

YAML config files (e.g., config_ur.yaml) define page size, font range, distortions, and paths.

Example Urdu config:

corpus_path: "resources/corpus/urdu.txt"
font_dir: "resources/font/ur/"
page_width: 1240
page_height: 1754
min_font_size: 20
max_font_size: 40
rotate_angle: [-2, 2]
background_dir: "resources/backgrounds/paper/"
Enter fullscreen mode Exit fullscreen mode

Generating Synthetic Data

Run the CLI:

synthtiger -o ./outputs/synthdog_ur -c 1000 -w 8 -v template.py SynthDoG config_ur.yaml
Enter fullscreen mode Exit fullscreen mode

This generates 1000 samples with 8 workers, outputting images and text into ./outputs/synthdog_ur/.

Repeat with config_ar.yaml, config_fa.yaml, etc. for multiple languages.


Formatting for Donut

Donut expects an image + JSON pair. Structure your dataset like:

my_dataset/
 ├─ train/
 │   ├─ metadata.jsonl
 │   ├─ 00000001.png
 │   └─ ...
 ├─ validation/
 │   └─ ...
 └─ test/
     └─ ...
Enter fullscreen mode Exit fullscreen mode

Each line in metadata.jsonl:

{"file_name": "00000001.png", "ground_truth": "{\"gt_parse\":{\"text_sequence\":\"یہ اردو کا متن ہے\"}}"}
Enter fullscreen mode Exit fullscreen mode

Donut will tokenize this internally. Ensure that file_name matches your image and text_sequence contains the RTL ground truth text.


Advanced Tips

  • Layouts: Customize template.py for multi-column, headers, or tables.
  • Effects: Add noise, blur, or perspective distortion in YAML for realism.
  • Fonts: Use multiple fonts per language to avoid overfitting.
  • Mixed Scripts: Include English corpora to simulate bilingual documents.
  • Scaling: Generate 10k–100k samples to pre-train Donut effectively.

Conclusion

With SynthDog-RTL you can rapidly bootstrap synthetic OCR datasets for all major RTL languages. The generated data integrates seamlessly with Donut, enabling you to train or fine-tune robust document understanding models even in low-resource settings.


References:

Top comments (0)