Tesseract Training

#tesseract #ocr #machinelearning

Overview of Training Process

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#introduction

Conceptually the same:

1 Prepare training text.
2 Render text to image + box file. (Or create hand-made box files for existing image data.)
3 Make unicharset file. (Can be partially specified, ie created manually).
4 Make a starter traineddata from the unicharset and optional dictionary data.
5 Run tesseract to process image + box file to make training data set.
6 Run training on training data set.
7 Combine data files.

The key differences are:

The boxes only need to be at the textline level. It is thus far easier to make training data from existing image data.
The .tr files are replaced by .lstmf data files.
Fonts can and should be mixed freely instead of being separate.
The clustering steps (mftraining, cntraining, shapeclustering) are replaced with a single slow lstmtraining step.

Understanding the Various Files Used During Training

As with base Tesseract, the completed LSTM model and everything else it needs is collected in the traineddata file. Unlike base Tesseract, a starter traineddata file is given during training, and has to be setup in advance. It can contain:

Config file providing control parameters.
Unicharset defining the character set.
Unicharcompress, aka the recoder, which maps the unicharset further to the codes actually used by the neural network recognizer.
Punctuation pattern dawg, with patterns of punctuation allowed around words.
Word dawg. The system word-list language model.
Number dawg, with patterns of numbers that are allowed.

Bold elements must be provided. Others are optional, but if any of the dawgs are provided, the punctuation dawg must also be provided. A new tool: combine_lang_model is provided to make a starter traineddata from a unicharset and optional wordlists.

Making Box Files

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#creating-training-data

Multiple formats of box files are accepted by Tesseract 4 for LSTM training, though they are different from the one used by Tesseract 3 (details).

Each line in the box file matches a 'character' (glyph) in the tiff image.

Where could be bounding-box coordinates of a single glyph or of a whole textline (see examples).

To mark an end-of-textline, a special line must be inserted after a series of lines.

Note that in all cases, even for right-to-left languages, such as Arabic, the text transcription for the line, should be ordered left-to-right. In other words, the network is going to learn from left-to-right regardless of the language, and the right-to-left/bidi handling happens at a higher level inside Tesseract.

Using tesstrain.sh
The setup for running tesstrain.sh is the same as for base Tesseract. Use --linedata_only option for LSTM training. Note that it is beneficial to have more training text and make more pages though, as neural nets don't generalize as well and need to train on something similar to what they will be running on. If the target domain is severely limited, then all the dire warnings about needing a lot of training data may not apply, but the network specification may need to be changed.

Training data is created using tesstrain.sh as follows:

src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \
  --noextract_font_properties --langdata_dir ../langdata \
  --tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain

The above command makes LSTM training data equivalent to the data used to train base Tesseract for English. For making a general-purpose LSTM-based OCR engine, it is woefully inadequate, but makes a good tutorial demo.

Now try this to make eval data for the 'Impact' font:

src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \
  --noextract_font_properties --langdata_dir ../langdata \
  --tessdata_dir ./tessdata \
  --fontlist "Impact Condensed" --output_dir ~/tesstutorial/engeval

We will use that data later to demonstrate tuning.