Skip to content
Stefan Weil edited this page Apr 18, 2020 · 13 revisions

This is work in progress, so please be patient.

Introduction

The British Library provides free transcriptions of Arabic handwritten text. They already have run trainings with Transkribus.

That transcriptions can also be used to train Tesseract.

Reference

Data preparation

Create a new directory and run the following commands to prepare the data for the training process.

mkdir -p ~/ArabicHandwriting
cd ~/ArabicHandwriting

# Get the data.
curl -L https://bl.oar.bl.uk/fail_uploads/download_file?fileset_id=e03280ef-5a75-4193-a8b5-1265f295e5cf >RASM2019_part_1.zip
curl -L https://bl.oar.bl.uk/fail_uploads/download_file?fileset_id=907b2e2a-3f23-49b8-8eef-f073c8bb97ab >RASM2019_part_2.zip

# Extract the data. Use 7za instead of unzip because there is an error in RASM2019_part_2.zip.
7za x RASM2019_part_1.zip
7za x RASM2019_part_2.zip
mkdir -p IMG PAGE
mv *.tif IMG
mv *.xml PAGE

# Remove spaces in filenames (workaround because currently not fully supported by OCR-D).
for i in IMG/* PAGE/*; do mv -v "$i" "${i/ /}"; done
for i in IMG/* PAGE/*; do mv -v "$i" "${i/ /}"; done
perl -pi -e 's/(imageFilename=.*) (.*tif)/$1$2/' PAGE/*
perl -pi -e 's/(imageFilename=.*) (.*tif)/$1$2/' PAGE/*
perl -pi -e 's/(filename=.*) (.*tif)/$1$2/' PAGE/*
perl -pi -e 's/(filename=.*) (.*tif)/$1$2/' PAGE/*

# Fix path for images for further processing.
perl -pi -e 's/imageFilename="/imageFilename="IMG\//' PAGE/*

# Remove alternative image filenames which are not available from PAGE files.
perl -pi -e 's/.*AlternativeImage.*//' PAGE/*

# Create OCR-D workspace and add images and PAGE files.
ocrd workspace init
for i in IMG/*; do base=$(basename "$i" .tif); ocrd workspace add "$i" -G IMG -i "${base}_img" -g "$base" -m image/tiff; done
for i in PAGE/*; do base=$(basename "$i" .xml); ocrd workspace add "$i" -G PAGE -i "${base}_page" -g "$base" -m application/vnd.prima.page+xml; done

# Binarize and denoise images.
ocrd-olena-binarize -I PAGE -O WOLF,WOLF-IMG -m mets.xml -p <(echo '{"impl":"wolf"}')
ocrd-cis-ocropy-denoise -I WOLF -O DENOISE,DENOISE-IMG -m mets.xml -p '{"level-of-operation": "line"}'

# Extract the line images.
ocrd-segment-extract-lines -I DENOISE -O LINES -m mets.xml

# Remove empty texts (files contain only a line feed) which cannot be used for training.
rm -v $(find LINES -size 1c)

Training

Here training is started with the existing Tesseract model script/Arabic.traineddata.

# Create box files needed for Tesseract training.
for t in ~/ArabicHandwriting/GT/LINES/*.txt; do test -f ${t/gt.txt/box} || (echo $t && ./generate_wordstr_box.py -i ${t/gt.txt/bin.png} -t $t -r >${t/gt.txt/box}); done 

nohup make LANG_TYPE=RTL MODEL_NAME=ArabicHandwritingOCRD GROUND_TRUTH_DIR=/home/stweil/src/ArabicHandwriting/GT/LINES PSM=13 START_MODEL=Arabic TESSDATA=/home/stweil/src/github/OCR-D/venv-20200408/share/tessdata EPOCHS=20 lists >>data/ArabicHandwritingOCRD.log
nohup make LANG_TYPE=RTL MODEL_NAME=ArabicHandwritingOCRD GROUND_TRUTH_DIR=/home/stweil/src/ArabicHandwriting/GT/LINES PSM=13 START_MODEL=Arabic TESSDATA=/home/stweil/src/github/OCR-D/venv-20200408/share/tessdata EPOCHS=20 training >>data/ArabicHandwritingOCRD.log

The ground truth lines are split in 2351 lines for training and 262 lines for validation.

After one epoch (2351 iterations), the CER is at about 46 %. With sufficient training (200 epochs, about 32 hours), the CER falls below 5 %.

Results

Best and fast Tesseract models which were trained using the steps above are available from https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/ArabicHandwritingOCRD/.

Open Questions

The achieved CER for other Arabic handwritings still has to be measured.

The Tesseract training shows lots of encoding and other problems with a rather large skip ratio of more than 8 %. Here a typical example:

At iteration 139783/462420/506586, Mean rms=0.254%, delta=0.043%, char train=4.88%, word train=21.6%, skip ratio=8.7%,  wrote checkpoint.

Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
Encoding of string failed! Failure bytes: ef bf bd
Can't encode transcription: 'امهرد نوسمخو ةيام نكي ىش ىلع ةموسقم امهرد رشع ةسمخ ىف مث نيمهردو ىش�' in language ''
Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
Image too small to scale!! (2x48 vs min width of 3)
Line cannot be recognized!!
Image not trainable
Encoding of string failed! Failure bytes: ef bf bd 20 d8 a8 d9 88 d8 aa d9 83 d9 85 d9 84 d8 a7 20 d8 a7 d8 b0 d9 87
Can't encode transcription: 'اٰمو اهعيابط ناٰيب يف ثلاثلا بابلا � بوتكملا اذه' in language ''
At iteration 139783/462520/506696, Mean rms=0.252%, delta=0.04%, char train=4.764%, word train=21.182%, skip ratio=9.5%,  wrote checkpoint.

``

Clone this wiki locally