Skip to content
Stefan Weil edited this page Apr 17, 2020 · 13 revisions

This is work in progress, so please be patient.

Introduction

The British Library provides free transcriptions of Arabic handwritten text. They already have run trainings with Transkribus.

That transcriptions can also be used to train Tesseract.

Reference

Data preparation

Create a new directory and run the following commands to prepare the data for the training process.

mkdir -p ~/ArabicHandwriting
cd ~/ArabicHandwriting

# Get the data.
curl -L https://bl.oar.bl.uk/fail_uploads/download_file?fileset_id=e03280ef-5a75-4193-a8b5-1265f295e5cf >RASM2019_part_1.zip
curl -L https://bl.oar.bl.uk/fail_uploads/download_file?fileset_id=907b2e2a-3f23-49b8-8eef-f073c8bb97ab >RASM2019_part_2.zip

# Extract the data. Use 7za instead of unzip because there is an error in RASM2019_part_2.zip.
7za x RASM2019_part_1.zip
7za x RASM2019_part_2.zip
mkdir -p IMG PAGE
mv *.tif IMG
mv *.xml PAGE

# Remove spaces in filenames (workaround because currently not fully supported by OCR-D).
for i in IMG/* PAGE/*; do mv -v "$i" "${i/ /}"; done
for i in IMG/* PAGE/*; do mv -v "$i" "${i/ /}"; done
perl -pi -e 's/(imageFilename=.*) (.*tif)/$1$2/' PAGE/*
perl -pi -e 's/(imageFilename=.*) (.*tif)/$1$2/' PAGE/*
perl -pi -e 's/(filename=.*) (.*tif)/$1$2/' PAGE/*
perl -pi -e 's/(filename=.*) (.*tif)/$1$2/' PAGE/*

# Fix path for images for further processing.
perl -pi -e 's/imageFilename="/imageFilename="IMG\//' PAGE/*

# Remove alternative image filenames which are not available from PAGE files.
perl -pi -e 's/.*AlternativeImage.*//' PAGE/*

# Create OCR-D workspace and add images and PAGE files.
ocrd workspace init
for i in IMG/*; do base=$(basename "$i" .tif); ocrd workspace add "$i" -G IMG -i "${base}_img" -g "$base" -m image/tiff; done
for i in PAGE/*; do base=$(basename "$i" .xml); ocrd workspace add "$i" -G PAGE -i "${base}_page" -g "$base" -m application/vnd.prima.page+xml; done

# Binarize and denoise images.
ocrd-olena-binarize -I PAGE -O WOLF,WOLF-IMG -m mets.xml -p <(echo '{"impl":"wolf"}')
ocrd-cis-ocropy-denoise -I WOLF -O DENOISE,DENOISE-IMG -m mets.xml -p '{"level-of-operation": "line"}'

# Extract the line images.
ocrd-segment-extract-lines -I DENOISE -O LINES -m mets.xml

# Remove empty texts (files contain only a line feed) which cannot be used for training.
rm -v $(find LINES -size 1c)

Training

Here training is started with the existing Tesseract model script/Arabic.traineddata.

# Create box files needed for Tesseract training.
for t in ~/ArabicHandwriting/GT/LINES/*.txt; do test -f ${t/gt.txt/box} || (echo $t && ./generate_wordstr_box.py -i ${t/gt.txt/bin.png} -t $t -r >${t/gt.txt/box}); done 

nohup make LANG_TYPE=RTL MODEL_NAME=ArabicHandwritingOCRD GROUND_TRUTH_DIR=/home/stweil/src/ArabicHandwriting/GT/LINES PSM=13 START_MODEL=Arabic TESSDATA=/home/stweil/src/github/OCR-D/venv-20200408/share/tessdata EPOCHS=20 lists >>data/ArabicHandwritingOCRD.log
nohup make LANG_TYPE=RTL MODEL_NAME=ArabicHandwritingOCRD GROUND_TRUTH_DIR=/home/stweil/src/ArabicHandwriting/GT/LINES PSM=13 START_MODEL=Arabic TESSDATA=/home/stweil/src/github/OCR-D/venv-20200408/share/tessdata EPOCHS=20 training >>data/ArabicHandwritingOCRD.log

After one epoch, the CER is at about 46 %. With sufficient training, the CER falls below 8 %.

``

Clone this wiki locally