-
Notifications
You must be signed in to change notification settings - Fork 197
German Konzilsprotokolle
The question whether Tesseract works for handwritten text recognition has been asked multiple times. In the following, an experiment which might help to answer this question is documented.
We used a data set which was created in the context of the READ project by Dirk Alvermann (Universitätsarchiv Greifswald) and has been published via Zenodo:
Tobias Grüning, Gundram Leifert, Johannes Michael, Tobias Strauß, Max Weidemann, Roger Labahn. (2016). read_dataset_german_konzilsprotokolle [Data set]. Zenodo. http://doi.org/10.5281/zenodo.215383
It contains 8 770 transcribed text lines of handwritten historical documents from the late 18th century. They are represented as image-PAGE-XML pairs.
Download and extraction result in the following directory structure:
── german_konzilsprotokolle
├── data
│ └── Greifswald_Alvermann
│ ├── Copy_of_1794-95
│ │ ├── page
│ │ └── tif
│ ├── Copy_of_1795-96
│ │ ├── page
│ │ └── tif
│ ├── Copy_of_1796-97
│ │ ├── page
│ │ └── tif
│ ├── Copy_of_AA_1794-95
│ │ ├── page
│ │ └── tif
└── lists
Since tesstrain
expects text-image pairs in the line level, the first step is to extract them from the (page) images using the coordinates given in the PAGE XML. Luckily, ocrd_segment
offers an ease-of-use processor for this task. Entering the OCR-D ecosphere also gives us access to various image preprocessing operations such as (superior) binarization and denoising.
We assume a working installation of OCR-D's core module and the OCR-D modules ocrd_cis
, ocrd_olena
and ocrd_segment
. Navigate to a directory of your choice and run
ocrd workspace init .
Images and XML files may be added to the workspace via
mkdir IMG
mkdir PAGE
for i in `find /path/to/german_konzilsprotokolle/data/Greifswald_Alvermann/Copy_of_1794-95 -name "*.tif"`; do base=`basename $i .tif`; echo $base; mv $i IMG/1794-95_${base}.tif; mv /path/to/german_konzilsprotokolle/data/Greifswald_Alvermann/Copy_of_1794-95/page/${base}.xml PAGE/1794-95_${base}.xml; done
for i in `find IMG -name "*.tif"`; do base=`basename $i .tif`; ocrd workspace add $i -G IMG -i ${base}_img -g $base -m 'image/tiff'; done
for i in `find PAGE -name "*.xml"`; do base=`basename $i .xml`; ocrd workspace add $i -G PAGE -i ${base}_page -g $base -m 'application/vnd.prima.page+xml'; done
For technical reasons, the attribute imageFilename
of the element Page
in the PAGE XML file names has to be adjusted accordingly. E.g.,
cd PAGE/
for i in `find . -name "1794-95_*"`; do echo $i; sed -i 's/imageFilename="/imageFilename="IMG\/1794-95_/' $i; done
Repeat these steps for the other Copy_of_
directories as well.
As an initial setup, we choose binarization following Wolf et al. (2002) and denoising (aka. despeckling) as provided by ocropy
. Both operations can be comfortably applied to the existing line annotations using the corresponding OCR-D interfaces:
ocrd-olena-binarize -I PAGE -O WOLF,WOLF-IMG -m mets.xml -p <(echo '{"impl":"wolf"}')
ocrd-cis-ocropy-denoise -I WOLF -O DENOISE,DENOISE-IMG -m mets.xml -p '{"level-of-operation": "line"}'
This results in shiny line images. E.g., Those can be extracted and accompanied with the corresponding GT using
ocrd-segment-extract-lines -I DENOISE -O LINES -m mets.xml