Skip to content
Stefan Weil edited this page Sep 12, 2024 · 38 revisions

Text recognition (OCR)

Technology

The first complete OCR was done in July 2017 using Tesseract 4.0.0-alpha.20170703 (see results). That version supports text recognition by a neural network (LSTM), but does not use it for Fraktur, so our text recognition was based on a Tesseract 3 training model.

New LSTM models which also support Fraktur were provided in August 2017 and are very promising (see results), so the next text recognition will use LSTM.

This next recognition was started in December 2018 with Tesseract 4.0.0, see the preliminary results.

In DFG funded projects improved text recognition models were trained for Tesseract and kraken. These models were used for the latest experimental and production OCR runs.

Text recognition with OCR-D

OCR-D technology was tested with one of the very latest models from UB Mannheim german_print_20.traineddata was created in January 2024.

Minimal OCR-D workflow

# Create a local workspace with all best resolution images for a newspaper issue (here: 1900-01-02).
ocrd workspace clone -a -q DEFAULT https://digi.bib.uni-mannheim.de/periodika/fileadmin/data/DeutReunP_856399094_19000102/DeutReunP_856399094_19000102.xml

# Run segmentation and OCR and produce ALTO files which are required for the presentation.
time ocrd-tesserocr-recognize -I DEFAULT -O PAGE_GERMAN_PRINT -P segmentation_level region -P textequiv_level word -P find_tables true -P model german_print

# Convert the PAGE XML output from the previous step into ALTO XML files.
ocrd-fileformat-transform -I PAGE_GERMAN_PRINT -O ALTO_GERMAN_PRINT -P from-to "page alto"

While the initial step of getting the page images only takes a few seconds, the text recognition takes much time, here about 27 minutes for 52 pages:

1881.76user 894.55system 27:15.93elapsed 169%CPU (0avgtext+0avgdata 3936800maxresident)k
984inputs+238656outputs (81major+12623008minor)pagefaults 0swaps

The final step failed with an error and no ALTO XML files were produced.

Known OCR and scan problems

OCR problems with Tesseract

Wrong post OCR

Tesseract uses a dictionary. This leads to funny unwanted results, because it "detects" words like computer, Google or Internet which did definitely not exist before 1945.

The dictionary also contains lots of confusions (B / ß, ii / ü). In addition, several characters were not trained and cannot be recognized therefore, notably the paragraph and tilde characters.

OCR of double pages

  • Wrong page separation (OCR does not respect page boundaries) 004-8445/0036 (ocr), see wrong text Rudolf Lantzsch in München.

OCR of tables

  • Extremely high resource needs for Tesseract OCR 030-8471/0651 (ocr, best ocr), double page with large tables takes about 2600 s.

  • Wrong table decomposition (OCR does not respect table columns) 167-9449/0015 (ocr, best ocr), see wrong text Gebiet des Generalgouvernements Türkei.

Layout recognition

Accuracy

image CER (1) WER (1) CER (2) WER (2) CER (3)
1819-03-30 p. 1 9.7 % 25.0 % 4.4 % 14.3 %
1881-02-01 p. 3 unusable unusable 3.8 % 16.2 % 2.2 %
1881-02-01 p. 4 unusable unusable 4.6 % 16.7 %
1919-05-08 p. 2 28.4 % 44.7 % 95.2 % 71.4 %
1919-05-08 p. 3 unusable unusable 10.4 % 30.8 %
1919-05-08 p. 4 unusable unusable 10.2 % 31.5 %
1921-09-22 p. 1 unusable 77.8 % 18.0 % 32.0 %
1944-01-03 p. 1 29.4 % 34.8 % 14.0 % 21.0 % 12.0 %

(1) Tesseract 4.0.0-alpha.20170703, script/Fraktur
(2) Tesseract 4.0.0-beta.4, script/Fraktur
(3) Tesseract 4.0.0, Fraktur_5000000 + voting

CER = Character error rate
WER = Word error rate
accuracy = 100 % - (error rate)

The accuracy was measures using the ocr-evaluation-tools. Please note that the ground truth was created manually and is known to contain errors, too. So in some cases Tesseract is right and the ground truth is wrong. That implies that the real error rates of Tesseract are a little bit lower. See original data for more details.

Clone this wiki locally