Skip to content
Stefan Weil edited this page Oct 14, 2019 · 38 revisions

Text recognition (OCR)

Technology

The first complete OCR was done in July 2017 using Tesseract 4.0.0-alpha.20170703 (see results). That version supports text recognition by a neural network (LSTM), but does not use it for Fraktur, so our text recognition was based on a Tesseract 3 training model.

New LSTM models which also support Fraktur were provided in August 2017 and are very promising (see results), so the next text recognition will use LSTM.

This next recognition was started in December 2018 with Tesseract 4.0.0, see the preliminary results.

Known OCR and scan problems

OCR problems with Tesseract

Wrong post OCR

Tesseract uses a dictionary. This leads to funny unwanted results, because it "detects" words like computer, Google or Internet which did definitely not exist before 1945.

The dictionary also contains lots of confusions (B / ß, ii / ü). In addition, several characters were not trained and cannot be recognized therefore, notably the paragraph and tilde characters.

OCR of double pages

  • Wrong page separation (OCR does not respect page boundaries) 004-8445/0036 (ocr), see wrong text Rudolf Lantzsch in München.

OCR of tables

  • Extremely high resource needs for Tesseract OCR 030-8471/0651 (ocr, best ocr), double page with large tables takes about 2600 s.

  • Wrong table decomposition (OCR does not respect table columns) 167-9449/0015 (ocr, best ocr), see wrong text Gebiet des Generalgouvernements Türkei.

Layout recognition

Accuracy

image CER (1) WER (1) CER (2) WER (2) CER (3)
1819-03-30 p. 1 9.7 % 25.0 % 4.4 % 14.3 %
1881-02-01 p. 3 unusable unusable 3.8 % 16.2 % 2.2 %
1881-02-01 p. 4 unusable unusable 4.6 % 16.7 %
1919-05-08 p. 2 28.4 % 44.7 % 95.2 % 71.4 %
1919-05-08 p. 3 unusable unusable 10.4 % 30.8 %
1919-05-08 p. 4 unusable unusable 10.2 % 31.5 %
1921-09-22 p. 1 unusable 77.8 % 18.0 % 32.0 %
1944-01-03 p. 1 29.4 % 34.8 % 14.0 % 21.0 % 12.0 %

(1) Tesseract 4.0.0-alpha.20170703, script/Fraktur
(2) Tesseract 4.0.0-beta.4, script/Fraktur
(3) Tesseract 4.0.0, Fraktur_5000000 + voting

CER = Character error rate
WER = Word error rate
accuracy = 100 % - (error rate)

The accuracy was measures using the ocr-evaluation-tools. Please note that the ground truth was created manually and is know to contain errors, too. So in some cases Tesseract is right and the ground truth is wrong. That implies that the real error rates of Tesseract are a little bit lower. See original data for more details.

Clone this wiki locally