Skip to content
Stefan Weil edited this page Sep 25, 2018 · 38 revisions

Text recognition (OCR)

Technology

The first complete OCR was done in July 2017 using Tesseract 4.0.0-alpha.20170703. That version supports text recognition by a neural network (LSTM), but does not use it for Fraktur, so our text recognition was based on a Tesseract 3 training model.

New LSTM models which also support Fraktur were provided in August 2017 and are very promising, so the next text recognition will use LSTM.

Known OCR and scan problems

  • High constrast 142-9550/0471 (ocr, best ocr), left side very light, right side light / dark.

  • Translucent 005-7924/0223 (ocr, best ocr). Such scans need very much time with Tesseract, and Tesseract reports many diacritics. Those scans result in very bad OCR results, typically with long sequences of 11111.

  • Skewed scans with different angles for left and right side 056-9937/0014, 072-7991/0011 (ocr, best ocr). Those scans result in very bad OCR results, typically with long sequences of 11111.

  • Unsharp scans 037-9918/0108 (ocr).

OCR problems with Tesseract

Wrong post OCR

Tesseract uses a dictionary. This leads to funny unwanted results, because it "detects" words like computer, Google or Internet which did definitely not exist before 1945.

The dictionary also contains lots of confusions (B / ß, ii / ü). In addition, several characters were not trained and cannot be recognized therefore, notably the paragraph and tilde characters.

OCR of double pages

  • Wrong page separation (OCR does not respect page boundaries) 004-8445/0036 (ocr), see wrong text Rudolf Lantzsch in München.

OCR of tables

  • Extremely high resource needs for Tesseract OCR 030-8471/0651 (ocr, best ocr), double page with large tables takes about 2600 s.

  • Wrong table decomposition (OCR does not respect table columns) 167-9449/0015 (ocr, best ocr), see wrong text Gebiet des Generalgouvernements Türkei.

Accuracy

image CER (1) WER (1) CER (2) WER (2)
1819-03-30 p. 1 9.7 % 25.0 % 4.4 % 14.3 %
1881-02-01 p. 3 unusable unusable 3.8 % 16.2 %
1881-02-01 p. 4 unusable unusable 4.6 % 16.7 %
1919-05-08 p. 3 28.4 % 44.7 % 95.2 % 71.4 %
1944-01-03 p. 1 29.4 % 34.8 % 14.0 % 21.0 %

(1) Tesseract 4.0.0-alpha.20170703
(2) Tesseract 4.0.0-beta.4

CER = Character error rate
WER = Word error rate
accuracy = 100 % - (error rate)

The accuracy was measures using the ocr-evaluation-tools. Please note that the ground truth was created manually and is know to contain errors, too. So in some cases Tesseract is right and the ground truth is wrong. That implies that the real error rates of Tesseract are a little bit lower. See original data for more details.

Clone this wiki locally