Text recognition

Text recognition (OCR)

Technology

The first complete OCR was done in July 2017 using Tesseract 4.0.0-alpha.20170703 (see results). That version supports text recognition by a neural network (LSTM), but does not use it for Fraktur, so our text recognition was based on a Tesseract 3 training model.

New LSTM models which also support Fraktur were provided in August 2017 and are very promising (see results), so the next text recognition will use LSTM.

This next recognition was started in December 2018 with Tesseract 4.0.0, see the preliminary results.

Known OCR and scan problems

High contrast 142-9550/0471 (ocr, best ocr), left side very light, right side light / dark.
Translucent 005-7924/0223 (ocr, best ocr). Such scans need very much time with Tesseract, and Tesseract reports many diacritics. Those scans result in very bad OCR results, typically with long sequences of 11111.
Skewed scans with different angles for left and right side 056-9937/0014, 072-7991/0011 (ocr, best ocr). Those scans result in very bad OCR results, typically with long sequences of 11111.
Unsharp scans 037-9918/0108 (ocr).
Light scans with low contrast 118-9526/0485 (ocr).

OCR problems with Tesseract

Wrong post OCR

Tesseract uses a dictionary. This leads to funny unwanted results, because it "detects" words like computer, Google or Internet which did definitely not exist before 1945.

The dictionary also contains lots of confusions (B / ß, ii / ü). In addition, several characters were not trained and cannot be recognized therefore, notably the paragraph and tilde characters.

OCR of double pages

Wrong page separation (OCR does not respect page boundaries) 004-8445/0036 (ocr), see wrong text Rudolf Lantzsch in München.

OCR of tables

Extremely high resource needs for Tesseract OCR 030-8471/0651 (ocr, best ocr), double page with large tables takes about 2600 s.
Wrong table decomposition (OCR does not respect table columns) 167-9449/0015 (ocr, best ocr), see wrong text Gebiet des Generalgouvernements Türkei.

Layout recognition

Column separation failed 022-9903/0526.

Accuracy

image	CER (1)	WER (1)	CER (2)	WER (2)	CER (3)
1819-03-30 p. 1	9.7 %	25.0 %	4.4 %	14.3 %
1881-02-01 p. 3	unusable	unusable	3.8 %	16.2 %	2.2 %
1881-02-01 p. 4	unusable	unusable	4.6 %	16.7 %
1919-05-08 p. 2	28.4 %	44.7 %	95.2 %	71.4 %
1919-05-08 p. 3	unusable	unusable	10.4 %	30.8 %
1919-05-08 p. 4	unusable	unusable	10.2 %	31.5 %
1921-09-22 p. 1	unusable	77.8 %	18.0 %	32.0 %
1944-01-03 p. 1	29.4 %	34.8 %	14.0 %	21.0 %	12.0 %

(1) Tesseract 4.0.0-alpha.20170703, script/Fraktur
(2) Tesseract 4.0.0-beta.4, script/Fraktur
(3) Tesseract 4.0.0, Fraktur_5000000 + voting

CER = Character error rate
WER = Word error rate
accuracy = 100 % - (error rate)

The accuracy was measures using the ocr-evaluation-tools. Please note that the ground truth was created manually and is know to contain errors, too. So in some cases Tesseract is right and the ground truth is wrong. That implies that the real error rates of Tesseract are a little bit lower. See original data for more details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly