-
Notifications
You must be signed in to change notification settings - Fork 0
Text recognition
The first complete OCR was done in July 2017 using Tesseract 4.0.0-alpha.20170703 (see results). That version supports text recognition by a neural network (LSTM), but does not use it for Fraktur, so our text recognition was based on a Tesseract 3 training model.
New LSTM models which also support Fraktur were provided in August 2017 and are very promising (see results), so the next text recognition will use LSTM.
This next recognition was started in December 2018 with Tesseract 4.0.0, see the preliminary results.
-
High contrast 142-9550/0471 (ocr, best ocr), left side very light, right side light / dark.
-
Translucent 005-7924/0223 (ocr, best ocr). Such scans need very much time with Tesseract, and Tesseract reports many diacritics. Those scans result in very bad OCR results, typically with long sequences of 11111.
-
Skewed scans with different angles for left and right side 056-9937/0014, 072-7991/0011 (ocr, best ocr). Those scans result in very bad OCR results, typically with long sequences of 11111.
-
Unsharp scans 037-9918/0108 (ocr).
-
Light scans with low contrast 118-9526/0485 (ocr).
Tesseract uses a dictionary. This leads to funny unwanted results, because it "detects" words like computer, Google or Internet which did definitely not exist before 1945.
The dictionary also contains lots of confusions (B / ß, ii / ü). In addition, several characters were not trained and cannot be recognized therefore, notably the paragraph and tilde characters.
- Wrong page separation (OCR does not respect page boundaries) 004-8445/0036 (ocr), see wrong text Rudolf Lantzsch in München.
-
Extremely high resource needs for Tesseract OCR 030-8471/0651 (ocr, best ocr), double page with large tables takes about 2600 s.
-
Wrong table decomposition (OCR does not respect table columns) 167-9449/0015 (ocr, best ocr), see wrong text Gebiet des Generalgouvernements Türkei.
- Column separation failed 022-9903/0526.
image | CER (1) | WER (1) | CER (2) | WER (2) | CER (3) |
---|---|---|---|---|---|
1819-03-30 p. 1 | 9.7 % | 25.0 % | 4.4 % | 14.3 % | |
1881-02-01 p. 3 | unusable | unusable | 3.8 % | 16.2 % | 2.2 % |
1881-02-01 p. 4 | unusable | unusable | 4.6 % | 16.7 % | |
1919-05-08 p. 2 | 28.4 % | 44.7 % | 95.2 % | 71.4 % | |
1919-05-08 p. 3 | unusable | unusable | 10.4 % | 30.8 % | |
1919-05-08 p. 4 | unusable | unusable | 10.2 % | 31.5 % | |
1921-09-22 p. 1 | unusable | 77.8 % | 18.0 % | 32.0 % | |
1944-01-03 p. 1 | 29.4 % | 34.8 % | 14.0 % | 21.0 % | 12.0 % |
(1) Tesseract 4.0.0-alpha.20170703, script/Fraktur
(2) Tesseract 4.0.0-beta.4, script/Fraktur
(3) Tesseract 4.0.0, Fraktur_5000000 + voting
CER = Character error rate
WER = Word error rate
accuracy = 100 % - (error rate)
The accuracy was measures using the ocr-evaluation-tools. Please note that the ground truth was created manually and is know to contain errors, too. So in some cases Tesseract is right and the ground truth is wrong. That implies that the real error rates of Tesseract are a little bit lower. See original data for more details.