Text recognition

Text recognition (OCR)

Technology

The first complete OCR was done in July 2017 using Tesseract 4.0.0-alpha.20170703 (see results). That version supports text recognition by a neural network (LSTM), but does not use it for Fraktur, so our text recognition was based on a Tesseract 3 training model.

New LSTM models which also support Fraktur were provided in August 2017 and are very promising (see results), so the next text recognition will use LSTM.

This next recognition was started in December 2018 with Tesseract 4.0.0, see the preliminary results.

In DFG funded projects improved text recognition models were trained for Tesseract and kraken. These models were used for the latest experimental and production OCR runs.

Text recognition with OCR-D

OCR-D technology was tested with one of the very latest models from UB Mannheim german_print_20.traineddata (2024/01).

OCR-D allows many different OCR workflows, but basically each of them has some common steps:

Prepare a local workspace for a single newspaper issue
Run segmentation and text recognition (and optionally additional OCR-D processors)
Create ALTO XML from PAGE XML

Prepare a local workspace for a single newspaper issue

This initial step of getting the page images only takes a few seconds on hosts with a fast internet connection:

# Create a local workspace with all best resolution images for a newspaper issue (here: 1900-01-02).
ocrd workspace clone -a -q DEFAULT https://digi.bib.uni-mannheim.de/periodika/fileadmin/data/DeutReunP_856399094_19000102/DeutReunP_856399094_19000102.xml

Run segmentation and text recognition (and optionally additional OCR-D processors)

The OCR-D documentation recommends three different workflows.

Minimal OCR-D workflow

# Run segmentation and OCR.
time ocrd-tesserocr-recognize -I DEFAULT -O PAGE_GERMAN_PRINT -P segmentation_level region -P textequiv_level word -P find_tables true -P model german_print

The text recognition takes much time, here about 27 minutes for 52 pages. The total computation time is 2776 seconds or 46:16 minutes:

1881.76user 894.55system 27:15.93elapsed 169%CPU (0avgtext+0avgdata 3936800maxresident)k
984inputs+238656outputs (81major+12623008minor)pagefaults 0swaps

A 2nd run was a little bit faster (total computation time 1909 s or 31:49 min):

1618.09user 290.89system 26:17.10elapsed 121%CPU (0avgtext+0avgdata 3915736maxresident)k
0inputs+236896outputs (0major+6837265minor)pagefaults 0swaps

Workflow for best results for selected pages

Initially the recommended workflow for best results was tried:

time ocrd process \
  "cis-ocropy-binarize -I DEFAULT -O OCR-D-BIN" \
  "anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP" \
  "skimage-binarize -I OCR-D-CROP -O OCR-D-BIN2 -P method li" \
  "skimage-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -P level-of-operation page" \
  "tesserocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW -P operation_level page" \
  "cis-ocropy-segment -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG -P level-of-operation page" \
  "cis-ocropy-dewarp -I OCR-D-SEG -O OCR-D-SEG-LINE-RESEG-DEWARP" \
  "calamari-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR -P checkpoint_dir qurator-gt4histocr-1.0"

This workflow does not recognize any text. It looks like the segmentation step does not output text regions which are required by the following dewarping and recognition steps. Therefore this workflow terminates rather early after less than two hours:

6987.72user 1027.41system 1:53:37elapsed 117%CPU (0avgtext+0avgdata 2697356maxresident)k
26184inputs+592168outputs (21major+47729693minor)pagefaults 0swaps

To fix this workflow, the failing segmentation processor cis-ocropy-segment was replaced by eynollah-segment. Eynollah finished the segmentation after 6:27:44 hours. It creates too many threads and therefore wastes much system time with scheduling, in total 342380 s or 95:06:20 h computation time. The available GPU was not used. The segmentation used up 16 GiB memory:

282805.62user 59574.09system 6:27:44elapsed 1471%CPU (0avgtext+0avgdata 16351628maxresident)k
36992inputs+70096outputs (249major+399930411minor)pagefaults 0swaps

Based on the Eynollah results the following dewarping takes 43 min:

2567.00user 22.90system 42:55.86elapsed 100%CPU (0avgtext+0avgdata 639396maxresident)k
22520inputs+786496outputs (24major+3466810minor)pagefaults 0swaps

The final text recognition with calamari-recognize runs 2:04:36 hours, using 14309 s or 3:58:29 h computation time and up to 6 GiB memory:

12910.53user 1398.25system 2:04:36elapsed 191%CPU (0avgtext+0avgdata 6134644maxresident)k
13184inputs+120776outputs (5major+10706645minor)pagefaults 0swaps

OCR-D workflow for slower processors

time ocrd process \
  "cis-ocropy-binarize -I DEFAULT -O OCR-D-BIN" \
  "anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP" \
  "skimage-denoise -I OCR-D-CROP -O OCR-D-BIN-DENOISE -P level-of-operation page" \
  "tesserocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW -P operation_level page" \
  "tesserocr-segment -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG -P shrink_polygons true" \
  "cis-ocropy-dewarp -I OCR-D-SEG -O OCR-D-SEG-DEWARP" \
  "tesserocr-recognize -I OCR-D-SEG-DEWARP -O OCRD_SLOWER_PROCESSOR -P textequiv_level glyph -P overwrite_segments true -P model german_print"

This workflow reports errors in the last step (text recognition) after running 3:31:50 hours. Nevertheless it produces PAGE XML files with text recognition results. The total computation time is 18552 s or 5:09:12 h, peak memory usage is 8765644 KiB:

13658.20user 4893.78system 3:31:50elapsed 145%CPU (0avgtext+0avgdata 8765644maxresident)k
60952inputs+3116224outputs (23major+35608871minor)pagefaults 0swaps

Create ALTO XML from PAGE XML

# Convert the PAGE XML output from the previous step into ALTO XML files.
# time ocrd-fileformat-transform -I PAGE_GERMAN_PRINT -O ALTO_GERMAN_PRINT -P from-to "page alto"
time ocrd-fileformat-transform -I PAGE_GERMAN_PRINT -O ALTO_GERMAN_PRINT -P from-to "page alto" -P script-args '--no-check-border'

The final step failed with an error and no ALTO XML files were produced. This could be fixed by adding a parameter -P script-args '--no-check-border'. The duration is about 1 minute for all pages, but it takes a surprisingly high total computation time of 15:15 min. This is caused by excessive parallel computing. A 2nd run had a total computation time of 963 s or 16:03 min:

284.46user 678.53system 0:51.51elapsed 1869%CPU (0avgtext+0avgdata 175704maxresident)k
0inputs+95904outputs (81major+5770573minor)pagefaults 0swaps

Minimal Tesseract workflow

# Run segmentation and OCR and produce ALTO and text files.
for image in *.jpg; do echo tesseract $image ocr/${image/.jpg/} -l german_print alto txt; done

Duration and total computation time are equal, namely 1188 seconds or 19:48 minutes:

1182.77user 4.75system 19:47.59elapsed 99%CPU (0avgtext+0avgdata 473152maxresident)k
0inputs+43776outputs (0major+1506053minor)pagefaults 0swaps

The similar processing in OCR-D required 1909 or 2776 s total computation time which is 161 or 234 % of the Tesseract time. So even the fastest processing with OCR-D can only process a single page where Tesseract can process two.

While OCR-D used up to 3915736 KiB memory, Tesseract only used 473152 KiB. That means OCR-D requires more than 8x of maximum memory compared to Tesseract.

Known OCR and scan problems

High contrast 142-9550/0471 (ocr, best ocr), left side very light, right side light / dark.
Translucent 005-7924/0223 (ocr, best ocr). Such scans need very much time with Tesseract, and Tesseract reports many diacritics. Those scans result in very bad OCR results, typically with long sequences of 11111.
Skewed scans with different angles for left and right side 056-9937/0014, 072-7991/0011 (ocr, best ocr). Those scans result in very bad OCR results, typically with long sequences of 11111.
Unsharp scans 037-9918/0108 (ocr).
Light scans with low contrast 118-9526/0485 (ocr).

OCR problems with Tesseract

Wrong post OCR

Tesseract uses a dictionary. This leads to funny unwanted results, because it "detects" words like computer, Google or Internet which did definitely not exist before 1945.

The dictionary also contains lots of confusions (B / ß, ii / ü). In addition, several characters were not trained and cannot be recognized therefore, notably the paragraph and tilde characters.

OCR of double pages

Wrong page separation (OCR does not respect page boundaries) 004-8445/0036 (ocr), see wrong text Rudolf Lantzsch in München.

OCR of tables

Extremely high resource needs for Tesseract OCR 030-8471/0651 (ocr, best ocr), double page with large tables takes about 2600 s.
Wrong table decomposition (OCR does not respect table columns) 167-9449/0015 (ocr, best ocr), see wrong text Gebiet des Generalgouvernements Türkei.

Layout recognition

Column separation failed 022-9903/0526.

Accuracy

image	CER (1)	WER (1)	CER (2)	WER (2)	CER (3)
1819-03-30 p. 1	9.7 %	25.0 %	4.4 %	14.3 %
1881-02-01 p. 3	unusable	unusable	3.8 %	16.2 %	2.2 %
1881-02-01 p. 4	unusable	unusable	4.6 %	16.7 %
1919-05-08 p. 2	28.4 %	44.7 %	95.2 %	71.4 %
1919-05-08 p. 3	unusable	unusable	10.4 %	30.8 %
1919-05-08 p. 4	unusable	unusable	10.2 %	31.5 %
1921-09-22 p. 1	unusable	77.8 %	18.0 %	32.0 %
1944-01-03 p. 1	29.4 %	34.8 %	14.0 %	21.0 %	12.0 %

(1) Tesseract 4.0.0-alpha.20170703, script/Fraktur
(2) Tesseract 4.0.0-beta.4, script/Fraktur
(3) Tesseract 4.0.0, Fraktur_5000000 + voting

CER = Character error rate
WER = Word error rate
accuracy = 100 % - (error rate)

The accuracy was measured using the ocr-evaluation-tools. Please note that the ground truth was created manually and is known to contain errors, too. So in some cases Tesseract is right and the ground truth is wrong. That implies that the real error rates of Tesseract are a little bit lower. See original data for more details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly