-
Notifications
You must be signed in to change notification settings - Fork 9.7k
Planning
Here we can plan the next releases of Tesseract.
Here are some ideas for future Tesseract releases.
-
Use llvm's tools: clang-format, clang-tidy, sanitizers.
-
Replace more Tesseract data types by C++ standard types (
GenericVector
, ...), especially for the API. -
Add json (or xml) output format. It will be used for full ocr and for psm 2 - layout info only.
-
Add option to use alternative binarization methods from leptonica.
-
Add an option to output separate files for multipage input (out1.hocr, out2.hocr ...).
-
Add multi-threading option to the command line (openmp will be disabled at runtime in this mode).
-
Explore the option to use Protocol Buffers or FlatBuffers for the traineddata.
-
Improve error handling and don't ignore return values from functions (see discussion).
That will be the next release. See also the release notes.
See also the discussion for issue #1423.
-
Issues with the "bug" label (see list here)
-
Noise characters recognized with bbox as the entire page #1192
-
Segmentation fault when using integer models for LSTM training #1573
-
Report a warning when the Tesseract initialisation code detects an unsupported locale setting. (See comment)
-
Insufficient error message when output file cannot be created Issue 1424
-
“no best words!!” on mixed language (fra+ara) items (see issue 235)
-
mgr_.Init(traineddata_path.c_str()):Error:Assert failed: #1075 (see issue 1075)
-
https://github.com/zdenop/tessdata_downloader
Script for installing only selected languages from github (see issue)
Depending on available resources and opinions, these suggestions will either be added to the planning for the next or a future release or abandoned.
-
Enhance --list-langs to show additional information for scripts and languages like legacy / LSTM, version
This will make the command slower, because each file must be opened and parsed. Add this as --list-langs-details or as --list-lang-details for one language file based on lang-code?
-
tessedit_load_sublangs should search for the sublangs relative to the parent, not starting in tessdata dir.
-
In addition to the current proprietary format Tesseract could also support ZIP archives (see discussion). A possible implementation using libarchive is available, but needs more testing.
-
"Training light" - Learning by doing (see issue)
-
Modify text2image to use PrepareDistortedPix() #1052
Tesseract 4.0 should be a full replacement for Tesseract 3.05 and have the same features when used with the old OCR engine (--oem 0
). The following regressions still need verification (are they really regressions, or are they just missing features for LSTM):
These features still work with the old OCR engine (--oem 0
), but are missing and desired for LSTM.
-
Black list / White list (See issue). Here is a workaround.
-
hOCR font info (See comment)
Here we collect important issues and features for the release(s) following 4.0.0.
-
New LSTM-based OSD detector (see comment).
-
Remove Legacy Tesseract Engine (see issue)
-
Better Multi-language implementation for training (See comment)
-
ARM SIMD support for dot product #519
-
Using OpenMP for dot product #983
-
This does not include OpenCL or the old Tesseract engine.
-
Tesseract creates output for missing input (see issue 1023).
Mostly solved, but could be improved.
-
Issue 1353: Patch for /training/tessopt.cpp (see pull request 13)
It looks like it is not possible to run more than one training in the same process. The pull request describes a possible fix, but does not include a complete implementation (low priority).
Old wiki - no longer maintained. The pages were moved, see the new documentation.
These wiki pages are no longer maintained.
All pages were moved to tesseract-ocr/tessdoc.
The latest documentation is available at https://tesseract-ocr.github.io/.