Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert eng training to h5 model #71

Open
ehrenmann1977 opened this issue Apr 2, 2022 · 3 comments
Open

convert eng training to h5 model #71

ehrenmann1977 opened this issue Apr 2, 2022 · 3 comments

Comments

@ehrenmann1977
Copy link

ehrenmann1977 commented Apr 2, 2022

how to export a Keras model of English language? is it possible to export the corpus to do some neural network training using it? I mean something like MNIST dataset

@stweil
Copy link
Member

stweil commented May 17, 2023

Good question. Tesseract uses its own model file format. But it should be possible to convert the included neural network to any other model format which supports the same network specification.

We still have to find someone who wants to implement that (and also the other direction).

@stefan6419846
Copy link

Is there any documentation available on the model file format Tesseract uses (*.traineddata file format specification)?

@stweil
Copy link
Member

stweil commented May 19, 2023

There exists a command line tool combine_tessdata which can list and extract all components from a model file:

% combine_tessdata -d /opt/homebrew/share/tessdata/eng.traineddata 
Version:4.00.00alpha:eng:synth20170629
17:lstm:size=401636, offset=192
18:lstm-punc-dawg:size=4322, offset=401828
19:lstm-word-dawg:size=3694794, offset=406150
20:lstm-number-dawg:size=4738, offset=4100944
21:lstm-unicharset:size=6360, offset=4105682
22:lstm-recoder:size=1012, offset=4112042
23:version:size=30, offset=4113054

Another tool dawg2wordlist can convert the dawg components to normal text files, and the unicharset is already text. That's the easy part.

The interesting part is the lstm component with the neural network. It's not documented, so the program code is the reference for it. Look for DeSerialize in the lstm code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants