Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ketos linegen CLI -d is ambiguous #306

Closed
bertsky opened this issue Nov 16, 2021 · 7 comments
Closed

ketos linegen CLI -d is ambiguous #306

bertsky opened this issue Nov 16, 2021 · 7 comments

Comments

@bertsky
Copy link

bertsky commented Nov 16, 2021

In ketos linegen, you currently have:

  -d, --disable-degradation       Dont degrade output lines.

  -d, --distort FLOAT             Mean of folded normal distribution to take
                                  distortion values from

You might want to rename one, e.g. -D.

@mittagessen
Copy link
Owner

The module hasn't been touched in a long time and should definitely be revisited. At least with the older shallow network architecture synthetic data didn't actually work in improving or even bootstrapping a rough working model.

@bertsky
Copy link
Author

bertsky commented Nov 16, 2021

The module hasn't been touched in a long time and should definitely be revisited. At least with the older shallow network architecture synthetic data didn't actually work in improving or even bootstrapping a rough working model.

Ah, good to know. And that applies to handwriting, or print, or both?

Also, what exactly is shallow for you here (or what is deep)? For example, Tesseract's default 1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx192 seems much less wide and deep compared with other systems' defaults (IIUC):

  • 1,48,0,1 Ct3,3,40 Mp2,2 Ct3,3,60 Mp2,2 Lfx100 Lrx100 (from Wick et al. 2018 for print)
  • 1,48,0,1 Ct3,3,64 Mp2,2 Ct3,3,128 Mp2,2 Lfx100 Lrx100 (from Wick et al. 2018 for print)
  • 1,48,0,1 Ct3,3,16 Ct3,3,24 Ct3,3,36 Ct3,3,54 Ct3,3,82 Ct3,3,124 Mp2,2 Lfx350 Lrx350 (from Liebl&Burghardt 2020 for print)
  • 1,128,0,1 Ct3,3,16 Mp2,2 Ct3,3,32 Mp2,2 Ct3,3,48 Mp2,2 Ct3,3,64 Ct3,3,80 Lfx256 Lrx256 Lfx256 Lrx256 Lfx256 Lrx256 Lfx256 Lrx256 Lfx256 Lrx256 (from Puigcerver 2017 for handwriting)

Assuming I got that right, where would Kraken's old and new default fit in?

@mittagessen
Copy link
Owner

mittagessen commented Nov 16, 2021

That was only for print and with the non-pytorch single BiLSTM layer model.

The current default is [1,48,0,1 Cr4,2,32,4,2 Gn32 Cr4,2,64,1,1 Gn32 Mp4,2,4,2 Cr3,3,128,1,1 Gn32 Mp1,2,1,2 S1(1x0)1,3 Lbx256 Do0.5 Lbx256 Do0.5 Lbx256 Do0.5] so somewhere between Burghardt and Puigcerver. The one we use for most handwriting is [1,120,0,1 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 S1(1x0)1,3 Lbx200 Do0.1,2 Lbx200 Do.1,2 Lbx200 Do] but that one has the drawback that it doesn't converge for small datasets (which is the reason we haven't made it the default yet despite reducing CER by ~75% for handwriting).

Tesseract is a bit weird. After writing the initial VGSL implementation I tried to use tesseract's specs replicating their hyperparameters as much as possible but I never got anything with LfysXX layers to even remotely reproduce their numbers. It has to be said though that tesseract's training procedure is decidedly non-standard: there's backtracking on plateaus, a per layer LR heuristic in addition to what Adam does, a weird CTC implementation that mirrors a bit the Breuelian formulation, and all kind of custom bits and bobs. Even just the summarizing LSTM layers by themselves are esoteric enough that I haven't seen anyone else using them.

EDIT: On a test set (print, polytonic Greek, single font, 2.5k lines, binary) I get 99.4% character accuracy with a summarizing layer and 99.7% with the large configuration.

@bertsky
Copy link
Author

bertsky commented Nov 18, 2021

Interesting, thanks!

I did not look much into Tesseract's training procedure yet (good to know). Its many other performance optimizations make it already impossible to precisely compare and reproduce I'm afraid.

but I never got anything with LfysXX layers to even remotely reproduce their numbers
Even just the summarizing LSTM layers by themselves are esoteric enough that I haven't seen anyone else using them.

Do you mean the "implicit baseline normalization" (described here p.21)? Perhaps other systems either rely on explicit dewarping, or use 2DLSTMs, or simply try to compensate with larger input height? But your last edit suggests you did apply this successfully – so how does it compare to the same config without LXysXX?

@mittagessen
Copy link
Owner

mittagessen commented Nov 18, 2021 via email

@bertsky
Copy link
Author

bertsky commented Nov 26, 2021

input height has nothing to do with it

I disagree: if you do normalize ("deslope"/dewarp) the baseline in advance, then the same height contains more information. And if you rely on vertical summarization to do the job implicitly, then you obviously need larger height in the input.

They probably put that in the presentation because Thomas Breuel was at Google at the time and the old ocropus had this heuristic CenterLineNormalizer.

It does not contain that kind of code, though.

For Tesseract it is a bit of a moot point I guess as their line extractor is so old it can't find anything but the straightest of lines anyway.

Right, but that probably does not matter much, because you can do line detection externally, (and during training you can still augment by warping).

OCR systems using the baseline paradigm for segmentation get auto-normalized lines for recognition as you can just map the baseline into the plane with a piecewise affine transform which works well even for extreme curvatures while the implicit (or network-internal approaches such as STNs) have limits.

I agree, external/explicit dewarping is probably more robust (but let's see how the new transformer / multi-head self-attention architectures fare).

For comparison, I get (character accuracy on the Greek print set): [...] and the second one converges a lot slower (ca. epoch 50, in contrast to 30 for other architecture).

I see – thanks! (Perhaps the vertical summary could be trained/regulated specially to converge faster?)

@mittagessen
Copy link
Owner

mittagessen commented Nov 26, 2021

I disagree: if you do normalize ("deslope"/dewarp) the baseline in advance, then the same height contains more information. And if you rely on vertical summarization to do the job implicitly, then you obviously need larger height in the input.

OK, I formulated that badly. For some material larger input heights result in better results (and we've seen that for many Hebrew manuscripts) but I don't believe this to be related to any improved capability to compensate for baseline position. I'm pulling this out of my ass but naïvely I'd expect implicit baseline compensation to improve with additional contextual information and not necessarily just by having the same information with a higher resolution (in fact it could be detrimental as the receptive field of the convolutional stack is limited).

It does not contain that kind of code, though.

Yes, as I said a lot of the ocropus-y features in that presentation never ended up in Tesseract.

(but let's see how the new transformer / multi-head self-attention architectures fare).

For now they mostly seem to require more training data for the same results with slower inference. At least that's what the literature (and some quick experiments on my side) suggest.

I see – thanks! (Perhaps the vertical summary could be trained/regulated specially to converge faster?)

Yeah, I didn't fiddle around with the hyperparameters much. Doing hyperparameter search with kraken is a bit of a pain right now as the datasets are loading so slowly. It is entirely possible that Tesseracts explicit per-layer learning rates beyond what Adam does where added for those layers. But IDK, in the end you can probably get the exact same result with a stack of 1xX convolutional layers when using a fixed input height.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants