Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite #33

Merged
merged 39 commits into from
Oct 23, 2022
Merged

Rewrite #33

merged 39 commits into from
Oct 23, 2022

Conversation

kba
Copy link
Member

@kba kba commented Dec 30, 2020

Rewrite the segmentation and add recognition with support for the upcoming kraken 3.0`

TODO

  • coordinates for words
  • handle detected image regions
  • Fix CI
  • Documentation

@kba kba marked this pull request as ready for review December 31, 2020 16:11
Copy link
Collaborator

@bertsky bertsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see Kraken support going!

Wonder if the blla code couldn't give us the hierarchy (region-line mapping) directly...

Also, is there some (region) reading order detection anywhere in Kraken?

ocrd_kraken/recognize.py Outdated Show resolved Hide resolved
ocrd_kraken/recognize.py Outdated Show resolved Hide resolved
ocrd_kraken/segment.py Outdated Show resolved Hide resolved
ocrd_kraken/segment.py Outdated Show resolved Hide resolved
ocrd_kraken/segment.py Outdated Show resolved Hide resolved
@bertsky
Copy link
Collaborator

bertsky commented Jan 8, 2021

Here's an example:

  • input binary (cropped, deskewed, binarized): FILE_0003_CROPPED-IMG
  • wrapped blla segmenter (still without enlarging regions to avoid extruding lines):
    kraken-segment

Another example:

  • input binary (cropped, deskewed, binarized):
    FILE_0010_CROPPED-IMG
  • wrapped blla segmenter:
    kraken-segment_overseg

@bertsky
Copy link
Collaborator

bertsky commented Jan 11, 2021

And here's a crop from the first example again, this time after 232a055 (enlarging regions to avoid extruding lines) and using PageViewer with PRImA-Research-Lab/prima-page-viewer#18 to show the baselines:

FILE_0002_LINES_KRAKEN_pageviewer-baselines

Perhaps @mittagessen should comment on what we are seeing here. Is this what we should expect from the blla segmenter, or are these caused by bugs in kraken or bad wrapping in ocrd_kraken?

@mittagessen
Copy link

mittagessen commented Jan 11, 2021 via email

@bertsky
Copy link
Collaborator

bertsky commented Jan 11, 2021

I guess its just (very) crappy output caused by a combination of the default model being only trained on handwritten text and high line count on the page which tends to cause the line merging you're seeing. We have changed the postprocessing a few weeks ago which is a bit more sensitive to low-confidence detection but solved a number of other rather annoying problems. Retraining the default model on a larger resolution than now should largely resolve the problem and is fairly high up on my todo list. Nothing wrong on your part. Try manuscripts next time ;). I'd also gladly take high line-count (>30) print datasets to incorporate that into the default model training data.

Thanks @mittagessen for these explanations. That makes we wonder whether you train on (fixed size) crops/tiles or on full images (see here for a study which tiling options work best), and whether you account for different pixel densities (see here for how the Qurator team deals with this)). (Using Mask-RCNN I used to get problems when mixing books and newspapers, esp. when DPI varied...) Could you please elaborate on your take regarding these aspects?

As to datasets, how about PubLayNet (very large, but modern/synthetic), and datasets listed here under metric: pages (mid-size, historic)?

(EDIT: Some of these would need additional effort to get to text lines; or could you train the regions exclusively with blla?)

@kba maybe in light of this, it makes most sense to also wrap both the blla and legacy line detectors in a region2line mode (or level-of-operation=region)?

@mittagessen
Copy link

mittagessen commented Jan 11, 2021 via email

@bertsky
Copy link
Collaborator

bertsky commented Jan 12, 2021

As to datasets, how about PubLayNet (very large, but modern/synthetic)
Unfortunately, that one doesn't have the baselines necessary to train the segmenter.

Yes, and it has more issues. Unfortunately, they did not publish the method or data of their PDF-XML alignment, so all we could do is post-process.

But...

(EDIT: Some of these would need additional effort to get to text lines; or could you train the regions exclusively with blla?)

...is that even an option?

I'll sift through these in a bit, thx.

Also, there are quite a few more (not yet properly listed) under https://github.com/cneud/ocr-gt/issues

@mittagessen
Copy link

...is that even an option?

Yes, you can train only regions or only lines (or a subset of types of either). The code actually supports multi-model inference for segmentation as well so you'd be able to mix and match models to your particular use-case. Of course with great flexibility comes great potential for blowing one's foot off.

@mittagessen
Copy link

BTW about the reading order question above. It's a bit complicated as the segmenter is designed to allow detection of non-textual regions such as stamps. Thus, regions are per definition unordered but textual regions (anything that contains lines) are treated as dummy lines for the purpose of determining the reading order, e.g. (L = line, R = region):
L0 - L1 - R0 (L5 L4 L3) - L2 - R1 (L7 L8) with the actual lines in R0 and R1 being ordered separately and substituted afterwards so the final output is L0-L1-L5-L4-L3-L2-L7-L8.

That will change mid-term though as we need a more capable reading order thingy for the semitic abjads and parallel texts and such. That one will have a more explicit ordering with additional semantics attached.

@bertsky
Copy link
Collaborator

bertsky commented Jan 12, 2021

BTW about the reading order question above. It's a bit complicated as the segmenter is designed to allow detection of non-textual regions such as stamps. Thus, regions are per definition unordered but textual regions (anything that contains lines) are treated as dummy lines for the purpose of determining the reading order, e.g. (L = line, R = region):
L0 - L1 - R0 (L5 L4 L3) - L2 - R1 (L7 L8) with the actual lines in R0 and R1 being ordered separately and substituted afterwards so the final output is L0-L1-L5-L4-L3-L2-L7-L8.

Ok, that's what I do as well in my Ocropy fork – only that I use recursive X-Y cut for region segmentation/grouping.

So, @kba we should try to wrap this functionality for PAGE here, too.

That will change mid-term though as we need a more capable reading order thingy for the semitic abjads and parallel texts and such. That one will have a more explicit ordering with additional semantics attached.

Am I right to assume you plan to do that with some neural modelling, @mittagessen?

@mittagessen
Copy link

mittagessen commented Jan 12, 2021 via email

ocrd_kraken/recognize.py Outdated Show resolved Hide resolved
Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
ocrd_kraken/recognize.py Show resolved Hide resolved
@bertsky
Copy link
Collaborator

bertsky commented Jul 15, 2022

I had to revert the conditional binary input for the segmenter, because it looks like binarization always produces better results for blla.mlmodel. @mittagessen could you please comment?

@mittagessen
Copy link

Do you have an example of the material you're testing on? We don't really do binarization anymore as it breaks most degraded manuscripts so the model wasn't even evaluated (nor trained) on it. Would be good to see what exactly is happening.

@bertsky
Copy link
Collaborator

bertsky commented Jul 16, 2022

I had to revert the conditional binary input for the segmenter, because it looks like binarization always produces better results for blla.mlmodel. @mittagessen could you please comment?

Do you have an example of the material you're testing on? We don't really do binarization anymore as it breaks most degraded manuscripts so the model wasn't even evaluated (nor trained) on it. Would be good to see what exactly is happening.

Oh, in that case... I have compared with/out SBB binarization (using the latest model) on this material.

@mittagessen
Copy link

Huh, interesting. Those look fairly similar to the stuff in cBAD but if the binarization is excellent it could give a boost in accuracy. In any case, I wouldn't force inputs to be binarized but if there's a good one available and you get better results there's no reason not to use it.

@bertsky
Copy link
Collaborator

bertsky commented Jul 18, 2022

Huh, interesting. Those look fairly similar to the stuff in cBAD but if the binarization is excellent it could give a boost in accuracy. In any case, I wouldn't force inputs to be binarized but if there's a good one available and you get better results there's no reason not to use it.

In that case, we should make the choice dependent on the workflow – by passing an empty feature selector/filter when blla is used. I'll revert once again.

@kba, if you could look into the CI permissions problem?

bertsky and others added 3 commits July 18, 2022 23:33
(Leave the selector/filter empty, so it depends on the workflow:
 If binarization is available, it will get used.)

This reverts commit 0eecf6c.
@kba
Copy link
Member Author

kba commented Jul 22, 2022

@kba, if you could look into the CI permissions problem?

CI is working again, there was an issue with the deployment key and some minor typos and missing models.

@bertsky bertsky self-requested a review August 29, 2022 10:38
Copy link
Collaborator

@bertsky bertsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say this should be merged for now.

Documentation (I guess you meant README.md?) would be nice of course.

Features like reading order, better extraction of region-line hierarchy, more efficient decoding etc can be tracked by dedicated issues.

@bertsky
Copy link
Collaborator

bertsky commented Aug 29, 2022

BTW, one thing we could also add is model URLs in the ocrd-tool.json for segmentation and for recognition (especially with the new models from UB Mannheim. (We could even make the suffix .mlmodel disappear.)

@bertsky
Copy link
Collaborator

bertsky commented Oct 13, 2022

@kba I did all of the above and fixed the CI again (with a workaround for this new problem in core). Now ready for merging AFAICS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants