Rewrite #33

kba · 2020-12-30T18:53:48Z

Rewrite the segmentation and add recognition with support for the upcoming kraken 3.0`

TODO

coordinates for words
handle detected image regions
Fix CI
Documentation

bertsky

Great to see Kraken support going!

Wonder if the blla code couldn't give us the hierarchy (region-line mapping) directly...

Also, is there some (region) reading order detection anywhere in Kraken?

ocrd_kraken/recognize.py

ocrd_kraken/segment.py

…s are valid

bertsky · 2021-01-08T15:03:06Z

Here's an example:

input binary (cropped, deskewed, binarized):
wrapped blla segmenter (still without enlarging regions to avoid extruding lines):

Another example:

input binary (cropped, deskewed, binarized):
wrapped blla segmenter:

bertsky · 2021-01-11T13:43:28Z

And here's a crop from the first example again, this time after 232a055 (enlarging regions to avoid extruding lines) and using PageViewer with PRImA-Research-Lab/prima-page-viewer#18 to show the baselines:

Perhaps @mittagessen should comment on what we are seeing here. Is this what we should expect from the blla segmenter, or are these caused by bugs in kraken or bad wrapping in ocrd_kraken?

mittagessen · 2021-01-11T14:48:56Z

I guess its just (very) crappy output caused by a combination of the default model being only trained on handwritten text and high line count on the page which tends to cause the line merging you're seeing. We have changed the postprocessing a few weeks ago which is a bit more sensitive to low-confidence detection but solved a number of other rather annoying problems. Retraining the default model on a larger resolution than now should largely resolve the problem and is fairly high up on my todo list. Nothing wrong on your part. Try manuscripts next time ;). I'd also gladly take high line-count (>30) print datasets to incorporate that into the default model training data.

bertsky · 2021-01-11T15:49:01Z

I guess its just (very) crappy output caused by a combination of the default model being only trained on handwritten text and high line count on the page which tends to cause the line merging you're seeing. We have changed the postprocessing a few weeks ago which is a bit more sensitive to low-confidence detection but solved a number of other rather annoying problems. Retraining the default model on a larger resolution than now should largely resolve the problem and is fairly high up on my todo list. Nothing wrong on your part. Try manuscripts next time ;). I'd also gladly take high line-count (>30) print datasets to incorporate that into the default model training data.

Thanks @mittagessen for these explanations. That makes we wonder whether you train on (fixed size) crops/tiles or on full images (see here for a study which tiling options work best), and whether you account for different pixel densities (see here for how the Qurator team deals with this)). (Using Mask-RCNN I used to get problems when mixing books and newspapers, esp. when DPI varied...) Could you please elaborate on your take regarding these aspects?

As to datasets, how about PubLayNet (very large, but modern/synthetic), and datasets listed here under metric: pages (mid-size, historic)?

(EDIT: Some of these would need additional effort to get to text lines; or could you train the regions exclusively with blla?)

@kba maybe in light of this, it makes most sense to also wrap both the blla and legacy line detectors in a region2line mode (or level-of-operation=region)?

mittagessen · 2021-01-11T16:24:17Z

Could you please elaborate on your take regarding these aspects?

The net is trained on full pages with a normalized page height, per default 1200px to keep the memory consumption below 5Gb for the method. As mentioned the line merging disappears for all but the craziest material (rotuli, maps, some inscriptions, newspapers) when increasing this to 1600px (~33% line separation at net scale). That's been a design decision since the first iteration of the method (U-Net) as even with the standard tiling techniques we never got completely rid of border effects. The current method is more or less a ReNet which might perform even worse with the reduced context tiling provides but I haven't evaluated it extensively. It is on our radar though as wanting to be able to process the crazy stuff is our shtick. We haven't encountered any issues relating to scale as described by the qurator people. Anything between 75dpi to 600dpi+ seems to work reasonably well in the same model even if not trained on that resolution and with different input heights of the model. I'd guess that's largely because even low resolution scans are at worst only ~50% smaller than the 'native' input size so resizing effects are rather modest.

As to datasets, how about [PubLayNet](https://github.com/ibm-aur-nlp/PubLayNet) (very large, but modern/synthetic)

Unfortunately, that one doesn't have the baselines necessary to train the segmenter.

[here](https://github.com/cneud/ocr-gt/blob/master/ocr-gt.yml) under

I'll sift through these in a bit, thx.

bertsky · 2021-01-12T12:37:12Z

As to datasets, how about PubLayNet (very large, but modern/synthetic)
Unfortunately, that one doesn't have the baselines necessary to train the segmenter.

Yes, and it has more issues. Unfortunately, they did not publish the method or data of their PDF-XML alignment, so all we could do is post-process.

But...

(EDIT: Some of these would need additional effort to get to text lines; or could you train the regions exclusively with blla?)

...is that even an option?

I'll sift through these in a bit, thx.

Also, there are quite a few more (not yet properly listed) under https://github.com/cneud/ocr-gt/issues

mittagessen · 2021-01-12T12:56:13Z

...is that even an option?

Yes, you can train only regions or only lines (or a subset of types of either). The code actually supports multi-model inference for segmentation as well so you'd be able to mix and match models to your particular use-case. Of course with great flexibility comes great potential for blowing one's foot off.

mittagessen · 2021-01-12T13:23:42Z

BTW about the reading order question above. It's a bit complicated as the segmenter is designed to allow detection of non-textual regions such as stamps. Thus, regions are per definition unordered but textual regions (anything that contains lines) are treated as dummy lines for the purpose of determining the reading order, e.g. (L = line, R = region):
L0 - L1 - R0 (L5 L4 L3) - L2 - R1 (L7 L8) with the actual lines in R0 and R1 being ordered separately and substituted afterwards so the final output is L0-L1-L5-L4-L3-L2-L7-L8.

That will change mid-term though as we need a more capable reading order thingy for the semitic abjads and parallel texts and such. That one will have a more explicit ordering with additional semantics attached.

bertsky · 2021-01-12T14:11:10Z

BTW about the reading order question above. It's a bit complicated as the segmenter is designed to allow detection of non-textual regions such as stamps. Thus, regions are per definition unordered but textual regions (anything that contains lines) are treated as dummy lines for the purpose of determining the reading order, e.g. (L = line, R = region):
L0 - L1 - R0 (L5 L4 L3) - L2 - R1 (L7 L8) with the actual lines in R0 and R1 being ordered separately and substituted afterwards so the final output is L0-L1-L5-L4-L3-L2-L7-L8.

Ok, that's what I do as well in my Ocropy fork – only that I use recursive X-Y cut for region segmentation/grouping.

So, @kba we should try to wrap this functionality for PAGE here, too.

That will change mid-term though as we need a more capable reading order thingy for the semitic abjads and parallel texts and such. That one will have a more explicit ordering with additional semantics attached.

Am I right to assume you plan to do that with some neural modelling, @mittagessen?

mittagessen · 2021-01-12T15:14:15Z

Am I right to assume you plan to do that with some neural modelling, @mittagessen?

When you have a hammer everything looks like a nail, so yes. I had some basic code for a graph NN orderer but which features to actually use is quite unknown.

ocrd_kraken/recognize.py

Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>

ocrd_kraken/recognize.py

This reverts commit b85e147.

bertsky · 2022-07-15T17:36:04Z

I had to revert the conditional binary input for the segmenter, because it looks like binarization always produces better results for blla.mlmodel. @mittagessen could you please comment?

mittagessen · 2022-07-16T10:52:21Z

Do you have an example of the material you're testing on? We don't really do binarization anymore as it breaks most degraded manuscripts so the model wasn't even evaluated (nor trained) on it. Would be good to see what exactly is happening.

bertsky · 2022-07-16T20:34:56Z

I had to revert the conditional binary input for the segmenter, because it looks like binarization always produces better results for blla.mlmodel. @mittagessen could you please comment?

Do you have an example of the material you're testing on? We don't really do binarization anymore as it breaks most degraded manuscripts so the model wasn't even evaluated (nor trained) on it. Would be good to see what exactly is happening.

Oh, in that case... I have compared with/out SBB binarization (using the latest model) on this material.

mittagessen · 2022-07-18T12:32:52Z

Huh, interesting. Those look fairly similar to the stuff in cBAD but if the binarization is excellent it could give a boost in accuracy. In any case, I wouldn't force inputs to be binarized but if there's a good one available and you get better results there's no reason not to use it.

bertsky · 2022-07-18T21:33:10Z

Huh, interesting. Those look fairly similar to the stuff in cBAD but if the binarization is excellent it could give a boost in accuracy. In any case, I wouldn't force inputs to be binarized but if there's a good one available and you get better results there's no reason not to use it.

In that case, we should make the choice dependent on the workflow – by passing an empty feature selector/filter when blla is used. I'll revert once again.

@kba, if you could look into the CI permissions problem?

(Leave the selector/filter empty, so it depends on the workflow: If binarization is available, it will get used.) This reverts commit 0eecf6c.

kba · 2022-07-22T14:37:47Z

@kba, if you could look into the CI permissions problem?

CI is working again, there was an issue with the deployment key and some minor typos and missing models.

bertsky

I'd say this should be merged for now.

Documentation (I guess you meant README.md?) would be nice of course.

Features like reading order, better extraction of region-line hierarchy, more efficient decoding etc can be tracked by dedicated issues.

bertsky · 2022-08-29T10:44:57Z

BTW, one thing we could also add is model URLs in the ocrd-tool.json for segmentation and for recognition (especially with the new models from UB Mannheim. (We could even make the suffix .mlmodel disappear.)

bertsky · 2022-10-13T13:29:11Z

@kba I did all of the above and fixed the CI again (with a workaround for this new problem in core). Now ready for merging AFAICS.

kba added 10 commits December 28, 2020 14:14

update requirements

99e370e

ocrd-tool: add parameter description (placeholders)

5a9ea7d

[wip] upgrade to use 3.x baseline segmenter

07fd913

blla segmenter basically working

c530706

segment: more robust handling of lines not matched to regions

ef40b8e

implement legacy segmenter

33e765f

decouple CLIs, add -recognize

5feb52c

implement basic recognition

28ef7cb

recognize: output words and glyphs with conf

b67e7c3

ci: do not build python 3.5 anymore

276f542

kba force-pushed the resources branch from 15c6bed to 276f542 Compare December 31, 2020 15:52

require core 2.22.0b3+ for kraken resources

c2ceef9

kba marked this pull request as ready for review December 31, 2020 16:11

calculate word coords from glyph coords

8be2abd

bertsky requested changes Jan 4, 2021

View reviewed changes

bertsky added 2 commits January 8, 2021 14:14

segment: convert coords back to absolute

8d93794

segment: enlarge regions to check for contained lines, ensure polygon…

232a055

…s are valid

bertsky requested changes Jan 21, 2021

View reviewed changes

ocrd_kraken/recognize.py Outdated Show resolved Hide resolved

Update ocrd_kraken/recognize.py

97ee297

Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>

bertsky requested changes Jan 21, 2021

View reviewed changes

ocrd_kraken/recognize.py Show resolved Hide resolved

Revert "segment: use binarized input only if use_legacy"

0eecf6c

This reverts commit b85e147.

bertsky and others added 3 commits July 18, 2022 23:33

Revert "Revert "segment: use binarized input only if use_legacy""

104f138

(Leave the selector/filter empty, so it depends on the workflow: If binarization is available, it will get used.) This reverts commit 0eecf6c.

fix typos

8b57e46

fix test_binarize

d27f1a4

kba force-pushed the resources branch from 65d15e5 to 25fbdb7 Compare July 22, 2022 14:23

ci: download model first

6f4a7bf

kba force-pushed the resources branch from 25fbdb7 to 6f4a7bf Compare July 22, 2022 14:31

📦 v0.2.0

f8235c4

bertsky self-requested a review August 29, 2022 10:38

bertsky approved these changes Aug 29, 2022

View reviewed changes

bertsky added 5 commits October 13, 2022 11:09

recognize: clip bboxes to canvas

ad8eb16

ocrd-tool: add model resources

36a463d

CI: add Py39, fix Py36

6297b33

udpate+fix Docker build

446af81

documentation

c7b7f18

bertsky added 2 commits October 13, 2022 15:45

Makefile: fix typo

45143de

CI: remove assets

831e017

kba merged commit d5ac876 into master Oct 23, 2022

kba deleted the resources branch October 23, 2022 12:37

This was referenced May 25, 2023

Restrict segmentation to print space #22

Closed

documentation: README completeness, debug ocrd-tool.json #29

Closed

ocrd-kraken-segment creates negative coordinates (=invalid PAGE) #34

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite #33

Rewrite #33

kba commented Dec 30, 2020 •

edited by bertsky

Loading

bertsky left a comment

bertsky commented Jan 8, 2021 •

edited

Loading

bertsky commented Jan 11, 2021

mittagessen commented Jan 11, 2021 via email

bertsky commented Jan 11, 2021 •

edited

Loading

mittagessen commented Jan 11, 2021 via email

bertsky commented Jan 12, 2021

mittagessen commented Jan 12, 2021

mittagessen commented Jan 12, 2021

bertsky commented Jan 12, 2021

mittagessen commented Jan 12, 2021 via email

bertsky commented Jul 15, 2022

mittagessen commented Jul 16, 2022

bertsky commented Jul 16, 2022 •

edited

Loading

mittagessen commented Jul 18, 2022

bertsky commented Jul 18, 2022

kba commented Jul 22, 2022

bertsky left a comment

bertsky commented Aug 29, 2022 •

edited

Loading

bertsky commented Oct 13, 2022

Rewrite #33

Rewrite #33

Conversation

kba commented Dec 30, 2020 • edited by bertsky Loading

bertsky left a comment

Choose a reason for hiding this comment

bertsky commented Jan 8, 2021 • edited Loading

bertsky commented Jan 11, 2021

mittagessen commented Jan 11, 2021 via email

bertsky commented Jan 11, 2021 • edited Loading

mittagessen commented Jan 11, 2021 via email

bertsky commented Jan 12, 2021

mittagessen commented Jan 12, 2021

mittagessen commented Jan 12, 2021

bertsky commented Jan 12, 2021

mittagessen commented Jan 12, 2021 via email

bertsky commented Jul 15, 2022

mittagessen commented Jul 16, 2022

bertsky commented Jul 16, 2022 • edited Loading

mittagessen commented Jul 18, 2022

bertsky commented Jul 18, 2022

kba commented Jul 22, 2022

bertsky left a comment

Choose a reason for hiding this comment

bertsky commented Aug 29, 2022 • edited Loading

bertsky commented Oct 13, 2022

kba commented Dec 30, 2020 •

edited by bertsky

Loading

bertsky commented Jan 8, 2021 •

edited

Loading

bertsky commented Jan 11, 2021 •

edited

Loading

bertsky commented Jul 16, 2022 •

edited

Loading

bertsky commented Aug 29, 2022 •

edited

Loading