Improve tooling and add OCR-D CLI #5

kba · 2020-10-15T16:33:24Z

add setup.py, requirements.txt, Makefile
clean up sbb_binarizer code and split CLI into own module
proof-of-concept OCR-D CLI

TODO:

Fix conversion from SbbBinarizer OpenCV2 image to PIL
Refactor SbbBinarizer to ~~load the model once and have a stateless run method~~ that is not possible because it's 4 models that are run in sequence. But SbbBinarizer is now refactored to have only model_dir as an instance var, the other args are passed to run
~~Setup CI~~ It is set up but sbb_binarization's requires too much RAM for binarizing two pages on Circle CI's free plan (4 GB)
Improve documentation

sbb_binarize/ocrd_cli.py

bertsky · 2020-10-15T17:30:21Z

sbb_binarize/ocrd-tool.json

+          "description": "PAGE XML hierarchy level to operate on"
+        },
+        "patches": {
+          "description": "by setting this parameter to true you let the model to see the image in patches.",


It would be helpful to document the ramifications of this choice here (speed vs quality?)

@vahidrezanezhad What are the consequences of using the "patches" approach?

@vahidrezanezhad What are the consequences of using the "patches" approach?

In the case of binarization, this parameter should be set "True", otherwise model will see image at once and the quality will be terrible.

Ok, then I'd recommend against exposing this parameter at all.

Also, how about setting it to False on region (or line) level, i.e. when running on region crops or textline crops?

Ok, then I'd recommend against exposing this parameter at all.

Also, how about setting it to False on region (or line) level, i.e. when running on region crops or textline crops?

I would recommend to apply binarization on page region, since the best performance achieved when you feed model with real scale (better to say with patches set to "True") . So, if you apply this on line level what is going to happen with False patches parameter? The textline region will be resized to input shape and I am sure that performance will be worse.
On region level is the same. Patches parameter in all cases better to be True.

@vahidrezanezhad understood – thanks for clarifying!

One follow-up though: Can you give an impression how sensitive these models are to different scales / pixel densities on the input side? (So, if the model has seen, say, only 600 DPI newspapers during training, will it still work on (patches of) 300 DPI books?)

Actually, this tool has no problem with different pixel densities. But for the large scale texts like headers (Not any header I mean really large scale ones) or images I had problem to get my desired result. It was because of GT which does not include any images and any large scale texture. But this is also resolved with combination of 4 models. If by sbb binarization tool you apply all our 4 models you can see that it works mostly for big variety of scales. By the way make long story short, if you set patches to True and if you use all 4 models be sure that this tool will do sth :)

Actually, this tool has no problem with different pixel densities.

Okay, even if they differ widely, like using the model(s) on 100 or 1200 DPI scans?

But for the large scale texts like headers (Not any header I mean really large scale ones) or images I had problem to get my desired result. It was because of GT which does not include any images and any large scale texture. But this is also resolved with combination of 4 models.

Oh, now I get it, that's why the run method predicts on multiple models and then assigns foreground wherever any of them predicted fg. So these models actually were trained on multiple scales.

This is reminiscent of multi-scale Sauvola running on different image resolutions, and then re-combining them morphologically. Do you think there would be merit in having such morphological restrictions (and a well-defined order from largest to smallest) here as well?

Also, how does this interface (practically) with text/non-text segmentation? If images are somewhat problematic, should they be detected on the raw image and then removed from the binarized image before entering sbb_textline_detector?

Actually, this tool has no problem with different pixel densities.

Okay, even if they differ widely, like using the model(s) on 100 or 1200 DPI scans?

But for the large scale texts like headers (Not any header I mean really large scale ones) or images I had problem to get my desired result. It was because of GT which does not include any images and any large scale texture. But this is also resolved with combination of 4 models.

Oh, now I get it, that's why the run method predicts on multiple models and then assigns foreground wherever any of them predicted fg. So these models actually were trained on multiple scales.

This is reminiscent of multi-scale Sauvola running on different image resolutions, and then re-combining them morphologically. Do you think there would be merit in having such morphological restrictions (and a well-defined order from largest to smallest) here as well?

Also, how does this interface (practically) with text/non-text segmentation? If images are somewhat problematic, should they be detected on the raw image and then removed from the binarized image before entering sbb_textline_detector?

I am not 100% sure but I think it should work for any dpis. If not do not worry we have also developed sth for it :)

Yes, you are right those 4 models are trained with different scales.

In sbb we have already recognized that how important is scale and somehow the dpi of documents (we get to this point based on ocr-scans documents which were used with ocr society, they were low dpi and low scale images and our sbb-textline-detector had hard time with layout detection). So we decided to train 2 models. First a model to enhance image (for enhancing we also needed a right scale) and second a model to detect scale of document or better to say number of columns in document. I am happy to say both are trained and results are really good. We have integrated this 2 models in our new layout detector and you will see it soon. Again, if our binarization is not working becuase of scale or quality of document, now we can first prepare document with those model and then feed it to binarization model.

So we decided to train 2 models. First a model to enhance image (for enhancing we also needed a right scale) and second a model to detect scale of document or better to say number of columns in document. I am happy to say both are trained and results are really good. We have integrated this 2 models in our new layout detector and you will see it soon. Again, if our binarization is not working becuase of scale or quality of document, now we can first prepare document with those model and then feed it to binarization model.

Awesome! Looking forward to seeing the new version of the segmentation, and feeling much better about this knowing that your 4-scaled binarization models will also benefit.

@kba we should probably document all this in our workflow guide or wiki etc.

sbb_binarize/ocrd-tool.json

bertsky · 2020-10-15T17:31:17Z

sbb_binarize/ocrd-tool.json

+          "default": false
+        },
+        "model": {
+          "description": "models directory.",


Relative paths, too?

The binarizer itself accepts relative paths, but we still have the chdir(workspace.directory) in the processor, so relative paths would have to be relative to the workspace :/

Okay, so when your resource resolution arrives, this will automatically include the package data directory as search path?

sbb_binarize/ocrd_cli.py

setup.py

Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>

kba added 13 commits October 15, 2020 14:44

🎨 clean up README, create proper Makefile

47e8dda

setup.py/requirements.txt/gitignore

b8310a4

add ocrd-tool.json

150f031

🎨 clean up code

71d4440

put CLI into its own module

389ef08

allow passing image directly, return image on binarize

ca03844

initial OCR-D interface

5909f94

implement region/line binarization

ee26ebd

require ocrd

6913415

add OCR-D click interface

d4c97eb

shut up keras/tensorflow

a4c0cf0

add assets subrepo

fabb638

convert between cv2 and pil, DRY binarizer call

12b44af

cneud mentioned this pull request Oct 15, 2020

OCR-D interface #4

Closed

cneud linked an issue Oct 15, 2020 that may be closed by this pull request

Add interfaces for use within OCR-D #3

Closed

cneud requested a review from vahidrezanezhad October 15, 2020 16:42

kba commented Oct 15, 2020

View reviewed changes

sbb_binarize/ocrd_cli.py Outdated Show resolved Hide resolved

bertsky suggested changes Oct 15, 2020

View reviewed changes

kba and others added 12 commits October 16, 2020 10:59

image_from_segment: exclude already binarized images

0650bad

Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>

Set page_id from input file

ee82d18

Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>

Append "binarized" to AlternativeImage/comments

2eec687

Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>

patches should be true by default, ht @vahidrezanezhad

eece89b

fix cv2pil ht @bertsky

b7a6301

change description ht @bertsky @vahidrezanezhad

e5bc557

line-level binarization independent of region-level

a1c8f6f

SbbBinarizer: refactor (variable names, less instance-wide state)

1fa5812

ocrd-tool: disable "line" level-of-operation for now

645ec41

rewrite "normal" CLI with click

fad7b7a

replace patches string comparison with use_patches boolean

2bc6ccc

remove "patches" parameter, always use_patches

6543e74

kba added 2 commits October 16, 2020 13:52

minimal CI setup

a24670e

🐛 numpy.ndarray.as{_,}type

84d6b1f

kba marked this pull request as ready for review October 16, 2020 15:49

kba force-pushed the ocrd-cli branch from 806bbd0 to 84d6b1f Compare October 16, 2020 16:51

vahidrezanezhad approved these changes Oct 22, 2020

View reviewed changes

vahidrezanezhad merged commit 3e60a62 into qurator-spk:master Oct 22, 2020

kba deleted the ocrd-cli branch October 27, 2020 17:05

bertsky mentioned this pull request Jan 11, 2021

Rewrite OCR-D/ocrd_kraken#33

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve tooling and add OCR-D CLI #5

Improve tooling and add OCR-D CLI #5

kba commented Oct 15, 2020 •

edited

Loading

bertsky Oct 15, 2020

kba Oct 16, 2020

vahidrezanezhad Oct 16, 2020

bertsky Oct 16, 2020

vahidrezanezhad Oct 16, 2020

vahidrezanezhad Oct 16, 2020

bertsky Oct 16, 2020 •

edited

Loading

vahidrezanezhad Oct 16, 2020

bertsky Oct 16, 2020

bertsky Oct 16, 2020

bertsky Oct 15, 2020

kba Oct 16, 2020

bertsky Oct 16, 2020

kba Oct 16, 2020

Improve tooling and add OCR-D CLI #5

Improve tooling and add OCR-D CLI #5

Conversation

kba commented Oct 15, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bertsky Oct 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kba commented Oct 15, 2020 •

edited

Loading

bertsky Oct 16, 2020 •

edited

Loading