Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve tooling and add OCR-D CLI #5

Merged
merged 27 commits into from
Oct 22, 2020

Conversation

kba
Copy link
Contributor

@kba kba commented Oct 15, 2020

  • add setup.py, requirements.txt, Makefile
  • clean up sbb_binarizer code and split CLI into own module
  • proof-of-concept OCR-D CLI

TODO:

  • Fix conversion from SbbBinarizer OpenCV2 image to PIL
  • Refactor SbbBinarizer to load the model once and have a stateless run method that is not possible because it's 4 models that are run in sequence. But SbbBinarizer is now refactored to have only model_dir as an instance var, the other args are passed to run
  • Setup CI It is set up but sbb_binarization's requires too much RAM for binarizing two pages on Circle CI's free plan (4 GB)
  • Improve documentation

@cneud cneud mentioned this pull request Oct 15, 2020
@cneud cneud linked an issue Oct 15, 2020 that may be closed by this pull request
sbb_binarize/ocrd_cli.py Outdated Show resolved Hide resolved
"description": "PAGE XML hierarchy level to operate on"
},
"patches": {
"description": "by setting this parameter to true you let the model to see the image in patches.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to document the ramifications of this choice here (speed vs quality?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vahidrezanezhad What are the consequences of using the "patches" approach?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vahidrezanezhad What are the consequences of using the "patches" approach?

In the case of binarization, this parameter should be set "True", otherwise model will see image at once and the quality will be terrible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then I'd recommend against exposing this parameter at all.

Also, how about setting it to False on region (or line) level, i.e. when running on region crops or textline crops?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then I'd recommend against exposing this parameter at all.

Also, how about setting it to False on region (or line) level, i.e. when running on region crops or textline crops?

I would recommend to apply binarization on page region, since the best performance achieved when you feed model with real scale (better to say with patches set to "True") . So, if you apply this on line level what is going to happen with False patches parameter? The textline region will be resized to input shape and I am sure that performance will be worse.
On region level is the same. Patches parameter in all cases better to be True.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vahidrezanezhad understood – thanks for clarifying!

One follow-up though: Can you give an impression how sensitive these models are to different scales / pixel densities on the input side? (So, if the model has seen, say, only 600 DPI newspapers during training, will it still work on (patches of) 300 DPI books?)

Actually, this tool has no problem with different pixel densities. But for the large scale texts like headers (Not any header I mean really large scale ones) or images I had problem to get my desired result. It was because of GT which does not include any images and any large scale texture. But this is also resolved with combination of 4 models. If by sbb binarization tool you apply all our 4 models you can see that it works mostly for big variety of scales. By the way make long story short, if you set patches to True and if you use all 4 models be sure that this tool will do sth :)

Copy link
Contributor

@bertsky bertsky Oct 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this tool has no problem with different pixel densities.

Okay, even if they differ widely, like using the model(s) on 100 or 1200 DPI scans?

But for the large scale texts like headers (Not any header I mean really large scale ones) or images I had problem to get my desired result. It was because of GT which does not include any images and any large scale texture. But this is also resolved with combination of 4 models.

Oh, now I get it, that's why the run method predicts on multiple models and then assigns foreground wherever any of them predicted fg. So these models actually were trained on multiple scales.

This is reminiscent of multi-scale Sauvola running on different image resolutions, and then re-combining them morphologically. Do you think there would be merit in having such morphological restrictions (and a well-defined order from largest to smallest) here as well?

Also, how does this interface (practically) with text/non-text segmentation? If images are somewhat problematic, should they be detected on the raw image and then removed from the binarized image before entering sbb_textline_detector?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this tool has no problem with different pixel densities.

Okay, even if they differ widely, like using the model(s) on 100 or 1200 DPI scans?

But for the large scale texts like headers (Not any header I mean really large scale ones) or images I had problem to get my desired result. It was because of GT which does not include any images and any large scale texture. But this is also resolved with combination of 4 models.

Oh, now I get it, that's why the run method predicts on multiple models and then assigns foreground wherever any of them predicted fg. So these models actually were trained on multiple scales.

This is reminiscent of multi-scale Sauvola running on different image resolutions, and then re-combining them morphologically. Do you think there would be merit in having such morphological restrictions (and a well-defined order from largest to smallest) here as well?

Also, how does this interface (practically) with text/non-text segmentation? If images are somewhat problematic, should they be detected on the raw image and then removed from the binarized image before entering sbb_textline_detector?

I am not 100% sure but I think it should work for any dpis. If not do not worry we have also developed sth for it :)

Yes, you are right those 4 models are trained with different scales.

In sbb we have already recognized that how important is scale and somehow the dpi of documents (we get to this point based on ocr-scans documents which were used with ocr society, they were low dpi and low scale images and our sbb-textline-detector had hard time with layout detection). So we decided to train 2 models. First a model to enhance image (for enhancing we also needed a right scale) and second a model to detect scale of document or better to say number of columns in document. I am happy to say both are trained and results are really good. We have integrated this 2 models in our new layout detector and you will see it soon. Again, if our binarization is not working becuase of scale or quality of document, now we can first prepare document with those model and then feed it to binarization model.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we decided to train 2 models. First a model to enhance image (for enhancing we also needed a right scale) and second a model to detect scale of document or better to say number of columns in document. I am happy to say both are trained and results are really good. We have integrated this 2 models in our new layout detector and you will see it soon. Again, if our binarization is not working becuase of scale or quality of document, now we can first prepare document with those model and then feed it to binarization model.

Awesome! Looking forward to seeing the new version of the segmentation, and feeling much better about this knowing that your 4-scaled binarization models will also benefit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kba we should probably document all this in our workflow guide or wiki etc.

sbb_binarize/ocrd-tool.json Outdated Show resolved Hide resolved
"default": false
},
"model": {
"description": "models directory.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relative paths, too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The binarizer itself accepts relative paths, but we still have the chdir(workspace.directory) in the processor, so relative paths would have to be relative to the workspace :/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so when your resource resolution arrives, this will automatically include the package data directory as search path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

sbb_binarize/ocrd_cli.py Outdated Show resolved Hide resolved
sbb_binarize/ocrd_cli.py Outdated Show resolved Hide resolved
sbb_binarize/ocrd_cli.py Outdated Show resolved Hide resolved
sbb_binarize/ocrd_cli.py Outdated Show resolved Hide resolved
sbb_binarize/ocrd_cli.py Outdated Show resolved Hide resolved
sbb_binarize/ocrd_cli.py Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
@kba kba marked this pull request as ready for review October 16, 2020 15:49
@vahidrezanezhad vahidrezanezhad merged commit 3e60a62 into qurator-spk:master Oct 22, 2020
@kba kba deleted the ocrd-cli branch October 27, 2020 17:05
@bertsky bertsky mentioned this pull request Jan 11, 2021
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add interfaces for use within OCR-D
3 participants