-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve tooling and add OCR-D CLI #5
Conversation
sbb_binarize/ocrd-tool.json
Outdated
"description": "PAGE XML hierarchy level to operate on" | ||
}, | ||
"patches": { | ||
"description": "by setting this parameter to true you let the model to see the image in patches.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful to document the ramifications of this choice here (speed vs quality?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vahidrezanezhad What are the consequences of using the "patches" approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vahidrezanezhad What are the consequences of using the "patches" approach?
In the case of binarization, this parameter should be set "True", otherwise model will see image at once and the quality will be terrible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, then I'd recommend against exposing this parameter at all.
Also, how about setting it to False
on region (or line) level, i.e. when running on region crops or textline crops?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, then I'd recommend against exposing this parameter at all.
Also, how about setting it to
False
on region (or line) level, i.e. when running on region crops or textline crops?
I would recommend to apply binarization on page region, since the best performance achieved when you feed model with real scale (better to say with patches set to "True") . So, if you apply this on line level what is going to happen with False patches parameter? The textline region will be resized to input shape and I am sure that performance will be worse.
On region level is the same. Patches parameter in all cases better to be True.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vahidrezanezhad understood – thanks for clarifying!
One follow-up though: Can you give an impression how sensitive these models are to different scales / pixel densities on the input side? (So, if the model has seen, say, only 600 DPI newspapers during training, will it still work on (patches of) 300 DPI books?)
Actually, this tool has no problem with different pixel densities. But for the large scale texts like headers (Not any header I mean really large scale ones) or images I had problem to get my desired result. It was because of GT which does not include any images and any large scale texture. But this is also resolved with combination of 4 models. If by sbb binarization tool you apply all our 4 models you can see that it works mostly for big variety of scales. By the way make long story short, if you set patches to True and if you use all 4 models be sure that this tool will do sth :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this tool has no problem with different pixel densities.
Okay, even if they differ widely, like using the model(s) on 100 or 1200 DPI scans?
But for the large scale texts like headers (Not any header I mean really large scale ones) or images I had problem to get my desired result. It was because of GT which does not include any images and any large scale texture. But this is also resolved with combination of 4 models.
Oh, now I get it, that's why the run method predicts on multiple models and then assigns foreground wherever any of them predicted fg. So these models actually were trained on multiple scales.
This is reminiscent of multi-scale Sauvola running on different image resolutions, and then re-combining them morphologically. Do you think there would be merit in having such morphological restrictions (and a well-defined order from largest to smallest) here as well?
Also, how does this interface (practically) with text/non-text segmentation? If images are somewhat problematic, should they be detected on the raw image and then removed from the binarized image before entering sbb_textline_detector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this tool has no problem with different pixel densities.
Okay, even if they differ widely, like using the model(s) on 100 or 1200 DPI scans?
But for the large scale texts like headers (Not any header I mean really large scale ones) or images I had problem to get my desired result. It was because of GT which does not include any images and any large scale texture. But this is also resolved with combination of 4 models.
Oh, now I get it, that's why the run method predicts on multiple models and then assigns foreground wherever any of them predicted fg. So these models actually were trained on multiple scales.
This is reminiscent of multi-scale Sauvola running on different image resolutions, and then re-combining them morphologically. Do you think there would be merit in having such morphological restrictions (and a well-defined order from largest to smallest) here as well?
Also, how does this interface (practically) with text/non-text segmentation? If images are somewhat problematic, should they be detected on the raw image and then removed from the binarized image before entering sbb_textline_detector?
I am not 100% sure but I think it should work for any dpis. If not do not worry we have also developed sth for it :)
Yes, you are right those 4 models are trained with different scales.
In sbb we have already recognized that how important is scale and somehow the dpi of documents (we get to this point based on ocr-scans documents which were used with ocr society, they were low dpi and low scale images and our sbb-textline-detector had hard time with layout detection). So we decided to train 2 models. First a model to enhance image (for enhancing we also needed a right scale) and second a model to detect scale of document or better to say number of columns in document. I am happy to say both are trained and results are really good. We have integrated this 2 models in our new layout detector and you will see it soon. Again, if our binarization is not working becuase of scale or quality of document, now we can first prepare document with those model and then feed it to binarization model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we decided to train 2 models. First a model to enhance image (for enhancing we also needed a right scale) and second a model to detect scale of document or better to say number of columns in document. I am happy to say both are trained and results are really good. We have integrated this 2 models in our new layout detector and you will see it soon. Again, if our binarization is not working becuase of scale or quality of document, now we can first prepare document with those model and then feed it to binarization model.
Awesome! Looking forward to seeing the new version of the segmentation, and feeling much better about this knowing that your 4-scaled binarization models will also benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kba we should probably document all this in our workflow guide or wiki etc.
"default": false | ||
}, | ||
"model": { | ||
"description": "models directory.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relative paths, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The binarizer itself accepts relative paths, but we still have the chdir(workspace.directory)
in the processor, so relative paths would have to be relative to the workspace :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so when your resource resolution arrives, this will automatically include the package data directory as search path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
setup.py
,requirements.txt
,Makefile
TODO:
load the model once and have a statelessthat is not possible because it's 4 models that are run in sequence. But SbbBinarizer is now refactored to have only model_dir as an instance var, the other args are passed to runrun
methodSetup CIIt is set up but sbb_binarization's requires too much RAM for binarizing two pages on Circle CI's free plan (4 GB)