Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve tooling and add OCR-D CLI #5

Merged
merged 27 commits into from
Oct 22, 2020
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
47e8dda
:art: clean up README, create proper Makefile
kba Oct 15, 2020
b8310a4
setup.py/requirements.txt/gitignore
kba Oct 15, 2020
150f031
add ocrd-tool.json
kba Oct 15, 2020
71d4440
:art: clean up code
kba Oct 15, 2020
389ef08
put CLI into its own module
kba Oct 15, 2020
ca03844
allow passing image directly, return image on binarize
kba Oct 15, 2020
5909f94
initial OCR-D interface
kba Oct 15, 2020
ee26ebd
implement region/line binarization
kba Oct 15, 2020
6913415
require ocrd
kba Oct 15, 2020
d4c97eb
add OCR-D click interface
kba Oct 15, 2020
a4c0cf0
shut up keras/tensorflow
kba Oct 15, 2020
fabb638
add assets subrepo
kba Oct 15, 2020
12b44af
convert between cv2 and pil, DRY binarizer call
kba Oct 15, 2020
0650bad
image_from_segment: exclude already binarized images
kba Oct 16, 2020
ee82d18
Set page_id from input file
kba Oct 16, 2020
2eec687
Append "binarized" to AlternativeImage/comments
kba Oct 16, 2020
eece89b
patches should be true by default, ht @vahidrezanezhad
kba Oct 16, 2020
b7a6301
fix cv2pil ht @bertsky
kba Oct 16, 2020
e5bc557
change description ht @bertsky @vahidrezanezhad
kba Oct 16, 2020
a1c8f6f
line-level binarization independent of region-level
kba Oct 16, 2020
1fa5812
SbbBinarizer: refactor (variable names, less instance-wide state)
kba Oct 16, 2020
645ec41
ocrd-tool: disable "line" level-of-operation for now
kba Oct 16, 2020
fad7b7a
rewrite "normal" CLI with click
kba Oct 16, 2020
2bc6ccc
replace patches string comparison with use_patches boolean
kba Oct 16, 2020
6543e74
remove "patches" parameter, always use_patches
kba Oct 16, 2020
a24670e
minimal CI setup
kba Oct 16, 2020
84d6b1f
:bug: numpy.ndarray.as{_,}type
kba Oct 16, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
*.egg-info
__pycache__
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[submodule "repo/assets"]
path = repo/assets
url = https://github.com/OCR-D/assets
4 changes: 4 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
all: install

install:
pip install .
22 changes: 17 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,30 @@
# Binarization

> Binarization for document images

## Introduction
This tool performs document image binarization (i.e. transform colour/grayscale to black-and-white pixels) for OCR using multiple trained models.

This tool performs document image binarization (i.e. transform colour/grayscale
to black-and-white pixels) for OCR using multiple trained models.

## Installation

Clone the repository, enter it and run
`./make`

`pip install .`

### Models

Pre-trained models can be downloaded from here:

https://qurator-data.de/sbb_binarization/

## Usage
`sbb_binarize -m <directory with models> -i <image file>
-p <set to true to let the model see the image divided into patches>
-s <directory where the results will be saved>`

```sh
sbb_binarize \
-m <directory with models> \
-i <image file> \
-p <set to true to let the model see the image divided into patches> \
-s <directory where the results will be saved>`
```
1 change: 1 addition & 0 deletions ocrd-tool.json
1 change: 1 addition & 0 deletions repo/assets
Submodule assets added at 32fde9
6 changes: 6 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
numpy >= 1.17.0, < 1.19.0
setuptools >= 41
opencv-python-headless
ocrd >= 2.18.0
keras >= 2.3.1, < 2.4
tensorflow >= 1.15, < 1.16
28 changes: 28 additions & 0 deletions sbb_binarize/cli.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
"""
sbb_binarize CLI
"""

from argparse import ArgumentParser

from .sbb_binarize import SbbBinarizer

def main():
parser = ArgumentParser()

parser.add_argument('-i', '--image', default=None, help='image.')
parser.add_argument('-p', '--patches', default=False, help='by setting this parameter to true you let the model to see the image in patches.')
parser.add_argument('-s', '--save', default=False, help='save prediction with a given name here. The name and format should be given (outputname.tif).')
parser.add_argument('-m', '--model', default=None, help='models directory.')

options = parser.parse_args()

binarizer = SbbBinarizer(
image_path=options.image,
model=options.model,
patches=options.patches,
save=options.save
)
binarizer.run()

if __name__ == "__main__":
main()
32 changes: 32 additions & 0 deletions sbb_binarize/ocrd-tool.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
{
"version": "0.0.1",
"git_url": "https://github.com/qurator-spk/sbb_binarization",
"tools": {
"ocrd-sbb-binarize": {
"executable": "ocrd-sbb-binarize",
"description": "Smart binarization with sbb_binarization",
"categories": ["Image preprocessing"],
"steps": ["preprocessing/optimization/binarization"],
"input_file_grp": [],
"output_file_grp": [],
"parameters": {
"operation_level": {
"type": "string",
"enum": ["page", "region", "line"],
kba marked this conversation as resolved.
Show resolved Hide resolved
"default": "page",
"description": "PAGE XML hierarchy level to operate on"
},
"patches": {
"description": "by setting this parameter to true you let the model to see the image in patches.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to document the ramifications of this choice here (speed vs quality?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vahidrezanezhad What are the consequences of using the "patches" approach?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vahidrezanezhad What are the consequences of using the "patches" approach?

In the case of binarization, this parameter should be set "True", otherwise model will see image at once and the quality will be terrible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then I'd recommend against exposing this parameter at all.

Also, how about setting it to False on region (or line) level, i.e. when running on region crops or textline crops?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then I'd recommend against exposing this parameter at all.

Also, how about setting it to False on region (or line) level, i.e. when running on region crops or textline crops?

I would recommend to apply binarization on page region, since the best performance achieved when you feed model with real scale (better to say with patches set to "True") . So, if you apply this on line level what is going to happen with False patches parameter? The textline region will be resized to input shape and I am sure that performance will be worse.
On region level is the same. Patches parameter in all cases better to be True.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vahidrezanezhad understood – thanks for clarifying!

One follow-up though: Can you give an impression how sensitive these models are to different scales / pixel densities on the input side? (So, if the model has seen, say, only 600 DPI newspapers during training, will it still work on (patches of) 300 DPI books?)

Actually, this tool has no problem with different pixel densities. But for the large scale texts like headers (Not any header I mean really large scale ones) or images I had problem to get my desired result. It was because of GT which does not include any images and any large scale texture. But this is also resolved with combination of 4 models. If by sbb binarization tool you apply all our 4 models you can see that it works mostly for big variety of scales. By the way make long story short, if you set patches to True and if you use all 4 models be sure that this tool will do sth :)

Copy link
Contributor

@bertsky bertsky Oct 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this tool has no problem with different pixel densities.

Okay, even if they differ widely, like using the model(s) on 100 or 1200 DPI scans?

But for the large scale texts like headers (Not any header I mean really large scale ones) or images I had problem to get my desired result. It was because of GT which does not include any images and any large scale texture. But this is also resolved with combination of 4 models.

Oh, now I get it, that's why the run method predicts on multiple models and then assigns foreground wherever any of them predicted fg. So these models actually were trained on multiple scales.

This is reminiscent of multi-scale Sauvola running on different image resolutions, and then re-combining them morphologically. Do you think there would be merit in having such morphological restrictions (and a well-defined order from largest to smallest) here as well?

Also, how does this interface (practically) with text/non-text segmentation? If images are somewhat problematic, should they be detected on the raw image and then removed from the binarized image before entering sbb_textline_detector?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this tool has no problem with different pixel densities.

Okay, even if they differ widely, like using the model(s) on 100 or 1200 DPI scans?

But for the large scale texts like headers (Not any header I mean really large scale ones) or images I had problem to get my desired result. It was because of GT which does not include any images and any large scale texture. But this is also resolved with combination of 4 models.

Oh, now I get it, that's why the run method predicts on multiple models and then assigns foreground wherever any of them predicted fg. So these models actually were trained on multiple scales.

This is reminiscent of multi-scale Sauvola running on different image resolutions, and then re-combining them morphologically. Do you think there would be merit in having such morphological restrictions (and a well-defined order from largest to smallest) here as well?

Also, how does this interface (practically) with text/non-text segmentation? If images are somewhat problematic, should they be detected on the raw image and then removed from the binarized image before entering sbb_textline_detector?

I am not 100% sure but I think it should work for any dpis. If not do not worry we have also developed sth for it :)

Yes, you are right those 4 models are trained with different scales.

In sbb we have already recognized that how important is scale and somehow the dpi of documents (we get to this point based on ocr-scans documents which were used with ocr society, they were low dpi and low scale images and our sbb-textline-detector had hard time with layout detection). So we decided to train 2 models. First a model to enhance image (for enhancing we also needed a right scale) and second a model to detect scale of document or better to say number of columns in document. I am happy to say both are trained and results are really good. We have integrated this 2 models in our new layout detector and you will see it soon. Again, if our binarization is not working becuase of scale or quality of document, now we can first prepare document with those model and then feed it to binarization model.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we decided to train 2 models. First a model to enhance image (for enhancing we also needed a right scale) and second a model to detect scale of document or better to say number of columns in document. I am happy to say both are trained and results are really good. We have integrated this 2 models in our new layout detector and you will see it soon. Again, if our binarization is not working becuase of scale or quality of document, now we can first prepare document with those model and then feed it to binarization model.

Awesome! Looking forward to seeing the new version of the segmentation, and feeling much better about this knowing that your 4-scaled binarization models will also benefit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kba we should probably document all this in our workflow guide or wiki etc.

"type": "boolean",
"default": false
},
"model": {
"description": "models directory.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relative paths, too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The binarizer itself accepts relative paths, but we still have the chdir(workspace.directory) in the processor, so relative paths would have to be relative to the workspace :/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so when your resource resolution arrives, this will automatically include the package data directory as search path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

"type": "string",
"required": true
}
}
}
}
}
129 changes: 129 additions & 0 deletions sbb_binarize/ocrd_cli.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# TODO: AlternativeImage 'binarized' comment should be additive

import os.path
from pkg_resources import resource_string
from json import loads

from PIL import Image
import numpy as np
import cv2
from click import command

from ocrd_utils import (
getLogger,
assert_file_grp_cardinality,
make_file_id,
MIMETYPE_PAGE
)
from ocrd import Processor
from ocrd_modelfactory import page_from_file
from ocrd_models.ocrd_page import AlternativeImageType, to_xml
from ocrd.decorators import ocrd_cli_options, ocrd_cli_wrap_processor

from .sbb_binarize import SbbBinarizer

OCRD_TOOL = loads(resource_string(__name__, 'ocrd-tool.json').decode('utf8'))
TOOL = 'ocrd-sbb-binarize'

def cv2pil(img):
color_coverted = cv2.cvtColor(img, cv2.COLOR_BGRA2RGB)
kba marked this conversation as resolved.
Show resolved Hide resolved
return Image.fromarray(color_coverted)

def pil2cv(img):
# from ocrd/workspace.py
color_conversion = cv2.COLOR_GRAY2BGR if img.mode in ('1', 'L') else cv2.COLOR_RGB2BGR
pil_as_np_array = np.array(img).astype('uint8') if img.mode == '1' else np.array(img)
return cv2.cvtColor(pil_as_np_array, color_conversion)

class SbbBinarizeProcessor(Processor):

def __init__(self, *args, **kwargs):
kwargs['ocrd_tool'] = OCRD_TOOL['tools'][TOOL]
kwargs['version'] = OCRD_TOOL['version']
super().__init__(*args, **kwargs)

def _run_binarizer(self, img):
return cv2pil(
SbbBinarizer(
image=pil2cv(img),
model=self.model_path,
patches=self.use_patches,
save=None).run())

def process(self):
"""
Binarize with sbb_binarization
"""
LOG = getLogger('processor.SbbBinarize')
assert_file_grp_cardinality(self.input_file_grp, 1)
assert_file_grp_cardinality(self.output_file_grp, 1)

oplevel = self.parameter['operation_level']
self.use_patches = self.parameter['patches'] # pylint: disable=attribute-defined-outside-init
self.model_path = self.parameter['model'] # pylint: disable=attribute-defined-outside-init

for n, input_file in enumerate(self.input_files):
file_id = make_file_id(input_file, self.output_file_grp)
page_id = input_file.pageId or input_file.ID
LOG.info("INPUT FILE %i / %s", n, page_id)
pcgts = page_from_file(self.workspace.download_file(input_file))
self.add_metadata(pcgts)
pcgts.set_pcGtsId(file_id)
page = pcgts.get_Page()

if oplevel == 'page':
LOG.info("Binarizing on 'page' level in page '%s'", page_id)
page_image, page_xywh, _ = self.workspace.image_from_page(page, page_id)
kba marked this conversation as resolved.
Show resolved Hide resolved
bin_image = self._run_binarizer(page_image)
# update METS (add the image file):
bin_image_path = self.workspace.save_image_file(bin_image,
file_id + '.IMG-BIN',
page_id=page_id,
kba marked this conversation as resolved.
Show resolved Hide resolved
file_grp=self.output_file_grp)
page.add_AlternativeImage(AlternativeImageType(filename=bin_image_path, comment="binarized"))
kba marked this conversation as resolved.
Show resolved Hide resolved

else:
regions = page.get_AllRegions(['Text', 'Table'])
kba marked this conversation as resolved.
Show resolved Hide resolved
if not regions:
LOG.warning("Page '%s' contains no text/table regions", page_id)

for region in regions:
region_image, region_xywh = self.workspace.image_from_segment(region, page_image, page_xywh)
kba marked this conversation as resolved.
Show resolved Hide resolved

if oplevel == 'region':
region_image_bin = self._run_binarizer(region_image)
region_image_bin_path = self.workspace.save_image_file(
region_image_bin,
"%s_%s.IMG-BIN" % (file_id, region.id),
page_id=page_id,
kba marked this conversation as resolved.
Show resolved Hide resolved
file_grp=self.output_file_grp)
region.add_AlternativeImage(
AlternativeImageType(filename=region_image_bin_path, comments='binarized'))
kba marked this conversation as resolved.
Show resolved Hide resolved

elif oplevel == 'line':
lines = region.get_TextLine()
if not lines:
LOG.warning("Page '%s' region '%s' contains no text lines", page_id, region.id)
for line in lines:
line_image, line_xywh = self.workspace.image_from_segment(line, page_image, page_xywh)
kba marked this conversation as resolved.
Show resolved Hide resolved
line_image_bin = self._run_binarizer(line_image)
line_image_bin_path = self.workspace.save_image_file(
line_image_bin,
"%s_%s_%s.IMG-BIN" % (file_id, region.id, line.id),
page_id=page_id,
kba marked this conversation as resolved.
Show resolved Hide resolved
file_grp=self.output_file_grp)
line.add_AlternativeImage(
AlternativeImageType(filename=line_image_bin_path, comments='binarized'))
kba marked this conversation as resolved.
Show resolved Hide resolved

self.workspace.add_file(
ID=file_id,
file_grp=self.output_file_grp,
pageId=input_file.pageId,
mimetype=MIMETYPE_PAGE,
local_filename=os.path.join(self.output_file_grp, file_id + '.xml'),
content=to_xml(pcgts))

@command()
@ocrd_cli_options
def cli(*args, **kwargs):
return ocrd_cli_wrap_processor(SbbBinarizeProcessor, *args, **kwargs)
Loading