Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PaddleDetection-based Layout Model #54

Merged
merged 15 commits into from
Aug 17, 2021

Conversation

an1018
Copy link
Contributor

@an1018 an1018 commented Aug 6, 2021

Hi, I have reorganized the file structure of Paddle.
Please review.Thx

setup.py Outdated
@@ -27,12 +27,22 @@
"torch",
"torchvision",
"iopath",
"tqdm",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个去掉吧,安装paddleocr的话,会自动安装这个模块的


if not enforce_cpu:
# initial GPU memory(M), device ID
config.enable_use_gpu(200, 0)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改为2000吧

@lolipopshock lolipopshock changed the title reorganize paddle structure Add PaddleDetection-based Layout Model Aug 6, 2021
@lolipopshock
Copy link
Member

Thanks for this PR. Here are some general questions/thoughts:

  • Please remove the OCR part from this PR.
  • Please use PEP8 for as the variable naming conventions -- you could use black and pylint for checking.
  • For the preprocess.py and layoutmodel.py file
    1. I think it contains a lot of unnecessarily complicated design -- can we just simply the functions with minal complexity?
    2. And more fundamentally, why the preprocess function should live within layoutparser rather than paddledetection? I think it's paddledetection's obligation to do the preprocessing for the models.
  • Also could you add tests for the paddle modules? You could follow the examples in test_model.

@an1018
Copy link
Contributor Author

an1018 commented Aug 7, 2021

Thanks for this PR. Here are some general questions/thoughts:

* Please remove the OCR part from this PR.

* Please use PEP8 for as the variable naming conventions -- you could use `black` and `pylint` for checking.

* For the `preprocess.py` and `layoutmodel.py` file
  
  1. I think it contains a lot of unnecessarily complicated design -- can we just simply the functions with minal complexity?
  2. And more fundamentally, why the `preprocess` function should live within `layoutparser` rather than `paddledetection`? I think it's `paddledetection`'s obligation to do the preprocessing for the models.

* Also could you add tests for the paddle modules? You could follow the examples in [`test_model`](https://github.com/Layout-Parser/layout-parser/blob/master/tests/test_model.py).

Thanks for your replay, I'll modify it later!

@littletomatodonkey
Copy link

Thanks for this PR. Here are some general questions/thoughts:

  • Please remove the OCR part from this PR.

  • Please use PEP8 for as the variable naming conventions -- you could use black and pylint for checking.

  • For the preprocess.py and layoutmodel.py file

    1. I think it contains a lot of unnecessarily complicated design -- can we just simply the functions with minal complexity?
    2. And more fundamentally, why the preprocess function should live within layoutparser rather than paddledetection? I think it's paddledetection's obligation to do the preprocessing for the models.
  • Also could you add tests for the paddle modules? You could follow the examples in test_model.

Hi, For the preprocess.py and layoutmodel.py file,

  1. we decouple the preprocess and inference engine because that will be more clear for the code.
  2. we extract the preprocess code to reduce the dependence to paddledetection, or else the users might download the whole repo, which is too heavy.

@an1018
Copy link
Contributor Author

an1018 commented Aug 8, 2021

Hi,I've modied the code, including:

  • remove the OCR part

  • use pylint for checking
    For the init.py and catalog.py file,these two files are less modified, so they are not checked

  • add tests for the paddle modules test_model

setup.py Outdated
@@ -33,6 +33,15 @@
'google-cloud-vision==1',
'pytesseract'
],
"paddleocr": [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don't change anything here - I'll modify it later myself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -11,11 +11,13 @@

from .ocr import (
GCVFeatureType, GCVAgent,
TesseractFeatureType, TesseractAgent
TesseractFeatureType, TesseractAgent,
PaddleocrAgent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't add OCR Agent in this PR -- right now we focus on LayoutModel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -86,7 +86,6 @@
0: "Table"
},
}
# fmt: on
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I've modified it back

}

# fmt: off
LABEL_MAP_CATALOG = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the labeling mappings for unused datasets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

# fmt: on


class LayoutParserDetectron2ModelHandler(PathHandler):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change class name to LayoutParserPaddleModelHandler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

trt_min_shape=1,
trt_max_shape=1280,
trt_opt_shape=640,
min_subgraph_size=3):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And are there any documentations on how to set and adjust these parameters, in English? If not, I would suggest removing them..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

enable_mkldnn=True,
thread_num=10,
min_subgraph_size=3,
use_dynamic_shape=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And apparently some of the hyperparameters/configurations are not used. Please remove.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

inputs = self.create_inputs(im, im_info)
return inputs

def postprocess(self, np_boxes, np_masks, inputs):
Copy link
Member

@lolipopshock lolipopshock Aug 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need merge the postprocess with gather_output. It's not reasonable to break them into two functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've merged ,thks


if not enforce_cpu:
# initial GPU memory(M), device ID
config.enable_use_gpu(2000, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why it is 2000 rather than other values? Please leave a note in the comment. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

setup.py Outdated
@@ -33,6 +33,12 @@
'google-cloud-vision==1',
'pytesseract'
],
"paddlepaddle": [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not being clear earlier. Please remove all the new extras_require, thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

@@ -10,7 +10,7 @@ class DropboxHandler(HTTPURLHandler):
"""

def _get_supported_prefixes(self):
return ["https://www.dropbox.com"]
return ["https://www.dropbox.com","https://paddle-model-ecology.bj.bcebos.com"]
Copy link
Member

@lolipopshock lolipopshock Aug 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a good idea -- perhaps it's better to create something like class PaddleModelURLHandler(HTTPURLHandler): in the paddle model catalog.py file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the comments below for more detailed instructions.

layout = self.gather_output(np_boxes, np_masks)
return layout

def untar_files(self, model_tar, model_dir):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels counter-intuitive to put this function inside the modeling folder, and I think it's better to include this part inside the PathManager, by rewriting this function _get_local_path. So basically it will be something like this:

class PaddleModelURLHandler(HTTPURLHandler):
    """
    Supports download and file check for dropbox links
    """

    def _get_supported_prefixes(self):
        return ["https://paddle-model-ecology.bj.bcebos.com"]

    def _isfile(self, path):
        return path in self.cache_map
    
    def _get_local_path(
        self,
        path: str,
        force: bool = False,
        cache_dir: Optional[str] = None,
        **kwargs: Any,
    ) -> str:
        """
        This implementation downloads the remote resource and caches it locally.
        The resource will only be downloaded if not previously requested.
        """
        self._check_kwargs(kwargs)
        if (
            force
            or path not in self.cache_map
            or not os.path.exists(self.cache_map[path])
        ):
            logger = logging.getLogger(__name__)
            parsed_url = urlparse(path)
            dirname = os.path.join(
                get_cache_dir(cache_dir), os.path.dirname(parsed_url.path.lstrip("/"))
            )
            filename = path.split("/")[-1]
            if len(filename) > self.MAX_FILENAME_LEN:
                filename = filename[:100] + "_" + uuid.uuid4().hex

            cached = os.path.join(dirname, filename)
            with file_lock(cached):
                if not os.path.isfile(cached):
                    logger.info("Downloading {} ...".format(path))
                    cached = download(path, dirname, filename=filename)
                
#### ----> Add from here 
                if path.endswith(".tar"):
                    untar_function() 

            logger.info("URL {} cached in {}".format(path, cached))
            self.cache_map[path] = cached
        return self.cache_map[path]

You might also need to check whether the untar'ed version exists beforehand to avoid duplicated downloading.

image, im_info = permute(image, im_info)

inputs = {}
inputs['image'] = np.array((image, )).astype('float32')
Copy link
Member

@lolipopshock lolipopshock Aug 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels weird -- probably it's better to have np.array(image)[np.newaxis, :].astype('float32').

config of model, defined by `Config(model_dir)`
model_path (str):
The path to the saved weights of the model.
threshold (float):
Copy link
Member

@lolipopshock lolipopshock Aug 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to clean up the docstrings -- and you can specify what the extra configs are under the extra_configs block.

config_path = self._reconstruct_path_with_detector_name(config_path)
model_tar = PathManager.get_local_path(config_path)

pre_dir = os.path.dirname(model_tar)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove this -- see the details below.

thread_num=extra_config.get('thread_num',10))

self.threshold = extra_config.get('threshold',0.5)
self.input_shape = extra_config.get('input_shape',[3,640,640])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As input_shape is in self.im_info, you might want to remove this variable.

self.threshold = extra_config.get('threshold',0.5)
self.input_shape = extra_config.get('input_shape',[3,640,640])
self.label_map = label_map
self.im_info = {
Copy link
Member

@lolipopshock lolipopshock Aug 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically this is something like default_image_info, right? It's better to use a more descriptive name rather than im_info, which is also the outputs of the preprocessors.

im_info (dict): info of processed image
"""
image = image.astype(np.float32, copy=False)
mean = np.array(mean)[np.newaxis, np.newaxis, :]
Copy link
Member

@lolipopshock lolipopshock Aug 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why initializing the arrays this every time? We can just initialize the mean and std to be the appropriate ndarrays.

inputs (dict): input of model
"""
# read rgb image
image, im_info = decode_image(image, self.im_info)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the design here:

  1. in the current setting, the self.im_info will be changed for all different image inputs, which doesn't make sense.
  2. im_info is only changed in the resize function, and the decode_image doesn't change image at all - can we further simplify the APIs?

For example, an alternative example could be:

image = image.copy()

input_shape = np.array(image.shape[:2], dtype=np.float32)
image, scale_factor = resize(image) # change the return value 
image = (image - self.extra_config['pixel_mean']) \
/  image - self.extra_config['pixel_std'] # may need change how to load the values 
image =  image.transpose((2, 0, 1)).copy() # The model requires channel-first input 

model_input_image_shape =  np.array(image.shape[:2], dtype=np.float32)

image_info = {
    'scale_factor': scale_factor,
    'im_shape': model_input_image_shape,
    'input_shape': input_shape,
}

@lolipopshock
Copy link
Member

lolipopshock commented Aug 11, 2021

Thanks for the updates! The current version is better than the previous one, but I think the code could be further simplified and more structured -- please check the comments for possible updates.

@lolipopshock
Copy link
Member

I've also carefully checked the code, and added some changes. Some key updates:

  1. improve the downloading logic -- it will remove the tar file when downloaded, and check the existence of the untar'ed files.
  2. further simplify the layoutmodel.py, removing unnecessary variables and definitions.

Please note, you mentioned in the previous version that paddle mode may support specifying batch_size during inference, however that variable was never used.

And in the future we might need to move the paddle model to dropbox/aws as the downloading speed is ~2MB/s in US, taking around 2~3 minutes to download the whole model, which might be suboptimal.

Otherwise I think this PR is ready to be merged - please take a quick look and let me know if the new changes are OK. Thanks for the great work!

@littletomatodonkey
Copy link

I've also carefully checked the code, and added some changes. Some key updates:

  1. improve the downloading logic -- it will remove the tar file when downloaded, and check the existence of the untar'ed files.
  2. further simplify the layoutmodel.py, removing unnecessary variables and definitions.

Please note, you mentioned in the previous version that paddle mode may support specifying batch_size during inference, however that variable was never used.

And in the future we might need to move the paddle model to dropbox/aws as the downloading speed is 2MB/s in US, taking around 23 minutes to download the whole model, which might be suboptimal.

Otherwise I think this PR is ready to be merged - please take a quick look and let me know if the new changes are OK. Thanks for the great work!

Hi, thanks for your new changes, which further simplifies the code.

  • For the model download source, it is ok to support dropbox/aws, but we hope baidu source is the prior choice since there are also many developers in China.
  • Batch size is not used in the code, which is just designed for the expansion, thanks for your check.
  • The changes are ok for me, thanks for your careful review again!

@an1018
Copy link
Contributor Author

an1018 commented Aug 17, 2021

The changes are ok for me,too.

Through this repo and the whole modification process, I also learned a lot. Thank you very much!

@lolipopshock
Copy link
Member

I've also carefully checked the code, and added some changes. Some key updates:

  1. improve the downloading logic -- it will remove the tar file when downloaded, and check the existence of the untar'ed files.
  2. further simplify the layoutmodel.py, removing unnecessary variables and definitions.

Please note, you mentioned in the previous version that paddle mode may support specifying batch_size during inference, however that variable was never used.
And in the future we might need to move the paddle model to dropbox/aws as the downloading speed is 2MB/s in US, taking around 23 minutes to download the whole model, which might be suboptimal.
Otherwise I think this PR is ready to be merged - please take a quick look and let me know if the new changes are OK. Thanks for the great work!

Hi, thanks for your new changes, which further simplifies the code.

  • For the model download source, it is ok to support dropbox/aws, but we hope baidu source is the prior choice since there are also many developers in China.
  • Batch size is not used in the code, which is just designed for the expansion, thanks for your check.
  • The changes are ok for me, thanks for your careful review again!

I think ultimately we'll provide better downloading support, e.g., mirrored downloading sites and or something mentioned in #43 . At that time, ppl from different regions can choose the best downloading methods accordingly.

@lolipopshock lolipopshock merged commit 035f66a into Layout-Parser:master Aug 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants