handle errors in zip_input_files #12

bertsky · 2024-08-20T07:34:44Z

See discussion on OCR-D/core#1240

…nput files

…new-processor-api-input-file-errors

…onfig.OCRD_MISSING_INPUT

bertsky · 2024-08-20T07:37:50Z

@kba I tentatively merged #9 here.

@kba @MehmedGIT this effectively undoes some of your recent commits:

Please review (cannot invite you directly).

MehmedGIT · 2024-08-20T10:56:39Z

src/ocrd/processor/base.py

                        if on_error == 'skip':
                            ift[i] = None
                        elif on_error == 'first':
                            pass # keep first match
                        elif on_error == 'last':
                            ift[i] = file_
                        elif on_error == 'abort':
-                            raise ValueError(
-                                "Multiple '%s' matches for page '%s' in fileGrp '%s'." % (
-                                    mimetype, file_.pageId, ifg))
+                            raise NonUniqueInputFile(ifg, file_.pageId, mimetype)
                        else:
                            raise Exception("Unknown 'on_error' strategy '%s'" % on_error)


This should become a separate method. I see the same if checks are repeated below. The constant strings for the on_error strategy could also become an Enum class. The Enum class would also help with the validator and parser in the config.

This should become a separate method. I see the same if checks are repeated below.

We used to have two discerning error messages here. (In one case, a specific MIME type was requested, like PAGE-only or image-only or JSON or whatever, but that was not unique. In another, no MIME type was requested, so the PAGE+images logic should apply but did not establish uniqueness via some PAGE file.)

How would you like to see this refactored into a function, exactly?

The constant strings for the on_error strategy could also become an Enum class.

Agreed.

The Enum class would also help with the validator and parser in the config.

Remember that this concrete setting should not be decided by the user. The config has a different granularity level: what should happen to input files missing (or skipped because of being non-unique).

How would you like to see this refactored into a function, exactly?

Something like:

def handle_errors(ift[i], file_, ifg: str, on_error: str): if on_error == 'skip': ift[i] = None elif on_error == 'first': pass # keep first match elif on_error == 'last': ift[i] = file_ elif on_error == 'abort': raise NonUniqueInputFile(ifg, file_.pageId, None) else: raise Exception("Unknown 'on_error' strategy '%s'" % on_error)

or

def handle_errors(file_, on_error: str, ifg: str) -> Union[None, file_]: if on_error == 'skip': return None elif on_error == 'first': return None elif on_error == 'last': return file_ elif on_error == 'abort': raise NonUniqueInputFile(ifg, file_.pageId, None) else: raise Exception("Unknown 'on_error' strategy '%s'" % on_error)

The only difference I see is the warning logs content which could be preserved outside the method. If different things happened, say on skip and/or abort, then we no longer have the method option. Anyway, nothing important for the current phase of v3.

Ok, sure, we can pass in the differentiating part of the message for NonUniqueInputFile.

But this is problematic:

if on_error == 'skip': return None elif on_error == 'first': return None

How would you call that function?

With skip, we should set the input file to None. With first, we should keep the previous OcrdFile instance.

Right, the return version would not work and the other version requires passing some arguments for just checking the on_error state. Let's leave it as it is.

MehmedGIT · 2024-08-20T10:59:33Z

src/ocrd_utils/config.py

+config.add("OCRD_MISSING_INPUT",
+    description="How to deal with missing input files (for some fileGrp/pageId) during processing [SKIP|ABORT]",
+    default=(True, 'SKIP'),
+    validator=lambda val: val in ['SKIP', 'ABORT'],


What about FIRST and LAST? Or are these supposed to be only internals?

Does not apply here. Remember first and last are only for the internal problem of dealing with non-unique matches. We did not want to bother the user with this level.

- introduce `config.OCRD_MISSING_OUTPUT` and catch exceptions during `process_page_file`: - `ABORT`: re-raise - `SKIP`: ignore and continue with next page - `COPY`: ignore and provide input PAGE-XML as output (with just `@PcGtsId` and `Metadata` added to simulate the processing step) - introduce `config.OCRD_EXISTING_OUTPUT`: - `ABORT`: re-raise FileExistsError - `SKIP`: ignore and continue with next page - `OVERWRITE`: force overwriting the exact output files (instead of removing output files indiscriminately) - 🔥 remove `Workspace.overwrite_mode`, have `--overwrite` merely delegate to `config.OCRD_EXISTING_OUTPUT=OVERWRITE` - introduce `--debug`, just delegate to `config.OCRD_MISSING_OUTPUT=ABORT` - `cli.bashlib.input-files`: delegate everything to `ocrd_cli_wrap_processor` (for CLI handling) and `process_workspace` (for error handling), but override `process_page_file` to (never fail and) print bash-friendly strings for actual processing - update tests, add `test_processor.test_run_output_missing` covering all `OCRD_MISSING_OUTPUT` options and the newly `OCRD_EXISTING_OUTPUT=SKIP`

bertsky · 2024-08-20T22:31:45Z

f00ecda: this covers the remaining aspects of error handling on the processor level as I understand them:

dealing with exceptions during the process_page_file call tree; acting on config.OCRD_MISSING_OUTPUT accordingly; --debug behaviour
avoiding FileExistsError or dealing with it after process_page_file calls; acting on config.OCRD_EXISTING_OUTPUT accordingly; --overwrite behaviour

By @kba's request, I pushed this to the same PR so we can quickly go forward merging everything into new-processor-api for the next release. So please re-review!

MehmedGIT · 2024-08-21T12:37:34Z

Seems I commented on a specific commit and that got lost and is not shown here.

kba

LGTM, only suggestions for cosmetics, functionality-wise good to go IMHO!

kba · 2024-08-21T14:13:34Z

src/ocrd/processor/base.py

+        self.pageId = pageId
+        self.mimetype = mimetype
+        self.message = (f"Could not find input file for fileGrp {fileGrp} "
+                        f"and pageId {pageId} under mimetype {mimetype or 'PAGE+image(s)'}")


Appreciate the specific exceptions. Maybe we should keep all of those in a separate ocrd_utils.exceptions before we release 3.0 stable.

Wouldn't that defeat package encapsulation (and possibly lead to circularity)? If I use stuff from ocrd which throws an exception, which I want to be able to handle, then that should be in ocrd as well.

Is there a case where ocrd module code will be used inside ocrd_utils.exceptions and then that code is imported back to ocrd from ocrd_utils? IMO, no circularity should happen if implemented properly, i.e., just importing the required definitions or methods of a class instead of *. This said dot operations in Python are also more expensive in terms of performance.

import workspace workspace.add_file()

is less efficient than

from workspace import add_file add_file()

due to the extra attribute lookup overhead (usually bad inside loops).

Interesting about the import syntax. I prefer the latter anyway, good to hear that it's also faster.

Wouldn't that defeat package encapsulation (and possibly lead to circularity)? If I use stuff from ocrd which throws an exception, which I want to be able to handle, then that should be in ocrd as well.

The dependency graph is ocrd_utils -> ocrd_models -> ocrd_validators -> ocrd -> ocrd_network. So unless ocrd_utils.exceptions imports something from one of the other packages (like the problem with in #12), there won't be an issue, at least for these simple, builtin-derived exceptions. If we want e.g. JsonSchemaDeprecationWarning in there as well, then we'd have to move it up the package hierarchy, e.g. to ocrd.exceptions. Which would not work for JsonSchemaDeprecationWarning because that is used in ocrd_validators.

It is not important, we just have quite a few custom exceptions and I thought it would be sensible to have them in one place, as many python projects do. But let's put a pin into that discussion until we're ready to release, there's more pressing issues.

src/ocrd/processor/base.py

tests/processor/test_processor.py

Co-authored-by: Konstantin Baierer <kba@users.noreply.github.com>

kba · 2024-08-21T16:50:10Z

tests/processor/test_processor.py is missing an from ocrd.processor import NonUniqueInputFile.

bertsky · 2024-08-21T17:52:59Z

tests/processor/test_processor.py is missing an from ocrd.processor import NonUniqueInputFile.

oops – thanks!

kba and others added 6 commits August 19, 2024 14:26

add config.OCRD_DOWNLOAD_INPUT

cd4c96c

Processor.zip_input_files: warning instead of exception for missing i…

785d607

…nput files

Processor.zip_input_files: introduce NonUniqueInputFile exception

b12849d

Processor.process_workspace: zip_input_files w/o require_first

95d3658

Merge remote-tracking branch 'origin/download-files-config-var' into …

2e2bda6

…new-processor-api-input-file-errors

Processor.zip_input_files: introduce MissingInputFile exception and c…

c729841

…onfig.OCRD_MISSING_INPUT

MehmedGIT reviewed Aug 20, 2024

View reviewed changes

MehmedGIT approved these changes Aug 20, 2024

View reviewed changes

bertsky added 2 commits August 21, 2024 00:06

processor parameter decorator: no '{}' default (unnecessary)

577baa5

MehmedGIT approved these changes Aug 21, 2024

View reviewed changes

kba approved these changes Aug 21, 2024

View reviewed changes

bertsky and others added 4 commits August 21, 2024 18:01

ocrd_utils.config: add variables to module docstring

fdd5d16

improve docstrings, re-generate docs

6d87f9e

Processor.zip_input_files: more verbose log msg

9942bbe

Co-authored-by: Konstantin Baierer <kba@users.noreply.github.com>

test_processor: test for specific exception

8a584e9

Co-authored-by: Konstantin Baierer <kba@users.noreply.github.com>

test_processor: fix missing import

8077d45

bertsky merged commit 6b68f7a into new-processor-api Aug 21, 2024
1 check passed

bertsky deleted the new-processor-api-input-file-errors branch September 23, 2024 15:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handle errors in zip_input_files #12

handle errors in zip_input_files #12

bertsky commented Aug 20, 2024 •

edited

Loading

bertsky commented Aug 20, 2024

MehmedGIT Aug 20, 2024

bertsky Aug 20, 2024

MehmedGIT Aug 20, 2024 •

edited

Loading

bertsky Aug 20, 2024

MehmedGIT Aug 20, 2024

MehmedGIT Aug 20, 2024

bertsky Aug 20, 2024

bertsky commented Aug 20, 2024

MehmedGIT commented Aug 21, 2024 •

edited

Loading

kba left a comment

kba Aug 21, 2024

bertsky Aug 21, 2024

MehmedGIT Aug 21, 2024

kba Aug 21, 2024

kba commented Aug 21, 2024

bertsky commented Aug 21, 2024

handle errors in zip_input_files #12

handle errors in zip_input_files #12

Conversation

bertsky commented Aug 20, 2024 • edited Loading

bertsky commented Aug 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MehmedGIT Aug 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bertsky commented Aug 20, 2024

MehmedGIT commented Aug 21, 2024 • edited Loading

kba left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kba commented Aug 21, 2024

bertsky commented Aug 21, 2024

bertsky commented Aug 20, 2024 •

edited

Loading

MehmedGIT Aug 20, 2024 •

edited

Loading

MehmedGIT commented Aug 21, 2024 •

edited

Loading