-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
handle errors in zip_input_files #12
handle errors in zip_input_files #12
Conversation
…new-processor-api-input-file-errors
…onfig.OCRD_MISSING_INPUT
@kba I tentatively merged #9 here. @kba @MehmedGIT this effectively undoes some of your recent commits: Please review (cannot invite you directly). |
if on_error == 'skip': | ||
ift[i] = None | ||
elif on_error == 'first': | ||
pass # keep first match | ||
elif on_error == 'last': | ||
ift[i] = file_ | ||
elif on_error == 'abort': | ||
raise ValueError( | ||
"Multiple '%s' matches for page '%s' in fileGrp '%s'." % ( | ||
mimetype, file_.pageId, ifg)) | ||
raise NonUniqueInputFile(ifg, file_.pageId, mimetype) | ||
else: | ||
raise Exception("Unknown 'on_error' strategy '%s'" % on_error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should become a separate method. I see the same if checks are repeated below. The constant strings for the on_error
strategy could also become an Enum class. The Enum class would also help with the validator and parser in the config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should become a separate method. I see the same if checks are repeated below.
We used to have two discerning error messages here. (In one case, a specific MIME type was requested, like PAGE-only or image-only or JSON or whatever, but that was not unique. In another, no MIME type was requested, so the PAGE+images logic should apply but did not establish uniqueness via some PAGE file.)
How would you like to see this refactored into a function, exactly?
The constant strings for the
on_error
strategy could also become an Enum class.
Agreed.
The Enum class would also help with the validator and parser in the config.
Remember that this concrete setting should not be decided by the user. The config has a different granularity level: what should happen to input files missing (or skipped because of being non-unique).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you like to see this refactored into a function, exactly?
Something like:
def handle_errors(ift[i], file_, ifg: str, on_error: str):
if on_error == 'skip':
ift[i] = None
elif on_error == 'first':
pass # keep first match
elif on_error == 'last':
ift[i] = file_
elif on_error == 'abort':
raise NonUniqueInputFile(ifg, file_.pageId, None)
else:
raise Exception("Unknown 'on_error' strategy '%s'" % on_error)
or
def handle_errors(file_, on_error: str, ifg: str) -> Union[None, file_]:
if on_error == 'skip':
return None
elif on_error == 'first':
return None
elif on_error == 'last':
return file_
elif on_error == 'abort':
raise NonUniqueInputFile(ifg, file_.pageId, None)
else:
raise Exception("Unknown 'on_error' strategy '%s'" % on_error)
The only difference I see is the warning logs content which could be preserved outside the method. If different things happened, say on skip and/or abort, then we no longer have the method option. Anyway, nothing important for the current phase of v3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, sure, we can pass in the differentiating part of the message for NonUniqueInputFile.
But this is problematic:
if on_error == 'skip': return None elif on_error == 'first': return None
How would you call that function?
With skip, we should set the input file to None. With first, we should keep the previous OcrdFile instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, the return version would not work and the other version requires passing some arguments for just checking the on_error state. Let's leave it as it is.
config.add("OCRD_MISSING_INPUT", | ||
description="How to deal with missing input files (for some fileGrp/pageId) during processing [SKIP|ABORT]", | ||
default=(True, 'SKIP'), | ||
validator=lambda val: val in ['SKIP', 'ABORT'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about FIRST and LAST? Or are these supposed to be only internals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does not apply here. Remember first
and last
are only for the internal problem of dealing with non-unique matches. We did not want to bother the user with this level.
- introduce `config.OCRD_MISSING_OUTPUT` and catch exceptions during `process_page_file`: - `ABORT`: re-raise - `SKIP`: ignore and continue with next page - `COPY`: ignore and provide input PAGE-XML as output (with just `@PcGtsId` and `Metadata` added to simulate the processing step) - introduce `config.OCRD_EXISTING_OUTPUT`: - `ABORT`: re-raise FileExistsError - `SKIP`: ignore and continue with next page - `OVERWRITE`: force overwriting the exact output files (instead of removing output files indiscriminately) - 🔥 remove `Workspace.overwrite_mode`, have `--overwrite` merely delegate to `config.OCRD_EXISTING_OUTPUT=OVERWRITE` - introduce `--debug`, just delegate to `config.OCRD_MISSING_OUTPUT=ABORT` - `cli.bashlib.input-files`: delegate everything to `ocrd_cli_wrap_processor` (for CLI handling) and `process_workspace` (for error handling), but override `process_page_file` to (never fail and) print bash-friendly strings for actual processing - update tests, add `test_processor.test_run_output_missing` covering all `OCRD_MISSING_OUTPUT` options and the newly `OCRD_EXISTING_OUTPUT=SKIP`
f00ecda: this covers the remaining aspects of error handling on the processor level as I understand them:
By @kba's request, I pushed this to the same PR so we can quickly go forward merging everything into |
Seems I commented on a specific commit and that got lost and is not shown here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, only suggestions for cosmetics, functionality-wise good to go IMHO!
self.pageId = pageId | ||
self.mimetype = mimetype | ||
self.message = (f"Could not find input file for fileGrp {fileGrp} " | ||
f"and pageId {pageId} under mimetype {mimetype or 'PAGE+image(s)'}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appreciate the specific exceptions. Maybe we should keep all of those in a separate ocrd_utils.exceptions
before we release 3.0 stable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't that defeat package encapsulation (and possibly lead to circularity)? If I use stuff from ocrd
which throws an exception, which I want to be able to handle, then that should be in ocrd
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a case where ocrd
module code will be used inside ocrd_utils.exceptions
and then that code is imported back to ocrd
from ocrd_utils
? IMO, no circularity should happen if implemented properly, i.e., just importing the required definitions or methods of a class instead of *
. This said dot operations in Python are also more expensive in terms of performance.
import workspace
workspace.add_file()
is less efficient than
from workspace import add_file
add_file()
due to the extra attribute lookup overhead (usually bad inside loops).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting about the import syntax. I prefer the latter anyway, good to hear that it's also faster.
Wouldn't that defeat package encapsulation (and possibly lead to circularity)? If I use stuff from ocrd which throws an exception, which I want to be able to handle, then that should be in ocrd as well.
The dependency graph is ocrd_utils -> ocrd_models -> ocrd_validators -> ocrd -> ocrd_network
. So unless ocrd_utils.exceptions
imports something from one of the other packages (like the problem with in #12), there won't be an issue, at least for these simple, builtin-derived exceptions. If we want e.g. JsonSchemaDeprecationWarning
in there as well, then we'd have to move it up the package hierarchy, e.g. to ocrd.exceptions
. Which would not work for JsonSchemaDeprecationWarning
because that is used in ocrd_validators
.
It is not important, we just have quite a few custom exceptions and I thought it would be sensible to have them in one place, as many python projects do. But let's put a pin into that discussion until we're ready to release, there's more pressing issues.
Co-authored-by: Konstantin Baierer <kba@users.noreply.github.com>
Co-authored-by: Konstantin Baierer <kba@users.noreply.github.com>
|
oops – thanks! |
See discussion on OCR-D/core#1240