Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle errors in zip_input_files #12

Merged
merged 13 commits into from
Aug 21, 2024

Conversation

bertsky
Copy link
Owner

@bertsky bertsky commented Aug 20, 2024

@bertsky
Copy link
Owner Author

bertsky commented Aug 20, 2024

@kba I tentatively merged #9 here.

@kba @MehmedGIT this effectively undoes some of your recent commits:

Please review (cannot invite you directly).

Comment on lines 620 to 629
if on_error == 'skip':
ift[i] = None
elif on_error == 'first':
pass # keep first match
elif on_error == 'last':
ift[i] = file_
elif on_error == 'abort':
raise ValueError(
"Multiple '%s' matches for page '%s' in fileGrp '%s'." % (
mimetype, file_.pageId, ifg))
raise NonUniqueInputFile(ifg, file_.pageId, mimetype)
else:
raise Exception("Unknown 'on_error' strategy '%s'" % on_error)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should become a separate method. I see the same if checks are repeated below. The constant strings for the on_error strategy could also become an Enum class. The Enum class would also help with the validator and parser in the config.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should become a separate method. I see the same if checks are repeated below.

We used to have two discerning error messages here. (In one case, a specific MIME type was requested, like PAGE-only or image-only or JSON or whatever, but that was not unique. In another, no MIME type was requested, so the PAGE+images logic should apply but did not establish uniqueness via some PAGE file.)

How would you like to see this refactored into a function, exactly?

The constant strings for the on_error strategy could also become an Enum class.

Agreed.

The Enum class would also help with the validator and parser in the config.

Remember that this concrete setting should not be decided by the user. The config has a different granularity level: what should happen to input files missing (or skipped because of being non-unique).

Copy link

@MehmedGIT MehmedGIT Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you like to see this refactored into a function, exactly?

Something like:

def handle_errors(ift[i], file_, ifg: str, on_error: str):
  if on_error == 'skip':
      ift[i] = None
  elif on_error == 'first':
      pass # keep first match
  elif on_error == 'last':
      ift[i] = file_
  elif on_error == 'abort':
      raise NonUniqueInputFile(ifg, file_.pageId, None)
  else:
      raise Exception("Unknown 'on_error' strategy '%s'" % on_error)

or

def handle_errors(file_, on_error: str, ifg: str) -> Union[None, file_]:
  if on_error == 'skip':
      return None
  elif on_error == 'first':
      return None
  elif on_error == 'last':
      return file_
  elif on_error == 'abort':
      raise NonUniqueInputFile(ifg, file_.pageId, None)
  else:
      raise Exception("Unknown 'on_error' strategy '%s'" % on_error)

The only difference I see is the warning logs content which could be preserved outside the method. If different things happened, say on skip and/or abort, then we no longer have the method option. Anyway, nothing important for the current phase of v3.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, sure, we can pass in the differentiating part of the message for NonUniqueInputFile.

But this is problematic:

if on_error == 'skip':
      return None
  elif on_error == 'first':
      return None

How would you call that function?

With skip, we should set the input file to None. With first, we should keep the previous OcrdFile instance.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the return version would not work and the other version requires passing some arguments for just checking the on_error state. Let's leave it as it is.

config.add("OCRD_MISSING_INPUT",
description="How to deal with missing input files (for some fileGrp/pageId) during processing [SKIP|ABORT]",
default=(True, 'SKIP'),
validator=lambda val: val in ['SKIP', 'ABORT'],

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about FIRST and LAST? Or are these supposed to be only internals?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not apply here. Remember first and last are only for the internal problem of dealing with non-unique matches. We did not want to bother the user with this level.

- introduce `config.OCRD_MISSING_OUTPUT` and
  catch exceptions during `process_page_file`:
  - `ABORT`: re-raise
  - `SKIP`: ignore and continue with next page
  - `COPY`: ignore and provide input PAGE-XML
    as output (with just `@PcGtsId` and `Metadata`
    added to simulate the processing step)

- introduce `config.OCRD_EXISTING_OUTPUT`:
  - `ABORT`: re-raise FileExistsError
  - `SKIP`: ignore and continue with next page
  - `OVERWRITE`: force overwriting the exact
    output files (instead of removing output files
    indiscriminately)

- 🔥 remove `Workspace.overwrite_mode`,
  have `--overwrite` merely delegate to
  `config.OCRD_EXISTING_OUTPUT=OVERWRITE`

- introduce `--debug`, just delegate to
  `config.OCRD_MISSING_OUTPUT=ABORT`

- `cli.bashlib.input-files`: delegate everything
  to `ocrd_cli_wrap_processor` (for CLI handling)
  and `process_workspace` (for error handling),
  but override `process_page_file` to (never fail and)
  print bash-friendly strings for actual processing

- update tests, add `test_processor.test_run_output_missing`
  covering all `OCRD_MISSING_OUTPUT` options and the newly
  `OCRD_EXISTING_OUTPUT=SKIP`
@bertsky
Copy link
Owner Author

bertsky commented Aug 20, 2024

f00ecda: this covers the remaining aspects of error handling on the processor level as I understand them:

  • dealing with exceptions during the process_page_file call tree; acting on config.OCRD_MISSING_OUTPUT accordingly; --debug behaviour
  • avoiding FileExistsError or dealing with it after process_page_file calls; acting on config.OCRD_EXISTING_OUTPUT accordingly; --overwrite behaviour

By @kba's request, I pushed this to the same PR so we can quickly go forward merging everything into new-processor-api for the next release. So please re-review!

@MehmedGIT
Copy link

MehmedGIT commented Aug 21, 2024

Seems I commented on a specific commit and that got lost and is not shown here.

Copy link

@kba kba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only suggestions for cosmetics, functionality-wise good to go IMHO!

self.pageId = pageId
self.mimetype = mimetype
self.message = (f"Could not find input file for fileGrp {fileGrp} "
f"and pageId {pageId} under mimetype {mimetype or 'PAGE+image(s)'}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate the specific exceptions. Maybe we should keep all of those in a separate ocrd_utils.exceptions before we release 3.0 stable.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that defeat package encapsulation (and possibly lead to circularity)? If I use stuff from ocrd which throws an exception, which I want to be able to handle, then that should be in ocrd as well.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a case where ocrd module code will be used inside ocrd_utils.exceptions and then that code is imported back to ocrd from ocrd_utils? IMO, no circularity should happen if implemented properly, i.e., just importing the required definitions or methods of a class instead of *. This said dot operations in Python are also more expensive in terms of performance.

import workspace
workspace.add_file()

is less efficient than

from workspace import add_file
add_file()

due to the extra attribute lookup overhead (usually bad inside loops).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting about the import syntax. I prefer the latter anyway, good to hear that it's also faster.

Wouldn't that defeat package encapsulation (and possibly lead to circularity)? If I use stuff from ocrd which throws an exception, which I want to be able to handle, then that should be in ocrd as well.

The dependency graph is ocrd_utils -> ocrd_models -> ocrd_validators -> ocrd -> ocrd_network. So unless ocrd_utils.exceptions imports something from one of the other packages (like the problem with in #12), there won't be an issue, at least for these simple, builtin-derived exceptions. If we want e.g. JsonSchemaDeprecationWarning in there as well, then we'd have to move it up the package hierarchy, e.g. to ocrd.exceptions. Which would not work for JsonSchemaDeprecationWarning because that is used in ocrd_validators.

It is not important, we just have quite a few custom exceptions and I thought it would be sensible to have them in one place, as many python projects do. But let's put a pin into that discussion until we're ready to release, there's more pressing issues.

src/ocrd/processor/base.py Show resolved Hide resolved
src/ocrd/processor/base.py Outdated Show resolved Hide resolved
tests/processor/test_processor.py Outdated Show resolved Hide resolved
tests/processor/test_processor.py Outdated Show resolved Hide resolved
bertsky and others added 4 commits August 21, 2024 18:01
Co-authored-by: Konstantin Baierer <kba@users.noreply.github.com>
Co-authored-by: Konstantin Baierer <kba@users.noreply.github.com>
@kba
Copy link

kba commented Aug 21, 2024

tests/processor/test_processor.py is missing an from ocrd.processor import NonUniqueInputFile.

@bertsky
Copy link
Owner Author

bertsky commented Aug 21, 2024

tests/processor/test_processor.py is missing an from ocrd.processor import NonUniqueInputFile.

oops – thanks!

@bertsky bertsky merged commit 6b68f7a into new-processor-api Aug 21, 2024
1 check passed
@bertsky bertsky deleted the new-processor-api-input-file-errors branch September 23, 2024 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants