Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/magic desc mime type #21

Closed

Conversation

conitrade-as
Copy link
Contributor

This change set adds the magic description and mime type to the classifier result.

To support easier test writing the artifacts have been augmented with a .json file which contains the expected classification result. Makes for much easier test case writing.

@chivay
Copy link
Contributor

chivay commented Apr 28, 2021

Hi,
adding MIME type to headers seems like a nice addition that could work as an alternative to current conventions (kind, type headers).
However, using libmagic output doesn't seem to be a good fit for headers. In some cases, its too specific (timestamps, size).

For example:

Microsoft Cabinet archive data, Windows 2000/XP setup, 235156 bytes...

gzip compressed data, was "Order 002_PDF.exe", last modified: Thu Apr 30 23:25:26 2020, ....

When writing a karton service, you'd have basically no way to listen for such tasks - you can't use wildcards in headers.

Do you have any specific usecase for this? Maybe moving magic to payload would be a better idea?

@conitrade-as
Copy link
Contributor Author

Sounds like moving magic to a payload is the way to go. What do you think about the mime? That seems reasonable, since that can be used for routing.

@chivay
Copy link
Contributor

chivay commented Apr 28, 2021

Yup, mime in headers sounds good :)

karton/classifier/classifier.py Show resolved Hide resolved
payload.update(expected["payload"])

res = self.run_task(task)
self.assertTasksEqual(res, [Task(expected["headers"], payload)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks quite simple and cool, but I'm not sure if hardcoding libmagic output in test cases is a good idea - it changes between versions. These tests are quite broken on my system (file v5.40). Examples:

AssertionError: 'PDF document, version 1.4, 1 pages' != 'PDF document, version 1.4'
- PDF document, version 1.4, 1 pages
?                          ---------
+ PDF document, version 1.4
 : Incorrect value of payload.magic
AssertionError: 'ASCII text, with very long lines (32088), with no line terminators' != 'ASCII text, with very long lines, with no line terminators'
- ASCII text, with very long lines (32088), with no line terminators
?                                 --------
+ ASCII text, with very long lines, with no line terminators
AssertionError: 'Zip archive data, at least v2.0 to extract, compression method=deflate' != 'Zip archive data, at least v2.0 to extract'
- Zip archive data, at least v2.0 to extract, compression method=deflate
?                                           ----------------------------
+ Zip archive data, at least v2.0 to extract

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is to be expected. Something like libmagic behaves differently, sometimes even within the same version (e.g. with local magic files, etc.). That's why in my experience we choose a frame of reference which is used for the test execution. In this case, this frame of reference is the CI/CD pipeline which runs the tests (based on Ubuntu latest in this case). Then everyone uses that frame of reference for development. This is proven to work and guarantees stable builds.

Copy link
Contributor

@chivay chivay May 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NAK from me. Tests that rely on a specific environment are a bad idea. Apart from annoyances, such as inability to run the tests on a development machine, there are some other problems:

  1. Using ubuntu-latest for testing basically guarantees that the environment will be different at some point. When GitHub changes latest from 20.04 to 21.04, tests will probably break unexpectedly and require fixing.
  2. Our CI is not and should not be the only testing environment (the more testing we get, the better). At this point we already have a downstream - nixpkgs that runs classifier tests in their CI infrastructure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can understand your point of view. So what do you suggest?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the tests require some more consideration I would suggest splitting them off to a separate PR.
We could merge the MIME / magic additions now and take some more time designing the implementation of test suite.

Copy link
Contributor

@msm-code msm-code May 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"writing the tests later" == "writing the tests never". Maybe we can fix it somehow after all?

@msm-code
Copy link
Contributor

Copied from slack:

Hmm, can we pin libmagic version somehow? And ship a statically compiled binary? ping @chivay (cert.pl) . The upside is that the service becomes more "stable" and works in the same way everywhere

Another way to fix it would be to use fuzzy matching for payload["magic"], for example only check first word (or ignore it during the comparison, though i prefer the first option).

@conitrade-as
Copy link
Contributor Author

Superseded by #23

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants