Feature/magic desc mime type #21

conitrade-as · 2021-04-26T18:32:26Z

This change set adds the magic description and mime type to the classifier result.

To support easier test writing the artifacts have been augmented with a .json file which contains the expected classification result. Makes for much easier test case writing.

chivay · 2021-04-28T08:27:55Z

Hi,
adding MIME type to headers seems like a nice addition that could work as an alternative to current conventions (kind, type headers).
However, using libmagic output doesn't seem to be a good fit for headers. In some cases, its too specific (timestamps, size).

For example:

Microsoft Cabinet archive data, Windows 2000/XP setup, 235156 bytes...

gzip compressed data, was "Order 002_PDF.exe", last modified: Thu Apr 30 23:25:26 2020, ....

When writing a karton service, you'd have basically no way to listen for such tasks - you can't use wildcards in headers.

Do you have any specific usecase for this? Maybe moving magic to payload would be a better idea?

conitrade-as · 2021-04-28T15:56:54Z

Sounds like moving magic to a payload is the way to go. What do you think about the mime? That seems reasonable, since that can be used for routing.

chivay · 2021-04-28T16:20:44Z

Yup, mime in headers sounds good :)

karton/classifier/classifier.py

chivay · 2021-05-04T09:00:05Z

tests/__init__.py

+                    payload.update(expected["payload"])
+
+                res = self.run_task(task)
+                self.assertTasksEqual(res, [Task(expected["headers"], payload)])


This looks quite simple and cool, but I'm not sure if hardcoding libmagic output in test cases is a good idea - it changes between versions. These tests are quite broken on my system (file v5.40). Examples:

AssertionError: 'PDF document, version 1.4, 1 pages' != 'PDF document, version 1.4' - PDF document, version 1.4, 1 pages ? --------- + PDF document, version 1.4 : Incorrect value of payload.magic

AssertionError: 'ASCII text, with very long lines (32088), with no line terminators' != 'ASCII text, with very long lines, with no line terminators' - ASCII text, with very long lines (32088), with no line terminators ? -------- + ASCII text, with very long lines, with no line terminators

AssertionError: 'Zip archive data, at least v2.0 to extract, compression method=deflate' != 'Zip archive data, at least v2.0 to extract' - Zip archive data, at least v2.0 to extract, compression method=deflate ? ---------------------------- + Zip archive data, at least v2.0 to extract

Yes, that is to be expected. Something like libmagic behaves differently, sometimes even within the same version (e.g. with local magic files, etc.). That's why in my experience we choose a frame of reference which is used for the test execution. In this case, this frame of reference is the CI/CD pipeline which runs the tests (based on Ubuntu latest in this case). Then everyone uses that frame of reference for development. This is proven to work and guarantees stable builds.

NAK from me. Tests that rely on a specific environment are a bad idea. Apart from annoyances, such as inability to run the tests on a development machine, there are some other problems:

Using ubuntu-latest for testing basically guarantees that the environment will be different at some point. When GitHub changes latest from 20.04 to 21.04, tests will probably break unexpectedly and require fixing.

Our CI is not and should not be the only testing environment (the more testing we get, the better). At this point we already have a downstream - nixpkgs that runs classifier tests in their CI infrastructure.

I can understand your point of view. So what do you suggest?

Since the tests require some more consideration I would suggest splitting them off to a separate PR.
We could merge the MIME / magic additions now and take some more time designing the implementation of test suite.

"writing the tests later" == "writing the tests never". Maybe we can fix it somehow after all?

msm-code · 2021-05-10T21:38:26Z

Copied from slack:

Hmm, can we pin libmagic version somehow? And ship a statically compiled binary? ping @chivay (cert.pl) . The upside is that the service becomes more "stable" and works in the same way everywhere

Another way to fix it would be to use fuzzy matching for payload["magic"], for example only check first word (or ignore it during the comparison, though i prefer the first option).

conitrade-as · 2021-05-17T08:44:56Z

Superseded by #23

conitrade-as added 2 commits April 26, 2021 20:24

refactor test aftifacts for increased readability

2b5cc0d

return magic description and mime type

0e9a351

add magic as a task payload

59fab15

conitrade-as force-pushed the feature/magic-desc-mime-type branch from 95f2493 to 59fab15 Compare April 29, 2021 07:43

chivay reviewed May 4, 2021

View reviewed changes

conitrade-as closed this May 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/magic desc mime type #21

Feature/magic desc mime type #21

conitrade-as commented Apr 26, 2021

chivay commented Apr 28, 2021

conitrade-as commented Apr 28, 2021

chivay commented Apr 28, 2021

chivay May 4, 2021

conitrade-as May 4, 2021

chivay May 7, 2021 •

edited

Loading

conitrade-as May 7, 2021

chivay May 10, 2021

msm-code May 10, 2021 •

edited

Loading

msm-code commented May 10, 2021

conitrade-as commented May 17, 2021

Feature/magic desc mime type #21

Feature/magic desc mime type #21

Conversation

conitrade-as commented Apr 26, 2021

chivay commented Apr 28, 2021

conitrade-as commented Apr 28, 2021

chivay commented Apr 28, 2021

chivay May 4, 2021

Choose a reason for hiding this comment

conitrade-as May 4, 2021

Choose a reason for hiding this comment

chivay May 7, 2021 • edited Loading

Choose a reason for hiding this comment

conitrade-as May 7, 2021

Choose a reason for hiding this comment

chivay May 10, 2021

Choose a reason for hiding this comment

msm-code May 10, 2021 • edited Loading

Choose a reason for hiding this comment

msm-code commented May 10, 2021

conitrade-as commented May 17, 2021

chivay May 7, 2021 •

edited

Loading

msm-code May 10, 2021 •

edited

Loading