Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auxiliary file download: JSON file extension/type is lost #8241

Closed
raprasad opened this issue Nov 13, 2021 · 8 comments · Fixed by #8282
Closed

auxiliary file download: JSON file extension/type is lost #8241

raprasad opened this issue Nov 13, 2021 · 8 comments · Fixed by #8282

Comments

@raprasad
Copy link
Contributor

raprasad commented Nov 13, 2021

What steps does it take to reproduce the issue?

  1. Use the Dataverse API to upload an Auxiliary file that is a JSON file (e.g. dp_release.json):
  2. Download the file via the UI
  3. The downloaded file has ".txt" appended to the end. The user may rename the file to end with ".json" but there is no direction that it is supposed to be a JSON file.
    (Note: this doesn't happen when adding PDFs as auxiliary files)
  • When does this issue occur?
    When downloading an auxiliary JSON file.

  • Which page(s) does it occurs on?
    Dataset page, file download dropdown.

  • What happens?
    See steps to reproduce above

  • To whom does it occur (all users, curators, superusers)?
    Tested this as the dataset owner.

  • What did you expect to happen?
    The file would be downloaded with a .json extension

Which version of Dataverse are you using?
v. 5.5 build develop-6cb87ee5d

Any related open or closed issues to this bug report?

@djbrooke
Copy link
Contributor

Thanks @raprasad. We can take a look. If you wouldn't mind, could you please see if this happens in 5.8 (on demo.dataverse.org)?

@pdurbin
Copy link
Member

pdurbin commented Nov 15, 2021

This is definitely a known issue and has been true since 5.5. 5c67f95 is a related commit we can look at. When we fix this we should also adjust the tests that have comments like FIXME: This should be ".json" instead of ".txt".

In short, the context type (MIME type) for these JSON files is being stored in Dataverse as "text/plain" rather than "application/json".

@qqmyers
Copy link
Member

qqmyers commented Nov 15, 2021

Looks like the core problem is that the mimetype detection used (

auxFile.setContentType(tika.detect(storageIO.getAuxFileAsInputStream(auxExtension)));
) doesn't recognize application/json and just assigns it text/plain. The best suggestion for ~fixes I've seen are to use the tike.detect(Stream, Metadata) method and pass a Metadata with the RESOURCE_NAME_KEY set to a filename/extension associated with application/json or Metadata.CONTENT_TYPE set to application/json (somewhat circular since we're trying to detect the type, but this evidently makes tika assign application/json as the right subtype of text/plain), or to just try parsing with a json parser (perhaps only if tika says text/plain to start).

@pdurbin
Copy link
Member

pdurbin commented Nov 17, 2021

Since this is at the top of Up Next I thought I'd mention that pull request #8237 is related so watch our for potential merge conflicts.

@djbrooke
Copy link
Contributor

djbrooke commented Nov 17, 2021

  • There are two approaches for this - we could detect and provide an extension based on the result of that detection OR we could just give back the file extension that was put in in the first place. To be discussed if we want to do what we do it other parts of the application (that may be driven by ingest) or if we want to treat this separately due to the use cases around auxiliary files - that the client knows what it's uploading and therefore what it wants to get back.

@pdurbin pdurbin self-assigned this Nov 18, 2021
@pdurbin pdurbin removed their assignment Nov 22, 2021
pdurbin added a commit that referenced this issue Dec 7, 2021
"application/octet-stream" is the default when the user doesn't supply a
content type. So if it's this, send it through Tika. Yes, a user
can supply "application/octet-stream" and this will also be sent through
Tika.
@raprasad
Copy link
Contributor Author

raprasad commented Dec 7, 2021

@pdurbin: Has a format been decided upon for sending over the content type? If so, will it be backward compatible with the current API? Thanks.

kcondon added a commit that referenced this issue Dec 7, 2021
allow users to override content type for aux files #8241
@pdurbin
Copy link
Member

pdurbin commented Dec 8, 2021

@raprasad yes, a format has been decided. Pull request #8282 has been merged and it's probably easiest to show the "diff" of the curl command:

Screen Shot 2021-12-08 at 8 20 25 AM

That is to say, from curl you'll pass something like this:

-F file=@data.json;type=application/json

And yes, the change is backward compatible. If you don't specify a content type we'll run the file through Tika.detect in the hopes of figuring out what it is.

@raprasad
Copy link
Contributor Author

raprasad commented Dec 8, 2021

Thanks @pdurbin

poikilotherm pushed a commit to poikilotherm/dataverse that referenced this issue Dec 15, 2021
poikilotherm pushed a commit to poikilotherm/dataverse that referenced this issue Dec 15, 2021
poikilotherm pushed a commit to poikilotherm/dataverse that referenced this issue Dec 15, 2021
"application/octet-stream" is the default when the user doesn't supply a
content type. So if it's this, send it through Tika. Yes, a user
can supply "application/octet-stream" and this will also be sent through
Tika.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants