Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support uploading of archives (ZIP, other). #8029

Open
mankoff opened this issue Jul 28, 2021 · 6 comments
Open

Support uploading of archives (ZIP, other). #8029

mankoff opened this issue Jul 28, 2021 · 6 comments
Labels
Component: JSF Involves modifying JSF (Jakarta Server Faces) code, which is being replaced with React. Feature: File Upload & Handling Type: Suggestion an idea User Role: Depositor Creates datasets, uploads data, etc.

Comments

@mankoff
Copy link
Contributor

mankoff commented Jul 28, 2021

Following on #3439 because that is closed.

There are many good reasons to discourage uploading archived files, summarized as:

It is not FAIR

However, there are many use cases where archives (e.g. ZIP files) are necessary, or at least a major improvement over un-archived. Dataverse should discourage but support archive uploads.

One simple use case is when sharing a shapefile, which itself is already not very FAIR. This file format is actually a folder of n files (6 or 7?), one of which has the extension 'shp' and is also called a 'shapefile'. However, if users only download that file, it does not work. They need the entire folder. There is no benefit to exposing these 6 or 7 binary files individually, they should be distributed as a package, and the most common and supported package is a ZIP file.

Another use-case: 100,000 small text files that might make up a data set, all working together. One could argue it is un-FAIR to mint 100,000 new DOIs and require the user to deal with each of these files individually, especially given existing issues with Dataverse bulk download.

@stevenmce
Copy link

ADA has the same requirement - we make extensive use of zips for various reasons including the cases above.

@pdurbin
Copy link
Member

pdurbin commented Oct 13, 2022

@mankoff @stevenmce (and others reading this) how should it look in the UI? Please feel free to draw on a cocktail napkin. 😄

Do you agree with "Add a checkbox to disable unzipping in order to push zipped files" which is how #3439 is worded?

That way, Dataverse is discouraging (as @mankoff said) the uploading of zips (unzipping would be the default) but you can opt-out of unzipping by checking a box.

A cool new feature of 5.12 is support for a new Zip Previewer and file extractor for zip files, an external tool. Basically, it allows you to navigate within zip files uploaded to Dataverse and download this or that file from within the zip. Pretty neat! But would it encourage more zips? 😄 Either way, I think we should let people more easily decide if they want zips or not. Give them that checkbox (or whatever UI), I say, so they don't have to resort to double zipping.

Finally, should we "small chunk" this by offering the ability to opt out of zipping per upload via the API? Is that of value? We recently started allowing people to opt of out ingest via API, thanks to @lubitchv in PR #8532.

@pdurbin pdurbin added Type: Suggestion an idea Feature: File Upload & Handling User Role: Depositor Creates datasets, uploads data, etc. Component: JSF Involves modifying JSF (Jakarta Server Faces) code, which is being replaced with React. labels Oct 13, 2022
@sergejzr
Copy link

sergejzr commented Nov 9, 2022

Hello, just to add that there is also restriction on number of files in zip which causes troubles, especially when importing a large number of datasets from other sources. In such a case we would prefer just to keep the zip as is by default.

@pdurbin
Copy link
Member

pdurbin commented Nov 18, 2022

I'd just like to note that @mankoff let us know that he's personally less involved with a Dataverse installation these days so perhaps a fresh issue with a new champion would be good. Or I'd be happy to create a Google doc if the new champions would prefer to refine some ideas there first. Please get in touch!

My suggestion would be an API-first approach. There's a new ability to skip ingest of tabular files when uploading via API (tabIngest=false) so perhaps we could have a similar unzip=false when uploading via API. Perhaps the new dvwebloader could use such a unzip=false API option, if we added it.

@kuhlaid
Copy link
Contributor

kuhlaid commented Jun 1, 2023

My use-case for using zip archives is a large number of ultrasound images that I am archiving by type. I uploaded one of these zip files (11.zip, containing about 650 images) to a Dataverse dataset today (6/1/2023) via the API. I was not expecting the zip to auto-unzip and leave me without a zip file reference within the dataset but that is what happened. I do not forsee any use-cases where researchers will want to pick out individual files for analysis, so it seems that individual DOI creation of each of the subfiles in my zip archives is unnecessary and wasteful. Having MD5 or some other checksum seems sufficient and if the files are described sufficiently in the metadata then having the Dataverse auto-extracting the contents of the zip file and polluting the files list in the web interface is not desirable and makes it less user friendly. If the API would allow for quick extraction of files based on the directoryLabel then the auto-extraction of archives in the Dataverse is slightly more palatable. As it stands I cannot simply make an API query for 11.zip, extract the dataset ID and download the data from that one archive, but instead would need to query an API endpoint containing the directoryLabel of individual files from the 11.zip file, parse out those dataset IDs and then build a POST request with the dataset IDs of the archive to retrieve those files.

With that said, there are many reasons to keep zip files unzipped and intact resources within a dataset. Some sets of files should only exist as a group and not separate, and the DOI creation of individual files of a large archive can be a resource drain and unnecessary and make it appear the upload failed due to the long processing times (which is what I am experiencing currently).

@pdurbin
Copy link
Member

pdurbin commented Jun 2, 2023

API would allow for quick extraction of files based on the directoryLabel then the auto-extraction of archives in the Dataverse is slightly more palatable.

Yes, this is already supported. You can find some screenshots here:

For now, the workaround to keep the zip a zip is to double zip. I do it myself: pdurbin/open-source-at-harvard-primary-data@2092b75

Here's a copy/paste of what I do (I'm on a Mac):

zip -r primary-data.zip primary-data -x '**/.*' -x '**/__MACOSX'
zip -r outer.zip primary-data.zip -x '**/.*' -x '**/__MACOSX'

In Dataverse, outer.zip is unzipped, leaving primary-data.zip: https://dataverse.harvard.edu/file.xhtml?fileId=6867328&version=4.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: JSF Involves modifying JSF (Jakarta Server Faces) code, which is being replaced with React. Feature: File Upload & Handling Type: Suggestion an idea User Role: Depositor Creates datasets, uploads data, etc.
Projects
None yet
Development

No branches or pull requests

5 participants