Support uploading of archives (ZIP, other). #8029

mankoff · 2021-07-28T14:25:49Z

Following on #3439 because that is closed.

There are many good reasons to discourage uploading archived files, summarized as:

It is not FAIR

However, there are many use cases where archives (e.g. ZIP files) are necessary, or at least a major improvement over un-archived. Dataverse should discourage but support archive uploads.

One simple use case is when sharing a shapefile, which itself is already not very FAIR. This file format is actually a folder of n files (6 or 7?), one of which has the extension 'shp' and is also called a 'shapefile'. However, if users only download that file, it does not work. They need the entire folder. There is no benefit to exposing these 6 or 7 binary files individually, they should be distributed as a package, and the most common and supported package is a ZIP file.

Another use-case: 100,000 small text files that might make up a data set, all working together. One could argue it is un-FAIR to mint 100,000 new DOIs and require the user to deal with each of these files individually, especially given existing issues with Dataverse bulk download.

stevenmce · 2021-07-29T05:32:09Z

ADA has the same requirement - we make extensive use of zips for various reasons including the cases above.

pdurbin · 2022-10-13T00:48:21Z

@mankoff @stevenmce (and others reading this) how should it look in the UI? Please feel free to draw on a cocktail napkin. 😄

Do you agree with "Add a checkbox to disable unzipping in order to push zipped files" which is how #3439 is worded?

That way, Dataverse is discouraging (as @mankoff said) the uploading of zips (unzipping would be the default) but you can opt-out of unzipping by checking a box.

A cool new feature of 5.12 is support for a new Zip Previewer and file extractor for zip files, an external tool. Basically, it allows you to navigate within zip files uploaded to Dataverse and download this or that file from within the zip. Pretty neat! But would it encourage more zips? 😄 Either way, I think we should let people more easily decide if they want zips or not. Give them that checkbox (or whatever UI), I say, so they don't have to resort to double zipping.

Finally, should we "small chunk" this by offering the ability to opt out of zipping per upload via the API? Is that of value? We recently started allowing people to opt of out ingest via API, thanks to @lubitchv in PR #8532.

sergejzr · 2022-11-09T16:25:52Z

Hello, just to add that there is also restriction on number of files in zip which causes troubles, especially when importing a large number of datasets from other sources. In such a case we would prefer just to keep the zip as is by default.

pdurbin · 2022-11-18T15:49:06Z

I'd just like to note that @mankoff let us know that he's personally less involved with a Dataverse installation these days so perhaps a fresh issue with a new champion would be good. Or I'd be happy to create a Google doc if the new champions would prefer to refine some ideas there first. Please get in touch!

My suggestion would be an API-first approach. There's a new ability to skip ingest of tabular files when uploading via API (tabIngest=false) so perhaps we could have a similar unzip=false when uploading via API. Perhaps the new dvwebloader could use such a unzip=false API option, if we added it.

kuhlaid · 2023-06-01T16:15:23Z

My use-case for using zip archives is a large number of ultrasound images that I am archiving by type. I uploaded one of these zip files (11.zip, containing about 650 images) to a Dataverse dataset today (6/1/2023) via the API. I was not expecting the zip to auto-unzip and leave me without a zip file reference within the dataset but that is what happened. I do not forsee any use-cases where researchers will want to pick out individual files for analysis, so it seems that individual DOI creation of each of the subfiles in my zip archives is unnecessary and wasteful. Having MD5 or some other checksum seems sufficient and if the files are described sufficiently in the metadata then having the Dataverse auto-extracting the contents of the zip file and polluting the files list in the web interface is not desirable and makes it less user friendly. If the API would allow for quick extraction of files based on the directoryLabel then the auto-extraction of archives in the Dataverse is slightly more palatable. As it stands I cannot simply make an API query for 11.zip, extract the dataset ID and download the data from that one archive, but instead would need to query an API endpoint containing the directoryLabel of individual files from the 11.zip file, parse out those dataset IDs and then build a POST request with the dataset IDs of the archive to retrieve those files.

With that said, there are many reasons to keep zip files unzipped and intact resources within a dataset. Some sets of files should only exist as a group and not separate, and the DOI creation of individual files of a large archive can be a resource drain and unnecessary and make it appear the upload failed due to the long processing times (which is what I am experiencing currently).

pdurbin · 2023-06-02T01:13:41Z

API would allow for quick extraction of files based on the directoryLabel then the auto-extraction of archives in the Dataverse is slightly more palatable.

Yes, this is already supported. You can find some screenshots here:

Add ZIP Previewer & File Downloader to demo and prod dataverse.harvard.edu#196

For now, the workaround to keep the zip a zip is to double zip. I do it myself: pdurbin/open-source-at-harvard-primary-data@2092b75

Here's a copy/paste of what I do (I'm on a Mac):

zip -r primary-data.zip primary-data -x '**/.*' -x '**/__MACOSX'
zip -r outer.zip primary-data.zip -x '**/.*' -x '**/__MACOSX'

In Dataverse, outer.zip is unzipped, leaving primary-data.zip: https://dataverse.harvard.edu/file.xhtml?fileId=6867328&version=4.0

mankoff mentioned this issue Jul 28, 2021

Add a checkbox to disable unzipping in order to push zipped files #3439

Closed

pdurbin added Type: Suggestion an idea Feature: File Upload & Handling User Role: Depositor Creates datasets, uploads data, etc. Component: JSF Involves modifying JSF (Jakarta Server Faces) code, which is being replaced with React. labels Oct 13, 2022

jo-pol mentioned this issue Oct 1, 2024

zip files created with an own cloud service are ingested as is #10898

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support uploading of archives (ZIP, other). #8029

Support uploading of archives (ZIP, other). #8029

mankoff commented Jul 28, 2021

stevenmce commented Jul 29, 2021

pdurbin commented Oct 13, 2022 •

edited

Loading

sergejzr commented Nov 9, 2022 •

edited

Loading

pdurbin commented Nov 18, 2022

kuhlaid commented Jun 1, 2023

pdurbin commented Jun 2, 2023

Support uploading of archives (ZIP, other). #8029

Support uploading of archives (ZIP, other). #8029

Comments

mankoff commented Jul 28, 2021

stevenmce commented Jul 29, 2021

pdurbin commented Oct 13, 2022 • edited Loading

sergejzr commented Nov 9, 2022 • edited Loading

pdurbin commented Nov 18, 2022

kuhlaid commented Jun 1, 2023

pdurbin commented Jun 2, 2023

pdurbin commented Oct 13, 2022 •

edited

Loading

sergejzr commented Nov 9, 2022 •

edited

Loading