Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a checkbox to disable unzipping in order to push zipped files #3439

Closed
bjonnh opened this issue Oct 30, 2016 · 28 comments
Closed

Add a checkbox to disable unzipping in order to push zipped files #3439

bjonnh opened this issue Oct 30, 2016 · 28 comments
Labels
Feature: File Upload & Handling Type: Suggestion an idea User Role: Depositor Creates datasets, uploads data, etc. UX & UI: Design This issue needs input on the design of the UI and from the product owner

Comments

@bjonnh
Copy link

bjonnh commented Oct 30, 2016

I have this use case (NMR datasets) which are (I should say can be depending on the format) composed of multiple files in a directory hierarchy.

The way we do it now is to do double zipping.

People in my group had trouble first time they tried using DV due to that issue.

Would that be possible to add a checkbox saying "do not unzip" on the upload system? (pdurbin told me it was an option before).

J.

@pdurbin
Copy link
Member

pdurbin commented Oct 30, 2016

Right, @bjonnh and I were talking about this at http://irclog.iq.harvard.edu/dataverse/2016-10-30#i_44159 and I can't find anything about this feature at http://guides.dataverse.org/en/3.6.2/dataverse-user-main.html but if memory serves there was a checkbox to prevent unzipping on upload.

While we're adding the checkbox, we should also make sure that an equivalent boolean is added to the new "native add" API being developed in #1612. Otherwise, this will be a GUI only feature.

I wonder if this will be trickier now that there's a "drag and drop" component to upload file. Hmm. Maybe you'd have to tick the "do not unzip" checkbox and then drag the files over.

To be clear, all this "double zip" business is really a workaround for the following shortcomings:

@lmaylein
Copy link
Contributor

Heidelberg University Library would appreciate this feature

@asconrad
Copy link

One of the use cases in our Dataverse pilot were astrophysical datasets, each organised with different kinds of data around one particular star. We build the datasets in BagIt format for long time preservation and zipped them for preserving the structure and save space.
For this use case it would be a good thing to keep zipping intact. A "don't unzip" flag would, however, need to be accessible on the API as well to be real valuable to us.

@shlake
Copy link
Contributor

shlake commented Sep 14, 2017

UVa would like a place to turn off auto unzipping of zip files and yes, this needs to be offered via API as well (keeping zips zipped).

@nmedeiro
Copy link

nmedeiro commented Sep 14, 2017 via email

@pdurbin
Copy link
Member

pdurbin commented Dec 18, 2018

Related: #5396

@pdurbin
Copy link
Member

pdurbin commented Mar 6, 2019

I'm just copying and pasting the feedback from @amberleahey at #2107 (comment)

"Hi folks! I know this post is old but I wanted to chime in and ask if there are plans to add the option to upload a .zip and NOT unpack? This would allow authors to choose to upload and retain zip if they wanted, maybe it would be default, but at least present the option. I think the inability to do so at the moment means users are double zipping and creating tar zip packages to get around it. We noticed this in our instance anyway. Any thoughts on reviving this convo?"

@djbrooke
Copy link
Contributor

We have file hierarchy support in Dataverse now (hooray!) so I'm going to adjust the title here and close out a similar issue in #2107.

@djbrooke djbrooke changed the title Add a checkbox to disable unzipping in order to push zipped hierarchical data files. Add a checkbox to disable unzipping in order to push zipped files Jul 18, 2019
@pdurbin
Copy link
Member

pdurbin commented Jul 18, 2019

Just to remind everyone, @rmo-cdsp already made pull request #5396 to address this issue, to add a checkbox in the UI that says "Unzip zip files". He also implemented an API equivalent. So we can warm up that pull request (merge the latest from develop) if we'd like to test it. I'd be happy to spin up a branch so people can take a look. I haven't seen the UI myself.

@djbrooke
Copy link
Contributor

I don't think we should implement this in the codebase, as keeping the files zipped means users miss out on a lot of file-level options (ingest, exploration, UNFs) and I think we'd want individual files for preservation purposes instead of a zip. By implementing file hierarchy we picked off one use case for keeping things zipped and I'm interested in other use cases so that we can discuss ways of addressing them without encouraging people to keep things zipped. Maybe an expansion of package files can help here.

@pdurbin
Copy link
Member

pdurbin commented Jul 18, 2019

@djbrooke I absolutely agree that understanding the use cases, the reasons why people feel the need to have zipped files in Dataverse, is crucial. I agree that zip is a suboptimal preservation format.

From the comments above, here is my take on why people who have written in this issue want zipped files in Dataverse:

If anyone reading this could elaborate on why you want to store zipped files in Dataverse, please leave a comment. Thanks! 😄 🙏

@nmedeiro
Copy link

nmedeiro commented Jul 18, 2019 via email

@shlake
Copy link
Contributor

shlake commented Jul 18, 2019

My original want was to preserve file hierarchy (DONE !!! thanks). But recently I got a request to upload "lots" of files and the researcher did not want "lots" of individual files in their dataset. They wanted to zip (keep zipped) and uploaded as one.

I am curious what @djbrooke means by "expansion of package files". Does Dataverse have "package files"?

@scolapasta
Copy link
Contributor

@shlake "Package" files are what we use to track uploads via rsync. In the orginal use case, individual files didn't matter, an end user would only want to download them as a "package".

When I first suggested the concept of "package files", I had in mind that this would be the way Dataverse could deal with any set of files that fit this criteria, and not just via rysnc.

On the back end we could unzip and store individually (for preservation), as well as keep a copy of the zip for easy download.

I believe this is what we do now with package files to allow download via S3.

@jggautier
Copy link
Contributor

Some depositors wanted files kept zipped to avoid Dataverse's duplication checks.

@shlake
Copy link
Contributor

shlake commented Jul 18, 2019

@jggautier YES! I forgot about that - to avoid Dataverse's duplication check!

@scolapasta
Copy link
Contributor

@jggautier @shlake sure, but in that case it's a workaround for the real desire which is to allow Dataverse to upload the same file (same checksum) multiple times. We have discussed this separately and may change the rule (at least allow it if in different directory; or as a warning instead of an error).

@shlake
Copy link
Contributor

shlake commented Jul 18, 2019

@scolapasta - now I remember that one UVa dataset had a duplicate file

Here are my comments from a Google Group discussion: https://groups.google.com/forum/?hl=en#!topic/dataverse-community/FLnm8-60sOs

I am not in the business of questioning why a researcher has "duplicate" files with different file names in their dataset. So is there any work around for dataverse to accept these files?

If a researcher has two files that just happen to contain the same information (the same checksum), I don't think that should stop that file from being uploaded, maybe flagged??. There may be a reason for different filenames w/ same content (such as: used in a script as part of analysis - where the title of the file is important to the script and thus would be important for transparency and understanding of the methodology).

@pdurbin
Copy link
Member

pdurbin commented Jul 18, 2019

There may be a reason for different filenames w/ same content

I just gave an example of how it's common in the Python world to see empty __init__.py files scattered throughout a project over in the "As a researcher, I need to publish a dataset that contains files with the same content, which are handled differently" issue at #4813 (comment)

Thanks to all for all the feedback so far! Much appreciated! I'm always interesting in the "why". 😄

Another thought: What if there were a checkbox in the UI to disable unzipping but it could be turned off at the installation level? (Or the opposite, you have to explicitly turn on the checkbox.) That way installations would have some choice.

@bjonnh
Copy link
Author

bjonnh commented Jul 19, 2019

I believe this all solves the issue I raised as the feature is now present and working. Thanks to all of you that have been involved.

@bjonnh bjonnh closed this as completed Jul 19, 2019
@mankoff
Copy link
Contributor

mankoff commented Apr 22, 2021

Why is this issue closed? Was something implemented for this feature?

@djbrooke
Copy link
Contributor

Hi @mankoff - the original reporter closed it.

The strategy around this has been to better handle the specific cases about "why" people want to keep their files zipped, instead of providing the option to keep the files zipped during upload. The reasoning is that zips are not as FAIR, there is a significant file-level feature set missed out on with zips, and zips are not as preservation-friendly. We've added support for lots of use cases (file hierarchy, duplicate MD5s, duplicate file names in different folders, etc.) that were brought up as reasons for keeping things zipped, but I'm sure there are others as well.

@mankoff
Copy link
Contributor

mankoff commented Apr 22, 2021

Thanks for the explanation. Our use cases: Shapefiles. Or a group of 10 little MATLAB functions.

@foobarbecue
Copy link
Contributor

foobarbecue commented Jul 27, 2021

We also would really like to have this checkbox. We have people that like to upload zip files containing tens of thousands of images and text files (e.g. training sets or engineering datasets). Dataverse chokes on these .zips trying to unzip them. Maybe I could talk them into using tarballs or something... For now, we're using the zip-in-a-zip workaround.

@lmaylein
Copy link
Contributor

@djbrooke

The strategy around this has been to better handle the specific cases about "why" people want to keep their files zipped, instead of providing the option to keep the files zipped during upload.

I would have fewer problems importing data with more complex folder structures if the tree view was the default. I'm afraid many users overlook the possibility to switch to the tree view.

@mankoff
Copy link
Contributor

mankoff commented Jul 28, 2021

@lmaylein Tree view isn't even an option if there are no sub-folders, so it shouldn't (currently) be default because setting it to default only if sub-folders exist would lead to two very different views depending on the dataverse. However, I agree with you. Tree view as a default should be a DV-level option, and datasets without folders should present as ./ or /.

I believe it is Dataverse / IQSS policy not to re-open closed issues, so this discussion is probably occurring in the wrong place. I suggest someone here open a new ticket. I agree with @djbrooke and his comment above about all the issues with ZIP files. But Dataverse policy here may be letting perfection be the enemy of good. Perfect FAIR-ness and not supporting ZIP files will make some users use other solutions. I think the correct behavior is to allow ZIP files, but discourage it. Make us jump through some hoops to do it, but allow it. Archives of some type is a requirement for certain types of data.

@djbrooke
Copy link
Contributor

@lmaylein @mankoff @foobarbecue - thanks for the discussion here. I do think a new issue would be a good way to restart the discussion. I still have reservations about this, but it would be good to reset on the remaining (or new!) use cases for why files should remain zipped.

Regarding the tree view, we may revisit it in the future but it was initially implemented as just a view without any of the usual file-level options. I'd be hesitant to make it the default view.

@mankoff
Copy link
Contributor

mankoff commented Jul 28, 2021

See #8029.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: File Upload & Handling Type: Suggestion an idea User Role: Depositor Creates datasets, uploads data, etc. UX & UI: Design This issue needs input on the design of the UI and from the product owner
Projects
None yet
Development

No branches or pull requests