-
Notifications
You must be signed in to change notification settings - Fork 488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IQSS/6763-multi-part upload API calls #6995
IQSS/6763-multi-part upload API calls #6995
Conversation
Putting this up as draft so people can comment on basic design. I've tested using the DVUploader, so I know the API works. The design here avoids changing the basic upload process (in general, and definitely w.r.t. single part directUpload.) Part of doing that was not tracking in-progress mp uploads in Dataverse. As is, the API to initiate sends URLs with the info needed to perform the part uploads to S3 and to complete/cancel the request at the end (the complete/cancel via Dataverse). That avoids having to keep state in Dataverse (the URLs have the state), but relies on the client to request a cancel/complete, otherwise the partial upload stays in AWS. Nominally this is another possible form of temp file that could/should be periodically cleaned up. Alternate designs could track open mp uploads directly, but there's no clear way to know when a client has stopped uploading, except waiting for the upload URLs to expire plus some time for transfers to complete. From the UI, it might make sense to keep track, so that Dataverse would know which mp upload in AWS to cancel if the user cancels the overall upload/dataset creation action, although this could probably be handled on the browser side in the javascript that's managing the overall direct/mp uploads. As you can tell, my guess is that this design is reasonable, but others may be able to see scenarios where an alternate would be better. If so, it would be good to fix now (since the DVUploader/other API clients would have to keep up/support multiple versions if the API changes later.) |
Testing with DVUploader (the as-yet-unreleased 1.1.0 version at https://github.com/GlobalDataverseCommunityConsortium/dataverse-uploader), the new API call appears to be working - tested up to ~40GB and part sizes from 5MB (510241024) to 100MB. I'm going to move on to looking at how this might work in the UI as well, but I don't expect API changes will be needed for that (changes to EditDataFilesPage probably) - so it would be good to get any design/API naming issues settled now since I have to match those in Dataverse and DVUploader. I can put the DVUploader jar somewhere if someone want's to try it out without having to build it. |
it's the max that a single part direct upload can be (as coded), and the minimum part size for multipart uploads. These could be separate settings, but it's not clear that that's useful to support. E.g. for AWS S3, the min-part-size is 5MB but single uploads could go to 5 GB so that is the max single upload.
Thanks @qqmyers, I'd be very excited to get this in as API-only ASAP, since we have a few large upload requests in the Harvard Dataverse queue and we've been doing the unsupported db-switcheroo method. :) |
at one point I was sending parts that were below the min-part-size on AWS and everything worked until a 400 response/Entity too small response when trying to complete. This commit would try to clean up after any similar problems (as the code now stops this aprticular one from happening).
add release note
Correction: since the API changes were just to the abort and complete API calls, the URLs for which are supplied via the json response to the API request to start a multipart upload, the DVUploader didn't have to change :-) - I did find/fix an auth issue in the complete api call in the PR though. |
Conflicts: src/main/webapp/editFilesFragment.xhtml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, could you please sync up this branch with develop? - there appears to be an actual conflict in this one.
Otherwise this looks really cool. (And worked for me! - when built and deployed as 4.20).
Conflicts: src/main/java/edu/harvard/iq/dataverse/util/FileUtil.java
@landreev - thanks - both PRs are now updated. |
Conflicts: src/main/java/edu/harvard/iq/dataverse/EditDatafilesPage.java src/main/java/edu/harvard/iq/dataverse/api/Datasets.java src/main/java/edu/harvard/iq/dataverse/dataaccess/S3AccessIO.java
still want to trigger the retry or report the error in this case. Prior code assumed the last call to finish succeeded.
Have completed base line testing. Works with a couple notable points to investigate: |
for cancel - catching the per datafile cancel button and handling the abort, after all currently uploading parts are done, in mp calls, to remove the file Note for single part uploads, cancelling the file does not yet remove the temp file, but in that case we do add the temp label so files can be cleaned up later. (Todo note includes more info)
@kcondon - I think the above are fixed:
So - I think everything that can be fixed for this PR is fixed and it's ready for retesting. Other than regression testing that things still work, I think you should try to cancel a single upload (small or multipart size) and make sure you can upload a second file in the same upload session, and try the cancel of the overall upload - checking to make sure that no multipart uploads are left open at AWS S3 via their api. |
Regression test:
Test Fixes: |
What this PR does / why we need it: This builds on #6994 to add three API calls needed to initiate, and complete or abort a direct multipart upload to S3. It will also include analogous calls in EditDatafilePage to support MP upload through the UI (could potentially be pulled into another PR).
Which issue(s) this PR closes:
Closes #6763
Special notes for your reviewer:
Suggestions on how to test this: The easiest way to test the API calls will be to try large uploads using the DVUploader (v1.1.0+ working code, no release yet). The UI support can be tested with large files via the normal direct upload UI (once completed)
Does this PR introduce a user interface change? If mockups are available, please link/include them here: ~No - probably will change how the progress bars are computed when using multiple parts
Is there a release notes update needed for this change?: Just to announce - no config changes needed.
Additional documentation: