write 5+ GiB (matrices) to S3Store #687
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Ensure
S3Store
does not attempt to write too-large chunks to S3 (5+ GiB)Changes
Underlying library
s3fs
automatically writes objects to S3 in "chunks"or "parts" -- i.e. via multipart upload -- in line with S3's minimum
limit for multipart of 5 MiB.
This should, in general, avoid S3's maximum limit per (part) upload of
5 GiB. However,
s3fs
assumes that no singlewrite()
mightexceed the maximum, and as such fails to chunk out such too-large upload
requests prompted by singular writes of 5+ GiB.
This can and should be resolved in
s3fs
. But first, it can, shouldbe and is resolved here in
S3Store
.resolves #530
Testing
In addition to the included unit test, a functional test script was composed – currently uploaded to Gist as test_store.py.
In its default, incremental mode, the functional test found no issue – writing to
s3fs
in chunks under 5 GiB is no problem.On the other hand, the following command to test a single 6 GiB write…
$ AWS_PROFILE=dsapp-ddj time nice -n 3 \ python test_store.py --assert-type=S3Store --method=full -l 6GiB \ s3://review.dssg.io/test.6GiB.full.dat
…resulted in:
…itself caused by:
Only with the included changeset, the above command succeeded in writing the object to S3, with the timing output:
(It certainly wasn't quick, but it was memory-stable. In the error case, RAM use exploded and threatened system stability – perhaps due to the payload being passed around and even copied by error handlers.)
Further consideration
It is important to cover this edge case and ensure that we can write 5+ GiB strings as necessary. However, it would be best (at least for memory stability) that we not attempt to, as possible.
A cursory look over the codebase brought this back to the fore, in
triage.component.catwalk.storage.CSVMatrixStore.save
:I remember that it was attempted to write the above without storing entire payloads at once in memory, (e.g. to pass the file object to
gzip
such that it could write to it iteratively according to standard buffering); however, this option was put aside, (likely for good reason).That said, we should seriously consider ways to maintain such standard buffering interactions, in lieu of working entirely in RAM, where possible, (at least to better avoid such edge cases and limitations).