write 5+ GiB (matrices) to S3Store #687

jesteria · 2019-05-07T16:47:29Z

Ensure `S3Store` does not attempt to write too-large chunks to S3 (5+ GiB)

Changes

Underlying library s3fs automatically writes objects to S3 in "chunks"
or "parts" -- i.e. via multipart upload -- in line with S3's minimum
limit for multipart of 5 MiB.

This should, in general, avoid S3's maximum limit per (part) upload of
5 GiB. However, s3fs assumes that no single write() might
exceed the maximum, and as such fails to chunk out such too-large upload
requests prompted by singular writes of 5+ GiB.

This can and should be resolved in s3fs. But first, it can, should
be and is resolved here in S3Store.

resolves #530

Testing

In addition to the included unit test, a functional test script was composed – currently uploaded to Gist as test_store.py.

In its default, incremental mode, the functional test found no issue – writing to s3fs in chunks under 5 GiB is no problem.

On the other hand, the following command to test a single 6 GiB write…

$ AWS_PROFILE=dsapp-ddj time nice -n 3 \
    python test_store.py --assert-type=S3Store --method=full -l 6GiB \
    s3://review.dssg.io/test.6GiB.full.dat

…resulted in:

OSError: Write failed: review.dssg.io/test.6GiB.full.dat

…itself caused by:

botocore.exceptions.ClientError: An error occurred (EntityTooLarge) when calling the PutObject operation: Your proposed upload exceeds the maximum allowed size

Only with the included changeset, the above command succeeded in writing the object to S3, with the timing output:

139.18user 55.41system 1:17:30elapsed 4%CPU (0avgtext+0avgdata 6845376maxresident)k
0inputs+48outputs (0major+7931521minor)pagefaults 0swaps

(It certainly wasn't quick, but it was memory-stable. In the error case, RAM use exploded and threatened system stability – perhaps due to the payload being passed around and even copied by error handlers.)

Further consideration

It is important to cover this edge case and ensure that we can write 5+ GiB strings as necessary. However, it would be best (at least for memory stability) that we not attempt to, as possible.

A cursory look over the codebase brought this back to the fore, in triage.component.catwalk.storage.CSVMatrixStore.save:

self.matrix_base_store.write(gzip.compress(self.full_matrix_for_saving.to_csv(None).encode("utf-8")))

I remember that it was attempted to write the above without storing entire payloads at once in memory, (e.g. to pass the file object to gzip such that it could write to it iteratively according to standard buffering); however, this option was put aside, (likely for good reason).

That said, we should seriously consider ways to maintain such standard buffering interactions, in lieu of working entirely in RAM, where possible, (at least to better avoid such edge cases and limitations).

…5+ GiB) Underlying library ``s3fs`` automatically writes objects to S3 in "chunks" or "parts" -- *i.e.* via multipart upload -- in line with S3's *minimum* limit for multipart of 5 MiB. This should, in general, avoid S3's *maximum* limit per (part) upload of 5 GiB. **However**, ``s3fs`` assumes that no *single* ``write()`` might exceed the maximum, and as such fails to chunk out such too-large upload requests prompted by singular writes of 5+ GiB. This can and should be resolved in ``s3fs``. But first, it can, should be and is resolved here in ``S3Store``. resolves #530

codecov-io · 2019-05-07T17:07:12Z

Codecov Report

Merging #687 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #687      +/-   ##
==========================================
+ Coverage   82.18%   82.19%   +<.01%     
==========================================
  Files          93       93              
  Lines        6271     6279       +8     
==========================================
+ Hits         5154     5161       +7     
- Misses       1117     1118       +1

Impacted Files	Coverage Δ
src/triage/component/catwalk/storage.py	`93.46% <100%> (+0.18%)`	⬆️
src/triage/component/architect/utils.py	`41.02% <0%> (-0.54%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 11ee044...be4f431. Read the comment docs.

thcrock · 2019-05-08T18:31:54Z

A note on the 'further consideration' section: All good points: yes, that change was held back until post-release because it is a larger change and not necessarily needed before feature freeze. I'll also note, though, that in today's compressed-CSV world that the matrix writing is extremely unlikely to be the procedure that surfaces this 5GB problem. It's model writing.

https://github.com/dssg/triage/blob/master/src/triage/component/catwalk/storage.py#L246

The way that we write models seems like it is not introducing the single-pass write problem, although I haven't looked at joblib's code so perhaps it is doing so.

jesteria requested review from nanounanue and thcrock May 7, 2019 16:47

jesteria assigned thcrock May 7, 2019

jesteria mentioned this pull request May 7, 2019

support for large writes & test updates fsspec/s3fs#183

Merged

thcrock merged commit 4f8992b into master May 8, 2019

thcrock deleted the jsl/s3store-5gb branch May 8, 2019 18:43

jesteria mentioned this pull request May 14, 2019

upgrade s3fs to support objects larger than 5GB #696

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

write 5+ GiB (matrices) to S3Store #687

write 5+ GiB (matrices) to S3Store #687

jesteria commented May 7, 2019

codecov-io commented May 7, 2019 •

edited

Loading

thcrock commented May 8, 2019

write 5+ GiB (matrices) to S3Store #687

write 5+ GiB (matrices) to S3Store #687

Conversation

jesteria commented May 7, 2019

Ensure S3Store does not attempt to write too-large chunks to S3 (5+ GiB)

Changes

Testing

Further consideration

codecov-io commented May 7, 2019 • edited Loading

Codecov Report

thcrock commented May 8, 2019

Ensure `S3Store` does not attempt to write too-large chunks to S3 (5+ GiB)

codecov-io commented May 7, 2019 •

edited

Loading