Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write 5+ GiB (matrices) to S3Store #687

Merged
merged 1 commit into from
May 8, 2019
Merged

write 5+ GiB (matrices) to S3Store #687

merged 1 commit into from
May 8, 2019

Conversation

jesteria
Copy link
Member

@jesteria jesteria commented May 7, 2019

Ensure S3Store does not attempt to write too-large chunks to S3 (5+ GiB)

Changes

Underlying library s3fs automatically writes objects to S3 in "chunks"
or "parts" -- i.e. via multipart upload -- in line with S3's minimum
limit for multipart of 5 MiB.

This should, in general, avoid S3's maximum limit per (part) upload of
5 GiB. However, s3fs assumes that no single write() might
exceed the maximum, and as such fails to chunk out such too-large upload
requests prompted by singular writes of 5+ GiB.

This can and should be resolved in s3fs. But first, it can, should
be and is resolved here in S3Store.

resolves #530

Testing

In addition to the included unit test, a functional test script was composed – currently uploaded to Gist as test_store.py.

In its default, incremental mode, the functional test found no issue – writing to s3fs in chunks under 5 GiB is no problem.

On the other hand, the following command to test a single 6 GiB write…

$ AWS_PROFILE=dsapp-ddj time nice -n 3 \
    python test_store.py --assert-type=S3Store --method=full -l 6GiB \
    s3://review.dssg.io/test.6GiB.full.dat

…resulted in:

OSError: Write failed: review.dssg.io/test.6GiB.full.dat

…itself caused by:

botocore.exceptions.ClientError: An error occurred (EntityTooLarge) when calling the PutObject operation: Your proposed upload exceeds the maximum allowed size

Only with the included changeset, the above command succeeded in writing the object to S3, with the timing output:

139.18user 55.41system 1:17:30elapsed 4%CPU (0avgtext+0avgdata 6845376maxresident)k
0inputs+48outputs (0major+7931521minor)pagefaults 0swaps

(It certainly wasn't quick, but it was memory-stable. In the error case, RAM use exploded and threatened system stability – perhaps due to the payload being passed around and even copied by error handlers.)

Further consideration

It is important to cover this edge case and ensure that we can write 5+ GiB strings as necessary. However, it would be best (at least for memory stability) that we not attempt to, as possible.

A cursory look over the codebase brought this back to the fore, in triage.component.catwalk.storage.CSVMatrixStore.save:

self.matrix_base_store.write(gzip.compress(self.full_matrix_for_saving.to_csv(None).encode("utf-8")))

I remember that it was attempted to write the above without storing entire payloads at once in memory, (e.g. to pass the file object to gzip such that it could write to it iteratively according to standard buffering); however, this option was put aside, (likely for good reason).

That said, we should seriously consider ways to maintain such standard buffering interactions, in lieu of working entirely in RAM, where possible, (at least to better avoid such edge cases and limitations).

…5+ GiB)

Underlying library ``s3fs`` automatically writes objects to S3 in "chunks"
or "parts" -- *i.e.* via multipart upload -- in line with S3's *minimum*
limit for multipart of 5 MiB.

This should, in general, avoid S3's *maximum* limit per (part) upload of
5 GiB. **However**, ``s3fs`` assumes that no *single* ``write()`` might
exceed the maximum, and as such fails to chunk out such too-large upload
requests prompted by singular writes of 5+ GiB.

This can and should be resolved in ``s3fs``. But first, it can, should
be and is resolved here in ``S3Store``.

resolves #530
@codecov-io
Copy link

codecov-io commented May 7, 2019

Codecov Report

Merging #687 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #687      +/-   ##
==========================================
+ Coverage   82.18%   82.19%   +<.01%     
==========================================
  Files          93       93              
  Lines        6271     6279       +8     
==========================================
+ Hits         5154     5161       +7     
- Misses       1117     1118       +1
Impacted Files Coverage Δ
src/triage/component/catwalk/storage.py 93.46% <100%> (+0.18%) ⬆️
src/triage/component/architect/utils.py 41.02% <0%> (-0.54%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 11ee044...be4f431. Read the comment docs.

@thcrock
Copy link
Contributor

thcrock commented May 8, 2019

A note on the 'further consideration' section: All good points: yes, that change was held back until post-release because it is a larger change and not necessarily needed before feature freeze. I'll also note, though, that in today's compressed-CSV world that the matrix writing is extremely unlikely to be the procedure that surfaces this 5GB problem. It's model writing.

https://github.com/dssg/triage/blob/master/src/triage/component/catwalk/storage.py#L246

The way that we write models seems like it is not introducing the single-pass write problem, although I haven't looked at joblib's code so perhaps it is doing so.

@thcrock thcrock merged commit 4f8992b into master May 8, 2019
@thcrock thcrock deleted the jsl/s3store-5gb branch May 8, 2019 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

s3fs cannot write matrices larger than 5GB
3 participants