-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Truly stream chunks so the entire blob doesn't need to be kept in memory #11009
Comments
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @xgithubtriage. |
Hi @jabbera Thanks |
Bump I've been trying to find a way for the python api to stream, and I haven't seen a way to do it yet, or maybe I missed it. The chunks() method on the api documentation doesn't have any description, so I'm thankful someone figured this out. |
Thanks @jabbera for this request. I too was struggling to figure out this |
Hi @jabbera and @shahdadpuri-varun, sorry on the delay on this. This feature is already available in the SDK.
I will close this issue for now. Let me know if you have any other questions! |
Hi @jabbera, thanks for the example code, it's really helped me out. To say "This feature is already available in the SDK" is over selling it a little. For example at the moment if I want to copy files from my Azure blob store to a partner's AWS S3 I would expect to be able to do:
In reality this requires doing:
Which loads the entire file into memory and then uploads it to AWS S3.
Which is almost (but not quite!) the first example and notably uploads and downloads at the same time, using significantly less memory for large files. |
I missed that this was closed. @ollytheninja reiterated my point. This SDK is striving to be as pythonic as possible and that chunks api is about as far from pythonic as possible. StorageStreamDownloader, despite its name, is not a python stream. |
If you look at my sample code I use the api you described here so I know it exists. The issue is StorageStreamDownloader isn't an actual python stream so it's useless across 99 percent of the python io ecosystem unless you want to download the entire blob into memory. (Hint, we don't want one of those fancy 200TB blobs you just released sitting in ram if we are copying it somewhere:-)) |
That's basically what my function does. It will basically keep at most 2x chunk size in memory at one time. |
Correct, currently it's doing what you illustrated on the first line - while it does the fetching of the file in chunks it doesn't expose those chunks as a stream, meaning that you cannot process the file in a streaming fashion, it will pull down the entire file before passing it on. The culprit is that This is especially confusing when the "StorageStreamDownloader" returns a file-like python object and not a stream-like python object. Exposing a stream like object, that buffers two chunks and fetches another when the first starts being read means processing a file uses [3*chunksize] and not [filesize] memory. This is not only useful for the example here of transferring files out to another provider but also when (for example) processing frames in a video, searching large log files etc. |
If I get some time I'll see about making a pull request and a new issue for this, @tasherif-msft what are your views on reopening this? Or should we raise a new issue? |
@jabbera and @ollytheninja , Thanks for the continued engagement on the topic, though I'm still a little unclear. @jabbera , sounds like you are saying that your code does not load the entire file into memory, but rather max 2 chunks and passes them completely down the line (sounds like streaming). Whereas @ollytheninja , you are saying Is it actually possible to accomplish the second illustration above:
|
My code only keeps 1 chunk in a buffer and whatever your read size is. |
Ok cool. Have you tried / are you able to POST chunks-out as part of say, an upload to S3 without holding the entire file in memory? |
No. I use it to stream really large text compressed text files (30-40GB compressed), decompress, and parse into a more usable format. |
Could we get some more samples in the docs for iterating over |
How can I increase the BUFFER_SIZE for the chunks? Is there any docs? |
I'm facing a similar situation, Thanks if anyone who could fix this :) |
I'm shocked this is still open. Native python stream functionality should be core to this library. |
Hi all, apologies that this thread has gone quiet for some time. It is true that @virtualdvid David, you can control the buffer size for chunks using the azure-sdk-for-python/sdk/storage/azure-storage-blob/azure/storage/blob/_blob_client.py Lines 130 to 131 in 073c3e8
That being said, a little while ago I did start the work to add a proper |
|
I resolved the issue from the previous comment by adding "max_chunk_get_size" argument in BlobService Client:
|
Could someone update us on the status of this issue? |
Hi @ericthomas1, #24275 was recently merged which added a standard In the meantime, or as an alternative, the |
Is your feature request related to a problem? Please describe.
I have large gz files I need to stream (10-50GB). I don't want or have the memory to download the blob into memory first. gz is a streaming format so I only need chunks at a time.
Describe the solution you'd like
Something like this. Note the AzureBlobStream implementation that only keeps 1 chunk in memory at a time. It would be nice if StorageStreamDownloader just acted like a stream and behaved this way.
Describe alternatives you've considered
Downloading 10's of GB into memory.
Additional context
None.
Edit: updated code to work....
The text was updated successfully, but these errors were encountered: