Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to truly stream BlockBlock downloads and uploads? #17149

Closed
ali-elgabri opened this issue Mar 7, 2021 · 3 comments · Fixed by #17435
Closed

Is there a way to truly stream BlockBlock downloads and uploads? #17149

ali-elgabri opened this issue Mar 7, 2021 · 3 comments · Fixed by #17435
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Storage Storage Service (Queues, Blobs, Files)

Comments

@ali-elgabri
Copy link

ali-elgabri commented Mar 7, 2021

I believe I have a very simple requirement for which a solution has befuddled me. I am new to the azure-python-sdk and have had little success with its new blob streaming functionality.

Some context

I have used the Java SDK for several years now. Each CloudBlockBlob object has a BlobInputStream and a BlobOutputStream object. When a BlobInputStream is opened, one can invoke its many functions (most notably its read() function) to retrieve data in a true-streaming fashion. A BlobOutputStream, once retrieved, has a write(byte[] data) function where one can continuously write data as frequently as they want until the close() function is invoked. So, it was very easy for me to:

  1. Get a CloudBlockBlob object, open it's BlobInputStream and essentially get back an InputStream that was 'tied' to the CloudBlockBlob. It usually maintained 4MB of data - at least, that's what I understood. When some amount of data is read from its buffer, a new (same amount) of data is introduced, so it always has approximately 4MB of new data (until all data is retrieved).
  2. Perform some operations on that data.
  3. Retrieve the CloudBlockBlob object that I am uploading to, get it's BlobOutputStream, and write to it the data I did some operations on.

A good example of this is if I wanted to compress a file. I had a GzipStreamReader class that would accept an BlobInputStream and an BlobOutputStream. It would read data from the BlobInputStream and, whenever it has compressed some amount of data, write to the BlobOutputStream. It could call write() as many times as it wished; when it finishes reading all the daya, it would close both Input and Output streams, and all was good.

Now for Python

Now, the Python SDK is a little different, and obviously for good reason; the io module works differently than Java's InputStream and OutputStream classes (which the Blob{Input/Output}Stream classes inherit from. I have been struggling to understand how streaming truly works in Azure's python SDK. To start out, I am just trying to see how the StorageStreamDownloader class works. It seems like the StorageStreamDownloader is what holds the 'connection' to the BlockBlob object I am reading data from. If I want to put the data in a stream, I would make a new io.BytesIO() and pass that stream to the StorageStreamDownloader's readinto method.

For uploads, I would call the BlobClient's upload method. The upload method accepts a data parameter that is of type Union[Iterable[AnyStr], IO[AnyStr]].

I don't want to go into too much detail about what I understand, because what I understand and what I have done have gotten me nowhere. I am suspicious that I am expecting something that only the Java SDK offers. But, overall, here are the problems I am having:

  1. When I call download_blob, I get back a StorageStreamDownloader with all the data in the blob. Some investigation has shown that I can use the offset and length to download the amount of data I want. Perhaps I can call it once with a download_blob(offset=0, length=4MB), process the data I get back, then again call download_bloc(offset=4MB, length=4MB), process the data, etc. This is unfavorable. The other thing I could do is utilize the max_chunk_get_size parameter for the BlobClient and turn on the validate_content flag (make it true) so that the StorageStreamDownloader only downloads 4mb. But this all results in several problems: that's not really streaming from a stream object. I'll still have to call download and readinto several times. And fine, I would do that, if it weren't for the second problem:
  2. How the heck do I stream an upload? The upload can take a stream. But if the stream doesn't auto-update itself, then I can only upload once, because all the blobs I deal with must be BlockBlobs. The docs for the upload_function function say that I can provide a param overwrite that does:

keyword bool overwrite: Whether the blob to be uploaded should overwrite the current data.
If True, upload_blob will overwrite the existing data. If set to False, the
operation will fail with ResourceExistsError. The exception to the above is with Append
blob types: if set to False and the data already exists, an error will not be raised
and the data will be appended to the existing blob. If set overwrite=True, then the existing
append blob will be deleted, and a new one created. Defaults to False.

  1. And this makes sense because BlockBlobs, once written to, cannot be written to again. So AFAIK, you can't 'stream' an upload. If I can't have a stream object that is directly tied to the blob, or holds all the data, then the upload() function will terminate as soon as it finishes, right?

Okay. I am certain I am missing something important. I am also somewhat ignorant when it comes to the io module in Python. Though I have developed in Python for a long time, I never really had to deal with that module too closely. I am sure I am missing something, because this functionality is very basic and exists in all the other azure SDKs I know about.

To recap

Everything I said above can honestly be ignored, and only this portion read; I am just trying to show I've done some due diligence. I want to know how to stream data from a blob, process the data I get in a stream, then upload that data. I cannot be receiving all the data in a blob at once. Blobs are likely to be over 1GB and all that pretty stuff. I would honestly love some example code that shows:

  1. Retrieving some data from a blob (the data received in one call should not be more than 10MB) in a stream.
  2. Compressing the data in that stream.
  3. Upload the data to a blob.

This should work for blobs of all sizes; whether its 1MB or 10MB or 10GB should not matter. Step 2 can be anything really; it can also be nothing. Just as long as long as data is being downloaded, inserted into a stream, then uploaded, that would be great. Of course, the other extremely important constraint is that the data per 'download' shouldn't be an amount more than 10MB.

I hope this makes sense! I just want to stream data.

@ghost ghost added needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Mar 7, 2021
@yunhaoling yunhaoling added the Storage Storage Service (Queues, Blobs, Files) label Mar 8, 2021
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Mar 8, 2021
@yunhaoling yunhaoling added Client This issue points to a problem in the data-plane of the library. needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. labels Mar 8, 2021
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Mar 8, 2021
@yunhaoling
Copy link
Contributor

thanks @ali-elgabri for reaching out! we'll take a look ASAP.
adding @xiafu-msft who could further help on usage of the storage sdk.

@yunhaoling yunhaoling modified the milestone: [2021] April Mar 8, 2021
@ali-elgabri
Copy link
Author

@yunhaoling @xiafu-msft any updates guys?

@tasherif-msft
Copy link
Contributor

Hi @ali-elgabri - you can use chunks() in your usecase. Using readall() would return the entirity of your blob data and readinto() would stream the data into a stream handle. You cannot pass a StorageStreamDownloader - it cannot stream from that object.

I believe you posted a PR for sample regarding chunks() I will modify the sample so it is well incorporated within the rest of our samples and will merge it.

We will also improve our docstrings within our download_blob() to reference the ability to use chunks()

Let me know if you need anything else!

@tasherif-msft tasherif-msft self-assigned this Mar 18, 2021
@github-actions github-actions bot locked and limited conversation to collaborators Apr 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Storage Storage Service (Queues, Blobs, Files)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants