Is there a way to truly stream BlockBlock downloads and uploads? #17149
Labels
Client
This issue points to a problem in the data-plane of the library.
customer-reported
Issues that are reported by GitHub users external to the Azure organization.
question
The issue doesn't require a change to the product in order to be resolved. Most issues start as that
Storage
Storage Service (Queues, Blobs, Files)
I believe I have a very simple requirement for which a solution has befuddled me. I am new to the azure-python-sdk and have had little success with its new blob streaming functionality.
Some context
I have used the Java SDK for several years now. Each CloudBlockBlob object has a BlobInputStream and a BlobOutputStream object. When a
BlobInputStream
is opened, one can invoke its many functions (most notably its read() function) to retrieve data in a true-streaming fashion. ABlobOutputStream
, once retrieved, has a write(byte[] data) function where one can continuously write data as frequently as they want until the close() function is invoked. So, it was very easy for me to:CloudBlockBlob
object, open it'sBlobInputStream
and essentially get back anInputStream
that was 'tied' to theCloudBlockBlob
. It usually maintained 4MB of data - at least, that's what I understood. When some amount of data is read from its buffer, a new (same amount) of data is introduced, so it always has approximately 4MB of new data (until all data is retrieved).CloudBlockBlob
object that I am uploading to, get it'sBlobOutputStream
, and write to it the data I did some operations on.A good example of this is if I wanted to compress a file. I had a
GzipStreamReader
class that would accept anBlobInputStream
and anBlobOutputStream
. It would read data from theBlobInputStream
and, whenever it has compressed some amount of data, write to theBlobOutputStream
. It could call write() as many times as it wished; when it finishes reading all the daya, it would close both Input and Output streams, and all was good.Now for Python
Now, the Python SDK is a little different, and obviously for good reason; the
io
module works differently than Java'sInputStream
andOutputStream
classes (which theBlob{Input/Output}Stream
classes inherit from. I have been struggling to understand how streaming truly works in Azure's python SDK. To start out, I am just trying to see how the StorageStreamDownloader class works. It seems like theStorageStreamDownloader
is what holds the 'connection' to theBlockBlob
object I am reading data from. If I want to put the data in a stream, I would make a newio.BytesIO()
and pass that stream to theStorageStreamDownloader
's readinto method.For uploads, I would call the BlobClient's upload method. The upload method accepts a
data
parameter that is of typeUnion[Iterable[AnyStr], IO[AnyStr]]
.I don't want to go into too much detail about what I understand, because what I understand and what I have done have gotten me nowhere. I am suspicious that I am expecting something that only the Java SDK offers. But, overall, here are the problems I am having:
StorageStreamDownloader
with all the data in the blob. Some investigation has shown that I can use theoffset
andlength
to download the amount of data I want. Perhaps I can call it once with adownload_blob(offset=0, length=4MB)
, process the data I get back, then again calldownload_bloc(offset=4MB, length=4MB)
, process the data, etc. This is unfavorable. The other thing I could do is utilize themax_chunk_get_size
parameter for theBlobClient
and turn on thevalidate_content
flag (make it true) so that theStorageStreamDownloader
only downloads 4mb. But this all results in several problems: that's not really streaming from astream
object. I'll still have to calldownload
andreadinto
several times. And fine, I would do that, if it weren't for the second problem:BlockBlobs
. The docs for theupload_function
function say that I can provide a paramoverwrite
that does:BlockBlobs
, once written to, cannot be written to again. So AFAIK, you can't 'stream' an upload. If I can't have a stream object that is directly tied to the blob, or holds all the data, then the upload() function will terminate as soon as it finishes, right?Okay. I am certain I am missing something important. I am also somewhat ignorant when it comes to the
io
module in Python. Though I have developed in Python for a long time, I never really had to deal with that module too closely. I am sure I am missing something, because this functionality is very basic and exists in all the other azure SDKs I know about.To recap
Everything I said above can honestly be ignored, and only this portion read; I am just trying to show I've done some due diligence. I want to know how to stream data from a blob, process the data I get in a stream, then upload that data. I cannot be receiving all the data in a blob at once. Blobs are likely to be over 1GB and all that pretty stuff. I would honestly love some example code that shows:
This should work for blobs of all sizes; whether its 1MB or 10MB or 10GB should not matter. Step 2 can be anything really; it can also be nothing. Just as long as long as data is being downloaded, inserted into a stream, then uploaded, that would be great. Of course, the other extremely important constraint is that the data per 'download' shouldn't be an amount more than 10MB.
I hope this makes sense! I just want to stream data.
The text was updated successfully, but these errors were encountered: