Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot stream-write a zipfile #631

Closed
Clockwork-Muse opened this issue Oct 19, 2021 · 6 comments · Fixed by #644
Closed

Cannot stream-write a zipfile #631

Clockwork-Muse opened this issue Oct 19, 2021 · 6 comments · Fixed by #644
Assignees
Labels
api: storage Issues related to the googleapis/python-storage API. status: investigating The issue is under investigation, which is determined to be non-trivial.

Comments

@Clockwork-Muse
Copy link

I'm attempting to stream-write an uncompressed zip file (200MB+ -> 1GB+ eventually, mostly ~3MB images) to avoid writing to disk. Unfortunately when the zip file is closed it attempts to at least partially flush the stream, and the storage client seems to assume that a flush will only occur to close the streamed file, which isn't the case here (and seems wrong in the general sense, since buffers are often flushed when they reach saturation).

Environment details

  • OS type and version: Ubuntu 20.04 (custom devcontainer docker image)
  • Python version: 3.8.10
  • pip version: 21.3
  • google-cloud-storage version: 1.42.3

Code example

import zipfile

from google.cloud.storage import Blob, Client

client = Client("some-account")
blob = Blob.from_string("gs://some-bucket/some-folder/something.zip", client)

with blob.open(mode="wb") as blob_file, \
    zipfile.ZipFile(blob_file, mode="w") as zip_file:
    # Empty/not empty doesn't matter, the same error is generated
    pass

Stack trace

Traceback (most recent call last):
  File "s.py", line 14, in <module>
    pass
  File "/usr/lib/python3.8/zipfile.py", line 1312, in __exit__
    self.close()
  File "/usr/lib/python3.8/zipfile.py", line 1839, in close
    self._write_end_record()
  File "/usr/lib/python3.8/zipfile.py", line 1947, in _write_end_record
    self.fp.flush()
  File "/workspaces/someproject/.pyenv/lib/python3.8/site-packages/google/cloud/storage/fileio.py", line 401, in flush
    raise io.UnsupportedOperation(
io.UnsupportedOperation: Cannot flush without finalizing upload. Use close() instead.

</ br>

Workaround

Insert an io.BufferedWriter:

import io
import zipfile

from google.cloud.storage import Blob, Client

client = Client("some-account")
blob = Blob.from_string("gs://some-bucket/some-folder/something.zip", client)

with blob.open(mode="wb") as blob_file, \
    io.BufferedWriter(blob_file) as binary_file, \
    zipfile.ZipFile(binary_file, mode="w") as zip_file:
    # Empty/not empty doesn't matter, the same error is generated
    pass

This results in the file being written to cloud storage (and at least for a simple case, including correctly written contents), but it prints a new error:

Traceback (most recent call last):
  File "s.py", line 24, in <module>
    pass
  File "/workspaces/someproject/.pyenv/lib/python3.8/site-packages/google/cloud/storage/fileio.py", line 406, in close
    self._checkClosed()  # Raises ValueError if closed.
  File "/workspaces/someproject/.pyenv/lib/python3.8/site-packages/google/cloud/storage/fileio.py", line 413, in _checkClosed
    raise ValueError("I/O operation on closed file.")
ValueError: I/O operation on closed file.
@product-auto-label product-auto-label bot added the api: storage Issues related to the googleapis/python-storage API. label Oct 19, 2021
@frankyn frankyn self-assigned this Oct 20, 2021
@frankyn frankyn added the status: investigating The issue is under investigation, which is determined to be non-trivial. label Oct 20, 2021
@Clockwork-Muse
Copy link
Author

For files of any real size, the current workaround I give is not sufficient - I get errors around the ~50MB range:

Traceback (most recent call last):
  File "src/s.py", line 27, in <module>
    o.export_function(**{f"some_image_{i}.png": img, f"some{i}.jpg": img})
  File "/some_project/src/output_manager.py", line 76, in export_function
    image_file.write(image_bytes)
  File "/usr/lib/python3.8/zipfile.py", line 1141, in write
    self._fileobj.write(data)
OSError: raw write() returned invalid length 41943095 (should have been between 0 and 55)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.8/contextlib.py", line 510, in __exit__
    if cb(*exc_details):
  File "/usr/lib/python3.8/zipfile.py", line 1312, in __exit__
    self.close()
  File "/usr/lib/python3.8/zipfile.py", line 1839, in close
    self._write_end_record()
  File "/usr/lib/python3.8/zipfile.py", line 1947, in _write_end_record
    self.fp.flush()
OSError: raw write() returned invalid length 41943601 (should have been between 0 and 561)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.8/contextlib.py", line 510, in __exit__
    if cb(*exc_details):
OSError: raw write() returned invalid length 41943601 (should have been between 0 and 561)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "src/s.py", line 28, in <module>
    o.export_text_function(gt, [])
  File "/some_project/src/output_manager.py", line 200, in __exit__
    return self.__context_stack.__exit__(exc_type, exc_val, exc_tb)
  File "/usr/lib/python3.8/contextlib.py", line 525, in __exit__
    raise exc_details[1]
  File "/usr/lib/python3.8/contextlib.py", line 510, in __exit__
    if cb(*exc_details):
  File "/some_project/.pyenv/lib/python3.8/site-packages/google/cloud/storage/fileio.py", line 406, in close
    self._checkClosed()  # Raises ValueError if closed.
  File "/some_project/.pyenv/lib/python3.8/site-packages/google/cloud/storage/fileio.py", line 413, in _checkClosed
    raise ValueError("I/O operation on closed file.")
ValueError: I/O operation on closed file.

... it's not clear to me why export_text_function is getting called here, since it's the export_function call that's throwing the error, and exiting the wrapping context that I have (strategically placed print-debugging lines aren't being called, so I'm really confused).

In any case, the archived files up to that point have been written to the gcloud storage location, and it can be downloaded and unpacked, and those files read.

@frankyn frankyn assigned andrewsg and unassigned frankyn Oct 26, 2021
@andrewsg
Copy link
Contributor

Hello, thank you for your report.

Some context here: the reason flush() triggers a file close is because file-like i/o uses a resumable upload handler under the hood, and resumable uploads require chunks of exactly the same size every time until the final chunk, which is of arbitrary size. flush() sends a chunk of arbitrary size, which is the signal to the remote server that the upload is complete.

I don't think I yet fully understand your use case or the connection between the zipfile closing and flush() being triggered. Is there relevant code missing from your example? If the pass section is a redaction and not your literal code being called it seems like it may be very relevant to understanding what is happening here.

@Clockwork-Muse
Copy link
Author

The same error is generated regardless of whether there is anything replacing pass or not.

This repro generates the same error, which is the same error as in my real code:

import io
import zipfile

from google.cloud.storage import Blob, Client

client = Client("client_account")
blob = Blob.from_string("gs://some_bucket_path/something.zip", client)

with blob.open(mode="wb") as blob_file, \
    zipfile.ZipFile(blob_file, mode="w") as zip_file, \
    open("large_image.png", mode="rb") as large_image_file, \
    open("small_image.jpg", mode="rb") as small_image_file:
    large_image_bytes = large_image_file.read()
    small_image_bytes = small_image_file.read()
    for i in range(10):
        print(f"writing image {i}")
        with zip_file.open(f"{i}.png", mode="w") as zip_entry:
            zip_entry.write(large_image_bytes)
        with zip_file.open(f"{i}.jpg", mode="w") as zip_entry:
            zip_entry.write(small_image_bytes)
        print(f"Succeeded writing image {i}")

... it turns out this actually succeeds in writing all of the images to the zip on gcloud, but then throws an error on finish:

writing image 0
Succeeded writing image 0
writing image 1
Succeeded writing image 1
writing image 2
Succeeded writing image 2
writing image 3
Succeeded writing image 3
writing image 4
Succeeded writing image 4
writing image 5
Succeeded writing image 5
writing image 6
Succeeded writing image 6
writing image 7
Succeeded writing image 7
writing image 8
Succeeded writing image 8
writing image 9
Succeeded writing image 9
Traceback (most recent call last):
  File "s.py", line 22, in <module>
    print(f"Succeeded writing image {i}")
  File "/usr/lib/python3.8/zipfile.py", line 1312, in __exit__
    self.close()
  File "/usr/lib/python3.8/zipfile.py", line 1839, in close
    self._write_end_record()
  File "/usr/lib/python3.8/zipfile.py", line 1947, in _write_end_record
    self.fp.flush()
  File "/project_path/.pyenv/lib/python3.8/site-packages/google/cloud/storage/fileio.py", line 401, in flush
    raise io.UnsupportedOperation(
io.UnsupportedOperation: Cannot flush without finalizing upload. Use close() instead.

@andrewsg
Copy link
Contributor

Oh, I see, zipfile is issuing the flush() to blobfile automatically on close.

With the resumable upload behavior where a partial upload necessarily closes the file, there's no way we can simply make flush() work the way people expect it to. Given that restriction, what kind of behavior would make more sense in this case?

@Clockwork-Muse
Copy link
Author

I don't know, I just don't want it to throw errors when I close the file.

@andrewsg
Copy link
Contributor

andrewsg commented Oct 29, 2021

Okay. I'll work on a new behavior here. In the meantime, this is perhaps a silly workaround, but inside your with statement you can set the private attribute blob_file._text_mode = True to silence the flush() error. This won't actually put the blob into text mode, it is just a signal to the flush() function that it should not raise an error, as is required by TextIOWrapper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the googleapis/python-storage API. status: investigating The issue is under investigation, which is determined to be non-trivial.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants