Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM when running to_dataframe_iterable with bqstorage client #1292

Closed
andaag opened this issue Jul 17, 2022 · 2 comments
Closed

OOM when running to_dataframe_iterable with bqstorage client #1292

andaag opened this issue Jul 17, 2022 · 2 comments
Labels
api: bigquery Issues related to the googleapis/python-bigquery API.

Comments

@andaag
Copy link

andaag commented Jul 17, 2022

Environment details

  • OS type and version: Linux, happens on docker container python:3.9-slim-bullseye in GKE as well.
  • Python version: 3.9.12
  • pip version: 22.0.4
  • google-cloud-bigquery version: 3.2.0

Steps to reproduce

  1. Read from a large table with to_dataframe_iterable(bqstorage_client)
  2. Will continue to fill memory until OOMKiller kicks in.
  3. Disable bqstorage_client and the problem is gone. ##EDIT, not entirely sure this is true.. I think this still happens just astronomically slower. Iterating by row is different though.

Code example

# Runs out of memory:
bqstorage_client = bigquery_storage.BigQueryReadClient()
for df in bigquery_result.result().to_dataframe_iterable(bqstorage_client=bqstorage_client, max_queue_size=2):
    pass

# Works fine:
for row in bigquery_result.result():
    pass

Is max_queue_size not propagated or something like that? The table I'm reading from is 24gb in size and not partitioned. I've been trying to use tracemalloc etc to track down what's going on, but not been successful. Happy to help add debug information if anyone has any ideas on how to resolve this one.

@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Jul 17, 2022
@andaag
Copy link
Author

andaag commented Jul 18, 2022

I think max_queue_size was just on the limit on how much memory this pod had.

Increasing the memory on the pod considerably and limiting max_queue_size to 1 seems to have solved the issue.

@andaag andaag closed this as completed Jul 18, 2022
@michalc
Copy link

michalc commented Aug 25, 2023

We had this issue as well. We started to suspect that to_dataframe_iterable can kick off loads of threads, where each of them uses a lot of memory, and also suspected that max_queue_size didn't affect it. Not using the bqstorage client also seemed fix the memory issue, but was also very much slower for it.

To have the best of both worlds, we came up with a bit of a hack that modifies the bqStorageClient's create_read_session method: modifying its max_stream_count argument to always be 1

bqStorageClient =  BigQueryReadClient(...)

original_create_read_session = bqStorageClient.create_read_session

def create_read_session(*args, **kwargs):
    kwargs.pop('max_stream_count')
    return original_create_read_session(*args, max_stream_count=1, **kwargs)

bqStorageClient.create_read_session = create_read_session

This modified bqStorageClient can be passed to to_dataframe_iterable as usual.

We then get the speed boost of using the BigQuery Storage API, but memory usage is wonderfully constant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API.
Projects
None yet
Development

No branches or pull requests

2 participants