OOM when running to_dataframe_iterable with bqstorage client #1292

andaag · 2022-07-17T15:28:22Z

Environment details

OS type and version: Linux, happens on docker container python:3.9-slim-bullseye in GKE as well.
Python version: 3.9.12
pip version: 22.0.4
google-cloud-bigquery version: 3.2.0

Steps to reproduce

Read from a large table with to_dataframe_iterable(bqstorage_client)
Will continue to fill memory until OOMKiller kicks in.
Disable bqstorage_client and the problem is gone. ##EDIT, not entirely sure this is true.. I think this still happens just astronomically slower. Iterating by row is different though.

Code example

# Runs out of memory:
bqstorage_client = bigquery_storage.BigQueryReadClient()
for df in bigquery_result.result().to_dataframe_iterable(bqstorage_client=bqstorage_client, max_queue_size=2):
    pass

# Works fine:
for row in bigquery_result.result():
    pass

Is max_queue_size not propagated or something like that? The table I'm reading from is 24gb in size and not partitioned. I've been trying to use tracemalloc etc to track down what's going on, but not been successful. Happy to help add debug information if anyone has any ideas on how to resolve this one.

The text was updated successfully, but these errors were encountered:

andaag · 2022-07-18T13:47:48Z

I think max_queue_size was just on the limit on how much memory this pod had.

Increasing the memory on the pod considerably and limiting max_queue_size to 1 seems to have solved the issue.

michalc · 2023-08-25T12:54:00Z

We had this issue as well. We started to suspect that to_dataframe_iterable can kick off loads of threads, where each of them uses a lot of memory, and also suspected that max_queue_size didn't affect it. Not using the bqstorage client also seemed fix the memory issue, but was also very much slower for it.

To have the best of both worlds, we came up with a bit of a hack that modifies the bqStorageClient's create_read_session method: modifying its max_stream_count argument to always be 1

bqStorageClient =  BigQueryReadClient(...)

original_create_read_session = bqStorageClient.create_read_session

def create_read_session(*args, **kwargs):
    kwargs.pop('max_stream_count')
    return original_create_read_session(*args, max_stream_count=1, **kwargs)

bqStorageClient.create_read_session = create_read_session

This modified bqStorageClient can be passed to to_dataframe_iterable as usual.

We then get the speed boost of using the BigQuery Storage API, but memory usage is wonderfully constant.

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Jul 17, 2022

andaag closed this as completed Jul 18, 2022

kien-truong mentioned this issue Sep 24, 2024

Make max_stream_count configurable when using Bigquery Storage API #2030

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM when running to_dataframe_iterable with bqstorage client #1292

OOM when running to_dataframe_iterable with bqstorage client #1292

andaag commented Jul 17, 2022 •

edited

Loading

andaag commented Jul 18, 2022

michalc commented Aug 25, 2023 •

edited

Loading

OOM when running to_dataframe_iterable with bqstorage client #1292

OOM when running to_dataframe_iterable with bqstorage client #1292

Comments

andaag commented Jul 17, 2022 • edited Loading

Environment details

Steps to reproduce

Code example

andaag commented Jul 18, 2022

michalc commented Aug 25, 2023 • edited Loading

andaag commented Jul 17, 2022 •

edited

Loading

michalc commented Aug 25, 2023 •

edited

Loading