Make `max_stream_count` configurable when using Bigquery Storage API #2030

kien-truong · 2024-09-24T06:22:52Z

Currently, for API that can use BQ Storage Client to fetch data like to_dataframe_iterable or to_arrow_iterable, the client library always uses the maximum number of read streams recommended by BQ server.

python-bigquery/google/cloud/bigquery/_pandas_helpers.py

Line 840 in ef8e927

requested_streams = 1 if preserve_order else 0

python-bigquery/google/cloud/bigquery/_pandas_helpers.py

Lines 854 to 858 in ef8e927

    
           session = bqstorage_client.create_read_session( 
        
               parent="projects/{}".format(project_id), 
        
               read_session=requested_session, 
        
               max_stream_count=requested_streams, 
        
           )

This behavior has the advantage of maximizing throughput but can lead to out-of-memory issue when there are too many streams being opened and result are not read fast enough: we've encountered queries that open hundreds of streams and consuming GBs of memory.

BQ Storage Client API also suggests capping max_stream_count when resource is constrained

https://cloud.google.com/bigquery/docs/reference/storage/rpc/google.cloud.bigquery.storage.v1#createreadsessionrequest

Typically, clients should either leave this unset to let the system to determine an upper bound OR set this a size for the maximum "units of work" it can gracefully handle.

This problem has been encountered by others before and can be worked-around by monkey-patching the create_read_session on the BQ Client object: #1292

However, it should really be fixed by allowing the max_stream_count parameter to be set through public API.

The text was updated successfully, but these errors were encountered:

kien-truong · 2024-09-24T09:04:02Z

Using the default setting, in the worst-case scenario, for n-download streams, we would have to store 2n result pages in memory:

1 result page inside each download thread, times n threads
n result page in the transfer queue between download threads and main thread

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Sep 24, 2024

blunderbuss-gcf bot assigned PhongChuong Sep 24, 2024

PhongChuong assigned chalmerlowe and unassigned PhongChuong Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make `max_stream_count` configurable when using Bigquery Storage API #2030

Make `max_stream_count` configurable when using Bigquery Storage API #2030

kien-truong commented Sep 24, 2024

kien-truong commented Sep 24, 2024 •

edited

Loading

Make max_stream_count configurable when using Bigquery Storage API #2030

Make max_stream_count configurable when using Bigquery Storage API #2030

Comments

kien-truong commented Sep 24, 2024

kien-truong commented Sep 24, 2024 • edited Loading

Make `max_stream_count` configurable when using Bigquery Storage API #2030

Make `max_stream_count` configurable when using Bigquery Storage API #2030

kien-truong commented Sep 24, 2024 •

edited

Loading