Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make max_stream_count configurable when using Bigquery Storage API #2030

Open
kien-truong opened this issue Sep 24, 2024 · 1 comment
Open
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery API.

Comments

@kien-truong
Copy link

Currently, for API that can use BQ Storage Client to fetch data like to_dataframe_iterable or to_arrow_iterable, the client library always uses the maximum number of read streams recommended by BQ server.

requested_streams = 1 if preserve_order else 0

session = bqstorage_client.create_read_session(
parent="projects/{}".format(project_id),
read_session=requested_session,
max_stream_count=requested_streams,
)

This behavior has the advantage of maximizing throughput but can lead to out-of-memory issue when there are too many streams being opened and result are not read fast enough: we've encountered queries that open hundreds of streams and consuming GBs of memory.

BQ Storage Client API also suggests capping max_stream_count when resource is constrained

https://cloud.google.com/bigquery/docs/reference/storage/rpc/google.cloud.bigquery.storage.v1#createreadsessionrequest

Typically, clients should either leave this unset to let the system to determine an upper bound OR set this a size for the maximum "units of work" it can gracefully handle.

This problem has been encountered by others before and can be worked-around by monkey-patching the create_read_session on the BQ Client object: #1292

However, it should really be fixed by allowing the max_stream_count parameter to be set through public API.

@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Sep 24, 2024
@kien-truong
Copy link
Author

kien-truong commented Sep 24, 2024

Using the default setting, in the worst-case scenario, for n-download streams, we would have to store 2n result pages in memory:

  • 1 result page inside each download thread, times n threads
  • n result page in the transfer queue between download threads and main thread

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API.
Projects
None yet
Development

No branches or pull requests

3 participants