Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: BigQueryBatchFileLoads in python loses data when using WRITE_TRUNCATE #23306

Closed
steveniemitz opened this issue Sep 20, 2022 · 5 comments · Fixed by #24536
Closed

[Bug]: BigQueryBatchFileLoads in python loses data when using WRITE_TRUNCATE #23306

steveniemitz opened this issue Sep 20, 2022 · 5 comments · Fixed by #24536

Comments

@steveniemitz
Copy link
Contributor

What happened?

When using WRITE_TRUNCATE, if the data being loaded is split into multiple temp tables, the copy job that merges all the results together can end up using WRITE_TRUNCATE for each copy job, resulting in only the last copy job "winning", and overwriting all other jobs.

It looks like there was an attempt to handle this here https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L545, but this code assumes that all inputs to TriggerCopyJobs arrive in the same bundle. From what I observed this is not the case, looking at the logs from that step, the log lines from that step have different "work" fields for copy job 1 and 2.

There probably needs to be a GBK before this step in order to make sure that all copy jobs actually are executed in the same unit?

Issue Priority

Priority: 1

Issue Component

Component: io-py-gcp

@johnjcasey
Copy link
Contributor

@ahmedabu98 how familiar are you with this? it strikes me as similar to the other BQ changes you've done

@ahmedabu98
Copy link
Contributor

Yeah it's in the same context, I can take a stab at this one

@ahmedabu98
Copy link
Contributor

.take-issue

@kennknowles
Copy link
Member

@ahmedabu98 are you still working on this?

@ahmedabu98
Copy link
Contributor

This resurfaced in #24535, work is being done there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment