[Bug]: BigQueryBatchFileLoads in python loses data when using WRITE_TRUNCATE #23306

steveniemitz · 2022-09-20T14:43:52Z

What happened?

When using WRITE_TRUNCATE, if the data being loaded is split into multiple temp tables, the copy job that merges all the results together can end up using WRITE_TRUNCATE for each copy job, resulting in only the last copy job "winning", and overwriting all other jobs.

It looks like there was an attempt to handle this here https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L545, but this code assumes that all inputs to TriggerCopyJobs arrive in the same bundle. From what I observed this is not the case, looking at the logs from that step, the log lines from that step have different "work" fields for copy job 1 and 2.

There probably needs to be a GBK before this step in order to make sure that all copy jobs actually are executed in the same unit?

Issue Priority

Priority: 1

Issue Component

Component: io-py-gcp

johnjcasey · 2022-09-30T17:39:32Z

@ahmedabu98 how familiar are you with this? it strikes me as similar to the other BQ changes you've done

ahmedabu98 · 2022-09-30T22:29:15Z

Yeah it's in the same context, I can take a stab at this one

ahmedabu98 · 2022-09-30T22:29:19Z

.take-issue

kennknowles · 2022-12-07T16:39:11Z

@ahmedabu98 are you still working on this?

ahmedabu98 · 2022-12-07T16:45:41Z

This resurfaced in #24535, work is being done there

steveniemitz added python bug bigquery awaiting triage labels Sep 20, 2022

github-actions bot added gcp io P1 labels Sep 20, 2022

pabloem assigned ahmedabu98 Oct 17, 2022

github-actions bot removed the awaiting triage label Oct 17, 2022

kennknowles mentioned this issue Dec 8, 2022

[Bug]: Bigquery Load jobs with WRITE_TRUNCATE disposition may truncate valid records. #24535

Closed

kennknowles assigned robertwb Dec 8, 2022

kennknowles mentioned this issue Dec 8, 2022

Ensure WRITE_TRUNCATE BigQuery loads to the same table are processed together. #24536

Merged

4 tasks

robertwb closed this as completed in #24536 Dec 20, 2022

github-actions bot added this to the 2.45.0 Release milestone Dec 20, 2022

ohnorobo mentioned this issue Mar 23, 2023

upgrade beam to 2.46 censoredplanet/censoredplanet-analysis#227

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: BigQueryBatchFileLoads in python loses data when using WRITE_TRUNCATE #23306

[Bug]: BigQueryBatchFileLoads in python loses data when using WRITE_TRUNCATE #23306

steveniemitz commented Sep 20, 2022

johnjcasey commented Sep 30, 2022

ahmedabu98 commented Sep 30, 2022

ahmedabu98 commented Sep 30, 2022

kennknowles commented Dec 7, 2022

ahmedabu98 commented Dec 7, 2022

[Bug]: BigQueryBatchFileLoads in python loses data when using WRITE_TRUNCATE #23306

[Bug]: BigQueryBatchFileLoads in python loses data when using WRITE_TRUNCATE #23306

Comments

steveniemitz commented Sep 20, 2022

What happened?

Issue Priority

Issue Component

johnjcasey commented Sep 30, 2022

ahmedabu98 commented Sep 30, 2022

ahmedabu98 commented Sep 30, 2022

kennknowles commented Dec 7, 2022

ahmedabu98 commented Dec 7, 2022