Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize insert_overwrite incremental strategy with WRITE_TRUNCATE / Partition copy #167

Merged
merged 2 commits into from
Oct 25, 2022

Conversation

github-christophe-oudar
Copy link
Contributor

@github-christophe-oudar github-christophe-oudar commented Apr 24, 2022

resolves #77

Description

  • Add a copy_partitions to partition_by object (default to false), throw an error if incremental_strategy != 'insert_overwrite'.
  • If insert_overwrite & copy_partitions, we replace the MERGE part by:
    • In static mode (partitions defined), we copy explicit partitions from the output in temp table
    • In dynamic mode, we retrieve the partitions from the output in temp table and copy all of them

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt-bigquery next" section.

@cla-bot cla-bot bot added the cla:yes label Apr 24, 2022
@Kayrnt Kayrnt force-pushed the copy-partitions branch 4 times, most recently from 2274a2e to 75d90db Compare April 24, 2022 14:46
@github-christophe-oudar github-christophe-oudar marked this pull request as ready for review April 24, 2022 23:21
@Kayrnt Kayrnt force-pushed the copy-partitions branch 2 times, most recently from d21d712 to 4c1e485 Compare April 29, 2022 08:26
@jtcohen6 jtcohen6 added the ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering label Jul 18, 2022
@@ -47,11 +47,16 @@ class PartitionConfig(dbtClassMixin):
data_type: str = "date"
granularity: str = "day"
range: Optional[Dict[str, Any]] = None
time_ingestion_partitioning: bool = False
copy_partitions: bool = False
Copy link
Contributor

@colin-rogers-dbt colin-rogers-dbt Oct 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@github-christophe-oudar - Is there a use case for copy_partitions outside of time_ingestion_partitioning or for not using copy_partitions with time_ingestion_partitioning? Just wondering if we need another config parameter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, absolutely, it's also much faster than a merge so I'm going to use it on all my models whether it's time ingestion partitioning or column type partitioning! It really fits my use case because I'm usually processing models partition per partition but it might not fit everyone (though honestly, most people in my opinion can afford a temporary loss of consistency for such a speed gain 👍 )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a use case for time_ingestion_partitioning = True and copy_partitions = False ? Should we allow users to do this?

@Fleid - I know you had concerns with adding more variables for users to set but I think we need to keep these separate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, here some use cases:

  • time_ingestion_partitioning = True and copy_partitions = False => for atomic merge for time ingestion partitioned table (useful for people that can't afford potential inconsistency on multiple partitions)
  • time_ingestion_partitioning = False and copy_partitions = True =>
    for fast atomic insert overwrite on a single partition or fast non atomic on multiple partitions on column type partitioning (quite common at my company for small/medium tables)
  • time_ingestion_partitioning = False and copy_partitions = False => before the 2 PRs
  • time_ingestion_partitioning = True and copy_partitions = True => for fast atomic insert overwrite on a single partition or fast non atomic on multiple partitions (a common use case at my company for large tables as it's the best performing tables from what we tested - we might need to double check now)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great context! I'm going to do a follow up PR after we merge to update our documentation with this context.

Copy link
Contributor

@colin-rogers-dbt colin-rogers-dbt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing some local testing, one question so far but overall it seems to function well. Hope to be able to move this and #136 forward this week

Also: apologies for the extreme delay on this, still getting ramped up but we do want to prioritize getting this merged post-Coalesce

CHANGELOG.md Outdated
@@ -9,6 +9,11 @@
- Use dbt.tests.adapter.basic in tests (new test framework) ([#135](https://github.com/dbt-labs/dbt-bigquery/issues/135), [#142](https://github.com/dbt-labs/dbt-bigquery/pull/142))
- Adding pre-commit and black formatter hooks ([#147](https://github.com/dbt-labs/dbt-bigquery/pull/147))
- Adding pre-commit code changes ([#148](https://github.com/dbt-labs/dbt-bigquery/pull/148))
- Add support for ingestion time partitioned table using incremental materialization ([#136](https://github.com/dbt-labs/dbt-bigquery/pull/136))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: looks like you manually added a changelog entry but we're using changie? https://github.com/dbt-labs/dbt-bigquery/blob/main/CONTRIBUTING.md#updating-docs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I'll update, it was done before changie was introduced

@Kayrnt Kayrnt force-pushed the copy-partitions branch 2 times, most recently from 56352a8 to 333865f Compare October 20, 2022 21:10
@github-christophe-oudar
Copy link
Contributor Author

@colin-rogers-dbt I rebased it, I think you have put the label to trigger the integration tests so that we can check there's no regression there too.

@Kayrnt Kayrnt force-pushed the copy-partitions branch 8 times, most recently from 9ded8d9 to 35f0e87 Compare October 25, 2022 12:22
@colin-rogers-dbt
Copy link
Contributor

This integ test is failing due to a known issue with our testing. No action required from @Kayrnt

Copy link
Contributor

@colin-rogers-dbt colin-rogers-dbt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes ok to test ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering test all
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize insert_overwrite incremental strategy with WRITE_TRUNCATE / Partition copy
5 participants