Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1521] [Bug] The static partitions configuration for incremental materializations doesn't work if it is a list that is declared via a statement block #6278

Closed
2 tasks done
achilleas-stachtiaris opened this issue Nov 17, 2022 · 3 comments
Labels
bug Something isn't working user docs [docs.getdbt.com] Needs better documentation wontfix Not a bug or out of scope for dbt-core

Comments

@achilleas-stachtiaris
Copy link

achilleas-stachtiaris commented Nov 17, 2022

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

Referencing the incremental materialization documentation for Static Partitions, the partitions configuration accepts a list, which is then iterated by the macro bq_copy_partitions(), and evaluated by the macro bq_insert_overwrite_sql in order to invoke bq_static_insert_overwrite_sql if partitions is not none and partitions != [].

When defining this list in a Macro and then supplying it to the partitions configuration of a model, it only invokes the bq_static_insert_overwrite_sql() if the list is declared using in-line text within the {% set %} clause. If a statement_block is used to construct the list (for example, dbt_utils.get_column_values() which is iterable), bq_dynamic_insert_overwrite_sql () is invoked instead because the partitions list is omitted 🤔 . And this happens even if the list is evaluated identically with one that is defined using in-line text inside a {% set %} clause.

See the following code / dummy data as examples:

Macro to construct the partitions list
If the use_statement_block is True, the Macro will use dbt_utils.get_column_values() to construct a list. Otherwise, the Macro will use a list defined in-line in text via {% set %}

{% macro pull_partitions_to_replace(use_statement_block=False) %}

	{% if use_statement_block %}
		
		{% set partitions_to_replace = dbt_utils.get_column_values(table=ref('partitions'), column='date') %}

	{% else %}

		{% set partitions_to_replace = ["'2022-11-17'", "'2022-11-16'", "'2022-11-15'"] %}

	{% endif %}

	{{ return(partitions_to_replace) }}

{% endmacro %}

Model defining one column to be queried in a statement block to construct a partitions list

{{ config(materialized='table') }}

select
    "'2022-11-17'" as date
union all
select
    "'2022-11-16'" as date
union all
select
    "'2022-11-15'"  as date

An incremental model
You will notice a few Jinja {{ }} evaluators, we use these to compare the two lists which contain our partitions

Remember that bq_insert_overwrite_sql uses this evaluator partitions is not none and partitions != [] to determine whether to invoke the static or dynamic statement - this is very important

Also, {{ partitions | join (', ') }} is used by bq_static_insert_overwrite_sql()

{{
config(
    materialized = 'incremental'
    , incremental_strategy = 'insert_overwrite'
    , partition_by = {'field' : 'date', 'data_type' : 'date', 'granularity': 'day'}
    , partitions = pull_partitions_to_replace(use_statement_block=False)
      )
}}

-- {{ pull_partitions_to_replace(use_statement_block=False) is iterable and pull_partitions_to_replace(use_statement_block=False) is not string }}
-- {{ pull_partitions_to_replace(use_statement_block=False) is not none and pull_partitions_to_replace(use_statement_block=False) != [] }}

-- {{ pull_partitions_to_replace(use_statement_block=True) is iterable and pull_partitions_to_replace(use_statement_block=True) is not string }}
-- {{ pull_partitions_to_replace(use_statement_block=True) is not none and pull_partitions_to_replace(use_statement_block=True) != [] }}

-- {{ pull_partitions_to_replace(use_statement_block=True) }}
-- {{ pull_partitions_to_replace(use_statement_block=False) }}

-- {{ pull_partitions_to_replace(use_statement_block=True) | join (', ') }}
-- {{ pull_partitions_to_replace(use_statement_block=False) | join (', ') }}

with
    data as (
        select
            date('2022-11-17') as date
        union all
        select
            date('2022-11-16') as date
        union all
        select
            date('2022-11-15') as date
    )

select
    *
from
    data

{% if is_incremental() %}

    where
        date in ('2022-11-17', '2022-11-16')

{% endif %}

Expected Behavior

By default, in the above example of an incremental model, we set the Macro to use_statement_block=False, which will use the list as per the in-line definition of {% set %}, and that works as expectedly 👍

  1. dbt run --select partitions
  2. dbt run --select incremental_model --full-refresh
  3. dbt run --select incremental_model

In target/run/../partitions.sql we can see the correctly compiled statement against BigQuery, which uses the partitions configuration to determine which partitions to replace.

when not matched by source
         and DBT_INTERNAL_DEST.date in (
              '2022-11-17', '2022-11-16', '2022-11-15'
          ) 
        then delete

    when not matched then insert
        (`date`)
    values
        (`date`)

I expected the same to occur when we set the Macro to use_statement_block=True, as the Jinja evaluators which we placed in SQL comments demonstrate that the list objects in memory are evaluated in the same way. What would be the reason for dbt to not accept the partitions configuration? 🤔

-- True
-- True

-- True
-- True

-- ["'2022-11-17", "'2022-11-16'", "'2022-11-15'"]
-- ["'2022-11-17'", "'2022-11-16'", "'2022-11-15'"]

-- '2022-11-17', '2022-11-16', '2022-11-15'
-- '2022-11-17', '2022-11-16', '2022-11-15'

🐛 The actual bug

When we do set use_statement_block=True, the partitions configuration gets omitted, and the bq_dynamic_insert_overwrite_sql () is invoked as a result. This means that dbt does not accept the list as a valid config variable, and will again recompute its own partitions merge strategy rather than using the partitions list supplied in the configuration.

❗ Looking at the source, I cannot understand why this is happening, and believe it is a bug.

I've looked through the incremental materialization and the insert_overwrite strategy source

Thanks a lot for taking the time to read through this! 😃

Steps To Reproduce

Given the code I shared above, you can reproduce this by:

dbt run --select partitions
dbt run --select incremental_model --full-refresh
dbt run --select incremental_model - with use_statement_block=False which will behave as expected
dbt run --select incremental_model - with use_statement_block=True which will behave unexpectedly

Relevant log output

No response

Environment

- OS: macOS Monterey 12.6
- Python: 3.7.13
- dbt: 1.0.3

Which database adapter are you using with dbt?

bigquery

Additional Context

No response

@achilleas-stachtiaris achilleas-stachtiaris added bug Something isn't working triage labels Nov 17, 2022
@github-actions github-actions bot changed the title [Bug] The static partitions configuration for incremental materializations doesn't work if it is a list that is declared via a statement block [CT-1521] [Bug] The static partitions configuration for incremental materializations doesn't work if it is a list that is declared via a statement block Nov 17, 2022
@jtcohen6 jtcohen6 added wontfix Not a bug or out of scope for dbt-core user docs [docs.getdbt.com] Needs better documentation and removed triage labels Nov 17, 2022
@jtcohen6
Copy link
Contributor

@achilleas-stachtiaris Thanks for the thorough write-up!

This is a general principle within dbt: All node definitions, dependencies, and configuration need to be resolved at parse time, before dbt has made any connections to the database. If we don't do a good job of stating that in the docs on configs today, we should! It's what enables dbt to statically and reliably understand the shape of the DAG, without database state as an input, and to support config: as a selection method.

I appreciate that this does limit certain use cases. A similar one that comes to mind: before the addition of merge_exclude_columns as a config in v1.3, folks were confused about why they couldn't set:

{{ config(
    merge_include_columns = dbt_utils.star(...)
}}

That doesn't work because the star macro requires an introspective database query (compile time), and configs are all resolved at parse time.

We do enable a small subset of configurations to accept macros, which are then "late-rendered" at runtime, namely pre-hook and post-hook: https://docs.getdbt.com/docs/building-a-dbt-project/dont-nest-your-curlies#an-exception


To support this category of use cases, we'd need a more-general rule about what dbt can expect to know when. We've talked about "decorators" for macros that would indicate whether they're:

  • "Pure" / "immutable" macros: Static inputs + outputs, can be resolved at parse time and used for configuration — we could even look to support these in .yml files!
  • "Stable" / "introspective" macros: Runs an introspective query that returns database state, without changing that state. Will always return the same results, given same inputs (arguments + db state). These should be supported at compile time, to template out model SQL. For certain configurations, we could explicitly support "re-rendering" at compile time, to gain access to database state as an input.
  • "Dirty" / "volatile" / "state change" macros: Has side effects in the database that changes state, e.g. creating or altering tables. If we do our jobs right, these should exist exclusively within materialization logic, and never within model logic — hence, it's an anti-pattern to see post hooks containing truncate or delete!

That's a much bigger idea, and worth further discussion. In the meantime, I'm going to close this one, as a known limitation of dbt configuration, and outside the scope of the built-in feature at play here ("static" partitions configured for insert_overwrite). I think you could work around this, with custom code, by reimplementing the other macros that feed into the insert_overwrite materialization, and perform a dynamic lookup at runtime.

@jtcohen6 jtcohen6 closed this as not planned Won't fix, can't repro, duplicate, stale Nov 17, 2022
@achilleas-stachtiaris
Copy link
Author

@jtcohen6 Right back at you, thanks a lot for the prompt and informative reply! 😃

This makes sense, I do admit that the ability to declare Macros for hooks in the config() might have given me false indications about what's possible. The separation between parse and compile is very clear to me now with regards to this, so thanks again for that.

Using your advise and with a bit more research, I slightly repurposed the partitions configuration to act as an input to a Macro I wrote that extends the functionality of bq_insert_overwrite(), in order to compute the partitions at compile time - this is basically:

{% if partitions is string %}

    {% set partitions = partitions_to_replace(the_source=partitions, return_config=True) %}

{% endif %}

Where partitions_to_replace() declares a list by accepting a source table, and comparing its partitions relative to the partition of {{ this }} (provided that we want to incrementally load new data from that source table to {{ this }}).

That way I can "dynamically parameterise" the "static" input, which enables incremental insert_overwrite models to be totally on auto-pilot.

Works well, so great outcome 🙌

I love the idea you mentioned regarding Macro decorators, will try and see if I can contribute somehow 😃

Thanks! 🎉

@jtcohen6
Copy link
Contributor

Amazing! Glad you were able to get this working for your use case.

I opened up a new discussion for the Big Idea around clearly defining macro that are "pure" versus ones with dynamic state / side effects: #6280

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working user docs [docs.getdbt.com] Needs better documentation wontfix Not a bug or out of scope for dbt-core
Projects
None yet
Development

No branches or pull requests

2 participants