Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CT 2196, CT2121 constraints column order #7161

Merged
merged 12 commits into from
Mar 19, 2023
Merged

Conversation

gshank
Copy link
Contributor

@gshank gshank commented Mar 13, 2023

resolves #6975, #7064

Description

Modify create_table_as ddl to use the order specified in the model config "columns". Modify "assert_columns_equivalent" to not check column order.

Checklist

Change check of column definitions to be order aganostic.
@gshank gshank requested review from a team as code owners March 13, 2023 18:03
@cla-bot cla-bot bot added the cla:yes label Mar 13, 2023
@gshank gshank requested review from MichelleArk, emmyoop, peterallenwebb and mikealfare and removed request for nathaniel-may and stu-k March 13, 2023 18:03
@gshank gshank requested a review from a team as a code owner March 13, 2023 19:43
Copy link
Member

@emmyoop emmyoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Just a matter of if you wanted to update the wording of the error to include something about the model having contracts enabled or not.

Comment on lines 2146 to 2151
msg = (
"Please ensure the name, data_type, and number of columns in your `yml` file "
"match the columns in your SQL file.\n"
f"Schema File Columns: {self.yaml_columns}\n"
f"SQL File Columns: {self.sql_columns}"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you going to update this as related to #7147?

{#-- Column with name not found in yaml --#}
{%- do exceptions.raise_contract_error(yaml_columns, sql_columns) -%}
{%- endif -%}
{%- if sql_col.dtype != yaml_col[0].dtype -%}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously column types were compared based on their formatted representation, which could be data platform specific as implemented by the <adapter>__format_column macro.

For example, BigQuery's format_column implementation compares data_type values by column.data_type as opposed to column.dtype in order to make comparisons on nested data types.

Strictly comparing SQL and yml data_type values by dtype would allow the following contract to be accepted in BigQuery:

SELECT
STRUCT("test" AS name, [1.0,2.0] AS laps) as some_struct
models:
  - name: test_schema
    config:
      contract: true
    columns:
      - name: some_struct
        data_type: STRUCT<name FLOAT64, laps STRING> #wrong! but accepted because dtype == STRUCT for both SQL and schema.yml

One workaround would be to do this comparison using format_column instead, i.e: adapter.dispatch('format_column', 'dbt')(sql_col) != adapter.dispatch('format_column', 'dbt')(yaml_col[0]). This would also ensure the comparison and error messaging are using consistent logic. An alternative would be to implement a default__compare_data_types macro to enable adapter-specific implementations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at using format_column but the default implementation of that just uses the dtype. Is that different in the other adapters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't actually see any implementations of format_column in the adapters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I was on the wrong bigquery branch. Bigquery is the only adapter that implements it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we include the formatted column in the Column structures returned by "get_column_schema_from_query"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or concatenate the other parts of the column structure for comparison?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a lot of logic to put in a jinja template. Could we put this in a python function and then wrap that python function in this macro? I'm cutting corners, but this is what's in my head:

columns_spec_ddl.sql

{% macro assert_columns_equivalent(sql) %}

  {% set sql_schema = get_column_schema_from_query(sql) %}

  {% set model_contract = model_contract(model['columns']) %}

  {% do assert_schema_meets_contract(sql_schema, model_contract) %}

dbt/adapters/base/impl.py (for lack of a better spot)

# these are defined elsewhere, but look something like this
ModelContract = List[ColumnContract]
Schema = List[Column]


def model_contract(model) -> ModelContract:
    # I assume we have a way of creating a model contract from a `schema.yml` file
    return ModelContract(model)


def assert_schema_meets_contract(schema: Schema, model_contract: ModelContract)
    if len(schema) != len(model_contract):
        raise ContractError(msg)
    for schema_column, contract_column in zip(sorted(schema), sorted(model_contract)):
        try:
            assert schema_column.name == contract_column.name
            assert schema_column.dtype == contract_column.dtype
        except AssertionError:
            raise ContractError(msg)

I think the python version would be much easier to unit test.

"Contracts are enabled for this model. "
"Please ensure the name, data_type, and number of columns in your `yml` file "
"match the columns in your SQL file.\n"
f"Schema File Columns: {self.yaml_columns}\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful to sort these both alphabetically (message only) so that it's easier for the user to spot the difference.

{#-- Column with name not found in yaml --#}
{%- do exceptions.raise_contract_error(yaml_columns, sql_columns) -%}
{%- endif -%}
{%- if sql_col.dtype != yaml_col[0].dtype -%}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a lot of logic to put in a jinja template. Could we put this in a python function and then wrap that python function in this macro? I'm cutting corners, but this is what's in my head:

columns_spec_ddl.sql

{% macro assert_columns_equivalent(sql) %}

  {% set sql_schema = get_column_schema_from_query(sql) %}

  {% set model_contract = model_contract(model['columns']) %}

  {% do assert_schema_meets_contract(sql_schema, model_contract) %}

dbt/adapters/base/impl.py (for lack of a better spot)

# these are defined elsewhere, but look something like this
ModelContract = List[ColumnContract]
Schema = List[Column]


def model_contract(model) -> ModelContract:
    # I assume we have a way of creating a model contract from a `schema.yml` file
    return ModelContract(model)


def assert_schema_meets_contract(schema: Schema, model_contract: ModelContract)
    if len(schema) != len(model_contract):
        raise ContractError(msg)
    for schema_column, contract_column in zip(sorted(schema), sorted(model_contract)):
        try:
            assert schema_column.name == contract_column.name
            assert schema_column.dtype == contract_column.dtype
        except AssertionError:
            raise ContractError(msg)

I think the python version would be much easier to unit test.

@gshank
Copy link
Contributor Author

gshank commented Mar 14, 2023

I don't see how a "ModelContract" would be different than model.columns. Also this code has the same issue that the earlier jinja code did in that it's only checking dtype, not the formatted column. Would we also have to implement a "format_column" method in the python adapters?

@MichelleArk
Copy link
Contributor

I don't see how a "ModelContract" would be different than model.columns.

Currently, these two are really the same thing but I could see a definition of ModelContract that includes model columns (names + data_types) as well as constraints, and potentially materializations. Basically any model attribute that could be modified in a way that would constitute a breaking change (#7065, #6869).

In Mike's proposal, assert_schema_meets_contract would just be comparing the model columns component of ModelContract to implement the pre-flight model contract checks necessary for the scope of this work. However, a ModelContract structure to represent the more general model contract could be useful as we implement state:modified checks to detect breaking changes to model contracts (#7065, #6869). I pushed up some prototyping I did on the state:modified check for column names + data_types and could imagine a ModelContract being instantiated here.

Also this code has the same issue that the earlier jinja code did in that it's only checking dtype, not the formatted column. Would we also have to implement a "format_column" method in the python adapters?

I agree with this - we'd need to add an overridable formatted_dtype property/method on the Column interface to provide a complete and user-friendly representation of the data_type. Note: we should not just use the existing data_type definitions for this comparison because it does some dbt-specific formatting, which would make the mismatched contract error message less reflective of what the user actually has to specify as a valid value for data_type in their schema.yml (more context here: #6986 (comment))

@gshank
Copy link
Contributor Author

gshank commented Mar 15, 2023

This all might be a good idea, but is totally out of scope for this ticket as it was defined. If you want us to close this one and wait until this other implementation is defined, let me know.

@MichelleArk
Copy link
Contributor

MichelleArk commented Mar 15, 2023

out of scope for this ticket as it was defined

100% agree! The scope for this issue was to update the behaviour of the contract checks + constraint ddl generation to be order-agnostic, not a broader rewrite of the contract check. I think if we decide to go down the route of introducing a more generalized ModelContract structure as part of thestate:modified issues (#7065, #6869), we can revisit making use of it to implement more of the contract validation in python. I can make an issue to keep track of the idea.

@@ -25,11 +25,26 @@

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking through what this might look like across all potential adapters, and if we need to add contract-related items in the future. With that context, I have the following questions:

  1. Do you think this could benefit from becoming a "dispatch" method? The goal of this would be to create a hard divide between contracted models and non-contracted models.
  2. Do we need to include the column select statement in default__get_select_subquery if we already validated get_assert_columns_equivalent? We validate that the number of columns are the same, and the names are the same. So I think the subquery already limits to just the columns that we want.
  3. It looks like get_assert_columns_equivalent might have been renamed assert_columns_equivalent.

With my assumptions (not necessarily true of course):

{% macro default__create_table_as(temporary, relation, sql) -%}
  {% if config.get('contract', False) %}
    {{ default__create_table_as_with_contract(temporary, relation, sql }}
  {% else %}
    {{ default__create_table_as_without_contract(temporary, relation, sql }}
  {% endif %}
{% endmacro %}

{% macro default__create_table_as_with_contract(temporary, relation, sql) %}
  {{ get_assert_columns_equivalent(sql) }}
  
  create {% if temporary: -%}temporary{%- endif %} table
    {{ relation.include(database=(not temporary), schema=(not temporary)) }}
    {{ get_columns_spec_ddl() }}
  as ({{ sql }})

{% endmacro %}

{% macro default__create_table_as_without_contract(temporary, relation, sql) %}
  create {% if temporary: -%}temporary{%- endif %} table
    {{ relation.include(database=(not temporary), schema=(not temporary)) }}
  as ({{ sql }})
{% endmacro %}

This would maintain backwards compatibility because we're keeping the macro create_table_as, which is what would have been overridden. And if we need to update only one of these in the future, it isolates the updates, instead of impacting all "create table" workflows. I'm open to feedback and I'm just trying to communicate a thought here. Please let me know what you think of these recommendations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the get_select_subquery in order to put the columns in the right order. We removed validating the order because we added the subquery. As far as splitting out the macro into two, probably @jtcohen6 and @MichelleArk should weigh in on that.

@gshank gshank merged commit a203fe8 into main Mar 19, 2023
@gshank gshank deleted the ct-2196-constraints_column_order branch March 19, 2023 23:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CT-2121] Make model contracts agnostic to ordering
4 participants