Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for Explicit Schema Definitions (DDLs) when Creating a Table / Model #2191

Closed
ericxiao251 opened this issue Mar 10, 2020 · 11 comments
Closed
Labels
enhancement New feature or request stale Issues that have gone stale

Comments

@ericxiao251
Copy link

Describe the feature

For a given DBT Model, we would like the ability to specify the column names, type, description and nullability explicitly. We are using BigQuery and this is possible within a CTAS query.

We want this feature mainly so that we can prevent creating any tables with columns that should not be null that have null values in it. We do not want to do a check afterwards with a dbt test, as downstream datasets /reports can potentially be using "corrupted" data.

i.e.

CREATE OR REPLACE TABLE
  `dbt_examples`.`my_first_dbt_model`
  -- table / model DDL starts here
  ( 
    x INT64 NOT NULL,
    y INT64
  )
  -- table / model DDL ends here
  OPTIONS() AS (
  SELECT
    1,
    Null
);

reference: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_table_examples

Describe alternatives you've considered

Adding a nullability check in a model's test, so that dbt test can do this check, but there are a lot of downsides to doing it this way:

  • In BigQuery, these types of checks are expensive. So to do this check on every column of table post creation can get expensive, especially at scale when you have n (tables) * m (columns).
  • We do not want to expose datasets to downstream models / reporting layer if a null value appears if it's not supposed to appear. We consider this "corrupted" data and that would produce incorrect results downstream.

Additional context

We mainly work with BigQuery and we know that this is possible with BigQuery, not sure about the other data warehouses.

I think it would just be overall beneficial to have the check done at creation time, rather than post creation time.

This would give more context to users in the warehouse, when they look at the table schema, ie. what is this column? is this column nullable? etc.

Who will this benefit?

Anyone using BigQuery, this would reduce their warehousing costs significantly.

@ericxiao251 ericxiao251 added enhancement New feature or request triage labels Mar 10, 2020
@drewbanin drewbanin removed enhancement New feature or request triage labels Mar 18, 2020
@drewbanin
Copy link
Contributor

Thanks for opening this thoughtful issue @ericxiao251 -- see also a previous discussion over here: #1438

I don't have any problem at all with this on BigQuery and Snowflake. I don't think this should be the default though. We also have an issue for adding a "strict" config to schema.yml files: #1570 . Maybe there's some sensible interop between these two features? Maybe something like:

# schema.yml

models:
 - name: my_model
    ddl_specifies_schema: true
    columns: [ ... ]

I think ddl_specifies_schema is a terrible name, but I don't have any good ideas at present. What kind of interface are you imagining for specifying this config?

@drewbanin drewbanin added enhancement New feature or request discussion labels Mar 18, 2020
@ericxiao251
Copy link
Author

We're still very new to DBT, but my initial thoughts were to enable this option in a new version, version: 3 of the schemas.yml, but your suggestion provides a simple and easy to use solution!

looking at #1570, we would probably enforce this if ddl_specifies_schema: true?

@spcvalente
Copy link

hi! is there any development on this? thanks

@smomen
Copy link

smomen commented Jul 29, 2022

This would be a positive development for our team. Some positives:

  • easier to migrate from legacy systems where data models are maintained in create table statements (or sqlalchemy models or equivalent)
  • single source-of-truth for a final table schema, which is nice encapsulation in cases where many tables are being union-ed together into a final materialization, and we'd have to build a casting/type enforcement model. This model becomes a nuisance if an alternative like table definitions is available?

Can take advantage of some features offered by table definitions:

  • faster-failure on schema failures - rather than having to rely on "a column is not null" dbt test, specifying that a column is not nullable in a table definition results in earlier failure which can be desirable.
  • specifying default values

Now, as an alternative for those looking to utilize create table statements in their current dbt project, we came up with a hack:

  • specify your model as a view
  • add a post-hook e.g. insert_into_table(this) that does a create table like... some predefined but empty table object with your desired table definition, and insert from the view into that table, doing the name swapping/gymanstics to ensure the final table has the right name.
  • alternative, the post-hook can specify your create table statement and use that to create the table and insert the view results into the table; like previous solution, name swapping required to ensure the right outcome.

some pseudocode:

{% macro insert_into_predefined_model(this, stage_prefix) %}

    {% set entity = source('predefined', this.name|replace(stage_prefix, '')) %}

    {% set sql %}
        create or replace transient table {{ temp_target }} as ( select * from {{ entity }} limit 0 );
        {% for col in additional_fields %}
            {% if col.name not in expected_fields %}
                {{ exceptions.raise_compiler_error("When compiling " ~ this ~ "... found an unexpected field: " ~ col) }}
            {% endif %}
            alter table {{ temp_target }} add column  {{ col.name }} {{ col.data_type }};
        {% endfor %}

        insert into {{ temp_target }} ({{ dbt_utils.star(from=this) }}) select * from {{ this }};
        drop view if exists {{ this }};
        alter table {{ temp_target }} rename to {{ this }}
    {% endset %}

    {{ return(sql) }}
{% endmacro %}

or



{% macro insert_into_with_my_table_defn(this) %}
    {% set create_table_sql %}
        CREATE OR REPLACE TABLE {{ this.schema }}.blah (
            a                                    number(38,0),
            b                                   number(38,0)
...
     )
    {% endset %}
    {% do run_query(create_table_sql) %}
    {% set insert_stmt %}
            insert into {{ this.schema }}.BLAH ({{ dbt_utils.star(from=this) }}) select * from {{ this }};
            drop table if exists {{ this }};
            alter table {{ this.schema }}.BLAH rename to {{ this }};
    {% endset %}
    {% do run_query(insert_stmt) %}

{% endmacro %}

@mmiyahara
Copy link

I built a working example in BigQuery that @smomen suggested, so let me share it.

  • schema.yml

    version: 2
    
    models:
      - name: my_model
        columns:
          - name: x
            description: "x column"
            meta:
              type: "INTEGER"
              mode: "REQUIRED"
          - name: y
            description: "y column"
            meta:
              type: "INTEGER"
  • my_model.sql

    {{
        config(
            materialized='table',
            post_hook="{{create_table_with_table_schema()}}"
        )
    }}
    
    SELECT
        1 AS x,
        2 AS y
  • create_table_with_table_schema.sql

    {% macro create_table_with_table_schema() %}
      {% set col_obj = model.columns %}
      {% set table_name = model.name %}
      {% set tmp_table_name = "{}_tmp".format(table_name) %}
    
      {%- set sql %}
        create or replace table {{this.schema}}.{{tmp_table_name}}
        (
          {%- for col_name in col_obj %}
            {%- set type = col_obj[col_name].meta.type %}
            {%- set mode = col_obj[col_name].meta.mode %}
            {%- set description = col_obj[col_name].description %}
    
            {%- set col_definition %}
              {{col_name}} {{type}} {{"not null" if mode == "REQUIRED"}} {{'options(description="{}")'.format(description) if description}}
            {%- endset %}
    
            {%- if loop.last %}
              {{col_definition}}
            {%- else %}
              {{col_definition}},
            {%- endif %}
          {%- endfor %}
        );
        insert into {{ this.schema }}.{{ tmp_table_name }} ({{ dbt_utils.star(from=this) }}) select * from {{ this }};
        drop table if exists {{ this }};
        alter table {{ this.schema }}.{{ tmp_table_name }} rename to {{ table_name }};
      {% endset %}
    
      {% do run_query(sql) %}
    {% endmacro %}

screenshot 2022-08-21 17 12 55
screenshot 2022-08-21 17 13 11

  • The column type and mode are specified in the meta property.
  • Since I noticed that the second ... in alter table ... rename to ...; only accepts table (not project_id.database.table) in BigQuery, I changed alter table {{ temp_target }} rename to {{ this }} to alter table {{ this.schema }}.{{ tmp_table_name }} rename to {{ table_name }};.

@smomen
Copy link

smomen commented Dec 15, 2022

Friends, worth keeping an eye on #6079 which might satisfy this issue mostly. You'll get not-null enforcement at execution time. @mmiyahara what do you think?

@mmiyahara
Copy link

@smomen
Thank you for letting me know. I agree that #6079 might solve this issue. I'll keep an eye on it.

@awilliamsOM1
Copy link

Is this issue solved by #6271 ? Model contracts are a new feature that dbt-labs has been working on.

@jtcohen6
Copy link
Contributor

Short answer: yes!

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Sep 10, 2023
@github-actions
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale Issues that have gone stale
Projects
None yet
Development

No branches or pull requests

7 participants