[CT-1045] [Feature] Provide option to allow tests to be ran before a model's data is updated #5660

adamcunnington-mlg · 2022-08-16T19:47:17Z

Describe the feature

Context

Currently, DBT tests run after a model is built.
If a test fails, there is some characteristic of the data that is unexpected - but it's too late. That "bad" data is in the model and is impacting downstream models - the integrity of the data is broken and this could flow through to other models where crazy things might happen (row explosion uh oh) and use cases (i.e. real-time dashboards) suddenly show the wrong data.

There is no straight forward solution to this problem. On snowflake, there is a neat way of achieving blue/green (although I think it's a slight misuse of the term blue/green because we're talking about new data, not [necessarily] new software) because Snowflake supports partition swaps. Others have implemented more complex workarounds such as managing separate staging and production environments - orchestrated either via Airflow, or perhaps achieved in DBT via logical layers with different run commands. Calogica have a nice write-up about how they use DBT to achieve this here.

My conjecture though is that this is a really core need that should be possible, natively, within DBT.

Acknowledgements

I am aware that there are use cases where you don't want to prevent "bad" data flowing through in the event of a test failure - but I do believe it should be an option - and should probably be the default behaviour (ignoring for one moment compromises that might be made when rolling out such a feature as to not cause a breaking change).

Also, it's worth calling out that tests actually serve two purposes:

Validating a change in model configuration / SQL has not caused regression
Validating that new data meets the expected profile

These are arguably quite different things. The former is more like a conventional software test and we'd expect to run that before deploying our updated artifacts (in this case, updated model code) - which we do via a typical DBT dev workflow. The second is a runtime consideration relating to the assertion of our data's profile. This is a very different thing and why it's so important to be able to succeed the tests before updating the model's data (or at least have the option to do this). The first is a solved problem. The second is not.

Discussion

DBT makes decisions about what should be abstracted behind a common layer and what should be adapter-specific. It feels to me like the ability, and user-facing interface, for having tests run (logically) before a model is updated, should be adapter-agnostic - but I appreciate this is a totally non-trivial problem to solve. Perhaps adapter-specific could be a starting point and this could be homogenised later.

I'm also aware, at the other end of the spectrum, that if you are going to consider running tests "transactionally" before a model is updated, what about a model that is dependent on another model? You may wish to update both or none at all. Sure! But one step at a time.

I imagine a world where there is a per-adapter implementation but perhaps hidden behind a common model property which controls whether tests run before or after.

Specifically, and selfishly, I want this capability in BigQuery. BigQuery is quite limited in terms of tools it gives us to help but multi-statement transactions have been round a shortwhile and are due to move from preview to GA in the not too distant future. I know transactions are not a silver-bullet solution to everything but maybe there's at least some mileage in exploring the feasibility of updating model and then running all model & column-level tests within the same transaction and rolling back if any fail (maybe the failure behaviour is a config that belongs to the tests: array items rather than at the model level)?

Keen to start a discussion on this!

Describe alternatives you've considered

Managing a custom blue/green approach.

Who will this benefit?

A lot of people. I don't have any hard numbers to back this up but I suspect that the majority of DBT users would not expect a model to be updated if tests fail. I imagine some of those users implement workarounds which cost development time and processing resources and opportunity cost in complex support routines. I imagine that others don't do anything in particular but just set expectation with users (directly or otherwise) that when data breaks, it'll be fixed quickly - but it will be broken for a period of time. Most data analytics still isn't mission critical after all (despite what we all say).

Are you interested in contributing this feature?

Perhaps! Certainly dev resources within my team.

Anything else?

https://cloud.google.com/bigquery/docs/reference/standard-sql/transactions

The text was updated successfully, but these errors were encountered:

rileyschack · 2022-08-16T20:32:15Z

+1 for this ability. Realize it may not be the easiest implementation. My org is starting to experiment with table/dataset cloning as a means to work with dbt's current functionality. Even contemplating building a clone materialization that would only run if all tests are passed (using dbt build).

dbeatty10 · 2022-08-17T22:20:15Z

#1054 has an interesting discussion on this topic.

jtcohen6 · 2022-08-20T12:42:05Z

@adamcunnington-mlg This is a tremendous start to a discussion! You've compiled some great research here. I am going to convert this to a discussion, in matter of fact.

I also agree with @dbeatty10 that #1054 offers some great historical context on the ideas we've had in the past, as well as the tripwires we've come across.

I appreciate the problem you're raising—when bad data's in the production pipeline, it's in the production pipeline—and I agree it's a very common one. There are a diversity of tactics folks use when solving it today. The two most common I see:

Define tests on sources, to validate that new data meets the expected profile. In the unified DAG of dbt build, those tests will run before any models depending on those sources are built, and those models will skip if the tests fail.
Use a "write-audit-publish" or "blue/green" pattern to run, test, and only then deploy models to production. You can do all three steps once per model, although the pattern I see more often is to swap an entire schema/database from "staging" to "production" when it's ready. This also avoids the risk of downstream queriers seeing inconsistent data, if one "final" model finishes building earlier than another "final" model.

While not perfect solutions, those two constructs have served many quite well. They care more about the holistic deployment of an entire DAG, rather than treating each model as its own microcosm.

I do think that dbt could do more to make the write-audit-publish / "blue-green" pattern more readily available. It is certainly easiest on Snowflake, due to its support of zero-copy cloning, though still requires some custom macros + operations to string together. BigQuery also supports zero-cost copying, for tables and datasets, and (I understand) is rolling out capabilities around table cloning (for discussion in dbt-labs/dbt-bigquery#270). Once there is a critical mass of support among a few of the most popular adapters, I feel more comfortable with the idea of building out an adapter-agnostic abstraction within dbt.

Each model to its own

There are three patterns I could see to support this on a model-by-model basis, all of which are much more case-by-case:

Tests within transactions: On data platforms that support transactions, run tests before the commit; and roll back if they fail. This is totally adapter-specific: Redshift supports transactions completely, Spark/Databricks not at all, Snowflake in very odd ways, and BigQuery only very recently. Within dbt, it would require us to rethink tests as a property of a model, which run during that model's materialization, rather than their own node types within the DAG. (How to treat tests that select from multiple models...?) I'm not dismissing this idea completely, just noting that results will vary significantly here, and I'd want to take account of the several adapter/materialization permutations before proceeding.
Tests within materializations: For certain materializations on certain adapters (namely incremental), which produce a temporary table before upserting/merging/overwriting into the existing result set, test the temporary results before upserting/merging/overwriting. This is an issue that Niall just opened, and to which I owe a response, though his motivation is more about performance of the test queries than rolling back if the tests fail: [CT-793] [Feature] Testable _tmp tables in incremental models #5427. The appeal of this approach doesn't strictly require transactions, because the temp table can be thrown out without a formal rollback—no harm, no foul—but it does get into the implementation details of specific materializations on specific adapters, and it raises the same questions about how tests need to be defined in the manifest and run in the DAG.
Database constraints: Historically, analytical databases have done something very ugly with constraints: they allow them, they use them in query optimization, but they do not enforce them. So, despite adding a unique constraint to a column, it may well have dupes—and the database's optimizer will act as if it doesn't, risking incorrect results! For that reason, dbt has really, really discouraged use of database constraints in the past. Still, there is a pattern of cross-applying dbt tests as database constraints for detection by other tools (https://hub.getdbt.com/snowflake-labs/dbt_constraints/latest/). On at least one data platform (Databricks), constraints are totally enforced at table creation/update time (Support Delta constraints databricks/dbt-databricks#71). That feels like a pretty good pattern, where it's available!

I agree that, in the ideal case, there would be a common abstraction in dbt to identify which tests are worth rolling back for—with adapter-specific implementations behind the scenes—just as dbt build has allowed users to say that all error-severity tests are worth stopping the DAG for. I'm also wary of building an abstraction that risks leaking beyond utility, because of how different these data platforms really are, behind their shiny SQL-y exteriors.

Let's keep the discussion going :)

adamcunnington-mlg added enhancement New feature or request triage labels Aug 16, 2022

github-actions bot changed the title ~~[Feature] Provide option to allow~~ [CT-1045] [Feature] Provide option to allow Aug 16, 2022

adamcunnington-mlg changed the title ~~[CT-1045] [Feature] Provide option to allow~~ [CT-1045] [Feature] Provide option to allow tests to be ran before a model is updated Aug 16, 2022

adamcunnington-mlg changed the title ~~[CT-1045] [Feature] Provide option to allow tests to be ran before a model is updated~~ [CT-1045] [Feature] Provide option to allow tests to be ran before a model's data is updated Aug 16, 2022

jtcohen6 self-assigned this Aug 17, 2022

jtcohen6 added dbt tests Issues related to built-in dbt testing functionality and removed enhancement New feature or request triage labels Aug 20, 2022

jtcohen6 removed their assignment Aug 20, 2022

dbt-labs locked and limited conversation to collaborators Aug 20, 2022

jtcohen6 converted this issue into discussion #5687 Aug 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

[CT-1045] [Feature] Provide option to allow tests to be ran before a model's data is updated #5660

[CT-1045] [Feature] Provide option to allow tests to be ran before a model's data is updated #5660

adamcunnington-mlg commented Aug 16, 2022 •

edited

Loading

rileyschack commented Aug 16, 2022

dbeatty10 commented Aug 17, 2022

jtcohen6 commented Aug 20, 2022

This issue was moved to a discussion.

This issue was moved to a discussion.

[CT-1045] [Feature] Provide option to allow tests to be ran before a model's data is updated #5660

[CT-1045] [Feature] Provide option to allow tests to be ran before a model's data is updated #5660

Comments

adamcunnington-mlg commented Aug 16, 2022 • edited Loading

Describe the feature

Context

Acknowledgements

Discussion

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

rileyschack commented Aug 16, 2022

dbeatty10 commented Aug 17, 2022

jtcohen6 commented Aug 20, 2022

Each model to its own

This issue was moved to a discussion.

adamcunnington-mlg commented Aug 16, 2022 •

edited

Loading