-
Notifications
You must be signed in to change notification settings - Fork 1.6k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-1045] [Feature] Provide option to allow tests to be ran before a model's data is updated #5660
Comments
+1 for this ability. Realize it may not be the easiest implementation. My org is starting to experiment with table/dataset cloning as a means to work with dbt's current functionality. Even contemplating building a clone materialization that would only run if all tests are passed (using |
#1054 has an interesting discussion on this topic. |
@adamcunnington-mlg This is a tremendous start to a discussion! You've compiled some great research here. I am going to convert this to a discussion, in matter of fact. I also agree with @dbeatty10 that #1054 offers some great historical context on the ideas we've had in the past, as well as the tripwires we've come across. I appreciate the problem you're raising—when bad data's in the production pipeline, it's in the production pipeline—and I agree it's a very common one. There are a diversity of tactics folks use when solving it today. The two most common I see:
While not perfect solutions, those two constructs have served many quite well. They care more about the holistic deployment of an entire DAG, rather than treating each model as its own microcosm. I do think that dbt could do more to make the write-audit-publish / "blue-green" pattern more readily available. It is certainly easiest on Snowflake, due to its support of zero-copy cloning, though still requires some custom macros + operations to string together. BigQuery also supports zero-cost copying, for tables and datasets, and (I understand) is rolling out capabilities around table cloning (for discussion in dbt-labs/dbt-bigquery#270). Once there is a critical mass of support among a few of the most popular adapters, I feel more comfortable with the idea of building out an adapter-agnostic abstraction within dbt. Each model to its ownThere are three patterns I could see to support this on a model-by-model basis, all of which are much more case-by-case:
I agree that, in the ideal case, there would be a common abstraction in dbt to identify which tests are worth rolling back for—with adapter-specific implementations behind the scenes—just as Let's keep the discussion going :) |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Describe the feature
Context
Currently, DBT tests run after a model is built.
If a test fails, there is some characteristic of the data that is unexpected - but it's too late. That "bad" data is in the model and is impacting downstream models - the integrity of the data is broken and this could flow through to other models where crazy things might happen (row explosion uh oh) and use cases (i.e. real-time dashboards) suddenly show the wrong data.
There is no straight forward solution to this problem. On snowflake, there is a neat way of achieving blue/green (although I think it's a slight misuse of the term blue/green because we're talking about new data, not [necessarily] new software) because Snowflake supports partition swaps. Others have implemented more complex workarounds such as managing separate staging and production environments - orchestrated either via Airflow, or perhaps achieved in DBT via logical layers with different run commands. Calogica have a nice write-up about how they use DBT to achieve this here.
My conjecture though is that this is a really core need that should be possible, natively, within DBT.
Acknowledgements
I am aware that there are use cases where you don't want to prevent "bad" data flowing through in the event of a test failure - but I do believe it should be an option - and should probably be the default behaviour (ignoring for one moment compromises that might be made when rolling out such a feature as to not cause a breaking change).
Also, it's worth calling out that tests actually serve two purposes:
These are arguably quite different things. The former is more like a conventional software test and we'd expect to run that before deploying our updated artifacts (in this case, updated model code) - which we do via a typical DBT dev workflow. The second is a runtime consideration relating to the assertion of our data's profile. This is a very different thing and why it's so important to be able to succeed the tests before updating the model's data (or at least have the option to do this). The first is a solved problem. The second is not.
Discussion
DBT makes decisions about what should be abstracted behind a common layer and what should be adapter-specific. It feels to me like the ability, and user-facing interface, for having tests run (logically) before a model is updated, should be adapter-agnostic - but I appreciate this is a totally non-trivial problem to solve. Perhaps adapter-specific could be a starting point and this could be homogenised later.
I'm also aware, at the other end of the spectrum, that if you are going to consider running tests "transactionally" before a model is updated, what about a model that is dependent on another model? You may wish to update both or none at all. Sure! But one step at a time.
I imagine a world where there is a per-adapter implementation but perhaps hidden behind a common model property which controls whether tests run before or after.
Specifically, and selfishly, I want this capability in BigQuery. BigQuery is quite limited in terms of tools it gives us to help but multi-statement transactions have been round a shortwhile and are due to move from preview to GA in the not too distant future. I know transactions are not a silver-bullet solution to everything but maybe there's at least some mileage in exploring the feasibility of updating model and then running all model & column-level tests within the same transaction and rolling back if any fail (maybe the failure behaviour is a config that belongs to the
tests:
array items rather than at the model level)?Keen to start a discussion on this!
Describe alternatives you've considered
Managing a custom blue/green approach.
Who will this benefit?
A lot of people. I don't have any hard numbers to back this up but I suspect that the majority of DBT users would not expect a model to be updated if tests fail. I imagine some of those users implement workarounds which cost development time and processing resources and opportunity cost in complex support routines. I imagine that others don't do anything in particular but just set expectation with users (directly or otherwise) that when data breaks, it'll be fixed quickly - but it will be broken for a period of time. Most data analytics still isn't mission critical after all (despite what we all say).
Are you interested in contributing this feature?
Perhaps! Certainly dev resources within my team.
Anything else?
https://cloud.google.com/bigquery/docs/reference/standard-sql/transactions
The text was updated successfully, but these errors were encountered: