Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run tests after model creation, persisting updates only if the tests pass #1054

Closed
drewbanin opened this issue Oct 12, 2018 · 18 comments
Closed
Labels
stale Issues that have gone stale

Comments

@drewbanin
Copy link
Contributor

Feature

Feature description

There have been a bunch of suggestions int Slack to run tests during model creation, and only persist changes if the tests pass. Here is a sampling of such suggestions:

jdub:

It’d be interesting if dbt could run tests within a transformation transaction, and bail out of the transaction if the tests fail. [...] Simple things like the existing unique and null checks, but also more interesting tests like, “Did this transformation produce data within correct / sensible boundaries?”

Josh Temple:

Or create data in a staging environment, test it, and copy to production only if tests pass. That way if the tests don't pass, you have data you can reference and explore to understand why.

Leon:

I think the data lake paradigm for ETL is the ideal that we personally want to move to -> create everything in staging, run tests, update catalog conditionally on things passing

tmurphy:

The way we’re solving this is using zero-copy clones on Snowflake. Each merge request gets its own clone, and only once dbt runs successfully and passes do we merge the code in to be run on production.

Who will this benefit?

A paradigm like this would help ensure that errant transformations are caught before they are deployed to production.

Things to think about:

  • If the test fails for a model, should dbt revert the update and then SKIP all of the dependent models?
  • Should this rollback paradigm happen for every test? Or just certain designated tests?
  • How should this work for things like incremental models? They're going to be difficult to "roll back" without doing presumably expensive operations on BQ/Snowflake

There's lots more to discuss here too, so please drop your thoughts in the comments below!

@mikekaminsky
Copy link
Contributor

I think this is a great idea. There are a few inter-related issues and features which I'll try to start picking apart. This would be a pretty major architecture change, so I think we will all benefit if we can get clear on the requirements, priorities, and streams of work.

Transactions

I feel strongly that DBT should be running in pseudo-transactions(henceforth "transactions") when it's running in production (i.e., there are downstream users querying the transformations). Here's why: in the middle of a DBT run, end users could receive invalid query results if two relations have become inconsistent.

Imagine you're calculating an order rate among users as the share of users that have an order. If this query gets execute after the users relation has been created but before the orders relation has finished, you're going to get misleading results.

Proposed architecture:

  • All transformations get written into staging schemas like $staging_my_schema_name
  • Once the DBT run completes, you do the following (maybe in a db-level transaction for seamlessness):
    • All "live" relations get renamed to "$old_"
    • All of the new relations get renamed into their production namespaces
    • All of the "$old_" relations get dropped

In this world, end users will never be querying data that are part old and part new.

Concerns:

  • People who are using post-hooks for managing permissions may end up wanting to tweak when these permissions are getting set.

DBT Run Modes

As I've alluding to above, you probably do not want this transaction-like behavior when you're developing. To support this, I'd recommend implementing DBT run "modes" that allow users to control how DBT behaves with respect to handling this transaction-like behavior.

To start, I'd propose the following modes (not mutually exclusive):

  • with-staging-schemas: this gets you the transaction-like behavior described above
  • fail-fast: stop dbt run as soon as it encounters an error (in with-staging-schemas mode, this means that the new transformations will not be promoted)
  • keep-going: inverse of fail-fast -- dbt builds as many models as it can despite any errors encountered (this is the current default behavior, I believe)

Note: I'm not at all committed to these names and I'm sure we can come up with better ones.

Tests

Once the above concepts are implemented, we can add some configuration so that tests can cause "errors" which may or may not halt a DBT run depending upon which run mode we're in.

One could imagine another configuration flag that indicates whether or not tests should be counted as "errors" for the purpose of a dbt run.

The last feature, which I expect people will want, would be the ability to configure tests with severity levels. So we could introduce "log", "warn", and "error" types (or whatever) for tests, then we could update the configuration options so that different test failure levels can be handled differently. In my mind, this feature comes last.

@mikekaminsky
Copy link
Contributor

See also #1005 which is relevant.

@joshtemple
Copy link
Contributor

Shooting blind here, but would it be possible to use tagging under the hood somehow to implement this? After running dbt test, dbt would tag models as failed or passed, and then the user would choose to run only models tagged as passed or failed.

This would also help in cases where you run successfully, test and find a few models failing tests, fix the issues with the models, and want to only re-run the models that ran successfully but didn't test successfully. In that case you could just run with tag failed.

@drewbanin
Copy link
Contributor Author

@joshtemple I really like the workflow you're describing! I'm unsure if tags are going to be the best way to do it, but I think this is absolutely something that dbt should support

@drewbanin
Copy link
Contributor Author

hey @michael Kaminsky -- this is a pretty significant departure from how dbt currently works. I just took another scan through here, and I think your description of an all-or-nothing build is most closely aligned with how I think this should work. I think there are still some unanswered questions here, and I think the first step in making headway on this issue is probably making sure that we have most/all of our bases covered. Some questions I have:

  1. how should incremental models work in this paradigm?
  2. what do we do on BigQuery (it does not have a rename statement)
  3. how do we interleave tests with models?
  4. at what “level” are the old/new relations swapped? Do we rename whole schemas/datasets? Or the individual database relations?

@mikekaminsky
Copy link
Contributor

mikekaminsky commented Nov 27, 2018

Great questions @drewbanin ! Going to table bigquery for now and come back to it (I've never used it, so am fuzzy on a bunch of important details that we'd need to solve it).

How should incremental models work in this paradigm?

Great question, and something I hadn't thought of yet. As I've conceptualized it, there are two issues:

  • You don't want to rebuild the whole incremental model in the staging location (defeats the purpose of an incremental model!)
  • Models downstream of the incremental relation need to query from the "updated" incremental model

Options:

  • Leave incremental models in place and update them normally

Easy, but sort of defeats the purpose of this whole exercise as it doesn't ensure consistency during your builds

  • view + UNION:
    • During the build, create a "temp" table that holds all of the records that need to be inserted or updated into the incremental model.
    • In the staging schema, in a view, UNION the incremental model with the temp table (applying the appropriate WHERE clause to the incremental relation)
    • At the end of the run, apply the update / insert logic to update the incremental model appropriately

There will be some performance penalty for models that are downstream of the incremental model, but it's unclear how much (depends on how big the change is and how the data are distributed).

  • Deep copy
    • In redshift (at least) you can use CREATE TABLE LIKE to deep copy a table.
    • Then we can apply the update / insert in the staging location as you'd expect

This will be slow (unsure how slow??) for large tables, though notably not nearly as slow as rebuilding the whole table -- we're basically just making redshift to do a lot of extra disk I/O for nothing. Might actually be an okay compromise, but would want to test it with someone who has bigger-data to get a sense of performance impact.

How do we interleave tests with models?

I don't think we need to implement this yet. I think we should table and come back to it when we want to implement the testing part of this (phase 2).

At what “level” are the old/new relations swapped? Do we rename whole schemas/datasets? Or the individual database relations?

The way I use dbt, I could / would do it at the schema level (dbt-managed schemas are wholly separate from non-dbt managed schemas). From what I've gathered in the slack channel, not everyone follows this rule so we'd probably have to do everything at the relation / model level. LMK if I'm missing something here, though! I'm thinking this would be trivial as long as we keep track of where everything is in the staging schema and where everything "belongs" but let me know if I'm just not thinking of something.

@drewbanin
Copy link
Contributor Author

@mikekaminsky I really like your view approach to incremental models! That's a fantastic way to test changes without applying them. I think you're right about the performance penalty, but maybe that's tolerable given the other benefits of this approach. AFAIK, the deep copy approach will probably work on every database in existence (definitely BigQuery), so that could also be compelling,

I'm always super wary of modifying dbt's materialization code, as we've done seemingly innocuous but wholly incorrect things in the past. There's a whole list of things to know like:

  1. Redshift chokes if a table is dropped inside of a transaction at the same time as a user selects from that table
  2. Redshift chokes if two drop table... cascade statements run concurrently, trying to drop the same downstream view
  3. Redshift chokes if you look at it funny
  4. Snowflake's views cannot be renamed if the tables they depend on have changed in ways that make the view no longer valid
  5. BigQuery has draconian limits around ddl/dml operations like:
  • Maximum number of table operations per day — 1,000
  • Maximum rate of dataset metadata update operations — 1 operation every 2 seconds per dataset

There is more to this list, but I just wanted to sketch out the kind of thinking that goes into a feature like this. These constraints are individually workable, but together, make it challenging to design a feature that 1) operates as intended and 2) uses each database to its fullest potential.

In the past, we've built features that weren't implemented on certain databases due to limitations of the database itself. BigQuery, for instance, only got create table as syntax within the past year! The biggest challenge here is going to be defining the desired dbt workflow, then figuring out how to make it happen for each database. So, I do really like your idea of DBT Run Modes from above. It could just be that this more advanced stuff isn't available on BQ until that particular technology reaches DDL parity with the rest of dbt's adapters.

Last, I agree that we should tackle the model building part of this first, then add testing into the mix separately. That change has more to do with how dbt's graph selector includes nodes in a given run, and should "just work" once this "atomic swap" paradigm is implemented.

PS I think we'll need to make dbt drop these new relations somehow. A failed run should stick around long enough to debug the problem, and then should be easy to clean up with a dbt command IMO.

@mikekaminsky
Copy link
Contributor

Lots of good thoughts, @drewbanin!

I'd be interested in moving this forward without bigquery support (especially because of the draconian ddl limits!!)

You definitely have a better perspective on where / how this can go wrong given the diversity of implementations you've seen. I've only really seen a handful (and all of them with redshift), so I just haven't seen these issues crop up. I'll defer to you if this is worth moving forward on!

Regarding having DBT clean up, it sort of depends on how you implement it. We could have DBT drop the old, failed relations, or we could also just let those get blown away when the next successful run happens (if we build everything in a dedicated staging schema like $dbt_staging it'd be pretty straightforward to just DROP CASCADE that schema when we want to clear out the old stuff.

Another comment incoming on late-binding views ...

@mikekaminsky
Copy link
Contributor

mikekaminsky commented Nov 28, 2018

When building views in atomic mode, we need to make sure they are not late-binding. Let's imagine we have table_a and view_b in our build where view_b depends on table_a.

In atomic mode, we run CREATE TABLE $dbt_staging.table_a AS ... and then CREATE VIEW $dbt_staging.view_b AS SELECT .... FROM $dbt_staging.table_a.

At the end of the run, we run RENAME $dbt_staging.table_a TO prod_schema.table_a; RENAME $dbt_staging.view_b TO prod_schema.view_b; If the view was late-binding, when we do the rename the view won't be pointed to the correct place!

So the view would need to be "regular" in order for this to work. Since we use late-binding views to get around the DROP CASCADE that happens during a non-atomic build, I think this will be fine unless there's some other use case for late-binding views I'm not thinking of!

@drewbanin
Copy link
Contributor Author

@mikekaminsky I totally buy the premise but I'm unsure about that implementation. For one, relations on Redshift can't be renamed across schemas :)

Will keep this in mind, but I think that we'll likely be best served by late binding views in the general case. Redshift (and Postgres) are really the only databases out there that support bound views, so I'd be hesitant to leverage them as a core part of this strategy.

@davehowell
Copy link

"Redshift chokes if you look at it funny" quote of the century.

There's another potential option @mikekaminsky 's view + UNION suggestion, which is to use APPEND once the test is successful

Downsides are that it commits implicitly, and can't be run in a transaction, also needs a vacuum, although Redshift does that automatically these days so don't worry about it. Upsides are that it is a metadata operation and almost instant.
Ideally I would do something like this however since you can't append in a transaction...

CREATE TABLE "tempy" LIKE destination_table
.. do the incremental insert into "tempy"
...create the view over "tempy" & destination_table
... run the tests

--if the tests succeed:
BEGIN TRANSACTION;
DELETE FROM destination_table WHERE exists ( select unique_key from "tempy" table );
ALTER TABLE destination_table APPEND FROM "tempy" table;
COMMIT;

There might be a cleverer way to do it that doesn't lead to a timespan where destination_table has deleted data. The reason for reducing the update&insert into a delete&append is that updates already work by marking rows as deleted, and inserting both insert&update rows to the unsorted sector at the end of the table so may as well do the same. Vacuuming takes care of garbage collecting deleted rows, but any sorting is not auto but as per standard Redshift that's already maintenance the user has to worry about.

  • Deepcopy is faster than doing a vacuum and recommended in some cases, but uses a huge amount of space and does take a long time for large tables. I've seen 28 mins for a ~1TB table, although it was on a small 6 node cluster.
  • Late-binding is preferred for views, mainly because we don't want to drop views when dropping tables and CASCADE is scary.

@drewbanin
Copy link
Contributor Author

hey @davehowell - cool idea! We played around with alter table ... append before. I'm not sure if the requirements changed, but in #395 I found that:

  • both the source and destination tables must be permanent
  • both tables must share identical schemas (i think including varchar sizes)
    • this is tractable (but hard) given that the source data we want to append is generated from a select statement. Redshift determines the exact column types as the result of the query

dbt's incremental models currently use delete + insert for exactly the reason you've described!

@davehowell
Copy link

If you create a permanent (I called it "tempy" because it's permanent but you would drop it later) table using "like" then the properties are identical, you've cloned the metadata of the destination table. That the increment comes from a select statement is inconsequential, because the "append" is from "tempy" to the destination table. If the incremental select can be coerced directly into the destination table then it follows it will successfully be inserted into a table just "like" it. Sure when the incremental model is first run it determines the datatypes of that destination because CTAS, but in this scenario the datatypes have been set in stone already.

@drewbanin
Copy link
Contributor Author

drewbanin commented May 16, 2019

Oh, roger, i think i misread tempy :)

dbt actually does some amount of "column expansion" for incremental models currently. The first time you run an incremental model, it's possible that the type of your last_name column is varchar(12). Then, someone with a longer last name signs up, and then type generated by Redshift is instead varchar(32). In this case, dbt resizes the column in place for you.

In order to determine the datatypes for the newly selected data, we'd probably need to create a temp table from a select statement, inspect the types, then insert into the tempy table, then append into the destination table. That would ameliorate the benefits of the create table ... like + append workflow you're describing.

The alternative is of course to not do column expansion (or try to handle changing column type inconsistencies at all!) which is honestly a pretty reasonable answer. It just means that folks might be running a whole lot of --full-refreshes :)

In the interest of completeness, I think that another solution to this is to have users specify their model types in their sql statements, eg:

select
  first_name::varchar(128) as first_name,
  last_name::varchar(128) as last_name,
...

I think if we imposed this responsibility on users, then the append statement would be a little more tractable for most use cases.

Final thoughts: If you'd like, we can re-open the discussion on #395 to discuss the use of alter table append in dbt. I think that's a more appropriate place for us to go deep on this particular Redshift optimization.

@davehowell
Copy link

Ah yes that would be a problem. I knew about the column expansion, that's bitten me in a weird edge case where the dist-key field became a candidate for expansion, and altering dist-keys is another case where Redshift chokes. Of course, the fix was exactly as you suggested, explicitly casting the column type :)

I think column-expansion is really nice for analysts, same with CTAS type-inference generally, so it totally makes sense for the dbt use-case of accelerating the analyst workflow, but I secretly hate inferred types and have started casting explicitly anyway. I also apply some encoding/compression as well with macros on first-run of incremental models. Actually I hate that I have to do so much hand-holding for Redshift and I keep hearing the wonders of Snowflake...but I digress.

So yeah I totally agree that column-expansion is a case of being too nice to users, but now it's there you can't really take it away without making analysts face up to the realities of strong types. Maybe allow disabling of that feature, or disabling it when this testing mode is enabled. Or maybe just as you said take the functionality away and make people use explicit types. In most cases they wouldn't be needed anyway.

@ismailsimsek
Copy link

jdub:
It’d be interesting if dbt could run tests within a transformation transaction, and bail out of the transaction if the tests fail. [...] Simple things like the existing unique and null checks, but also more interesting tests like, “Did this transformation produce data within correct / sensible boundaries?”

IMO this is the correct implementation, very similar to post-hook running it inside transaction. rollingback if something fails.

Most of the databases are supporting transactions and it should be possible to run {DML + Testst (Selects)} and do rollback of the transaction.

For the databases not supporting transactions it should be up to the macro to implement rollback scenario.

  • for this it could be useful to control transaction inside a macro/snapshot. then implementation could have more possibility to implement rollback scenario
{%- adapter.start_transaction() -%}
.. do something
{%- adapter.commit_transaction() -%}

In general i would avoid drop creating target model/tables, or swapping them.

for-example: bigquery supports following actions inside transaction and these are sufficient to implement the feature.
SELECT, INSERT, UPDATE, DELETE, MERGE AND DDL statements on temporary entities

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Apr 11, 2022
@github-actions
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest; add a comment to notify the maintainers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues that have gone stale
Projects
None yet
Development

No branches or pull requests

5 participants