Defer refers to outdated seed #2909

dmateusp · 2020-11-24T14:44:08Z

Describe the bug

I'm running the following on CI:

aws s3 cp $(MANIFEST_REMOTE_PATH) target/prod/manifest.json
dbt seed --select "state:modified+" --full-refresh --state target/prod/ --target dev-master
dbt run --models "state:modified+" --defer --state target/prod/ --target dev-master
...

Say we have a seed countries.csv:

name
Portugal
Ireland
Switzerland

And a model dim_countries.sql:

select to_upper(name) as name
from {{ ref("countries") }}

Now in one PR we change the seed to

id,name
1,Portugal
2,Ireland
3,Switzerland

And we change the model to:

select
  id, 
  to_upper(name) as name
from {{ ref("countries") }}

dbt correctly identifies that countries.csv changed, and that dim_countries.sql changed. However when the model runs, it fails with "column id does not exist" because the model tries to read the seed from the "main run" (it defers the seed) instead of identifying that it re-ran.

Note that I tried copying the manifest produced by dbt seed into target/prod/ but what happens then is that dbt does not identify the model as "modified"

dbt version

0.18.1

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2020-11-24T15:37:37Z

Thanks for the writeup @dmateusp. The issue here is that --defer simply switches the reference of any resource not included in the selection criteria. As part of node selection, dbt excludes all resources from the selection criteria that are not relevant to the invocation type, e.g. a seed from dbt run. So today there's no real way to not defer a seed.

I think this fix is going to be tricky: we'd need dbt to defer on the basis of an intermediate set of selected nodes, rather than the final set. This is part of what made test deferral so cumbersome (#2701), and ultimately not worthwhile. In this case, though, I think we need to figure out an answer.

A much longer-term resolution here would be a generalized command that can operate on both seeds and models (#2743).

In the meantime, the only workarounds I can think of are quite hacky. E.g. if you have a staging model (models/staging/seeds/*.sql) that selects from the seed, and ref() that staging model instead of the seed downstream, you could try to run that view once (without deferral), before running the rest of the DAG with deferral...

dbt seed -s state:modified --state .state
dbt run -m state:modified+1,staging.seeds --state .state # no deferral
dbt run -m state:modified+ --exclude state:modified+1,staging.seeds --defer --state .state

dmateusp · 2020-11-24T15:47:45Z

Thank you for the details @jtcohen6, the staging model is an interesting solution but not something practical to enforce in our project right now unfortunately.

For a short term solution would it be possible to extend dbt to allow deferring only models ? That way I could change my CI to always re-create all the seeds (which is not a super big deal since seeds should be small)

dbt run -m state:modified+ --defer models <...> ?

jtcohen6 · 2020-11-24T15:56:25Z

@dmateusp That's a good thought, and if it proves much more straightforward, that may be the move. In either case, it will require tweaking some of the logic of deferral—which isn't something we can squeeze in for the next minor version (v0.19), but possibly for the one after.

In the meantime, I'll plan to document this as a known caveat to state comparison + deferral.

jtcohen6 · 2020-12-03T17:32:13Z

One approach that's occurred to me, though I haven't thought through all the implications:

Today, we defer all models that are not included in the node selection criteria. That means we defer:

seeds (always)
modified models that are included, then excluded, from shifting criteria, e.g. dbt run -m state:modified && dbt run -m state:modified,config.materialized:incremental (see slack thread)

Perhaps we shouldn't use selection criteria as the basis for deferral. Instead, during compilation, we could check to see if a referent's "new" representation (ci_schema.identifier) exists in the database (via cache lookup). If it doesn't exist, we "fall back" to the comparison manifest's representation (prod_schema.identifier) of a node with the same unique_id.

If we took this more-naive approach to deferral:

Would it also help us avoid tricky issues around manifest construction when previous source or ephemeral model parents have gone missing (KeyError while using deferred flag and model filters #2875)?
Could we even revisit the question of how deferral might work for tests (Support deferred tests #2701)?

dmateusp · 2020-12-04T09:55:45Z

I like that a lot!

dmateusp added bug Something isn't working triage labels Nov 24, 2020

jtcohen6 added state Stateful selection (state:modified, defer) and removed triage labels Nov 24, 2020

jtcohen6 added this to the Oh-Twenty [TBD] milestone Nov 24, 2020

jtcohen6 mentioned this issue Dec 12, 2020

Defer iff unselected reference does not exist in current env #2946

Merged

4 tasks

jtcohen6 modified the milestones: Oh-Twenty [TBD], Kiyoshi Kuromiya Dec 12, 2020

jtcohen6 closed this as completed in #2946 Dec 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Defer refers to outdated seed #2909

Defer refers to outdated seed #2909

dmateusp commented Nov 24, 2020 •

edited

Loading

jtcohen6 commented Nov 24, 2020

dmateusp commented Nov 24, 2020

jtcohen6 commented Nov 24, 2020 •

edited

Loading

jtcohen6 commented Dec 3, 2020 •

edited

Loading

dmateusp commented Dec 4, 2020

Defer refers to outdated seed #2909

Defer refers to outdated seed #2909

Comments

dmateusp commented Nov 24, 2020 • edited Loading

Describe the bug

dbt version

jtcohen6 commented Nov 24, 2020

dmateusp commented Nov 24, 2020

jtcohen6 commented Nov 24, 2020 • edited Loading

jtcohen6 commented Dec 3, 2020 • edited Loading

dmateusp commented Dec 4, 2020

dmateusp commented Nov 24, 2020 •

edited

Loading

jtcohen6 commented Nov 24, 2020 •

edited

Loading

jtcohen6 commented Dec 3, 2020 •

edited

Loading