[CT-168] Cache objects for selected resources only? #4688

jtcohen6 · 2022-02-05T12:20:26Z

Background

Should we cache everything dbt cares about? Or try to cache objects relevant only to resources that are selected to run? This is an increasingly important consideration as folks move to:

Execute smaller sub-DAGs within ever-larger projects
Manage their slice of a multi-package dbt deployment

The intent of the relation cache is to speed up performance, after all. If it's not serving that purpose, in its current unadaptive form, then we should change its behavior.

Details

We know the set of selected resources at the time when we populate the adapter cache. We only create schemas (if they do not yet exist) for objects that are selected. Should we also limit the reach of our metadata queries, to only introspect the schemas we care about?

dbt-core/core/dbt/task/run.py

Lines 457 to 458 in a588607

    
           self.create_schemas(adapter, selected_uids) 
        
           self.populate_adapter_cache(adapter)

Should we go one step further, and use filters/wildcards to only cache the objects we care about? This seems necessary on dbt-spark (issue linked below.)

Risks

The cache has made us "greedy" when it comes to really simple metadata lookups. What if, over the course of running a few models, we end up spending more time in one-off lookup queries (due to cache misses) than we saved at the start, by skipping one batch metadata query that seemed irrelevant?
--defer depends on being able to access cache information about resources that are not selected (context in Defer iff unselected reference does not exist in current env #2946):

dbt-core/core/dbt/contracts/graph/manifest.py

Lines 952 to 955 in a588607

    
           unique_id not in selected and 
        
           not adapter.get_relation( 
        
               current.database, current.schema, current.identifier 
        
           )

)

Questions

If dbt misses something in its cache, are we liable to return slow or incorrect results? The answer significantly changes the stakes of such a change
Would we want cache behavior to vary based on --defer behavior?

Hi, we are trying to run individual model at a time from Apache Airflow (by parsing the manifest.json). We have a common project where multiple teams are working. So there are about 30 schemas in total in various databases (Snowflake). When running single model, it takes 2 minutes just to run show terse objects in various schemas while the main query for the model completes in 9 second. This is very inefficient. It should have been sufficient to run show terse objects on the target schema of the particular model only.

I was hoping if there could be a command line or config parameter to change this default behavior. This would give options to users on what is best for their use case.

boxysean · 2022-03-14T14:56:45Z

I spoke with a dbt user today who is trying to reference the data type of a column during a post-hook macro. They observed that, if the data type of a column changes between two successive runs (say from a TEXT to a NUMBER), then the post-hook macro that references the data type will be the data type before to the run (e.g., TEXT rather than NUMBER).

I think this is an example of negative/unexpected behavior of caching.

jtcohen6 · 2022-03-14T15:05:36Z

@boxysean Great catch. Confirming that the user was on Spark/Databricks? That's the only adapter I know of where we cache column-level info at the start of the run, since it's available from the caching query, such that subsequent calls to adapter.get_columns_in_relation can be cache hits.

This is a tricky one — wouldn't be solved by limiting the cache to selected/run relations, nor would it solved by updating the cache with a more detailed relation object after the materialization runs (since this is after the post-hook). It's a good reason to prefer a less powerful caching query that doesn't return column info (option 2 here: dbt-labs/dbt-spark#296).

boxysean · 2022-03-15T09:07:56Z

@jtcohen6 They are on Snowflake. I'll try to get them to chime in directly and share their code to illustrate.

jtcohen6 added performance Team:Execution Team:Adapters Issues designated for the adapter area of the code labels Feb 5, 2022

github-actions bot changed the title ~~Cache objects for selected resources only?~~ [CT-168] Cache objects for selected resources only? Feb 5, 2022

jtcohen6 mentioned this issue Feb 5, 2022

[CT-202] Workaround for some limitations due to list_relations_without_caching method dbt-labs/dbt-spark#228

Open

jtcohen6 mentioned this issue Mar 10, 2022

[CT-94] Holistic tracking for perceived start-up performance #4625

Closed

karunpoudel mentioned this issue Mar 13, 2022

cache schema for selected models #4860

Closed

4 tasks

jtcohen6 mentioned this issue Apr 12, 2022

Add experimental cache_selected_only config #5036

Merged

4 tasks

jtcohen6 closed this as completed in #5036 Apr 12, 2022

jtcohen6 added the adapter_caching Issues related to the adapter's relation cache label Dec 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-168] Cache objects for selected resources only? #4688

[CT-168] Cache objects for selected resources only? #4688

jtcohen6 commented Feb 5, 2022

karunpoudel commented Mar 6, 2022 •

edited

Loading

boxysean commented Mar 14, 2022

jtcohen6 commented Mar 14, 2022

boxysean commented Mar 15, 2022

[CT-168] Cache objects for selected resources only? #4688

[CT-168] Cache objects for selected resources only? #4688

Comments

jtcohen6 commented Feb 5, 2022

Background

Details

Risks

Questions

Related

karunpoudel commented Mar 6, 2022 • edited Loading

boxysean commented Mar 14, 2022

jtcohen6 commented Mar 14, 2022

boxysean commented Mar 15, 2022

karunpoudel commented Mar 6, 2022 •

edited

Loading