Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-168] Cache objects for selected resources only? #4688

Closed
jtcohen6 opened this issue Feb 5, 2022 · 4 comments · Fixed by #5036
Closed

[CT-168] Cache objects for selected resources only? #4688

jtcohen6 opened this issue Feb 5, 2022 · 4 comments · Fixed by #5036
Labels
adapter_caching Issues related to the adapter's relation cache performance Team:Adapters Issues designated for the adapter area of the code

Comments

@jtcohen6
Copy link
Contributor

jtcohen6 commented Feb 5, 2022

Background

Should we cache everything dbt cares about? Or try to cache objects relevant only to resources that are selected to run? This is an increasingly important consideration as folks move to:

  • Execute smaller sub-DAGs within ever-larger projects
  • Manage their slice of a multi-package dbt deployment

The intent of the relation cache is to speed up performance, after all. If it's not serving that purpose, in its current unadaptive form, then we should change its behavior.

Details

We know the set of selected resources at the time when we populate the adapter cache. We only create schemas (if they do not yet exist) for objects that are selected. Should we also limit the reach of our metadata queries, to only introspect the schemas we care about?

self.create_schemas(adapter, selected_uids)
self.populate_adapter_cache(adapter)

Should we go one step further, and use filters/wildcards to only cache the objects we care about? This seems necessary on dbt-spark (issue linked below.)

Risks

  • The cache has made us "greedy" when it comes to really simple metadata lookups. What if, over the course of running a few models, we end up spending more time in one-off lookup queries (due to cache misses) than we saved at the start, by skipping one batch metadata query that seemed irrelevant?
  • --defer depends on being able to access cache information about resources that are not selected (context in Defer iff unselected reference does not exist in current env #2946):

unique_id not in selected and
not adapter.get_relation(
current.database, current.schema, current.identifier
)
)

Questions

  • If dbt misses something in its cache, are we liable to return slow or incorrect results? The answer significantly changes the stakes of such a change
  • Would we want cache behavior to vary based on --defer behavior?

Related

@jtcohen6 jtcohen6 added performance Team:Execution Team:Adapters Issues designated for the adapter area of the code labels Feb 5, 2022
@github-actions github-actions bot changed the title Cache objects for selected resources only? [CT-168] Cache objects for selected resources only? Feb 5, 2022
@karunpoudel
Copy link
Contributor

karunpoudel commented Mar 6, 2022

Hi, we are trying to run individual model at a time from Apache Airflow (by parsing the manifest.json). We have a common project where multiple teams are working. So there are about 30 schemas in total in various databases (Snowflake). When running single model, it takes 2 minutes just to run show terse objects in various schemas while the main query for the model completes in 9 second. This is very inefficient. It should have been sufficient to run show terse objects on the target schema of the particular model only.

I was hoping if there could be a command line or config parameter to change this default behavior. This would give options to users on what is best for their use case.

@boxysean
Copy link
Contributor

I spoke with a dbt user today who is trying to reference the data type of a column during a post-hook macro. They observed that, if the data type of a column changes between two successive runs (say from a TEXT to a NUMBER), then the post-hook macro that references the data type will be the data type before to the run (e.g., TEXT rather than NUMBER).

I think this is an example of negative/unexpected behavior of caching.

@jtcohen6
Copy link
Contributor Author

@boxysean Great catch. Confirming that the user was on Spark/Databricks? That's the only adapter I know of where we cache column-level info at the start of the run, since it's available from the caching query, such that subsequent calls to adapter.get_columns_in_relation can be cache hits.

This is a tricky one — wouldn't be solved by limiting the cache to selected/run relations, nor would it solved by updating the cache with a more detailed relation object after the materialization runs (since this is after the post-hook). It's a good reason to prefer a less powerful caching query that doesn't return column info (option 2 here: dbt-labs/dbt-spark#296).

@boxysean
Copy link
Contributor

@jtcohen6 They are on Snowflake. I'll try to get them to chime in directly and share their code to illustrate.

@jtcohen6 jtcohen6 added the adapter_caching Issues related to the adapter's relation cache label Dec 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
adapter_caching Issues related to the adapter's relation cache performance Team:Adapters Issues designated for the adapter area of the code
Projects
None yet
3 participants