[CT-426] [Feature] Manage full schema content with dbt by dropping tables/views not maintained in models #4957

bneijt · 2022-03-25T13:45:26Z

Is there an existing feature request for this?

I have searched the existing issues

Describe the Feature

We sometime deprecate and remove old models and implement renames or new models. To stay in control of the data we maintain in the database, we would like to be able to have dbt automatically remove tables that are not linked to models defined in dbt.

This will make sure that in your test/prod environments you do not end up with stale, possibly sensitive, data still in the database that is no longer maintained.

General idea is that you are allowed to write dbt_project.yml configuration that states:

models:
  - +schema: something
  - +maintain: drop

which would drop tables/views not used in the something schema. (future alternatives might include "rename" as a maintenance approach or move to another schema).

Because I want the feature to work on some but not all databases (we share a dev database in our project) I would like to have another configuration required in the profile.yml which allows this to start working on the specific database:

snowflake:
  outputs:
    dev:
      ...
    test:
        maintain: true
        database: test

Describe alternatives you've considered

Using post-run macro's as described in https://discourse.getdbt.com/t/faq-cleaning-up-removed-models-from-your-production-schema/113/3 (ends up being to backend specific and I would like more testing and control based on central configuration)
Writing a separate python script that uses the dbt connection modules to achieve the same (feels hacky and is really project specific)
Writing a dbt package (does not have the same control over the complete project configuration as I would like.)

Who will this benefit?

Anybody who is willing to implement any of the macro's mentioned in https://discourse.getdbt.com/t/faq-cleaning-up-removed-models-from-your-production-schema/113/3 should be able to use this feature instead and end up with a more configurable solution.
People on a tight storage budget for their database instances
People who run dashboards on top of dbt created data marts and want those dashboards to fail instead of continue with stale data
People who use dbt to process sensitive/GDPR data that needs to be either removed or part of the current processing license to be allowed in the database.

Are you interested in contributing this feature?

Yes, me and a collegue of mine would like to write code for this

Anything else?

Might also help with #3685

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2022-03-25T14:54:41Z

(Similar proposal from long ago, linking for historical context: #1135. I don't know if I still agree with the proposal outlined in my last comment there, though.)

iknox-fa · 2022-04-07T22:01:41Z

Thanks for the well thought out feature request @bneijt! I think this is a great idea and I know we'd love to have something like this in core. We have it as a backlog ticket now, but it's probably not in the immediate future plans for the core team. That said, if it's something you and your colleague would like to contribute that would be really fantastic-- I'm sure we can help provide some assistance along the way as well.

iknox-fa · 2022-04-07T22:02:22Z

(by the way, be sure to check out #1135 there's lots of good ideas and prior art in there)

agoblet · 2022-04-11T10:52:25Z

Hi, @bneijt's colleague here. Good to read that you are enthusiastic about the proposal. We think that implementing this feature consists of 2 tasks:

Allow users to specify which models should be autoremoved when no longer needed. We plan to do this using a manage_schema flag in the dbt_project.yml / profile.yml as described in the initial post of this thread.
Perform the drops on the models that are removed and are specified to be managed by dbt via task 1.

We looked into #1135. It mainly talks about diff / compare functionality. We think that this is not needed for the MVP, and can be added in a later release.

We are curious to hear whether you agree with our ideas!

bneijt · 2022-04-15T12:39:56Z

After some consideration, we have come to the following:
in dbt_project.yml we would add:

managed_schemas:
  - database: your_database_name
    schema: your_schema_name
    action: warn/drop/info/truncate....

By making it top-level config, we ensure we don't mix management and the cascading nature of the project config. This also allows us to drop the check on defining both schema and manage action at the same level.

for the profile.yml configuration, we will add a config to the target itself, which will enable/disable the management of schemas in the target:

profile_name:
  outputs:
    dev:
      ....
    test:
      manage_schemas: false
    prod:
      manage_schemas: true

Default for manage_schemas is false, configuration is done at a target level.

jtcohen6 · 2022-04-18T13:22:41Z

@bneijt Thanks for sharing the concrete proposal! Two feelings I wanted to share:

First: dbt should support a --dry-run mode for object management. It should be able to list or log which objects it would be dropping, separate from actually dropping them.

Second: Specifying schemas to manage in dbt_project.yml feels slightly off to me. This is quite different from how users interact with schemas otherwise, as a property of individual nodes. It's not an unheard-of request to define schemas in their own terms, e.g. in order to define a schema's description (#1714). If the ask is to manage schemas in particular, though, why not as part of a more general-purpose solution to managing objects (#3391), rather than a one-off project-level manage_schemas config?

As a dbt user, I'd be more inclined to pass a runtime config, dbt --manage-schemas run, that has the effect of "managing" (dropping unrecognized objects) from any schema in which dbt is also materializing models. For instance, given model_a in schema schema_a, dbt --manage-schemas run --select model_a would "clean" schema_a as a post hook, but no other schemas.

Operationally, this could be done at the end of the run, akin to an on-run-end hook. It could use the cache that was constructed during the run (pairing nicely with another runtime config we just added, --cache-selected-only), by picking out all the cached objects from that schema that don't match a manifest node (enabled models, sources, seeds, snapshots, etc). I think the relevant method is adapter.cache.get_relations:

dbt-core/core/dbt/adapters/cache.py

Lines 453 to 459 in 14c2bd9

    
               def get_relations(self, database: Optional[str], schema: Optional[str]) -> List[Any]: 
        
                   """Case-insensitively yield all relations matching the given schema. 
        
                   :param str schema: The case-insensitive schema name to list from. 
        
                   :return List[BaseRelation]: The list of relations with the given 
        
                       schema 
        
                   """

bneijt · 2022-04-18T18:36:55Z

First: yes a --dry-run is on our todo. First iteration we want to have a configurable set of possible action: values, including warn as: only warn about extra models. This would allow for a dry-run on the different environments before you start enabling the actual management/drop action.

Second: Our first intention was to have it as part of the node configuration, next to the model schema attribute, you could just add manage_schema. However, this can cause confusing configurations like:

models:
  target:
    +schema: schema_a
    a:
      b:
        +schema: schema_b
    c:
      +schema: schema_b
      +manage_schema: true

Now all models below b would be managed, even though this is only specified at the end of the configuration. Bottom line, enabling management on a schema is not a cascading configuration property when it comes to models.

The only proper solution seems to be a top-level configuration in the project. If we would only want to allow this configuration to be done via the command line, I think the definitions of which schema's should be managed will get out of hand (we want to manage about 10 schema's in my current project).

We are currently experimenting with the code in a feature branch on our fork https://github.com/BigDataRepublic/dbt-core/blob/c22aaafff0f25d10d6769850ff50413c02063937/core/dbt/task/run.py#L488 with a pull request towards the for to allow us to track discussions and progress BigDataRepublic#1.

As for the caching behavior, that is something we still need to look in to, might indeed be nice to pair that as you mention.

agoblet · 2022-05-04T13:52:39Z

@jtcohen6 I don’t think that not specifying schemas to manage, and simply manage all schemas that have models, works. When removing the last model(s) of a schema from your project, these models are left in the database, while they should be removed. Do I miss something here?

jtcohen6 · 2022-06-30T13:25:00Z

@bneijt @agoblet Thanks so much for opening the PR! This really helped me get a feel for what you're thinking. I'm going to leave some comments over there.

I like the idea of warn as a dry-run option. I think this gets us a lot of value, and the stakes are much lower in the event that dbt over-eagerly identifies a resource to be dropped.

When removing the last model(s) of a schema from your project, these models are left in the database, while they should be removed. Do I miss something here?

Hmm, good point. A few wild ideas that occur to me:

Authorizing dbt to "manage" an entire database (i.e. Snowflake logical database, BigQuery project) where it expects to controls all resources. This seems like a common pattern for folks who ask dbt to produce many custom schemas.
dbt could infer, based on database permissions, which schemas its user/role/principal is the owner of — and drop any which it does not recognize as the landing place for currently enabled resoruces. Again, though, this would require different environments (dev/staging/prod) to exist in separate logical databases, or to use different roles (e.g. Snowflake) that clearly disambiguate environment-level ownership. I could see this really meshing nicely with best practices for DWH setup/management, but I imagine it wouldn't work in practice for many many folks who have slightly different setups.

github-actions · 2023-02-17T02:04:38Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

joellabes · 2023-02-20T02:32:11Z

The issue is still very active over in its PR! #5392

bneijt · 2024-02-04T10:19:44Z

Doen a lot of work in a pr and had some feedback from the team. Meanwhile I moved on from using dbt and the original pr would require a complete rewrite to fit on the current main branch. Closing this issue due to inactivity.

bneijt added enhancement New feature or request triage labels Mar 25, 2022

github-actions bot changed the title ~~[Feature] Manage full schema content with dbt by dropping tables/views not maintained in models~~ [CT-426] [Feature] Manage full schema content with dbt by dropping tables/views not maintained in models Mar 25, 2022

jtcohen6 added the Team:Execution label Mar 25, 2022

iknox-fa removed the triage label Apr 7, 2022

bneijt mentioned this issue Apr 22, 2022

Feature/ct 426 scratch BigDataRepublic/dbt-core#1

Closed

bneijt mentioned this issue Jun 27, 2022

Feature CT-426 manage schemas #5392

Closed

10 tasks

ChenyuLInx added the triage label Jun 27, 2022

ChenyuLInx assigned jtcohen6 Jun 27, 2022

jtcohen6 removed the triage label Jun 30, 2022

jtcohen6 removed their assignment Aug 20, 2022

jtcohen6 mentioned this issue Sep 6, 2022

[CT-1080] BigQuery dataset expiration time dbt-labs/dbt-bigquery#289

Closed

yu-iskw mentioned this issue Oct 28, 2022

[CT-1442] [Feature] Enable to access disabled resources with the graph macro #6169

Closed

3 tasks

github-actions bot added the stale Issues that have gone stale label Feb 17, 2023

jtcohen6 removed the stale Issues that have gone stale label Feb 20, 2023

jtcohen6 removed the Team:Execution label Jul 19, 2023

bneijt closed this as completed Feb 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-426] [Feature] Manage full schema content with dbt by dropping tables/views not maintained in models #4957

[CT-426] [Feature] Manage full schema content with dbt by dropping tables/views not maintained in models #4957

bneijt commented Mar 25, 2022

jtcohen6 commented Mar 25, 2022 •

edited

Loading

iknox-fa commented Apr 7, 2022

iknox-fa commented Apr 7, 2022

agoblet commented Apr 11, 2022 •

edited

Loading

bneijt commented Apr 15, 2022

jtcohen6 commented Apr 18, 2022

bneijt commented Apr 18, 2022

agoblet commented May 4, 2022

jtcohen6 commented Jun 30, 2022

github-actions bot commented Feb 17, 2023

joellabes commented Feb 20, 2023

bneijt commented Feb 4, 2024

[CT-426] [Feature] Manage full schema content with dbt by dropping tables/views not maintained in models #4957

[CT-426] [Feature] Manage full schema content with dbt by dropping tables/views not maintained in models #4957

Comments

bneijt commented Mar 25, 2022

Is there an existing feature request for this?

Describe the Feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

jtcohen6 commented Mar 25, 2022 • edited Loading

iknox-fa commented Apr 7, 2022

iknox-fa commented Apr 7, 2022

agoblet commented Apr 11, 2022 • edited Loading

bneijt commented Apr 15, 2022

jtcohen6 commented Apr 18, 2022

bneijt commented Apr 18, 2022

agoblet commented May 4, 2022

jtcohen6 commented Jun 30, 2022

github-actions bot commented Feb 17, 2023

joellabes commented Feb 20, 2023

bneijt commented Feb 4, 2024

jtcohen6 commented Mar 25, 2022 •

edited

Loading

agoblet commented Apr 11, 2022 •

edited

Loading