Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-426] [Feature] Manage full schema content with dbt by dropping tables/views not maintained in models #4957

Closed
1 task done
bneijt opened this issue Mar 25, 2022 · 12 comments
Labels
enhancement New feature or request

Comments

@bneijt
Copy link

bneijt commented Mar 25, 2022

Is there an existing feature request for this?

  • I have searched the existing issues

Describe the Feature

We sometime deprecate and remove old models and implement renames or new models. To stay in control of the data we maintain in the database, we would like to be able to have dbt automatically remove tables that are not linked to models defined in dbt.

This will make sure that in your test/prod environments you do not end up with stale, possibly sensitive, data still in the database that is no longer maintained.

General idea is that you are allowed to write dbt_project.yml configuration that states:

models:
  - +schema: something
  - +maintain: drop

which would drop tables/views not used in the something schema. (future alternatives might include "rename" as a maintenance approach or move to another schema).

Because I want the feature to work on some but not all databases (we share a dev database in our project) I would like to have another configuration required in the profile.yml which allows this to start working on the specific database:

snowflake:
  outputs:
    dev:
      ...
    test:
        maintain: true
        database: test

Describe alternatives you've considered

  • Using post-run macro's as described in https://discourse.getdbt.com/t/faq-cleaning-up-removed-models-from-your-production-schema/113/3 (ends up being to backend specific and I would like more testing and control based on central configuration)
  • Writing a separate python script that uses the dbt connection modules to achieve the same (feels hacky and is really project specific)
  • Writing a dbt package (does not have the same control over the complete project configuration as I would like.)

Who will this benefit?

  • Anybody who is willing to implement any of the macro's mentioned in https://discourse.getdbt.com/t/faq-cleaning-up-removed-models-from-your-production-schema/113/3 should be able to use this feature instead and end up with a more configurable solution.
  • People on a tight storage budget for their database instances
  • People who run dashboards on top of dbt created data marts and want those dashboards to fail instead of continue with stale data
  • People who use dbt to process sensitive/GDPR data that needs to be either removed or part of the current processing license to be allowed in the database.

Are you interested in contributing this feature?

Yes, me and a collegue of mine would like to write code for this

Anything else?

Might also help with #3685

@bneijt bneijt added enhancement New feature or request triage labels Mar 25, 2022
@github-actions github-actions bot changed the title [Feature] Manage full schema content with dbt by dropping tables/views not maintained in models [CT-426] [Feature] Manage full schema content with dbt by dropping tables/views not maintained in models Mar 25, 2022
@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 25, 2022

(Similar proposal from long ago, linking for historical context: #1135. I don't know if I still agree with the proposal outlined in my last comment there, though.)

@iknox-fa iknox-fa removed the triage label Apr 7, 2022
@iknox-fa
Copy link
Contributor

iknox-fa commented Apr 7, 2022

Thanks for the well thought out feature request @bneijt! I think this is a great idea and I know we'd love to have something like this in core. We have it as a backlog ticket now, but it's probably not in the immediate future plans for the core team. That said, if it's something you and your colleague would like to contribute that would be really fantastic-- I'm sure we can help provide some assistance along the way as well.

@iknox-fa
Copy link
Contributor

iknox-fa commented Apr 7, 2022

(by the way, be sure to check out #1135 there's lots of good ideas and prior art in there)

@agoblet
Copy link
Contributor

agoblet commented Apr 11, 2022

Hi, @bneijt's colleague here. Good to read that you are enthusiastic about the proposal. We think that implementing this feature consists of 2 tasks:

  1. Allow users to specify which models should be autoremoved when no longer needed. We plan to do this using a manage_schema flag in the dbt_project.yml / profile.yml as described in the initial post of this thread.
  2. Perform the drops on the models that are removed and are specified to be managed by dbt via task 1.

We looked into #1135. It mainly talks about diff / compare functionality. We think that this is not needed for the MVP, and can be added in a later release.

We are curious to hear whether you agree with our ideas!

@bneijt
Copy link
Author

bneijt commented Apr 15, 2022

After some consideration, we have come to the following:
in dbt_project.yml we would add:

managed_schemas:
  - database: your_database_name
    schema: your_schema_name
    action: warn/drop/info/truncate....

By making it top-level config, we ensure we don't mix management and the cascading nature of the project config. This also allows us to drop the check on defining both schema and manage action at the same level.

for the profile.yml configuration, we will add a config to the target itself, which will enable/disable the management of schemas in the target:

profile_name:
  outputs:
    dev:
      ....
    test:
      manage_schemas: false
    prod:
      manage_schemas: true

Default for manage_schemas is false, configuration is done at a target level.

@jtcohen6
Copy link
Contributor

@bneijt Thanks for sharing the concrete proposal! Two feelings I wanted to share:

First: dbt should support a --dry-run mode for object management. It should be able to list or log which objects it would be dropping, separate from actually dropping them.

Second: Specifying schemas to manage in dbt_project.yml feels slightly off to me. This is quite different from how users interact with schemas otherwise, as a property of individual nodes. It's not an unheard-of request to define schemas in their own terms, e.g. in order to define a schema's description (#1714). If the ask is to manage schemas in particular, though, why not as part of a more general-purpose solution to managing objects (#3391), rather than a one-off project-level manage_schemas config?

As a dbt user, I'd be more inclined to pass a runtime config, dbt --manage-schemas run, that has the effect of "managing" (dropping unrecognized objects) from any schema in which dbt is also materializing models. For instance, given model_a in schema schema_a, dbt --manage-schemas run --select model_a would "clean" schema_a as a post hook, but no other schemas.

Operationally, this could be done at the end of the run, akin to an on-run-end hook. It could use the cache that was constructed during the run (pairing nicely with another runtime config we just added, --cache-selected-only), by picking out all the cached objects from that schema that don't match a manifest node (enabled models, sources, seeds, snapshots, etc). I think the relevant method is adapter.cache.get_relations:

def get_relations(self, database: Optional[str], schema: Optional[str]) -> List[Any]:
"""Case-insensitively yield all relations matching the given schema.
:param str schema: The case-insensitive schema name to list from.
:return List[BaseRelation]: The list of relations with the given
schema
"""

@bneijt
Copy link
Author

bneijt commented Apr 18, 2022

First: yes a --dry-run is on our todo. First iteration we want to have a configurable set of possible action: values, including warn as: only warn about extra models. This would allow for a dry-run on the different environments before you start enabling the actual management/drop action.

Second: Our first intention was to have it as part of the node configuration, next to the model schema attribute, you could just add manage_schema. However, this can cause confusing configurations like:

models:
  target:
    +schema: schema_a
    a:
      b:
        +schema: schema_b
    c:
      +schema: schema_b
      +manage_schema: true    

Now all models below b would be managed, even though this is only specified at the end of the configuration. Bottom line, enabling management on a schema is not a cascading configuration property when it comes to models.

The only proper solution seems to be a top-level configuration in the project. If we would only want to allow this configuration to be done via the command line, I think the definitions of which schema's should be managed will get out of hand (we want to manage about 10 schema's in my current project).

We are currently experimenting with the code in a feature branch on our fork https://github.com/BigDataRepublic/dbt-core/blob/c22aaafff0f25d10d6769850ff50413c02063937/core/dbt/task/run.py#L488 with a pull request towards the for to allow us to track discussions and progress BigDataRepublic#1.

As for the caching behavior, that is something we still need to look in to, might indeed be nice to pair that as you mention.

@agoblet
Copy link
Contributor

agoblet commented May 4, 2022

@jtcohen6 I don’t think that not specifying schemas to manage, and simply manage all schemas that have models, works. When removing the last model(s) of a schema from your project, these models are left in the database, while they should be removed. Do I miss something here?

@jtcohen6
Copy link
Contributor

@bneijt @agoblet Thanks so much for opening the PR! This really helped me get a feel for what you're thinking. I'm going to leave some comments over there.

I like the idea of warn as a dry-run option. I think this gets us a lot of value, and the stakes are much lower in the event that dbt over-eagerly identifies a resource to be dropped.

When removing the last model(s) of a schema from your project, these models are left in the database, while they should be removed. Do I miss something here?

Hmm, good point. A few wild ideas that occur to me:

  • Authorizing dbt to "manage" an entire database (i.e. Snowflake logical database, BigQuery project) where it expects to controls all resources. This seems like a common pattern for folks who ask dbt to produce many custom schemas.
  • dbt could infer, based on database permissions, which schemas its user/role/principal is the owner of — and drop any which it does not recognize as the landing place for currently enabled resoruces. Again, though, this would require different environments (dev/staging/prod) to exist in separate logical databases, or to use different roles (e.g. Snowflake) that clearly disambiguate environment-level ownership. I could see this really meshing nicely with best practices for DWH setup/management, but I imagine it wouldn't work in practice for many many folks who have slightly different setups.

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Feb 17, 2023
@joellabes
Copy link
Contributor

The issue is still very active over in its PR! #5392

@jtcohen6 jtcohen6 removed the stale Issues that have gone stale label Feb 20, 2023
@bneijt
Copy link
Author

bneijt commented Feb 4, 2024

Doen a lot of work in a pr and had some feedback from the team. Meanwhile I moved on from using dbt and the original pr would require a complete rewrite to fit on the current main branch. Closing this issue due to inactivity.

@bneijt bneijt closed this as completed Feb 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants