Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optional condition to not_null_proportion test #691

Closed

Conversation

elyobo
Copy link
Contributor

@elyobo elyobo commented Sep 28, 2022

This is a:

  • documentation update
  • bug fix with no breaking changes
  • new functionality
  • a breaking change

All pull requests from community contributors should target the main branch (default).

Description & motivation

The dbt core not_null test allows a where parameter and the dbt utils expression_is_true allows a condition parameter, both of which allow the test to operate on a subset of rows.

This adds a similar functoinality to the not_null_proportion test, sticking with the condition convention from `expression_is_true.

Checklist

  • This code is associated with an Issue which has been triaged and accepted for development.
  • I have verified that these changes work locally on the following warehouses (Note: it's okay if you do not have access to all warehouses, this helps us understand what has been covered)
    • BigQuery
    • Postgres
    • Redshift
    • Snowflake
  • I followed guidelines to ensure that my changes will work on "non-core" adapters by:
    • dispatching any new macro(s) so non-core adapters can also use them (e.g. the star() source)
    • using the limit_zero() macro in place of the literal string: limit 0
    • using dbt.type_* macros instead of explicit datatypes (e.g. dbt.type_timestamp() instead of TIMESTAMP
  • I have updated the README.md (if applicable)
  • I have added tests & descriptions to my models (and macros if applicable)
  • I have added an entry to CHANGELOG.md

@elyobo elyobo force-pushed the optional-condition-on-not-null-proportion branch from 4078ec6 to 32557ea Compare September 28, 2022 12:51
The dbt core `not_null` test allows a `where` parameter and the dbt
utils `expression_is_true` allows a `condition` parameter, both of which
allow the test to operate on a subset of rows.

This adds a similar functoinality to the `not_null_proportion` test,
sticking with the `condition` convention from `expression_is_true.
@elyobo elyobo force-pushed the optional-condition-on-not-null-proportion branch from 32557ea to 5fdeabd Compare September 28, 2022 12:53
@joellabes
Copy link
Contributor

@elyobo all tests get a where clause for free: https://docs.getdbt.com/reference/resource-configs/where, so I don't think this PR is necessary - let me know if I've misunderstood though!

@joellabes joellabes closed this Sep 28, 2022
@elyobo
Copy link
Contributor Author

elyobo commented Sep 29, 2022

🤯 woah, I did not know that, that's great - I'll test that out for my use case to confirm but it should be fine. (Edit: confirmed, works great, sorry for the PR-that-turned-out-to-be-a-support-request).

Question now would be - why does expression_is_true have condition then? Doesn't where make it redundant?

Incidentally it looks like that where substitution is another "this is might cost a lot in BigQuery" concern as per #683 🤔 Another possible feature (I haven't looked for an issue yet) would be reporting on data usage in tests as it does when running models in build/run, costs like this are hidden away.

@joellabes
Copy link
Contributor

Question now would be - why does expression_is_true have condition then? Doesn't where make it redundant?

That test predates the addition of native where support - we're removing the not_null_where and unique_where tests in dbt utils 1.0 next month for the same reason. It'd be nice to remove this condition argument, but migration would be messy... 🤔 i'll have a ponder.

Incidentally it looks like that where substitution is another "this is might cost a lot in BigQuery" concern

How come? because it makes a CTE containing a select * that gets a filter applied to it, and then that CTE gets tested? (I dont remember how the tests are implemented offhand, but I think that’s it).

I would expect any query optimiser worth its salt to notice that those columns aren't used in later parts of the query and ignore them. Here's an example from a very good Snowflake-flavoured article talking about column pruning.

@elyobo
Copy link
Contributor Author

elyobo commented Sep 29, 2022

Yes, sorry, you're quite right about the pruning - so long as the columns are eventually not selected then BQ will prune them, so it's not making the problem any worse and can be addressed in specific tests that end up doing select * 👍

condition removal is a BC break, but the migration for users isn't too bad (there's an exact replacement that provides the same functionality) so a major release would normally be appropriate; if it's likely to cause users pain then maybe flag it as deprecated (can you emit warnings?) and remove it in a future major release instead?

@joellabes
Copy link
Contributor

Yeah OK I'm sold - I've made #695 to track it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants