Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use QUALIFY clause in deduplicate macro for Redshift #811

Merged

Conversation

yauhen-sobaleu
Copy link
Contributor

resolves #713

This is a:

  • documentation update
  • bug fix with no breaking changes
  • new functionality
  • a breaking change

All pull requests from community contributors should target the main branch (default).

Description & motivation

In Redshift deduplicate macro causes rows with NULL values in any column to be discarded due to specifics of the natural join.

Since Redshift has added support for QUALIFY keyword (https://aws.amazon.com/about-aws/whats-new/2023/07/amazon-redshift-qualify-clause-select-sql-statement/) we can get rid of natural join in the macro and fix the problem in an elegant manner.

Compare inputs and outputs:

Old version

with data as (

    select 
        '2021-04-03 23:00:26'::timestamp as ts,
        1 AS id,
        NULL AS null_field
    UNION ALL
    select 
        '2021-04-03 23:00:26'::timestamp as ts,
        1 AS id,
        NULL AS null_field
),

data_deduped AS (
    with row_numbered as (
        select
            _inner.*,
            row_number() over (
                partition by id
                order by ts desc
            ) as rn
        from data as _inner
    )

    select
        distinct data.*
    from data as data
    
    natural join row_numbered
    where row_numbered.rn = 1
)

select * from data_deduped

Expected results

         ts          | id | null_field 
---------------------+----+------------
 2021-04-03 23:00:26 |  1 |          
(1 row)

Actual results

 ts | id | null_field 
----+----+------------
(0 rows)

New version

with data as (

    select 
        '2021-04-03 23:00:26'::timestamp as ts,
        1 AS id,
        NULL AS null_field
    UNION ALL
    select 
        '2021-04-03 23:00:26'::timestamp as ts,
        1 AS id,
        NULL AS null_field
),

data_deduped AS (
 
    select *
    from data as tt
    qualify
        row_number() over (
            partition by id
            order by ts
        ) = 1
)

select * from data_deduped

Actual result

         ts          | id | null_field 
---------------------+----+------------
 2021-04-03 23:00:26 |  1 |          
(1 row)

Checklist

  • This code is associated with an Issue which has been triaged and accepted for development.
  • x ] I have verified that these changes work locally on the following warehouses (Note: it's okay if you do not have access to all warehouses, this helps us understand what has been covered)
    • BigQuery
    • Postgres
    • Redshift
    • Snowflake
  • I followed guidelines to ensure that my changes will work on "non-core" adapters by:
    • dispatching any new macro(s) so non-core adapters can also use them (e.g. the star() source)
    • using the limit_zero() macro in place of the literal string: limit 0
    • using dbt.type_* macros instead of explicit datatypes (e.g. dbt.type_timestamp() instead of TIMESTAMP
  • I have updated the README.md (if applicable)
  • I have added tests & descriptions to my models (and macros if applicable)
  • I have added an entry to CHANGELOG.md

Copy link
Contributor

@dbeatty10 dbeatty10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @yauhen-sobaleu ! 🚀

@dbeatty10 dbeatty10 added this pull request to the merge queue Jul 25, 2023
Merged via the queue into dbt-labs:main with commit a17d66f Jul 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dbt_utils.deduplicate macro returns zero rows when deduplicating a CTE with NULL column (Redshift)
2 participants