Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplication macro #335

Closed
blake-enyart opened this issue Feb 12, 2021 · 9 comments · Fixed by #512
Closed

Deduplication macro #335

blake-enyart opened this issue Feb 12, 2021 · 9 comments · Fixed by #512
Labels
enhancement New feature or request good first issue

Comments

@blake-enyart
Copy link

Describe the feature

It would be great to have deduplication macro which utilizes something such as the surrogate_key macro to deduplicate on. The basis for this would be something such as: #29

Describe alternatives you've considered

At this point, the only alternative is writing this in full SQL query and this feels like a pretty common repetitive type of query that is prime for integrating into a macro

Additional context

Is this feature database-specific? Which database(s) is/are relevant? Please include any other relevant context here.
I work mainly in Snowflake, but this could be generalized. Perhaps a first draft in Snowflake and expands outward?

Who will this benefit?

It is very common practice to deduplicate data on ingestion in addition to type casting which will greatly expedite new data source ingestion

Are you interested in contributing this feature?

I would need a hand to get started on this. I can leverage this PR as a reference, #29, but would greatly appreciate any support and perspective provided here.

@blake-enyart blake-enyart added enhancement New feature or request triage labels Feb 12, 2021
@rwatts3
Copy link

rwatts3 commented Feb 19, 2021

Here is a macro I use to dedupe using BigQuery, not sure if its relevant or could help.

{%- macro deduped_dataset(database, table, key="id") -%}
  select deduped.*
    from (
      select
        _inner.*,
        row_number() over (partition by {{ key }} order by updated_at desc ) as rn
      from {{ source(database, table) }} as _inner
    ) as deduped
  where deduped.rn = 1
{%- endmacro -%}

@switzer
Copy link

switzer commented Mar 1, 2021

We've had resource issues using row_number() vs. using array_agg. Our dedupe macro is as follows:

{% macro dedupe_source(source, group_by, order_by = None) %}
  select
    event.*
  from (
    select
      array_agg (
        {{ source }}
        {% if order_by != None %}
        order by {{ order_by }}
        {% endif %}
        limit 1
      )[offset(0)] event
    from {{ source }}
    group by {{ group_by }}
  )
{% endmacro %}

It is more performant (can handle more rows, not necessarily faster). You may want to consider going down this path. Adding the LIMIT 1 allows BQ to vastly reduce the dataset it needs to calculate the query.

In addition, rather than passing in a dataset and table, it would be better to pass in a source, so that this can be used as a separate CTE. Of course you could call dedupe( source(db, table) ) if you like, so this would supersede existing functionality.

Also, we use this macro to dedupe based on a configurable column AND order by - so that you can select which row to select.

@blake-enyart
Copy link
Author

Hi @switzer and @rwatts3 !

Thank you both for sharing out your dedupe methodologies. I'm working on standing up some client projects right now and hoping to utilize these shortly :)

@thiklock
Copy link

Hi @blake-enyart , @switzer and @rwatts3 !

I was trying to apply this method but to many columns (4). but I got some errors.

I was wondering if you managed to find a solution for deduping based on many columns.

@switzer
Copy link

switzer commented Sep 24, 2021

There is no limit to the number of columns to deduplicate across - just add a comma list in the string. For example, you can call the macro as follows:

with
source as (
  select * from table
),
optimised_source as (

    {{ dedupe_source('source', 'col1, col2, col3, col4', 'col9, col8, col7 desc' ) }}

)

...

@keurcien
Copy link

Great idea! Using Airbyte in incremental mode without normalization, I could see a great benefit in having such a macro dedicated to Airbyte's raw data. Thanks for sharing your snippet @switzer

@joellabes
Copy link
Contributor

👋 Hey team! I'd welcome a PR that implemented this.

It will need to be cross-database compatible by the time we merge it in - I don't know offhand whether all four of the supported adapters (Snowflake, BQ, Redshift, Postgres) have array_agg or not. If they don't, falling back to the row_number based approach seems reasonable - also remember that there are integration tests that can validate that everything works correctly across all warehouses so you don't have to have access to them all individually.

Happy to help out with any questions along the way 🚀

@judahrand
Copy link
Contributor

judahrand commented Mar 6, 2022

@joellabes I thought I'd take a crack at this. It turns out that:

  • Redshift does not have an ARRAY_AGG function.
  • Snowflake and Postgres do not have a LIMIT clause available in their ARRAY_AGG implementations.

For these reasons I've ended up implementing the ROW_NUMBER approach as the default and the ARRAY_AGG as a BigQuery specific overload.

It was also necessary for the macro to take a relation as an argument because most of the databases do not have a SELECT * EXCEPT (...) clause and we introduce a new column with ROW_NUMBER. Therefore, we have to use dbt_utils.star to get the select statement from the relation.

@dbeatty10
Copy link
Contributor

Fixed by #512

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants