Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

include_columns argument breaks one_hot_encoder macro #12

Open
andrewlin0 opened this issue Aug 6, 2021 · 2 comments
Open

include_columns argument breaks one_hot_encoder macro #12

andrewlin0 opened this issue Aug 6, 2021 · 2 comments

Comments

@andrewlin0
Copy link

I have been working with the one_hot_encoder macro on a feature store and I came across a bug when using it.

Bug

The bug that I encountered was that when the include_columns argument was not '*', the code for the one_hot_encoder would insert blanks as the columns in the select statement and the code would not compile because of the comma that would be alone there.

For example, if the code was this:

gender as ( {{ dbt_ml_preprocessing.one_hot_encoder( source_table = ref('my_table'), source_column = 'GENDER', include_columns = ['ID']) }} )

The compiled SQL file would provide an "unexpected ',' error" because the line would look like:

select , case WHEN GENDER = 'MALE' then true... (and the rest of the compiled sql)

When it should be:

select GENDER, case WHEN GENDER = 'MALE' then true... (and the rest of the compiled sql)

I looked at the code for the one_hot_encoder and I found the problem to be in lines 102-104 of the code. (In the deafult__one_hot_encoder macro).

For some reason, when include_columns is set to a list of columns we want, the "column" in "for column in col_list" is already the name of what is in the list. For example, if include_columns = ['ID'], column would be 'ID', so when you do column.name on that, it would be like 'ID'.name and it returns a blank for some reason.

It works fine when include_columns = "*" because that makes "column" something like SnowflakeColumn(column='name', dtype='VARCHAR', char_size = 16777216, numeric_precision=None, numeric_scale=None), so when you do column.name on it, it will give you the name.

Quick Solution

I have come up with a quick solution to this bug, and it is probably not the most elegant and robust. In the original code, under line 55, I made a new variable called 'flag' and set it equal to 'inc' like this: {% set flag = 'not_inc' %}. This will be a variable that tells the macro if the include_columns argument is "*" or not. Then under line 60 in the original code, I set flag = 'inc' because it would mean the include_columns variable was set by the programmer. So like this: {% set flag = 'inc' %}.

Then on line 70 of the original code, I added that variable into the adapter.dispatch thing, so the new line was:

{{ adapter.dispatch('one_hot_encoder',packages=['dbt_ml_preprocessing'])(source_table, source_column, category_values, handle_unknown, col_list, flag) }}

as opposed to

{{ adapter.dispatch('one_hot_encoder',packages=['dbt_ml_preprocessing'])(source_table, source_column, category_values, handle_unknown, col_list) }}

Finally, for the default__one_hot_encoder macro on line 73 of the original code, I changed the logic up a bit to catch the instances where the include_columns argument was not "*". I replaced lines 76-78 of the original code with:

 {%- if flag == 'not_inc' -%}
         {% for column in col_list %}
             {{ column.name }},
         {%- endfor -%}

 {%- elif flag == 'inc' -%}
        {% for column in col_list %}
            {{ column }},
        {%- endfor -%}
    {%- endif -%}

This logic makes sure that if include_arguments is something like ['ID'], then it will not do column.name on 'ID', but rather just take the column name straight from column. Otherwise, when include_columns="*", proceed like before.

This solution may not be robust, so I was hoping to find out why this bug occurs and maybe find a better fix for it.

@Chryzanthemum
Copy link

you're a wizard ty

@coisnepe
Copy link

Wouldn't these be easier ?

image image

OR

image image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants