Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add unit argument and apply conversion for kms #340

Merged

Conversation

bastienboutonnet
Copy link
Contributor

@bastienboutonnet bastienboutonnet commented Feb 27, 2021

This is a:

  • bug fix PR with no breaking changes — please ensure the base branch is master

  • new functionality — please ensure the base branch is the latest dev/ branch
    To be honest I'm not too sure what this counts as. Haversine distance macro doesn't support kms #338 is labelled as a bug, but it's also kind of a feature request --let me know what you want to consider this one as and I'll adjust the diff target

  • a breaking change — please ensure the base branch is the latest dev/ branch

Description & motivation

the haversine_distance() calculation outputs results in miles only at the minute. Interest in being able to output haversine in kilometers was registered in #338 .

Changes

calling with unit='km' (distance between Amsterdam and Paris)

select {{haversine_distance(48.864716, 2.349014, 52.379189, 4.899431, unit='km')}} as hav_dist
from table_a

returns ~ 430
Screenshot 2021-02-27 at 14 58 53

calling witout a unit param or (unit='mi')

select {{haversine_distance(48.864716, 2.349014, 52.379189, 4.899431)}} as hav_dist
from table_a

returns ~267
Screenshot 2021-02-27 at 14 57 34

Closes #338

Checklist

  • I have verified that these changes work locally on the following warehouses (Note: it's okay if you do not have access to all warehouses, this helps us understand what has been covered)

    • BigQuery
    • Postgres
    • Redshift
    • Snowflake
  • I have updated the README.md (if applicable)

  • I have added tests & descriptions to my models (and macros if applicable)

  • I have added an entry to CHANGELOG.md

    • Not sure if this change would cause a minor or patch increment. In semver world since this add a backwards compatible features it should trigger a bump to 0.7.0 but not sure if you follow that here, so let me know and then I can update the changelog accordingly.

@bastienboutonnet bastienboutonnet marked this pull request as ready for review February 27, 2021 14:05
@joellabes
Copy link
Contributor

🙌 Thanks @bastienboutonnet!

This repo contains another project which imports dbt-utils and runs integration tests against seed files. Check out https://github.com/fishtown-analytics/dbt-utils/pull/334/files for an example of configuring tests

@bastienboutonnet
Copy link
Contributor Author

bastienboutonnet commented Mar 1, 2021

Thanks for the pointers @joellabes I'll see about writing a test for this function at some point this week

@bastienboutonnet
Copy link
Contributor Author

@joellabes I added some tests and they seem to pass locally and on circle CI except for snowflake and BQ. I'm assuming because I'm a 3rd party these tests are bound to fail? Either way, let me know if that's the issue of if there is anything I should do (or not do) in the code.

Regarding the changelog todo on the PR template my question still stands let me know what I should do when we get closer to merge I guess?

@joellabes
Copy link
Contributor

I'm assuming because I'm a 3rd party these tests are bound to fail?

I don't think that's true - I've had all the adapters succeed on past PRs. My best guess is that BQ and SF do something differently - perhaps their column type inference isn't playing nicely with the seed, or the conversion_rate multiplication doesn't work for some reason. I only have access to a Redshift warehouse so can't muck around to find out why, and I'm external to Fishtown too so I can't see the CircleCI errors - @clrcrl will have to let you know the specifics.

In integration_tests/models/geo/test_haversine_distance_mi.sql, it might be nice to union that model and run it again without passing in the mi argument to test the default/backwards-compatible case as well (unless you have that covered somewhere else that I missed).

0.7.0 is the next significant (intentionally avoiding semver words) version of dbt-utils, so I expect that this sort of net-new functionality would go there as opposed to a patch version. Again a good q for Claire

@bastienboutonnet
Copy link
Contributor Author

bastienboutonnet commented Mar 8, 2021

@joellabes with the union part of your message I'm not sure I understand. I think I understand you'd like a test where we call the macro without the unit argument which makes tons of sense. But I don't know why union. Did you have something in mind?

Btw yeah you're right the Snowflake issue seems to be with the fact that it doesn't like ^. I think it probably uses pow() or something but then I'm curious if there is a snowflake adaptor specific version of the macro somewhere @joellabes @clrcrl ?

@joellabes
Copy link
Contributor

Did you have something in mind?

Broadly, I was thinking something like

select convert(lat_1, lon_1, lat_2, lon_2, ‘mi’)
from table

union all

select convert(lat_1, lon_1, lat_2, lon_2)
from table

(written on my phone so only very broadly accurate)

By doing that, you’d get 0, 1 or 2 failing rows depending on which part broke. It’s probably bad practice to overload the test, but I was trying to avoid having to make a third separate query 😬

not sure if there’s a SF-friendly version of the macro sorry! Weird that it got this far without breaking. Maybe it’s never worked but no one uses this on BQ or SF

@bastienboutonnet
Copy link
Contributor Author

bastienboutonnet commented Mar 8, 2021

@joellabes Ok I see what you meant. I can get that done in that way I think it's not that bad actually coming to think about it.

So yeah I checked into my team's macro for haversine and we have used the pow() functions on our Snowflake warehouse. I think there is a way to distribute warehouse specific macros and let the macro adaptor choose it at runtime, I'll see if there is an example of this already on the repo or wait till you or @clrcrl let me know how best to do it.

But YES! very strange people didn't complain so far, or maybe they have been like us and have used their own macros for haversine. Anyway, would be a great idea to fix that for everyone right?

UPDATE: I checked the postgres documentation and it looks like they also have a pow() function which could be also a solution, not sure if it works for both Redshift and BQ so we probably still want to go down the route of distributing adapted macros for some warehouses. Let me know what your thoughts are on this.

UPDATE 2 actually BQ fails on the radians so that's another issue and they don't seem to have a builtin math function to convert degrees to radians which is gonna call for an adapted macro for BQ for sure.

@bastienboutonnet
Copy link
Contributor Author

bastienboutonnet commented Mar 8, 2021

@joellabes I just implemented the unioned test and it seems to work fine on pg and redshift.

What do you think should happen qua the other dbs? Do we want to make adapters part of this PR?

I think if so, maybe we shouldn't as it's gonna be a pretty haphazard PR but I'd be happy to contribute to a snowflake adapter if I can figure out how the adapters macros are done here. It's been a while since my materialisation macros that I contributed in core.

@bastienboutonnet
Copy link
Contributor Author

I think I may just be good with creating a new macro and name it snowflake__haversine_distance with the db specific function in it and dbt would take care of selecting it based on the adapter type?

For BQ I was looking and they don't seem to have radians function to convert something that's in degrees + I cannot test it as I can't play with a BQ warehouse. Looks like if we implemented it for BQ we'd also have to make some maths happen to convert radians to degrees. It's probably doable though.

Let me know what makes sense.

@joellabes
Copy link
Contributor

Sorry for the delay getting back to you!

I think I may just be good with creating a new macro and name it snowflake__haversine_distance with the db specific function in it and dbt would take care of selecting it based on the adapter type?

Yes I think that's right!

Re BigQuery, if it was already broken and you don't use BQ, I don't think you need to feel obliged to bend over backwards to make it work. You could either add an implementation that no-ops and returns null, or throw the equivalent of a NotImplementedException, or do nothing at all and it will throw a more confusing error. When a BQ user comes along who wants this, they can always open another PR. I think there are other macros that do this but I don't remember which offhand.

@bastienboutonnet
Copy link
Contributor Author

bastienboutonnet commented Mar 11, 2021

no worries at all @joellabes

Ok I'll have a go at a snowflake implementation for sure because I can actually test it out. For BQ I think what you propose makes sense, however not sure why you're pointing me to the Microsoft website here:

or throw the equivalent of a NotImplementedException,

I'd actually like to be able to throw a proper error if BQ users were to use this macro rather than a null. How best do you think we can do that?

@joellabes
Copy link
Contributor

not sure why you're pointing me to the Microsoft website here

In a past life I was a C# developer, so that was the concept I knew it as - I'm not sure what the dbtonic way of doing this is.

The closest equivalent I could find was https://github.com/fishtown-analytics/dbt-utils/blob/c64b52068070023bc3524525e06a14ad72d485a4/macros/cross_db_utils/datediff.sql#L55. I think you'd basically have to make a stub of a bigquery implementation that threw the exception.

Then there would have to be some way of excluding that one from CI, because an exception that's intentionally thrown is still an exception... Hopefully @clrcrl has some way of excluding them somehow 🤔

@bastienboutonnet
Copy link
Contributor Author

@joellabes so I looked into the postgres and redshift documentations and they also have the pow() function that is the only way to make snowflake calculate power. So in the interest of not duplicating too much code I went ahead and made the default__ implementation use pow(). Judging by the tests this seems to work fine for PG, redshift and snowflake which is really nice.

I'll work on making the BQ exception raising happen soon.

@bastienboutonnet bastienboutonnet force-pushed the feat/haversine_in_km_and_miles branch 3 times, most recently from fa915ca to c1ae430 Compare March 14, 2021 20:36
@bastienboutonnet
Copy link
Contributor Author

@joellabes @clrcrl in the end I went ahead and also worked an adaptor macro for big query, it was a bit of a pain but I managed. I created so many failing CI runs I hope you folks don't get notified on each failures.

Anyway, let me know if you have any additional feedback

@joellabes
Copy link
Contributor

Incredible, thanks so much @bastienboutonnet! Sorry that this turned out to be so much more work than first thought 😭

@bastienboutonnet
Copy link
Contributor Author

Ah no worries, if I didn't want to do it I would have said it. I was stuck on my other open source projects today so it was something fun and useful to do.

Is there a changelog entry I should be writing btw? And or some documentation?

Also, not sure if this should be merged into the main branch or develop.

I guess by this point this PR is more of a feature than a bugfix so I rebase on dev/ and change the target of the PR.

@joellabes
Copy link
Contributor

Not sure on those logistics - Claire's got "clear the packages backlog" on her sprint for this fortnight so I imagine she'll be checking this out pretty soon though!

@clrcrl clrcrl changed the base branch from master to dev/0.7.0 April 16, 2021 18:47
Copy link
Contributor

@clrcrl clrcrl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love these changes! Yay km!!!

I wrote some suggestions for how this code might be made a little cleaner :)

{% endmacro %}

{% macro default__haversine_distance(lat1,lon1,lat2,lon2) -%}
{% macro default__haversine_distance(lat1,lon1,lat2,lon2,unit) -%}
Copy link
Contributor

@clrcrl clrcrl Apr 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's set this as an optional argument with a default!

Suggested change
{% macro default__haversine_distance(lat1,lon1,lat2,lon2,unit) -%}
{% macro default__haversine_distance(lat1, lon1, lat2, lon2, unit='mi') -%}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, wasn't sure if we wanted to implement it on the specific macros but that makes sense.

{% macro default__haversine_distance(lat1,lon1,lat2,lon2) -%}
{% macro default__haversine_distance(lat1,lon1,lat2,lon2,unit) -%}
{# vanilla macro is in miles #}
{% set conversion_rate = '' %}
Copy link
Contributor

@clrcrl clrcrl Apr 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little cleaner, and has some error handling. I like setting variables as numbers instead of strings where possible, so I've updated the code below as well

Suggested change
{% set conversion_rate = '' %}
{%- if unit == 'mi' %}
{% set conversion_rate = 1 %}
{% elif unit == 'km' %}
{% set conversion_rate = 1.60934 %}
{% else %}
{{ exceptions.raise_compiler_error("unit input must be one of 'mi' or 'km'. Got ' ~ unit) }}
{% endif -%}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks thats a lot better indeed! Done :)

macros/geo/haversine_distance.sql Outdated Show resolved Hide resolved
macros/geo/haversine_distance.sql Outdated Show resolved Hide resolved
multiply instead of divide and move logic in core macro

rename things a bit

update readme
too many curly brackets

make params be strings

pass strings

remove round to see if this was the problem

let's try casting

round the output field instead

add commas

cast the macro result directly

rename expected to output

is it a rounding error

swap the units

what is going on...

throw a log to see what we get

remove curlies

try computing in a CTE first'

add comma

I'm sleeping...

fixup! I'm sleeping...

rename in CTE

dont quote?

fix macro and separate unit in two tests

fix yaml indent

fix indentation in schema yaml

alias the field

clean up

fixup! clean up

make the macro log stuff

remove logging in test macro

remove the select star
@bastienboutonnet
Copy link
Contributor Author

Hey @clrcrl thanks a lot for the feedback all done. Hope this works 🥳

Copy link
Contributor

@clrcrl clrcrl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@clrcrl clrcrl merged commit f4ef848 into dbt-labs:dev/0.7.0 Apr 21, 2021
@bastienboutonnet
Copy link
Contributor Author

@clrcrl Thanks for merging the PR. Btw, I had not added any changelog entries yet because I wasn't sure whether this was bug or feature. Would you like me to add something to the changelog or you will do it when you release?

@bastienboutonnet bastienboutonnet deleted the feat/haversine_in_km_and_miles branch April 23, 2021 17:18
clrcrl pushed a commit that referenced this pull request May 18, 2021
clrcrl pushed a commit that referenced this pull request May 18, 2021
clrcrl pushed a commit that referenced this pull request May 18, 2021
clrcrl pushed a commit that referenced this pull request May 18, 2021
clrcrl pushed a commit that referenced this pull request May 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Haversine distance macro doesn't support kms
3 participants