-
-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update to Splink v4.0 #3834
Update to Splink v4.0 #3834
Conversation
import splink.duckdb.comparison_level_library as cll | ||
import splink.duckdb.comparison_library as cl | ||
import splink.duckdb.comparison_template_library as ctl | ||
from splink.duckdb.blocking_rule_library import block_on | ||
import splink.comparison_level_library as cll | ||
import splink.comparison_library as cl | ||
from splink import block_on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a big namespace reorganization in v4. Same modules and functions, but imported from different places.
@@ -19,7 +18,7 @@ | |||
blocking_rule_7 = "l.report_year = r.report_year and l.capacity_mw = r.capacity_mw and substr(l.plant_name_mphone,1,2) = substr(r.plant_name_mphone,1,2)" | |||
blocking_rule_8 = "l.report_year = r.report_year and l.installation_year = r.installation_year and substr(l.plant_name_mphone,1,2) = substr(r.plant_name_mphone,1,2)" | |||
blocking_rule_9 = "l.report_year = r.report_year and l.construction_year = r.construction_year and substr(l.plant_name_mphone,1,2) = substr(r.plant_name_mphone,1,2)" | |||
blocking_rule_10 = block_on(["report_year", "net_generation_mwh"]) | |||
blocking_rule_10 = block_on("report_year", "net_generation_mwh") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change in the function signature. Takes an arbitrary number of positional args now.
plant_name_comparison = ctl.name_comparison( | ||
plant_name_comparison = cl.NameComparison( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of things that were previously functions have been replaced with classes.
term_frequency_adjustments=True, | ||
) | ||
fuel_type_code_pudl_comparison = cl.exact_match( | ||
"fuel_type_code_pudl", term_frequency_adjustments=True | ||
) | ||
utility_name_comparison.configure(term_frequency_adjustments=True) | ||
fuel_type_code_pudl_comparison = cl.ExactMatch("fuel_type_code_pudl") | ||
fuel_type_code_pudl_comparison.configure(term_frequency_adjustments=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a funny change to me, but the .configure()
method is used to set attributes which are common to all the different kinds of match classes now.
return ctl.date_comparison( | ||
return cl.DateOfBirthComparison( | ||
column_name, | ||
damerau_levenshtein_thresholds=[], | ||
datediff_thresholds=[1, 2], | ||
datediff_metrics=["year", "year"], | ||
input_is_string=False, | ||
datetime_thresholds=[1, 2], | ||
datetime_metrics=["year", "year"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weirdly they went to calling this DateOfBirthComparison
as if that's the only kind of date you might want to compare.
linker = Linker( | ||
[eia_df, ferc_df], | ||
settings=settings, | ||
input_table_aliases=["eia_df", "ferc_df"], | ||
settings_dict=settings_dict, | ||
db_api=DuckDBAPI(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Linker
class is agnostic as to what database backend it uses -- you pass in a class that is used to interface with it.
damerau_levenshtein_thresholds=[], | ||
datediff_thresholds=[1, 2], | ||
datediff_metrics=["year", "year"], | ||
input_is_string=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This input_is_string=False
parameter is the only thing I was worried about being more than a rename -- it means that the date field that's being matched is a Date
or DateTime
type, which I wasn't sure about. But the ETL worked, so I'm assuming it was correct.
* Update dependencies to allow immanent Numpy v2.0 release for testing * Update minimum versions for Numpy 2.0 compatibility * Fix Numpy 2.0 dtype issues and relock dependencies. * Update dependencies to require Splink v4 * Updated Splink API usage to v4. Unit tests pass. * Update linker namespace to work with Splink v4 * Update to Splink v4.0 (#3834) * Update dependencies to require Splink v4 * Updated Splink API usage to v4. Unit tests pass. * Update linker namespace to work with Splink v4 * Add release notes about Numpy v2 & Splink v4 update.
Overview
Closes #3735
Testing
To-do list