Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to Splink v4.0 #3834

Merged
merged 3 commits into from
Sep 8, 2024
Merged

Update to Splink v4.0 #3834

merged 3 commits into from
Sep 8, 2024

Conversation

zaneselvans
Copy link
Member

@zaneselvans zaneselvans commented Sep 7, 2024

Overview

Closes #3735

  • Splink v4 has been out for a few months. This PR updates our imports and API calls to use the new version.
  • The underlying methods haven't changed, but the code and API have been cleaned up, and there are some performance improvements.

Testing

To-do list

@zaneselvans zaneselvans added dependencies Pull requests that update a dependency file record-linkage Issues related to connecting related records / entities that don't have explicit IDs or keys. labels Sep 7, 2024
@zaneselvans zaneselvans self-assigned this Sep 7, 2024
@zaneselvans zaneselvans linked an issue Sep 7, 2024 that may be closed by this pull request
@zaneselvans zaneselvans changed the title Update dependencies to require Splink v4 Update dependencies to use Splink v4 Sep 7, 2024
Comment on lines -8 to +10
import splink.duckdb.comparison_level_library as cll
import splink.duckdb.comparison_library as cl
import splink.duckdb.comparison_template_library as ctl
from splink.duckdb.blocking_rule_library import block_on
import splink.comparison_level_library as cll
import splink.comparison_library as cl
from splink import block_on
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a big namespace reorganization in v4. Same modules and functions, but imported from different places.

@@ -19,7 +18,7 @@
blocking_rule_7 = "l.report_year = r.report_year and l.capacity_mw = r.capacity_mw and substr(l.plant_name_mphone,1,2) = substr(r.plant_name_mphone,1,2)"
blocking_rule_8 = "l.report_year = r.report_year and l.installation_year = r.installation_year and substr(l.plant_name_mphone,1,2) = substr(r.plant_name_mphone,1,2)"
blocking_rule_9 = "l.report_year = r.report_year and l.construction_year = r.construction_year and substr(l.plant_name_mphone,1,2) = substr(r.plant_name_mphone,1,2)"
blocking_rule_10 = block_on(["report_year", "net_generation_mwh"])
blocking_rule_10 = block_on("report_year", "net_generation_mwh")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change in the function signature. Takes an arbitrary number of positional args now.

Comment on lines -36 to +35
plant_name_comparison = ctl.name_comparison(
plant_name_comparison = cl.NameComparison(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of things that were previously functions have been replaced with classes.

term_frequency_adjustments=True,
)
fuel_type_code_pudl_comparison = cl.exact_match(
"fuel_type_code_pudl", term_frequency_adjustments=True
)
utility_name_comparison.configure(term_frequency_adjustments=True)
fuel_type_code_pudl_comparison = cl.ExactMatch("fuel_type_code_pudl")
fuel_type_code_pudl_comparison.configure(term_frequency_adjustments=True)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a funny change to me, but the .configure() method is used to set attributes which are common to all the different kinds of match classes now.

Comment on lines -84 to +85
return ctl.date_comparison(
return cl.DateOfBirthComparison(
column_name,
damerau_levenshtein_thresholds=[],
datediff_thresholds=[1, 2],
datediff_metrics=["year", "year"],
input_is_string=False,
datetime_thresholds=[1, 2],
datetime_metrics=["year", "year"],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weirdly they went to calling this DateOfBirthComparison as if that's the only kind of date you might want to compare.

Comment on lines +225 to +229
linker = Linker(
[eia_df, ferc_df],
settings=settings,
input_table_aliases=["eia_df", "ferc_df"],
settings_dict=settings_dict,
db_api=DuckDBAPI(),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Linker class is agnostic as to what database backend it uses -- you pass in a class that is used to interface with it.

damerau_levenshtein_thresholds=[],
datediff_thresholds=[1, 2],
datediff_metrics=["year", "year"],
input_is_string=False,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This input_is_string=False parameter is the only thing I was worried about being more than a rename -- it means that the date field that's being matched is a Date or DateTime type, which I wasn't sure about. But the ETL worked, so I'm assuming it was correct.

@zaneselvans zaneselvans requested review from pudlbot and removed request for katie-lamb September 8, 2024 03:58
@zaneselvans zaneselvans changed the base branch from numpy-2.0 to main September 8, 2024 04:02
@zaneselvans zaneselvans marked this pull request as ready for review September 8, 2024 04:02
@pudlbot pudlbot changed the title Update dependencies to use Splink v4 Update to Splink v4.0 Sep 8, 2024
@pudlbot pudlbot added this pull request to the merge queue Sep 8, 2024
@pudlbot pudlbot removed this pull request from the merge queue due to a manual request Sep 8, 2024
@zaneselvans zaneselvans changed the base branch from main to numpy-2.0 September 8, 2024 04:11
@zaneselvans zaneselvans merged commit f3935cf into numpy-2.0 Sep 8, 2024
18 checks passed
@zaneselvans zaneselvans deleted the splink-v4 branch September 8, 2024 04:12
github-merge-queue bot pushed a commit that referenced this pull request Sep 8, 2024
* Update dependencies to allow immanent Numpy v2.0 release for testing

* Update minimum versions for Numpy 2.0 compatibility

* Fix Numpy 2.0 dtype issues and relock dependencies.

* Update dependencies to require Splink v4

* Updated Splink API usage to v4. Unit tests pass.

* Update linker namespace to work with Splink v4

* Update to Splink v4.0 (#3834)

* Update dependencies to require Splink v4
* Updated Splink API usage to v4. Unit tests pass.
* Update linker namespace to work with Splink v4

* Add release notes about Numpy v2 & Splink v4 update.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file record-linkage Issues related to connecting related records / entities that don't have explicit IDs or keys.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Update to Splink v4.0
2 participants