Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate where/when we filter for the freshest XBRL data #3861

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

cmgosnell
Copy link
Member

@cmgosnell cmgosnell commented Sep 20, 2024

Overview

This was done in conjunction with @jdangerx !

What problem does this address?
It came to our attention in #3857 that filter_for_freshest_data being inside of the XBRL io_manager was... a lot! It included a pretty slow step that was mostly a validation step.

What did you change?

  • move the filter_for_freshest_data processing out of the io_manager into transform.ferc1. this took the longest to figure out where/when in the process this should happen.... we ended up putting it in ferc1_transform_asset_factory because that is where the raw tables get concatenated when there is more than one input raw table. So that is the last place where we have individual raw tables.
  • move the __compare_dedupe_methodologies into a validation test
  • move the unit tests
  • also added a little defensive check in ferc1_transform_asset_factory to make sure all of the raw tables either have an instant or duration table... because we had a type in one of the input tables 😬

Implications??

Testing

How did you make sure this worked? How can a reviewer verify this?

To-do list

@cmgosnell cmgosnell self-assigned this Sep 20, 2024
@cmgosnell cmgosnell added the xbrl Related to the FERC XBRL transition label Sep 20, 2024
Comment on lines 184 to 185
NOTE: None of this actually calls any of the code over in pudl.transform.ferc1 rn.
That's maybe okay because we are mostly testing the data updates mostly here.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this feel okay? we could pull some of the bits from the transform filter out into smaller functions to actually run here... but i don't think it is worth it looking at the bits and the setup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to test the actual code we're running in the transform - otherwise it's too easy for the test to drift from the implementation. One thing we could do is:

  • make filter_for_freshest take an optional deduper param that defaults to __apply_diffs - but in test, you could pass in various dedupe methodologies.
  • call filter_for_freshest(table_name) and filter_for_freshest(table_name, deduper = __best_snapshot) in test, then __compare_dedupe_methodologies() the two.

How does that sound to you?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm after trying to do this, the main snag here is that filter_for_freshest does the prep to get the never_duped and duped_groups, does the filtering, then concats them back together.

So instead of passing in an optional deduper method I think it'll be simpler to pass in a compare_methods bool, which defaults to False

Copy link
Member

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whee! A few changes requested in testing (mostly around actually testing the code we're running in production) but otherwise looking good!

src/pudl/transform/ferc1.py Outdated Show resolved Hide resolved
src/pudl/transform/ferc1.py Outdated Show resolved Hide resolved
src/pudl/transform/ferc1.py Outdated Show resolved Hide resolved
src/pudl/transform/ferc1.py Outdated Show resolved Hide resolved
test/unit/io_managers_test.py Show resolved Hide resolved
test/unit/transform/ferc1_test.py Outdated Show resolved Hide resolved
Comment on lines 184 to 185
NOTE: None of this actually calls any of the code over in pudl.transform.ferc1 rn.
That's maybe okay because we are mostly testing the data updates mostly here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to test the actual code we're running in the transform - otherwise it's too easy for the test to drift from the implementation. One thing we could do is:

  • make filter_for_freshest take an optional deduper param that defaults to __apply_diffs - but in test, you could pass in various dedupe methodologies.
  • call filter_for_freshest(table_name) and filter_for_freshest(table_name, deduper = __best_snapshot) in test, then __compare_dedupe_methodologies() the two.

How does that sound to you?

test/validate/ferc1_test.py Outdated Show resolved Hide resolved
Comment on lines 218 to 222
Instead of stacking the two datasets, merging by context, and then
looking for left_only or right_only values, we just count non-null
values. This is because we would want to use the report_year as a
merge key, but that isn't available until after we pipe the
dataframe through `refine_report_year`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will have refine_report_year at this point, right? Can we do this more useful comparison now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah okay so now __compare_dedupe_methodologies we could merge with indicator=True? i can try that out

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or do you mean merge the two filtered dfs and check column by column for nulls in one filter method but not in the other?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I wasn't clear! I had meant that we should check each individual XBRL fact to see if it had changed:

  • from null in apply-diffs to not-null in snapshot
  • vice versa
  • from one value in apply-diffs to a different value in snapshot

Originally / in this comment, I had envisioned this Pandas workflow:

  • stack datasets so there's only one data column
  • merge by XBRL context columns + variable name column
  • look at the prevalence of left_only/right_only values & any rows where the value_left and value_right were different

But! I think your plan of merging the dfs and checking column-by-column also works! In my head, the merge indicator operates on a row-by-row basis, so I figured stacking to one-xbrl-fact-per-row would be easiest. But you are the one who is writing this and also has more pandas experience so I'll defer to your judgment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes stacking the non pk/context columns is a step of this that i hadn't considered and would make this more zippy! ty for this extra context i will try it out

Comment on lines -308 to +309
df_out.stack(params.stacked_column_name, dropna=False)
df_out.stack(params.stacked_column_name, future_stack=True)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry this is fully unrelated... but this was to remove an annoying future warning. i tested the update by re-running all the core and out ferc assets then running the ferc validation tests.

Comment on lines 30 to +33
"""Create PUDL input and output directories if they don't already exist."""
self.input_dir.mkdir(parents=True, exist_ok=True)
self.output_dir.mkdir(parents=True, exist_ok=True)
return self
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was to remove another more different warning about model_validator's needing self to be returned. i honestly don't know how this didn't junk everything up before

Comment on lines +177 to +182
# some sample guys found to have higher filtering diffs
"core_ferc1__yearly_utility_plant_summary_sched200",
"core_ferc1__yearly_plant_in_service_sched204",
"core_ferc1__yearly_operating_expenses_sched320",
"core_ferc1__yearly_income_statements_sched114",
],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i added some more guys

Copy link
Member

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

primary_keys = get_primary_key_raw_xbrl(
raw_table_name.removeprefix("raw_ferc1_xbrl__"), "ferc1"
)
filter_for_freshest_data_xbrl(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little confusing that the assertions don't live in the test function, but I guess this works! I think someone confused about what this is actually testing would just go into filter_for_freshest and immediately see the assertion errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
xbrl Related to the FERC XBRL transition
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

2 participants