Migrate where/when we filter for the freshest XBRL data #3861

cmgosnell · 2024-09-20T20:21:45Z

Overview

This was done in conjunction with @jdangerx !

What problem does this address?
It came to our attention in #3857 that filter_for_freshest_data being inside of the XBRL io_manager was... a lot! It included a pretty slow step that was mostly a validation step.

What did you change?

move the filter_for_freshest_data processing out of the io_manager into transform.ferc1. this took the longest to figure out where/when in the process this should happen.... we ended up putting it in ferc1_transform_asset_factory because that is where the raw tables get concatenated when there is more than one input raw table. So that is the last place where we have individual raw tables.
move the __compare_dedupe_methodologies into a validation test
move the unit tests
also added a little defensive check in ferc1_transform_asset_factory to make sure all of the raw tables either have an instant or duration table... because we had a type in one of the input tables 😬

Implications??

after this any non-ferc1 raw xbrl table that gets extracted will no longer be filtered in this way. That actually seems like it is going to be a problem for the annual 714 table that @aesharpe is processing in Transform XBRL core_ferc714__yearly_planning_area_demand_forecast table #3856.

Testing

How did you make sure this worked? How can a reviewer verify this?

To-do list

Give feedback

Run all of the core_ferc1 assets!
Run make pytest-coverage locally to ensure that the merge queue will accept your PR.
Review the PR yourself and call out any questions or issues you have
Update the release notes: reference the PR and related issues.
Options

…nd the filtering into the transform step

src/pudl/extract/ferc1.py

src/pudl/transform/ferc1.py

src/pudl/transform/params/ferc1.py

test/unit/io_managers_test.py

test/validate/ferc1_test.py

cmgosnell · 2024-09-20T20:49:52Z

test/validate/ferc1_test.py

+    NOTE: None of this actually calls any of the code over in pudl.transform.ferc1 rn.
+    That's maybe okay because we are mostly testing the data updates mostly here.


does this feel okay? we could pull some of the bits from the transform filter out into smaller functions to actually run here... but i don't think it is worth it looking at the bits and the setup.

It would be good to test the actual code we're running in the transform - otherwise it's too easy for the test to drift from the implementation. One thing we could do is:

make filter_for_freshest take an optional deduper param that defaults to __apply_diffs - but in test, you could pass in various dedupe methodologies.

call filter_for_freshest(table_name) and filter_for_freshest(table_name, deduper = __best_snapshot) in test, then __compare_dedupe_methodologies() the two.

How does that sound to you?

hm after trying to do this, the main snag here is that filter_for_freshest does the prep to get the never_duped and duped_groups, does the filtering, then concats them back together.

So instead of passing in an optional deduper method I think it'll be simpler to pass in a compare_methods bool, which defaults to False

…d slow?

test/unit/transform/ferc1_test.py

jdangerx

Whee! A few changes requested in testing (mostly around actually testing the code we're running in production) but otherwise looking good!

src/pudl/transform/ferc1.py

test/unit/io_managers_test.py

test/unit/transform/ferc1_test.py

jdangerx · 2024-09-24T15:47:56Z

test/validate/ferc1_test.py

+    NOTE: None of this actually calls any of the code over in pudl.transform.ferc1 rn.
+    That's maybe okay because we are mostly testing the data updates mostly here.


It would be good to test the actual code we're running in the transform - otherwise it's too easy for the test to drift from the implementation. One thing we could do is:

make filter_for_freshest take an optional deduper param that defaults to __apply_diffs - but in test, you could pass in various dedupe methodologies.

call filter_for_freshest(table_name) and filter_for_freshest(table_name, deduper = __best_snapshot) in test, then __compare_dedupe_methodologies() the two.

How does that sound to you?

test/validate/ferc1_test.py

jdangerx · 2024-09-24T15:50:45Z

test/validate/ferc1_test.py

+        Instead of stacking the two datasets, merging by context, and then
+        looking for left_only or right_only values, we just count non-null
+        values. This is because we would want to use the report_year as a
+        merge key, but that isn't available until after we pipe the
+        dataframe through `refine_report_year`.


We will have refine_report_year at this point, right? Can we do this more useful comparison now?

ah okay so now __compare_dedupe_methodologies we could merge with indicator=True? i can try that out

or do you mean merge the two filtered dfs and check column by column for nulls in one filter method but not in the other?

Sorry, I wasn't clear! I had meant that we should check each individual XBRL fact to see if it had changed:

from null in apply-diffs to not-null in snapshot

vice versa

from one value in apply-diffs to a different value in snapshot

Originally / in this comment, I had envisioned this Pandas workflow:

stack datasets so there's only one data column

merge by XBRL context columns + variable name column

look at the prevalence of left_only/right_only values & any rows where the value_left and value_right were different

But! I think your plan of merging the dfs and checking column-by-column also works! In my head, the merge indicator operates on a row-by-row basis, so I figured stacking to one-xbrl-fact-per-row would be easiest. But you are the one who is writing this and also has more pandas experience so I'll defer to your judgment.

ah yes stacking the non pk/context columns is a step of this that i hadn't considered and would make this more zippy! ty for this extra context i will try it out

cmgosnell · 2024-09-27T19:38:50Z

src/pudl/transform/ferc1.py

-        df_out.stack(params.stacked_column_name, dropna=False)
+        df_out.stack(params.stacked_column_name, future_stack=True)


sorry this is fully unrelated... but this was to remove an annoying future warning. i tested the update by re-running all the core and out ferc assets then running the ferc validation tests.

cmgosnell · 2024-09-27T19:42:18Z

src/pudl/workspace/setup.py

        """Create PUDL input and output directories if they don't already exist."""
        self.input_dir.mkdir(parents=True, exist_ok=True)
        self.output_dir.mkdir(parents=True, exist_ok=True)
+        return self


this was to remove another more different warning about model_validator's needing self to be returned. i honestly don't know how this didn't junk everything up before

cmgosnell · 2024-09-27T19:42:47Z

test/validate/ferc1_test.py

+        # some sample guys found to have higher filtering diffs
+        "core_ferc1__yearly_utility_plant_summary_sched200",
+        "core_ferc1__yearly_plant_in_service_sched204",
+        "core_ferc1__yearly_operating_expenses_sched320",
+        "core_ferc1__yearly_income_statements_sched114",
+    ],


i added some more guys

jdangerx

🚢

jdangerx · 2024-09-27T19:49:19Z

test/validate/ferc1_test.py

+            primary_keys = get_primary_key_raw_xbrl(
+                raw_table_name.removeprefix("raw_ferc1_xbrl__"), "ferc1"
+            )
+            filter_for_freshest_data_xbrl(


It's a little confusing that the assertions don't live in the test function, but I guess this works! I think someone confused about what this is actually testing would just go into filter_for_freshest and immediately see the assertion errors.

migrate the comparing of the freshest data methodologies into tests a…

f779733

…nd the filtering into the transform step

cmgosnell requested a review from jdangerx September 20, 2024 20:21

omigosh more moving of stuff

20621cd

cmgosnell self-assigned this Sep 20, 2024

cmgosnell added the xbrl Related to the FERC XBRL transition label Sep 20, 2024