Ferc xbrl updates #233

zschira · 2024-06-25T16:41:00Z

Background

The pudl-archiver has been updated to decouple taxonomies from years of XBRL data. Now it puts all versions of taxonomies in a single zipfile, so if filings from the same year use different versions of the taxonomy, those can all be referenced. This PR updates the extractor to accommodate this change. Now, it will parse all taxonomies and create a dictionary which maps these parsed taxonomies to the version. Then, while parsing individual filings, it will detect the version of the taxonomy referenced in the filing, and use that version for interpreting the facts in the filing.

jdangerx · 2024-06-25T18:48:47Z

See review of catalyst-cooperative/pudl-archiver#362 - I think we should consolidate the "figure out the filename" and "figure out what taxonomy each file points at" logic into the archiver, and just read all that information out of rssfeed here.

jdangerx · 2024-06-28T21:09:52Z

src/ferc_xbrl_extractor/instance.py

        )
        for filename in archive.namelist()
        if Path(filename).suffix in allowable_suffixes
    ]


-def get_filing_name(filing_metadata: dict[str, str | int]) -> str:


jdangerx

Seems fine, thanks for updating the tests too! There's a few small nits you can take or leave.

Don't forget to actually do a versioned release so that PUDL can depend on this -_-

jdangerx · 2024-06-28T21:14:32Z

src/ferc_xbrl_extractor/xbrl.py

@@ -231,11 +226,9 @@ def get_fact_tables(
    Data-Package descriptor describing the output database if requested.

    Args:
-        taxonomy_path: URL of taxonomy.
+        taxonomy_path: Zipfile with archived taxonomies for form.


nit: This arg is called taxonomy_source now.

jdangerx · 2024-06-28T21:15:48Z

tests/integration/data_quality_test.py

@@ -58,7 +52,7 @@ def test_lost_facts_pct(extracted, request):
        # fix the bug. We don't use xfail here because the parametrization is
        # at the *fixture* level, and only the lost facts tests should fail
        # for form 6.
-        assert used_fact_ratio > total_threshold and used_fact_ratio <= 0.96
+        assert used_fact_ratio > total_threshold and used_fact_ratio <= 0.97


nit: comment above still says 0.96

jdangerx · 2024-06-28T21:18:45Z

src/ferc_xbrl_extractor/xbrl.py

@@ -85,7 +83,7 @@ def extract(

 def table_data_from_instances(
    instance_builders: list[InstanceBuilder],
-    table_defs: dict[str, FactTable],
+    table_defs: dict[str, dict[str, FactTable]],


We had talked about batching instances such that each batch only had to parse one taxonomy - did you find that reading all the taxonomies for each worker turned out OK in the end?

Actually I'm just parsing all the taxonomies at the beginning, then pass a dict of those to each worker. It's a bit less memory efficient (but not that bad), but only does the parsing once.

jdangerx · 2024-06-28T21:19:56Z

tests/unit/instance_test.py

@@ -207,6 +192,7 @@ def test_all_fact_ids():
        instant_facts=instant_facts,
        duration_facts=duration_facts,
        filing_name="test_instance",
+        taxonomy_version="form-1_2022-01-01",


nit: it might be good to differentiate taxonomy_version in Instance, which doesn't expect .zip, from taxonomy_version in InstanceBuilder, which does.

There all actually using the .zip version, fixed!

zschira added 5 commits June 21, 2024 12:54

Update extractor to handle multiple taxonomies gracefullyish

631fa5b

Fix medata handling

fa47111

Update examples in readme

d380ee3

Handle metadata from multiple taxonomies

059fe12

Process multiple years at once

ca5aa6d

zschira requested a review from jdangerx June 25, 2024 16:41

zschira added 3 commits June 26, 2024 11:36

Fix path when saving taxonomy metadata

8a6bd38

Get taxonomy version from metadta

7612f18

Fix integration tests

10e801a

jdangerx reviewed Jun 28, 2024

View reviewed changes

jdangerx approved these changes Jun 28, 2024

View reviewed changes

Fixed typos

26eecaa

zschira merged commit a90bf16 into main Jul 1, 2024
14 checks passed

zschira deleted the ferc_xbrl_updates branch July 1, 2024 19:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ferc xbrl updates #233

Ferc xbrl updates #233

zschira commented Jun 25, 2024

jdangerx commented Jun 25, 2024 •

edited

Loading

jdangerx Jun 28, 2024

jdangerx left a comment

jdangerx Jun 28, 2024

zschira Jul 1, 2024

jdangerx Jun 28, 2024

zschira Jul 1, 2024

jdangerx Jun 28, 2024

zschira Jul 1, 2024

jdangerx Jun 28, 2024

zschira Jul 1, 2024

Ferc xbrl updates #233

Ferc xbrl updates #233

Conversation

zschira commented Jun 25, 2024

Background

jdangerx commented Jun 25, 2024 • edited Loading

Choose a reason for hiding this comment

jdangerx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdangerx commented Jun 25, 2024 •

edited

Loading