Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ferc xbrl updates #233

Merged
merged 9 commits into from
Jul 1, 2024
Merged

Ferc xbrl updates #233

merged 9 commits into from
Jul 1, 2024

Conversation

zschira
Copy link
Member

@zschira zschira commented Jun 25, 2024

Background

The pudl-archiver has been updated to decouple taxonomies from years of XBRL data. Now it puts all versions of taxonomies in a single zipfile, so if filings from the same year use different versions of the taxonomy, those can all be referenced. This PR updates the extractor to accommodate this change. Now, it will parse all taxonomies and create a dictionary which maps these parsed taxonomies to the version. Then, while parsing individual filings, it will detect the version of the taxonomy referenced in the filing, and use that version for interpreting the facts in the filing.

@zschira zschira requested a review from jdangerx June 25, 2024 16:41
@jdangerx
Copy link
Member

jdangerx commented Jun 25, 2024

See review of catalyst-cooperative/pudl-archiver#362 - I think we should consolidate the "figure out the filename" and "figure out what taxonomy each file points at" logic into the archiver, and just read all that information out of rssfeed here.

)
for filename in archive.namelist()
if Path(filename).suffix in allowable_suffixes
]


def get_filing_name(filing_metadata: dict[str, str | int]) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woohoo!

Copy link
Member

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine, thanks for updating the tests too! There's a few small nits you can take or leave.

Don't forget to actually do a versioned release so that PUDL can depend on this -_-

@@ -231,11 +226,9 @@ def get_fact_tables(
Data-Package descriptor describing the output database if requested.

Args:
taxonomy_path: URL of taxonomy.
taxonomy_path: Zipfile with archived taxonomies for form.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This arg is called taxonomy_source now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

@@ -58,7 +52,7 @@ def test_lost_facts_pct(extracted, request):
# fix the bug. We don't use xfail here because the parametrization is
# at the *fixture* level, and only the lost facts tests should fail
# for form 6.
assert used_fact_ratio > total_threshold and used_fact_ratio <= 0.96
assert used_fact_ratio > total_threshold and used_fact_ratio <= 0.97
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: comment above still says 0.96

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

@@ -85,7 +83,7 @@ def extract(

def table_data_from_instances(
instance_builders: list[InstanceBuilder],
table_defs: dict[str, FactTable],
table_defs: dict[str, dict[str, FactTable]],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had talked about batching instances such that each batch only had to parse one taxonomy - did you find that reading all the taxonomies for each worker turned out OK in the end?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'm just parsing all the taxonomies at the beginning, then pass a dict of those to each worker. It's a bit less memory efficient (but not that bad), but only does the parsing once.

@@ -207,6 +192,7 @@ def test_all_fact_ids():
instant_facts=instant_facts,
duration_facts=duration_facts,
filing_name="test_instance",
taxonomy_version="form-1_2022-01-01",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it might be good to differentiate taxonomy_version in Instance, which doesn't expect .zip, from taxonomy_version in InstanceBuilder, which does.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There all actually using the .zip version, fixed!

@zschira zschira merged commit a90bf16 into main Jul 1, 2024
14 checks passed
@zschira zschira deleted the ferc_xbrl_updates branch July 1, 2024 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants