-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ferc xbrl updates #233
Ferc xbrl updates #233
Conversation
See review of catalyst-cooperative/pudl-archiver#362 - I think we should consolidate the "figure out the filename" and "figure out what taxonomy each file points at" logic into the archiver, and just read all that information out of |
) | ||
for filename in archive.namelist() | ||
if Path(filename).suffix in allowable_suffixes | ||
] | ||
|
||
|
||
def get_filing_name(filing_metadata: dict[str, str | int]) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woohoo!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine, thanks for updating the tests too! There's a few small nits you can take or leave.
Don't forget to actually do a versioned release so that PUDL can depend on this -_-
src/ferc_xbrl_extractor/xbrl.py
Outdated
@@ -231,11 +226,9 @@ def get_fact_tables( | |||
Data-Package descriptor describing the output database if requested. | |||
|
|||
Args: | |||
taxonomy_path: URL of taxonomy. | |||
taxonomy_path: Zipfile with archived taxonomies for form. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This arg is called taxonomy_source
now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
@@ -58,7 +52,7 @@ def test_lost_facts_pct(extracted, request): | |||
# fix the bug. We don't use xfail here because the parametrization is | |||
# at the *fixture* level, and only the lost facts tests should fail | |||
# for form 6. | |||
assert used_fact_ratio > total_threshold and used_fact_ratio <= 0.96 | |||
assert used_fact_ratio > total_threshold and used_fact_ratio <= 0.97 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: comment above still says 0.96
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
@@ -85,7 +83,7 @@ def extract( | |||
|
|||
def table_data_from_instances( | |||
instance_builders: list[InstanceBuilder], | |||
table_defs: dict[str, FactTable], | |||
table_defs: dict[str, dict[str, FactTable]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had talked about batching instances such that each batch only had to parse one taxonomy - did you find that reading all the taxonomies for each worker turned out OK in the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I'm just parsing all the taxonomies at the beginning, then pass a dict of those to each worker. It's a bit less memory efficient (but not that bad), but only does the parsing once.
tests/unit/instance_test.py
Outdated
@@ -207,6 +192,7 @@ def test_all_fact_ids(): | |||
instant_facts=instant_facts, | |||
duration_facts=duration_facts, | |||
filing_name="test_instance", | |||
taxonomy_version="form-1_2022-01-01", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it might be good to differentiate taxonomy_version
in Instance
, which doesn't expect .zip
, from taxonomy_version
in InstanceBuilder
, which does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There all actually using the .zip
version, fixed!
Background
The
pudl-archiver
has been updated to decouple taxonomies from years of XBRL data. Now it puts all versions of taxonomies in a single zipfile, so if filings from the same year use different versions of the taxonomy, those can all be referenced. This PR updates the extractor to accommodate this change. Now, it will parse all taxonomies and create a dictionary which maps these parsed taxonomies to the version. Then, while parsing individual filings, it will detect the version of the taxonomy referenced in the filing, and use that version for interpreting the facts in the filing.