Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issues arising from pandas v2.1 & ferc-xbrl-extractor v1.1.1 #2854

Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
16c7fa7
Fix issues arising from stricter typing used in pandas 2.1
zaneselvans Sep 12, 2023
3d6912b
Try using small runner to see if that doesn't self-cancel
jdangerx Sep 13, 2023
3921f7e
Merge branch '2810-run-2021-ferc-1-data-through-new-more-complete-ext…
zaneselvans Sep 14, 2023
316ea51
Merge branch '2810-run-2021-ferc-1-data-through-new-more-complete-ext…
zaneselvans Sep 22, 2023
7d2d0fa
Updates some dependency versions
zaneselvans Sep 22, 2023
f5edb37
Merge branch '2810-run-2021-ferc-1-data-through-new-more-complete-ext…
zaneselvans Sep 22, 2023
36955c1
Merge branch '2810-run-2021-ferc-1-data-through-new-more-complete-ext…
zaneselvans Sep 22, 2023
1fd366b
Loosen explosion calculation error tolerance & use integer transmissi…
zaneselvans Sep 23, 2023
684e5d9
Merge branch '2810-run-2021-ferc-1-data-through-new-more-complete-ext…
zaneselvans Sep 23, 2023
9150b7f
Revert to using large runner for ci-integration
zaneselvans Sep 23, 2023
84a8ce7
Bump ferc-xbrl-extractor to v1.1.1
zaneselvans Sep 24, 2023
8f0ac69
Use pd.to_numeric() to convert numeric strings to numeric dtypes.
zaneselvans Sep 24, 2023
b3bfa74
Remove obsolete references to ferc1_schema tests.
zaneselvans Sep 24, 2023
31e334a
Update fsspec requirement from <2023.6.1,>=2021.7 to >=2021.7,<2023.9.3
dependabot[bot] Sep 25, 2023
4e5d603
Update dask requirement from <2023.9.2,>=2021.8 to >=2021.8,<2023.9.3
dependabot[bot] Sep 25, 2023
5415fa1
Merge pull request #2887 from catalyst-cooperative/dependabot/pip/dev…
zaneselvans Sep 25, 2023
fa9002a
Merge pull request #2888 from catalyst-cooperative/dependabot/pip/dev…
zaneselvans Sep 25, 2023
ab5f366
Merge branch 'dev' into 2810-run-2021-ferc-1-data-through-new-more-co…
zaneselvans Sep 25, 2023
11d66ea
Merge branch '2810-run-2021-ferc-1-data-through-new-more-complete-ext…
zaneselvans Sep 25, 2023
86c40a1
Get rid of an unnecessary temporary variable.
zaneselvans Sep 25, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 27 additions & 42 deletions docs/dev/testing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -155,18 +155,19 @@ their own:
doc8 -> Check the documentation input files for syntactical correctness.
docs -> Remove old docs output and rebuild HTML from scratch with Sphinx
unit -> Run all the software unit tests.
ferc1_solo -> Test whether FERC 1 can be loaded into the PUDL database alone.
integration -> Run all software integration tests and process a full year of data.
minmax_rows -> Check that all outputs have the expected number of rows.
validate -> Run all data validation tests. This requires a complete PUDL DB.
ferc1_schema -> Verify FERC Form 1 DB schema are compatible for all years.
jupyter -> Ensure that designated Jupyter notebooks can be executed.
full_integration -> Run ETL and integration tests for all years and data sources.
full -> Run all CI checks, but for all years of data.
build -> Prepare Python source and binary packages for release.
testrelease -> Do a dry run of Python package release using the PyPI test server.
release -> Release the PUDL package to the production PyPI server.
nuke -> Nuke & recreate SQLite & Parquet outputs, then run all tests and
data validations against the new outputs.
get_unmapped_ids -> Make the raw FERC1 DB and generate a PUDL database with only EIA in
order to generate any unmapped IDs.
Comment on lines -158 to +167
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test environments were generally out of date and I just cut-and pasted in the current ones.

I noticed that the ferc1_schema test was gone entirely, and am wondering if we are no longer checking anywhere for inter-year FERC DBF database schema compatibility with the reference year? Or if that got moved to the more generic DBF extractor code? I poked around a bit but didn't see it immediately @rousik


Note that not all of them literally run tests. For instance, to lint and
build the documentation you can run:
Note that not all of them literally run tests. For instance, to lint and build the
documentation you can run:

.. code-block:: console

Expand Down Expand Up @@ -321,41 +322,25 @@ with the construction of that database. For example, the output routines:

We also use this option to run the data validations.

Assuming you do want to run the ETL and build new databases as part of the test
you're running, the contents of that database are determined by an ETL settings
file. By default, the settings file that's used is
``test/settings/integration-test.yml`` But it's also possible to use a
different input file, generating a different database, and then run some
tests against that database.

For example, we test that FERC 1 data can be loaded into a PUDL database all
by itself by running the ETL tests with a settings file that includes only A
couple of FERC 1 tables for a single year. This is the ``ferc1_solo`` Tox
test environment:

.. code-block:: console

$ pytest --etl-settings=test/settings/ferc1-solo-test.yml test/integration/etl_test.py

Similarly, we use the ``test/settings/full-integration-test.yml`` settings file
to specify an exhaustive collection of input data, and then we run a test that
checks that the database schemas extracted from all historical FERC 1 databases
are compatible with each other. This is the ``ferc1_schema`` test:

.. code-block:: console

$ pytest --etl-settings test/settings/full-integration-test.yml test/integration/etl_test.py::test_ferc1_schema

The raw input data that all the tests use is ultimately coming from our
`archives on Zenodo <https://zenodo.org/communities/catalyst-cooperative>`__.
However, you can optionally tell the tests to look in a different places for more
rapidly accessible caches of that data and to force the download of a fresh
copy (especially useful when you are testing the datastore functionality
specifically). By default, the tests will use the datastore that's part of your
local PUDL workspace.

For example, to run the ETL portion of the integration tests and download
fresh input data to a temporary datastore that's later deleted automatically:
Comment on lines -324 to -358
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed references to ferc1_schema and ferc1_solo environments, which no longer exist.

Assuming you do want to run the ETL and build new databases as part of the test you're
running, the contents of that database are determined by an ETL settings file. By
default, the settings file that's used is
``src/pudl/package_data/settings/etl_fast.yml`` But it's also possible to use a
different input file, generating a different database, and then run some tests against
that database.

We use the ``src/pudl/package_data/etl_full.yml`` settings file to specify an exhaustive
collection of input data.
zaneselvans marked this conversation as resolved.
Show resolved Hide resolved

The raw input data that all the tests use is ultimately coming from our `archives on
Zenodo <https://zenodo.org/communities/catalyst-cooperative>`__. However, you can
optionally tell the tests to look in a different places for more rapidly accessible
caches of that data and to force the download of a fresh copy (especially useful when
you are testing the datastore functionality specifically). By default, the tests will
use the datastore that's part of your local PUDL workspace.

For example, to run the ETL portion of the integration tests and download fresh input
data to a temporary datastore that's later deleted automatically:

.. code-block:: console

Expand Down
184 changes: 0 additions & 184 deletions docs/pudl/pudl-etl.dot
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this out of data flowchart describing the ETL.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the dagster UI already shows a lot of this information now!

This file was deleted.

4 changes: 2 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ dependencies = [
"addfips>=0.4,<0.5",
"alembic>=1.10.3,<1.13",
"catalystcoop.dbfread>=3.0,<3.1",
"catalystcoop.ferc-xbrl-extractor>=1.1.0,<2",
"catalystcoop.ferc-xbrl-extractor>=1.1.1,<2",
"coloredlogs>=14.0,<15.1", # Dagster requires 14.0
"dagster-webserver>=1.4,<1.5", # 1.2.2 is first version to support Python 3.11
"dagster>=1.4,<1.5", # 1.2.2 is first version to support Python 3.11
Expand All @@ -28,7 +28,7 @@ dependencies = [
"matplotlib>=3.3,<3.9", # Should make this optional with a "viz" extras
"networkx>=2.2,<3.2",
"numpy>=1.18.5,!=1.23.0,<1.27",
"pandas[parquet,excel,fss,gcp,compression]>=2.0,<2.1",
"pandas[compression,excel,fss,gcp,parquet,performance]>=2,<2.2",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to allowing pandas 2.1.x, I alphabetized these and included performance since it did result in a noticeable speedup in my testing in the ferc-xbrl-extractor repo (even though that didn't end up being the fix I used).

"pyarrow>=7,<13",
"pydantic[email]>=1.7,<2",
"python-dotenv>=0.21,<1.1",
Expand Down
2 changes: 1 addition & 1 deletion src/pudl/metadata/classes.py
Original file line number Diff line number Diff line change
Expand Up @@ -572,7 +572,7 @@ class Field(Base):
Examples:
>>> field = Field(name='x', type='string', constraints={'enum': ['x', 'y']})
>>> field.to_pandas_dtype()
CategoricalDtype(categories=['x', 'y'], ordered=False)
CategoricalDtype(categories=['x', 'y'], ordered=False, categories_dtype=object)
>>> field.to_sql()
Column('x', Enum('x', 'y'), CheckConstraint(...), table=None)
>>> field = Field.from_id('utility_id_eia')
Expand Down
2 changes: 1 addition & 1 deletion src/pudl/metadata/fields.py
Original file line number Diff line number Diff line change
Expand Up @@ -1552,7 +1552,7 @@
"description": "For nuclear plants only, the unit number .One digit numeric. Nuclear plants are the only type of plants for which data are shown explicitly at the generating unit level.",
},
"num_transmission_circuits": {
"type": "number",
"type": "integer",
"description": "Number of circuits in a transmission line.",
},
"oil_fraction_cost": {
Expand Down
2 changes: 1 addition & 1 deletion src/pudl/output/ferc1.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ class CalculationToleranceFerc1(BaseModel):
intertable_calculation_errors=0.20,
),
"balance_sheet_assets_ferc1": CalculationToleranceFerc1(
intertable_calculation_errors=0.85,
intertable_calculation_errors=1.00,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😢

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's okay man we're back down to 60% on the other other FERC branch!

),
"balance_sheet_liabilities_ferc1": CalculationToleranceFerc1(
intertable_calculation_errors=0.07,
Expand Down
19 changes: 10 additions & 9 deletions src/pudl/output/ferc714.py
Original file line number Diff line number Diff line change
Expand Up @@ -135,14 +135,15 @@ def categorize_eia_code(
This function associates a ``respondent_type`` of ``utility``,
``balancing_authority`` or ``pandas.NA`` with each input ``eia_code`` using the
following rules:

* If a ``eia_code`` appears only in ``util_ids`` the ``respondent_type`` will be
``utility``.
``utility``.
* If ``eia_code`` appears only in ``ba_ids`` the ``respondent_type`` will be
assigned ``balancing_authority``.
assigned ``balancing_authority``.
* If ``eia_code`` appears in neither set of IDs, ``respondent_type`` will be
assigned ``pandas.NA``.
assigned ``pandas.NA``.
* If ``eia_code`` appears in both sets of IDs, then whichever ``respondent_type``
has been selected with the ``priority`` flag will be assigned.
has been selected with the ``priority`` flag will be assigned.

Note that the vast majority of ``balancing_authority_id_eia`` values also show up
as ``utility_id_eia`` values, but only a small subset of the ``utility_id_eia``
Expand All @@ -163,8 +164,7 @@ def categorize_eia_code(
"balancing_authority". The default is "balancing_authority".

Returns:
A DataFrame containing 2 columns: ``eia_code`` and
``respondent_type``.
A DataFrame containing 2 columns: ``eia_code`` and ``respondent_type``.
"""
if priority == "balancing_authority":
primary = "balancing_authority"
Expand Down Expand Up @@ -226,14 +226,15 @@ def filled_balancing_authority_eia861(
# Build table of new rows
# Insert row for each target balancing authority-year pair
# missing from the original table, using the reference year as a template.
rows = []
rows: list[dict[str, Any]] = []
for ref, fix in zip(refs, ASSOCIATIONS):
for year in range(fix["to"][0], fix["to"][1] + 1):
key = (fix["id"], pd.Timestamp(year, 1, 1))
if key not in dfi.index:
rows.append({**ref, "report_date": key[1]})
# Append to original table
df = pd.concat([df, pd.DataFrame(rows)])
new_rows = apply_pudl_dtypes(pd.DataFrame(rows), group="eia")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super-nit: maybe rows_dtyped is a little clearer about what's new about these rows?

Or, later on line 279 you just skip the variable assignment altogether, which also seems legitimate - maybe we can do that here too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, went with the solution from below.

df = pd.concat([df, new_rows], axis="index")
# Remove balancing authorities treated as utilities
mask = df["balancing_authority_id_eia"].isin([util["id"] for util in UTILITIES])
return apply_pudl_dtypes(df[~mask], group="eia")
Expand Down Expand Up @@ -275,7 +276,7 @@ def filled_balancing_authority_assn_eia861(
tables.append(ref.assign(report_date=key[1]))
replaced |= mask
# Append to original table with matching rows removed
df = pd.concat([df[~replaced], pd.concat(tables)])
df = pd.concat([df[~replaced], apply_pudl_dtypes(pd.concat(tables), group="eia")])
# Remove balancing authorities treated as utilities
mask = np.zeros(df.shape[0], dtype=bool)
tables = []
Expand Down
2 changes: 1 addition & 1 deletion src/pudl/transform/classes.py
Original file line number Diff line number Diff line change
Expand Up @@ -552,7 +552,7 @@ def convert_units(col: pd.Series, params: UnitConversion) -> pd.Series:
new_name = col.name
if (col.name == new_name) & (params.from_unit != "") & (params.to_unit != ""):
logger.debug(f"Old and new column names are identical: {col.name}.")
col = (params.multiplier * col) + params.adder
col = (params.multiplier * pd.to_numeric(col)) + params.adder
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were half a dozen tables where col ended up being a numeric string like "0.0" as a result of ferc-xbrl-extractor==1.1.1 -- though I'm not sure what about the change in method over there affected the output type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was from 1.1.0 to 1.1.1? Surprising... this seems OK for now but I'm definitely curious what changed.

It's totally possible that we'll keep changing the extractor, though - I don't think this is blocking though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are notes in the PRs / issues but the thing that changed is what kind of NULL value ends up being generated. In v1.1.0 with gb.ffill().iloc[-1] we ended up with mixed str + np.nan => object columns that resulted in numeric values. In v1.1.1 using gb.last() we ended up with mixed str + None => object columns that are interpreted as strings. 🤷🏼

col.name = new_name
return col

Expand Down
Loading
Loading