Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issues arising from pandas v2.1 & ferc-xbrl-extractor v1.1.1 #2854

Conversation

zaneselvans
Copy link
Member

@zaneselvans zaneselvans commented Sep 12, 2023

PR Overview

  • Increase upper bound of allowed pandas versions to pandas<2.2
  • Fix a couple of instances where there were type mismatches in columns being concatenated, which pandas no longer allows.
  • Fix a doctest failure resulting from CategoricalDtype having explicit types now.
  • Address some lingering number-as-string columns apparently introduced by the change to ferc-xbrl-extractor See this PR & comment for more details.
  • Remove references to the ferc1_schema test, since it's been removed from the integration tests.

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Include unit tests for new functions and classes.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

@zaneselvans zaneselvans added the dependencies Pull requests that update a dependency file label Sep 12, 2023
@zaneselvans
Copy link
Member Author

Unclear to me what is going on with the ci-integration test failing after 4 minutes.

@jdangerx jdangerx force-pushed the 2810-run-2021-ferc-1-data-through-new-more-complete-extractor branch from 529c10d to 1eaef5a Compare September 13, 2023 16:58
@jdangerx jdangerx force-pushed the 2810-run-2021-ferc-1-data-through-new-more-complete-extractor branch from 3d6912b to 8676c93 Compare September 18, 2023 19:03
@zaneselvans zaneselvans marked this pull request as ready for review September 22, 2023 20:35
@zaneselvans
Copy link
Member Author

@zschira It seems like something on this branch has made the XBRL extraction process catastrophically slow. It doesn't run out of memory, but it takes more than 6 hours, so the CI can't even complete within the allotted time. See this failed run.

I ran tox locally and the extraction was maybe a bit slow, but not crazy. It took 10 minutes to extract all of the XBRL data... and then it was followed by this bizarre 7 hour long pause:

2023-09-22 23:19:20 [    INFO] catalystcoop.pudl.workspace.datastore:388 Retrieved Resource(ferc714/10.5281/zenodo.8326694/ferc714-xbrl-taxonomy-2022.zip) from cache.
2023-09-22 23:19:20 [    INFO] catalystcoop.pudl.workspace.datastore:388 Retrieved Resource(ferc714/10.5281/zenodo.8326694/ferc714-xbrl-2021.zip) from cache.
2023-09-22 23:19:20 [    INFO] catalystcoop.ferc_xbrl_extractor.xbrl:254 Parsing taxonomy from <_io.BytesIO object at 0x1401c0090>
2023-09-22 23:20:49 [    INFO] catalystcoop.ferc_xbrl_extractor.xbrl:138 Finished batch 1/3
2023-09-22 23:20:49 [    INFO] catalystcoop.ferc_xbrl_extractor.xbrl:138 Finished batch 2/3
2023-09-22 23:20:49 [    INFO] catalystcoop.ferc_xbrl_extractor.xbrl:138 Finished batch 3/3
2023-09-23 06:16:16 [    INFO] alembic.runtime.migration:213 Context impl SQLiteImpl.
2023-09-23 06:16:16 [    INFO] alembic.runtime.migration:216 Will assume non-transactional DDL.
2023-09-23 06:16:16 [    INFO] alembic.runtime.migration:619 Running stamp_revision  -> ec80dd91891a
2023-09-23 06:16:16 [   DEBUG] alembic.runtime.migration:827 new branch insert ec80dd91891a
2023-09-23 06:16:16 [    INFO] alembic.runtime.migration:213 Context impl SQLiteImpl.
2023-09-23 06:16:16 [    INFO] alembic.runtime.migration:216 Will assume non-transactional DDL.
2023-09-23 06:16:16 [    INFO] alembic.runtime.migration:619 Running stamp_revision  -> ec80dd91891a
2023-09-23 06:16:16 [   DEBUG] alembic.runtime.migration:827 new branch insert ec80dd91891a

But after the pause things continued as normal, with the entire Tox run taking nearly 8 hours to complete.

@zaneselvans
Copy link
Member Author

zaneselvans commented Sep 23, 2023

Running

$ ferc_to_sqlite --clobber --gcs-cache-path gs://zenodo-cache.catalyst.coop src/pudl/package_data/settings/etl_full.yml

locally also results in multi-hour hangs in the XBRL extraction process.

Doesn't seem very informative, but I did a Ctrl-C and got:

^CTraceback (most recent call last):
  File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_core/execution/watch_orphans.py", line 42, in <module>
Traceback (most recent call last):
  File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_core/execution/watch_orphans.py", line 42, in <module>
    watch(sys.argv[1:])
    watch(sys.argv[1:])
  File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_core/execution/watch_orphans.py", line 38, in watch
  File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_core/execution/watch_orphans.py", line 38, in watch
    time.sleep(1)
    time.sleep(1)
KeyboardInterrupt
KeyboardInterrupt
2023-09-23 11:52:35 -0600 - dagster - DEBUG - ferc_to_sqlite_job - 20b685ae-5f69-4815-9ecb-696b40c90bda - 51947 - ENGINE_EVENT - Multiprocess executor: received termination signal - forwarding to active child processes
2023-09-23 11:52:37 -0600 - dagster - DEBUG - ferc_to_sqlite_job - 20b685ae-5f69-4815-9ecb-696b40c90bda - 51947 - ENGINE_EVENT - Multiprocess executor: interrupted all active child processes
2023-09-23 11:52:37 -0600 - dagster - ERROR - ferc_to_sqlite_job - 20b685ae-5f69-4815-9ecb-696b40c90bda - 51947 - RUN_FAILURE - Execution of run for "ferc_to_sqlite_job" failed. Execution was interrupted unexpectedly. No user initiated termination request was found, treating as failure.

dagster._core.errors.DagsterExecutionInterruptedError

Stack Trace:
  File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_core/execution/api.py", line 755, in job_execution_iterator
    for event in job_context.executor.execute(job_context, execution_plan):
  File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_core/executor/multiprocess.py", line 307, in execute
    raise DagsterExecutionInterruptedError()

Traceback (most recent call last):
  File "/Users/zane/mambaforge/envs/pudl-dev/bin/ferc_to_sqlite", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/zane/code/catalyst/pudl/src/pudl/ferc_to_sqlite/cli.py", line 142, in main
    result = execute_job(
             ^^^^^^^^^^^^
  File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_core/execution/api.py", line 460, in execute_job
    return _logged_execute_job(
           ^^^^^^^^^^^^^^^^^^^^
  File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_core/telemetry.py", line 168, in wrap
    result = f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
  File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_core/execution/api.py", line 513, in _logged_execute_job
    return execute_run(
           ^^^^^^^^^^^^
  File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_core/execution/api.py", line 256, in execute_run
    event_list = list(_execute_run_iterable)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_core/execution/api.py", line 866, in __iter__
    yield from self.iterator(
  File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_core/execution/api.py", line 755, in job_execution_iterator
    for event in job_context.executor.execute(job_context, execution_plan):
  File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_core/executor/multiprocess.py", line 307, in execute
    raise DagsterExecutionInterruptedError()
dagster._core.errors.DagsterExecutionInterruptedError

@zaneselvans
Copy link
Member Author

zaneselvans commented Sep 23, 2023

This behavior doesn't seem to show up in #2821 and the only thing that seems like it could be making a difference is the switch from pandas 2.0.x to 2.1.x, so I'm going to re-install pandas 2.0.3 and re-run the script and see if it behaves wildly differently...

Is there anything in the Pandas 2.1 release notes that seems like it might have impacted what we're doing? Most of the things I saw were related to data types. Maybe we are somehow ending up with a bunch of object columns that should be cast to string or string[pyarrow]?

It looks like this option can make PyArrow strings the default reflecting future behavior.

pd.options.future.infer_string = True

With the only change being going from pandas v2.1.1 to v2.0.3 ferc_to_sqlite completes in 13 minutes locally, rather than hanging for hours so... seems like something funny is going on in there.

@zaneselvans zaneselvans marked this pull request as ready for review September 24, 2023 04:02
@zaneselvans
Copy link
Member Author

There are some problems due to foreign key constraint violations & unmapped utilities, but those seem similar to what's happening in #2821 already.

There are also a few small changes in the number of records in the FERC 1 outputs, but I don't know if these are the same as the ones that have popped up in #2821, or if they're a result of the change I made in ferc-xbrl-extractor v1.1.1

FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-fbp_ferc1-25421] - ValueError: fbp_ferc1: found 25406 rows, expected 25421. Off by -0.059%, allowed margin of 0.000%
FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-fuel_ferc1-48841] - ValueError: fuel_ferc1: found 48815 rows, expected 48841. Off by -0.053%, allowed margin of 0.000%
FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-plant_in_service_ferc1-315206] - ValueError: plant_in_service_ferc1: found 315112 rows, expected 315206. Off by -0.030%, allowed margin of 0.000%
FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-plants_all_ferc1-54284] - ValueError: plants_all_ferc1: found 54415 rows, expected 54284. Off by 0.241%, allowed margin of 0.000%
FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-plants_hydro_ferc1-6796] - ValueError: plants_hydro_ferc1: found 6798 rows, expected 6796. Off by 0.029%, allowed margin of 0.000%
FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-plants_small_ferc1-16235] - ValueError: plants_small_ferc1: found 16248 rows, expected 16235. Off by 0.080%, allowed margin of 0.000%
FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-plants_steam_ferc1-30709] - ValueError: plants_steam_ferc1: found 30825 rows, expected 30709. Off by 0.378%, allowed margin of 0.000%
FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-purchased_power_ferc1-197523] - ValueError: purchased_power_ferc1: found 197947 rows, expected 197523. Off by 0.215%, allowed margin of 0.000%

Copy link
Member Author

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR leaked over into ferc-xbrl-extractor because beyond pandas API changes in this repo, there was a big performance bottleneck over in that repo, which was addressed in catalyst-cooperative/ferc-xbrl-extractor#144

The CI in this repo doesn't pass right now but I think that's for reasons unrelated to the pandas 2.1 update and the changes to ferc-xbrl-extractor -- I think the change in row counts and foreign key constraints / unmapped plants and utilities also exist in #2821 but should compare notes with @jdangerx.

We should probably merge this into the upstream branch before that branch hits dev because of the performance & compatibility issues with the ferc-xbrl-extractor==1.1.1 update.

Comment on lines -158 to +167
ferc1_solo -> Test whether FERC 1 can be loaded into the PUDL database alone.
integration -> Run all software integration tests and process a full year of data.
minmax_rows -> Check that all outputs have the expected number of rows.
validate -> Run all data validation tests. This requires a complete PUDL DB.
ferc1_schema -> Verify FERC Form 1 DB schema are compatible for all years.
jupyter -> Ensure that designated Jupyter notebooks can be executed.
full_integration -> Run ETL and integration tests for all years and data sources.
full -> Run all CI checks, but for all years of data.
build -> Prepare Python source and binary packages for release.
testrelease -> Do a dry run of Python package release using the PyPI test server.
release -> Release the PUDL package to the production PyPI server.
nuke -> Nuke & recreate SQLite & Parquet outputs, then run all tests and
data validations against the new outputs.
get_unmapped_ids -> Make the raw FERC1 DB and generate a PUDL database with only EIA in
order to generate any unmapped IDs.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test environments were generally out of date and I just cut-and pasted in the current ones.

I noticed that the ferc1_schema test was gone entirely, and am wondering if we are no longer checking anywhere for inter-year FERC DBF database schema compatibility with the reference year? Or if that got moved to the more generic DBF extractor code? I poked around a bit but didn't see it immediately @rousik

Comment on lines -324 to -358
Assuming you do want to run the ETL and build new databases as part of the test
you're running, the contents of that database are determined by an ETL settings
file. By default, the settings file that's used is
``test/settings/integration-test.yml`` But it's also possible to use a
different input file, generating a different database, and then run some
tests against that database.

For example, we test that FERC 1 data can be loaded into a PUDL database all
by itself by running the ETL tests with a settings file that includes only A
couple of FERC 1 tables for a single year. This is the ``ferc1_solo`` Tox
test environment:

.. code-block:: console

$ pytest --etl-settings=test/settings/ferc1-solo-test.yml test/integration/etl_test.py

Similarly, we use the ``test/settings/full-integration-test.yml`` settings file
to specify an exhaustive collection of input data, and then we run a test that
checks that the database schemas extracted from all historical FERC 1 databases
are compatible with each other. This is the ``ferc1_schema`` test:

.. code-block:: console

$ pytest --etl-settings test/settings/full-integration-test.yml test/integration/etl_test.py::test_ferc1_schema

The raw input data that all the tests use is ultimately coming from our
`archives on Zenodo <https://zenodo.org/communities/catalyst-cooperative>`__.
However, you can optionally tell the tests to look in a different places for more
rapidly accessible caches of that data and to force the download of a fresh
copy (especially useful when you are testing the datastore functionality
specifically). By default, the tests will use the datastore that's part of your
local PUDL workspace.

For example, to run the ETL portion of the integration tests and download
fresh input data to a temporary datastore that's later deleted automatically:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed references to ferc1_schema and ferc1_solo environments, which no longer exist.

docs/dev/testing.rst Show resolved Hide resolved
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this out of data flowchart describing the ETL.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the dagster UI already shows a lot of this information now!

"pandas[parquet,excel,fss,gcp,compression]>=2.0,<2.1",
"pandas[compression,excel,fss,gcp,parquet,performance]>=2,<2.2",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to allowing pandas 2.1.x, I alphabetized these and included performance since it did result in a noticeable speedup in my testing in the ferc-xbrl-extractor repo (even though that didn't end up being the fix I used).

@@ -552,7 +552,7 @@ def convert_units(col: pd.Series, params: UnitConversion) -> pd.Series:
new_name = col.name
if (col.name == new_name) & (params.from_unit != "") & (params.to_unit != ""):
logger.debug(f"Old and new column names are identical: {col.name}.")
col = (params.multiplier * col) + params.adder
col = (params.multiplier * pd.to_numeric(col)) + params.adder
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were half a dozen tables where col ended up being a numeric string like "0.0" as a result of ferc-xbrl-extractor==1.1.1 -- though I'm not sure what about the change in method over there affected the output type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was from 1.1.0 to 1.1.1? Surprising... this seems OK for now but I'm definitely curious what changed.

It's totally possible that we'll keep changing the extractor, though - I don't think this is blocking though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are notes in the PRs / issues but the thing that changed is what kind of NULL value ends up being generated. In v1.1.0 with gb.ffill().iloc[-1] we ended up with mixed str + np.nan => object columns that resulted in numeric values. In v1.1.1 using gb.last() we ended up with mixed str + None => object columns that are interpreted as strings. 🤷🏼

src/pudl/transform/ferc1.py Show resolved Hide resolved
src/pudl/transform/ferc1.py Show resolved Hide resolved
src/pudl/transform/ferc1.py Show resolved Hide resolved
tox.ini Show resolved Hide resolved
@zaneselvans zaneselvans changed the title Fix issues arising from stricter typing used in pandas 2.1 Fix isues arising from pandas v2.1 & ferc-xbrl-extractor v1.1.1 Sep 24, 2023
@zaneselvans zaneselvans linked an issue Sep 24, 2023 that may be closed by this pull request
dependabot bot and others added 4 commits September 25, 2023 07:04
Updates the requirements on [fsspec](https://github.com/fsspec/filesystem_spec) to permit the latest version.
- [Commits](fsspec/filesystem_spec@2021.07.0...2023.9.2)

---
updated-dependencies:
- dependency-name: fsspec
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Updates the requirements on [dask](https://github.com/dask/dask) to permit the latest version.
- [Changelog](https://github.com/dask/dask/blob/main/docs/release-procedure.md)
- [Commits](dask/dask@2021.08.0...2023.9.2)

---
updated-dependencies:
- dependency-name: dask
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
…/fsspec-gte-2021.7-and-lt-2023.9.3

Update fsspec requirement from <2023.6.1,>=2021.7 to >=2021.7,<2023.9.3
…/dask-gte-2021.8-and-lt-2023.9.3

Update dask requirement from <2023.9.2,>=2021.8 to >=2021.8,<2023.9.3
@zaneselvans zaneselvans changed the title Fix isues arising from pandas v2.1 & ferc-xbrl-extractor v1.1.1 Fix issues arising from pandas v2.1 & ferc-xbrl-extractor v1.1.1 Sep 25, 2023
Copy link
Member

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this all looks pretty good! Thanks for doing all this maintenance :) :shipit:

@@ -44,7 +44,7 @@ class CalculationToleranceFerc1(BaseModel):
intertable_calculation_errors=0.20,
),
"balance_sheet_assets_ferc1": CalculationToleranceFerc1(
intertable_calculation_errors=0.85,
intertable_calculation_errors=1.00,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😢

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's okay man we're back down to 60% on the other other FERC branch!

docs/dev/testing.rst Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the dagster UI already shows a lot of this information now!

for ref, fix in zip(refs, ASSOCIATIONS):
for year in range(fix["to"][0], fix["to"][1] + 1):
key = (fix["id"], pd.Timestamp(year, 1, 1))
if key not in dfi.index:
rows.append({**ref, "report_date": key[1]})
# Append to original table
df = pd.concat([df, pd.DataFrame(rows)])
new_rows = apply_pudl_dtypes(pd.DataFrame(rows), group="eia")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super-nit: maybe rows_dtyped is a little clearer about what's new about these rows?

Or, later on line 279 you just skip the variable assignment altogether, which also seems legitimate - maybe we can do that here too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, went with the solution from below.

@@ -552,7 +552,7 @@ def convert_units(col: pd.Series, params: UnitConversion) -> pd.Series:
new_name = col.name
if (col.name == new_name) & (params.from_unit != "") & (params.to_unit != ""):
logger.debug(f"Old and new column names are identical: {col.name}.")
col = (params.multiplier * col) + params.adder
col = (params.multiplier * pd.to_numeric(col)) + params.adder
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was from 1.1.0 to 1.1.1? Surprising... this seems OK for now but I'm definitely curious what changed.

It's totally possible that we'll keep changing the extractor, though - I don't think this is blocking though.

src/pudl/transform/ferc1.py Show resolved Hide resolved
@zaneselvans
Copy link
Member Author

@jdangerx well for better or worse I apparently love doing this kind of maintenance stuff.

@zaneselvans zaneselvans merged commit 11c2311 into 2810-run-2021-ferc-1-data-through-new-more-complete-extractor Sep 26, 2023
8 of 9 checks passed
@zaneselvans zaneselvans deleted the pandas-2.1 branch September 26, 2023 00:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants