Pandas performance & automated release #144

zaneselvans · 2023-09-24T03:11:08Z

Address performance issues that surfaced under pandas 2.1
Add an automated release workflow that will attempt to publish a new package when a tag starting with v is pushed.
Address some deprecation warnings in the command line interface test cases.
Fix some bad URLs in the package metadata.
Additional keywords in package metadata.

codecov · 2023-09-24T03:17:15Z

Codecov Report

Patch and project coverage have no change.

Comparison is base (5e75cb8) 93.09% compared to head (eddb0a5) 93.09%.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #144   +/-   ##
=======================================
  Coverage   93.09%   93.09%           
=======================================
  Files           8        8           
  Lines         594      594           
=======================================
  Hits          553      553           
  Misses         41       41

Files Changed	Coverage Δ
src/ferc_xbrl_extractor/xbrl.py	`91.46% <ø> (ø)`

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

zaneselvans · 2023-09-24T18:39:06Z

src/ferc_xbrl_extractor/xbrl.py

@@ -84,9 +84,11 @@ def _dedupe_newer_report_wins(df: pd.DataFrame, primary_key: list[str]) -> pd.Da
    old_index = df.index.names
    return (
        df.reset_index()
+        .sort_values("report_date")


I think we can sort_values() before doing the groupby because according to the groupby docs "Groupby preserves the order of rows within each group."

One potentially non-deterministic outcome here is that if there are duplicate values of report_date within a group then which row of data ends up being "last" might not always be the same. Do we think there can ever be multiple filings within a group that have the same date?

Yes!

For some reason, I made this issue in the PUDL repo instead: catalyst-cooperative/pudl#2822 but the short of it is "we need to use metadata from rssfeed to sort them better."

zaneselvans · 2023-09-24T18:43:08Z

src/ferc_xbrl_extractor/xbrl.py

        .groupby(unique_cols)
-        .apply(lambda df: df.sort_values("report_date").ffill().iloc[-1])
+        .last()


groupby.last() selects the last non-null value from each column which I think is the intention of ffill().iloc[-1] -- since that will forward fill the last non-null value within the group, and then select the last element, right?

Perfect! That's the exact intent & also this conveys what's going on way better!

zaneselvans · 2023-09-24T18:43:55Z

tox.ini

-#######################################################################################
-# Software Package Build & Release
-#######################################################################################
-[testenv:build]
-description = Prepare Python source and binary packages for release.
-basepython = python3
-skip_install = false
-extras =
-    dev
-commands =
-    bash -c 'rm -rf build/* dist/*'
-    python -m build
-
-[testenv:testrelease]
-description = Do a dry run of Python package release using the PyPI test server.
-basepython = python3
-skip_install = false
-extras =
-    dev
-commands =
-    {[testenv:build]commands}
-    twine check dist/*
-    twine upload --verbose --repository testpypi --skip-existing dist/*
-
-[testenv:release]
-description = Release the package to the production PyPI server.
-basepython = python3
-skip_install = false
-extras =
-    dev
-commands =
-    {[testenv:build]commands}
-    twine check dist/*
-    twine upload --verbose --skip-existing dist/*


Removed because it takes place in the release workflow now.

zaneselvans · 2023-09-24T18:44:15Z

tests/integration/console_scripts_test.py

@@ -19,7 +19,7 @@ def test_extractor_scripts(script_runner, ep):

    The script_runner fixture is provided by the pytest-console-scripts plugin.
    """
-    ret = script_runner.run(ep, "--help", print_result=False)
+    ret = script_runner.run([ep, "--help"], print_result=False)


Just a switch from deprecated syntax.

zaneselvans added 4 commits September 23, 2023 20:38

Use lists of CLI args for CLI tests.

faaefcf

Fix project URLs and update keywords.

c736acd

Replace gb.apply(lambda) to improve performance

7ca62c6

Add automated release-on-tag workflow.

eddb0a5

zaneselvans merged commit e9a931a into main Sep 24, 2023
14 checks passed

zaneselvans deleted the pandas-performance branch September 24, 2023 03:44

zaneselvans linked an issue Sep 24, 2023 that may be closed by this pull request

XBRL extraction is much slower with pandas 2.1.1 than pandas 2.0.3 #143

Closed

zaneselvans mentioned this pull request Sep 24, 2023

Fix issues arising from pandas v2.1 & ferc-xbrl-extractor v1.1.1 catalyst-cooperative/pudl#2854

Merged

8 tasks

zaneselvans commented Sep 24, 2023

View reviewed changes

zaneselvans added the performance Resource consumption like memory or CPU intensity label Sep 25, 2023

zaneselvans self-assigned this Sep 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas performance & automated release #144

Pandas performance & automated release #144

zaneselvans commented Sep 24, 2023

codecov bot commented Sep 24, 2023 •

edited

Loading

zaneselvans Sep 24, 2023 •

edited

Loading

zaneselvans Sep 24, 2023 •

edited

Loading

jdangerx Sep 25, 2023

zaneselvans Sep 24, 2023 •

edited

Loading

jdangerx Sep 25, 2023

zaneselvans Sep 24, 2023

jdangerx Sep 25, 2023

zaneselvans Sep 24, 2023

Pandas performance & automated release #144

Pandas performance & automated release #144

Conversation

zaneselvans commented Sep 24, 2023

codecov bot commented Sep 24, 2023 • edited Loading

Codecov Report

zaneselvans Sep 24, 2023 • edited Loading

Choose a reason for hiding this comment

zaneselvans Sep 24, 2023 • edited Loading

Choose a reason for hiding this comment

jdangerx Sep 25, 2023

Choose a reason for hiding this comment

zaneselvans Sep 24, 2023 • edited Loading

Choose a reason for hiding this comment

jdangerx Sep 25, 2023

Choose a reason for hiding this comment

zaneselvans Sep 24, 2023

Choose a reason for hiding this comment

jdangerx Sep 25, 2023

Choose a reason for hiding this comment

zaneselvans Sep 24, 2023

Choose a reason for hiding this comment

codecov bot commented Sep 24, 2023 •

edited

Loading

zaneselvans Sep 24, 2023 •

edited

Loading

zaneselvans Sep 24, 2023 •

edited

Loading

zaneselvans Sep 24, 2023 •

edited

Loading