Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas performance & automated release #144

Merged
merged 4 commits into from
Sep 24, 2023
Merged

Conversation

zaneselvans
Copy link
Member

  • Address performance issues that surfaced under pandas 2.1
  • Add an automated release workflow that will attempt to publish a new package when a tag starting with v is pushed.
  • Address some deprecation warnings in the command line interface test cases.
  • Fix some bad URLs in the package metadata.
  • Additional keywords in package metadata.

@codecov
Copy link

codecov bot commented Sep 24, 2023

Codecov Report

Patch and project coverage have no change.

Comparison is base (5e75cb8) 93.09% compared to head (eddb0a5) 93.09%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #144   +/-   ##
=======================================
  Coverage   93.09%   93.09%           
=======================================
  Files           8        8           
  Lines         594      594           
=======================================
  Hits          553      553           
  Misses         41       41           
Files Changed Coverage Δ
src/ferc_xbrl_extractor/xbrl.py 91.46% <ø> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@zaneselvans zaneselvans merged commit e9a931a into main Sep 24, 2023
14 checks passed
@zaneselvans zaneselvans deleted the pandas-performance branch September 24, 2023 03:44
@zaneselvans zaneselvans linked an issue Sep 24, 2023 that may be closed by this pull request
@@ -84,9 +84,11 @@ def _dedupe_newer_report_wins(df: pd.DataFrame, primary_key: list[str]) -> pd.Da
old_index = df.index.names
return (
df.reset_index()
.sort_values("report_date")
Copy link
Member Author

@zaneselvans zaneselvans Sep 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can sort_values() before doing the groupby because according to the groupby docs "Groupby preserves the order of rows within each group."

Copy link
Member Author

@zaneselvans zaneselvans Sep 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potentially non-deterministic outcome here is that if there are duplicate values of report_date within a group then which row of data ends up being "last" might not always be the same. Do we think there can ever be multiple filings within a group that have the same date?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

For some reason, I made this issue in the PUDL repo instead: catalyst-cooperative/pudl#2822 but the short of it is "we need to use metadata from rssfeed to sort them better."

.groupby(unique_cols)
.apply(lambda df: df.sort_values("report_date").ffill().iloc[-1])
.last()
Copy link
Member Author

@zaneselvans zaneselvans Sep 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

groupby.last() selects the last non-null value from each column which I think is the intention of ffill().iloc[-1] -- since that will forward fill the last non-null value within the group, and then select the last element, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect! That's the exact intent & also this conveys what's going on way better!

Comment on lines -112 to -145
#######################################################################################
# Software Package Build & Release
#######################################################################################
[testenv:build]
description = Prepare Python source and binary packages for release.
basepython = python3
skip_install = false
extras =
dev
commands =
bash -c 'rm -rf build/* dist/*'
python -m build

[testenv:testrelease]
description = Do a dry run of Python package release using the PyPI test server.
basepython = python3
skip_install = false
extras =
dev
commands =
{[testenv:build]commands}
twine check dist/*
twine upload --verbose --repository testpypi --skip-existing dist/*

[testenv:release]
description = Release the package to the production PyPI server.
basepython = python3
skip_install = false
extras =
dev
commands =
{[testenv:build]commands}
twine check dist/*
twine upload --verbose --skip-existing dist/*
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed because it takes place in the release workflow now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wooohoo!

@@ -19,7 +19,7 @@ def test_extractor_scripts(script_runner, ep):

The script_runner fixture is provided by the pytest-console-scripts plugin.
"""
ret = script_runner.run(ep, "--help", print_result=False)
ret = script_runner.run([ep, "--help"], print_result=False)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a switch from deprecated syntax.

@zaneselvans zaneselvans added the performance Resource consumption like memory or CPU intensity label Sep 25, 2023
@zaneselvans zaneselvans self-assigned this Sep 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Resource consumption like memory or CPU intensity
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

XBRL extraction is much slower with pandas 2.1.1 than pandas 2.0.3
2 participants