Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache dimension snakecase #181

Merged
merged 2 commits into from
Jan 19, 2024
Merged

Cache dimension snakecase #181

merged 2 commits into from
Jan 19, 2024

Conversation

rousik
Copy link
Collaborator

@rousik rousik commented Dec 29, 2023

It turns out that this piece of code gets run very frequently and accounts for quite a lot of runtime for xbrl extraction. In the test scenario, this improvements seems to produce ~15% reduction in extraction time.

rousik and others added 2 commits December 26, 2023 11:37
It turns out that this piece of code gets run very frequently
and accounts for quite a lot of runtime for xbrl extraction.
In the test scenario, this improvements seems to produce ~15%
reduction in runtime.
Copy link

codecov bot commented Dec 29, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (7c2e2f6) 93.43% compared to head (e724aa2) 93.44%.
Report is 5 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #181      +/-   ##
==========================================
+ Coverage   93.43%   93.44%   +0.01%     
==========================================
  Files           8        8              
  Lines         609      610       +1     
==========================================
+ Hits          569      570       +1     
  Misses         40       40              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! I bet there are other speedups lurking in here too. Thanks for finding this.

@rousik
Copy link
Collaborator Author

rousik commented Dec 29, 2023

Cool! I bet there are other speedups lurking in here too. Thanks for finding this.

Yep, I would like to test this in the full PUDL ETL before committing this PR, so I'm trying to figure out the right (and low effort) way to go about that as I'm assuming that this may not be the last such change.

I should probably also do a write-up/wiki doc for how to do profiling of the code to find stuff like this. It turns out that the multi-processing is kind of annoying and use of process executors here definitely got in the way of getting clean profiles.

@zaneselvans
Copy link
Member

zaneselvans commented Dec 29, 2023

For testing the altered ferc-xbrl-extractor is there a reason not to just check it out alongside the main pudl repo and then install it within your pudl-dev environment?

pip install --no-deps --no-cache-dir --editable ../ferc-xbrl-extractor

That should replace the version of ferc-xbrl-extractor that you've got in the pudl-dev environment locally with whatever version of ferc-xbrl-extractor you have checked out. Then you can edit either repository as needed and re-run the tests or the full ETL with dagster and they'll reflect any changes in both pudl and ferc-xbrl-extractor.

Once you're satisfied with the changes to ferc-xbrl-extractor you can commit them and we can cut a new release by pushing a version tag, and you can update the required version in the pyproject.toml for PUDL.

@rousik
Copy link
Collaborator Author

rousik commented Dec 29, 2023

Thanks, that definitely sounds like the easiest way to go about testing this.

@rousik
Copy link
Collaborator Author

rousik commented Jan 18, 2024

I have tested this in the context of ferc_to_sqlite, sadly the benefits are minimal at that point. I suppose the multi-processing must be making the caches less effective or the bottlenecks are different in that scenario. I have avoided multi-processing for cpu profiling as it was more difficult to obtain the profile across process boundaries. I will try to look into this some more. We could submit this, but the benefits are likely not going to be as good as initially promised :-/

Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we might as well merge it in. It doesn't really add complexity and it'll improve performance in some computational contexts even if it doesn't right now, so why not?

@rousik
Copy link
Collaborator Author

rousik commented Jan 18, 2024

Seems like we might as well merge it in. It doesn't really add complexity and it'll improve performance in some computational contexts even if it doesn't right now, so why not?

I would like to re-run the analysis and that would be harder once this is merged in. In general, I'm a bit wary of optimizations that actually don't help :-/

@zaneselvans
Copy link
Member

Haha, okay well I'm fine with just closing it PR too. Splitting off the FERC extractions in our CI reorganizations will be a much much larger performance boost.

@rousik
Copy link
Collaborator Author

rousik commented Jan 19, 2024

Let's merge. It may not help but it won't hurt and realistically, I don't have enough cycles to dig into this before I free up my plate of other things. You're right that more significant speedups will likely be brought by the sharding of ferc_to_sqlite

@zaneselvans zaneselvans added the performance Resource consumption like memory or CPU intensity label Jan 19, 2024
@zaneselvans zaneselvans merged commit 226bc86 into main Jan 19, 2024
16 checks passed
@zaneselvans zaneselvans deleted the cache-snakecase branch January 19, 2024 16:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Resource consumption like memory or CPU intensity
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants