Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace use of list for cached_data and subject_ids in pytorch dataset with polars objects #74

Closed
wants to merge 1 commit into from

Conversation

juancq
Copy link
Contributor

@juancq juancq commented Oct 19, 2023

This fixes the increased memory consumption issues when using multiple pytorch dataloaders (issue #73). It also dropped the starting memory usage in my test case from 30GB to 12GB.

Removing this line makes all the difference:

self.cached_data = self.cached_data.rows()

Editing the following line didn't make much of a difference, but I edited it for consistency:

self.subject_ids = self.cached_data["subject_id"].to_list()

This fixes the increased memory consumption issues when using multiple
pytorch dataloaders.
@mmcdermott
Copy link
Owner

Hey @juancq -- how does this impact the iteration speed through the dataloader, though? The motivation to convert things to lists was that with raw polars objects, the base iteration speed was much slower.

@juancq
Copy link
Contributor Author

juancq commented Oct 25, 2023

@mmcdermott I saw no noticeable difference in the iteration speed.

@mmcdermott
Copy link
Owner

@juancq I'm working on a different solution for this problem that also addresses some other issues. I'll tag you in that other PR. It's not 100% ready but it is close. It is a larger change, but I'll explain more there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants