Replace use of list for cached_data and subject_ids in pytorch dataset with polars objects #74

juancq · 2023-10-19T09:40:30Z

This fixes the increased memory consumption issues when using multiple pytorch dataloaders (issue #73). It also dropped the starting memory usage in my test case from 30GB to 12GB.

Removing this line makes all the difference:

EventStreamGPT/EventStream/data/pytorch_dataset.py

Line 309 in b10e741

self.cached_data = self.cached_data.rows()

Editing the following line didn't make much of a difference, but I edited it for consistency:

EventStreamGPT/EventStream/data/pytorch_dataset.py

Line 306 in b10e741

self.subject_ids = self.cached_data["subject_id"].to_list()

This fixes the increased memory consumption issues when using multiple pytorch dataloaders.

mmcdermott · 2023-10-24T19:35:27Z

Hey @juancq -- how does this impact the iteration speed through the dataloader, though? The motivation to convert things to lists was that with raw polars objects, the base iteration speed was much slower.

juancq · 2023-10-25T00:39:54Z

@mmcdermott I saw no noticeable difference in the iteration speed.

mmcdermott · 2023-11-10T20:55:42Z

@juancq I'm working on a different solution for this problem that also addresses some other issues. I'll tag you in that other PR. It's not 100% ready but it is close. It is a larger change, but I'll explain more there.

Replace use of list for cached_data with polars

eb4ec8b

This fixes the increased memory consumption issues when using multiple pytorch dataloaders.

mmcdermott closed this Nov 10, 2023

mmcdermott mentioned this pull request Dec 14, 2023

DataLoader with num_workers > 0 increases memory consumption over time #73

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace use of list for cached_data and subject_ids in pytorch dataset with polars objects #74

Replace use of list for cached_data and subject_ids in pytorch dataset with polars objects #74

juancq commented Oct 19, 2023

mmcdermott commented Oct 24, 2023

juancq commented Oct 25, 2023

mmcdermott commented Nov 10, 2023

Replace use of list for cached_data and subject_ids in pytorch dataset with polars objects #74

Replace use of list for cached_data and subject_ids in pytorch dataset with polars objects #74

Conversation

juancq commented Oct 19, 2023

mmcdermott commented Oct 24, 2023

juancq commented Oct 25, 2023

mmcdermott commented Nov 10, 2023