Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve memory usage of MatrixBuilder #372

Closed
thcrock opened this issue Jan 21, 2018 · 0 comments
Closed

Improve memory usage of MatrixBuilder #372

thcrock opened this issue Jan 21, 2018 · 0 comments
Assignees

Comments

@thcrock
Copy link
Contributor

thcrock commented Jan 21, 2018

We have had memory problems while matrix building. Recycling worker processes in MulticoreExperiment can help, but we can likely optimize what the builder is doing. It is reading all of the constituent dataframes (each feature group, and labels) into memory, and then .join-ing them together into a larger dataframe before saving.

Potentially we could one-by-one read the small dataframes from SQL, add to the big one, and make sure the small one gets garbage collected before reading the next. Also, I have no idea if .join is the best way to perform the joining.

@thcrock thcrock self-assigned this Oct 9, 2018
thcrock added a commit that referenced this issue Oct 10, 2018
Introduces triage.util.pandas.downcast_matrix, which downcasts the
values of a dataframe to their minimal values (e.g. float64->float32, or
int64->int32). This is called in two places:

1. Within MatrixBuilder to each smaller
matrix before they are joined together, with the intention of stopping
memory from spiking at the join step.

2. When loading a matrix into memory from the MatrixStore class.
Since it made sense to put this in the superclass as opposed to
forcing each subclass to implement it, it was added to the .matrix
getter. While doing this, it made sense to do the same for the set_index
call as well, allowing some further cleaning up of the MatrixStore
subclasses.
thcrock added a commit that referenced this issue Oct 11, 2018
Introduces triage.util.pandas.downcast_matrix, which downcasts the
values of a dataframe to their minimal values (e.g. float64->float32, or
int64->int32). This is called in two places:

1. Within MatrixBuilder to each smaller
matrix before they are joined together, with the intention of stopping
memory from spiking at the join step.

2. When loading a matrix into memory from the MatrixStore class.
Since it made sense to put this in the superclass as opposed to
forcing each subclass to implement it, it was added to the .matrix
getter. While doing this, it made sense to do the same for the set_index
call as well, allowing some further cleaning up of the MatrixStore
subclasses.
thcrock added a commit that referenced this issue Oct 17, 2018
Introduces triage.util.pandas.downcast_matrix, which downcasts the
values of a dataframe to their minimal values (e.g. float64->float32, or
int64->int32). This is called in two places:

1. Within MatrixBuilder to each smaller
matrix before they are joined together, with the intention of stopping
memory from spiking at the join step.

2. When loading a matrix into memory from the MatrixStore class.
Since it made sense to put this in the superclass as opposed to
forcing each subclass to implement it, it was added to the .matrix
getter. While doing this, it made sense to do the same for the set_index
call as well, allowing some further cleaning up of the MatrixStore
subclasses.
thcrock added a commit that referenced this issue Apr 4, 2019
Matrix building is still very memory-intensive, for no particularly good
reason: we're not using the matrices at that point, just transferring
them from database to disk with an in-Python join to get around column
limits. While we're still using pandas to build the matrices themselves,
this is hard to get around: any type of pandas join will always use
multiple times the memory needed. Bringing the memory usage down to what
is actually needed for the data is better, but even better is to make
the memory usage controllable by never keeping the matrix in memory.

Using Ohio's PipeTextIO makes this technically feasible, but to make it
work out we also need to remove HDF support. HDF support was added
merely for the compression capabilities, and with recent changes to
compress CSVs, this is no longer needed.

- Remove HDFMatrixStore and hdf support from the experiment and CLI
- Modify MatrixStore.save to take in a bytestream instead of assuming it
has a dataframe available to convert
thcrock added a commit that referenced this issue Apr 4, 2019
Matrix building is still very memory-intensive, for no particularly good
reason: we're not using the matrices at that point, just transferring
them from database to disk with an in-Python join to get around column
limits. While we're still using pandas to build the matrices themselves,
this is hard to get around: any type of pandas join will always use
multiple times the memory needed. Bringing the memory usage down to what
is actually needed for the data is better, but even better is to make
the memory usage controllable by never keeping the matrix in memory.

Using Ohio's PipeTextIO makes this technically feasible, but to make it
work out we also need to remove HDF support. HDF support was added
merely for the compression capabilities, and with recent changes to
compress CSVs, this is no longer needed.

MatrixStore changes:
- Remove HDFMatrixStore and hdf support from the experiment and CLI
- Modify MatrixStore.save to take in a bytestream instead of assuming it
has a dataframe available to convert
- Add null column check to loading/preprocessing instead of after we build matrix

MatrixBuilder changes:
- Convert intermediate-dataframe generating functions to just be query-generating functions, because we can't use intermediate dataframes anymore. Also, these queries no longer duplicate the index (entity id, as of date) so the Python joining code doesn't have to manually remove them.
- Since there is no dataframe anymore, the row count has to come from the database.
- Add more prebuild checks to make sure that the joins will work; without a dataframe.join at the end, column mismatches will no longer have explicit errors without doing this

Other changes:
- Remove unused utils that mentioned hdf
- Remove HDF section from experiment run document
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants