-
Notifications
You must be signed in to change notification settings - Fork 61
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Introduces triage.util.pandas.downcast_matrix, which downcasts the values of a dataframe to their minimal values (e.g. float64->float32, or int64->int32). This is called in two places: 1. Within MatrixBuilder to each smaller matrix before they are joined together, with the intention of stopping memory from spiking at the join step. 2. When loading a matrix into memory from the MatrixStore class. Since it made sense to put this in the superclass as opposed to forcing each subclass to implement it, it was added to the .matrix getter. While doing this, it made sense to do the same for the set_index call as well, allowing some further cleaning up of the MatrixStore subclasses.
- Loading branch information
Showing
5 changed files
with
55 additions
and
17 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
from triage.util.pandas import downcast_matrix | ||
from .utils import matrix_creator | ||
|
||
|
||
def test_downcast_matrix(): | ||
df = matrix_creator() | ||
downcasted_df = downcast_matrix(df) | ||
|
||
# make sure the contents are equivalent | ||
assert((downcasted_df == df).all().all()) | ||
|
||
# make sure the memory usage is lower because there would be no point of this otherwise | ||
assert downcasted_df.memory_usage().sum() < df.memory_usage().sum() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
from functools import partial | ||
import pandas as pd | ||
import logging | ||
|
||
|
||
def downcast_matrix(df): | ||
"""Downcast the numeric values of a matrix. | ||
This will make the matrix use less memory by turning, for instance, | ||
int64 columns into int32 columns. | ||
First converts floats and then integers. | ||
Operates on the dataframe as passed, without doing anything to the index. | ||
Callers may pass an index-less dataframe if they wish to re-add the index afterwards | ||
and save memory on the index storage. | ||
""" | ||
logging.info("Downcasting matrix. Starting memory usage: %s", df.memory_usage()) | ||
new_df = ( | ||
df.apply(partial(pd.to_numeric, downcast="float")) | ||
.apply(partial(pd.to_numeric, downcast="integer")) | ||
) | ||
|
||
logging.info("Downcasted matrix. Final memory usage: %s", new_df.memory_usage()) | ||
return new_df |