Skip to content

Commit

Permalink
Reduce Memory Usage of Matrix Building [Resolves #372]
Browse files Browse the repository at this point in the history
Matrix building is still very memory-intensive, for no particularly good
reason: we're not using the matrices at that point, just transferring
them from database to disk with an in-Python join to get around column
limits. While we're still using pandas to build the matrices themselves,
this is hard to get around: any type of pandas join will always use
multiple times the memory needed. Bringing the memory usage down to what
is actually needed for the data is better, but even better is to make
the memory usage controllable by never keeping the matrix in memory.

Using Ohio's PipeTextIO makes this technically feasible, but to make it
work out we also need to remove HDF support. HDF support was added
merely for the compression capabilities, and with recent changes to
compress CSVs, this is no longer needed.

MatrixStore changes:
- Remove HDFMatrixStore and hdf support from the experiment and CLI
- Modify MatrixStore.save to take in a bytestream instead of assuming it
has a dataframe available to convert
- Add null column check to loading/preprocessing instead of after we build matrix

MatrixBuilder changes:
- Convert intermediate-dataframe generating functions to just be query-generating functions, because we can't use intermediate dataframes anymore. Also, these queries no longer duplicate the index (entity id, as of date) so the Python joining code doesn't have to manually remove them.
- Since there is no dataframe anymore, the row count has to come from the database.
- Add more prebuild checks to make sure that the joins will work; without a dataframe.join at the end, column mismatches will no longer have explicit errors without doing this

Other changes:
- Remove unused utils that mentioned hdf
- Remove HDF section from experiment run document
  • Loading branch information
thcrock authored and Tristan Crockett committed Apr 4, 2019
1 parent da24b61 commit 7735fba
Show file tree
Hide file tree
Showing 19 changed files with 360 additions and 648 deletions.
2 changes: 1 addition & 1 deletion docs/sources/experiments/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -334,7 +334,7 @@ On the other hand, new options that affect only runtime concerns (e.g. performan

## Storage Abstractions

Another important part of enabling different execution contexts is being able to pass large, persisted objects (e.g. matrices or models) by reference to another process or cluster. To achieve this, as well as provide the ability to configure different storage mediums (e.g. S3) and formats (e,g, HDF) without changes to the Experiment class, all references to these large objects within any components are handled through an abstraction layer.
Another important part of enabling different execution contexts is being able to pass large, persisted objects (e.g. matrices or models) by reference to another process or cluster. To achieve this, as well as provide the ability to configure different storage mediums (e.g. S3) without changes to the Experiment class, all references to these large objects within any components are handled through an abstraction layer.

### Matrix Storage

Expand Down
31 changes: 0 additions & 31 deletions docs/sources/experiments/running.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,37 +114,6 @@ experiment.run()

```

## Using HDF5 as a matrix storage format

Triage by default uses CSV format to store matrices, but this can take up a lot of space. However, this is configurable. Triage ships with an HDF5 storage module that you can use.

### CLI

On the command-line, this is configurable using the `--matrix-format` option, and supports `csv` and `hdf`.

```bash
triage experiment example/config/experiment.yaml --matrix-format hdf
```

### Python

In Python, this is configurable using the `matrix_storage_class` keyword argument. To allow users to write their own storage modules, this is passed in the form of a class. The shipped modules are in `triage.component.catwalk.storage`. If you'd like to write your own storage module, you can use the [existing modules](https://github.com/dssg/triage/blob/master/src/triage/component/catwalk/storage.py) as a guide.

```python
from triage.experiments import SingleThreadedExperiment
from triage.component.catwalk.storage import HDFMatrixStore

experiment = SingleThreadedExperiment(
config=experiment_config
db_engine=create_engine(...),
matrix_storage_class=HDFMatrixStore,
project_path='/path/to/directory/to/save/data',
)
experiment.run()
```

Note: The HDF storage option is *not* compatible with S3.

## Validating an Experiment

Configuring an experiment is complex, and running an experiment can take a long time as data scales up. If there are any misconfigured values, it's going to help out a lot to figure out what they are before we run the Experiment. So when you have completed your experiment config and want to test it out, it's best to validate the Experiment first. If any problems are detectable in your Experiment, either in configuration or the database tables referenced by it, this method will throw an exception. For instance, if I refer to the `cat_complaints` table in a feature aggregation but it doesn't exist, I'll see something like this:
Expand Down
Loading

0 comments on commit 7735fba

Please sign in to comment.