Skip to content

Commit

Permalink
Evaluate on subsets [Resolves #535, #138]
Browse files Browse the repository at this point in the history
This commit adds support for evaluating models against subsets of their
predictions, in both training and testing. It adds a table to the
results schemas to track subsets:

  - `model_metadata.subsets` stores subset metadata, including a hash,
    the subset configuration, and the time the row was created

The `evaluations` tables in the `train_results` and `test_results`
schemas are updated to include a new column (also added to the primary
key), `subset_hash` that is an empty string for full cohort evaluations
or contains the subset hash when the evaluation is for a subset of the
cohort.

A new alembic upgrade script creates the subsets table and updates the
evaluation tables.

Testing factories are included or modified for the subsets and
evaluation tables.

Most of the remaining code changes are made to the ModelEvaluator class,
which can now process subset queries and write the results to the
appropriate table [#535] and will record `NULL` values for undefined
metrics (whether due to an empty subset or lack of variation in labels
[#138]).

However, some changes are made elsewhere in the experiment to allow
(optionally) including subsets in the experiment configuration file,
including storing subset metadata in the `model_metadata.subsets` table
and iterating over subsets in the model tester.

In addition, some changes to the documentation and `.gitignore` are
included to make modifying the results schema more joyful.
  • Loading branch information
ecsalomon committed Feb 22, 2019
1 parent 59ffbe1 commit fb55bbb
Show file tree
Hide file tree
Showing 27 changed files with 1,429 additions and 419 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,4 @@ dist/
.ipynb_checkpoints/
venv/
my_db_config.yaml
database.yaml
11 changes: 10 additions & 1 deletion docs/sources/experiments/algorithm.md
Original file line number Diff line number Diff line change
Expand Up @@ -309,7 +309,7 @@ The trained model's prediction probabilities (`predict_proba()`) are computed bo
### Individual Feature Importance
Feature importances (of a configurable number of top features, defaulting to 5) for each prediction are computed and written to the `test_results.individual_importances` table. Right now, there are no sophisticated calculation methods integrated into the experiment; simply the top 5 global feature importances for the model are copied to the `individual_importances` table.

#### Metrics
### Metrics
Triage allows for the computation of both testing set and training set evaluation metrics. Evaluation metrics, such as precision and recall at various thresholds, are written to either the `train_results.evaluations` table or the `test_results.evaluations`. Triage defines a number of [Evaluation Metrics](https://github.com/dssg/triage/blob/master/src/triage/component/catwalk/evaluation.py#L45-L58) metrics that can be addressed by name in the experiment definition, along with a list of thresholds and/or other parameters (such as the 'beta' value for fbeta) to iterate through. Thresholding is done either via absolute value (top k) or percentile by sorting the predictions and labels by the row's predicted probability score, with ties broken at random (the random seed can be passed in the config file to make this deterministic), and assigning the predicted value as True for those above the threshold. Note that the percentile thresholds are in terms of the population percentage, not a cutoff threshold for the predicted probability.

Sometimes test matrices may not have labels for every row, so it's worth mentioning here how that is handled and interacts with thresholding. Rows with missing labels are not considered in the metric calculations, and if some of these rows are in the top k of the test matrix, no more rows are taken from the rest of the list for consideration. So if the experiment is calculating precision at the top 100 rows, and 40 of the top 100 rows are missing a label, the precision will actually be calculated on the 60 of the top 100 rows that do have a label. To make the results of this more transparent for users, a few extra pieces of metadata are written to the evaluations table for each metric score.
Expand All @@ -319,5 +319,14 @@ Sometimes test matrices may not have labels for every row, so it's worth mention
labels
* `num_positive_labels` - The number of positive labels in the test matrix

Triage also supports evaluating a model on a subset of the predictions made.
This is done by passing a subset query in the prediction config. The model
evaluator will then subset the predictions on valid entity-date pairs for the
given model and will calculate metrics for the subset, re-applying thresholds
as necessary to the predictions in the subset. Subset definitions are stored in
the `model_metadata.subsets` table, and the evaluations are stored in the
`evaluations` tables. A hash of the subset configuration identifies subset
evaluations and links the `subsets` table.

### Recap
At this point, the 'model_metadata', 'train_results', and 'test_results' database schemas are fully populated with data about models, model groups, predictions, feature importances, and evaluation metrics for the researcher to query. In addition, the trained model pickle files are saved in the configured project path. The experiment is considered finished.
3 changes: 2 additions & 1 deletion docs/sources/experiments/running.md
Original file line number Diff line number Diff line change
Expand Up @@ -266,11 +266,12 @@ After the experiment run, a variety of schemas and tables will be created and po
* model_metadata.experiment_models - A many-to-many table between experiments and models. This will have a row if the experiment used the model, regardless of whether or not it had to build it
* model_metadata.model_groups - A model groups refers to all models that share parameters like classifier type, hyperparameters, etc, but *have different training windows*. Look at these to see how classifiers perform over different training windows.
* model_metadata.matrices - Each matrix that was used for training and testing has metadata written about it such as the matrix hash, length, and time configuration.
* model_metadata.subsets - Each evaluation subset that was used for model scoring has its configuation and a hash written here
* train_results.feature_importances - The sklearn feature importances results for each trained model
* train_results.predictions - Prediction probabilities for train matrix entities generated against trained models
* train_results.evaluations - Metric scores of trained models on the training data.
* test_results.predictions - Prediction probabilities for test matrix entities generated against trained models
* test_results.evaluations - Metric scores of trained models over given testing windows
* test_results.evaluations - Metric scores of trained models over given testing windows and subsets
* test_results.individual_importances - Individual feature importance scores for test matrix entities.

Here's an example query, which returns the top 10 model groups by precision at the top 100 entities:
Expand Down
28 changes: 28 additions & 0 deletions example/config/experiment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -316,6 +316,26 @@ grid_config:
# sort_seed, if passed, will seed the random number generator for each model's
# metric creation phase. This affects how entities with the same probabilities
# are sorted
#
# subsets, if passed, will add evaluations for subset(s) of the predictions to
# the subset_evaluations tables, using the same testing and training metric
# groups as used for overall evaluations but with any thresholds reapplied only
# to entities in the subset on the relevant as_of_dates. For example, when
# calculating precision@5_pct for the subset of women, the ModelEvaluator will
# count as positively labeled the top 5% of women, rather than any women in the
# top 5% overall. This is useful if, for example, different interventions will
# be applied to different subsets of entities (e.g., one program will provide
# subsidies to the top 500 women with children and another program will provide
# shelter to the top 150 women without children) and you would like to see
# whether a single model can be used for both applications. Subsets can also be
# used to see how a model's performance would be affected if the requirements
# for intervention eligibility became more restricted.
#
# Subsets should be a list of dictionaries with the following keys:
# - "name": a shorthand name for the subset
# - "query": a query that returns distinct entity_ids belonging to the
# subset on a given as_of_date with a placeholder for the
# as_of_date being queried
scoring:
sort_seed: 5
testing_metric_groups:
Expand All @@ -336,6 +356,14 @@ scoring:
training_metric_groups:
-
metrics: [accuracy]
subsets:
-
name: women
query: |
select distinct entity_id
from demographics p
where d.gender = 'woman'
and demographic_date < '{as_of_date}'::date
# INDIVIDUAL IMPORTANCES
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import testing.postgresql
from sqlalchemy.engine import create_engine

from triage.component.architect.cohort_table_generators import CohortTableGenerator
from triage.component.architect.entity_date_table_generators import EntityDateTableGenerator

from . import utils

Expand All @@ -18,19 +18,19 @@ def test_empty_output():
with testing.postgresql.Postgresql() as postgresql:
engine = create_engine(postgresql.url())
utils.create_binary_outcome_events(engine, "events", [])
table_generator = CohortTableGenerator(
table_generator = EntityDateTableGenerator(
query="select entity_id from events where outcome_date < '{as_of_date}'::date",
db_engine=engine,
cohort_table_name="exp_hash_cohort",
entity_date_table_name="exp_hash_cohort",
)

with pytest.raises(ValueError):
# Request time outside of available intervals
table_generator.generate_cohort_table([datetime(2015, 12, 31)])
table_generator.generate_entity_date_table([datetime(2015, 12, 31)])

(cohort_count,) = engine.execute(
f"""\
select count(*) from {table_generator.cohort_table_name}
select count(*) from {table_generator.entity_date_table_name}
"""
).first()

Expand All @@ -39,7 +39,7 @@ def test_empty_output():
engine.dispose()


def test_cohort_table_generator_replace():
def test_entity_date_table_generator_replace():
input_data = [
(1, datetime(2016, 1, 1), True),
(1, datetime(2016, 4, 1), False),
Expand All @@ -53,10 +53,10 @@ def test_cohort_table_generator_replace():
with testing.postgresql.Postgresql() as postgresql:
engine = create_engine(postgresql.url())
utils.create_binary_outcome_events(engine, "events", input_data)
table_generator = CohortTableGenerator(
table_generator = EntityDateTableGenerator(
query="select entity_id from events where outcome_date < '{as_of_date}'::date",
db_engine=engine,
cohort_table_name="exp_hash_cohort",
entity_date_table_name="exp_hash_entity_date",
replace=True
)
as_of_dates = [
Expand All @@ -67,7 +67,7 @@ def test_cohort_table_generator_replace():
datetime(2016, 5, 1),
datetime(2016, 6, 1),
]
table_generator.generate_cohort_table(as_of_dates)
table_generator.generate_entity_date_table(as_of_dates)
expected_output = [
(1, datetime(2016, 2, 1), True),
(1, datetime(2016, 3, 1), True),
Expand All @@ -91,20 +91,20 @@ def test_cohort_table_generator_replace():
results = list(
engine.execute(
f"""
select entity_id, as_of_date, active from {table_generator.cohort_table_name}
select entity_id, as_of_date, active from {table_generator.entity_date_table_name}
order by entity_id, as_of_date
"""
)
)
assert results == expected_output
utils.assert_index(engine, table_generator.cohort_table_name, "entity_id")
utils.assert_index(engine, table_generator.cohort_table_name, "as_of_date")
utils.assert_index(engine, table_generator.entity_date_table_name, "entity_id")
utils.assert_index(engine, table_generator.entity_date_table_name, "as_of_date")

table_generator.generate_cohort_table(as_of_dates)
table_generator.generate_entity_date_table(as_of_dates)
assert results == expected_output


def test_cohort_table_generator_noreplace():
def test_entity_date_table_generator_noreplace():
input_data = [
(1, datetime(2016, 1, 1), True),
(1, datetime(2016, 4, 1), False),
Expand All @@ -118,10 +118,10 @@ def test_cohort_table_generator_noreplace():
with testing.postgresql.Postgresql() as postgresql:
engine = create_engine(postgresql.url())
utils.create_binary_outcome_events(engine, "events", input_data)
table_generator = CohortTableGenerator(
table_generator = EntityDateTableGenerator(
query="select entity_id from events where outcome_date < '{as_of_date}'::date",
db_engine=engine,
cohort_table_name="exp_hash_cohort",
entity_date_table_name="exp_hash_entity_date",
replace=False
)

Expand All @@ -131,7 +131,7 @@ def test_cohort_table_generator_noreplace():
datetime(2016, 2, 1),
datetime(2016, 3, 1),
]
table_generator.generate_cohort_table(as_of_dates)
table_generator.generate_entity_date_table(as_of_dates)
expected_output = [
(1, datetime(2016, 2, 1), True),
(1, datetime(2016, 3, 1), True),
Expand All @@ -143,16 +143,16 @@ def test_cohort_table_generator_noreplace():
results = list(
engine.execute(
f"""
select entity_id, as_of_date, active from {table_generator.cohort_table_name}
select entity_id, as_of_date, active from {table_generator.entity_date_table_name}
order by entity_id, as_of_date
"""
)
)
assert results == expected_output
utils.assert_index(engine, table_generator.cohort_table_name, "entity_id")
utils.assert_index(engine, table_generator.cohort_table_name, "as_of_date")
utils.assert_index(engine, table_generator.entity_date_table_name, "entity_id")
utils.assert_index(engine, table_generator.entity_date_table_name, "as_of_date")

table_generator.generate_cohort_table(as_of_dates)
table_generator.generate_entity_date_table(as_of_dates)
assert results == expected_output

# 2. generate a cohort for a different subset of as-of-dates,
Expand All @@ -163,7 +163,7 @@ def test_cohort_table_generator_noreplace():
datetime(2016, 5, 1),
datetime(2016, 6, 1),
]
table_generator.generate_cohort_table(as_of_dates)
table_generator.generate_entity_date_table(as_of_dates)
expected_output = [
(1, datetime(2016, 2, 1), True),
(1, datetime(2016, 3, 1), True),
Expand All @@ -187,7 +187,7 @@ def test_cohort_table_generator_noreplace():
results = list(
engine.execute(
f"""
select entity_id, as_of_date, active from {table_generator.cohort_table_name}
select entity_id, as_of_date, active from {table_generator.entity_date_table_name}
order by entity_id, as_of_date
"""
)
Expand Down
12 changes: 6 additions & 6 deletions src/tests/architect_tests/test_integration.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
FeatureGroupMixer,
)
from triage.component.architect.label_generators import LabelGenerator
from triage.component.architect.cohort_table_generators import CohortTableGenerator
from triage.component.architect.entity_date_table_generators import EntityDateTableGenerator
from triage.component.architect.planner import Planner
from triage.component.architect.builders import MatrixBuilder
from triage.component.catwalk.storage import ProjectStorage
Expand Down Expand Up @@ -160,9 +160,9 @@ def basic_integration_test(
test_durations=["1months"],
)

cohort_table_generator = CohortTableGenerator(
entity_date_table_generator = EntityDateTableGenerator(
db_engine=db_engine,
cohort_table_name="cohort_abcd",
entity_date_table_name="cohort_abcd",
query="select distinct(entity_id) from events"
)

Expand Down Expand Up @@ -217,8 +217,8 @@ def basic_integration_test(
all_as_of_times.extend(test_matrix["as_of_times"])
all_as_of_times = list(set(all_as_of_times))

# generate cohort state table
cohort_table_generator.generate_cohort_table(as_of_dates=all_as_of_times)
# generate entity_date state table
entity_date_table_generator.generate_entity_date_table(as_of_dates=all_as_of_times)

# create labels table
label_generator.generate_all_labels(
Expand Down Expand Up @@ -263,7 +263,7 @@ def basic_integration_test(
},
],
feature_dates=all_as_of_times,
state_table=cohort_table_generator.cohort_table_name,
state_table=entity_date_table_generator.entity_date_table_name,
)
feature_table_agg_tasks = feature_generator.generate_all_table_tasks(
aggregations, task_type="aggregation"
Expand Down
Loading

0 comments on commit fb55bbb

Please sign in to comment.