Evaluate on subsets [Resolves #535, #138]

This commit adds support for evaluating models against subsets of their predictions, in both training and testing. It adds a table to the results schemas to track subsets: - `model_metadata.subsets` stores subset metadata, including a hash, the subset configuration, and the time the row was created The `evaluations` tables in the `train_results` and `test_results` schemas are updated to include a new column (also added to the primary key), `subset_hash` that is an empty string for full cohort evaluations or contains the subset hash when the evaluation is for a subset of the cohort. A new alembic upgrade script creates the subsets table and updates the evaluation tables. Testing factories are included or modified for the subsets and evaluation tables. Most of the remaining code changes are made to the ModelEvaluator class, which can now process subset queries and write the results to the appropriate table [#535] and will record `NULL` values for undefined metrics (whether due to an empty subset or lack of variation in labels [#138]). However, some changes are made elsewhere in the experiment to allow (optionally) including subsets in the experiment configuration file, including storing subset metadata in the `model_metadata.subsets` table and iterating over subsets in the model tester. In addition, some changes to the documentation and `.gitignore` are included to make modifying the results schema more joyful.
dssg · Feb 22, 2019 · fb55bbb · fb55bbb
1 parent 59ffbe1
commit fb55bbb
Show file tree

Hide file tree

Showing 27 changed files with 1,429 additions and 419 deletions.
diff --git a/.gitignore b/.gitignore
@@ -13,3 +13,4 @@ dist/
 .ipynb_checkpoints/
 venv/
 my_db_config.yaml
+database.yaml
diff --git a/docs/sources/experiments/algorithm.md b/docs/sources/experiments/algorithm.md
@@ -309,7 +309,7 @@ The trained model's prediction probabilities (`predict_proba()`) are computed bo
 ### Individual Feature Importance
 Feature importances (of a configurable number of top features, defaulting to 5) for each prediction are computed and written to the `test_results.individual_importances` table. Right now, there are no sophisticated calculation methods integrated into the experiment; simply the top 5 global feature importances for the model are copied to the `individual_importances` table.
 
-#### Metrics
+### Metrics
 Triage allows for the computation of both testing set and training set evaluation metrics. Evaluation metrics, such as precision and recall at various thresholds, are written to either the `train_results.evaluations` table or the `test_results.evaluations`. Triage defines a number of [Evaluation Metrics](https://github.com/dssg/triage/blob/master/src/triage/component/catwalk/evaluation.py#L45-L58) metrics that can be addressed by name in the experiment definition, along with a list of thresholds and/or other parameters (such as the 'beta' value for fbeta) to iterate through. Thresholding is done either via absolute value (top k) or percentile by sorting the predictions and labels by the row's predicted probability score, with ties broken at random (the random seed can be passed in the config file to make this deterministic), and assigning the predicted value as True for those above the threshold. Note that the percentile thresholds are in terms of the population percentage, not a cutoff threshold for the predicted probability.
 
 Sometimes test matrices may not have labels for every row, so it's worth mentioning here how that is handled and interacts with thresholding. Rows with missing labels are not considered in the metric calculations, and if some of these rows are in the top k of the test matrix, no more rows are taken from the rest of the list for consideration. So if the experiment is calculating precision at the top 100 rows, and 40 of the top 100 rows are missing a label, the precision will actually be calculated on the 60 of the top 100 rows that do have a label. To make the results of this more transparent for users, a few extra pieces of metadata are written to the evaluations table for each metric score.
@@ -319,5 +319,14 @@ Sometimes test matrices may not have labels for every row, so it's worth mention
 labels
 * `num_positive_labels` - The number of positive labels in the test matrix
 
+Triage also supports evaluating a model on a subset of the predictions made.
+This is done by passing a subset query in the prediction config. The model
+evaluator will then subset the predictions on valid entity-date pairs for the
+given model and will calculate metrics for the subset, re-applying thresholds
+as necessary to the predictions in the subset. Subset definitions are stored in
+the `model_metadata.subsets` table, and the evaluations are stored in the
+`evaluations` tables. A hash of the subset configuration identifies subset
+evaluations and links the `subsets` table.
+
 ### Recap
 At this point, the 'model_metadata', 'train_results', and 'test_results' database schemas are fully populated with data about models, model groups, predictions, feature importances, and evaluation metrics for the researcher to query. In addition, the trained model pickle files are saved in the configured project path. The experiment is considered finished.
diff --git a/docs/sources/experiments/running.md b/docs/sources/experiments/running.md
@@ -266,11 +266,12 @@ After the experiment run, a variety of schemas and tables will be created and po
 * model_metadata.experiment_models - A many-to-many table between experiments and models. This will have a row if the experiment used the model, regardless of whether or not it had to build it
 * model_metadata.model_groups - A model groups refers to all models that share parameters like classifier type, hyperparameters, etc, but *have different training windows*. Look at these to see how classifiers perform over different training windows.
 * model_metadata.matrices - Each matrix that was used for training and testing has metadata written about it such as the matrix hash, length, and time configuration.
+* model_metadata.subsets - Each evaluation subset that was used for model scoring has its configuation and a hash written here
 * train_results.feature_importances - The sklearn feature importances results for each trained model
 * train_results.predictions - Prediction probabilities for train matrix entities generated against trained models
 * train_results.evaluations - Metric scores of trained models on the training data.
 * test_results.predictions - Prediction probabilities for test matrix entities generated against trained models
-* test_results.evaluations - Metric scores of trained models over given testing windows
+* test_results.evaluations - Metric scores of trained models over given testing windows and subsets
 * test_results.individual_importances - Individual feature importance scores for test matrix entities.
 
 Here's an example query, which returns the top 10 model groups by precision at the top 100 entities:

diff --git a/example/config/experiment.yaml b/example/config/experiment.yaml
@@ -316,6 +316,26 @@ grid_config:
 # sort_seed, if passed, will seed the random number generator for each model's
 # metric creation phase. This affects how entities with the same probabilities
 # are sorted
+#
+# subsets, if passed, will add evaluations for subset(s) of the predictions to
+# the subset_evaluations tables, using the same testing and training metric
+# groups as used for overall evaluations but with any thresholds reapplied only
+# to entities in the subset on the relevant as_of_dates. For example, when
+# calculating precision@5_pct for the subset of women, the ModelEvaluator will
+# count as positively labeled the top 5% of women, rather than any women in the
+# top 5% overall. This is useful if, for example, different interventions will
+# be applied to different subsets of entities (e.g., one program will provide
+# subsidies to the top 500 women with children and another program will provide
+# shelter to the top 150 women without children) and you would like to see
+# whether a single model can be used for both applications. Subsets can also be
+# used to see how a model's performance would be affected if the requirements
+# for intervention eligibility became more restricted.
+#
+# Subsets should be a list of dictionaries with the following keys:
+#     - "name": a shorthand name for the subset 
+#     - "query": a query that returns distinct entity_ids belonging to the
+#                subset on a given as_of_date with a placeholder for the
+#                as_of_date being queried
 scoring:
     sort_seed: 5
     testing_metric_groups:
@@ -336,6 +356,14 @@ scoring:
     training_metric_groups:
         -
             metrics: [accuracy]
+    subsets:
+        -
+            name: women
+            query: |
+                select distinct entity_id
+                from demographics p
+                where d.gender = 'woman'
+                and demographic_date < '{as_of_date}'::date
 
 
 # INDIVIDUAL IMPORTANCES

diff --git a/...ect_tests/test_cohort_table_generators.py → ...ests/test_entity_date_table_generators.py b/...ect_tests/test_cohort_table_generators.py → ...ests/test_entity_date_table_generators.py
@@ -4,7 +4,7 @@
 import testing.postgresql
 from sqlalchemy.engine import create_engine
 
-from triage.component.architect.cohort_table_generators import CohortTableGenerator
+from triage.component.architect.entity_date_table_generators import EntityDateTableGenerator
 
 from . import utils
 
@@ -18,19 +18,19 @@ def test_empty_output():
     with testing.postgresql.Postgresql() as postgresql:
         engine = create_engine(postgresql.url())
         utils.create_binary_outcome_events(engine, "events", [])
-        table_generator = CohortTableGenerator(
+        table_generator = EntityDateTableGenerator(
             query="select entity_id from events where outcome_date < '{as_of_date}'::date",
             db_engine=engine,
-            cohort_table_name="exp_hash_cohort",
+            entity_date_table_name="exp_hash_cohort",
         )
 
         with pytest.raises(ValueError):
             # Request time outside of available intervals
-            table_generator.generate_cohort_table([datetime(2015, 12, 31)])
+            table_generator.generate_entity_date_table([datetime(2015, 12, 31)])
 
         (cohort_count,) = engine.execute(
             f"""\
-            select count(*) from {table_generator.cohort_table_name}
+            select count(*) from {table_generator.entity_date_table_name}
         """
         ).first()
 
@@ -39,7 +39,7 @@ def test_empty_output():
         engine.dispose()
 
 
-def test_cohort_table_generator_replace():
+def test_entity_date_table_generator_replace():
     input_data = [
         (1, datetime(2016, 1, 1), True),
         (1, datetime(2016, 4, 1), False),
@@ -53,10 +53,10 @@ def test_cohort_table_generator_replace():
     with testing.postgresql.Postgresql() as postgresql:
         engine = create_engine(postgresql.url())
         utils.create_binary_outcome_events(engine, "events", input_data)
-        table_generator = CohortTableGenerator(
+        table_generator = EntityDateTableGenerator(
             query="select entity_id from events where outcome_date < '{as_of_date}'::date",
             db_engine=engine,
-            cohort_table_name="exp_hash_cohort",
+            entity_date_table_name="exp_hash_entity_date",
             replace=True
         )
         as_of_dates = [
@@ -67,7 +67,7 @@ def test_cohort_table_generator_replace():
             datetime(2016, 5, 1),
             datetime(2016, 6, 1),
         ]
-        table_generator.generate_cohort_table(as_of_dates)
+        table_generator.generate_entity_date_table(as_of_dates)
         expected_output = [
             (1, datetime(2016, 2, 1), True),
             (1, datetime(2016, 3, 1), True),
@@ -91,20 +91,20 @@ def test_cohort_table_generator_replace():
         results = list(
             engine.execute(
                 f"""
-                select entity_id, as_of_date, active from {table_generator.cohort_table_name}
+                select entity_id, as_of_date, active from {table_generator.entity_date_table_name}
                 order by entity_id, as_of_date
             """
             )
         )
         assert results == expected_output
-        utils.assert_index(engine, table_generator.cohort_table_name, "entity_id")
-        utils.assert_index(engine, table_generator.cohort_table_name, "as_of_date")
+        utils.assert_index(engine, table_generator.entity_date_table_name, "entity_id")
+        utils.assert_index(engine, table_generator.entity_date_table_name, "as_of_date")
 
-        table_generator.generate_cohort_table(as_of_dates)
+        table_generator.generate_entity_date_table(as_of_dates)
         assert results == expected_output
 
 
-def test_cohort_table_generator_noreplace():
+def test_entity_date_table_generator_noreplace():
     input_data = [
         (1, datetime(2016, 1, 1), True),
         (1, datetime(2016, 4, 1), False),
@@ -118,10 +118,10 @@ def test_cohort_table_generator_noreplace():
     with testing.postgresql.Postgresql() as postgresql:
         engine = create_engine(postgresql.url())
         utils.create_binary_outcome_events(engine, "events", input_data)
-        table_generator = CohortTableGenerator(
+        table_generator = EntityDateTableGenerator(
             query="select entity_id from events where outcome_date < '{as_of_date}'::date",
             db_engine=engine,
-            cohort_table_name="exp_hash_cohort",
+            entity_date_table_name="exp_hash_entity_date",
             replace=False
         )
 
@@ -131,7 +131,7 @@ def test_cohort_table_generator_noreplace():
             datetime(2016, 2, 1),
             datetime(2016, 3, 1),
         ]
-        table_generator.generate_cohort_table(as_of_dates)
+        table_generator.generate_entity_date_table(as_of_dates)
         expected_output = [
             (1, datetime(2016, 2, 1), True),
             (1, datetime(2016, 3, 1), True),
@@ -143,16 +143,16 @@ def test_cohort_table_generator_noreplace():
         results = list(
             engine.execute(
                 f"""
-                select entity_id, as_of_date, active from {table_generator.cohort_table_name}
+                select entity_id, as_of_date, active from {table_generator.entity_date_table_name}
                 order by entity_id, as_of_date
             """
             )
         )
         assert results == expected_output
-        utils.assert_index(engine, table_generator.cohort_table_name, "entity_id")
-        utils.assert_index(engine, table_generator.cohort_table_name, "as_of_date")
+        utils.assert_index(engine, table_generator.entity_date_table_name, "entity_id")
+        utils.assert_index(engine, table_generator.entity_date_table_name, "as_of_date")
 
-        table_generator.generate_cohort_table(as_of_dates)
+        table_generator.generate_entity_date_table(as_of_dates)
         assert results == expected_output
 
         # 2. generate a cohort for a different subset of as-of-dates,
@@ -163,7 +163,7 @@ def test_cohort_table_generator_noreplace():
             datetime(2016, 5, 1),
             datetime(2016, 6, 1),
         ]
-        table_generator.generate_cohort_table(as_of_dates)
+        table_generator.generate_entity_date_table(as_of_dates)
         expected_output = [
             (1, datetime(2016, 2, 1), True),
             (1, datetime(2016, 3, 1), True),
@@ -187,7 +187,7 @@ def test_cohort_table_generator_noreplace():
         results = list(
             engine.execute(
                 f"""
-                select entity_id, as_of_date, active from {table_generator.cohort_table_name}
+                select entity_id, as_of_date, active from {table_generator.entity_date_table_name}
                 order by entity_id, as_of_date
             """
             )

diff --git a/src/tests/architect_tests/test_integration.py b/src/tests/architect_tests/test_integration.py
@@ -16,7 +16,7 @@
     FeatureGroupMixer,
 )
 from triage.component.architect.label_generators import LabelGenerator
-from triage.component.architect.cohort_table_generators import CohortTableGenerator
+from triage.component.architect.entity_date_table_generators import EntityDateTableGenerator
 from triage.component.architect.planner import Planner
 from triage.component.architect.builders import MatrixBuilder
 from triage.component.catwalk.storage import ProjectStorage
@@ -160,9 +160,9 @@ def basic_integration_test(
                 test_durations=["1months"],
             )
 
-            cohort_table_generator = CohortTableGenerator(
+            entity_date_table_generator = EntityDateTableGenerator(
                 db_engine=db_engine,
-                cohort_table_name="cohort_abcd",
+                entity_date_table_name="cohort_abcd",
                 query="select distinct(entity_id) from events"
             )
 
@@ -217,8 +217,8 @@ def basic_integration_test(
                     all_as_of_times.extend(test_matrix["as_of_times"])
             all_as_of_times = list(set(all_as_of_times))
 
-            # generate cohort state table
-            cohort_table_generator.generate_cohort_table(as_of_dates=all_as_of_times)
+            # generate entity_date state table
+            entity_date_table_generator.generate_entity_date_table(as_of_dates=all_as_of_times)
 
             # create labels table
             label_generator.generate_all_labels(
@@ -263,7 +263,7 @@ def basic_integration_test(
                     },
                 ],
                 feature_dates=all_as_of_times,
-                state_table=cohort_table_generator.cohort_table_name,
+                state_table=entity_date_table_generator.entity_date_table_name,
             )
             feature_table_agg_tasks = feature_generator.generate_all_table_tasks(
                 aggregations, task_type="aggregation"