Insert Ranks for Predictions [Resolves #357]

Adds ranking to the predictions tables. A few flavors of ranking are added. rank_abs (already existing column) - Absolute rank, starting at 1, without ties. Ties are broken based on either a random draw or a user-supplied fallback clause in the predictions table (e.g. label_value) rank_pct (already existing column) - Percentile rank, *without ties*. Based on the rank_abs tiebreaking. rank_abs_without_ties - Absolute rank, starting at 1, with ties and skipping (e.g. if two entities are tied for 3, there will be no 4) The tiebreaking for rank_abs (that cascades to rank_pct) is either done randomly using a random seed that is based on the model's seed, or through user input at the new "prediction->rank_tiebreaker_order_by" config value. What is the model's seed, you ask? It's a new construct, that we store in the models table under 'random_seed'. For each model training task, we generate a value between -1000000000 and 1000000000. This value is set as the Python seed right before training of an individual model, so behavior is the same on singlethreaded or multiprocess training contexts. How is this generated? The experiment requires that one is passed in the config, so this becomes part of the experiment config that is saved. To help make space in the predictions table, and to remove unnecessary precision that would make tiebreaking kind of irrelevant, the score in the predictions tables are turned into DECIMAL(6, 5). To keep track of how tiebreaking was done, there is a new prediction_metadata table that holds this metadata, whether user configuration or the Triage-supplied default. Implementation-wise, this is done via an update statement after predictions are initially inserted with NULL ranks to prevent memory from ballooning.
dssg · May 3, 2019 · b8fc0cf · b8fc0cf
1 parent 7435238
commit b8fc0cf
Show file tree

Hide file tree

Showing 28 changed files with 839 additions and 78 deletions.
diff --git a/docs/sources/dirtyduck/docs/eis.md b/docs/sources/dirtyduck/docs/eis.md
@@ -696,8 +696,8 @@ predictions_query: |
       entity_id,
       score,
       label_value,
-      coalesce(rank_abs, row_number() over (partition by (model_id, as_of_date) order by score desc)) as rank_abs,
-      coalesce(rank_pct*100, ntile(100) over (partition by (model_id, as_of_date) order by score desc)) as rank_pct
+      coalesce(rank_abs_no_ties, row_number() over (partition by (model_id, as_of_date) order by score desc)) as rank_abs,
+      coalesce(rank_pct_no_ties*100, ntile(100) over (partition by (model_id, as_of_date) order by score desc)) as rank_pct
       from test_results.predictions
       join models_dates_join_query using(model_id, as_of_date)
       where model_id in (select model_id from models_list_query)

diff --git a/docs/sources/dirtyduck/triage/eis_crosstabs_config.yaml b/docs/sources/dirtyduck/triage/eis_crosstabs_config.yaml
@@ -41,8 +41,8 @@ predictions_query: |
       entity_id,
       score,
       label_value,
-      coalesce(rank_abs, row_number() over (partition by (model_id, as_of_date) order by score desc)) as rank_abs,
-      coalesce(rank_pct*100, ntile(100) over (partition by (model_id, as_of_date) order by score desc)) as rank_pct
+      coalesce(rank_abs_no_ties, row_number() over (partition by (model_id, as_of_date) order by score desc)) as rank_abs,
+      coalesce(rank_pct_no_ties*100, ntile(100) over (partition by (model_id, as_of_date) order by score desc)) as rank_pct
       from test_results.predictions
       join models_dates_join_query using(model_id, as_of_date)
       where model_id in (select model_id from models_list_query)

diff --git a/docs/sources/dirtyduck/triage/experiments/inspections_dt.yaml b/docs/sources/dirtyduck/triage/experiments/inspections_dt.yaml
@@ -1,6 +1,7 @@
 config_version: 'v6'
 
 model_comment: 'inspections: DT'
+random_seed: 12345
 
 user_metadata:
   label_definition: 'failed'
@@ -184,6 +185,9 @@ feature_group_definition:
 
 feature_group_strategies: ['all']
 
+prediction:
+   rank_tiebreaker: "best"
+
 scoring:
     testing_metric_groups:
         -

diff --git a/docs/sources/experiments/algorithm.md b/docs/sources/experiments/algorithm.md
@@ -322,7 +322,6 @@ A few different versions of tiebreaking are implemented to deal with the nuances
 * `stochastic_value` - If the `worst_value` and `best_value` are not the same (as defined by the floating point tolerance at catwalk.evaluation.RELATIVE_TOLERANCE), the sorting/thresholding/evaluation will be redone many times, and the mean of all these trials is written to this column. Otherwise, the `worst_value` is written here
 * `num_sort_trials` - If trials are needed to produce the `stochastic_value`, the number of trials taken is written here. Otherwise this will be 0
 * `standard_deviation` - If trials are needed to produce the `stochastic_value`, the standard deviation of these trials is written here. Otherwise this will be 0
-*
 
 Sometimes test matrices may not have labels for every row, so it's worth mentioning here how that is handled and interacts with thresholding. Rows with missing labels are not considered in the metric calculations, and if some of these rows are in the top k of the test matrix, no more rows are taken from the rest of the list for consideration. So if the experiment is calculating precision at the top 100 rows, and 40 of the top 100 rows are missing a label, the precision will actually be calculated on the 60 of the top 100 rows that do have a label. To make the results of this more transparent for users, a few extra pieces of metadata are written to the evaluations table for each metric score.
 

diff --git a/docs/sources/experiments/prediction-ranking.md b/docs/sources/experiments/prediction-ranking.md
@@ -0,0 +1,50 @@
+# Prediction Ranking
+
+The predictions tables in the `train_results` and `test_results` schemas contain several different flavors of rankings, covering absolute vs percentile ranking and whether or not ties exist.
+
+## Ranking columns
+
+| Column name | Behavior |
+| ----------- | ------- |
+| rank_abs_with_ties | Absolute ranking, with ties. Ranks will skip after a set of ties, so if two entities are tied at rank 3, the next entity after them will have rank 5. |
+| rank_pct_with_ties | Percentile ranking, with ties. Percentiles will skip after a set of ties, so if two entities out of ten are tied at 0.1 (tenth percentile), the next entity after them will have 0.3 (thirtieth percentile) |
+| rank_abs_no_ties | Absolute ranking, with no ties. Ties are broken according to a configured choice: 'best', 'worst', or 'random', which is recorded in the `prediction_metadata` table |
+| rank_pct_no_ties | Percentile ranking, with no ties. Ties are broken according to a configured choice: 'best', 'worst', or 'random', which is recorded in the `prediction_metadata` table |
+
+
+## Viewing prediction metadata
+
+The `prediction_metadata` table contains information about how ties were broken. There is one row per model/matrix combination. For each model and matrix, it records:
+
+- `tiebreaker_ordering` - The tiebreaker ordering rule (e.g. 'random', 'best', 'worst') used for the corresponding predictions.
+- `random_seed` - The random seed, if 'random' was the ordering used. Otherwise None
+- `predictions_saved` - Whether or not predictions were saved. If it's false, you won't expect to find any predictions, but the row is inserted as a record that the prediction was performed.
+
+There is one `prediction_metadata` table in each of the `train_results`, `test_results` schemas (in other words, wherever there is a companion `predictions` table).
+
+## Backfilling ranks for old predictions
+
+Prediction ranking is new to Triage, so you may have old Triage runs that have no prediction ranks that you would like to backfill. To do this, you can use the `Predictor` class' `update_db_with_ranks` method to backfill ranks. This example fills rankings for test predictions, but you can replace `TestMatrixType` with `TrainMatrixType` to rank train predictions (provided such predictions already exist)
+
+```python
+from triage.component.catwalk import Predictor
+from triage.component.catwalk.storage import TestMatrixType
+
+predictor = Predictor(
+    db_engine=...,
+    rank_order='worst',
+    model_storage_engine=None,
+)
+
+predictor.update_db_with_ranks(
+    model_id=..., # model id of some model with test predictions for the companion matrix
+    matrix_uuid=..., # matrix uuid of some matrix with test predictions for the companion model
+    matrix_type=TestMatrixType,
+)
+
+```
+
+
+## Subsequent runs
+
+If you run Triage Experiments with `replace=False`, and you change nothing except for the `rank_tiebreaker` in experiment config, ranking will be redone and the row in `prediction_metadata` updated. You don't have to run a full experiment if that's all you want to do; you could follow the directions for backfilling ranks above, which will redo the ranking for an individual model/matrix pair. However, changing the `rank_tiebreaker` in experiment config and re-running the experiment is a handy way of redoing all of them if that's what is useful.
diff --git a/docs/sources/experiments/running.md b/docs/sources/experiments/running.md
@@ -274,8 +274,10 @@ After the experiment run, a variety of schemas and tables will be created and po
 * model_metadata.subsets - Each evaluation subset that was used for model scoring has its configuation and a hash written here
 * train_results.feature_importances - The sklearn feature importances results for each trained model
 * train_results.predictions - Prediction probabilities for train matrix entities generated against trained models
+* train_results.prediction_metadata - Metadata about the prediction stage for a model and train matrix, such as tiebreaking configuration
 * train_results.evaluations - Metric scores of trained models on the training data.
 * test_results.predictions - Prediction probabilities for test matrix entities generated against trained models
+* test_results.prediction_metadata - Metadata about the prediction stage for a model and test matrix, such as tiebreaking configuration
 * test_results.evaluations - Metric scores of trained models over given testing windows and subsets
 * test_results.individual_importances - Individual feature importance scores for test matrix entities.
 

diff --git a/example/config/experiment.yaml b/example/config/experiment.yaml
@@ -11,6 +11,9 @@ config_version: 'v6'
 # model_comment (optional) will end up in the model_comment column of the
 # models table for each model created in this experiment
 model_comment: 'test'
+# random_seed will be set in Python at the beginning of the experiment and 
+# affect the generation of all model seeds
+random_seed: 23895478
 
 # TIME SPLITTING
 # The time window to look at, and how to divide the window into
@@ -299,6 +302,21 @@ grid_config:
         logical_operator: 'and'
 
 
+# PREDICTION
+# How predictions are computed for train and test matrices
+# 
+# Rank tiebreaking - In the predictions.rank_abs and rank_pct columns, ties in the score 
+# are broken either at random or based on the 'worst' or 'best' options. 'worst' is the default.
+#
+# 'worst' will break ties with the ascending label value, so if you take the top 'k' predictions, and there are ties across the 'k' threshold, the predictions above the threshold will be negative labels if possible.
+# 'best' will break ties with the descending label value, so if you take the top 'k' predictions, and there are ties across the 'k' threshold, the predictions above the threshold will be positive labels if possible.
+# 'random' will choose one random ordering to break ties. The result will be affected by
+# current state of Postgres' random number generator. Before ranking, the generator is seeded
+# based on the *model*'s random seed.
+#prediction:
+#   rank_tiebreaker: "worst"
+
+
 # MODEL SCORING
 # How each trained model is scored
 #

diff --git a/example/config/postmodeling_crosstabs.yaml b/example/config/postmodeling_crosstabs.yaml
@@ -36,8 +36,8 @@ select model_id,
       entity_id,
       score,
       label_value,
-      coalesce(rank_abs, row_number() over (partition by (model_id, as_of_date) order by score desc)) as rank_abs,
-      coalesce(rank_pct*100, ntile(100) over (partition by (model_id, as_of_date) order by score desc)) as rank_pct
+      coalesce(rank_abs_no_ties, row_number() over (partition by (model_id, as_of_date) order by score desc)) as rank_abs,
+      coalesce(rank_pct_no_ties*100, ntile(100) over (partition by (model_id, as_of_date) order by score desc)) as rank_pct
   from test_results.predictions
   JOIN models_dates_join_query USING(model_id, as_of_date)
   where model_id IN (select model_id from models_list_query)

diff --git a/src/tests/catwalk_tests/test_model_trainers.py b/src/tests/catwalk_tests/test_model_trainers.py
@@ -1,17 +1,11 @@
 import pandas
-import testing.postgresql
-import sqlalchemy
-import unittest
-from unittest.mock import patch
+import random
 import pytest
 
-from sqlalchemy import create_engine
-from triage.component.catwalk.db import ensure_db
-from tests.results_tests.factories import init_engine
 
 from triage.component.catwalk.model_grouping import ModelGrouper
 from triage.component.catwalk.model_trainers import ModelTrainer
-from tests.utils import rig_engines, get_matrix_store
+from tests.utils import get_matrix_store
 
 
 @pytest.fixture
@@ -43,6 +37,9 @@ def test_model_trainer(grid_config, default_model_trainer):
     project_storage = trainer.model_storage_engine.project_storage
     model_storage_engine = trainer.model_storage_engine
 
+    def set_test_seed():
+        random.seed(5)
+    set_test_seed()
     model_ids = trainer.train_models(
         grid_config=grid_config,
         misc_db_parameters=dict(),
@@ -75,6 +72,15 @@ def test_model_trainer(grid_config, default_model_trainer):
     ]
     assert len(records) == 4
 
+    # 2. that the random seeds are distinct
+    records = [
+        row
+        for row in db_engine.execute(
+            "select distinct random_seed from model_metadata.models"
+        )
+    ]
+    assert len(records) == 4
+
     # 3. that the model sizes are saved in the table and all are < 1 kB
     records = [
         row
@@ -99,7 +105,8 @@ def test_model_trainer(grid_config, default_model_trainer):
         predictions = model_pickle.predict(test_matrix)
         assert len(predictions) == 2
 
-    # 6. when run again, same models are returned
+    # 6. when run again with the same starting seed, same models are returned
+    set_test_seed()
     new_model_ids = trainer.train_models(
         grid_config=grid_config,
         misc_db_parameters=dict(),
@@ -134,6 +141,7 @@ def test_model_trainer(grid_config, default_model_trainer):
         db_engine=db_engine,
         replace=True,
     )
+    set_test_seed()
     new_model_ids = trainer.train_models(
         grid_config=grid_config,
         misc_db_parameters=dict(),
@@ -163,6 +171,7 @@ def test_model_trainer(grid_config, default_model_trainer):
     assert len(records) == 4 * 2  # maybe exclude entity_id? yes
 
     # 8. if the cache is missing but the metadata is still there, reuse the metadata
+    set_test_seed()
     for row in db_engine.execute("select model_hash from model_metadata.models"):
         model_storage_engine.delete(row[0])
     new_model_ids = trainer.train_models(
@@ -173,6 +182,7 @@ def test_model_trainer(grid_config, default_model_trainer):
     assert model_ids == sorted(new_model_ids)
 
     # 9. that the generator interface works the same way
+    set_test_seed()
     new_model_ids = trainer.generate_trained_models(
         grid_config=grid_config,
         misc_db_parameters=dict(),
@@ -233,31 +243,41 @@ def test_n_jobs_not_new_model(default_model_trainer):
             "max_features": ["sqrt", "log2"],
             "max_depth": [5, 10, 15, 20],
             "criterion": ["gini", "entropy"],
-            "n_jobs": [12, 24],
+            "n_jobs": [12],
         },
     }
 
     trainer = default_model_trainer
     project_storage = trainer.model_storage_engine.project_storage
     db_engine = trainer.db_engine
 
+    # generate train tasks, with a specific random seed so that we can compare
+    # apples to apples later
+    random.seed(5)
     train_tasks = trainer.generate_train_tasks(
         grid_config, dict(), get_matrix_store(project_storage)
     )
 
-    assert len(train_tasks) == 35  # 32+3, would be (32*2)+3 if we didn't remove
-    assert (
-        len([task for task in train_tasks if "n_jobs" in task["parameters"]]) == 32
-    )
-
     for train_task in train_tasks:
         trainer.process_train_task(**train_task)
 
+    # since n_jobs is a runtime attribute of the model, it should not make it
+    # into the model group
     for row in db_engine.execute(
         "select hyperparameters from model_metadata.model_groups"
     ):
         assert "n_jobs" not in row[0]
 
+    hashes = set(task['model_hash'] for task in train_tasks)
+    # generate the grid again with a different n_jobs (but the same random seed!)
+    # and make sure that the hashes are the same as before
+    random.seed(5)
+    grid_config['sklearn.ensemble.RandomForestClassifier']['n_jobs'] = [24]
+    new_train_tasks = trainer.generate_train_tasks(
+        grid_config, dict(), get_matrix_store(project_storage)
+    )
+    assert hashes == set(task['model_hash'] for task in new_train_tasks)
+
 
 def test_cache_models(default_model_trainer):
     assert not default_model_trainer.model_storage_engine.should_cache