Skip to content

Commit

Permalink
Insert Ranks for Predictions [Resolves #357]
Browse files Browse the repository at this point in the history
Adds ranking to the predictions tables.

A few flavors of ranking are added.

rank_abs (already existing column) - Absolute rank, starting at 1,
without ties. Ties are broken based on either a random draw or a user-supplied fallback clause in the predictions table (e.g. label_value)
rank_pct (already existing column) - Percentile rank, *without ties*. Based on the rank_abs tiebreaking.
rank_abs_without_ties - Absolute rank, starting at 1, with ties and
skipping (e.g. if two entities are tied for 3, there will be no 4)

The tiebreaking for rank_abs (that cascades to rank_pct) is either done
randomly using a random seed that is based on the model's seed, or
through user input at the new "prediction->rank_tiebreaker_order_by"
config value.

What is the model's seed, you ask? It's a new construct, that we store
in the models table under 'random_seed'.  For each model training task,
we generate a value between -1000000000 and 1000000000. This value is
set as the Python seed right before training of an individual model, so
behavior is the same on singlethreaded or multiprocess training
contexts. How is this generated? The experiment requires that one is
passed in the config, so this becomes part of the experiment config that
is saved.

To help make space in the predictions table,
and to remove unnecessary precision that would make tiebreaking kind of irrelevant,
the score in the predictions tables are turned into DECIMAL(6, 5).

To keep track of how tiebreaking was done, there is a new
prediction_metadata table that holds this metadata, whether user
configuration or the Triage-supplied default.

Implementation-wise, this is done via an update statement after
predictions are initially inserted with NULL ranks to prevent memory
from ballooning.
  • Loading branch information
thcrock committed May 3, 2019
1 parent 7435238 commit b8fc0cf
Show file tree
Hide file tree
Showing 28 changed files with 839 additions and 78 deletions.
4 changes: 2 additions & 2 deletions docs/sources/dirtyduck/docs/eis.md
Original file line number Diff line number Diff line change
Expand Up @@ -696,8 +696,8 @@ predictions_query: |
entity_id,
score,
label_value,
coalesce(rank_abs, row_number() over (partition by (model_id, as_of_date) order by score desc)) as rank_abs,
coalesce(rank_pct*100, ntile(100) over (partition by (model_id, as_of_date) order by score desc)) as rank_pct
coalesce(rank_abs_no_ties, row_number() over (partition by (model_id, as_of_date) order by score desc)) as rank_abs,
coalesce(rank_pct_no_ties*100, ntile(100) over (partition by (model_id, as_of_date) order by score desc)) as rank_pct
from test_results.predictions
join models_dates_join_query using(model_id, as_of_date)
where model_id in (select model_id from models_list_query)
Expand Down
4 changes: 2 additions & 2 deletions docs/sources/dirtyduck/triage/eis_crosstabs_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,8 @@ predictions_query: |
entity_id,
score,
label_value,
coalesce(rank_abs, row_number() over (partition by (model_id, as_of_date) order by score desc)) as rank_abs,
coalesce(rank_pct*100, ntile(100) over (partition by (model_id, as_of_date) order by score desc)) as rank_pct
coalesce(rank_abs_no_ties, row_number() over (partition by (model_id, as_of_date) order by score desc)) as rank_abs,
coalesce(rank_pct_no_ties*100, ntile(100) over (partition by (model_id, as_of_date) order by score desc)) as rank_pct
from test_results.predictions
join models_dates_join_query using(model_id, as_of_date)
where model_id in (select model_id from models_list_query)
Expand Down
4 changes: 4 additions & 0 deletions docs/sources/dirtyduck/triage/experiments/inspections_dt.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
config_version: 'v6'

model_comment: 'inspections: DT'
random_seed: 12345

user_metadata:
label_definition: 'failed'
Expand Down Expand Up @@ -184,6 +185,9 @@ feature_group_definition:

feature_group_strategies: ['all']

prediction:
rank_tiebreaker: "best"

scoring:
testing_metric_groups:
-
Expand Down
1 change: 0 additions & 1 deletion docs/sources/experiments/algorithm.md
Original file line number Diff line number Diff line change
Expand Up @@ -322,7 +322,6 @@ A few different versions of tiebreaking are implemented to deal with the nuances
* `stochastic_value` - If the `worst_value` and `best_value` are not the same (as defined by the floating point tolerance at catwalk.evaluation.RELATIVE_TOLERANCE), the sorting/thresholding/evaluation will be redone many times, and the mean of all these trials is written to this column. Otherwise, the `worst_value` is written here
* `num_sort_trials` - If trials are needed to produce the `stochastic_value`, the number of trials taken is written here. Otherwise this will be 0
* `standard_deviation` - If trials are needed to produce the `stochastic_value`, the standard deviation of these trials is written here. Otherwise this will be 0
*

Sometimes test matrices may not have labels for every row, so it's worth mentioning here how that is handled and interacts with thresholding. Rows with missing labels are not considered in the metric calculations, and if some of these rows are in the top k of the test matrix, no more rows are taken from the rest of the list for consideration. So if the experiment is calculating precision at the top 100 rows, and 40 of the top 100 rows are missing a label, the precision will actually be calculated on the 60 of the top 100 rows that do have a label. To make the results of this more transparent for users, a few extra pieces of metadata are written to the evaluations table for each metric score.

Expand Down
50 changes: 50 additions & 0 deletions docs/sources/experiments/prediction-ranking.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Prediction Ranking

The predictions tables in the `train_results` and `test_results` schemas contain several different flavors of rankings, covering absolute vs percentile ranking and whether or not ties exist.

## Ranking columns

| Column name | Behavior |
| ----------- | ------- |
| rank_abs_with_ties | Absolute ranking, with ties. Ranks will skip after a set of ties, so if two entities are tied at rank 3, the next entity after them will have rank 5. |
| rank_pct_with_ties | Percentile ranking, with ties. Percentiles will skip after a set of ties, so if two entities out of ten are tied at 0.1 (tenth percentile), the next entity after them will have 0.3 (thirtieth percentile) |
| rank_abs_no_ties | Absolute ranking, with no ties. Ties are broken according to a configured choice: 'best', 'worst', or 'random', which is recorded in the `prediction_metadata` table |
| rank_pct_no_ties | Percentile ranking, with no ties. Ties are broken according to a configured choice: 'best', 'worst', or 'random', which is recorded in the `prediction_metadata` table |


## Viewing prediction metadata

The `prediction_metadata` table contains information about how ties were broken. There is one row per model/matrix combination. For each model and matrix, it records:

- `tiebreaker_ordering` - The tiebreaker ordering rule (e.g. 'random', 'best', 'worst') used for the corresponding predictions.
- `random_seed` - The random seed, if 'random' was the ordering used. Otherwise None
- `predictions_saved` - Whether or not predictions were saved. If it's false, you won't expect to find any predictions, but the row is inserted as a record that the prediction was performed.

There is one `prediction_metadata` table in each of the `train_results`, `test_results` schemas (in other words, wherever there is a companion `predictions` table).

## Backfilling ranks for old predictions

Prediction ranking is new to Triage, so you may have old Triage runs that have no prediction ranks that you would like to backfill. To do this, you can use the `Predictor` class' `update_db_with_ranks` method to backfill ranks. This example fills rankings for test predictions, but you can replace `TestMatrixType` with `TrainMatrixType` to rank train predictions (provided such predictions already exist)

```python
from triage.component.catwalk import Predictor
from triage.component.catwalk.storage import TestMatrixType

predictor = Predictor(
db_engine=...,
rank_order='worst',
model_storage_engine=None,
)

predictor.update_db_with_ranks(
model_id=..., # model id of some model with test predictions for the companion matrix
matrix_uuid=..., # matrix uuid of some matrix with test predictions for the companion model
matrix_type=TestMatrixType,
)

```


## Subsequent runs

If you run Triage Experiments with `replace=False`, and you change nothing except for the `rank_tiebreaker` in experiment config, ranking will be redone and the row in `prediction_metadata` updated. You don't have to run a full experiment if that's all you want to do; you could follow the directions for backfilling ranks above, which will redo the ranking for an individual model/matrix pair. However, changing the `rank_tiebreaker` in experiment config and re-running the experiment is a handy way of redoing all of them if that's what is useful.
2 changes: 2 additions & 0 deletions docs/sources/experiments/running.md
Original file line number Diff line number Diff line change
Expand Up @@ -274,8 +274,10 @@ After the experiment run, a variety of schemas and tables will be created and po
* model_metadata.subsets - Each evaluation subset that was used for model scoring has its configuation and a hash written here
* train_results.feature_importances - The sklearn feature importances results for each trained model
* train_results.predictions - Prediction probabilities for train matrix entities generated against trained models
* train_results.prediction_metadata - Metadata about the prediction stage for a model and train matrix, such as tiebreaking configuration
* train_results.evaluations - Metric scores of trained models on the training data.
* test_results.predictions - Prediction probabilities for test matrix entities generated against trained models
* test_results.prediction_metadata - Metadata about the prediction stage for a model and test matrix, such as tiebreaking configuration
* test_results.evaluations - Metric scores of trained models over given testing windows and subsets
* test_results.individual_importances - Individual feature importance scores for test matrix entities.

Expand Down
18 changes: 18 additions & 0 deletions example/config/experiment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,9 @@ config_version: 'v6'
# model_comment (optional) will end up in the model_comment column of the
# models table for each model created in this experiment
model_comment: 'test'
# random_seed will be set in Python at the beginning of the experiment and
# affect the generation of all model seeds
random_seed: 23895478

# TIME SPLITTING
# The time window to look at, and how to divide the window into
Expand Down Expand Up @@ -299,6 +302,21 @@ grid_config:
logical_operator: 'and'


# PREDICTION
# How predictions are computed for train and test matrices
#
# Rank tiebreaking - In the predictions.rank_abs and rank_pct columns, ties in the score
# are broken either at random or based on the 'worst' or 'best' options. 'worst' is the default.
#
# 'worst' will break ties with the ascending label value, so if you take the top 'k' predictions, and there are ties across the 'k' threshold, the predictions above the threshold will be negative labels if possible.
# 'best' will break ties with the descending label value, so if you take the top 'k' predictions, and there are ties across the 'k' threshold, the predictions above the threshold will be positive labels if possible.
# 'random' will choose one random ordering to break ties. The result will be affected by
# current state of Postgres' random number generator. Before ranking, the generator is seeded
# based on the *model*'s random seed.
#prediction:
# rank_tiebreaker: "worst"


# MODEL SCORING
# How each trained model is scored
#
Expand Down
4 changes: 2 additions & 2 deletions example/config/postmodeling_crosstabs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,8 @@ select model_id,
entity_id,
score,
label_value,
coalesce(rank_abs, row_number() over (partition by (model_id, as_of_date) order by score desc)) as rank_abs,
coalesce(rank_pct*100, ntile(100) over (partition by (model_id, as_of_date) order by score desc)) as rank_pct
coalesce(rank_abs_no_ties, row_number() over (partition by (model_id, as_of_date) order by score desc)) as rank_abs,
coalesce(rank_pct_no_ties*100, ntile(100) over (partition by (model_id, as_of_date) order by score desc)) as rank_pct
from test_results.predictions
JOIN models_dates_join_query USING(model_id, as_of_date)
where model_id IN (select model_id from models_list_query)
Expand Down
50 changes: 35 additions & 15 deletions src/tests/catwalk_tests/test_model_trainers.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,11 @@
import pandas
import testing.postgresql
import sqlalchemy
import unittest
from unittest.mock import patch
import random
import pytest

from sqlalchemy import create_engine
from triage.component.catwalk.db import ensure_db
from tests.results_tests.factories import init_engine

from triage.component.catwalk.model_grouping import ModelGrouper
from triage.component.catwalk.model_trainers import ModelTrainer
from tests.utils import rig_engines, get_matrix_store
from tests.utils import get_matrix_store


@pytest.fixture
Expand Down Expand Up @@ -43,6 +37,9 @@ def test_model_trainer(grid_config, default_model_trainer):
project_storage = trainer.model_storage_engine.project_storage
model_storage_engine = trainer.model_storage_engine

def set_test_seed():
random.seed(5)
set_test_seed()
model_ids = trainer.train_models(
grid_config=grid_config,
misc_db_parameters=dict(),
Expand Down Expand Up @@ -75,6 +72,15 @@ def test_model_trainer(grid_config, default_model_trainer):
]
assert len(records) == 4

# 2. that the random seeds are distinct
records = [
row
for row in db_engine.execute(
"select distinct random_seed from model_metadata.models"
)
]
assert len(records) == 4

# 3. that the model sizes are saved in the table and all are < 1 kB
records = [
row
Expand All @@ -99,7 +105,8 @@ def test_model_trainer(grid_config, default_model_trainer):
predictions = model_pickle.predict(test_matrix)
assert len(predictions) == 2

# 6. when run again, same models are returned
# 6. when run again with the same starting seed, same models are returned
set_test_seed()
new_model_ids = trainer.train_models(
grid_config=grid_config,
misc_db_parameters=dict(),
Expand Down Expand Up @@ -134,6 +141,7 @@ def test_model_trainer(grid_config, default_model_trainer):
db_engine=db_engine,
replace=True,
)
set_test_seed()
new_model_ids = trainer.train_models(
grid_config=grid_config,
misc_db_parameters=dict(),
Expand Down Expand Up @@ -163,6 +171,7 @@ def test_model_trainer(grid_config, default_model_trainer):
assert len(records) == 4 * 2 # maybe exclude entity_id? yes

# 8. if the cache is missing but the metadata is still there, reuse the metadata
set_test_seed()
for row in db_engine.execute("select model_hash from model_metadata.models"):
model_storage_engine.delete(row[0])
new_model_ids = trainer.train_models(
Expand All @@ -173,6 +182,7 @@ def test_model_trainer(grid_config, default_model_trainer):
assert model_ids == sorted(new_model_ids)

# 9. that the generator interface works the same way
set_test_seed()
new_model_ids = trainer.generate_trained_models(
grid_config=grid_config,
misc_db_parameters=dict(),
Expand Down Expand Up @@ -233,31 +243,41 @@ def test_n_jobs_not_new_model(default_model_trainer):
"max_features": ["sqrt", "log2"],
"max_depth": [5, 10, 15, 20],
"criterion": ["gini", "entropy"],
"n_jobs": [12, 24],
"n_jobs": [12],
},
}

trainer = default_model_trainer
project_storage = trainer.model_storage_engine.project_storage
db_engine = trainer.db_engine

# generate train tasks, with a specific random seed so that we can compare
# apples to apples later
random.seed(5)
train_tasks = trainer.generate_train_tasks(
grid_config, dict(), get_matrix_store(project_storage)
)

assert len(train_tasks) == 35 # 32+3, would be (32*2)+3 if we didn't remove
assert (
len([task for task in train_tasks if "n_jobs" in task["parameters"]]) == 32
)

for train_task in train_tasks:
trainer.process_train_task(**train_task)

# since n_jobs is a runtime attribute of the model, it should not make it
# into the model group
for row in db_engine.execute(
"select hyperparameters from model_metadata.model_groups"
):
assert "n_jobs" not in row[0]

hashes = set(task['model_hash'] for task in train_tasks)
# generate the grid again with a different n_jobs (but the same random seed!)
# and make sure that the hashes are the same as before
random.seed(5)
grid_config['sklearn.ensemble.RandomForestClassifier']['n_jobs'] = [24]
new_train_tasks = trainer.generate_train_tasks(
grid_config, dict(), get_matrix_store(project_storage)
)
assert hashes == set(task['model_hash'] for task in new_train_tasks)


def test_cache_models(default_model_trainer):
assert not default_model_trainer.model_storage_engine.should_cache
Expand Down
Loading

0 comments on commit b8fc0cf

Please sign in to comment.