Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Insert Ranks for Predictions [Resolves #357] #671

Merged
merged 3 commits into from
May 3, 2019
Merged

Conversation

thcrock
Copy link
Contributor

@thcrock thcrock commented Apr 10, 2019

Adds ranking to the predictions tables.

A few flavors of ranking are added.

rank_abs (already existing column) - Absolute rank, starting at 1,
without ties. Ties are broken based on either a random draw or a user-supplied fallback clause in the predictions table (e.g. label_value)
rank_pct (already existing column) - Percentile rank, without ties. Based on the rank_abs tiebreaking.
rank_abs_without_ties - Absolute rank, starting at 1, with ties and
skipping (e.g. if two entities are tied for 3, there will be no 4)

The tiebreaking for rank_abs (that cascades to rank_pct) is either done
with RANDOM() using a random seed that is based on the model's seed (but
converted from Python seed-space into Postgres seed-space), or
through user input at the new "prediction->rank_tiebreaker_order_by"
config value.

What is the model's seed, you ask? It's a new construct, that we store
in the models table under 'random_seed'. For each model training task,
we generate a value between -1000000000 and 1000000000. This value is
set as the Python seed right before training of an individual model, so
behavior is the same on singlethreaded or multiprocess training
contexts. How is this generated? The experiment requires that one is
passed in the config, so this becomes part of the experiment config that
is saved.

To help make space in the predictions table,
and to remove unnecessary precision that would make tiebreaking kind of irrelevant,
the score in the predictions tables are turned into DECIMAL(6, 5).

To keep track of how tiebreaking was done, there is a new
prediction_metadata table that holds this metadata, whether user
configuration or the Triage-supplied default.

Implementation-wise, this is done via an update statement after
predictions are initially inserted with NULL ranks to prevent memory
from ballooning.

@codecov-io
Copy link

codecov-io commented Apr 11, 2019

Codecov Report

Merging #671 into master will decrease coverage by 0.18%.
The diff coverage is 72.22%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #671      +/-   ##
==========================================
- Coverage   82.78%   82.59%   -0.19%     
==========================================
  Files          90       93       +3     
  Lines        6012     6114     +102     
==========================================
+ Hits         4977     5050      +73     
- Misses       1035     1064      +29
Impacted Files Coverage Δ
src/triage/component/results_schema/__init__.py 66.66% <ø> (ø) ⬆️
...embic/versions/609c7cc51794_rankify_predictions.py 0% <0%> (ø)
src/triage/component/results_schema/schema.py 98.36% <100%> (+0.15%) ⬆️
src/triage/experiments/base.py 95.39% <100%> (+0.03%) ⬆️
src/triage/component/catwalk/model_trainers.py 91.94% <100%> (-1.11%) ⬇️
src/triage/component/catwalk/storage.py 92.47% <100%> (+0.05%) ⬆️
src/triage/util/random.py 100% <100%> (ø)
src/triage/util/sql.py 100% <100%> (ø)
src/triage/experiments/validate.py 77.03% <83.33%> (+0.27%) ⬆️
src/triage/component/catwalk/predictors.py 95.09% <96.29%> (+0.29%) ⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 01f357e...a8c30ce. Read the comment docs.

Copy link
Contributor

@ecsalomon ecsalomon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really have any comments on the code, but I found some of the documentation and configuration confusing.

  • I think we need to make it really clear where and how the random seed is used and stored -- i.e., what is its scope, especially as it comes to being stored with model metadata. A model's random seem seems intuitively to me as being similar to the random_state key in sklearn classes and so would also be part of the model group definition. If we don't intend to use the experiment random seed in this way, we need to be really clear in the language around it. On the other hand, perhaps it is desirable to set a single random seed to use for all elements of an experiment, including and model random_states. I haven't really thought through the implications of either.
  • Relatedly, what the expected behavior of re-running the same config changing only the random seed with and without replace? What, if anything, is updated in the models and model_groups tables? In the predictions, evaluations, and predictions_metadata tables?
  • I'm unclear on the purpose and behavior of the other configuration options, and the remainder of my comments are on that topic.
  • I think it would be cool to eventually support something like @avishekrk's tie-breaking prioritization (e.g., prioritize women for housing interventions in the case of ties) and to see the impact of this kind of prioritization on evaluation and bias metrics. But this could still lead to ties, which would then need to be broken in the best/worst/stochastic manner. But this doesn't seem like the intention of the configuration, nor does it seem like an obvious extension of the configuration.
  • Have we considered how audition and/or postmodelling will handle models that have different configuration options for tie-breaking? Or different random seeds? Are these comparable or should they blow up? I know this PR doesn't have to make those changes, but I am trying to think through what the consequences of the new configuration options will be.
  • The configuration of rank_tiebreaker_order_by allows users to do both potentially dangerous (label_value_desc) and useless (score asc) things. We already support plenty of ridiculous configurations, so this is isn't a strong objection. But I'm not sure that it supports all that many useful things, either.

@@ -11,6 +11,9 @@ config_version: 'v6'
# model_comment (optional) will end up in the model_comment column of the
# models table for each model created in this experiment
model_comment: 'test'
# random_seed will be set in Python at the beginning of the experiment and
# affect the generation of all model seeds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and model groups? or are model groups independent of seed? Edit: I was thinking about this as a replacement for sklearn's seed setting, which now (usefully) creates distinct model groups.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to document here (and elsewhere) what the scope of the model random seed is/isn't.

# Example: "RANDOM()" (default),
# Will end up in a SQL clause like: "order by score desc, RANDOM()"
#
# Example: "label_value desc"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, the configuration here seems maybe undesirable to support as is -- it seems like a bad idea to sort by label_value desc as it will give an unrealistically positive view of any downstream processes that make use of ranking, unless there is a giant, flashing warning letting users know what is happening. If it were possible to join to another table and then sort on a value there (e.g., we want to prioritize the lowest income people in the event of ties), I could see this being useful, though that is obviously out of scope here. So I guess my question is: Is that the intended future of this configuration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. I don't really see how to make a non-confusing configuration option support joins here. Like, we could let people put in the join clause, but disconnected from the query it's probably hard to reliably write. They could put the whole query, with what's in the code file now as an example, but that is

Honestly, given the limited scope of the usage of these columns (see my other reply just now), I'm inclined to just remove the configuration for this entirely. My opinion is that stuff like 'prioritize lowest income people in the event of ties' would belong in after-experimenting evaluation, which could be built around the enhanced sort_predictions_and_labels in the evaluation pull request: https://github.com/dssg/triage/pull/674/files#diff-69452711d71cc1b37559e5011b372db0 (for instance, we add onto that to make it take other arbitrary arrays that may have other data, and add some convenience onto that for getting the predictions and labels and other data)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we're reporting out best, worst, and stochastic cases in the evaluations table, it makes sense to allow those to be supported here, but I wonder if it might make more sense to just have a parameter with those explicit values allowed rather than an arbitrary order by clause.

In the future, maybe that could be extended to allow specifying a join table or something along those lines for more complex cases, but may be a little safer and more explicit (especially with respect to errors confusing ascending and descending) for the current purposes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shaycrk how would the stochastic case (30 trials, therefore 30 rankings) map to a ranking we can materialize here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think I got my mind all twisted thinking about this in terms of what's used for the metrics, but since there is no longer a "canonical" metric value but rather a range (worst, best) and central tendency (stochastic), there is no longer any alignment between the inserted ranks and the metric value. So I assume these will just be used for downstream processes, specifically postmodeling and bias analysis. So we should think about the configuration either in terms of what we want to support for those... or punt on configuration altogether until we start working on changes to those modules.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as the random tie-breaking option (just one realization from the stochastic case) you're already implementing here. I don't think we need the flexibility of allowing the user to specify any arbitrary order by clause (which may also be error-prone) as opposed to having a fixed menu of options that mirrors what we're putting in the evaluations table.

# of Postgres' random number generator. Before ranking, the generator is seeded based on
# the *model*'s random seed.
#prediction:
# rank_tiebreaker_order_by: "RANDOM()"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear from the documentation here how this interacts with the (future) metrics behavior, which may be addressed in the next PR in this series. But I am wondering: If I set random() here, is that intended to apply before label_value in the best/worst metric comparisons or random() in the stochastic (leading to paradoxically invariant stochastic metrics)? In the example I gave above of prioritizing based on income, I would expect the sorting to be applied and then ties within that sorting to be broken by random() or label_value, but if I gave one of the example options, I would not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't affect the metrics behavior. It'd be really hard to integrate this schema with best/worst/30 random runs. Really, I think the mere existence of the stochastic option for evaluation should discourage us from trying to pin evaluation hopes on tightly coupling the metrics with the ranking that goes into this table. We're not reasonably going to store every ranking that was used, and there isn't even a true 'ranking' to recover if we use the stochastic value.

@shaycrk shaycrk self-requested a review April 25, 2019 19:47
update {table_name(prediction_obj)} as p
set
rank_abs = abs.abs_rank,
rank_pct = pct.pct_rank,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize it would mean changing column names that already exist, but I would probably lean towards naming these rank_abs_no_ties and rank_pct_no_ties (or without_ties) just to be very explicit as well as to avoid suggesting that one is the preferred/default to use in general.

Also, we might have to calculate it directly, but it seems like we may also want rank_pct_with_ties too?

@thcrock thcrock force-pushed the ranking_try_3 branch 3 times, most recently from b8fc0cf to 02f60c6 Compare May 3, 2019 19:07
Adds ranking to the predictions tables.

A few flavors of ranking are added.

rank_abs (already existing column) - Absolute rank, starting at 1,
without ties. Ties are broken based on either a random draw or a user-supplied fallback clause in the predictions table (e.g. label_value)
rank_pct (already existing column) - Percentile rank, *without ties*. Based on the rank_abs tiebreaking.
rank_abs_without_ties - Absolute rank, starting at 1, with ties and
skipping (e.g. if two entities are tied for 3, there will be no 4)

The tiebreaking for rank_abs (that cascades to rank_pct) is either done
randomly using a random seed that is based on the model's seed, or
through user input at the new "prediction->rank_tiebreaker_order_by"
config value.

What is the model's seed, you ask? It's a new construct, that we store
in the models table under 'random_seed'.  For each model training task,
we generate a value between -1000000000 and 1000000000. This value is
set as the Python seed right before training of an individual model, so
behavior is the same on singlethreaded or multiprocess training
contexts. How is this generated? The experiment requires that one is
passed in the config, so this becomes part of the experiment config that
is saved.

To help make space in the predictions table,
and to remove unnecessary precision that would make tiebreaking kind of irrelevant,
the score in the predictions tables are turned into DECIMAL(6, 5).

To keep track of how tiebreaking was done, there is a new
prediction_metadata table that holds this metadata, whether user
configuration or the Triage-supplied default.

Implementation-wise, this is done via an update statement after
predictions are initially inserted with NULL ranks to prevent memory
from ballooning.
@@ -184,6 +185,9 @@ feature_group_definition:

feature_group_strategies: ['all']

prediction:
rank_tiebreaker: "best"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a comment warning that this is illustrative, but you might not want to use best in most cases unless you have good reason to do so...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually left this in accidentally from some speed testing I was doing. I'll remove it and let them use the default (unless @nanounanue wants to change it to something else)

| Column name | Behavior |
| ----------- | ------- |
| rank_abs_with_ties | Absolute ranking, with ties. Ranks will skip after a set of ties, so if two entities are tied at rank 3, the next entity after them will have rank 5. |
| rank_pct_with_ties | Percentile ranking, with ties. Percentiles will skip after a set of ties, so if two entities out of ten are tied at 0.1 (tenth percentile), the next entity after them will have 0.3 (thirtieth percentile) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be good to mention if the percentiles are calculated to a certain precision?

ranking_df['rank_pct_no_ties'] = numpy.array([1 - (rank - 1) / len(ranking_df) for rank in ranking_df['rank_abs_no_ties']])
ranking_df['rank_pct_with_ties'] = ranking_df['score'].rank(method='min', pct=True)

# with our rankings computed, upsert these ranks into the predictions table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just an update, not an upsert, right? That is, it assumes the records are in the predictions table already (at least if I'm reading this correctly)

Copy link
Contributor

@shaycrk shaycrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a couple of small notes about comments inline, but generally looks good to me!

@shaycrk shaycrk assigned thcrock and unassigned shaycrk May 3, 2019
@thcrock thcrock merged commit 1dc8a4a into master May 3, 2019
@thcrock thcrock deleted the ranking_try_3 branch May 3, 2019 22:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants