Add compute best/worst/stochastic for each evaluation [Resolves #292] #674

thcrock · 2019-04-19T16:56:31Z

Instead of just computing evaluation metrics based on one sort seed,
computes best/worst/stochastic (many sort seeds)

Add schema migration to create new columns to store the different
versions of each metric, plus number of trials and standard deviation.
The old 'value' is copied to stochastic_value and then dropped.
Modify sorting and thresholding routines to use numpy.arrays instead
of converting to Python lists
Update sort_predictions_and_labels to implement best and worst sort in
addition to the random one
Update catwalk.ModelEvaluator to generate evaluations for the
best/worst sorting for each metric, and do 30 trials for the metrics
which have sufficiently different best/worst. To enable this, the
evaluation is refactored to decouple the flattening of the metric definitions from the actual evaluation computation.
Update Audition and Postmodeling to look at the 'stochastic value' by
default
Remove sort_seed from scoring example config as it is no longer used

Instead of just computing evaluation metrics based on one sort seed, computes best/worst/stochastic (many sort seeds) - Add schema migration to create new columns to store the different versions of each metric, plus number of trials and standard deviation. The old 'value' is copied to stochastic_value and then dropped. - Modify sorting and thresholding routines to use numpy.arrays instead of converting to Python lists - Update sort_predictions_and_labels to implement best and worst sort in addition to the random one - Update catwalk.ModelEvaluator to generate evaluations for the best/worst sorting for each metric, and do 30 trials for the metrics which have sufficiently different best/worst. To enable this, the evaluation is refactored to decouple the flattening of the metric definitions from the actual evaluation computation. - Update Audition and Postmodeling to look at the 'stochastic value' by default - Remove sort_seed from scoring example config as it is no longer used

codecov-io · 2019-04-19T17:14:09Z

Codecov Report

Merging #674 into master will decrease coverage by 0.3%.
The diff coverage is 80%.

@@            Coverage Diff             @@
##           master     #674      +/-   ##
==========================================
- Coverage   82.78%   82.47%   -0.31%     
==========================================
  Files          90       92       +2     
  Lines        6012     6134     +122     
==========================================
+ Hits         4977     5059      +82     
- Misses       1035     1075      +40

Impacted Files	Coverage Δ
...ent/postmodeling/contrast/model_group_evaluator.py	`64.3% <ø> (ø)`	⬆️
...component/postmodeling/contrast/model_evaluator.py	`68.27% <ø> (ø)`	⬆️
src/triage/experiments/base.py	`95.35% <ø> (ø)`	⬆️
...rc/triage/component/audition/distance_from_best.py	`100% <ø> (ø)`	⬆️
...bic/versions/97cf99b7348f_evaluation_randomness.py	`0% <0%> (ø)`
src/triage/component/results_schema/schema.py	`98.26% <100%> (+0.06%)`	⬆️
src/triage/component/catwalk/metrics.py	`98.33% <100%> (+0.02%)`	⬆️
src/triage/component/catwalk/evaluation.py	`98.57% <100%> (+0.5%)`	⬆️
src/triage/component/catwalk/storage.py	`92.11% <50%> (-0.31%)`	⬇️
src/triage/util/random.py	`83.33% <83.33%> (ø)`
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8e14077...b0e04cf. Read the comment docs.

jesteria · 2019-04-23T19:41:11Z

src/triage/component/catwalk/evaluation.py

+        'num_labeled_above_threshold',
+        'num_positive_labels'
+    ])
+):


I'm curious, how come subclass the namedtuples? If it's just to add __doc__, I think you have better options in Python3, (and which don't require you to add __slots__, as you otherwise probably should here).

typing.NamedTuple provides a typed interface, (and requires annotations) [docs]:

class MetricEvaluationResult(typing.NamedTuple): """I have a doc string.""" metric: typing.Any parameter: str = 'boop' …

If that doesn't suit, in Python3, you can also set __doc__ procedurally:

MetricEvaluationResult = namedtuple('MetricEvaluationResult', …) MetricEvaluationResult.__doc__ = """…"""

And there are other niceties in Python3 now. I'm guessing SimpleNamespace doesn't fit the bill here; but, dataclass very well might. dataclass is rather like namedtuple, but more object-y:

@dataclass class MetricEvaluationResult: """my docs""" metric: typing.Any parameter: str = 'boop'

It works quite like namedtuple, but the underlying implementation is quite different, since it's dressing up your class definition, rather than generating a subclass of tuple: It generates the appropriate __init__ for you, as well as all the other boring methods you probably want, like __repr__ and __eq__. You can even pass the frozen flag to make a dataclass immutable, like a namedtuple.

Ultimately, a namedtuple might be lighter weight. But there are pros and cons 🤷‍♂️

Lots of stuff in Python3, man 💥

Thanks! Dataclass looks like the ticket

(yes, it was just to add the docstring)

ecsalomon · 2019-04-25T21:42:20Z

docs/sources/experiments/algorithm.md

+* `best_value` - Ordering by the label descending. This has the effect of as many predicted positives making it above thresholds as possible, thus producing the best possible score.
+* `stochastic_value` - If the `worst_value` and `best_value` are not the same (as defined by the floating point tolerance at catwalk.evaluation.RELATIVE_TOLERANCE), the sorting/thresholding/evaluation will be redone many times, and the mean of all these trials is written to this column. Otherwise, the `worst_value` is written here
+* `num_sort_trials` - If trials are needed to produce the `stochastic_value`, the number of trials taken is written here. Otherwise this will be 1
+* `standard_deviation` - If trials are needed to produce the `stochastic_value`, the standard deviation of these trials is written here. Otherwise this will be null


I don't have a strong feeling on this, but perhaps the default for st dev should be 0 (i.e., no variance).

From the error we get in both Python and Postgres, it seems to be 'undefined' with less than 2 trials and I call this different than 0.

On the other hand, if we take my answer to the other comment (the best and/or worst is a trial) we could just plug the best and worst in as the two trials, which is a defined variance of 0.

Yeah, I guess the difference is: do we take license to say we know the std dev of the metric is 0 when all values are (within our sig digits) equal, even if we don't do the trials, or do we report the standard deviation only when we've estimated it with trials.

In other words: Just like the mean reported in stochastic_value is an estimate of the true mean of all metric values, the standard deviation based on the trials is just an estimate of the true standard deviation of all values. When all of the possible metric values are the same, we know the population value of standard deviation is 0 without having to do the trials, so we could just enter that just like we report the population value for the mean without doing the trials.

ecsalomon

I just have a few comments/questions. Found it easy to understand, thought the documentation was clear.

ecsalomon · 2019-04-25T21:49:17Z

docs/sources/experiments/algorithm.md

+* `worst_value` - Ordering by the label ascending. This has the effect of as many predicted negatives making it above thresholds as possible, thus producing the worst possible score.
+* `best_value` - Ordering by the label descending. This has the effect of as many predicted positives making it above thresholds as possible, thus producing the best possible score.
+* `stochastic_value` - If the `worst_value` and `best_value` are not the same (as defined by the floating point tolerance at catwalk.evaluation.RELATIVE_TOLERANCE), the sorting/thresholding/evaluation will be redone many times, and the mean of all these trials is written to this column. Otherwise, the `worst_value` is written here
+* `num_sort_trials` - If trials are needed to produce the `stochastic_value`, the number of trials taken is written here. Otherwise this will be 1


Is it more realistic to say the number of trials is 0 (since none were done)?

I was thinking that the one that produced the final value (worst) was a trial but I guess that's not really accurate.

Yeah, I feel like 0 is a more transparent flag that no random trials were done, whereas the previous value column was truly based on 1 random trial. I think putting 1 or 2 here would make the meaning of the column (and also of the metrics computed) less clear, as these aren't (or shouldn't be) included in the estimates of the mean and standard deviation when trials are done.

ecsalomon · 2019-04-25T21:49:48Z

src/tests/utils.py

@@ -70,7 +70,7 @@ def __init__(
        self.matrix_label_tuple = matrix, labels
        self.metadata = base_metadata
        self.label_count = label_count
-        self.init_labels = init_labels
+        self.init_labels = pandas.Series(init_labels)


Thank you

ecsalomon · 2019-04-25T21:50:19Z

src/triage/component/catwalk/storage.py

@@ -395,6 +395,8 @@ def design_matrix(self):

    @property
    def labels(self):
+        if type(self.matrix_label_tuple[1]) != pd.Series:
+            raise TypeError("Label stored as something other than pandas Series")


ecsalomon · 2019-04-25T21:51:48Z

src/triage/component/catwalk/utils.py

-            )
-        )
-        return predictions_proba_sorted, labels_sorted
+        numpy.random.seed(abs(sort_seed))


Do we want to disallow negative random seeds rather than alter them silently?

Sure. Python allows them but we could just generate them between 0 and whatever instead of -whatever/2 and +whatever/2, so no conversion to make them numpy-able would be needed

ecsalomon

Thanks! LGTM

thcrock requested review from ecsalomon and saleiro April 19, 2019 16:57

thcrock assigned saleiro Apr 19, 2019

jesteria reviewed Apr 23, 2019

View reviewed changes

ecsalomon reviewed Apr 25, 2019

View reviewed changes

Changes from review

b0e04cf

ecsalomon approved these changes Apr 27, 2019

View reviewed changes

saleiro approved these changes May 2, 2019

View reviewed changes

saleiro merged commit 7435238 into master May 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add compute best/worst/stochastic for each evaluation [Resolves #292] #674

Add compute best/worst/stochastic for each evaluation [Resolves #292] #674

thcrock commented Apr 19, 2019

codecov-io commented Apr 19, 2019 •

edited

Loading

jesteria Apr 23, 2019

thcrock Apr 25, 2019

thcrock Apr 25, 2019

jesteria May 1, 2019

ecsalomon Apr 25, 2019

thcrock Apr 25, 2019

ecsalomon Apr 25, 2019 •

edited

Loading

ecsalomon Apr 25, 2019 •

edited

Loading

ecsalomon left a comment

ecsalomon Apr 25, 2019

thcrock Apr 25, 2019

ecsalomon Apr 25, 2019 •

edited

Loading

ecsalomon Apr 25, 2019

ecsalomon Apr 25, 2019

ecsalomon Apr 25, 2019

thcrock Apr 25, 2019

ecsalomon left a comment

Add compute best/worst/stochastic for each evaluation [Resolves #292] #674

Add compute best/worst/stochastic for each evaluation [Resolves #292] #674

Conversation

thcrock commented Apr 19, 2019

codecov-io commented Apr 19, 2019 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ecsalomon Apr 25, 2019 • edited Loading

Choose a reason for hiding this comment

ecsalomon Apr 25, 2019 • edited Loading

Choose a reason for hiding this comment

ecsalomon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ecsalomon Apr 25, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ecsalomon left a comment

Choose a reason for hiding this comment

codecov-io commented Apr 19, 2019 •

edited

Loading

ecsalomon Apr 25, 2019 •

edited

Loading

ecsalomon Apr 25, 2019 •

edited

Loading

ecsalomon Apr 25, 2019 •

edited

Loading