Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add compute best/worst/stochastic for each evaluation [Resolves #292] #674

Merged
merged 2 commits into from
May 2, 2019

Conversation

thcrock
Copy link
Contributor

@thcrock thcrock commented Apr 19, 2019

Instead of just computing evaluation metrics based on one sort seed,
computes best/worst/stochastic (many sort seeds)

  • Add schema migration to create new columns to store the different
    versions of each metric, plus number of trials and standard deviation.
    The old 'value' is copied to stochastic_value and then dropped.
  • Modify sorting and thresholding routines to use numpy.arrays instead
    of converting to Python lists
  • Update sort_predictions_and_labels to implement best and worst sort in
    addition to the random one
  • Update catwalk.ModelEvaluator to generate evaluations for the
    best/worst sorting for each metric, and do 30 trials for the metrics
    which have sufficiently different best/worst. To enable this, the
    evaluation is refactored to decouple the flattening of the metric definitions from the actual evaluation computation.
  • Update Audition and Postmodeling to look at the 'stochastic value' by
    default
  • Remove sort_seed from scoring example config as it is no longer used

Instead of just computing evaluation metrics based on one sort seed,
computes best/worst/stochastic (many sort seeds)

- Add schema migration to create new columns to store the different
versions of each metric, plus number of trials and standard deviation.
The old 'value' is copied to stochastic_value and then dropped.
- Modify sorting and thresholding routines to use numpy.arrays instead
of converting to Python lists
- Update sort_predictions_and_labels to implement best and worst sort in
addition to the random one
- Update catwalk.ModelEvaluator to generate evaluations for the
best/worst sorting for each metric, and do 30 trials for the metrics
which have sufficiently different best/worst. To enable this, the
evaluation is refactored to decouple the flattening of the metric definitions from the actual evaluation computation.
- Update Audition and Postmodeling to look at the 'stochastic value' by
default
- Remove sort_seed from scoring example config as it is no longer used
@codecov-io
Copy link

codecov-io commented Apr 19, 2019

Codecov Report

Merging #674 into master will decrease coverage by 0.3%.
The diff coverage is 80%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #674      +/-   ##
==========================================
- Coverage   82.78%   82.47%   -0.31%     
==========================================
  Files          90       92       +2     
  Lines        6012     6134     +122     
==========================================
+ Hits         4977     5059      +82     
- Misses       1035     1075      +40
Impacted Files Coverage Δ
...ent/postmodeling/contrast/model_group_evaluator.py 64.3% <ø> (ø) ⬆️
...component/postmodeling/contrast/model_evaluator.py 68.27% <ø> (ø) ⬆️
src/triage/experiments/base.py 95.35% <ø> (ø) ⬆️
...rc/triage/component/audition/distance_from_best.py 100% <ø> (ø) ⬆️
...bic/versions/97cf99b7348f_evaluation_randomness.py 0% <0%> (ø)
src/triage/component/results_schema/schema.py 98.26% <100%> (+0.06%) ⬆️
src/triage/component/catwalk/metrics.py 98.33% <100%> (+0.02%) ⬆️
src/triage/component/catwalk/evaluation.py 98.57% <100%> (+0.5%) ⬆️
src/triage/component/catwalk/storage.py 92.11% <50%> (-0.31%) ⬇️
src/triage/util/random.py 83.33% <83.33%> (ø)
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8e14077...b0e04cf. Read the comment docs.

'num_labeled_above_threshold',
'num_positive_labels'
])
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, how come subclass the namedtuples? If it's just to add __doc__, I think you have better options in Python3, (and which don't require you to add __slots__, as you otherwise probably should here).

typing.NamedTuple provides a typed interface, (and requires annotations) [docs]:

class MetricEvaluationResult(typing.NamedTuple):
    """I have a doc string."""
    metric: typing.Any
    parameter: str = 'boop'
    …

If that doesn't suit, in Python3, you can also set __doc__ procedurally:

MetricEvaluationResult = namedtuple('MetricEvaluationResult', …)
MetricEvaluationResult.__doc__ = """…"""

And there are other niceties in Python3 now. I'm guessing SimpleNamespace doesn't fit the bill here; but, dataclass very well might. dataclass is rather like namedtuple, but more object-y:

@dataclass
class MetricEvaluationResult:
    """my docs"""
    metric: typing.Any
    parameter: str = 'boop'

It works quite like namedtuple, but the underlying implementation is quite different, since it's dressing up your class definition, rather than generating a subclass of tuple: It generates the appropriate __init__ for you, as well as all the other boring methods you probably want, like __repr__ and __eq__. You can even pass the frozen flag to make a dataclass immutable, like a namedtuple.

Ultimately, a namedtuple might be lighter weight. But there are pros and cons 🤷‍♂️

Lots of stuff in Python3, man 💥

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Dataclass looks like the ticket

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(yes, it was just to add the docstring)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

* `best_value` - Ordering by the label descending. This has the effect of as many predicted positives making it above thresholds as possible, thus producing the best possible score.
* `stochastic_value` - If the `worst_value` and `best_value` are not the same (as defined by the floating point tolerance at catwalk.evaluation.RELATIVE_TOLERANCE), the sorting/thresholding/evaluation will be redone many times, and the mean of all these trials is written to this column. Otherwise, the `worst_value` is written here
* `num_sort_trials` - If trials are needed to produce the `stochastic_value`, the number of trials taken is written here. Otherwise this will be 1
* `standard_deviation` - If trials are needed to produce the `stochastic_value`, the standard deviation of these trials is written here. Otherwise this will be null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong feeling on this, but perhaps the default for st dev should be 0 (i.e., no variance).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the error we get in both Python and Postgres, it seems to be 'undefined' with less than 2 trials and I call this different than 0.

On the other hand, if we take my answer to the other comment (the best and/or worst is a trial) we could just plug the best and worst in as the two trials, which is a defined variance of 0.

Copy link
Contributor

@ecsalomon ecsalomon Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess the difference is: do we take license to say we know the std dev of the metric is 0 when all values are (within our sig digits) equal, even if we don't do the trials, or do we report the standard deviation only when we've estimated it with trials.

Copy link
Contributor

@ecsalomon ecsalomon Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words: Just like the mean reported in stochastic_value is an estimate of the true mean of all metric values, the standard deviation based on the trials is just an estimate of the true standard deviation of all values. When all of the possible metric values are the same, we know the population value of standard deviation is 0 without having to do the trials, so we could just enter that just like we report the population value for the mean without doing the trials.

Copy link
Contributor

@ecsalomon ecsalomon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just have a few comments/questions. Found it easy to understand, thought the documentation was clear.

* `worst_value` - Ordering by the label ascending. This has the effect of as many predicted negatives making it above thresholds as possible, thus producing the worst possible score.
* `best_value` - Ordering by the label descending. This has the effect of as many predicted positives making it above thresholds as possible, thus producing the best possible score.
* `stochastic_value` - If the `worst_value` and `best_value` are not the same (as defined by the floating point tolerance at catwalk.evaluation.RELATIVE_TOLERANCE), the sorting/thresholding/evaluation will be redone many times, and the mean of all these trials is written to this column. Otherwise, the `worst_value` is written here
* `num_sort_trials` - If trials are needed to produce the `stochastic_value`, the number of trials taken is written here. Otherwise this will be 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it more realistic to say the number of trials is 0 (since none were done)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that the one that produced the final value (worst) was a trial but I guess that's not really accurate.

Copy link
Contributor

@ecsalomon ecsalomon Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I feel like 0 is a more transparent flag that no random trials were done, whereas the previous value column was truly based on 1 random trial. I think putting 1 or 2 here would make the meaning of the column (and also of the metrics computed) less clear, as these aren't (or shouldn't be) included in the estimates of the mean and standard deviation when trials are done.

@@ -70,7 +70,7 @@ def __init__(
self.matrix_label_tuple = matrix, labels
self.metadata = base_metadata
self.label_count = label_count
self.init_labels = init_labels
self.init_labels = pandas.Series(init_labels)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

@@ -395,6 +395,8 @@ def design_matrix(self):

@property
def labels(self):
if type(self.matrix_label_tuple[1]) != pd.Series:
raise TypeError("Label stored as something other than pandas Series")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmmhmmm

)
)
return predictions_proba_sorted, labels_sorted
numpy.random.seed(abs(sort_seed))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to disallow negative random seeds rather than alter them silently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Python allows them but we could just generate them between 0 and whatever instead of -whatever/2 and +whatever/2, so no conversion to make them numpy-able would be needed

Copy link
Contributor

@ecsalomon ecsalomon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM

@saleiro saleiro merged commit 7435238 into master May 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants