-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add compute best/worst/stochastic for each evaluation [Resolves #292] #674
Conversation
Instead of just computing evaluation metrics based on one sort seed, computes best/worst/stochastic (many sort seeds) - Add schema migration to create new columns to store the different versions of each metric, plus number of trials and standard deviation. The old 'value' is copied to stochastic_value and then dropped. - Modify sorting and thresholding routines to use numpy.arrays instead of converting to Python lists - Update sort_predictions_and_labels to implement best and worst sort in addition to the random one - Update catwalk.ModelEvaluator to generate evaluations for the best/worst sorting for each metric, and do 30 trials for the metrics which have sufficiently different best/worst. To enable this, the evaluation is refactored to decouple the flattening of the metric definitions from the actual evaluation computation. - Update Audition and Postmodeling to look at the 'stochastic value' by default - Remove sort_seed from scoring example config as it is no longer used
Codecov Report
@@ Coverage Diff @@
## master #674 +/- ##
==========================================
- Coverage 82.78% 82.47% -0.31%
==========================================
Files 90 92 +2
Lines 6012 6134 +122
==========================================
+ Hits 4977 5059 +82
- Misses 1035 1075 +40
Continue to review full report at Codecov.
|
'num_labeled_above_threshold', | ||
'num_positive_labels' | ||
]) | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious, how come subclass the namedtuple
s? If it's just to add __doc__
, I think you have better options in Python3, (and which don't require you to add __slots__
, as you otherwise probably should here).
typing.NamedTuple
provides a typed interface, (and requires annotations) [docs]:
class MetricEvaluationResult(typing.NamedTuple):
"""I have a doc string."""
metric: typing.Any
parameter: str = 'boop'
…
If that doesn't suit, in Python3, you can also set __doc__
procedurally:
MetricEvaluationResult = namedtuple('MetricEvaluationResult', …)
MetricEvaluationResult.__doc__ = """…"""
And there are other niceties in Python3 now. I'm guessing SimpleNamespace doesn't fit the bill here; but, dataclass very well might. dataclass
is rather like namedtuple
, but more object
-y:
@dataclass
class MetricEvaluationResult:
"""my docs"""
metric: typing.Any
parameter: str = 'boop'
It works quite like namedtuple
, but the underlying implementation is quite different, since it's dressing up your class definition, rather than generating a subclass of tuple
: It generates the appropriate __init__
for you, as well as all the other boring methods you probably want, like __repr__
and __eq__
. You can even pass the frozen
flag to make a dataclass
immutable, like a namedtuple
.
Ultimately, a namedtuple
might be lighter weight. But there are pros and cons 🤷♂️
Lots of stuff in Python3, man 💥
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Dataclass looks like the ticket
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(yes, it was just to add the docstring)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
* `best_value` - Ordering by the label descending. This has the effect of as many predicted positives making it above thresholds as possible, thus producing the best possible score. | ||
* `stochastic_value` - If the `worst_value` and `best_value` are not the same (as defined by the floating point tolerance at catwalk.evaluation.RELATIVE_TOLERANCE), the sorting/thresholding/evaluation will be redone many times, and the mean of all these trials is written to this column. Otherwise, the `worst_value` is written here | ||
* `num_sort_trials` - If trials are needed to produce the `stochastic_value`, the number of trials taken is written here. Otherwise this will be 1 | ||
* `standard_deviation` - If trials are needed to produce the `stochastic_value`, the standard deviation of these trials is written here. Otherwise this will be null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong feeling on this, but perhaps the default for st dev should be 0 (i.e., no variance).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the error we get in both Python and Postgres, it seems to be 'undefined' with less than 2 trials and I call this different than 0.
On the other hand, if we take my answer to the other comment (the best and/or worst is a trial) we could just plug the best and worst in as the two trials, which is a defined variance of 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I guess the difference is: do we take license to say we know the std dev of the metric is 0 when all values are (within our sig digits) equal, even if we don't do the trials, or do we report the standard deviation only when we've estimated it with trials.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other words: Just like the mean reported in stochastic_value
is an estimate of the true mean of all metric values, the standard deviation based on the trials is just an estimate of the true standard deviation of all values. When all of the possible metric values are the same, we know the population value of standard deviation is 0 without having to do the trials, so we could just enter that just like we report the population value for the mean without doing the trials.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just have a few comments/questions. Found it easy to understand, thought the documentation was clear.
* `worst_value` - Ordering by the label ascending. This has the effect of as many predicted negatives making it above thresholds as possible, thus producing the worst possible score. | ||
* `best_value` - Ordering by the label descending. This has the effect of as many predicted positives making it above thresholds as possible, thus producing the best possible score. | ||
* `stochastic_value` - If the `worst_value` and `best_value` are not the same (as defined by the floating point tolerance at catwalk.evaluation.RELATIVE_TOLERANCE), the sorting/thresholding/evaluation will be redone many times, and the mean of all these trials is written to this column. Otherwise, the `worst_value` is written here | ||
* `num_sort_trials` - If trials are needed to produce the `stochastic_value`, the number of trials taken is written here. Otherwise this will be 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it more realistic to say the number of trials is 0 (since none were done)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that the one that produced the final value (worst) was a trial but I guess that's not really accurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I feel like 0 is a more transparent flag that no random trials were done, whereas the previous value
column was truly based on 1 random trial. I think putting 1 or 2 here would make the meaning of the column (and also of the metrics computed) less clear, as these aren't (or shouldn't be) included in the estimates of the mean and standard deviation when trials are done.
@@ -70,7 +70,7 @@ def __init__( | |||
self.matrix_label_tuple = matrix, labels | |||
self.metadata = base_metadata | |||
self.label_count = label_count | |||
self.init_labels = init_labels | |||
self.init_labels = pandas.Series(init_labels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you
@@ -395,6 +395,8 @@ def design_matrix(self): | |||
|
|||
@property | |||
def labels(self): | |||
if type(self.matrix_label_tuple[1]) != pd.Series: | |||
raise TypeError("Label stored as something other than pandas Series") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmmhmmm
) | ||
) | ||
return predictions_proba_sorted, labels_sorted | ||
numpy.random.seed(abs(sort_seed)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to disallow negative random seeds rather than alter them silently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Python allows them but we could just generate them between 0 and whatever instead of -whatever/2 and +whatever/2, so no conversion to make them numpy-able would be needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! LGTM
Instead of just computing evaluation metrics based on one sort seed,
computes best/worst/stochastic (many sort seeds)
versions of each metric, plus number of trials and standard deviation.
The old 'value' is copied to stochastic_value and then dropped.
of converting to Python lists
addition to the random one
best/worst sorting for each metric, and do 30 trials for the metrics
which have sufficiently different best/worst. To enable this, the
evaluation is refactored to decouple the flattening of the metric definitions from the actual evaluation computation.
default