Cohort and Label Deep Dive [Resolves #492] #577

thcrock · 2019-01-25T20:51:54Z

Add Cohort/Label Deep Dive to documentation site
Add TOC to mkdocs config to allow references within cohort/label doc

- Add Cohort/Label Deep Dive to documentation site - Add TOC to mkdocs config to allow references within cohort/label doc

codecov-io · 2019-01-25T21:04:07Z

Codecov Report

Merging #577 into master will increase coverage by 0.33%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #577      +/-   ##
==========================================
+ Coverage   82.49%   82.83%   +0.33%     
==========================================
  Files          83       84       +1     
  Lines        4833     5173     +340     
==========================================
+ Hits         3987     4285     +298     
- Misses        846      888      +42

Impacted Files	Coverage Δ
src/triage/util/db.py	`100% <0%> (ø)`	⬆️
...c/versions/0bca1ba9706e_add_matrix_uuid_to_eval.py	`0% <0%> (ø)`
src/triage/component/results_schema/schema.py	`97.72% <0%> (+0.12%)`	⬆️
src/triage/component/catwalk/evaluation.py	`97.54% <0%> (+0.12%)`	⬆️
src/triage/component/catwalk/storage.py	`92.88% <0%> (+0.48%)`	⬆️
src/triage/experiments/base.py	`97.47% <0%> (+1.41%)`	⬆️
src/triage/experiments/validate.py	`78.57% <0%> (+2.57%)`	⬆️
src/triage/validation_primitives.py	`74.28% <0%> (+3.31%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b8b56f1...2a7ad83. Read the comment docs.

nanounanue · 2019-01-28T16:06:53Z

docs/sources/experiments/cohort-labels.md

+
+This document assumes that the reader is familiar with the concept of a machine learning target variable and will focus on explaining what is unique to Triage.
+
+**A cohort is the population used used for modeling on a given as-of-date**. This is expressed as a list of *entities*. An entity is simply the object of prediction, such as a facility to inspect or a patient coming in for a visit. Early warning systems tend to include their entire population (or at least a large subset of it) in the cohort at any given date, while appointment-based problems may only include in a date's cohort the people who are scheduled for an appointment on that date.


list -> set

"Early warning systems tend to include their entire population (or at least a large subset of it)" -> "EWS tend to include their entire population"

I was trying to make the DDJ-type early warning cohorts fit here. The cohorts are only sometimes the entire population but can sometimes be like very large subsets like 'people who have ever been arrested'. I wanted a sentence that draws a distinction between visit-level problems that have a small per-date cohort vs everything else

Might be illustrative to give a quick example of each case. I think we have a good handle on the difference between those two types of problems, but it will probably be less familiar to external readers.

nanounanue · 2019-01-28T16:07:50Z

docs/sources/experiments/cohort-labels.md

+
+## Temporal Validation Refresher
+
+Triage uses temporal validation to select models because the real-world problems that Triage is built for tend to have a strong temporal correlation.  Picking a date range to train on and a date range afterwards to test on ensures that we don't leak data from the future into our models that wouldn't be available in a real-world deployment scenario. Because of this, we often talk in Triage about the *as-of-date*: all models trained by Triage are associated with an *as-of-date*, which means that all the data that goes into the model **is only included if it was known about before that date**. For more on temporal validation, see the [relevant section in Dirty Duck](https://dssg.github.io/dirtyduck/#sec-4-2-2-1).


I know that it is the standard narrative, but "temporal correlation" is confusing: Triage focus on systems that evolve/change on time, even if there is or not a temporal correlation

nanounanue · 2019-01-28T16:10:00Z

docs/sources/experiments/cohort-labels.md

+
+## Temporal Validation Refresher
+
+Triage uses temporal validation to select models because the real-world problems that Triage is built for tend to have a strong temporal correlation.  Picking a date range to train on and a date range afterwards to test on ensures that we don't leak data from the future into our models that wouldn't be available in a real-world deployment scenario. Because of this, we often talk in Triage about the *as-of-date*: all models trained by Triage are associated with an *as-of-date*, which means that all the data that goes into the model **is only included if it was known about before that date**. For more on temporal validation, see the [relevant section in Dirty Duck](https://dssg.github.io/dirtyduck/#sec-4-2-2-1).


The models are associated to an as-of-date, but every row inside the matrices is linked to a set of as of dates (including the one linked to the model)

I'm also trying to figure out how to phrase this here in the doc without going down a rabbit hole. I'll get back to you.

nanounanue · 2019-01-28T16:17:47Z

docs/sources/experiments/cohort-labels.md

+44 | 2016-03-01 | True
+60 | 2016-03-01 | True
+
+Above we define three total cohorts, on `2016-01-01`, `2016-02-01`, and `2016-03-01`. The first two cohorts have two entities each and the last one has a new third entity. For the first cohort, only one of the entities has an explicitly defined label (meaning the label query didn't return anything for them on that date).


"Above we observe three total cohorts"

nanounanue · 2019-01-28T16:21:58Z

docs/sources/experiments/cohort-labels.md

+
+The final contents of the matrix, however, depend on the `include_missing_labels_in_train_as` setting.
+
+### Inspections-Style (don't include missing labels)


This is confusing, because we are including those rows as NULL... Is it possible to phrase this differently?

Yeah, I get the confusoin. I don't know what the right phrasing is but I'll come up with something.

nanounanue · 2019-01-28T16:35:11Z

docs/sources/experiments/cohort-labels.md

+experiment.generate_matrices()
+```
+
+The matrix generation process will run all of the cohort/label/feature generation above, and then save matrices to your project_path's `matrices` directory. By default, these are CSVs and should have a few columns: 'entity_id', 'date', 'test_1_sum', and 'failed_inspection'. The 'entity_id' and 'date' columns represent the index of this matrix, and 'failed_inspection' is the label.  Each of these CSV files has a YAML file starting with the same hash representing metadata about that matrix. If you want to look for just the train matrices to inspect the results of the `include_missing_labels_in_train_as` flag, try this command (assuming you can use bash):


It will be nice to include some sentences about how the matrix name is created (i.e. their dependencies) and how this will be stable.

I could add this, but is this in scope for a document that's focused on cohorts and labels? The only reason I even added anything about matrices is that that's where the joins and the label imputation is visible.

The 'Restarting an Experiment' section of the running document does cover this, though it could go into more detail for sure.

https://github.com/dssg/triage/blob/master/docs/sources/experiments/running.md#restarting-an-experiment

Is that a better place to put those details or do you think this cohort/label document makes more sense?

You are right :)

nanounanue · 2019-01-28T16:35:39Z

docs/sources/experiments/cohort-labels.md

+
+## Wrapup
+
+Cohorts and Labels require a lot of care to define correctly as they constitute a large part of the problem framing. Even if you leave all of your feature generation the same, you can completely change the problem you're modeling by changing the label and cohort. Testing your cohort and label config can give you confidence that you're framing the problem the way you expect.


I really liked this paragraph! 👍

shaycrk · 2019-02-07T04:02:47Z

docs/sources/experiments/cohort-labels.md

+
+**A cohort is the population used used for modeling on a given as-of-date**. This is expressed as a list of *entities*. An entity is simply the object of prediction, such as a facility to inspect or a patient coming in for a visit. Early warning systems tend to include their entire population (or at least a large subset of it) in the cohort at any given date, while appointment-based problems may only include in a date's cohort the people who are scheduled for an appointment on that date.
+
+**A label is the binary target variable for a member of the cohort at a given as-of-date and a given label timespan.** For instance, in an inspection prioritization problem the question being asked may be 'what facilities are at high risk of having a failed inspection in the next 6 months?' For this problem, the `label_timespan` is 6 months. There may be multiple label timespans tested in the same experiment, in which case there could be multiple labels for an entity and date.


Could also be multiple label definitions for the same entity, date, and timespan (e.g., any inspection failures vs failures with serious issues). Worth calling that out?

shaycrk · 2019-02-07T04:10:23Z

docs/sources/experiments/cohort-labels.md

+
+### Note 1
+
+The as_of_date is parsed as a timestamp in the database. This timestamp defaults to **the beginning of the date in question**. It's important to consider how this is used for feature generation. Features are only included if they are known about **before this timestamp**. So features will be only included for an as_of_date if they are known about **before that as_of_date**. If you want to work around this (e.g for visit-level problems in which you want to intake data **on the day of the visit and make predictions using that data the same day**), you can move your cohort up a day.


maybe just "midnight" rather than "the beginning of the date"? Also, probably worth calling out timezones here as a potential gotcha -- I think everything is expected to be a timestamp without a timezone, right? Likewise, seems worth noting that day is the smallest unit of time Triage can currently handle (I think that's correct?)

Well, we haven't fully considered what happens if the user puts a <1 day unit, like "5hour" in a timechop field. I'm kind of assuming something will break, but I'm trying it in dirty duck right now.

shaycrk · 2019-02-07T04:12:41Z

docs/sources/experiments/cohort-labels.md

+Triage expects the cohort to be a unique list of entity ids. Throughout the cohort example queries you will see `distinct(entity_id)` used to ensure this.
+
+### Example: Inspections
+Let's say I am prioritizing the inspection of restaurants. One simple definition of a cohort for restaurant inspection would be to include *any restaurants that have active permits in the last year* in the cohort. Assume that these permits are contained in a table, named `permits`, with the facility's id, a start date, and an end date of the permit.


probably with the restaurant's id rather than facility's id?

Is a facility, not a restaurant. That is the standard nomenclature of the Chicago Food Inspections (because you can actually inspect things that are not a restaurant)

@thcrock @shaycrk

I changed the copy to switch back to 'facility' but also introduce the definition of a food service facility as being restaurants plus other things

shaycrk · 2019-02-07T16:22:17Z

docs/sources/experiments/cohort-labels.md

+    group by entity_id
+  include_missing_labels_in_train_as: False
+  name: 'diabetes'
+```


Might be a good example for a comment about being thoughtful about your label definitions since an alternative way to think about this problem would be something more similar to the inspections problem (e.g., you would need to run an A1C test to either diagnosis someone with diabetes or positively conclude they don't have it, so anyone who hasn't been tested would be a null in that case...)

shaycrk · 2019-02-07T16:41:45Z

Looks pretty good to me overall -- made a few comments inline.

My only other general question might be whether there are additional gotchas/common mistakes people run into that are worth calling out here? (I'm thinking of things like understanding how your cohort changes over time, large label timespans that don't give enough space to do many splits, interactions between time window boundaries and your cuts - like if all your enrollment events happen on the same day, etc)

shaycrk

made a few comments -- feel free to incorporate or reject as you see fit and merge whenever you're good with it!

nanounanue · 2019-02-07T20:29:54Z

Yeah, I agree, just see my comment about restaurants vs facilities

thcrock · 2019-02-11T22:42:28Z

"My only other general question might be whether there are additional gotchas/common mistakes people run into that are worth calling out here? (I'm thinking of things like understanding how your cohort changes over time, large label timespans that don't give enough space to do many splits, interactions between time window boundaries and your cuts - like if all your enrollment events happen on the same day, etc)"

Some of these could be useful here, and some of these may be better to cover in the in-progress temporal deep dive. I'm also probably not qualified to write any of these 'additional gotchas' so I'll leave that for others to add to this doc in another PR.

Cohort and Label Deep Dive [Resolves #492]

70d2781

- Add Cohort/Label Deep Dive to documentation site - Add TOC to mkdocs config to allow references within cohort/label doc

nanounanue requested changes Jan 28, 2019

View reviewed changes

thcrock added 2 commits January 28, 2019 17:59

First round of changes from review

cbd8ee3

Trying to explain multiple as-of-dates

d5a8832

thcrock assigned nanounanue and shaycrk and unassigned nanounanue Feb 5, 2019

shaycrk reviewed Feb 7, 2019

View reviewed changes

shaycrk assigned thcrock and unassigned shaycrk Feb 7, 2019

shaycrk approved these changes Feb 7, 2019

View reviewed changes

Partial changes from review

e8bcceb

thcrock added 3 commits February 7, 2019 14:46

wip

cc7ffd5

Fix nomenclature of restaurant/facility

39abf51

Add note about diabetes inspections

2a7ad83

thcrock merged commit 3af1fed into master Feb 11, 2019

thcrock deleted the cohort_label_doc branch February 11, 2019 22:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cohort and Label Deep Dive [Resolves #492] #577

Cohort and Label Deep Dive [Resolves #492] #577

thcrock commented Jan 25, 2019

codecov-io commented Jan 25, 2019 •

edited

Loading

nanounanue Jan 28, 2019

nanounanue Jan 28, 2019

thcrock Jan 28, 2019

shaycrk Feb 7, 2019

nanounanue Jan 28, 2019

nanounanue Jan 28, 2019

thcrock Jan 28, 2019

nanounanue Jan 28, 2019

nanounanue Jan 28, 2019

thcrock Jan 28, 2019

nanounanue Jan 28, 2019

thcrock Jan 28, 2019

nanounanue Feb 1, 2019

nanounanue Jan 28, 2019

shaycrk Feb 7, 2019

shaycrk Feb 7, 2019

thcrock Feb 7, 2019

shaycrk Feb 7, 2019

nanounanue Feb 7, 2019

nanounanue Feb 7, 2019

thcrock Feb 11, 2019

shaycrk Feb 7, 2019

shaycrk commented Feb 7, 2019

shaycrk left a comment

nanounanue commented Feb 7, 2019

thcrock commented Feb 11, 2019


		This document assumes that the reader is familiar with the concept of a machine learning target variable and will focus on explaining what is unique to Triage.

		A cohort is the population used used for modeling on a given as-of-date. This is expressed as a list of entities. An entity is simply the object of prediction, such as a facility to inspect or a patient coming in for a visit. Early warning systems tend to include their entire population (or at least a large subset of it) in the cohort at any given date, while appointment-based problems may only include in a date's cohort the people who are scheduled for an appointment on that date.


		## Temporal Validation Refresher

		Triage uses temporal validation to select models because the real-world problems that Triage is built for tend to have a strong temporal correlation. Picking a date range to train on and a date range afterwards to test on ensures that we don't leak data from the future into our models that wouldn't be available in a real-world deployment scenario. Because of this, we often talk in Triage about the as-of-date: all models trained by Triage are associated with an as-of-date, which means that all the data that goes into the model is only included if it was known about before that date. For more on temporal validation, see the [relevant section in Dirty Duck](https://dssg.github.io/dirtyduck/#sec-4-2-2-1).


		The final contents of the matrix, however, depend on the `include_missing_labels_in_train_as` setting.

		### Inspections-Style (don't include missing labels)


		## Wrapup

		Cohorts and Labels require a lot of care to define correctly as they constitute a large part of the problem framing. Even if you leave all of your feature generation the same, you can completely change the problem you're modeling by changing the label and cohort. Testing your cohort and label config can give you confidence that you're framing the problem the way you expect.


		A cohort is the population used used for modeling on a given as-of-date. This is expressed as a list of entities. An entity is simply the object of prediction, such as a facility to inspect or a patient coming in for a visit. Early warning systems tend to include their entire population (or at least a large subset of it) in the cohort at any given date, while appointment-based problems may only include in a date's cohort the people who are scheduled for an appointment on that date.

		A label is the binary target variable for a member of the cohort at a given as-of-date and a given label timespan. For instance, in an inspection prioritization problem the question being asked may be 'what facilities are at high risk of having a failed inspection in the next 6 months?' For this problem, the `label_timespan` is 6 months. There may be multiple label timespans tested in the same experiment, in which case there could be multiple labels for an entity and date.


		### Note 1

		The as_of_date is parsed as a timestamp in the database. This timestamp defaults to the beginning of the date in question. It's important to consider how this is used for feature generation. Features are only included if they are known about before this timestamp. So features will be only included for an as_of_date if they are known about before that as_of_date. If you want to work around this (e.g for visit-level problems in which you want to intake data on the day of the visit and make predictions using that data the same day), you can move your cohort up a day.

Cohort and Label Deep Dive [Resolves #492] #577

Cohort and Label Deep Dive [Resolves #492] #577

Conversation

thcrock commented Jan 25, 2019

codecov-io commented Jan 25, 2019 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shaycrk commented Feb 7, 2019

shaycrk left a comment

Choose a reason for hiding this comment

nanounanue commented Feb 7, 2019

thcrock commented Feb 11, 2019

codecov-io commented Jan 25, 2019 •

edited

Loading