Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cohort and Label Deep Dive [Resolves #492] #577

Merged
merged 7 commits into from
Feb 11, 2019
Merged

Conversation

thcrock
Copy link
Contributor

@thcrock thcrock commented Jan 25, 2019

  • Add Cohort/Label Deep Dive to documentation site
  • Add TOC to mkdocs config to allow references within cohort/label doc

- Add Cohort/Label Deep Dive to documentation site
- Add TOC to mkdocs config to allow references within cohort/label doc
@codecov-io
Copy link

codecov-io commented Jan 25, 2019

Codecov Report

Merging #577 into master will increase coverage by 0.33%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #577      +/-   ##
==========================================
+ Coverage   82.49%   82.83%   +0.33%     
==========================================
  Files          83       84       +1     
  Lines        4833     5173     +340     
==========================================
+ Hits         3987     4285     +298     
- Misses        846      888      +42
Impacted Files Coverage Δ
src/triage/util/db.py 100% <0%> (ø) ⬆️
...c/versions/0bca1ba9706e_add_matrix_uuid_to_eval.py 0% <0%> (ø)
src/triage/component/results_schema/schema.py 97.72% <0%> (+0.12%) ⬆️
src/triage/component/catwalk/evaluation.py 97.54% <0%> (+0.12%) ⬆️
src/triage/component/catwalk/storage.py 92.88% <0%> (+0.48%) ⬆️
src/triage/experiments/base.py 97.47% <0%> (+1.41%) ⬆️
src/triage/experiments/validate.py 78.57% <0%> (+2.57%) ⬆️
src/triage/validation_primitives.py 74.28% <0%> (+3.31%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b8b56f1...2a7ad83. Read the comment docs.


This document assumes that the reader is familiar with the concept of a machine learning target variable and will focus on explaining what is unique to Triage.

**A cohort is the population used used for modeling on a given as-of-date**. This is expressed as a list of *entities*. An entity is simply the object of prediction, such as a facility to inspect or a patient coming in for a visit. Early warning systems tend to include their entire population (or at least a large subset of it) in the cohort at any given date, while appointment-based problems may only include in a date's cohort the people who are scheduled for an appointment on that date.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list -> set

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Early warning systems tend to include their entire population (or at least a large subset of it)" -> "EWS tend to include their entire population"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to make the DDJ-type early warning cohorts fit here. The cohorts are only sometimes the entire population but can sometimes be like very large subsets like 'people who have ever been arrested'. I wanted a sentence that draws a distinction between visit-level problems that have a small per-date cohort vs everything else

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be illustrative to give a quick example of each case. I think we have a good handle on the difference between those two types of problems, but it will probably be less familiar to external readers.


## Temporal Validation Refresher

Triage uses temporal validation to select models because the real-world problems that Triage is built for tend to have a strong temporal correlation. Picking a date range to train on and a date range afterwards to test on ensures that we don't leak data from the future into our models that wouldn't be available in a real-world deployment scenario. Because of this, we often talk in Triage about the *as-of-date*: all models trained by Triage are associated with an *as-of-date*, which means that all the data that goes into the model **is only included if it was known about before that date**. For more on temporal validation, see the [relevant section in Dirty Duck](https://dssg.github.io/dirtyduck/#sec-4-2-2-1).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that it is the standard narrative, but "temporal correlation" is confusing: Triage focus on systems that evolve/change on time, even if there is or not a temporal correlation


## Temporal Validation Refresher

Triage uses temporal validation to select models because the real-world problems that Triage is built for tend to have a strong temporal correlation. Picking a date range to train on and a date range afterwards to test on ensures that we don't leak data from the future into our models that wouldn't be available in a real-world deployment scenario. Because of this, we often talk in Triage about the *as-of-date*: all models trained by Triage are associated with an *as-of-date*, which means that all the data that goes into the model **is only included if it was known about before that date**. For more on temporal validation, see the [relevant section in Dirty Duck](https://dssg.github.io/dirtyduck/#sec-4-2-2-1).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The models are associated to an as-of-date, but every row inside the matrices is linked to a set of as of dates (including the one linked to the model)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also trying to figure out how to phrase this here in the doc without going down a rabbit hole. I'll get back to you.

44 | 2016-03-01 | True
60 | 2016-03-01 | True

Above we define three total cohorts, on `2016-01-01`, `2016-02-01`, and `2016-03-01`. The first two cohorts have two entities each and the last one has a new third entity. For the first cohort, only one of the entities has an explicitly defined label (meaning the label query didn't return anything for them on that date).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Above we observe three total cohorts"


The final contents of the matrix, however, depend on the `include_missing_labels_in_train_as` setting.

### Inspections-Style (don't include missing labels)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing, because we are including those rows as NULL... Is it possible to phrase this differently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I get the confusoin. I don't know what the right phrasing is but I'll come up with something.

experiment.generate_matrices()
```

The matrix generation process will run all of the cohort/label/feature generation above, and then save matrices to your project_path's `matrices` directory. By default, these are CSVs and should have a few columns: 'entity_id', 'date', 'test_1_sum', and 'failed_inspection'. The 'entity_id' and 'date' columns represent the index of this matrix, and 'failed_inspection' is the label. Each of these CSV files has a YAML file starting with the same hash representing metadata about that matrix. If you want to look for just the train matrices to inspect the results of the `include_missing_labels_in_train_as` flag, try this command (assuming you can use bash):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be nice to include some sentences about how the matrix name is created (i.e. their dependencies) and how this will be stable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could add this, but is this in scope for a document that's focused on cohorts and labels? The only reason I even added anything about matrices is that that's where the joins and the label imputation is visible.

The 'Restarting an Experiment' section of the running document does cover this, though it could go into more detail for sure.

https://github.com/dssg/triage/blob/master/docs/sources/experiments/running.md#restarting-an-experiment

Is that a better place to put those details or do you think this cohort/label document makes more sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right :)


## Wrapup

Cohorts and Labels require a lot of care to define correctly as they constitute a large part of the problem framing. Even if you leave all of your feature generation the same, you can completely change the problem you're modeling by changing the label and cohort. Testing your cohort and label config can give you confidence that you're framing the problem the way you expect.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really liked this paragraph! 👍

@thcrock thcrock assigned nanounanue and shaycrk and unassigned nanounanue Feb 5, 2019

**A cohort is the population used used for modeling on a given as-of-date**. This is expressed as a list of *entities*. An entity is simply the object of prediction, such as a facility to inspect or a patient coming in for a visit. Early warning systems tend to include their entire population (or at least a large subset of it) in the cohort at any given date, while appointment-based problems may only include in a date's cohort the people who are scheduled for an appointment on that date.

**A label is the binary target variable for a member of the cohort at a given as-of-date and a given label timespan.** For instance, in an inspection prioritization problem the question being asked may be 'what facilities are at high risk of having a failed inspection in the next 6 months?' For this problem, the `label_timespan` is 6 months. There may be multiple label timespans tested in the same experiment, in which case there could be multiple labels for an entity and date.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also be multiple label definitions for the same entity, date, and timespan (e.g., any inspection failures vs failures with serious issues). Worth calling that out?


### Note 1

The as_of_date is parsed as a timestamp in the database. This timestamp defaults to **the beginning of the date in question**. It's important to consider how this is used for feature generation. Features are only included if they are known about **before this timestamp**. So features will be only included for an as_of_date if they are known about **before that as_of_date**. If you want to work around this (e.g for visit-level problems in which you want to intake data **on the day of the visit and make predictions using that data the same day**), you can move your cohort up a day.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just "midnight" rather than "the beginning of the date"? Also, probably worth calling out timezones here as a potential gotcha -- I think everything is expected to be a timestamp without a timezone, right? Likewise, seems worth noting that day is the smallest unit of time Triage can currently handle (I think that's correct?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we haven't fully considered what happens if the user puts a <1 day unit, like "5hour" in a timechop field. I'm kind of assuming something will break, but I'm trying it in dirty duck right now.

Triage expects the cohort to be a unique list of entity ids. Throughout the cohort example queries you will see `distinct(entity_id)` used to ensure this.

### Example: Inspections
Let's say I am prioritizing the inspection of restaurants. One simple definition of a cohort for restaurant inspection would be to include *any restaurants that have active permits in the last year* in the cohort. Assume that these permits are contained in a table, named `permits`, with the facility's id, a start date, and an end date of the permit.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably with the restaurant's id rather than facility's id?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a facility, not a restaurant. That is the standard nomenclature of the Chicago Food Inspections (because you can actually inspect things that are not a restaurant)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the copy to switch back to 'facility' but also introduce the definition of a food service facility as being restaurants plus other things

group by entity_id
include_missing_labels_in_train_as: False
name: 'diabetes'
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a good example for a comment about being thoughtful about your label definitions since an alternative way to think about this problem would be something more similar to the inspections problem (e.g., you would need to run an A1C test to either diagnosis someone with diabetes or positively conclude they don't have it, so anyone who hasn't been tested would be a null in that case...)

@shaycrk
Copy link
Contributor

shaycrk commented Feb 7, 2019

Looks pretty good to me overall -- made a few comments inline.

My only other general question might be whether there are additional gotchas/common mistakes people run into that are worth calling out here? (I'm thinking of things like understanding how your cohort changes over time, large label timespans that don't give enough space to do many splits, interactions between time window boundaries and your cuts - like if all your enrollment events happen on the same day, etc)

@shaycrk shaycrk assigned thcrock and unassigned shaycrk Feb 7, 2019
Copy link
Contributor

@shaycrk shaycrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made a few comments -- feel free to incorporate or reject as you see fit and merge whenever you're good with it!

@nanounanue
Copy link
Contributor

Yeah, I agree, just see my comment about restaurants vs facilities

@thcrock
Copy link
Contributor Author

thcrock commented Feb 11, 2019

"My only other general question might be whether there are additional gotchas/common mistakes people run into that are worth calling out here? (I'm thinking of things like understanding how your cohort changes over time, large label timespans that don't give enough space to do many splits, interactions between time window boundaries and your cuts - like if all your enrollment events happen on the same day, etc)"

Some of these could be useful here, and some of these may be better to cover in the in-progress temporal deep dive. I'm also probably not qualified to write any of these 'additional gotchas' so I'll leave that for others to add to this doc in another PR.

@thcrock thcrock merged commit 3af1fed into master Feb 11, 2019
@thcrock thcrock deleted the cohort_label_doc branch February 11, 2019 22:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants