Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Namespace cohort and labels tables by their config [Resolves #574] #576

Merged
merged 1 commit into from
Feb 1, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 12 additions & 26 deletions docs/sources/experiments/algorithm.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,39 +56,25 @@ This binary labels table is scoped to the entire Experiment, so all `as_of_time`
`label_timespan` (taken straight from `temporal_config`) combinations are present. Additionally, the 'label_name' and
'label_type' are also recorded with each row in the table.

The name of the labels table is based on both the name of the label and a hash of the label query (e.g `labels_failedviolation_a0b1c2d3`), so any prior experiments that shared both the name and query will be able to reuse the labels table. If the 'replace' flag was sent, for each `as_of_time` and `label_timespan`, the labels table is queried to check if any rows exist that match. If any such rows exist, the labels query for that date and timespan is not run.

At this point, the 'labels' table may not have entries for all entities and dates that need to be in a given matrix.
How these rows have their labels represented is up to the configured `include_missing_labels_in_train_as` value in the
experiment. This value is not processed when we generate the labels table, but later on when the matrix is built (see
'Retrieving Data and Saving Completed Matrix')


### State Table

The Experiment keeps track of the state of the entities. Based on the configuration, certain entities can be included
or excluded for different time periods in feature imputation, creating matrices, or both.

In code, it does this by computing what it calls the 'sparse' state table for an experiment. This is a table with a
boolean flag entry for every entity, as_of_time, and state. The structure of this table allows for state filtering
based on SQL conditions given by the user.

Based on configuration, it can be created through one of three code paths:
### Cohort Table

1. If the user passes what we call a 'dense states' table, with the following structure: entity id/state/start/end, and
a list of state filters, this 'dense states' table holds time ranges that entities were in for specific states.
When converting this to a sparse table, we take each as_of_time present in the Experiment, and for each known state
(that is, the distinct values found in the 'dense states' table), see if there is any entry in the dense states table
with this state whose range overlaps with the as_of_time. If so, the entity is considered to be in that state as of that
date.
The Experiment keeps track of the which entities are in the cohort on any given date. Similarly to the labels table, the experiment populates a cohort table using the following input:

2. If the user passes what we call an 'entities' table, containing an entity_id, it will simply use all distinct
entities present in said table, and mark them as 'active' for every as_of_time in the experiment. Any other columns are
ignored.
1. A query, provided by the user in the configuration file, that generates entity_ids for a given as_of_date.

3. If the user passes a query, parameterized with an as of date, we will populate the table by running it for each
as_of_date.
2. Each as_of_date as defined in temporal config

This table is created and exists until matrices are built, at which point it is considered unnecessary and then dropped.
This cohort table is scoped to the entire Experiment, so all `as_of_times` (computed in step 1) are present.

The name of the cohort table is based on both the name of the cohort and a hash of the cohort query (e.g `cohort_permitted_a0b1c2d3`), so any prior experiments that shared both the name and query will be able to reuse the cohort table. If the 'replace' flag was sent, for each `as_of_time`, the cohort table is queried to check if any rows exist that match. If any such rows exist, the cohort query for that date is not run.

### Features

Expand All @@ -110,7 +96,7 @@ a column or SQL expression representing a numeric quantity present in the `from_
of aggregate functions we want to use. The aggregate function is applied to the quantity.
* Each `group` is a column applied to the GROUP BY clause. Generally this is 'entity_id', but higher-level groupings
(for instance, 'zip_code') can be used as long as they can be rolled up to 'entity_id'.
* By default the query is joined with the cohort table (see 'state table' above) to remove unnecessary rows. If `features_ignore_cohort` is passed to the Experiment this is not done.
* By default the query is joined with the cohort table to remove unnecessary rows. If `features_ignore_cohort` is passed to the Experiment this is not done.

So a simplified version of a typical query would look like:
```
Expand All @@ -137,7 +123,7 @@ entity-level and zipcode-level aggregates from both tables. This aggregation-lev
in the aggregation, pre-imputation. Its output location is generally `{prefix}_aggregation`

#### Imputing Values
A table that looks similar, but with imputed values is created. The state table from above is passed into collate as
A table that looks similar, but with imputed values is created. The cohort table from above is passed into collate as
the comprehensive set of entities and dates for which output should be generated, regardless if they exist in the
`from_obj`. Each feature column has an imputation rule, inherited from some level of the feature definition. The
imputation rules that are based on data (e.g. `mean`) use the rows from the `as_of_time` to produce the imputed value.
Expand All @@ -147,8 +133,8 @@ Its output location is generally `{prefix}_aggregation_imputed`

At this point, we have at least three tables that are used to populate matrices:

- `labels` with computed labels
- `tmp_states_{experiment hash}` that tracks what `as_of_times` each entity was in each state.
- `labels_{labelname}_{labelqueryhash}` with computed labels for each date
- `cohort_{cohortname}_{cohortqueryhash}` with the cohort for each date
- A `features.{prefix}_aggregation_imputed` table for each feature aggregation present in the experiment config.


Expand Down
4 changes: 2 additions & 2 deletions docs/sources/experiments/running.md
Original file line number Diff line number Diff line change
Expand Up @@ -188,8 +188,8 @@ If an experiment fails for any reason, you can restart it.

By default, all work will be recreated. This includes label queries, feature queries, matrix building, model training, etc. However, if you pass the `replace=False` keyword argument, the Experiment will reuse what work it can.

- Cohort Table: The Experiment keeps a cohort table namespaced by its experiment hash, and within that will check on a per-as-of-date level whether or not there are any existing rows, and skip the cohort query for that date if so. For this reason, it is *not* aware of specific entities or source events so if the source data has changed, you will not want to set `replace` to False. Don't expect too much reuse from this, however, as the table is experiment-namespaced. Essentially, this will only reuse data if the same experiment was run prior and failed part of the way through.
- Labels Table: The Experiment keeps a labels table namespaced by its experiment hash, and within that will check on a per-`as_of_date`/`label timespan` level whether or not there are *any* existing rows, and skip the label query if so. For this reason, it is *not* aware of specific entities or source events so if the label query has changed or the source data has changed, you will not want to set `replace` to False. Don't expect too much reuse from this, however, as the table is experiment-namespaced. Essentially, this will only reuse data if the same experiment was run prior and failed part of the way through label generation.
- Cohort Table: The Experiment refers to a cohort table namespaced by the cohort name and a hash of the cohort query, and in that way allows you to reuse cohorts between different experiments if their label names and queries are identical. When referring to this table, it will check on an as-of-date level whether or not there are any existing rows for that date, and skip the cohort query for that date if so. For this reason, it is *not* aware of specific entities or source events so if the source data has changed, ensure that `replace` is set to True.
- Labels Table: The Experiment refers to a labels table namespaced by the label name and a hash of the label query, and in that way allows you to reuse labels between different experiments if their label names and queries are identical. When referring to this table, it will check on a per-`as_of_date`/`label timespan` level whether or not there are *any* existing rows, and skip the label query if so. For this reason, it is *not* aware of specific entities or source events so if the label query has changed or the source data has changed, ensure that `replace` is set to True.
- Features Tables: The Experiment will check on a per-table basis whether or not it exists and contains rows for the entire cohort, and skip the feature generation if so. It does not look at the column list for the feature table or inspect the feature data itself. So, if you have modified any source data that affects a feature aggregation, or added any columns to that aggregation, you won't want to set `replace` to False. However, it is cohort-and-date aware so you can change around your cohort and temporal configuration safely.
- Matrix Building: Each matrix's metadata is hashed to create a unique id. If a file exists in storage with that hash, it will be reused.
- Model Training: Each model's metadata (which includes its train matrix's hash) is hashed to create a unique id. If a file exists in storage with that hash, it will be reused.
Expand Down
1 change: 1 addition & 0 deletions requirement/test.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,4 @@ pytest<4.0.0 # pyup: ignore
pytest-cov==2.6.0
moto==1.3.7
fakeredis==0.16.0
hypothesis==4.4.1
2 changes: 1 addition & 1 deletion src/tests/test_experiments.py
Original file line number Diff line number Diff line change
Expand Up @@ -355,7 +355,7 @@ def test_profiling(db_engine):
populate_source_data(db_engine)
with TemporaryDirectory() as temp_dir:
project_path = os.path.join(temp_dir, "inspections")
experiment = SingleThreadedExperiment(
SingleThreadedExperiment(
config=sample_config(),
db_engine=db_engine,
project_path=project_path,
Expand Down
19 changes: 19 additions & 0 deletions src/tests/test_validation_primitives.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
from triage.validation_primitives import string_is_tablesafe
from hypothesis import given, example
from hypothesis.strategies import text, characters


# test with a variety of strings based on letters and numbers auto-generated by hypothesis
# and also add a hardcoded example that includes underscores because those are fine
@given(text(alphabet=characters(whitelist_categories=('Lu', 'Ll', 'Nd')), min_size=1))
@example('a_valid_name')
def test_string_is_tablesafe(s):
assert string_is_tablesafe(s)


# test with a variety of strings based on unsafe characters auto-generated by hypothesis
# and also add a hardcoded example that should be bad because it has spaces
@given(text(alphabet='/ "'))
@example('Spaces are not valid')
def test_string_is_not_tablesafe(s):
assert not string_is_tablesafe(s)
20 changes: 15 additions & 5 deletions src/triage/experiments/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,8 @@
associate_models_with_experiment,
associate_matrices_with_experiment,
missing_matrix_uuids,
missing_model_hashes
missing_model_hashes,
filename_friendly_hash
)
from triage.component.catwalk.storage import (
CSVMatrixStore,
Expand Down Expand Up @@ -113,8 +114,6 @@ def __init__(
self.materialize_subquery_fromobjs = materialize_subquery_fromobjs
self.features_ignore_cohort = features_ignore_cohort
self.experiment_hash = save_experiment_and_get_hash(self.config, self.db_engine)
self.labels_table_name = "labels_{}".format(self.experiment_hash)
self.cohort_table_name = "cohort_{}".format(self.experiment_hash)
self.initialize_components()

self.cleanup = cleanup
Expand Down Expand Up @@ -157,6 +156,10 @@ def initialize_components(self):

cohort_config = self.config.get("cohort_config", {})
if "query" in cohort_config:
self.cohort_table_name = "cohort_{}_{}".format(
cohort_config.get('name', 'default'),
filename_friendly_hash(cohort_config['query'])
)
self.cohort_table_generator = CohortTableGenerator(
cohort_table_name=self.cohort_table_name,
db_engine=self.db_engine,
Expand All @@ -170,16 +173,23 @@ def initialize_components(self):
"or save time by only computing features for that cohort."
)
self.features_ignore_cohort = True
self.cohort_table_name = "cohort_{}".format(self.experiment_hash)
self.cohort_table_generator = CohortTableGeneratorNoOp()

if "label_config" in self.config:
label_config = self.config["label_config"]
self.labels_table_name = "labels_{}_{}".format(
label_config.get('name', 'default'),
filename_friendly_hash(label_config['query'])
)
self.label_generator = LabelGenerator(
label_name=self.config["label_config"].get("name", None),
query=self.config["label_config"]["query"],
label_name=label_config.get("name", None),
query=label_config["query"],
replace=self.replace,
db_engine=self.db_engine,
)
else:
self.labels_table_name = "labels_{}".format(self.experiment_hash)
self.label_generator = LabelGeneratorNoOp()
logging.warning(
"label_config missing or unrecognized. Without labels, "
Expand Down
7 changes: 7 additions & 0 deletions src/triage/experiments/validate.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
from triage.component.timechop import Timechop

from triage.util.conf import convert_str_to_relativedelta
from triage.validation_primitives import string_is_tablesafe


class Validator(object):
Expand Down Expand Up @@ -449,6 +450,9 @@ def _run(self, label_config):
key 'query' not found. You must define a label query."""
)
)
if 'name' in label_config and not string_is_tablesafe(label_config['name']):
raise ValueError("Section: label_config - "
"name should only contain letters, numbers, and underscores")
self._validate_query(label_config["query"])
self._validate_include_missing_labels_in_train_as(
label_config.get("include_missing_labels_in_train_as", None)
Expand All @@ -475,6 +479,9 @@ def _run(self, cohort_config):
{as_of_date} must be present"""
)
)
if 'name' in cohort_config and not string_is_tablesafe(cohort_config['name']):
raise ValueError("Section: cohort_config - "
"name should only contain letters, numbers, and underscores")
dated_query = query.replace("{as_of_date}", "2016-01-01")
logging.info("Validating cohort query")
try:
Expand Down
6 changes: 6 additions & 0 deletions src/triage/validation_primitives.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,3 +149,9 @@ def column_should_be_stringlike(table_name, column, db_engine):
"""
table_should_have_column(table_name, column, db_engine)
column_should_be_in_types(table_name, column, [VARCHAR, TEXT], db_engine)


def string_is_tablesafe(string):
if not string:
return False
return all(c.isalpha() or c.isdigit() or c == '_' for c in string)