dssg · thcrock · Feb 1, 2019 · Jan 24, 2019
diff --git a/docs/sources/experiments/algorithm.md b/docs/sources/experiments/algorithm.md
@@ -56,39 +56,25 @@ This binary labels table is scoped to the entire Experiment, so all `as_of_time`
 `label_timespan` (taken straight from `temporal_config`) combinations are present. Additionally, the 'label_name' and
 'label_type' are also recorded with each row in the table.
 
+The name of the labels table is based on both the name of the label and a hash of the label query (e.g `labels_failedviolation_a0b1c2d3`), so any prior experiments that shared both the name and query will be able to reuse the labels table.  If the 'replace' flag was sent, for each `as_of_time` and `label_timespan`, the labels table is queried to check if any rows exist that match. If any such rows exist, the labels query for that date and timespan is not run.
+
 At this point, the 'labels' table may not have entries for all entities and dates that need to be in a given matrix.
 How these rows have their labels represented is up to the configured `include_missing_labels_in_train_as` value in the
 experiment. This value is not processed when we generate the labels table, but later on when the matrix is built (see
 'Retrieving Data and Saving Completed Matrix')
 
 
-### State Table
-
-The Experiment keeps track of the state of the entities. Based on the configuration, certain entities can be included
-or excluded for different time periods in feature imputation, creating matrices, or both.
-
-In code, it does this by computing what it calls the 'sparse' state table for an experiment. This is a table with a
-boolean flag entry for every entity, as_of_time, and state. The structure of this table allows for state filtering
-based on SQL conditions given by the user.
-
-Based on configuration, it can be created through one of three code paths:
+### Cohort Table
 
-1. If the user passes what we call a 'dense states' table, with the following structure: entity id/state/start/end, and
-a list of state filters, this 'dense states' table holds time ranges that entities were in for specific states.
-When converting this to a sparse table, we take each as_of_time present in the Experiment, and for each known state
-(that is, the distinct values found in the 'dense states' table), see if there is any entry in the dense states table
-with this state whose range overlaps with the as_of_time. If so, the entity is considered to be in that state as of that
-date.
+The Experiment keeps track of the which entities are in the cohort on any given date. Similarly to the labels table, the experiment populates a cohort table using the following input:
 
-2. If the user passes what we call an 'entities' table, containing an entity_id, it will simply use all distinct
-entities present in said table, and mark them as 'active' for every as_of_time in the experiment. Any other columns are
-ignored.
+1. A query, provided by the user in the configuration file, that generates entity_ids for a given as_of_date.
 
-3. If the user passes a query, parameterized with an as of date, we will populate the table by running it for each
-as_of_date.
+2. Each as_of_date as defined in temporal config
 
-This table is created and exists until matrices are built, at which point it is considered unnecessary and then dropped.
+This cohort table is scoped to the entire Experiment, so all `as_of_times` (computed in step 1) are present. 
 
+The name of the cohort table is based on both the name of the cohort and a hash of the cohort query (e.g `cohort_permitted_a0b1c2d3`), so any prior experiments that shared both the name and query will be able to reuse the cohort table.  If the 'replace' flag was sent, for each `as_of_time`, the cohort table is queried to check if any rows exist that match. If any such rows exist, the cohort query for that date is not run.
 
 ### Features
 
@@ -110,7 +96,7 @@ a column or SQL expression representing a numeric quantity present in the `from_
 of aggregate functions we want to use. The aggregate function is applied to the quantity.
 * Each `group` is a column applied to the GROUP BY clause. Generally this is 'entity_id', but higher-level groupings
 (for instance, 'zip_code') can be used as long as they can be rolled up to 'entity_id'.
-* By default the query is joined with the cohort table (see 'state table' above) to remove unnecessary rows. If `features_ignore_cohort` is passed to the Experiment this is not done.
+* By default the query is joined with the cohort table to remove unnecessary rows. If `features_ignore_cohort` is passed to the Experiment this is not done.
 
 So a simplified version of a typical query would look like:
 ```
@@ -137,7 +123,7 @@ entity-level and zipcode-level aggregates from both tables. This aggregation-lev
 in the aggregation, pre-imputation. Its output location is generally `{prefix}_aggregation`
 
 #### Imputing Values
-A table that looks similar, but with imputed values is created. The state table from above is passed into collate as
+A table that looks similar, but with imputed values is created. The cohort table from above is passed into collate as
 the comprehensive set of entities and dates for which output should be generated, regardless if they exist in the
 `from_obj`. Each feature column has an imputation rule, inherited from some level of the feature definition. The
 imputation rules that are based on data (e.g. `mean`) use the rows from the `as_of_time` to produce the imputed value.
@@ -147,8 +133,8 @@ Its output location is generally `{prefix}_aggregation_imputed`
 
 At this point, we have at least three tables that are used to populate matrices:
 
-- `labels` with computed labels
-- `tmp_states_{experiment hash}` that tracks what `as_of_times` each entity was in each state.
+- `labels_{labelname}_{labelqueryhash}` with computed labels for each date
+- `cohort_{cohortname}_{cohortqueryhash}` with the cohort for each date
 - A `features.{prefix}_aggregation_imputed` table for each feature aggregation present in the experiment config.
 
 

diff --git a/docs/sources/experiments/running.md b/docs/sources/experiments/running.md
@@ -188,8 +188,8 @@ If an experiment fails for any reason, you can restart it.
 
 By default, all work will be recreated. This includes label queries, feature queries, matrix building, model training, etc. However, if you pass the `replace=False` keyword argument, the Experiment will reuse what work it can.
 
-- Cohort Table: The Experiment keeps a cohort table namespaced by its experiment hash, and within that will check on a per-as-of-date level whether or not there are any existing rows, and skip the cohort query for that date if so. For this reason, it is *not* aware of specific entities or source events so if the source data has changed, you will not want to set `replace` to False. Don't expect too much reuse from this, however, as the table is experiment-namespaced. Essentially, this will only reuse data if the same experiment was run prior and failed part of the way through.
-- Labels Table: The Experiment keeps a labels table namespaced by its experiment hash, and within that will check on a per-`as_of_date`/`label timespan` level whether or not there are *any* existing rows, and skip the label query if so. For this reason, it is *not* aware of specific entities or source events so if the label query has changed or the source data has changed, you will not want to set `replace` to False. Don't expect too much reuse from this, however, as the table is experiment-namespaced. Essentially, this will only reuse data if the same experiment was run prior and failed part of the way through label generation. 
+- Cohort Table: The Experiment refers to a cohort table namespaced by the cohort name and a hash of the cohort query, and in that way allows you to reuse cohorts between different experiments if their label names and queries are identical. When referring to this table, it will check on an as-of-date level whether or not there are any existing rows for that date, and skip the cohort query for that date if so. For this reason, it is *not* aware of specific entities or source events so if the source data has changed, ensure that `replace` is set to True. 
+- Labels Table: The Experiment refers to a labels table namespaced by the label name and a hash of the label query, and in that way allows you to reuse labels between different experiments if their label names and queries are identical. When referring to this table, it will check on a per-`as_of_date`/`label timespan` level whether or not there are *any* existing rows, and skip the label query if so. For this reason, it is *not* aware of specific entities or source events so if the label query has changed or the source data has changed, ensure that `replace` is set to True.
 - Features Tables: The Experiment will check on a per-table basis whether or not it exists and contains rows for the entire cohort, and skip the feature generation if so. It does not look at the column list for the feature table or inspect the feature data itself. So, if you have modified any source data that affects a feature aggregation, or added any columns to that aggregation, you won't want to set `replace` to False. However, it is cohort-and-date aware so you can change around your cohort and temporal configuration safely.
 - Matrix Building: Each matrix's metadata is hashed to create a unique id. If a file exists in storage with that hash, it will be reused.
 - Model Training: Each model's metadata (which includes its train matrix's hash) is hashed to create a unique id. If a file exists in storage with that hash, it will be reused.

diff --git a/requirement/test.txt b/requirement/test.txt
@@ -7,3 +7,4 @@ pytest<4.0.0 # pyup: ignore
 pytest-cov==2.6.0
 moto==1.3.7
 fakeredis==0.16.0
+hypothesis==4.4.1
diff --git a/src/tests/test_experiments.py b/src/tests/test_experiments.py
@@ -355,7 +355,7 @@ def test_profiling(db_engine):
     populate_source_data(db_engine)
     with TemporaryDirectory() as temp_dir:
         project_path = os.path.join(temp_dir, "inspections")
-        experiment = SingleThreadedExperiment(
+        SingleThreadedExperiment(
             config=sample_config(),
             db_engine=db_engine,
             project_path=project_path,

diff --git a/src/tests/test_validation_primitives.py b/src/tests/test_validation_primitives.py
@@ -0,0 +1,19 @@
+from triage.validation_primitives import string_is_tablesafe
+from hypothesis import given, example
+from hypothesis.strategies import text, characters
+
+
+# test with a variety of strings based on letters and numbers auto-generated by hypothesis
+# and also add a hardcoded example that includes underscores because those are fine
+@given(text(alphabet=characters(whitelist_categories=('Lu', 'Ll', 'Nd')), min_size=1))
+@example('a_valid_name')
+def test_string_is_tablesafe(s):
+    assert string_is_tablesafe(s)
+
+
+# test with a variety of strings based on unsafe characters auto-generated by hypothesis
+# and also add a hardcoded example that should be bad because it has spaces
+@given(text(alphabet='/ "'))
+@example('Spaces are not valid')
+def test_string_is_not_tablesafe(s):
+    assert not string_is_tablesafe(s)
diff --git a/src/triage/experiments/base.py b/src/triage/experiments/base.py
@@ -40,7 +40,8 @@
     associate_models_with_experiment,
     associate_matrices_with_experiment,
     missing_matrix_uuids,
-    missing_model_hashes
+    missing_model_hashes,
+    filename_friendly_hash
 )
 from triage.component.catwalk.storage import (
     CSVMatrixStore,
@@ -113,8 +114,6 @@ def __init__(
         self.materialize_subquery_fromobjs = materialize_subquery_fromobjs
         self.features_ignore_cohort = features_ignore_cohort
         self.experiment_hash = save_experiment_and_get_hash(self.config, self.db_engine)
-        self.labels_table_name = "labels_{}".format(self.experiment_hash)
-        self.cohort_table_name = "cohort_{}".format(self.experiment_hash)
         self.initialize_components()
 
         self.cleanup = cleanup
@@ -157,6 +156,10 @@ def initialize_components(self):
 
         cohort_config = self.config.get("cohort_config", {})
         if "query" in cohort_config:
+            self.cohort_table_name = "cohort_{}_{}".format(
+                cohort_config.get('name', 'default'),
+                filename_friendly_hash(cohort_config['query'])
+            )
             self.cohort_table_generator = CohortTableGenerator(
                 cohort_table_name=self.cohort_table_name,
                 db_engine=self.db_engine,
@@ -170,16 +173,23 @@ def initialize_components(self):
                 "or save time by only computing features for that cohort."
             )
             self.features_ignore_cohort = True
+            self.cohort_table_name = "cohort_{}".format(self.experiment_hash)
             self.cohort_table_generator = CohortTableGeneratorNoOp()
 
         if "label_config" in self.config:
+            label_config = self.config["label_config"]
+            self.labels_table_name = "labels_{}_{}".format(
+                label_config.get('name', 'default'),
+                filename_friendly_hash(label_config['query'])
+            )
             self.label_generator = LabelGenerator(
-                label_name=self.config["label_config"].get("name", None),
-                query=self.config["label_config"]["query"],
+                label_name=label_config.get("name", None),
+                query=label_config["query"],
                 replace=self.replace,
                 db_engine=self.db_engine,
             )
         else:
+            self.labels_table_name = "labels_{}".format(self.experiment_hash)
             self.label_generator = LabelGeneratorNoOp()
             logging.warning(
                 "label_config missing or unrecognized. Without labels, "

diff --git a/src/triage/experiments/validate.py b/src/triage/experiments/validate.py
@@ -10,6 +10,7 @@
 from triage.component.timechop import Timechop
 
 from triage.util.conf import convert_str_to_relativedelta
+from triage.validation_primitives import string_is_tablesafe
 
 
 class Validator(object):
@@ -449,6 +450,9 @@ def _run(self, label_config):
             key 'query' not found. You must define a label query."""
                 )
             )
+        if 'name' in label_config and not string_is_tablesafe(label_config['name']):
+            raise ValueError("Section: label_config - "
+                             "name should only contain letters, numbers, and underscores")
         self._validate_query(label_config["query"])
         self._validate_include_missing_labels_in_train_as(
             label_config.get("include_missing_labels_in_train_as", None)
@@ -475,6 +479,9 @@ def _run(self, cohort_config):
             {as_of_date} must be present"""
                 )
             )
+        if 'name' in cohort_config and not string_is_tablesafe(cohort_config['name']):
+            raise ValueError("Section: cohort_config - "
+                             "name should only contain letters, numbers, and underscores")
         dated_query = query.replace("{as_of_date}", "2016-01-01")
         logging.info("Validating cohort query")
         try:

diff --git a/src/triage/validation_primitives.py b/src/triage/validation_primitives.py
@@ -149,3 +149,9 @@ def column_should_be_stringlike(table_name, column, db_engine):
     """
     table_should_have_column(table_name, column, db_engine)
     column_should_be_in_types(table_name, column, [VARCHAR, TEXT], db_engine)
+
+
+def string_is_tablesafe(string):
+    if not string:
+        return False
+    return all(c.isalpha() or c.isdigit() or c == '_' for c in string)