Bias part 2 #688

saleiro · 2019-05-07T17:32:49Z

This is the second PR for aequitas integration and it covers the process of running a bias audit following the evaluation mechanics of Triage (for every matrix_type and subset_hash).

jesteria · 2019-05-08T15:40:44Z

src/triage/component/catwalk/evaluation.py

@@ -5,6 +5,7 @@
 import math

 import numpy
+import ohio.ext.pandas


I see we already have this import in src/triage/component/catwalk/predictors.py as well.

It's only really needed once.

Perhaps remove them and just have one in src/triage/__init__.py?

jesteria · 2019-05-08T15:42:10Z

src/triage/component/catwalk/evaluation.py



 RELATIVE_TOLERANCE = 0.01
 SORT_TRIALS = 30


+def query_protected_groups_table(db_engine, as_of_dates, protected_group_table_name, labels):
+    """Queries the protected groups table table to retrieve the protected attributes values for each as of date


typo

Suggested change

"""Queries the protected groups table table to retrieve the protected attributes values for each as of date

"""Query the protected groups table to retrieve the protected attribute values for each as of date.

jesteria · 2019-05-08T15:42:40Z

src/triage/component/catwalk/evaluation.py

+    Args:
+        db_engine (sqlalchemy.engine) a database engine
+        as_of_dates (list) the as_of_Dates to query
+        protected_group_table_name (str) the name of the table to query


missing labels

jesteria · 2019-05-08T15:44:49Z

src/triage/component/catwalk/evaluation.py

+    """
+    protected_df = pandas.DataFrame.pg_copy_from(query_string, engine=db_engine, parse_dates=["as_of_date"],
+                                                 index_col=MatrixStore.indices)
+    return protected_df.align(labels, join="inner")[0]


jesteria · 2019-05-08T15:47:01Z

src/triage/component/catwalk/evaluation.py

@@ -402,7 +420,7 @@ def metric_definitions_from_matrix_type(self, matrix_type):
    def needs_evaluations(self, matrix_store, model_id, subset_hash=""):
        """Returns whether or not all the configured metrics are present in the
        database for the given matrix and model.
-
+train_results


jesteria · 2019-05-08T15:53:52Z

src/triage/component/catwalk/evaluation.py

+        Returns:
+
+        """
+        bias_audits = []


unused (remove?)

jesteria · 2019-05-08T15:55:08Z

src/triage/component/catwalk/evaluation.py

+        bias_audits = []
+        protected_df['model_id'] = model_id
+        protected_df['score'] = predictions_proba
+        protected_df['label_value'] = labels


How large might the protected dataframe be? Would it be unreasonable to copy it? It's not generally "nice" (or safe) to modify a data structure you receive like this, (especially considering it'll be reused?).

(to be sure, it looks like it could be a "shallow" copy, so it should be ok?)

jesteria · 2019-05-08T15:56:50Z

src/triage/component/catwalk/evaluation.py

+        group_value_df['evaluation_start_time'] = evaluation_start_time
+        group_value_df['evaluation_end_time'] = evaluation_end_time
+        group_value_df['matrix_uuid'] = matrix_uuid
+        group_value_df['parameter'] =


this looks like a SyntaxError to me…

jesteria · 2019-05-08T16:01:00Z

src/triage/component/catwalk/evaluation.py

+        if group_value_df.empty:
+            raise ValueError("""
+            Bias audit: aequitas_audit() failed. Returned empty dataframe for model_id = {model_id}, and subset_hash = {subset_hash}
+            and predictions_schema = {schema}""".format())


this looks uninterpolated (and weirdly formatted). try? –

raise ValueError( f"Bias audit: aequitas_audit() failed. " f"Returned empty dataframe for (model_id, subset_hash, predictions_schema): " f"({model_id!r}, {subset_hash!r}, {schema!r})" )

jesteria · 2019-05-08T16:21:33Z

src/triage/component/catwalk/evaluation.py

+                    attribute_name=row['attribute_name'],
+                    attribute_value=row['attribute_value']
+                ).delete()
+            session.bulk_insert_mappings(matrix_type.aequitas_obj, group_value_df.to_dict(orient="records"))


(I know we talked about this so you probably did try, but), did you try simply bulk_update_mappings? E.g.:

with scoped_session(self.db_engine) as session: session.bulk_update_mappings(matrix_type.aequitas_obj, group_value_df.to_dict(orient="records"))

It looks like your database index isn't set on the dataframe; but, if it were:

with scoped_session(self.db_engine) as session: session.bulk_update_mappings(matrix_type.aequitas_obj, group_value_df.reset_index().to_dict(orient="records"))

Of course, I presume this requires that an index is set on the database table, (and that SQLAlchemy is aware of it).

I would guess that bulk_update_mappings is the most efficient for the situation, or at least more so than non-bulk SQLAlchemy methods. But, if it's really a lot of data, and otherwise I suppose for the record, you might do something like:

with db_engine.begin() as conn: # create temp table to which to COPY # using schema (not rows) of true target table conn.execute(f'''\ create temp table group_values as select * from {matrix_type.aequitas_obj.__table__} limit 0 ''') group_value_df.pg_copy_to('group_values', conn) conn.execute(f"""\ insert into {matrix_type.aequitas_obj.__table__} on conflict do update select * from group_values """)

thcrock · 2019-05-08T18:47:04Z

src/triage/component/catwalk/evaluation.py

@@ -438,6 +456,92 @@ def needs_evaluations(self, matrix_store, model_id, subset_hash=""):
        session.close()
        return needed

+    def _bias_audit(self,


verbify name (e.g. _write_audit_to_db) to make it clear just from looking at the name that nothing is being returned.

…tected group generator

codecov-io · 2019-05-23T21:58:27Z

Codecov Report

Merging #688 into master will decrease coverage by 0.09%.
The diff coverage is 83.52%.

@@            Coverage Diff            @@
##           master     #688     +/-   ##
=========================================
- Coverage    82.3%   82.21%   -0.1%     
=========================================
  Files          94       95      +1     
  Lines        6387     6810    +423     
=========================================
+ Hits         5257     5599    +342     
- Misses       1130     1211     +81

Impacted Files	Coverage Δ
src/triage/experiments/base.py	`95.6% <ø> (ø)`	⬆️
src/triage/component/results_schema/__init__.py	`71.66% <ø> (ø)`	⬆️
...s_schema/alembic/versions/b4d7569d31cb_aequitas.py	`0% <0%> (ø)`
...component/postmodeling/contrast/model_evaluator.py	`67.76% <0%> (-0.51%)`	⬇️
src/triage/component/catwalk/__init__.py	`95.5% <100%> (+0.38%)`	⬆️
src/triage/component/catwalk/storage.py	`93.51% <100%> (+0.04%)`	⬆️
src/triage/component/results_schema/schema.py	`98.96% <100%> (+0.54%)`	⬆️
src/triage/experiments/validate.py	`77.13% <75%> (+0.31%)`	⬆️
...e/component/catwalk/protected_groups_generators.py	`94.73% <90%> (-1.01%)`	⬇️
src/triage/component/catwalk/evaluation.py	`98.02% <96.66%> (-0.56%)`	⬇️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5342254...10dad6c. Read the comment docs.

saleiro added 7 commits May 6, 2019 18:56

adding _bias_audit and results_schema

75eb720

adding _bias_audit and results_schema

a846d23

adding bias_audit tables to schema.py

6c475bb

adding bias_audit tables to schema.py

8bd35fc

aequitas alembic

4b30bb7

write to db

1fa50d2

write to db

c2d6695

saleiro assigned thcrock May 7, 2019

jesteria reviewed May 8, 2019

View reviewed changes

thcrock reviewed May 8, 2019

View reviewed changes

thcrock added 14 commits May 20, 2019 11:37

WIP

ab8e6d9

Merge remote-tracking branch 'origin/master' into bias_part_2

9765aec

WIP

9a04eac

Modify evaluation test, move querying of protected group table to pro…

e84f4cc

…tected group generator

Did not commit everything before

5af73c8

Merge remote-tracking branch 'origin/master' into bias_part_2

c51c4c4

WIP

e3d9091

More doc changes

c6a64e7

A couple fixes

1be5a4d

Catwalk integration test

ab32b0e

Fix simple experiemnt test

dfbc251

Use matplotlib agg in audition test

9bd69b2

Try putting agg in conftest

e50bd1a

Update ml governance doc

6334a17

thcrock and others added 5 commits May 24, 2019 14:31

Since the matplotlib agg is in conftest we may not need it here

580d099

Update docstring for subset

f11e384

Simply as_dataframe and update PyYAML to jive with aequitas requirements

5573793

extending ML governance

70109f9

Add label join back in, make needs_evaluations bias audit aware

109ff02

thcrock added 3 commits May 29, 2019 16:33

Update colormap in test

519bb08

Update docstrings

f007f6a

Clarify percentile thresholds

10dad6c

thcrock merged commit e47f07a into master May 30, 2019

thcrock deleted the bias_part_2 branch May 30, 2019 16:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bias part 2 #688

Bias part 2 #688

saleiro commented May 7, 2019

jesteria May 8, 2019

jesteria May 8, 2019

jesteria May 8, 2019

jesteria May 8, 2019

jesteria May 8, 2019

jesteria May 8, 2019

jesteria May 8, 2019

jesteria May 8, 2019

jesteria May 8, 2019

jesteria May 8, 2019

jesteria May 8, 2019

jesteria May 8, 2019

thcrock May 8, 2019

codecov-io commented May 23, 2019 •

edited

Loading

	"""Queries the protected groups table table to retrieve the protected attributes values for each as of date
	"""Query the protected groups table to retrieve the protected attribute values for each as of date.

Bias part 2 #688

Bias part 2 #688

Conversation

saleiro commented May 7, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented May 23, 2019 • edited Loading

Codecov Report

codecov-io commented May 23, 2019 •

edited

Loading