Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bias part 2 #688

Merged
merged 29 commits into from
May 30, 2019
Merged

Bias part 2 #688

merged 29 commits into from
May 30, 2019

Conversation

saleiro
Copy link
Member

@saleiro saleiro commented May 7, 2019

This is the second PR for aequitas integration and it covers the process of running a bias audit following the evaluation mechanics of Triage (for every matrix_type and subset_hash).

@@ -5,6 +5,7 @@
import math

import numpy
import ohio.ext.pandas
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we already have this import in src/triage/component/catwalk/predictors.py as well.

It's only really needed once.

Perhaps remove them and just have one in src/triage/__init__.py?



RELATIVE_TOLERANCE = 0.01
SORT_TRIALS = 30


def query_protected_groups_table(db_engine, as_of_dates, protected_group_table_name, labels):
"""Queries the protected groups table table to retrieve the protected attributes values for each as of date
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

Suggested change
"""Queries the protected groups table table to retrieve the protected attributes values for each as of date
"""Query the protected groups table to retrieve the protected attribute values for each as of date.

Args:
db_engine (sqlalchemy.engine) a database engine
as_of_dates (list) the as_of_Dates to query
protected_group_table_name (str) the name of the table to query
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing labels

"""
protected_df = pandas.DataFrame.pg_copy_from(query_string, engine=db_engine, parse_dates=["as_of_date"],
index_col=MatrixStore.indices)
return protected_df.align(labels, join="inner")[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏

@@ -402,7 +420,7 @@ def metric_definitions_from_matrix_type(self, matrix_type):
def needs_evaluations(self, matrix_store, model_id, subset_hash=""):
"""Returns whether or not all the configured metrics are present in the
database for the given matrix and model.

train_results
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hã?

Returns:

"""
bias_audits = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused (remove?)

bias_audits = []
protected_df['model_id'] = model_id
protected_df['score'] = predictions_proba
protected_df['label_value'] = labels
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How large might the protected dataframe be? Would it be unreasonable to copy it? It's not generally "nice" (or safe) to modify a data structure you receive like this, (especially considering it'll be reused?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(to be sure, it looks like it could be a "shallow" copy, so it should be ok?)

group_value_df['evaluation_start_time'] = evaluation_start_time
group_value_df['evaluation_end_time'] = evaluation_end_time
group_value_df['matrix_uuid'] = matrix_uuid
group_value_df['parameter'] =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like a SyntaxError to me…

if group_value_df.empty:
raise ValueError("""
Bias audit: aequitas_audit() failed. Returned empty dataframe for model_id = {model_id}, and subset_hash = {subset_hash}
and predictions_schema = {schema}""".format())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks uninterpolated (and weirdly formatted). try? –

raise ValueError(
    f"Bias audit: aequitas_audit() failed. "
    f"Returned empty dataframe for (model_id, subset_hash, predictions_schema): "
    f"({model_id!r}, {subset_hash!r}, {schema!r})"
)

attribute_name=row['attribute_name'],
attribute_value=row['attribute_value']
).delete()
session.bulk_insert_mappings(matrix_type.aequitas_obj, group_value_df.to_dict(orient="records"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I know we talked about this so you probably did try, but), did you try simply bulk_update_mappings? E.g.:

with scoped_session(self.db_engine) as session:
    session.bulk_update_mappings(matrix_type.aequitas_obj, group_value_df.to_dict(orient="records"))

It looks like your database index isn't set on the dataframe; but, if it were:

with scoped_session(self.db_engine) as session:
    session.bulk_update_mappings(matrix_type.aequitas_obj, group_value_df.reset_index().to_dict(orient="records"))

Of course, I presume this requires that an index is set on the database table, (and that SQLAlchemy is aware of it).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would guess that bulk_update_mappings is the most efficient for the situation, or at least more so than non-bulk SQLAlchemy methods. But, if it's really a lot of data, and otherwise I suppose for the record, you might do something like:

with db_engine.begin() as conn:
    # create temp table to which to COPY
    # using schema (not rows) of true target table
    conn.execute(f'''\
        create temp table group_values
        as select * from {matrix_type.aequitas_obj.__table__} limit 0
    ''')

    group_value_df.pg_copy_to('group_values', conn)

    conn.execute(f"""\
        insert into {matrix_type.aequitas_obj.__table__}
        on conflict do update
        select * from group_values
    """)

@@ -438,6 +456,92 @@ def needs_evaluations(self, matrix_store, model_id, subset_hash=""):
session.close()
return needed

def _bias_audit(self,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verbify name (e.g. _write_audit_to_db) to make it clear just from looking at the name that nothing is being returned.

@codecov-io
Copy link

codecov-io commented May 23, 2019

Codecov Report

Merging #688 into master will decrease coverage by 0.09%.
The diff coverage is 83.52%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #688     +/-   ##
=========================================
- Coverage    82.3%   82.21%   -0.1%     
=========================================
  Files          94       95      +1     
  Lines        6387     6810    +423     
=========================================
+ Hits         5257     5599    +342     
- Misses       1130     1211     +81
Impacted Files Coverage Δ
src/triage/experiments/base.py 95.6% <ø> (ø) ⬆️
src/triage/component/results_schema/__init__.py 71.66% <ø> (ø) ⬆️
...s_schema/alembic/versions/b4d7569d31cb_aequitas.py 0% <0%> (ø)
...component/postmodeling/contrast/model_evaluator.py 67.76% <0%> (-0.51%) ⬇️
src/triage/component/catwalk/__init__.py 95.5% <100%> (+0.38%) ⬆️
src/triage/component/catwalk/storage.py 93.51% <100%> (+0.04%) ⬆️
src/triage/component/results_schema/schema.py 98.96% <100%> (+0.54%) ⬆️
src/triage/experiments/validate.py 77.13% <75%> (+0.31%) ⬆️
...e/component/catwalk/protected_groups_generators.py 94.73% <90%> (-1.01%) ⬇️
src/triage/component/catwalk/evaluation.py 98.02% <96.66%> (-0.56%) ⬇️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5342254...10dad6c. Read the comment docs.

@thcrock thcrock merged commit e47f07a into master May 30, 2019
@thcrock thcrock deleted the bias_part_2 branch May 30, 2019 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants