-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bias part 2 #688
Bias part 2 #688
Conversation
@@ -5,6 +5,7 @@ | |||
import math | |||
|
|||
import numpy | |||
import ohio.ext.pandas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see we already have this import in src/triage/component/catwalk/predictors.py
as well.
It's only really needed once.
Perhaps remove them and just have one in src/triage/__init__.py
?
|
||
|
||
RELATIVE_TOLERANCE = 0.01 | ||
SORT_TRIALS = 30 | ||
|
||
|
||
def query_protected_groups_table(db_engine, as_of_dates, protected_group_table_name, labels): | ||
"""Queries the protected groups table table to retrieve the protected attributes values for each as of date |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
"""Queries the protected groups table table to retrieve the protected attributes values for each as of date | |
"""Query the protected groups table to retrieve the protected attribute values for each as of date. |
Args: | ||
db_engine (sqlalchemy.engine) a database engine | ||
as_of_dates (list) the as_of_Dates to query | ||
protected_group_table_name (str) the name of the table to query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing labels
""" | ||
protected_df = pandas.DataFrame.pg_copy_from(query_string, engine=db_engine, parse_dates=["as_of_date"], | ||
index_col=MatrixStore.indices) | ||
return protected_df.align(labels, join="inner")[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏
@@ -402,7 +420,7 @@ def metric_definitions_from_matrix_type(self, matrix_type): | |||
def needs_evaluations(self, matrix_store, model_id, subset_hash=""): | |||
"""Returns whether or not all the configured metrics are present in the | |||
database for the given matrix and model. | |||
|
|||
train_results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hã?
Returns: | ||
|
||
""" | ||
bias_audits = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused (remove?)
bias_audits = [] | ||
protected_df['model_id'] = model_id | ||
protected_df['score'] = predictions_proba | ||
protected_df['label_value'] = labels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How large might the protected dataframe be? Would it be unreasonable to copy it? It's not generally "nice" (or safe) to modify a data structure you receive like this, (especially considering it'll be reused?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(to be sure, it looks like it could be a "shallow" copy, so it should be ok?)
group_value_df['evaluation_start_time'] = evaluation_start_time | ||
group_value_df['evaluation_end_time'] = evaluation_end_time | ||
group_value_df['matrix_uuid'] = matrix_uuid | ||
group_value_df['parameter'] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks like a SyntaxError
to me…
if group_value_df.empty: | ||
raise ValueError(""" | ||
Bias audit: aequitas_audit() failed. Returned empty dataframe for model_id = {model_id}, and subset_hash = {subset_hash} | ||
and predictions_schema = {schema}""".format()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks uninterpolated (and weirdly formatted). try? –
raise ValueError(
f"Bias audit: aequitas_audit() failed. "
f"Returned empty dataframe for (model_id, subset_hash, predictions_schema): "
f"({model_id!r}, {subset_hash!r}, {schema!r})"
)
attribute_name=row['attribute_name'], | ||
attribute_value=row['attribute_value'] | ||
).delete() | ||
session.bulk_insert_mappings(matrix_type.aequitas_obj, group_value_df.to_dict(orient="records")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I know we talked about this so you probably did try, but), did you try simply bulk_update_mappings
? E.g.:
with scoped_session(self.db_engine) as session:
session.bulk_update_mappings(matrix_type.aequitas_obj, group_value_df.to_dict(orient="records"))
It looks like your database index isn't set on the dataframe; but, if it were:
with scoped_session(self.db_engine) as session:
session.bulk_update_mappings(matrix_type.aequitas_obj, group_value_df.reset_index().to_dict(orient="records"))
Of course, I presume this requires that an index is set on the database table, (and that SQLAlchemy is aware of it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would guess that bulk_update_mappings
is the most efficient for the situation, or at least more so than non-bulk SQLAlchemy methods. But, if it's really a lot of data, and otherwise I suppose for the record, you might do something like:
with db_engine.begin() as conn:
# create temp table to which to COPY
# using schema (not rows) of true target table
conn.execute(f'''\
create temp table group_values
as select * from {matrix_type.aequitas_obj.__table__} limit 0
''')
group_value_df.pg_copy_to('group_values', conn)
conn.execute(f"""\
insert into {matrix_type.aequitas_obj.__table__}
on conflict do update
select * from group_values
""")
@@ -438,6 +456,92 @@ def needs_evaluations(self, matrix_store, model_id, subset_hash=""): | |||
session.close() | |||
return needed | |||
|
|||
def _bias_audit(self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
verbify name (e.g. _write_audit_to_db) to make it clear just from looking at the name that nothing is being returned.
…tected group generator
Codecov Report
@@ Coverage Diff @@
## master #688 +/- ##
=========================================
- Coverage 82.3% 82.21% -0.1%
=========================================
Files 94 95 +1
Lines 6387 6810 +423
=========================================
+ Hits 5257 5599 +342
- Misses 1130 1211 +81
Continue to review full report at Codecov.
|
This is the second PR for aequitas integration and it covers the process of running a bias audit following the evaluation mechanics of Triage (for every matrix_type and subset_hash).