-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce Inferred Labels for Stacked Bar Charts #145
Introduce Inferred Labels for Stacked Bar Charts #145
Conversation
…() and expand_inferredlabels() for processsing, filtering and expanding inferred labels. 2. map_trip_data() to extract the mapping functionality.
…_viz_notebook_data() for refactor.
…and incorporate inferred label for Distribution of modes.
…Metrics 2. Update quality_text, fig, ax, text_results and introduce new plot_and_text_stacked_bar_chart() for all Stacked Bar Charts to represent inferred labels bar in generic_metrics notebook
…dd query for mode_of_interest for inferred labels 3. Update fig, ax, text_results, plot_and_text_stacked_bar_chart() for all Stacked Bar Charts.
…mode_specific_metrics notebook
…_ 2. Update plot_and_text_stacked_bar_chart() for Distribution of modes in commute trips
… stacked_bar_quality_text and stacked_bar_quality_text_inferred with plot_and_text_stacked_bar_chart() 3. Adjust plot_title to plot_title_no_quality
The charts look great! I'll look at the code next, but I do have one quick question - could a given trip be in the sensed, inferred, and labeled charts? Some of these look like labeled & inferred add to about 100%, but that is probably coincidence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than this variable name that might need to be changed LGTM!
viz_scripts/scaffolding.py
Outdated
|
||
return expanded_ct, file_suffix, quality_text, debug_df | ||
|
||
def map_trip_data(df, study_type, dynamic_labels, dic_re, dic_pur): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a more specific variable name that could be used in places of df
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the variable name from df
to expanded_trip_df
, since the parameter is placeholder for both expanded_
labeled trips, and inferred trips.
We can label a trip detected as certain mode of commute to be the same or different as the one detected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the more specific variable name, looks good to me!
I can take a look at the code, but some high level comments just by looking at the charts first:
I also took a brief look at the code, and I think that the current |
…f there is user_input or not. If there is user_input, chose it over inferred_labels.
Rearranged the bars to showcase from top to bottom: labeled - inferred - sensed.
I have updated the filter to select trips which has either user_input or inferred_labels. Iterate through the filtered df - if there is user_input chose it, else look for inferred_labels. This way, we will have user label + inferred labels for inferred bars which would be "labeled and inferred". |
I need some clarification.
I have proceeded to implement with Approach B. I think it'd also be a good idea to account for a |
This is the updated chart with the Approach B. I tried to follow the below approach to understand the data better: I enlisted all the
It seems like e-bike trip is being mapped not just to BICYCLING, but also IN_VEHICLE and others. I will need to investigate further. |
I think one reason that the e-bike trips could be sensed as |
@iantei @Abby-Wheelis correct, I fully anticipate that sensed_mode is off, since I did not focus on e-bikes while making the original algorithms. I was questioning the inferred mode numbers (e.g. 75% of trips, at a proportion that is way off from labeled and sensed).
|
After filtering the labels form inferred_labels which has 'p' value greater than the confidence_threshold, for each trip. Referenced approach from |
…abeled and Inferred by OpenPATH ...
Testing Scenario: Dataset - Details of the script execution
Result: |
After merging with main: The charts look identical. The chart for Total trip length covered by mode has "Other" [3032] split into "Other" [1306] + "Airplane" [1726] for Labeled Similarly, we do not have Airplane trips in Total trip length covered by mode in land. |
On some of the bars next to sensed mode it says "(100% of all trips)" and some it says "(100%)" - can you standardize that please? Other than that slight issue the charts look great, I'll move on to reviewing the code itself |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few notes/questions here
viz_scripts/scaffolding.py
Outdated
if (max_entry['p'] > row.confidence_threshold): | ||
max_labels_list.append(max_entry['labels']) | ||
else: | ||
max_labels_list.append({}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if there are no user labels and the confidence is too low there won't be any labels? Do these entries get filtered out later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whenever there is no user labels and the 'p' value for the inferred_labels is less than the confidence_threshold. We are not considering the inferred labels of mode_confirm, purpose_confirm and replaced_mode, therefore an empty dictionary is being added.
Later these columns would be converted as NaN for mode_confirm, purpose_confirm and replaced_mode while creating a dataframe, which is subsequently added to the original dataframe.
When I looked up for
(expanded_ct_inferred['mode_confirm'].value_counts(dropna=False), expanded_ct_inferred['Mode_confirm'].value_counts(dropna=False))
I observed the below:
mode_confirm:
e-bike 1868
drove_alone 1101
shared_ride 768
NaN 655
walk 467
atv 42
e_car_shared_ride 15
e_car_drove_alone 13
not_a_trip 12
bike 8
side_by side 6
free_shuttle 3
lawn_mower 2
2
moco_transit 2
air 2
atv_ride 1
Name: mode_confirm, dtype: int64
Mode_confirm:
E-bike 1868
Gas Car, drove alone 1101
Gas Car, with others 768
Other 710
Walk 467
E-car, with others 15
E-car, drove alone 13
Not a Trip 12
Regular Bike 8
Free Shuttle 3
Airplane 2
Name: Mode_confirm, dtype: int64)
The values processed as NaN
was eventually converted to Other
, which is wrong.
Do these entries get filtered out later?
I have to filter it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considered the case of filtering with NaN
values, but there could be some existing NaN
values apart from the way it was computed earlier. Decided to pass uncertain
values for the keys to mode_confirm
, purpose_confirm
and replaced_mode
in the dictionary, and add it to the list.
After the convergence with the original dataframe, filter out the rows which have these mode_confirm, purpose_confirm and replaced_mode as uncertain
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with this for now, but I would expect to revisit and discuss as part of/after the footprint calculations. If the user labels are not present, they are mapped to NaN and we drop them (dropna
). Why do we want to handle this differently? (i.e. why don't we just dropna
again?) Given that we already indicate that we only represent a subset of the trips, why do we also need to label some of those as uncertain
?
viz_scripts/scaffolding.py
Outdated
else: | ||
max_labels_list.append(row.user_input) | ||
|
||
inferred_only_labels = pd.DataFrame(max_labels_list, index=inferred_ct.index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is really inferred / user labels now, I think it might be best to rename it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed it to labeled_inferred_labels
…append the labels_list with dict - uncertain for all labels. Later filter it out from the dataframe.
…emove reset_index on expanded_inferred_ct dataframe.
…d update the variable names to add prefix of labeled. We display both labeled and inferred labels altogether for inferred bars in stacked bar charts.
…e *_w_other cols for the dataframe.
…await from notebook. Update the map_trip_data() to be async, and call to it as await in scaffolding.py
Testing scenario: Dataset used - Executed notebooks -
Details - Execution of generic/mode_specific_metrics notebook
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merging this now, barring the future cleanup items I have listed here.
"\n", | ||
"inferred_match = re.match(r'Based on ([0-9]+) confirmed trips from ([0-9]+) (users|testers and participants)\\nof ([0-9]+) total trips from ([0-9]+) (users|testers and participants) (\\(([0-9.]+|nan)%\\))', quality_text_inferred)\n", | ||
"stacked_bar_quality_text_inferred = f\"{inferred_match.group(1)} trips {inferred_match.group(7)}\\n from {inferred_match.group(2)} {inferred_match.group(3)}\"\n", | ||
"\n", | ||
"stacked_bar_quality_text_labeled, stacked_bar_quality_text_sensed, stacked_bar_quality_text_inferred" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope we fix/clean this up soon. Even if we wanted to generate parametrized text, it is a lot clearer to have a structure with the parameters instead of parsing out from existing text using regular expressions. And now that we have converted over to bar charts, we don't need the backwards compat of the old Based on...
text.
if len(labeled_inferred_ct) == 0: | ||
return labeled_inferred_ct | ||
|
||
def _select_max_label(row): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am merging this for now, but note that there is a similar function in the server when we handle metrics to include trips with confidence > 90% as "confirmed". We do not and should not need to reinvent code.
"Trips_with_at_least_one_label": len(labeled_ct), | ||
"Trips_with_mode_confirm_label": trip_label_count("Mode_confirm", expanded_ct), | ||
"Trips_with_trip_purpose_label": trip_label_count("Trip_purpose", expanded_ct) | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I'm merging this for now, but this should be switched to mode_confirm
or mode_confirm_with_other
, right? We should not be working with the display values in any internal code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this would need to switch to mode_confirm_w_other
variable.
|
||
async def map_trip_data(expanded_trip_df, study_type, dynamic_labels, dic_re, dic_pur): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the reason for refactoring and pulling this out, but wonder why you chose to rename the parameter. It has lead to a lot of changed lines that are essentially just the variable rename. Not asking for a change, just an explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are calling map_trip_data(expanded_trip_df, ...)
from two functions -
load_viz_notebook_inferred_data(expanded_it, ...)
& load_viz_notebook_data(expanded_ct, ...)
.
Therefore, I wanted to use a variable name which would suit both the parameters.
A. Introduce processing functionalities in scaffolding.py for inferred labels.