Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patches: nan summaries breaking some deployments, filter land modes on underlying (not display) label #141

Merged
merged 6 commits into from
Sep 15, 2024

Conversation

Abby-Wheelis
Copy link
Member

To address #140 by updating the way we filter the data from display mode to underlying mode

What I am now using is mode_confirm, which should work for now (all of the label sets we have now using the label "air", but I think using the baseMode from the label file, which would be AIR, but that change will be somewhat more complicated.

If I figure out why there are so many Unknown ble mode entries in dfc-fermata, I'll fix that in this PR as well.

switching the condition from the display mode to the underlying mode
@Abby-Wheelis
Copy link
Member Author

image

Looking into what's happening with dfc-fermata by comparing May in the public and admin dashboards, and the results are the same. There are about 63% UNKNOWN trips on both dashboards for May, so the display is accurate.

We should keep looking into why there are so many trips that are not connected to a mode from a BLE beacon

@Abby-Wheelis
Copy link
Member Author

Testing with open-access data for the patch to #142

Screenshot 2024-09-06 at 11 47 43 AM

@Abby-Wheelis
Copy link
Member Author

So... air -> Air is not in our current csv way of translating the labels, so they get grouped as "Other"

Great advocacy for getting those labels unified as soon as we can.

I think the quickest fix would be to add "air -> Air" to the csv, flagging that we should get the labels unified (I would try to do that here, but I think #148 / #143 needs to be completed first to prevent mode chaos. The we can retire the csvs as soon as possible as a follow-on/clean-up to the unification of colors (which is proving to require more work with labels than we first thought)

@Abby-Wheelis Abby-Wheelis marked this pull request as ready for review September 6, 2024 20:52
@Abby-Wheelis
Copy link
Member Author

@iantei feel free to chime in on what you think about my approach to add air->Air to the csv for now, with the plan of unifying the labels and filtering on the base mode as soon as feasible.

@Abby-Wheelis Abby-Wheelis changed the title Represent only land modes Patches: nan summaries breaking some deployments, filter land modes on underlying (not display) label Sep 6, 2024
@Abby-Wheelis
Copy link
Member Author

This fixes #140 and #142

@iantei
Copy link
Contributor

iantei commented Sep 6, 2024

I think the approach to add air->Air to the csv would cover up for the default mapping case, which uses the csv for the mapping.
For the dynamic labels, currently we have usaid-laos-ev, godcgo-label-options label_options which have MODE.value = air, but caebike-label-options doesn't have MODE.value = air.
Would the above change also cover for this program?

An alternative temporary solution, which I can also think about is to modify the merge_small_entries(labels, values) function to restrict the conversion of labels = air from air to Other.
I think introducing new label on csv is also a good solution.

@iantei
Copy link
Contributor

iantei commented Sep 6, 2024

The code changes look good! Could you please test for the caebike-label-options program too?

@Abby-Wheelis
Copy link
Member Author

@iantei since air is not an option in that program, I would expect any airplane trips to have a user-entered label and so they would get grouped as other, I'll test with that dataset just to be sure!

@Abby-Wheelis
Copy link
Member Author

Screenshot 2024-09-07 at 10 22 03 AM

The two labeled charts look the same, despite the fact that there are quite a few trips that do seem to be "air" - but are user-custom labels. I wonder if ca-ebike would like to add the air mode? It would make their distance charts more representative (and not overwhelmed by "Other")

@Abby-Wheelis
Copy link
Member Author

Screenshot 2024-09-07 at 10 27 38 AM Counts are also largely "Other" - IIRC from the data, much of this might be testing related.

@shankari
Copy link
Contributor

shankari commented Sep 7, 2024

Can we get the breakdown of the "Other" to verify this?

At a high level, in a future fix, I think we should consider user labels if they represent a large enough proportion of the aggregate data. So if 40% of the data is "Other" and 90% of the "Other" is "Air", maybe we should just show "Air". But that is going to require a lot of judgement calls - what is "large enough proportion"? what if the text entered by users is subtly different? so let's punt on that for now.

@Abby-Wheelis
Copy link
Member Author

Screenshot 2024-09-07 at 10 42 15 AMScreenshot 2024-09-07 at 10 37 36 AM

The labeled trips at least are not air. Labels not in the dynamic labels include - drove_alone, shared_ride, bus, and e_car_shared_ride - all of the data that I have in this dump is from test users, and I'm guessing a lot of it is from before they added their custom labels.

The air/HSR trips (39/79,000 miles) must be unlabeled.

@shankari
Copy link
Contributor

shankari commented Sep 7, 2024

all of the data that I have in this dump is from test users, and I'm guessing a lot of it is from before they added their custom labels.

Yup, makes sense.

@Abby-Wheelis
Copy link
Member Author

So aside from the tabled point of

At a high level, in a future fix, I think we should consider user labels if they represent a large enough proportion of the aggregate data. So if 40% of the data is "Other" and 90% of the "Other" is "Air", maybe we should just show "Air". But that is going to require a lot of judgement calls - what is "large enough proportion"? what if the text entered by users is subtly different? so let's punt on that for now.

This seems to be working as expected, and can serve as a patch pending the more principled unification of labels, and usage of base modes (I added to the csv, and we still filter on mode_confirm, should update to baseMode when we get rid of the csvs)

check more carefully for missing summaries
@shankari
Copy link
Contributor

shankari commented Sep 9, 2024

I'm not going to insist on this right now, but in a cleanup change, I think it would be helpful to maybe have a unified layer that handles both dashboards. Let's wait until the carbon calculations are integrated into the public dashboard, though. If the unified layer is small and generic enough, it could even go into the server codebase, so that anybody else working with mongodumps (e.g. TSDC) could use it as well.

@Abby-Wheelis
Copy link
Member Author

I have tested with marking the entries that are malformated as INVALID as we do on the admin dash, but if I do that, they get grouped as OTHER since INVALID is not a valid sensed mode, but later when we merge small entries, they get marked as Other, leaving both OTHER and Other on the chart. I see 3 paths forward with this:

  1. Mark the data as UNKNOWN rather than INVALID (arguably, the mode is unknown, but because the data is invalid)
  2. add INVALID to the list of valid modes - it might become a segment in the chart, and would flag to us and program admins that there is an issue with the data
  3. Continue to allow INVALID to be marked for display as OTHER, but prevent OTHER and Other from appearing on the same chart (merge small entries into OTHER for sensed modes, not Other

I'll collect opinions at our meeting later today!

image
image

@Abby-Wheelis
Copy link
Member Author

They should be marked as INVALID and I should fix OTHER vs Other

@Abby-Wheelis
Copy link
Member Author

What the chart with invalid data will look like now:
image
What it would look like if INVALID were OTHER (to test if existing OTHER gets pooled with the small entries)
image

@shankari
Copy link
Contributor

@iantei @Abby-Wheelis this is not in "Ready for review", but I am deploying the related admin dashboard changes to staging, so am planning to merge and deploy this as well. Please make sure to pull this into future changes as well!

Comment on lines +49 to +62
if v2l_df.index[0][-1].isupper():
print("Found uppercase last letter")
# This part if a bit tricky
# We could have already had a non-zero other, and it could be small or large
if "OTHER" not in v2l_df.index:
# zero other will end up with misc_count
if misc_count.vals > 0:
v2l_df.loc["OTHER"] = misc_count
elif "OTHER" in small_chunk.index:
# non-zero small other will already be in misc_count
v2l_df.loc["OTHER"] = misc_count
else:
# non-zero large other, will not already be in misc_count
v2l_df.loc["OTHER"] = v2l_df.loc["OTHER"] + misc_count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch! I hope this is simplified once we start using the basemode values are remove the CSV

@shankari shankari merged commit 6137db4 into e-mission:main Sep 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Tasks completed
Development

Successfully merging this pull request may close these issues.

3 participants