Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drag drop metadata improvements #1244

Merged
merged 6 commits into from
Jan 5, 2021

Conversation

jameshadfield
Copy link
Member

A number of improvements to the drag-and-drop metadata functionality. See commit messages and documentation changes in this PR for more details.

Closes #1242
Closes #1239

The refactor was motivated by upcoming functionality and this influenced the implementation here. The change in action type name indicates that we will be able to extract more information from dropped files than just extra color-bys.
Here we use microReact-style interpretation of column header names. Specifically, a double-underscore in a header name is interpreted as extra information, with valid suffixes "colour", "autocolour" and "shape".

We ignore "shape" (as auspice doesn't have this capability) and "autocolour" (as that's our default).

We parse "colour" as a column defining colours for the corresponding column. We enforce that these are (long) HEX values, but this should be relaxed in the future. Where multiple nodes with the same trait value define different colours, we average them similarly to the map.
@jameshadfield jameshadfield temporarily deployed to auspice-drag-drop-metad-2eccco December 15, 2020 00:31 Inactive

/* There are a number of "special case" columns we currently ignore */
const fieldsToIgnore = new Set(["name", "div", "vaccine", "labels", "hidden", "mutations", "url", "authors", "accession", "traits", "children"]);
fieldsToIgnore.add("num_date").add("year").add("month").add("date"); /* TODO - implement date parsing */
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An obvious next step is to allow dates to be specified via additional metadata. Interpreting this as the num_date attr is slightly complex as the coloring & tree-metric are tied together in the code. A different approach would be to extend our color schemes to allow "temporal" types and then add temporal info in the CSV/TSV as simply an additional color-by.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this would mean sequences in a time-tree with inferred dates were allowed to have 'real' dates supplied? Would it change where they're plotted on X-axis?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaning towards allowing metadata traits (in drag-n-drop data or in the main JSON) which are encoded as (e.g.) YYYY-MM-DD to be inferred as temporal types by auspice, and a colour scale generated according. A good example of this would be nCoV's "submitted date" field. These would be "just another color-by" and thus different to the num_date field which is used by auspice for the tree-time view. I don't see any easy way to use metadata to update / influence node (temporal) positioning, but am open to suggestions. (It's easy to see how such metadata could define the x-axis position of tips, but what do we do with internal nodes / tips not in the metadata?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Yes, I agree, that's why I was curious - was wondering how you'd adjust the other parts of the tree to accommodate moving tips! Unfortunately I have no answers to offer, just curiosity :)

@emmahodcroft
Copy link
Member

This all sounds super-cool James!! What a neat bunch of features. If you have a moment to supply a dummy CSV that shows an example how one can make use of these (how to organize them, etc) that would be fantastic! (No rush though!)

@emmahodcroft
Copy link
Member

emmahodcroft commented Dec 17, 2020

If you can provide a dummy file showing how to format the drag-n-drop files for the new things that are accepted, I'm happy to work from that, modify, add, etc, and play around to help review! (Sorry if I've missed this file already being somewhere!)

@jameshadfield jameshadfield temporarily deployed to auspice-drag-drop-metad-2eccco December 17, 2020 22:36 Inactive
@jameshadfield
Copy link
Member Author

One way to test the functionality of this is via a modified version of the zika-tutorial at https://auspice-drag-drop-metad-2eccco.herokuapp.com/zika-sparse-metadata. Notably this build does not have the country included as a coloring or a geographic resolution.

The additional metadata TSV at https://nextstrain-scratch.s3.amazonaws.com/james/zika-extra-metadata.tsv:

strain  country latitude        longitude       random  random__colour
1_0087_PF       French Polynesia        -17.6797        -149.4068       blue    #184ae8
1_0181_PF       French Polynesia        -17.6797        -149.4068       blue    #184ae8
1_0199_PF       French Polynesia        -17.6797        -149.4068       blue    #184ae8
Aedes_aegypti/USA/2016/FL05     USA     39.7837304      -100.4458824    green   #2dc234
BRA/2016/FC_6706        Brazil  -10.3333332     -53.1999999     red     #e7182a
Brazil/2015/ZBRA105     Brazil  -10.3333332     -53.1999999     blue    #184ae8
...

will add the following data to around 25 of the strains in the build:

  • Country is added as a coloring, with auspice choosing the colours for these
  • A coloring entitled "random", with a separate column defining the colours for each value
  • An additional coloring (not in the TSV) is added by auspice identifying which strains were in the metadata file.
  • Filterings in auspice are available for the above described colorings
  • Lat/long for strains. This is actually the lat/long of each strain's country, so this PR will group them together appropriately.

@emmahodcroft
Copy link
Member

So is this the colouring that just defines what was in the TSV? :
image

This was a little confusing to me and I assumed it was a bug until I came back and re-read your points more carefully. I like the idea, but maybe you could show it on the drop-down as "Strains in XX.tsv" or similar?

Apart from that this seems to work great! Is it worth creating another random file for another build, or do you think testing this on the Zika tutorial is sufficient? I have some files for the LA builds which I could probably add random columns to - but maybe not until next week (sorry!).

@cassiawag
Copy link
Contributor

I tested this out on the WA trees (https://dev.nextstrain.org/groups/blab/ncov/wa/1y) w/ real metadata, and overall, these are fantastic new features that work really well!

Specifically, I tested:

  • specifying new colors with hexcodes
  • adding lat & long
  • filtering by added metadata

Comments:

  • All the features worked well and as expected on the real data, including on fields containing spaces!
  • For this American speller, I would just encourage strong documenting that you must use __colour and not __color when adding hexcodes.
  • Should clearly document that if you add latitude and longitude and choose the geographic resolution of your added metadata, this will only display sequences with lat & long in your metadata on the map. This behavior was initially unexpected for me. But on thought, I think this makes sense for now as I'm not sure what geo resolution you would overlay added lat & longs on. In the future, maybe you could choose to overlay my_geo on any of the geo layers, like my_geo + division or my_geo + location or my_geo + region? I'm not sure what would be best; this would probably depend on how users interact with the map, and it's not a function I use very often.
  • I think the suggestion by @emmahodcroft to make the dropdown say "Strains in XXX.tsv" is a good one!

Thanks again for the awesome work, James!! These are super useful features.

@emmahodcroft
Copy link
Member

emmahodcroft commented Dec 22, 2020

Very good point about spelling @cassiawag ! @jameshadfield would it be possible to let the code accept both color and colour? I know it's reflexive to type whichever you're used to 🙃
Otherwise, (and know it pains me to write this), we may be better standardising to color as this is more common in programming languages, so users such as myself who prefer colour are probably more used to switching to color for 'programmy' things than US users switching to colour 😛

Allows for a common use case where one wants to filter the dataset to those samples in the CSV/TSV.
Previously only `<trait>__colour` was valid to specify the
colour hexes, which was chosen to maximise compatability with
MicroReact format files. Here we allow `<trait>__color` to be
used as an alternate spelling.
If both are specified, then `__colour` is used.
This interprets specific lat/long columns in the CSV/TSV as association a strain with a geographic location.  As this approach defines lat-longs _per-sample_ it is orthogonal to Nextstrain's approach (where we associate coords to a metadata trait). The approach employed here is to create a new (dummy) trait whose values represent the unique lat/longs provided.

If a JSON does not define any lat/longs then the appropriate additional metadata will trigger the map to become available (and displayed).
@jameshadfield jameshadfield force-pushed the drag-drop-metadata-improvements branch from 8a0fe85 to 9c15c4a Compare January 5, 2021 20:37
@jameshadfield jameshadfield merged commit 0423d6f into master Jan 5, 2021
@jameshadfield jameshadfield deleted the drag-drop-metadata-improvements branch January 5, 2021 20:39
@jameshadfield
Copy link
Member Author

Thanks for the testing and comments @cassiawag & @emmahodcroft -- I've updated the documentation, allowed American spellings (color), and changed the value to "Strains in XXX.tsv" which makes the filtering look much nicer! P.S. Microreact uses __colour which is why I initially chose it - but it makes sense to allow __color as well; if both are supplied then __colour takes preference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants