Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flag use of GISAID data in exported Auspice JSON #691

Closed
trvrb opened this issue Mar 6, 2021 · 9 comments · Fixed by #705
Closed

Flag use of GISAID data in exported Auspice JSON #691

trvrb opened this issue Mar 6, 2021 · 9 comments · Fixed by #705
Assignees
Labels
easy problem Requires less work than most issues enhancement New feature or request priority: high To be resolved before other issues

Comments

@trvrb
Copy link
Member

trvrb commented Mar 6, 2021

Context

Currently, we look at URL via Auspice searching for nextstrain.org/ncov to know to insert "enabled by data from GISAID" into the byline (https://github.com/nextstrain/auspice/blob/master/src/components/info/byline.js#L26). However, this pretty limiting and there are emerging pages via /groups and /community that are using GISAID data.

Description

I think we should allow another element in the Auspice JSON schema of: gisaid_data: true. Auspice would look at this value to decide whether of not to flag "enabled by data from GISAID" in the byline.

This would require updating augur export to allow gisaid_data in the --auspice-config JSON file and would require updating the JSON v2 schema here: https://github.com/nextstrain/augur/blob/master/augur/data/schema-export-v2.json.

With Augur updated, we'd then need to update the ncov build and then once live JSON files have been updated with this field, we can update Auspice to work from this input rather than from URL string.

@trvrb trvrb added enhancement New feature or request priority: moderate To be resolved after high priority issues easy problem Requires less work than most issues labels Mar 6, 2021
@rneher
Copy link
Member

rneher commented Mar 7, 2021

I think this is a good direction, but I'd rather have this more generic. For example

data_provenance: 'gisaid'

@emmahodcroft
Copy link
Member

I think this would be very useful, but think Richard's idea is a good one, this would leave open to other possibilities of acknowledgement in the future, if desired.

@trvrb
Copy link
Member Author

trvrb commented Mar 7, 2021

Great idea @rneher. Perhaps in this case we'd want an array? I could imagine entries like:

data_provenance: ["gisaid"]
data_provenance: ["insdc", "cog"]

Do we want this to be free-form or would it be a restricted library of options? There could be entries like:

data_provenance: ["https://github.com/inrb-drc"]

for Ebola. And in this direction, you could imagine asking for full URLs so others can better track down data. The above would then be:

data_provenance: ["https://www.gisaid.org/"]
data_provenance: ["https://www.ncbi.nlm.nih.gov/", "https://www.cogconsortium.uk/"]

Or if we wanted to get to the point of automatically flagging data sources we'd want something like we have for maintainers, ie:

"data_provenance": [
  {
    "name": "Genbank",
    "url": "https://www.ncbi.nlm.nih.gov/"
  },
  {
    "name": "COG UK",
    "url": "https://www.cogconsortium.uk/"
  }
],

In this direction, the name GISAID or the URL gisaid.org would be special cased in Auspice to use the "enabled by data from GISAID" logo, but other uses could work like maintainers do now.

@rneher
Copy link
Member

rneher commented Mar 7, 2021

I'd probably opt for the most expressive solution. Seems most future proof.

@huddlej
Copy link
Contributor

huddlej commented Mar 10, 2021

What if the data provenance was an annotation in the metadata itself (e.g., the db column of the Zika metadata)?

These annotations could allow Auspice to determine the data source attributions from the metadata itself and save us from manual curation of an Auspice config file that could get out of sync with the data (for instance, if we add COG data to a build but forget to update the auspice JSON). With those annotations in the metadata, we could also enable filtering in Auspice by the data source field.

We track this information for some (most?) pathogens in fauna already. We would still need to maintain some mapping of the terms in the metadata to data source records that contain full name, URL, etc. for display in Auspice (something like the last example above, but maybe indexed by metadata field values). That mapping could live wherever makes the most sense technically.

@trvrb
Copy link
Member Author

trvrb commented Mar 18, 2021

I see what you're going for here John. But I think a more manageable initial solution is the

"data_provenance": [
  {
    "name": "Genbank",
    "url": "https://www.ncbi.nlm.nih.gov/"
  },
  {
    "name": "COG UK",
    "url": "https://www.cogconsortium.uk/"
  }
],

block in the Auspice JSON. This would be manually updated for the moment, but could be something that's eventually automatically generated from metadata.tsv.

I'm bumping priority of this issue as it's now clear that it's necessary for others to surface this information.

@trvrb trvrb added priority: high To be resolved before other issues and removed priority: moderate To be resolved after high priority issues labels Mar 18, 2021
@dpark01
Copy link

dpark01 commented Mar 18, 2021

This generalized approach sounds great to me. So then what does the expression for GISAID look like then?

"data_provenance": [
  {
    "name": "GISAID",
    "url": "https://www.gisaid.org/",
    "image": "https://www.gisaid.org/fileadmin/gisaid/img/schild.png"
  }
],

And auspice would turn that into a "enabled by data from [name](url)" unless image was defined in which case it becomes "enabled by data from [image](url)"? (or actually a comma separated list of data_provenance entries)

@trvrb
Copy link
Member Author

trvrb commented Mar 18, 2021

Excellent point. I was imagining that we'd special-case a situation of data_provenance.name = "GISAID" to automatically replace with the GISAID logo. Other name/url pairs would just be linked text. So a hypothetical:

"data_provenance": [
  {
    "name": "GISAID",
    "url": "https://gisaid.org"
  },
  {
    "name": "COG UK",
    "url": "https://www.cogconsortium.uk"
  }
],

would be "Enabled by data from ((GISAID)) and COG UK" where "((GISAID))" is the logo that links to gisaid.org and "COG UK" is text that links to www.cogconsortium.uk.

However, automatically grabbing an image is an interesting idea. There'd need to be someway to size this appropriately however.

@dpark01
Copy link

dpark01 commented Mar 18, 2021

Yeah I think either GISAID gets special treatment in this spec or it doesn't, so it's either something like I originally proposed (where even the image url is fully described.. but then yes... dimensions) or it could be as simple as:

"data_provenance": [
  {
    "name": "GISAID"
  },
  {
    "name": "COG UK",
    "url": "https://www.cogconsortium.uk"
  }
],

No point asking the user to supply the URL to GISAID if we're already supplying the logo.

@huddlej huddlej self-assigned this Mar 19, 2021
huddlej added a commit that referenced this issue Mar 19, 2021
Adds support for a `data_provenance` field in the auspice v2 config and
exported auspice v2 JSONs through additions of schema definitions for
`data_provenance` and inclusion of an example provenance entry in the
Zika build's auspice config.

Fixes #691
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
easy problem Requires less work than most issues enhancement New feature or request priority: high To be resolved before other issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants