Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update pending dgidb API #40

Closed
colleenXu opened this issue Dec 16, 2021 · 18 comments
Closed

update pending dgidb API #40

colleenXu opened this issue Dec 16, 2021 · 18 comments
Assignees

Comments

@colleenXu
Copy link

BTE is using https://pending.biothings.io/dgidb quite a bit, and I believe it hasn't been updated since it was made (in 2018/2019?).

We are interested in updating BOTH the parser + data for this API. This API is using the biolink-model predicates in the association.edge-label field....so we may want to remove this or map the dgidb relations to the most recent version of the biolink-model.

@andrewsu
Copy link
Member

Before refreshing our pending API, @colleenXu, can you look at whether we can annotate their API for use within BTE directly? https://www.dgidb.org/api

@colleenXu
Copy link
Author

colleenXu commented Dec 28, 2021

We could theoretically annotate the interactions endpoint with x-bte annotation, but there are some issues...

  • the genes field can accept Entrez IDs (not just gene symbols / names). However, the response will be under the matchedTerms field or ambiguousTerms field, depending on how their API was able to match the search term (and I don't see a clear pattern for when it'll do one vs the other). At the moment, to handle looking at two different response-mappings, we'd need duplicate operations / sub-queries...
  • the drugs field expects names. I've tried inputting chembl ID in different ways and haven't been successful in getting the record out.....so I think the "reverse operation" of drugs -> genes won't work...
  • This API is slower than a BioThings API would be (takes ~ 7 sec to return 1 gene's interactions) and likely has less capacity for batch-querying (it can accept a comma-delimited list for a GET parameter and the documentation says POST is possible)

@colleenXu
Copy link
Author

colleenXu commented Jan 10, 2022

noting current parser issues:

  • some chem may have incorrectly imported IDs (have a wikidata ID and a non-matching CHEMBL ID in the object.id field...)
  • the relationships in DGIdb are drug -> gene, but the API is set up with subject as gene and object as chem

@erikyao
Copy link
Contributor

erikyao commented Apr 7, 2022

Hi @colleenXu, could you please double check the mapping from DGIdb interaction types to Biolink predicates?

Within the DGIdb Feb 2022 interactions.tsv, the interaction_types column has values like inhibitor, blocker, etc. My understanding is that, e.g. with blocker, we can find in biolink-model.yaml that DGIdb:blocker is one of the narrow_mappings of predicate decreases activity of. Therefore we map the blocker interaction type to decreases_activity_of (whitespaces replaced by underscores) predicate.

Please correct me if I am wrong.

@colleenXu
Copy link
Author

@erikyao @andrewsu My understanding is that we don't need to map from DGIdb interaction types to Biolink predicates in the API....this can be handled in the x-bte annotation.

In the past, Kevin had included this mapping in the parser for some reason...

@erikyao
Copy link
Contributor

erikyao commented Apr 8, 2022

Hi @andrewsu @colleenXu , the latest DGIdb Feb 2022 interactions.tsv has a new column interaction_group_score, which is actually the "Interaction Score" as explained here.

E.g. the interaction between BMS-387032 and CDK7 has a score of 0.82, as shown here.

Shall we integrate these scores into a field, say association.score? Thanks!

Never mind. The scores were already parsed, the comments in parser are outdated.

@erikyao
Copy link
Contributor

erikyao commented Apr 8, 2022

Hi @colleenXu, w.r.t.

  • some chem may have incorrectly imported IDs (have a wikidata ID and a non-matching CHEMBL ID in the object.id field...)

I noticed that in the tsv file, the column drug_concept_id contains:

  1. empty values
  2. wikidata IDs in the form of wikidata:Q<num>, such as wikidata:Q419808 for SAPROPTERIN
  3. chembl IDs in the form of chembl:CHEMBL<num>, such as chembl:CHEMBL942 for BISACODYL

I'll query MyChem with drug names for chembl IDs for case 1 and case 2, to solve the wikidata problem.

Could you explain more on a non-matching CHEMBL ID in the object.id field? Thank you!


W.r.t.

  • the relationships in DGIdb are drug -> gene, but the API is set up with subject as gene and object as chem

I'll swap object and subject fields in the parser. The column-to-field relation would be like:

Column Index Column Name Key Name
0 gene_name object.SYMBOL
1 gene_claim_name
2 entrez_id object.NCBIGene
3 interaction_claim_source association.provided_by
4 interaction_types association.relation_name
5 drug_claim_name
6 drug_claim_primary_name
7 drug_name subject.name
8 drug_concept_id subject.CHEMBL_COMPOUND
9 interaction_group_score association.interaction_group_score
10 PMIDs association.pubmed

@colleenXu
Copy link
Author

colleenXu commented Apr 10, 2022

On the wikidata vs chembl.compound IDs

I don't quite remember what was going on, and I can't find an example in the current API. My guess is that every record in the current API has an object.CHEMBL_COMPOUND field, and I found some wikidata IDs in that field for some records...

To clarify:

  • I'm not sure what you mean by "ensembl ids" here. It sounds like you're dealing with CHEMBL IDs? specifically CHEMBL.COMPOUND IDs? And you'd query MyChem rather than MyGene?
  • are you mapping drug names (when the drug_concept_id field is empty) and drug wikidata IDs (when the drug_concept_id is a wikidata ID) to CHEMBL.COMPOUND ids - so the final biothings API only has chembl.compound IDs for chemicals? that'd be awesome

the column-to-field table

I'd like to keep the tsv column names in some cases....

  • column index 3: association.interaction_claim_source
  • column index 4: association.interaction_types
  • column 7: subject.drug_name
  • column 10: association.pmids

@colleenXu
Copy link
Author

colleenXu commented Apr 10, 2022

Also, my understanding has been that:

  • subject.id has the prefix-id combo (CHEMBL.COMPOUND:1234) while subject.CHEMBL_COMPOUND would only have the ID (1234)
  • same with object.id (NCBIGene:1234) vs object.NCBIGene (1234)

The current parser keeps the CHEMBL.COMPOUND prefix on both fields of the object (id + CHEMBL_COMPOUND) right now...

(note: I don't know where the convention above comes from. I get the sense that it's a biothings api / our lab thing. BTE used to need prefixes/no-prefixes on certain ID namespaces....but it seems to be doing okay right now...)

@erikyao
Copy link
Contributor

erikyao commented Apr 11, 2022

  • I'm not sure what you mean by "ensembl ids" here. It sounds like you're dealing with CHEMBL IDs? specifically CHEMBL.COMPOUND IDs? And you'd query MyChem rather than MyGene?

Sorry it's a typo. I mean chembl ids actually. And it's MyChem.

  • are you mapping drug names (when the drug_concept_id field is empty) and drug wikidata IDs (when the drug_concept_id is a wikidata ID) to CHEMBL.COMPOUND ids - so the final biothings API only has chembl.compound IDs for chemicals? that'd be awesome

yes, that's what I am going to do.

The current parser keeps the CHEMBL.COMPOUND prefix on both fields of the object (id + CHEMBL_COMPOUND) right now...

I can fix this issue as well.

@erikyao
Copy link
Contributor

erikyao commented Apr 11, 2022

@colleenXu, and since we are going to remove the mapped predicates, the association.edge_label field is to be removed. Originally we have set it at parser.py#L122

@erikyao
Copy link
Contributor

erikyao commented Apr 12, 2022

https://biothings.ncats.io/dgidb updated

@erikyao erikyao closed this as completed Apr 12, 2022
@colleenXu
Copy link
Author

@erikyao it looks like the field name suggestions here weren't addressed...what were your thoughts?

@colleenXu
Copy link
Author

colleenXu commented Apr 13, 2022

@erikyao also, most of the API records seem to not have an association.relation_name field....it would help to give these records some kind of value. Right now BTE cannot get these associations...

I see "n/a" and "other/unknown" interaction types in https://dgidb.org/interaction_types so it would help to maybe keep these values. Maybe by having the values not_applicable and other_unknown....

@colleenXu
Copy link
Author

updated x-bte annotation for dgidb. still have the issues noted above NCATS-Tangerine/translator-api-registry@0307491

@erikyao
Copy link
Contributor

erikyao commented Apr 13, 2022

the column-to-field table

I'd like to keep the tsv column names in some cases....

  • column index 3: association.interaction_claim_source
  • column index 4: association.interaction_types
  • column 7: subject.drug_name
  • column 10: association.pmids

Revised as suggested

@erikyao
Copy link
Contributor

erikyao commented Apr 13, 2022

@erikyao also, most of the API records seem to not have an association.relation_name field....it would help to give these records some kind of value. Right now BTE cannot get these associations...

I see "n/a" and "other/unknown" interaction types in https://dgidb.org/interaction_types so it would help to maybe keep these values. Maybe by having the values not_applicable and other_unknown....

not_applicable is used for empty interactions

@colleenXu
Copy link
Author

colleenXu commented Apr 15, 2022

These are the interaction_types in the raw data. I've used 12 unique interaction_type values for x-bte annotations (24 operations total for forward + reverse). NCATS-Tangerine/translator-api-registry@5ac52b7

  • activator
  • agonist
  • allosteric_modulator
  • antagonist
  • antibody
  • blocker
  • inhibitor
  • inverse_agonist
  • modulator
  • not_applicable (as mentioned above, a lot of rows had no value for interaction_types...)
  • partial_agonist
  • positive_modulator

For the other interaction_types in the data, I write x-bte annotation and commented it out. this is because there were only a few records for the operation...

  • antisense_oligonucleotide: 4 records
  • inducer: 1 record
  • inhibitory_allosteric_modulator: 1 record
  • negative_modulator: 5 records
  • suppressor: 1 record
  • vaccine: 8 records

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants