Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CTD processing 3: handling output IDs when multiple ID prefixes are possible #585

Open
colleenXu opened this issue Mar 14, 2023 · 8 comments

Comments

@colleenXu
Copy link
Collaborator

colleenXu commented Mar 14, 2023

Intro: see intro section of #583 (comment). Originally noted in #558 (comment)

3. handling output IDs when multiple ID prefixes are possible

Some operations are commented-out because BTE isn't properly handling the output IDs when multiple ID prefixes are possible. This happens when the output is a disease ID (which can be MESH or OMIM) or a pathway ID (which can be REACT or KEGG).

For example, BTE will fail to recognize that the API response returned both MESH and OMIM Disease IDs and will instead assign all the Disease IDs to the one ID-prefix assigned by the operation.

Edit SmartAPI yaml + run BTE locally

In a local copy of the SmartAPI yaml, uncomment the chemical2disease_1 and chemical2disease_2 operations (lines 125, 127, 211-231, 253-273, 548-551, 556-559).

Set up a local instance of BTE to override and use your local copy of the CTD yaml. Then POST to that specific api (v1/smartapi/{id}/query endpoint):

{
    "message": {
        "query_graph": {
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:related_to"]
                }
            },
            "nodes": {
                "n0": {
                    "ids": ["MESH:D004317"],
                    "categories": ["biolink:SmallMolecule"]
                },
                "n1": {
                    "categories": ["biolink:Disease"]
                }
            }
        }
    }
}

CTD's raw response

During execution, BTE should generate this query to CTD.

In CTD's raw response, some Disease IDs are MESH like MESH:D015746/ Abdominal Pain and others are OMIM like OMIM:610141 / QT INTERVAL, VARIATION IN.

    {
        "CasRN": "23214-92-8",
        "ChemicalID": "D004317",
        "ChemicalName": "Doxorubicin",
        "DirectEvidence": "marker/mechanism",
        "DiseaseCategories": "Signs and symptoms",
        "DiseaseID": "MESH:D015746",
        "DiseaseName": "Abdominal Pain",
        "Input": "d004317",
        "PubMedIDs": "3712578|6542584"
    },


    {
        "CasRN": "23214-92-8",
        "ChemicalID": "D004317",
        "ChemicalName": "Doxorubicin",
        "DirectEvidence": "marker/mechanism",
        "DiseaseCategories": "Cardiovascular disease|Pathology (process)",
        "DiseaseID": "OMIM:610141",
        "DiseaseName": "QT INTERVAL, VARIATION IN",
        "Input": "d004317",
        "PubMedIDs": "12597018|7919046"
    },
BTE's current flawed response

BTE will do the operation with MESH-Disease-outputs and find the OMIM ID in CTD's response. It'll then strip the OMIM ID prefix off, and then assign it as a MESH ID. This will result in a flawed record -> Edge to MESH:610141 (when the original ID was OMIM:610141 / QT INTERVAL, VARIATION IN).

                "034d46e56b095c750619cc51ee2cb1bf": {
                    "predicate": "biolink:related_to",
                    "subject": "PUBCHEM.COMPOUND:31703",
                    "object": "MESH:610141",

Then it'll do similar behavior with the operation with OMIM-Disease outputs and the MESH IDs in CTD's response. So there'll be flawed records -> Edges like this to OMIM:D015746 (when the original ID was MESH:D015746/ Abdominal Pain).

                "0ca3845c254a304918933c581de85ae4": {
                    "predicate": "biolink:related_to",
                    "subject": "PUBCHEM.COMPOUND:31703",
                    "object": "OMIM:D015746",

I think Biolink API / Monarch's post-query processing + SmartAPI yaml response-mapping (which is to fields that exist only after the post-processing) is able to handle this situation, so maybe a solution like that will work here. However, it's perhaps not ideal that multiple operations are written + the same query is done repeatedly for different post-query processing.

This problem is related to past discussions on supporting multiple ID prefixes/namespaces as input / output. I'm not sure how much refactoring of code / x-bte annotation would be needed for a general solution...

@rjawesome
Copy link
Contributor

rjawesome commented Mar 25, 2023

We could have a "hasPrefix" option in the yaml under the output. If this is set to true, then it will expect the ID prefix in the output at the beginning of the ID (ie. like it would parse "MESH:0000" to see it is "MESH" id) instead of using the output id type from smartapi yaml. In this case, instead of the output id being labeled by its id type in the response mapping (like "MESH") it could be named something generic (like "OUTPUT"). This should at least fix the output problem. I could work on this feature.

@rjawesome
Copy link
Contributor

Work started on multiple-prefixes branch of smartapi-kg and api-response-transform

@colleenXu
Copy link
Collaborator Author

@rjawesome sorry for the late reply. I'm having trouble understanding your proposal...could you provide an example of x-bte annotation edits you're proposing?

And as I rethink this issue, I wonder if some discussion would help:

  • does this proposal mean 1 operation is written in these situations with multiple output ID-namespaces?
    • how does that affect the outputs section of the operation?
    • how does this affect how x-bte annotation is processed (into BTE MetaEdges?)
  • does this proposal only cover situations where ONE response field returns output IDs in different namespaces?
    • that's probably fine?
    • related situation?: multiple response fields (each returns output IDs in a single namespace). handled by creating multiple operations right now. This is the situation for investigate why BTE doesn't retrieve variant-disease relations from clinvar #548, and many operations for core and pending BioThings APIs
    • related situation?: supporting multiple input ID namespaces in 1 operation? This is tricky because in some cases, the different input ID-namespaces correspond with different querying info (parameters/requestBody stuff)...so I think it's fine to keep writing separate operations for different input ID namespaces...

@rjawesome
Copy link
Contributor

rjawesome commented Mar 29, 2023

My proposal would just be in the yaml outputs section like so (only for outputs not inputs)

outputs:
- semantic: Disease
  hasPrefix: true

Then in the response mapping you would put OUTPUT instead of a prefix like (MESH), ie.

 chemical2disease_1:
       OUTPUT: data.DiseaseID          ## HAS prefix, the ID type will be determined by the prefix
       ctd_chemical_disease_interaction_types: data.DirectEvidence
       pubmed: data.PubMedIDs

@rjawesome
Copy link
Contributor

rjawesome commented Mar 29, 2023

Another feature that could be added here since outputs is already an array, if each output ID corresponds to a different field in the json, then you could just add other outputs if different IDs have different Fields in the JSON. For example, the operation outputs could look like

outputs:
- semantic: Disease
  id: MESH
- semantic: Disease
  id: OMIM

Then the response mapping would look like

chemical2disease_1:
       OMIM: data.DiseaseIDomim          ## omim disease id is located here in the json from api
       MESH:  data.DiseaseIDmesh    ## mesh disease id is located here in the json from api
       ctd_chemical_disease_interaction_types: data.DirectEvidence
       pubmed: data.PubMedIDs

In this case, BTE could split this into two operations, or when the response is recieved from the API it could just see which id type exists in the output.

@rjawesome
Copy link
Contributor

The first feature (hasPrefix) is currently working in multiple-prefixes branch

@rjawesome
Copy link
Contributor

rjawesome commented Mar 29, 2023

Another feature that could be added here since outputs is already an array, if each output ID corresponds to a different field in the json, then you could just add other outputs if different IDs have different Fields in the JSON. For example, the operation outputs could look like

outputs:
- semantic: Disease
  id: MESH
- semantic: Disease
  id: OMIM

Then the response mapping would look like

chemical2disease_1:
       OMIM: data.DiseaseIDomim          ## omim disease id is located here in the json from api
       MESH:  data.DiseaseIDmesh    ## mesh disease id is located here in the json from api
       ctd_chemical_disease_interaction_types: data.DirectEvidence
       pubmed: data.PubMedIDs

In this case, BTE could split this into two operations, or when the response is recieved from the API it could just see which id type exists in the output.

Another way to solve this problem could be using JQ Post processing + the hasPrefix/OUTPUT feature. JQ post processing could move all the ids into the same field in the json before response mapping. So the operation could have

transformers:
  wrap_jq: "{data: [.[] | if .DiseaseIDomim then .DiseaseID = "OMIM:" + .DiseaseIDomim else . end |  if .DiseaseIDmesh then .DiseaseID = "MESH:" + .DiseaseIDmesh else . end]}"

Then the usage of hasPrefix and OUTPUT in the response mapping would be exactly the same as the first proposal.

@colleenXu
Copy link
Collaborator Author

colleenXu commented Mar 30, 2023

@rjawesome could you pause your work on this particular issue? and keep the work specific to this issue on a separate branch from use-jmes-path (maybe you've already done this with multiple-prefixes)?

After talking with @tokebe, we agreed that there's some larger-scale issues that still have be worked out, like:

  • what are the situations where multiple possible ID-namespaces (inputs + outputs) happen
    • is this related at all to different semantic types
  • given the info from the point above, what counts as 1 x-bte operation
  • whether the format of x-bte operations needs changing
  • how x-bte operations are processed and used by BTE the structure of x-bte operations

so I plan to write an issue and start discussions on that. I think after those discussions, it'll be clearer what the actual requirements / behavior we want for this issue is...

[EDIT: oh, one thing for sure is that in this use case and similar situations (one field, multiple ID prefixes), processing of the raw API response WILL BE REQUIRED to organize the IDs by namespace]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants