CTD processing 3: handling output IDs when multiple ID prefixes are possible #585

colleenXu · 2023-03-14T09:08:06Z

Intro: see intro section of #583 (comment). Originally noted in #558 (comment)

3. handling output IDs when multiple ID prefixes are possible

Some operations are commented-out because BTE isn't properly handling the output IDs when multiple ID prefixes are possible. This happens when the output is a disease ID (which can be MESH or OMIM) or a pathway ID (which can be REACT or KEGG).

For example, BTE will fail to recognize that the API response returned both MESH and OMIM Disease IDs and will instead assign all the Disease IDs to the one ID-prefix assigned by the operation.

Edit SmartAPI yaml + run BTE locally

In a local copy of the SmartAPI yaml, uncomment the chemical2disease_1 and chemical2disease_2 operations (lines 125, 127, 211-231, 253-273, 548-551, 556-559).

Set up a local instance of BTE to override and use your local copy of the CTD yaml. Then POST to that specific api (v1/smartapi/{id}/query endpoint):

{
    "message": {
        "query_graph": {
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:related_to"]
                }
            },
            "nodes": {
                "n0": {
                    "ids": ["MESH:D004317"],
                    "categories": ["biolink:SmallMolecule"]
                },
                "n1": {
                    "categories": ["biolink:Disease"]
                }
            }
        }
    }
}

CTD's raw response

During execution, BTE should generate this query to CTD.

In CTD's raw response, some Disease IDs are MESH like MESH:D015746/ Abdominal Pain and others are OMIM like OMIM:610141 / QT INTERVAL, VARIATION IN.

    {
        "CasRN": "23214-92-8",
        "ChemicalID": "D004317",
        "ChemicalName": "Doxorubicin",
        "DirectEvidence": "marker/mechanism",
        "DiseaseCategories": "Signs and symptoms",
        "DiseaseID": "MESH:D015746",
        "DiseaseName": "Abdominal Pain",
        "Input": "d004317",
        "PubMedIDs": "3712578|6542584"
    },


    {
        "CasRN": "23214-92-8",
        "ChemicalID": "D004317",
        "ChemicalName": "Doxorubicin",
        "DirectEvidence": "marker/mechanism",
        "DiseaseCategories": "Cardiovascular disease|Pathology (process)",
        "DiseaseID": "OMIM:610141",
        "DiseaseName": "QT INTERVAL, VARIATION IN",
        "Input": "d004317",
        "PubMedIDs": "12597018|7919046"
    },

BTE's current flawed response

BTE will do the operation with MESH-Disease-outputs and find the OMIM ID in CTD's response. It'll then strip the OMIM ID prefix off, and then assign it as a MESH ID. This will result in a flawed record -> Edge to MESH:610141 (when the original ID was OMIM:610141 / QT INTERVAL, VARIATION IN).

                "034d46e56b095c750619cc51ee2cb1bf": {
                    "predicate": "biolink:related_to",
                    "subject": "PUBCHEM.COMPOUND:31703",
                    "object": "MESH:610141",

Then it'll do similar behavior with the operation with OMIM-Disease outputs and the MESH IDs in CTD's response. So there'll be flawed records -> Edges like this to OMIM:D015746 (when the original ID was MESH:D015746/ Abdominal Pain).

                "0ca3845c254a304918933c581de85ae4": {
                    "predicate": "biolink:related_to",
                    "subject": "PUBCHEM.COMPOUND:31703",
                    "object": "OMIM:D015746",

I think Biolink API / Monarch's post-query processing + SmartAPI yaml response-mapping (which is to fields that exist only after the post-processing) is able to handle this situation, so maybe a solution like that will work here. However, it's perhaps not ideal that multiple operations are written + the same query is done repeatedly for different post-query processing.

This problem is related to past discussions on supporting multiple ID prefixes/namespaces as input / output. I'm not sure how much refactoring of code / x-bte annotation would be needed for a general solution...

The text was updated successfully, but these errors were encountered:

rjawesome · 2023-03-25T00:42:06Z

We could have a "hasPrefix" option in the yaml under the output. If this is set to true, then it will expect the ID prefix in the output at the beginning of the ID (ie. like it would parse "MESH:0000" to see it is "MESH" id) instead of using the output id type from smartapi yaml. In this case, instead of the output id being labeled by its id type in the response mapping (like "MESH") it could be named something generic (like "OUTPUT"). This should at least fix the output problem. I could work on this feature.

rjawesome · 2023-03-25T01:18:14Z

Work started on multiple-prefixes branch of smartapi-kg and api-response-transform

colleenXu · 2023-03-28T23:40:00Z

@rjawesome sorry for the late reply. I'm having trouble understanding your proposal...could you provide an example of x-bte annotation edits you're proposing?

And as I rethink this issue, I wonder if some discussion would help:

does this proposal mean 1 operation is written in these situations with multiple output ID-namespaces?
- how does that affect the outputs section of the operation?
- how does this affect how x-bte annotation is processed (into BTE MetaEdges?)
does this proposal only cover situations where ONE response field returns output IDs in different namespaces?
- that's probably fine?
- related situation?: multiple response fields (each returns output IDs in a single namespace). handled by creating multiple operations right now. This is the situation for investigate why BTE doesn't retrieve variant-disease relations from clinvar #548, and many operations for core and pending BioThings APIs
- related situation?: supporting multiple input ID namespaces in 1 operation? This is tricky because in some cases, the different input ID-namespaces correspond with different querying info (parameters/requestBody stuff)...so I think it's fine to keep writing separate operations for different input ID namespaces...

rjawesome · 2023-03-29T01:13:52Z

My proposal would just be in the yaml outputs section like so (only for outputs not inputs)

outputs:
- semantic: Disease
  hasPrefix: true

Then in the response mapping you would put OUTPUT instead of a prefix like (MESH), ie.

 chemical2disease_1:
       OUTPUT: data.DiseaseID          ## HAS prefix, the ID type will be determined by the prefix
       ctd_chemical_disease_interaction_types: data.DirectEvidence
       pubmed: data.PubMedIDs

rjawesome · 2023-03-29T01:13:57Z

Another feature that could be added here since outputs is already an array, if each output ID corresponds to a different field in the json, then you could just add other outputs if different IDs have different Fields in the JSON. For example, the operation outputs could look like

outputs:
- semantic: Disease
  id: MESH
- semantic: Disease
  id: OMIM

Then the response mapping would look like

chemical2disease_1:
       OMIM: data.DiseaseIDomim          ## omim disease id is located here in the json from api
       MESH:  data.DiseaseIDmesh    ## mesh disease id is located here in the json from api
       ctd_chemical_disease_interaction_types: data.DirectEvidence
       pubmed: data.PubMedIDs

In this case, BTE could split this into two operations, or when the response is recieved from the API it could just see which id type exists in the output.

rjawesome · 2023-03-29T01:34:48Z

The first feature (hasPrefix) is currently working in multiple-prefixes branch

rjawesome · 2023-03-29T17:35:11Z

Another feature that could be added here since outputs is already an array, if each output ID corresponds to a different field in the json, then you could just add other outputs if different IDs have different Fields in the JSON. For example, the operation outputs could look like
outputs:
- semantic: Disease
  id: MESH
- semantic: Disease
  id: OMIM
Then the response mapping would look like
chemical2disease_1:
       OMIM: data.DiseaseIDomim          ## omim disease id is located here in the json from api
       MESH:  data.DiseaseIDmesh    ## mesh disease id is located here in the json from api
       ctd_chemical_disease_interaction_types: data.DirectEvidence
       pubmed: data.PubMedIDs
In this case, BTE could split this into two operations, or when the response is recieved from the API it could just see which id type exists in the output.

Another way to solve this problem could be using JQ Post processing + the hasPrefix/OUTPUT feature. JQ post processing could move all the ids into the same field in the json before response mapping. So the operation could have

transformers:
  wrap_jq: "{data: [.[] | if .DiseaseIDomim then .DiseaseID = "OMIM:" + .DiseaseIDomim else . end |  if .DiseaseIDmesh then .DiseaseID = "MESH:" + .DiseaseIDmesh else . end]}"

Then the usage of hasPrefix and OUTPUT in the response mapping would be exactly the same as the first proposal.

colleenXu · 2023-03-30T21:58:15Z

@rjawesome could you pause your work on this particular issue? and keep the work specific to this issue on a separate branch from use-jmes-path (maybe you've already done this with multiple-prefixes)?

After talking with @tokebe, we agreed that there's some larger-scale issues that still have be worked out, like:

what are the situations where multiple possible ID-namespaces (inputs + outputs) happen
- is this related at all to different semantic types
given the info from the point above, what counts as 1 x-bte operation
whether the format of x-bte operations needs changing
how x-bte operations are processed and used by BTE the structure of x-bte operations

so I plan to write an issue and start discussions on that. I think after those discussions, it'll be clearer what the actual requirements / behavior we want for this issue is...

[EDIT: oh, one thing for sure is that in this use case and similar situations (one field, multiple ID prefixes), processing of the raw API response WILL BE REQUIRED to organize the IDs by namespace]

This was referenced Mar 17, 2023

Data source: RaMP biothings/pending.api#69

Closed

support JQ for API response transformation #489

Closed

andrewsu added the data source label Aug 22, 2023

colleenXu mentioned this issue Sep 14, 2023

x-bte annotation refactoring discussion #656

Open

colleenXu added the x-bte label Sep 28, 2023

colleenXu added needs discussion jq / jmespath labels Oct 18, 2023

This was referenced Oct 21, 2023

x-bte-refactoring: multiple input/output ID namespaces #748

Open

summary: x-bte-refactoring related issues #750

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CTD processing 3: handling output IDs when multiple ID prefixes are possible #585

CTD processing 3: handling output IDs when multiple ID prefixes are possible #585

colleenXu commented Mar 14, 2023 •

edited

Loading

rjawesome commented Mar 25, 2023 •

edited

Loading

rjawesome commented Mar 25, 2023

colleenXu commented Mar 28, 2023

rjawesome commented Mar 29, 2023 •

edited

Loading

rjawesome commented Mar 29, 2023 •

edited

Loading

rjawesome commented Mar 29, 2023

rjawesome commented Mar 29, 2023 •

edited

Loading

colleenXu commented Mar 30, 2023 •

edited

Loading

CTD processing 3: handling output IDs when multiple ID prefixes are possible #585

CTD processing 3: handling output IDs when multiple ID prefixes are possible #585

Comments

colleenXu commented Mar 14, 2023 • edited Loading

3. handling output IDs when multiple ID prefixes are possible

rjawesome commented Mar 25, 2023 • edited Loading

rjawesome commented Mar 25, 2023

colleenXu commented Mar 28, 2023

rjawesome commented Mar 29, 2023 • edited Loading

rjawesome commented Mar 29, 2023 • edited Loading

rjawesome commented Mar 29, 2023

rjawesome commented Mar 29, 2023 • edited Loading

colleenXu commented Mar 30, 2023 • edited Loading

colleenXu commented Mar 14, 2023 •

edited

Loading

rjawesome commented Mar 25, 2023 •

edited

Loading

rjawesome commented Mar 29, 2023 •

edited

Loading

rjawesome commented Mar 29, 2023 •

edited

Loading

rjawesome commented Mar 29, 2023 •

edited

Loading

colleenXu commented Mar 30, 2023 •

edited

Loading