Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retired CUIs in semmedVER43_2022_R_PREDICATION.csv #2

Closed
erikyao opened this issue May 17, 2022 · 16 comments
Closed

Retired CUIs in semmedVER43_2022_R_PREDICATION.csv #2

erikyao opened this issue May 17, 2022 · 16 comments

Comments

@erikyao
Copy link
Collaborator

erikyao commented May 17, 2022

File semmedVER43_2022_R_PREDICATION.csv contains 117,589,597 rows. After removing rows with SUBJECT_NOVELTY == 0 or OBJECT_NOVELTY == 0, 81,282,024 rows remained. Among those rows, there are 303,080 unique subject CUIs, and 262,268 unique object CUIs (piped CUIs decomposed and counted).

Following MRCUI.RRF data analysis, we found that, for subject CUIs, the counts and ratios of retired CUIs are:

type count of CUIs ratio
retired 10734 3.54%
deleted 181 0.06%
injectively mapped 10182 3.36%
    bijectively mapped     3109     1.03%
One-to-Many mapped 371 0.12%

and for object CUIs,

type count of CUIs ratio
retired 9364 3.57%
deleted 175 0.07%
injectively mapped 8856 3.37%
    bijectively mapped     2678     1.02%
One-to-Many mapped 333 0.13%

It's a safe bet to consider only the deleted and bijectively mapped CUIs. Also it's worth considering only mappings with SY relationship.

@erikyao
Copy link
Collaborator Author

erikyao commented May 17, 2022

@andrewsu initialed the following policies toward retired CUIs and piped CUIs

For the one-to-one bijective mappings, I agree on the simple replacement.

Confirmed.

For the other injective many-to-one mappings, I think I'm also good with simple replacement.

Confirmed.

For the one-to-many, I think I'm good with just duplicating the original record multiple times, one for each of the mapped CUI2s. It seems like this would be a pretty modest increase in size.

Discussion Pending.
Expansion size is small but might introduce duplicate contents among those expanded documents.

For the deletions, I might be okay just leaving them in. It's a faithful statement of what's stated, and I think for BTE, it pretty much will end up being ignored since the node normalizer will not know what to do with them.

Negative.
Should delete when parsing/uploading. (Log if necessary.)

For piped subjects/objects, I seem to recall having this discussion with Sander previously, and that he implemented a solution where the record was duplicated multiple times, similar to how I'm proposing handling the one-to-many case.

New analysis required.

@erikyao
Copy link
Collaborator Author

erikyao commented May 23, 2022

Piped vs Non-Piped Rows

File semmedVER43_2022_R_PREDICATION.csv contains 117,589,597 rows. After removing rows with SUBJECT_NOVELTY == 0 or OBJECT_NOVELTY == 0, 81,282,024 rows remained, among which the distribution of rows with/without piped CUIs is:

row type count ratio
w/o piped CUIs 74,288,575 91.4%
with piped CUIs 6,993,449 8.6%

Rows with Retired CUIs

The distributions of the counts of rows containing retired CUIs among the two types of rows are listed below, where the ratios are calculated against the total number of rows (81,282,024):

status piped or not? count ratio remark
retired 4,101,513 5.05%
291,629 0.36%
(1) deleted 14,737 0.02%
3,536 0.004%
(2) injective 3,940,070 4.85%
283,839 0.35%
  (2.1) bijective 1,052,756 1.30%
155,048 0.19%
(3) one-to-many 150,634 0.19% Avg. out-degree 2.07
4,272 0.005% Avg. out-degree 2.80

Note that if we create new predication for each mapped CUIs, those 150,634 rows with one-to-many mapped, non-piped CUIs will expand to 311,800 documents. The avg. out-degree is 2.07. Similarly, for the piped ones, it will expand to 11,935 documents with an avg. out-degree 2.80.

Impact of Splitting Policies on Piped CUIs

Note that in this section, retired CUIs (or the replacement plans) are not taken into consideration.

The current splitting policies were proposed here:

  • In cases where a UMLS CUI is followed by one or more numeric IDs (presumed to be NCBI Gene IDs) e.g., C0056207|3075, discard the numeric IDs and process as usual
  • In cases where the CUI only consists of one or more pipe-separated numeric IDs, create separate documents for each numeric id using the key ncbigene.

Following these policies, the 6,993,449 rows with piped CUIs will bring about 7,959,310 documents (1.14 docs per row). The total number of documents will be 7,959,310 + 74,288,575 = 82,247,885.

If we change the first policy and not discard any of the numeric IDs, we will find 17,335,870 documents generated from those 6,993,449 rows (2.48 docs per row). The total number of documents will come to 17,335,870 + 74,288,575 = 91,624,445, a 11.5% increase from the original policies.

In summary:

Splitting Policies Rows with Piped CUIs Documents from Piped Rows Docs per Piped Row Total Documents Remark
current 6,993,449 7,959,310 1.14 82,247,885
new 6,993,449 17,335,870 2.48 91,624,445 11.5% ⤴️ in total

P.S. current https://biothings.ncats.io/semmeddb API has 114,383,742 documents but it includes docs with zero novelty score.

@andrewsu
Copy link
Member

andrewsu commented May 24, 2022

Great, let's handle these by group:

  1. Deleted: Let's delete the 18,273 predications referencing deleted IDs
  2. Injective: The 4,223,909 predications here with old IDs would essentially map to the same number of predications with new IDs (not accounting for piping). That seems reasonable, so let's do this. (This is also the biggest group, so this decision likely solves 98% of the issue here...)
  3. One-to-many: So there are 154,906 predications with retired IDs that map to multiple new IDs. Can you easily calculate/estimate the number of predications that would turn into if you created new predications for every new ID? (E.g., if semmed says A -> B_ret and MRCUI says B_ret -> C and B_ret -> D, then we would create 2 predications for A->C and A->D). Again, ignore piping for now... (I'm guessing the number here will still be very low as a percentage of semmeddb overall, so I lean toward just doing this.)

Also, regarding piping, there are 6,993,449 predications with some piping. Can you calculate/estimate the number of predications that would turn into if you created a new predication for each ID in the pipe?

@colleenXu
Copy link

colleenXu commented May 25, 2022

Using the same numbering as Andrew did in his post:

  1. Deleted: I agree with Andrew on this.
  2. Injective: My understanding is that this means each retired UMLS CUI is mapped to 1 non-retired UMLS CUI. I'm still fine with the simple replacement that was agreed on earlier. However, I think there's still the piping to consider (will discuss in next post)
  3. One-to-many: hmmmm the UMLS browser seems to do something interesting here...it seems to map to the "first" non-retired ID in the table. Perhaps we can do the same thing and do simple replacement (like the injective case)? And there's still piping to consider (will discuss in next post)

@colleenXu
Copy link

colleenXu commented May 25, 2022

With piping:

Point A

@andrewsu the scope of the issue was somewhat discussed here. However, the full effect on predications wasn't clear. For example, are there cases where both the subject + object have piped IDs - and how much expansion would then happen?

Point B

I think there's still some vagueness: are there any combos of IDs in a piped thing where the IDs represent "equivalent" things, to the point where we don't want to expand to multiple records? For example: when there's 1 Entrez ID and 1 CUI, are those two IDs "equivalent" enough that we just want a record with 1 of the IDs (probably the Entrez one)? Maybe one way to tell "equivalent" is when it's easy to find a cross-mapping between the Entrez ID and the CUI (in MyGene for instance)?

Point C

On the other hand, I'm starting to be less concerned about the chance of having "duplicated information" from expanding piped IDs that are basically equivalent into multiple records (each record = 1 combo of subject ID and object ID). At least, I think BTE can kinda handle it.

For example, semmeddb currently has 3 records corresponding to the exact same triple + pmid. But when BTE is queried for that triple (see query details below), the edge only has one instance of that PMID (8959933) in its biolink:publications array. This means BTE runs set-like operations to get only unique values (maybe here). To some extent, I think BTE will process the API's response and merge/take-only-unique-values.


response from querying only semmeddb through BTE (POST to http://localhost:3000/v1/smartapi/1d288b3a3caf75d541ffaae3aab386c8/query locally):

only one instance of PMID:8959933 in this response
{
    "workflow": [
        {
            "id": "lookup"
        }
    ],
    "message": {
        "query_graph": {
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": [
                        "biolink:location_of"
                    ]
                }
            },
            "nodes": {
                "n0": {
                    "ids": [
                        "UMLS:C1521748"
                    ],
                    "categories": [
                        "biolink:GrossAnatomicalStructure",
                        "biolink:AnatomicalEntity"
                    ],
                    "name": "Entire mastoid"
                },
                "n1": {
                    "ids": [
                        "UMLS:C0029440"
                    ],
                    "categories": [
                        "biolink:Disease"
                    ],
                    "name": "osteoma"
                }
            }
        },
        "knowledge_graph": {
            "nodes": {
                "MONDO:0005166": {
                    "categories": [
                        "biolink:Disease"
                    ],
                    "name": "osteoma",
                    "attributes": [
                        {
                            "attribute_type_id": "biolink:xref",
                            "value": [
                                "MONDO:0005166",
                                "UMLS:C0029440",
                                "MESH:D010016",
                                "MEDDRA:10031249",
                                "NCIT:C3296",
                                "SNOMEDCT:302858007",
                                "SNOMEDCT:83612000",
                                "HP:0100246"
                            ]
                        },
                        {
                            "attribute_type_id": "biolink:synonym",
                            "value": [
                                "osteoma",
                                "Osteoma"
                            ]
                        },
                        {
                            "attribute_type_id": "num_source_nodes",
                            "value": 1
                        },
                        {
                            "attribute_type_id": "num_target_nodes",
                            "value": 0
                        },
                        {
                            "attribute_type_id": "source_qg_nodes",
                            "value": [
                                "n0"
                            ]
                        },
                        {
                            "attribute_type_id": "target_qg_nodes",
                            "value": []
                        }
                    ]
                },
                "UMLS:C1521748": {
                    "categories": [
                        "biolink:AnatomicalEntity"
                    ],
                    "name": "Entire mastoid",
                    "attributes": [
                        {
                            "attribute_type_id": "biolink:xref",
                            "value": [
                                "UMLS:C1521748"
                            ]
                        },
                        {
                            "attribute_type_id": "biolink:synonym",
                            "value": [
                                "Entire mastoid"
                            ]
                        },
                        {
                            "attribute_type_id": "num_source_nodes",
                            "value": 0
                        },
                        {
                            "attribute_type_id": "num_target_nodes",
                            "value": 1
                        },
                        {
                            "attribute_type_id": "source_qg_nodes",
                            "value": []
                        },
                        {
                            "attribute_type_id": "target_qg_nodes",
                            "value": [
                                "n1"
                            ]
                        }
                    ]
                }
            },
            "edges": {
                "aad0f5aae3e7a1ae1d5fa9772637ef2c": {
                    "predicate": "biolink:location_of",
                    "subject": "UMLS:C1521748",
                    "object": "MONDO:0005166",
                    "attributes": [
                        {
                            "attribute_type_id": "biolink:aggregator_knowledge_source",
                            "value": [
                                "infores:biothings-explorer"
                            ],
                            "value_type_id": "biolink:InformationResource"
                        },
                        {
                            "attribute_type_id": "biolink:primary_knowledge_source",
                            "value": [
                                "infores:semmeddb"
                            ],
                            "value_type_id": "biolink:InformationResource"
                        },
                        {
                            "attribute_type_id": "biolink:aggregator_knowledge_source",
                            "value": [
                                "infores:biothings-semmeddb"
                            ],
                            "value_type_id": "biolink:InformationResource"
                        },
                        {
                            "attribute_type_id": "biolink:publications",
                            "value": [
                                "PMID:8959933",
                                "PMID:578417",
                                "PMID:24653880",
                                "PMID:702417",
                                "PMID:467282",
                                "PMID:130459",
                                "PMID:9441563",
                                "PMID:4264661",
                                "PMID:194342",
                                "PMID:686306",
                                "PMID:19977164",
                                "PMID:19977176",
                                "PMID:23377306",
                                "PMID:23120331",
                                "PMID:23532662",
                                "PMID:2076317",
                                "PMID:5381639",
                                "PMID:7446839",
                                "PMID:12802983",
                                "PMID:13381266",
                                "PMID:13717178",
                                "PMID:13149044",
                                "PMID:13749106",
                                "PMID:13445597",
                                "PMID:13760067",
                                "PMID:14100724",
                                "PMID:13379779",
                                "PMID:13408609",
                                "PMID:10721530",
                                "PMID:5923152",
                                "PMID:5808232",
                                "PMID:5998528",
                                "PMID:5014384",
                                "PMID:28764209",
                                "PMID:20554264",
                                "PMID:14912570",
                                "PMID:18357940",
                                "PMID:18633930",
                                "PMID:29255322",
                                "PMID:1897711",
                                "PMID:1742065",
                                "PMID:2044485",
                                "PMID:30199506",
                                "PMID:16001697",
                                "PMID:29392054",
                                "PMID:21620788",
                                "PMID:30388584",
                                "PMID:15021765",
                                "PMID:31750121",
                                "PMID:31262987"
                            ]
                        },
                        {
                            "attribute_type_id": "biolink:original_object",
                            "value": "C0029440"
                        },
                        {
                            "attribute_type_id": "biolink:original_predicate",
                            "value": "LOCATION_OF"
                        },
                        {
                            "attribute_type_id": "biolink:original_subject",
                            "value": "C1521748"
                        },
                        {
                            "attribute_type_id": "original_object_name",
                            "value": "Osteoma"
                        },
                        {
                            "attribute_type_id": "original_subject_name",
                            "value": "Entire mastoid"
                        }
                    ]
                }
            }
        },
        "results": [
            {
                "node_bindings": {
                    "n0": [
                        {
                            "id": "UMLS:C1521748"
                        }
                    ],
                    "n1": [
                        {
                            "id": "MONDO:0005166"
                        }
                    ]
                },
                "edge_bindings": {
                    "e01": [
                        {
                            "id": "aad0f5aae3e7a1ae1d5fa9772637ef2c"
                        }
                    ]
                },
                "score": 2.6815700080727343
            }
        ]
    },
    "logs": [
        {
            "timestamp": "2022-05-25T01:13:51.735Z",
            "level": "INFO",
            "message": "Node n0 with id [UMLS:C1521748] and category [biolink:GrossAnatomicalStructure] augmented with category [biolink:AnatomicalEntity] inferred from id.",
            "code": null
        },
        {
            "timestamp": "2022-05-25T01:13:52.010Z",
            "level": "DEBUG",
            "message": "BTE identified 2 qNodes from your query graph",
            "code": null
        },
        {
            "timestamp": "2022-05-25T01:13:52.010Z",
            "level": "DEBUG",
            "message": "BTE identified 1 qEdges from your query graph",
            "code": null
        },
        {
            "timestamp": "2022-05-25T01:13:52.045Z",
            "level": "INFO",
            "message": "Executing e01: n0 --> n1",
            "code": null
        },
        {
            "timestamp": "2022-05-25T01:13:52.318Z",
            "level": "DEBUG",
            "message": "REDIS cache is not enabled.",
            "code": null
        },
        {
            "timestamp": "2022-05-25T01:13:52.319Z",
            "level": "DEBUG",
            "message": "BTE is trying to find metaKG edges (smartAPI registry, x-bte annotation) connecting from GrossAnatomicalStructure,CellularComponent,Cell,PathologicalAnatomicalStructure,AnatomicalEntity to Disease with predicate location_of",
            "code": null
        },
        {
            "timestamp": "2022-05-25T01:13:52.351Z",
            "level": "DEBUG",
            "message": "BTE found 10 metaKG edges corresponding to e01. These metaKG edges comes from 1 unique APIs. They are BioThings SEMMEDDB API",
            "code": null
        },
        {
            "timestamp": "2022-05-25T01:13:52.353Z",
            "level": "DEBUG",
            "message": "BTE found 10 metaKG for this batch.",
            "code": null
        },
        {
            "timestamp": "2022-05-25T01:13:52.353Z",
            "level": "DEBUG",
            "message": "call-apis: Resolving ID feature is turned on",
            "code": null
        },
        {
            "timestamp": "2022-05-25T01:13:52.353Z",
            "level": "DEBUG",
            "message": "call-apis: Number of API Edges received is 10",
            "code": null
        },
        {
            "timestamp": "2022-05-25T01:13:52.834Z",
            "level": "DEBUG",
            "message": "call-apis: Successful POST https://biothings.ncats.io/semmeddb (1 ID): GrossAnatomicalStructure > location_of > Disease (obtained 189 records, took 352ms)",
            "code": null
        },
        {
            "timestamp": "2022-05-25T01:13:52.950Z",
            "level": "DEBUG",
            "message": "call-apis: Successful POST https://biothings.ncats.io/semmeddb (1 ID): GrossAnatomicalStructure > location_of > Disease (obtained 267 records, took 474ms)",
            "code": null
        },
        {
            "timestamp": "2022-05-25T01:13:53.303Z",
            "level": "DEBUG",
            "message": "call-apis: Successful POST https://biothings.ncats.io/semmeddb (1 ID): GrossAnatomicalStructure > location_of > Disease (obtained 858 records, took 580ms)",
            "code": null
        },
        {
            "timestamp": "2022-05-25T01:13:53.561Z",
            "level": "DEBUG",
            "message": "call-apis: Successful POST https://biothings.ncats.io/semmeddb (1 ID): GrossAnatomicalStructure > location_of > Disease (obtained 34 records, took 235ms)",
            "code": null
        },
        {
            "timestamp": "2022-05-25T01:13:53.571Z",
            "level": "DEBUG",
            "message": "call-apis: Successful POST https://biothings.ncats.io/semmeddb (1 ID): GrossAnatomicalStructure > location_of > Disease (obtained 27 records, took 257ms)",
            "code": null
        },
        {
            "timestamp": "2022-05-25T01:13:53.727Z",
            "level": "DEBUG",
            "message": "call-apis: Successful POST https://biothings.ncats.io/semmeddb (1 ID): Cell > location_of > Disease (obtained 267 records, took 306ms)",
            "code": null
        },
        {
            "timestamp": "2022-05-25T01:13:53.962Z",
            "level": "DEBUG",
            "message": "call-apis: Successful POST https://biothings.ncats.io/semmeddb (1 ID): GrossAnatomicalStructure > location_of > Disease (obtained 0 records, took 221ms)",
            "code": null
        },
        {
            "timestamp": "2022-05-25T01:13:53.985Z",
            "level": "DEBUG",
            "message": "call-apis: Successful POST https://biothings.ncats.io/semmeddb (1 ID): GrossAnatomicalStructure > location_of > Disease (obtained 0 records, took 250ms)",
            "code": null
        },
        {
            "timestamp": "2022-05-25T01:13:54.169Z",
            "level": "DEBUG",
            "message": "call-apis: Successful POST https://biothings.ncats.io/semmeddb (1 ID): CellularComponent > location_of > Disease (obtained 267 records, took 341ms)",
            "code": null
        },
        {
            "timestamp": "2022-05-25T01:13:54.530Z",
            "level": "DEBUG",
            "message": "call-apis: Successful POST https://biothings.ncats.io/semmeddb (1 ID): Cell > location_of > Disease (obtained 189 records, took 288ms)",
            "code": null
        },
        {
            "timestamp": "2022-05-25T01:13:54.531Z",
            "level": "DEBUG",
            "message": "call-apis: Total number of records returned for this query is 2098",
            "code": null
        },
        {
            "timestamp": "2022-05-25T01:13:55.356Z",
            "level": "DEBUG",
            "message": "call-apis: qEdge queries complete in 2s",
            "code": null
        },
        {
            "timestamp": "2022-05-25T01:13:55.357Z",
            "level": "INFO",
            "message": "e01 execution: 10 queries (10 success/0 fail) and (0) cached qEdges return (2098) records",
            "code": null
        },
        {
            "timestamp": "2022-05-25T01:13:52.045Z",
            "level": "DEBUG",
            "message": "Edge manager is managing 1 qEdges.",
            "code": null
        },
        {
            "timestamp": "2022-05-25T01:13:52.045Z",
            "level": "DEBUG",
            "message": "Next qEdge will pick lower entity value to use for query.",
            "code": null
        },
        {
            "timestamp": "2022-05-25T01:13:52.045Z",
            "level": "DEBUG",
            "message": "Edge manager is sending next qEdge 'e01' for execution.",
            "code": null
        },
        {
            "timestamp": "2022-05-25T01:13:55.410Z",
            "level": "DEBUG",
            "message": "'e01' kept (252) / dropped (1846) records.",
            "code": null
        },
        {
            "timestamp": "2022-05-25T01:13:55.414Z",
            "level": "INFO",
            "message": "'e01' keeps (252) records!",
            "code": null
        },
        {
            "timestamp": "2022-05-25T01:13:55.414Z",
            "level": "DEBUG",
            "message": "Edge manager collected (252) records!",
            "code": null
        },
        {
            "timestamp": "2022-05-25T01:13:55.453Z",
            "level": "DEBUG",
            "message": "Successfully scored 1 results, couldn't score 0 results.",
            "code": null
        },
        {
            "timestamp": "2022-05-25T01:13:55.454Z",
            "level": "INFO",
            "message": "Execution Summary: (2) nodes / (1) edges / (1) results; (8/10) queries returned results from (1) unique APIs ",
            "code": null
        },
        {
            "timestamp": "2022-05-25T01:13:55.454Z",
            "level": "INFO",
            "message": "APIs: BioThings SEMMEDDB API",
            "code": null
        }
    ]
}

@erikyao
Copy link
Collaborator Author

erikyao commented May 25, 2022

Great, let's handle these by group:

  1. Deleted: Let's delete the 18,273 predications referencing deleted IDs
  2. Injective: The 4,223,909 predications here with old IDs would essentially map to the same number of predications with new IDs (not accounting for piping). That seems reasonable, so let's do this. (This is also the biggest group, so this decision likely solves 98% of the issue here...)
  3. One-to-many: So there are 154,906 predications with retired IDs that map to multiple new IDs. Can you easily calculate/estimate the number of predications that would turn into if you created new predications for every new ID? (E.g., if semmed says A -> B_ret and MRCUI says B_ret -> C and B_ret -> D, then we would create 2 predications for A->C and A->D). Again, ignore piping for now... (I'm guessing the number here will still be very low as a percentage of semmeddb overall, so I lean toward just doing this.)

Also, regarding piping, there are 6,993,449 predications with some piping. Can you calculate/estimate the number of predications that would turn into if you created a new predication for each ID in the pipe?

@andrewsu @colleenXu, please find my updated comments above.

@andrewsu
Copy link
Member

Fantastic, I think we are very close here. @erikyao, In this comment, you mention there are three classes of piped IDs:

1 UMLS + 1 Entrez
1 UMLS + N Entrez
N Entrez

Can you post a sampling (maybe 20 examples) of the "1 UMLS + N Entrez" group? I'd just like to understand that group a bit better...

@erikyao
Copy link
Collaborator Author

erikyao commented May 25, 2022

Fantastic, I think we are very close here. @erikyao, In this comment, you mention there are three classes of piped IDs:

1 UMLS + 1 Entrez
1 UMLS + N Entrez
N Entrez

Can you post a sampling (maybe 20 examples) of the "1 UMLS + N Entrez" group? I'd just like to understand that group a bit better...

1 UMLS + 13 Entrez, 7 examples

'C0074479|4489|4490|4493|4494|4495|4496|4498|4499|4500|4501|4543|56052|644314'
'Antigens,CD43|MT1A|MT1B|MT1E|MT1F|MT1G|MT1H|MT1JP|MT1M|MT1L|MT1X|MTNR1A|ALG1|MT1IP'


'C0682972|5241|5541|5555|140738|449619|619465|100616101|100616102|100616103|100775105|100862683|100862684|100862685'
'G-Protein-Coupled Receptors|PGR|PR@|PRH2|TMEM37|ERVK-7|ERVK-8|ERVK-10|ERVK-9|ERVK-21|ERVK-18|ERVK-25|ERVK-24|ERVK-19'


'C0597298|5241|5541|5555|140738|449619|619465|100616101|100616102|100616103|100775105|100862683|100862684|100862685'
'Protein Isoforms|PGR|PR@|PRH2|TMEM37|ERVK-7|ERVK-8|ERVK-10|ERVK-9|ERVK-21|ERVK-18|ERVK-25|ERVK-24|ERVK-19'


'C0079427|5241|5541|5555|140738|449619|619465|100616101|100616102|100616103|100775105|100862683|100862684|100862685'
'Tumor Suppressor Genes|PGR|PR@|PRH2|TMEM37|ERVK-7|ERVK-8|ERVK-10|ERVK-9|ERVK-21|ERVK-18|ERVK-25|ERVK-24|ERVK-19'


'C0017968|5241|5541|5555|140738|449619|619465|100616101|100616102|100616103|100775105|100862683|100862684|100862685'
'Glycoproteins|PGR|PR@|PRH2|TMEM37|ERVK-7|ERVK-8|ERVK-10|ERVK-9|ERVK-21|ERVK-18|ERVK-25|ERVK-24|ERVK-19'


'C0033684|5241|5541|5555|140738|449619|619465|100616101|100616102|100616103|100775105|100862683|100862684|100862685'
'Proteins|PGR|PR@|PRH2|TMEM37|ERVK-7|ERVK-8|ERVK-10|ERVK-9|ERVK-21|ERVK-18|ERVK-25|ERVK-24|ERVK-19'


'C0033371|5241|5541|5555|140738|449619|619465|100616101|100616102|100616103|100775105|100862683|100862684|100862685'
'Prolactin|PGR|PR@|PRH2|TMEM37|ERVK-7|ERVK-8|ERVK-10|ERVK-9|ERVK-21|ERVK-18|ERVK-25|ERVK-24|ERVK-19'

1 UMLS + 8 Entrez, 4 examples

'C0002210|250|470|6590|10850|26033|27295|55226|80150'
'alpha-Fetoproteins|ALPP|ATHS|SLPI|CCL27|ATRNL1|PDLIM3|NAT10|ASRGL1'


'C0212691|1523|4791|4940|6490|9733|22974|27044|84164'
'lyt-10 protein|CUX1|NFKB2|OAS3|PMEL|SART3|TPX2|SND1|ASCC2'


'C0126732|250|470|6590|10850|26033|27295|55226|80150'
'I Kappa B-Alpha|ALPP|ATHS|SLPI|CCL27|ATRNL1|PDLIM3|NAT10|ASRGL1'


'C0600251|250|470|6590|10850|26033|27295|55226|80150'
'Interleukin-1 alpha|ALPP|ATHS|SLPI|CCL27|ATRNL1|PDLIM3|NAT10|ASRGL1'

1 UMLS + 5 Entrez, 3 examples

'C0085828|2353|2354|3725|3726|3727'
'Transcription Factor AP-1|FOS|FOSB|JUN|JUNB|JUND'


'C0083957|3854|3872|5126|5311|8535'
'Proprotein Convertase 2|KRT6B|KRT17|PCSK2|PKD2|CBX4'


'C0135615|3853|5122|7832|10120|57332'
'Proprotein Convertase 1|KRT6A|PCSK1|BTG2|ACTR1B|CBX8'

1 UMLS + 3 Entrezs, 7 examples

'C1141639|1081|3342|93659'
'Human Chorionic Gonadotropin|CGA|HTC2|CGB5'


'C0007082|1048|1084|5670'
'Carcinoembryonic Antigen|CEACAM5|CEACAM3|PSG2'


'C0968902|2167|2971|7020'
'Transcription Factor AP-2 Alpha|FABP4|GTF3A|TFAP2A'


'C1335440|100616102|100862685|100862688'
'Polymerase Gene|ERVK-9|ERVK-19|ERVK-11'


'C1335439|100616102|100862685|100862688'
'Polymerase|ERVK-9|ERVK-19|ERVK-11'


'C0035681|100616102|100862685|100862688'
'DNA-Directed RNA Polymerase|ERVK-9|ERVK-19|ERVK-11'


'C0012892|100616102|100862685|100862688'
'DNA-Directed DNA Polymerase|ERVK-9|ERVK-19|ERVK-11'

@colleenXu
Copy link

For "1 UMLS + N Entrez", it seems like the UMLS ID and the Entrez IDs are not equivalent. Then maybe we want to change the current splitting policy: "In cases where a UMLS CUI is followed by one or more numeric IDs (presumed to be NCBI Gene IDs) e.g., C0056207|3075, discard the numeric IDs and process as usual"? Change to not discarding the numeric IDs?

Has "Point B" above been explored? I was wondering if the "1 UMLS + 1 Entrez" are equivalent.

@andrewsu
Copy link
Member

andrewsu commented Jun 1, 2022

Perhaps a generic way of handling the case of "1 UMLS + N Entrez" (including "1 UMLS + 1 Entrez") is to keep all Entrez IDs and create multiple records unless an Entrez ID also maps to the UMLS ID according to the Node Normalizer. Thoughts?

@colleenXu
Copy link

I think it's an interesting idea. Would we want to use MyGene, rather than Node Normalizer?

For example, one can query either the entrezgene field and then look at the umls field or vice versa...

Here's an example using the

'C0012892|100616102|100862685|100862688'
'DNA-Directed DNA Polymerase|ERVK-9|ERVK-19|ERVK-11'

POST to https://mygene.info/v3/query?fields=entrezgene,umls,symbol,name,taxid:

{
    "q": "100616102,100862685,100862688",
    "scopes": "entrezgene"
}

Response. Notice that none of the umls ids returned match C0012892 / DNA-Directed DNA Polymerase

[
    {
        "query": "100616102",
        "_id": "100616102",
        "_score": 26.72278,
        "entrezgene": "100616102",
        "name": "endogenous retrovirus group K member 9",
        "symbol": "ERVK-9",
        "taxid": 9606,
        "umls": {
            "cui": "C3147204"
        }
    },
    {
        "query": "100862685",
        "notfound": true
    },
    {
        "query": "100862688",
        "_id": "100862688",
        "_score": 25.927315,
        "entrezgene": "100862688",
        "name": "endogenous retrovirus group K member 11",
        "symbol": "ERVK-11",
        "taxid": 9606,
        "umls": {
            "cui": "C3147206"
        }
    }
]

@andrewsu
Copy link
Member

andrewsu commented Jun 3, 2022

Would we want to use MyGene, rather than Node Normalizer?

I think we should use Node Normalizer (assuming we can figure out batch querying via POST). Unless there is any other discussion or dissent, @erikyao please implement this behavior that I described in this comment.

@colleenXu
Copy link

@colleenXu
Copy link

@andrewsu @erikyao did we ever decide on the one-to-many retired ID issue (andrew's post, my post)? This is before we get into pipes, where 1 retired ID is mapped to multiple current IDs.

@erikyao
Copy link
Collaborator Author

erikyao commented Jun 6, 2022

@andrewsu @erikyao did we ever decide on the one-to-many retired ID issue (andrew's post, my post)? This is before we get into pipes, where 1 retired ID is mapped to multiple current IDs.

Hi @colleenXu , I think @andrewsu suggested replacement with all the mapped new IDs. Quote:

E.g., if semmed says A -> B_ret and MRCUI says B_ret -> C and B_ret -> D, then we would create 2 predications for A->C and A->D

@andrewsu
Copy link
Member

andrewsu commented Jun 7, 2022

Given the small expansion in triples based on Yao's updated comment, yes, I think we proceed with the plan that @erikyao quoted the comment above...

E.g., if semmed says A -> B_ret and MRCUI says B_ret -> C and B_ret -> D, then we would create 2 predications for A->C and A->D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants