Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull in all of UMLS #119

Open
cbizon opened this issue Jan 28, 2022 · 16 comments
Open

Pull in all of UMLS #119

cbizon opened this issue Jan 28, 2022 · 16 comments
Assignees
Labels
Babel Issue applies to babel.

Comments

@cbizon
Copy link
Contributor

cbizon commented Jan 28, 2022

Nodenormalizer has a few roles:

  1. establishing equivalent id sets
  2. Assigning types
  3. returning labels

One problem is that people rely on it for 3, so if it can't do 1&2 for some chunk of data, then you don't get any labels.

Could we/ should we pull in all of UMLS either as 1) correctly merging everything in there into clique/types or 2) just bringing in the unmapped parts as NamedThings so that we can at least serve labels?

@cbizon
Copy link
Contributor Author

cbizon commented Jan 28, 2022

We could make use of the semantic types and biolink mappings, but I would want to review those pretty carefully first.

@cbizon
Copy link
Contributor Author

cbizon commented Mar 15, 2022

@colleenXu can you comment on any particular types of entities that give you the most trouble?

@colleenXu
Copy link

colleenXu commented Mar 19, 2022

I didn't find as many examples as I expected but here are some:

SEMMEDDB has associations with outdated identifiers, so these show up as well...

EDIT: the semmeddb data files do have pipe-delimited identifiers, so basically mapping to entrez gene IDs...

@colleenXu
Copy link

More examples that SRI Node Normalizer doesn't fetch labels for. These map to biolink PhysiologicalProcess or MolecularActivity...

@colleenXu
Copy link

colleenXu commented Mar 28, 2022

More examples of semmeddb semantic types already mentioned:

from other semmeddb semantic types:

@colleenXu
Copy link

colleenXu commented Mar 28, 2022

Plant is particularly useful since it's used to annotate supplements in idisk...

@cbizon
Copy link
Contributor Author

cbizon commented Jun 7, 2022

There are many of the missing UMLS that are taxa (plants, birds, etc). It looks like there are good mappings in the metathesaurus to both mesh and ncbi.

@cbizon
Copy link
Contributor Author

cbizon commented Jun 8, 2022

Looking at some aapp types that don't map to other things in nodenorm at the moment, it looks like there are at least some decent mappings to meshes. We should review whether we want these mappings; I seem to recall them occasionally giving some trouble...

@gaurav
Copy link
Contributor

gaurav commented Jun 14, 2022

Here's one way in which we could proceed:

  1. Once all the compendia are generated, we generate a final "UMLS.txt" compendium that consists of every UMLS ID from MRCONSO.RRF with a relevant semantic type (e.g. T092 "Organization" is probably not relevant to Translator, so we can just exclude all those IDs), minus any IDs that were already included in any other compendia. We can add the UMLS label and the overall semantic type mapped to a Biolink type to the compendium as well. (We could also include synonymy information from across the UMLS if that would be useful). This should meet the specific aims of this issue to improve UMLS coverage, but at the cost of having lots of concepts that aren't properly clustered to each other.
  2. We then start looking at the semantic types we have in UMLS.txt and determining if more of those entries should be included in the specific compendia, such as anatomy, cellular component, and so on. The goal would be to reduce the number of entries in UMLS.txt by ensuring that its identifiers are properly clustered elsewhere in Babel. Eventually, we should end up with an UMLS.txt that only contains the identifiers that can't be easily placed anywhere else in the compendia.

@cbizon Do you think this would work?

@cbizon
Copy link
Contributor Author

cbizon commented Jun 14, 2022

Yes, I think this is a very good plan.

@colleenXu
Copy link

colleenXu commented Jun 14, 2022

Note that there are UMLS IDs that seem to map to > 1 umls semantic type...This is an example: rosiglitazone mapped to both Organic Chemical and Pharmacologic Substance

@gaurav
Copy link
Contributor

gaurav commented Aug 29, 2022

Sorry it took me a while to get back to this! It looks like there are 1,339,426 UMLS IDs in the latest Babel run. I found 1,037,476 UMLS IDs not already present in Babel that have a single Biolink type.

There were also 3,110,863 UMLS IDs without a Biolink type, which are:

      1  {'T007': {'Bacterium'}, 'T121': {'Pharmacologic Substance'}} -> []
      5  {'T007': {'Bacterium'}, 'T204': {'Eukaryote'}} -> []
     41  {'T021': {'Fully Formed Anatomical Structure'}} -> []
    160  {'T016': {'Human'}} -> []
    165  {'T010': {'Vertebrate'}} -> []
    516  {'T001': {'Organism'}} -> []
    565  {'T008': {'Animal'}} -> []
    891  {'T120': {'Chemical Viewed Functionally'}} -> []
   2458  {'T090': {'Occupation or Discipline'}} -> []
   6184  {'T091': {'Biomedical Occupation or Discipline'}} -> []
   7991  {'T031': {'Body Substance'}} -> []
  10514  {'T194': {'Archaeon'}} -> []
  18360  {'T167': {'Substance'}} -> []
  30162  {'T011': {'Amphibian'}} -> []
  34624  {'T014': {'Reptile'}} -> []
  41733  {'T015': {'Mammal'}} -> []
  69714  {'T005': {'Virus'}} -> []
  94786  {'T012': {'Bird'}} -> []
 114974  {'T013': {'Fish'}} -> []
 324645  {'T004': {'Fungus'}} -> []
 514514  {'T002': {'Plant'}} -> []
 550402  {'T007': {'Bacterium'}} -> []
1287458  {'T204': {'Eukaryote'}} -> []

I'll be trying to map those to Biolink types next.

There are also some UMLS IDs that have multiple Biolink types:

     49 biolink:Drug|biolink:Food
    136 biolink:Activity|biolink:Procedure
    137 biolink:Device|biolink:Drug
    698 biolink:PhysicalEntity|biolink:Publication
   2156 biolink:Drug|biolink:SmallMolecule
   4556 biolink:Agent|biolink:PhysicalEntity

I suspect that I can remove those biolink:PhysicalEntitys without affecting the interpretation of the provided UMLS concepts. For the others (e.g. C1971594 "Chantix 1 MG Oral Tablet" is both a device and a drug, while C1999759 "BZL101" is both a drug and a small molecule) I think it makes sense to categorize these concepts as both of the specified Biolink types. Does that sound right?

@cbizon
Copy link
Contributor Author

cbizon commented Aug 29, 2022

This all looks good to me. I do have a few questions -

For all those ones that are different organism types (eukaryote, bird, etc), would it be feasible to go ahead and include them in the taxon concordance?

I don't really understand why C1971594 is a device?

Also, I would say that BZl101 is a small molecule but not a drug (in the biolink sense of drug) unless we are doing molecule/drug conflation.

That said, the numbers on these are small, and I'm not particularly concerned about them enough to hold this up

@gaurav
Copy link
Contributor

gaurav commented Aug 30, 2022

This might be doable by calling umls.write_umls_ids() and sending in the identifiers we need from this missing table above.

Not worth more than a day.

@gaurav
Copy link
Contributor

gaurav commented Aug 30, 2022

  • Try to get just the taxa in out of the first list.
  • biolink:Device|biolink:Drug -> biolink:Drug.
  • biolink:Drug|biolink:SmallMolecule -> small molecule
  • biolink:Agent|biolink:PhysicalEntity -> agent
  • biolink:PhysicalEntity|biolink:Publication -> publication
  • biolink:Activity|biolink:Procedure -> pick one
  • biolink:Drug|biolink:Food -> food

@gaurav
Copy link
Contributor

gaurav commented Sep 7, 2022

After the fixes described above, we're now down to 36313 UMLS IDs without UMLS types and 0 UMLS IDs with multiple UMLS types. The UMLS IDs without types are:

     41 {'T021': {'Fully Formed Anatomical Structure'}} -> []
    160 {'T016': {'Human'}} -> []
    891 {'T120': {'Chemical Viewed Functionally'}} -> []
   2458 {'T090': {'Occupation or Discipline'}} -> []
   6184 {'T091': {'Biomedical Occupation or Discipline'}} -> []
   8219 {'T031': {'Body Substance'}} -> []
  18360 {'T167': {'Substance'}} -> []

The full list is available on Hatteras at /projects/babel/babel-outputs/2022sep6/reports/umls.txt.

I'll dig into this a bit more. At a quick look, it looks like the T031 Body Substance might need to be added to anatomy (it includes things like peritoneal fluid, aqueous humor, bronchoalveolar lavage fluid), but T167 Substance isn't (it includes things like elementary particles, fossils and hashish).

gaurav added a commit to TranslatorSRI/Babel that referenced this issue Oct 3, 2022
We identified many UMLS IDs that correspond to taxa but aren't currently imported in TranslatorSRI/NodeNormalization#119 (comment). This PR updates the taxon compendia to include taxa from UMLS. Since Biolink Model v3.0.2 doesn't include UMLS as an identifier prefix for OrganismTaxon, this PR adds an `extra_prefixes` argument to `write_compendium()` to add `UMLS` to that list, since `UMLS` was added to the Biolink Model OrganismTaxon class (biolink/biolink-model#1084), we don't actually use this functionality here. But I'm leaving it in since we might need it in the future.
gaurav added a commit to TranslatorSRI/Babel that referenced this issue Oct 10, 2022
This PR will implement the plan in TranslatorSRI/NodeNormalization#119 (comment) to (1) create a compendium of UMLS IDs, labels and types for UMLS IDs that weren't included in other compendia, and (2) produce a report that will help us find sections of the UMLS that should be incorporated into Babel in other compendia.

It also fixes a minor bug in which "UMLS" was not listed as a prerequisite for "chemical", which caused "chemical" in some cases to run before the UMLS ids, labels and synonyms had been produced.
@gaurav gaurav added the Babel Issue applies to babel. label Oct 28, 2022
@gaurav gaurav added this to the NodeNorm - needs investigation milestone Jun 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Babel Issue applies to babel.
Projects
None yet
Development

No branches or pull requests

3 participants