Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

investigate addition of supporting sentence to semmeddb API #563

Closed
andrewsu opened this issue Feb 14, 2023 · 7 comments
Closed

investigate addition of supporting sentence to semmeddb API #563

andrewsu opened this issue Feb 14, 2023 · 7 comments
Assignees

Comments

@andrewsu
Copy link
Member

SemMedDB is a text-mined resource for extracting relationships (triples) from the literature. The schema for SemMedDB is described at https://lhncbc.nlm.nih.gov/ii/tools/SemRep_SemMedDB_SKR/dbinfo.html. Our current semmeddb API (http://biothings.ncats.io/semmeddb; parser: https://github.com/biothings/semmeddb) primarily focuses on the "PREDICATIONS" table. The Translator consortium would like to explore the addition of the actual sentence used to infer the triple, found in the "SENTENCE" table. I remember that we briefly explored this previously, and the substantial increase in size was a key consideration. (The PREDICATION file is 3 GB; the SENTENCE file is 15 GB.)

@colleenXu
Copy link
Collaborator

Some previous discussion of "sentence" info:

@erikyao
Copy link

erikyao commented Feb 23, 2023

We can load the SENTENCE table and join the SENTENCE records to our documents by SENTENCE_ID (as shown in the entity-relationship diagram at the bottom).

Currently our PREDICATION parser discards the SENTENCE_ID field since it's never used.

@andrewsu
Copy link
Member Author

@erikyao are you at all worried about the explosion in index size that would result?

@erikyao
Copy link

erikyao commented Feb 23, 2023

are you at all worried about the explosion in index size that would result?

@andrewsu I think the additional SENTENCE field(s) won't take too much index size.

I am more worried about the memory usage when loading the SENTENCE table... But we can always preprocess it and extract smaller intermediate files if the memory usage became a real problem.

@andrewsu
Copy link
Member Author

Super, thanks. Let's wait until we make a decision on #569 (next week) so we can possibly make both changes together.

@andrewsu
Copy link
Member Author

andrewsu commented Jun 2, 2023

sentence context has now been added to the semmeddb2 API (which will soon replace the semmeddb API), so closing this issue

https://biothings.ncats.io/semmeddb2/association/C0007642-ISA-C0410013

{
  "_id": "C0007642-ISA-C0410013",
  "_version": 1,
  "object": {
    "name": "Soft tissue lesion",
    "novelty": 1,
    "semantic_type_abbreviation": "patf",
    "semantic_type_name": "Pathologic Function",
    "umls": "C0410013"
  },
  "pmid_count": 1,
  "predicate": "ISA",
  "predication": [
    {
      "object_score": 906,
      "object_text": "soft tissue lesions",
      "pmid": 8455912,
      "predication_id": 88510072,
      "sentence": "Definitive diagnoses were 15 osteomyelitis, 14 soft tissue lesions (nine cellulitis and five noninfected ischaemic or trophic wounds), and nine degenerative bone disease.",
      "sentence_id": 49832114,
      "subject_score": 888,
      "subject_text": "cellulitis"
    }
  ],
  "predication_count": 1,
  "subject": {
    "name": "Cellulitis",
    "novelty": 1,
    "semantic_type_abbreviation": "dsyn",
    "semantic_type_name": "Disease or Syndrome",
    "umls": "C0007642"
  }
}

@andrewsu andrewsu closed this as completed Jun 2, 2023
@colleenXu
Copy link
Collaborator

Here's where I noted that I added sentence support to the x-bte annotations #569 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants