Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add discovery algorithm #93

Merged
merged 18 commits into from
Nov 14, 2023
Merged

Add discovery algorithm #93

merged 18 commits into from
Nov 14, 2023

Conversation

cthoyt
Copy link
Member

@cthoyt cthoyt commented Nov 13, 2023

This PR adds a workflow that iterates through a list of URIs and tries to discovery new common URI prefixes. Docs: https://curies.readthedocs.io/en/discovery/discovery.html

Algorithm

It follows this basic algorithm given by @matentzn:

  1. Define a prioritized list of delimiters (e.g., #, /)
  2. For each URI, for each delimiter:
    1. If the delimiter isn't present, continue
    2. Right split the URI based on the delimiter
    3. If the part after the split is at least one alphanumeric character, save the URI prefix as putative
  3. Apply a cutoff for putative URI prefixes that are of a certain frequency
  4. Create dummy CURIE prefixes (ns1, ns2, ...) for each discovered URI prefix
  5. Return a Converter with the dummy CURIE prefix / discovered URI prefixes that can be chained onto a converter (don't modify converters in place, ever!)

Demo

Assume you have some ontology with randomly generated URIs with the prefix http://ran.dom/ (but you don't know this ahead of time). You can use the curies.discover function to create a converter that has a dummy CURIE prefix ns1 for this URI prefix.

import curies

converter = curies.get_obo_converter()

# URIs can come from wherever
uris = [f"http://ran.dom/{i:03}" for i in range(30)]

discovered_converter = curies.discover(uris, converter=converter)
>>> discovered_converter.records
[Record(prefix="ns1", uri_prefix="http://ran.dom/")]

# Now, you can chain your extra converter together with your first one
augmented_converter = curies.chain([converter, discovered_converter])

# Compression works for existing and for new URI prefix definitions
augmented_converter.compress("http://purl.obolibrary.org/obo/GO_1234567")
>>> 'GO:1234567'
augmented_converter.compress("http://ran.dom/002")
>>> 'ns1:002'

Use Cases

  1. OAEI runs a few competitions on generated ontologies with generated term IDs. To drive adoption of SSSOM in their matchers, there needs to be a zero-barrier solution to generating SSSOM files even in cases where there are no “real” ontologies involved.
  2. Some companies are interested to using SSSOM, but have proprietary URIs in their ontologies and aren't well-versed in practical semantics. Again, they want to extract SSSOM files without the distraction of having to write prefix maps.

@cthoyt cthoyt changed the title Add initial discovery algorithm Add discovery algorithm Nov 13, 2023
Copy link
Collaborator

@matentzn matentzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing effort, made a first round of comments on this! THANKS

src/curies/discovery.py Outdated Show resolved Hide resolved
src/curies/discovery.py Outdated Show resolved Hide resolved
src/curies/discovery.py Show resolved Hide resolved
src/curies/discovery.py Outdated Show resolved Hide resolved
src/curies/discovery.py Outdated Show resolved Hide resolved
records = []
record_number = 0
for uri_prefix, luids in sorted(counter.items()):
if len(luids) > cutoff:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a matter of lack of brain power - how does this code prevent, given http://purl.obolibrary.org/yoyo/TMP_123 to add both http://purl.obolibrary.org/yoyo/TMP_ and http://purl.obolibrary.org/yoyo/ as prefixes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does in priority order of the delimiter first / then # then _. When it starts with /, it sees TMP_123 isn't alphanumeric, so it moves on, and therefore does not discover http://purl.obolibrary.org/yoyo/

@cthoyt cthoyt marked this pull request as ready for review November 14, 2023 10:12
@cthoyt cthoyt enabled auto-merge (squash) November 14, 2023 10:52
@cthoyt cthoyt merged commit 5061a42 into main Nov 14, 2023
8 checks passed
@cthoyt cthoyt deleted the discovery branch November 14, 2023 10:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants