Add discovery algorithm #93

cthoyt · 2023-11-13T08:33:27Z

This PR adds a workflow that iterates through a list of URIs and tries to discovery new common URI prefixes. Docs: https://curies.readthedocs.io/en/discovery/discovery.html

Algorithm

It follows this basic algorithm given by @matentzn:

Define a prioritized list of delimiters (e.g., #, /)
For each URI, for each delimiter:
1. If the delimiter isn't present, continue
2. Right split the URI based on the delimiter
3. If the part after the split is at least one alphanumeric character, save the URI prefix as putative
Apply a cutoff for putative URI prefixes that are of a certain frequency
Create dummy CURIE prefixes (ns1, ns2, ...) for each discovered URI prefix
Return a Converter with the dummy CURIE prefix / discovered URI prefixes that can be chained onto a converter (don't modify converters in place, ever!)

Demo

Assume you have some ontology with randomly generated URIs with the prefix http://ran.dom/ (but you don't know this ahead of time). You can use the curies.discover function to create a converter that has a dummy CURIE prefix ns1 for this URI prefix.

import curies

converter = curies.get_obo_converter()

# URIs can come from wherever
uris = [f"http://ran.dom/{i:03}" for i in range(30)]

discovered_converter = curies.discover(uris, converter=converter)
>>> discovered_converter.records
[Record(prefix="ns1", uri_prefix="http://ran.dom/")]

# Now, you can chain your extra converter together with your first one
augmented_converter = curies.chain([converter, discovered_converter])

# Compression works for existing and for new URI prefix definitions
augmented_converter.compress("http://purl.obolibrary.org/obo/GO_1234567")
>>> 'GO:1234567'
augmented_converter.compress("http://ran.dom/002")
>>> 'ns1:002'

Use Cases

OAEI runs a few competitions on generated ontologies with generated term IDs. To drive adoption of SSSOM in their matchers, there needs to be a zero-barrier solution to generating SSSOM files even in cases where there are no “real” ontologies involved.
Some companies are interested to using SSSOM, but have proprietary URIs in their ontologies and aren't well-versed in practical semantics. Again, they want to extract SSSOM files without the distraction of having to write prefix maps.

matentzn

Amazing effort, made a first round of comments on this! THANKS

src/curies/discovery.py

matentzn · 2023-11-13T12:43:43Z

src/curies/discovery.py

+    records = []
+    record_number = 0
+    for uri_prefix, luids in sorted(counter.items()):
+        if len(luids) > cutoff:


This is a matter of lack of brain power - how does this code prevent, given http://purl.obolibrary.org/yoyo/TMP_123 to add both http://purl.obolibrary.org/yoyo/TMP_ and http://purl.obolibrary.org/yoyo/ as prefixes?

it does in priority order of the delimiter first / then # then _. When it starts with /, it sees TMP_123 isn't alphanumeric, so it moves on, and therefore does not discover http://purl.obolibrary.org/yoyo/

cthoyt added 3 commits November 13, 2023 09:28

Add initial discovery algorithm

fb42149

Lint and docs

20cc1cd

Update discovery.py

3b15a5d

cthoyt changed the title ~~Add initial discovery algorithm~~ Add discovery algorithm Nov 13, 2023

cthoyt added 6 commits November 13, 2023 09:43

Update test_discovery.py

c0e6179

Fix linting

17b3516

Add RDFlib

21aeec2

Add first tutorial

eec787e

Hide rdflib

0f3f3e8

Update docs

e65bfcb

matentzn reviewed Nov 13, 2023

View reviewed changes

cthoyt added 7 commits November 13, 2023 15:20

Update docs and refactor

7d8c073

Improve docs and tests

a4e372c

Merge branch 'main' into discovery

1786bfe

Improve ergonomics and docs

ff69131

More docs

a068715

Update docs

1ca7d8f

Fix tests

01851a1

cthoyt marked this pull request as ready for review November 14, 2023 10:12

cthoyt added 2 commits November 14, 2023 11:21

Update docs

5cf3a69

Update discovery.rst

9e248d7

cthoyt enabled auto-merge (squash) November 14, 2023 10:52

cthoyt merged commit 5061a42 into main Nov 14, 2023
8 checks passed

cthoyt deleted the discovery branch November 14, 2023 10:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add discovery algorithm #93

Add discovery algorithm #93

cthoyt commented Nov 13, 2023 •

edited

Loading

matentzn left a comment

matentzn Nov 13, 2023

cthoyt Nov 14, 2023

Add discovery algorithm #93

Add discovery algorithm #93

Conversation

cthoyt commented Nov 13, 2023 • edited Loading

Algorithm

Demo

Use Cases

matentzn left a comment

Choose a reason for hiding this comment

matentzn Nov 13, 2023

Choose a reason for hiding this comment

cthoyt Nov 14, 2023

Choose a reason for hiding this comment

cthoyt commented Nov 13, 2023 •

edited

Loading