Upgrading (non-bijective) prefix maps (#99)

Implement function to deterministically make an EPM from a non-bijective prefix map, motivated by mapping-commons/sssom-py#485
biopragmatics · Jan 16, 2024 · faef0bd · faef0bd
1 parent 094ad63
commit faef0bd
Show file tree

Hide file tree

Showing 5 changed files with 122 additions and 2 deletions.
diff --git a/docs/source/struct.rst b/docs/source/struct.rst
@@ -86,6 +86,8 @@ and URI prefixes for most semantic spaces as well as specify which CURIE prefix
 URI prefix is the "preferred" one in a given context. Prefix maps, unfortunately, have no way to
 address this. Therefore, we're going to introduce a new data structure.
 
+.. _epms:
+
 Extended Prefix Maps
 --------------------
 Extended Prefix Maps (EPMs) address the issues with prefix maps by including explicit
@@ -106,8 +108,9 @@ containing an entry for ChEBI) looks like:
        }
    ]
 
-EPMs have the benefit that they are still encoded in JSON and can easily be encoded in
-YAML, TOML, RDF, and other schemata.
+An EPM is simply a list of records (see :class:`curies.Record`). EPMs have the benefit that they are still
+encoded in JSON and can easily be encoded in YAML, TOML, RDF, and other schemata. Further, prefix maps can be
+automatically upgraded into EPMs (with some caveats) using :func:`curies.upgrade_prefix_map`.
 
 .. note::
 

diff --git a/docs/source/tutorial.rst b/docs/source/tutorial.rst
@@ -92,6 +92,10 @@ This function also accepts a string with a HTTP, HTTPS, or FTP path to a remote
     structure for situations when there can be CURIE synonyms or even URI prefix synonyms is
     the *extended prefix map* (see below).
 
+    If you're not in a position where you can fix data issues upstream, you can try using the
+    :func:`curies.upgrade_prefix_map` to extract a canonical extended prefix map from a non-bijective
+    prefix map.
+
 Loading Extended Prefix Maps
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Extended prefix maps (EPMs) address the issues with prefix maps by including explicit

diff --git a/src/curies/__init__.py b/src/curies/__init__.py
@@ -15,6 +15,7 @@
     load_jsonld_context,
     load_prefix_map,
     load_shacl,
+    upgrade_prefix_map,
     write_extended_prefix_map,
     write_jsonld_context,
     write_shacl,
@@ -42,6 +43,7 @@
     "remap_curie_prefixes",
     "remap_uri_prefixes",
     "rewire",
+    "upgrade_prefix_map",
     "get_version",
     # i/o
     "load_prefix_map",

diff --git a/src/curies/api.py b/src/curies/api.py
@@ -48,7 +48,10 @@
     "DuplicateValueError",
     "DuplicatePrefixes",
     "DuplicateURIPrefixes",
+    # Utilities
     "chain",
+    "upgrade_prefix_map",
+    # Loaders
     "load_extended_prefix_map",
     "load_prefix_map",
     "load_jsonld_context",
@@ -2191,3 +2194,85 @@ def _get_shacl_line(prefix: str, uri_prefix: str, pattern: Optional[str] = None)
         pattern = pattern.replace("\\", "\\\\")
         line += f'; sh:pattern "{pattern}"'
     return line + " ]"
+
+
+def upgrade_prefix_map(prefix_map: Mapping[str, str]) -> List[Record]:
+    """Convert a (potentially problematic) prefix map (i.e., not bijective) into a list of records.
+
+    A prefix map is bijective if it has no duplicate CURIE prefixes (i.e., keys in a dictionary) and
+    no duplicate URI prefixes (i.e., values in a dictionary). Because of the way that dictionaries work in Python,
+    we are always guaranteed that there are no duplicate keys.
+
+    However, it is both possible and frequent to have duplicate values. This happens because many semantic spaces
+    have multiple synonymous CURIE prefixes. For example, the `OBO in OWL <https://bioregistry.io/oboinowl>`_
+    vocabulary has two common, interchangable prefixes: ``oio`` and ``oboInOwl`` (and the case variant ``oboinowl``).
+    Therefore, a prefix map might contain the following parts that make it non-bijective:
+
+    .. code-block:: json
+
+        {
+          "oio": "http://www.geneontology.org/formats/oboInOwl#",
+          "oboInOwl": "http://www.geneontology.org/formats/oboInOwl#"
+        }
+
+    This is bad because this prefix map can't be used to determinstically compress a URI. For example, should
+    ``http://www.geneontology.org/formats/oboInOwl#hasDbXref`` be compressed to ``oio:hasDbXref`` or
+    ``oboInOwl:hasDbXref``? Neither is necessarily incorrect, but the issue here is that there is not an explicit
+    choice by the data modeler, meaning that data compressed into CURIEs with this non-bijective map might not be
+    readily integrable with other datasets.
+
+    The best solution to this situation is not more code, but rather for the data modeler to address the issue
+    upstream in the following steps:
+
+    1. Choose the which of prefix synonyms is going to be the primary prefix. If you're not sure, the
+       `Bioregistry <https://bioregistry.io/>`_ is a comprehensive registry of prefixes and their syonyms
+       applicable in the semantic web and the natural sciences. It gives a good suggestion of what the best
+       prefix is. In the OBO in OWL case, it suggests ``oboInOwl``.
+    2. Update all related data artifacts to only use that preferred prefix
+    3. Either 1) remove the other synonyms (in this example, ``oio``) from the prefix map *or* 2) transition to
+       using :ref:`epms`, a more modern data structure for supporting URI and CURIE interconversion.
+
+    The first part of step 3 in this solution highlights one of the key shortcomings of prefix maps themselves -
+    they can't keep track of synonyms, which are often useful in data integration, especially when a single
+    prefix map is defined on the level of a project or community. The extended prefix map is a simple data structure
+    proposed to address this.
+
+    * * *
+
+    This function is for people who are not in the position to make the sustainable fix, and want to automate
+    the assignment of which is the preferred prefix. It uses a deterministic algorithm to choose from two or more
+    CURIE prefixes that have the same URI prefix and generate an extended prefix map in which they have bene collapsed
+    into a single record. More specitically, the algorithm is based on a case-sensitive lexical sort of the prefixes.
+    The first in the sort order becomes the primary prefix and the others become synonyms in the resulting record.
+
+    :param prefix_map: A mapping whose keys represent CURIE prefixes and values represent URI prefixes
+    :return: A list of :class:`curies.Record` objects that together constitute an extended prefix map
+
+    >>> from curies import Converter, upgrade_prefix_map
+    >>> pm = {"a": "https://example.com/a/", "b": "https://example.com/a/"}
+    >>> records = upgrade_prefix_map(pm)
+    >>> converter = Converter(records)
+    >>> converter.expand("a:1")
+    'https://example.com/a/1'
+    >>> converter.expand("b:1")
+    'https://example.com/a/1'
+    >>> converter.compress("https://example.com/a/1")
+    'a:1'
+
+    .. note::
+
+        Thanks to `Joe Flack <https://github.com/joeflack4>`_ for proposing this algorithm
+        `in this discussion <https://github.com/mapping-commons/sssom-py/pull/485#discussion_r1451812733>`_.
+
+    """
+    uri_prefix_to_curie_synonyms = defaultdict(list)
+    for curie_prefix, uri_prefix in prefix_map.items():
+        uri_prefix_to_curie_synonyms[uri_prefix].append(curie_prefix)
+    priority_prefix_map = {
+        uri_prefix: sorted(curie_prefixes)
+        for uri_prefix, curie_prefixes in uri_prefix_to_curie_synonyms.items()
+    }
+    return [
+        Record(prefix=prefix, prefix_synonyms=prefix_synonyms, uri_prefix=uri_prefix)
+        for uri_prefix, (prefix, *prefix_synonyms) in sorted(priority_prefix_map.items())
+    ]
diff --git a/tests/test_api.py b/tests/test_api.py
@@ -25,6 +25,7 @@
     ReferenceTuple,
     URIStandardizationError,
     chain,
+    upgrade_prefix_map,
 )
 from curies.sources import (
     BIOREGISTRY_CONTEXTS,
@@ -860,3 +861,28 @@ def test_version_type(self):
         """
         version = get_version()
         self.assertIsInstance(version, str)
+
+
+class TestUtils(unittest.TestCase):
+    """Test utility functions."""
+
+    def test_clean(self):
+        """Test clean."""
+        prefix_map = {
+            "b": "https://example.com/a/",
+            "a": "https://example.com/a/",
+            "c": "https://example.com/c/",
+        }
+        records = upgrade_prefix_map(prefix_map)
+        self.assertEqual(2, len(records))
+        a_record, c_record = records
+
+        self.assertEqual("a", a_record.prefix)
+        self.assertEqual(["b"], a_record.prefix_synonyms)
+        self.assertEqual("https://example.com/a/", a_record.uri_prefix)
+        self.assertEqual([], a_record.uri_prefix_synonyms)
+
+        self.assertEqual("c", c_record.prefix)
+        self.assertEqual([], c_record.prefix_synonyms)
+        self.assertEqual("https://example.com/c/", c_record.uri_prefix)
+        self.assertEqual([], c_record.uri_prefix_synonyms)