Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add subsetting functionality #67

Merged
merged 6 commits into from
Sep 8, 2023
Merged

Add subsetting functionality #67

merged 6 commits into from
Sep 8, 2023

Conversation

cthoyt
Copy link
Member

@cthoyt cthoyt commented Sep 8, 2023

This functionality is useful for downstream applications like the following:

  1. You load a comprehensive extended prefix map, e.g., from the Bioregistry using curies.get_bioregistry_converter().
  2. You load some data that conforms to this prefix map by convention. This is often the case for semantic mappings stored in the SSSOM format
  3. You extract the list of prefixes actually used within your data
  4. You subset the detailed extended prefix map to only include prefixes relevant for your data
  5. You make some kind of output of the subsetted extended prefix map to go with your data. Effectively, this is a way of reconciling data. This is especially effective when using the Bioregistry or other comprehensive extended prefix maps.

Here's a concrete example of doing this (which also includes a bit of data science)
to do this on the SSSOM mappings from the Disease Ontology project.

>>> import curies
>>> import pandas as pd
>>> import itertools as itt
>>> commit = "faca4fc335f9a61902b9c47a1facd52a0d3d2f8b"
>>> url = f"https://github.com/mapping-commons/disease-mappings/blob/{commit}/mappings/doid.sssom.tsv"
>>> df = pd.read_csv(url, sep="\t", comment="#")
>>> prefixes = {
...     curies.Reference.from_curie(curie).prefix
...     for column in ["subject_id", "predicate_id", "object_id"]
...     for curie in df[column]
... }
>>> converter = curies.get_bioregistry_converter()
>>> slim_converter = converter.get_subconverter(prefixes)

This PR also sneaks in a related documentation update to pandas dataframe processing

@cthoyt cthoyt enabled auto-merge (squash) September 8, 2023 13:12
@cthoyt cthoyt merged commit e5c56d6 into main Sep 8, 2023
8 checks passed
@cthoyt cthoyt deleted the subsets branch September 8, 2023 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant