Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSSOM: Separate profile for Literal Mappings #234

Closed
matentzn opened this issue Nov 9, 2022 · 27 comments · Fixed by #235 or #384
Closed

SSSOM: Separate profile for Literal Mappings #234

matentzn opened this issue Nov 9, 2022 · 27 comments · Fixed by #235 or #384
Assignees
Milestone

Comments

@matentzn
Copy link
Collaborator

matentzn commented Nov 9, 2022

as discussed in #197 we are now going to provide a basic spec for a literal mapping. This is the suggestion:

  • Literals take the place of the "subjects" in the normal sssom
  • we ditch all of the subject related metadata columns
  • INCOMPLETE ISSUE DONT READ
@matentzn
Copy link
Collaborator Author

Being addressed here #235

@udp is currently in a mad push on OLS, but maybe James, when you get a chance, can you tell me what else needs doing on that PR to call it "good enough" for v1?

@matentzn
Copy link
Collaborator Author

cc @rsgoncalves, would also like to hear your input on the PR if you don't mind. Happy to answer any questions if you don't understand what exactly it does!

@rsgoncalves
Copy link

rsgoncalves commented Dec 13, 2022

The PR looks good to me. I think we can swap out our (simple) mapping format with SSSOM straightforwardly. I could use your help clarifying a few things, further below.

For context: In the mapping tool we've been developing, the output is a simple table like so: (subject_id, subject_label, object_id, object_label, mapping_score). For example: ("bbj-a-113", "Endometrial cancer", "http://www.ebi.ac.uk/efo/EFO_1001512", "endometrial carcinoma", 0.977010419). The tool doesn't output a predicate but it is assumed to be exactMatch— through UIs, curators can change the predicate to 'broad' or 'narrow'.

Now to questions:

  1. In mapping some literals, like trait descriptions in OpenGWAS database records, we want to keep track of the (internal) record identifier associated with each literal. Would it be possible to keep some subject metadata as optional? I understand these identifiers are not of general use, but they're used in internal pipelines and this way we wouldn't have to maintain 2 tables (an internal identifier table + a SSSOM table).

  2. The description of literal_source states "URI of ontology source for the literal". Seems there's an assumption these literals come from ontologies— is this intended? Do values need to resolve? If not, I imagine this field could be (ab)used to store internal identifiers, for example, but not sure that's the appropriate use.

  3. What is the design difference between confidence and similarity_score (when to use one vs the other)? We output a mapping "confidence" score, which is a similarity score computed by one of multiple supported similarity metrics.

@matentzn
Copy link
Collaborator Author

So in your use case, you always have an internal identifier? In this case, you could simply be using normal SSSOM rather than the literal profile?

The description of literal_source states "URI of ontology source for the literal". Seems there's an assumption these literals come from ontologies— is this intended? Do values need to resolve? If not, I imagine this field could be (ab)used to store internal identifiers, for example, but not sure that's the appropriate use.

Seems wrong, @udp.

What is the design difference between confidence and similarity_score (when to use one vs the other)? We output a mapping "confidence" score, which is a similarity score computed by one of multiple supported similarity metrics.

Great question, can you open a new issue about that? I will try my best to document the difference, but it is true that these two metrics will often coincide.

@rsgoncalves
Copy link

No, not always. So far only in a couple of datasets have we had to maintain internal identifiers. I think the literal profile is still the route, with some optional field to specify such identifiers. Could that be literal_source perhaps? The label seems to suit the intent at least.

@matentzn
Copy link
Collaborator Author

I think literal_source may be confused with being the context (text, database, etc) from which the literal originated, but it may be an option - I am 50/50 here. However, to avoid marketing issues moving forward: can you articulate a bit more how we can communicate the difference between:

Classic SSSOM:

subject_id subject_label predicate_id object_id object_label mapping_justification
A:1 label 1 skos:exactMatch A:2 label 2 semapv:LexicalMatching

Literal Profile

literal_source literal predicate_id object_id object_label mapping_justification
A:1 label 1 skos:exactMatch A:2 label 2 semapv:LexicalMatching

@matentzn matentzn added this to the 1.1.0 milestone Feb 16, 2023
matentzn added a commit that referenced this issue Aug 2, 2023
Fixes #197 
Fixes #234 

- [x] `docs/` have been added/updated if necessary
- [x] `make test` has been run locally
- [x] tests have been added/updated (if applicable)
- [ ]
[CHANGELOG.md](https://github.com/mapping-commons/sssom/blob/0.9.0/CHANGELOG.md)
has been updated.

This PR adds a new profile to SSSOM for the representation of literal
mappings, leaving the default SSSOM intact.

---------

Co-authored-by: James McLaughlin <james@mclgh.net>
@cmungall cmungall reopened this Apr 2, 2024
@cmungall
Copy link
Contributor

cmungall commented Apr 2, 2024

I see I approved #235 but I actually have a number of problems to raise

  • I thought this was going to be a separate profile, not go in the main schema
  • Currently this is causing some issues as we are reusing the same slot_uri for different meanings in literal and non literal mappings (some of these can be fixed in linkml but you will run into issues with jsonld)
  • don't use spaces in class names
  • I also pointed out overlap with annotation models
  • How are these to be used, exchanged, serialized? How are they collected? Will a MappingSet be extended to allow separate fields for literal mappings and mappings? (and how would this work e.g. with CSV)? Perhaps the range of mappings will be extended to be a union of mapping and literal mapping? Or will these be a separate LiteralMappingSet?

@matentzn
Copy link
Collaborator Author

matentzn commented Apr 4, 2024

don't use spaces in class names

I see many models that do this, like https://github.com/biolink/biolink-model/blob/1698cf997785490304a617123d5e3a242c6b2bc0/biolink-model.yaml#L6128. Where can I find focs about this?

I thought this was going to be a separate profile, not go in the main schema

Is there something to read about modular schema development best practices?

Currently this is causing some issues as we are reusing the same slot_uri for different meanings in literal and non literal mappings (some of these can be fixed in linkml but you will run into issues with jsonld)

That was an honest mistake, now fixed. Technically literal mappings are not yet connected to the spec, we just wanted to have the docs out there to be able to use it, even if there is no tool support.

@gouttegd
Copy link
Contributor

gouttegd commented Jul 25, 2024

Technically literal mappings are not yet connected to the spec, we just wanted to have the docs out there to be able to use it, even if there is no tool support.

But how is the “literal profile” even supposed to be used? All we have is a literal mapping class which cannot be contained in a mapping set (a mapping set can only contains mapping, not literal mapping).

I second @cmungall ’s questions:

How are these to be used, exchanged, serialized? How are they collected? Will a MappingSet be extended to allow separate fields for literal mappings and mappings? (and how would this work e.g. with CSV)? Perhaps the range of mappings will be extended to be a union of mapping and literal mapping? Or will these be a separate LiteralMappingSet?

Those questions should get answered before we make a SSSOM 1.0, or the “literal profile” should be removed from the 1.0 version in my opinion.

Right now, the “literal profile” is in effect impossible to implement in code.

@jamesamcl
Copy link
Member

I think a separate literal mapping set would be fine? It was never the intention that they would be in the same file.

The use case for this is to publish all of the manually asserted string to term mappings we have collected in ZOOMA, see https://github.com/EBISPOT/zooma2sssom/tree/master/mappings

@gouttegd
Copy link
Contributor

I don’t see why we even need a separate “profile” or a separate class for literal mappings for such a use case.

Why not simply put the literal in the subject_label? Along with a new EntityType value for subject_type that indicates that the subject is a literal and that, therefore, for this particular mapping it is the subject_label, not the subject_id, that matters (the subject_id can even be absent).

subject_label	predicate_id	object_id	mapping_justification	mapping_provider	subject_type
uterus	http://www.w3.org/2000/01/rdf-schema#label	http://purl.obolibrary.org/obo/BTO_0001424	https://w3id.org/semapv/vocab/ManualMappingCuration	https://www.ebi.ac.uk/vg/faang	literal
sperm	http://www.w3.org/2000/01/rdf-schema#label	http://purl.obolibrary.org/obo/CL_0000019	https://w3id.org/semapv/vocab/ManualMappingCuration	https://www.ebi.ac.uk/vg/faang	literal
kidney	http://www.w3.org/2000/01/rdf-schema#label	http://purl.obolibrary.org/obo/BTO_0000671	https://w3id.org/semapv/vocab/ManualMappingCuration	https://www.ebi.ac.uk/vg/faang	literal

@jamesamcl
Copy link
Member

Yes, I think this approach would work if subject_id is made optional.

@allenbaron
Copy link

I think it's helpful to have an SSSOM-like approach for literals and I agree it fits well and doesn't necessarily need a separate "profile" but I wonder if it would lead to significant scope creep. Could SSSOM become a TSV format for annotating information about any kind of subject-predicate-object relationship? The more slots become optional and optional slots exist, the more developers will have trouble implementing tools and users trouble finding a tool that does what they are looking for.

Why does the literal slot replace the subject_id slot instead of the object_id slot? Would literals ever be able to use oboInOwl synonym predicates? I can't see how oio:hasNarrowSynonym makes sense with a URI as the object. I imagine having the literal as subject works fine with existing mapping predicates (e.g. skos).

@gouttegd
Copy link
Contributor

I think it's helpful to have an SSSOM-like approach for literals and I agree it fits well and doesn't necessarily need a separate "profile" but I wonder if it would lead to significant scope creep.

Maybe, but it seems there is clear interest in being able to represent such “literal mappings”. So the options are:

  1. Declare that this is completely out-of-scope for SSSOM. Sorry, this is not the standard you are looking for, it is not designed to handle this case and it will never be.

A likely outcome of this option is that people who need to handle this case will, in effect, “fork” SSSOM to create their own variant that can represent literal mappings. If several people do that, we will end up in the same situation as we were for general mappings before SSSOM: everyone will represent literal mappings with their own custom format, which will all be slightly incompatible with each other.

  1. Propose a variant of SSSOM that can represent literal mappings. This is what was started with Adding SSSOM profile for literals #235. Basically, instead of letting people fork SSSOM to handle literal mappings, we create our own fork (and we call that a “profile”).

Two problems with that approach.

First, for now it is incomplete. The ”literal profile” defines a literal mapping object, but as @matentzn said, it is currently “not yet connected” to the rest of the spec. As such, it is of little value since developers cannot create tools to deal with such mappings.

Second, even with it is complete, the “literal profile” will be a mess to implement, at least in non-duck-typed languages. There is no relation between mapping and literal mapping, so polymorphism won’t help. The “best” solution will be to create a corresponding literal mapping set class. This will result in a lot of duplicated code, for (in my opinion) very little benefit.

  1. Adapt the model to support “literal mappings” without having to create a fork/profile. This is basically what I propose. In this case, the ”adaptation” does not even require adding any new slots. All we need is a new value in the EntityType enum to indicate than one of the mapped entity is a “literal“ rather than an “entity with an identifier”, and a paragraph somewhere in the spec to explain that when the subject_type of a mapping is a literal, then subject_label is mandatory and must contain the literal that is being mapped.

I see no obvious drawbacks to that approach, and only benefits. Notably:

a. This allows for either side of the mapping (subject or object) to be the literal. If subject_type is set to literal, then the subject is the literal, and the literal value is to be found in subject_label. If object_type is set to literal, then the object is the literal, and the literal value is to be found in object_label.

b. Consequently, this allows inversion of mappings according to SSSOM’s standard rules (contrary to the profile proposed in #235, where the literal can only be on the subject side).

c. As a side-effect, this even allows for literal-to-literal mappings, should anyone ever need to do that.

d. This allows mixing literal and non-literal mappings, should anyone ever need to do that. Not saying this is necessarily a good idea, but the approach automatically makes it possible without anything special to do. By contrast, the separate fork/profile route would never allow that unless we explicitly plan for this possibility.

e. Implementation-wise, this should be a breeze.

The more slots become optional and optional slots exist, the more developers will have trouble implementing tools and users trouble finding a tool that does what they are looking for.

Apart from subject_id, predicate_id, object_id, and mapping_justification, all slots are optional. The only thing my proposition would change is that, when checking whether a mapping has a subject_id (resp. object_id), an implementation should check beforehand the value of subject_type (resp. object_type) – if the value is present and is literal, then it is subject_label (resp. object_label) that should be checked for existence.

I do agree that the fact that most slots are optional can complicate the use of SSSOM, though. This, in fact, is where the notion of “profile” would be interesting, but it would be different from the type of “profile” that has been proposed in #235.

A “profile” could simply be a list of slots that, within the profile, should be considered mandatory. The spec could define a few of such profiles, and users could be free of defining their own.

The idea being that, once you have declared a set to adhere to a given profile (and the parser has verified that the set is indeed compliant with the indicated profile), you no longer have to worry about which slots are present or not because you already know that all slots mandated by your profile are present (if they were not, the parser would have rejected the set outright).

@allenbaron
Copy link

Okay, I understand why an official "fork" for literals is desirable and your argument in 3 for adapting the model by adding literal to the list of possible values for subject_type and object_type.

It looks to me like rdfs:Literal is already an option for these type slots (https://mapping-commons.github.io/sssom/EntityTypeEnum/). Correct me if I read the LinkML documentation wrong.

Has anyone tried mixing different types in the same mappingSet?

On the one hand, it would be really convenient for me to curate both a mapping between entities, and a mapping between an entity and literal in a single TSV file like this (top = literal to owl:Class, bottom = owl:Class to owl:Class):

subject_id subject_label predicate_id predicate_modifier object_id object_label mapping_justification subject_type object_type
MESH:C535731 Chmrq1 oboInOwl:hasExactSynonym Not DOID:0070556 CAMRQ1 semapv:ManualMappingCuration literal linkml:Uriorcurie
MESH:C535731 Dysequilibrium syndrome skos:exactMatch   DOID:0070556 CAMRQ1 semapv:ManualMappingCuration linkml:Uriorcurie linkml:Uriorcurie

I assume subject_id would be equivalent to literal_source in #235.

I can fairly easily tell which is which with just these two mappings, primarily by the predicate I chose to use. But if both mappings used skos:exactMatch as predicate, which I assume would be in spec, it would require a curator to look at the subject_type and object_type slot for every mapping to make sure they are curating the right slots. Having to look that up for every mapping would be much less simple than it is currently, especially when labels become longer and looking at types adds the need for lots of horizontal scrolling.

A “profile” could simply be a list of slots that, within the profile, should be considered mandatory. The spec could define a few of such profiles, and users could be free of defining their own.

I like this idea for a "profile".

It seems fairly straightforward to say in the standard mappings "profile" subject_id, predicate_id, object_id, and mapping_justification are required, while for the literal "profile" subject_label would become required and subject_id optional with no other changes to slots.

Profiles like this could be defined in the mappingSet metadata. Curators could be alerted that everything in a set is of a particular type (or set of allowed types), preventing the confusion I mentioned above. It does lose some of the convenience of creating mappings between very different types in the same file. I suppose you could always define a super "profile" that allows anything from the other defined profiles and then create tools to merge or split profiles.

@gouttegd
Copy link
Contributor

It looks to me like rdfs:Literal is already an option for these type slots (https://mapping-commons.github.io/sssom/EntityTypeEnum/).

Yes. However it’s unclear to me whether it is suitable here (the poor documentation of the model doesn’t help). Can it be used outside of a RDF context? If I have a list of, say, scRNAseq cell cluster names and I want them to map them to Cell Ontology IDs, would it be correct to use rdfs literal as the subject_type even though the subjects are just entries in a flat list and are not part of any RDF graph at all?

Maybe it would be fine, maybe not. I just don’t know. Whoever came up with the values for the EntityType enum would need to clarify.

Has anyone tried mixing different types in the same mappingSet?

Do you mean, mixing mappings with different subject_type (or object_type) values? I never had to do that (all the mappings I have to deal with are mappings between OWL classes), but that’s a completely supported situation. I don’t know for SSSOM-Py, but SSSOM-Java would have no problem whatsoever dealing with such mixed mappings.

Or did you mean, mixing (normal) mappings with literal mappings (as represented by the new profile)? Then no, right now it’s completely impossible to do that.

On the one hand, it would be really convenient for me to curate both a mapping between entities, and a mapping between an entity and literal in a single TSV file like this (top = literal to owl:Class, bottom = owl:Class to owl:Class):

subject_id subject_label predicate_id predicate_modifier object_id object_label mapping_justification subject_type object_type
MESH:C535731 Chmrq1 oboInOwl:hasExactSynonym Not DOID:0070556 CAMRQ1 semapv:ManualMappingCuration literal linkml:Uriorcurie
MESH:C535731 Dysequilibrium syndrome skos:exactMatch DOID:0070556 CAMRQ1 semapv:ManualMappingCuration linkml:Uriorcurie linkml:Uriorcurie

I am sorry but I don’t understand your example at all.

The second mapping states that DOID:0070556 is an exact match to MESH:C535731; the first one seems to state that DOID:0070556 is not an exact synonym to MESH:C535731. I don’t understand what is that supposed to mean. Why does MESH:C535731 have a different label in the two mappings? Why is the subject_type of the first mapping a literal, while it clearly refers to an entity?

(Besides, linkml:Uriorcurie is not a valid value for subject_ or object_type.)

I can fairly easily tell which is which with just these two mappings, primarily by the predicate I chose to use. But if both mappings used skos:exactMatch as predicate, which I assume would be in spec

Again, I don’t understand what you mean here. The spec does not and will not mandate which predicate to use (at most it can recommend that some predicates be used or conversely discourage the use of some others, but that’s it). Just because “literal mappings” would become an officially supported type of mapping does not mean that the spec would force you to use skos:exactMatch for those mappings.

It seems fairly straightforward to say in the standard mappings "profile" subject_id, predicate_id, object_id, and mapping_justification are required, while for the literal "profile" subject_label would become required and subject_id optional with no other changes to slots.

Something like that, yes.

Profiles like this could be defined in the mappingSet metadata.

I would not envision allowing the definition of a profile in a mapping set’s metadata. Instead, profiles should be defined externally, and a mapping set would simply declare that they use a specific profile. Allowing each mapping set to define its own profile seems like a needless complication to me.

Curators could be alerted that everything in a set is of a particular type

You don’t need profiles to do that. subject_type and object_type are propagatable slots, which means that if all mappings in your set have the same subject_type (resp. object_type), you can set the subject_type (resp. object_type) once and for all in the mapping set’s metadata and the value you set there will apply to all mappings in the set.

I suppose you could always define a super "profile" that allows anything from the other defined profiles

Or you can just not use profiles, if you need to merge several sets that are compliant to different profiles. Profiles, if we ever create them, would not be a mandatory feature – mapping sets would not have to have to a profile.

@matentzn
Copy link
Collaborator Author

Sorry for barging in here, I dont have time to comment on here much. Here is my very short take:

  1. Literal mappings and semantic entity mappings are very different things and should not be confused
  2. We have had lots of meetings about the literal profile and its motivation, discussed it at a workshop, had a PR open for months (or years) and it is done. I will take an unusually hard stance against pulling it back (not unbreakable, but hard, so it will be a big distraction for all of us).
  3. The main use case is to capture grounding decisions ("string occurrences in text that are supposed to refer to a concept"), sort of like a synonym table. I am very against mixing the concepts of synonyms and mappings.
  4. I dont not think we should make our work of maintaining SSSOM core model harder by considering what the status is of the profile(s, in the future). Lets just label them as "inoffical", not part of the main standard, for now.

@gouttegd
Copy link
Contributor

gouttegd commented Jul 26, 2024

Literal mappings and semantic entity mappings are very different things and should not be confused

OK.

We have had lots of meetings about the literal profile and its motivation, discussed it at a workshop, had a PR open for months (or years)

Where is the trace of those “lots of meetings”?

All I know about is:

  • the discussion in Representing literal values in SSSOM #197, which has not been a particularly active one and which provides very few insights about how you came to the current “design” – in fact, according to that discussion most of the decisions were taken in what seems to have been a private discussion between you and @udp, the minutes of which are not available anywhere;

  • this SSSOM: Separate profile for Literal Mappings #234 issue, which was open at a time the decision to go for a separate “literal profile” had already been taken;

  • the “2nd Mapping Commons Workship on SSSOM”, which included a presentation by James about the literal profile – again, at a time the decision to use a separate profile for literal mappings had already been taken; the correctness of that decision does not seem to have been discussed following James’ presentation, at least not in the recording that is available. If there has been further discussion outside of the workshop itself, where are the minutes?

it is done

We have a very different opinion of what can be considered “done”. I say it again: the literal profile is right now unusable. There are ways too many questions left open about how it can/should be used.

The only thing the literal profile does for now is causing confusion, by leading people to believe they can use SSSOM to represent literal mappings, which is absolutely not the case.

The EBI has already started to publish “literal mapping sets” (see James’ message above) in .sssom.tsv files. Anyone could legitimately conclude that those files are bona fide SSSOM/TSV files, and therefore would expect to be able to use them with the existing SSSOM/TSV tools. But those files are not SSSOM/TSV files – no SSSOM tools can deal with them! And no SSSOM tools will ever be able to deal with them.

Aren’t we suppose to care about “interoperability”?

I am very against mixing the concepts of synonyms and mappings.

OK.

I dont not think we should make our work of maintaining SSSOM core model harder by considering what the status is of the profile(s, in the future). Lets just label them as "inoffical", not part of the main standard, for now.

Hard disagree. You’re taking the easy path now without consideration for how hard you will make things in the future. That may be fine for software development in general (“move fast and break things”, as the tech bros of Silicon Valley are saying), but when desiging a (hopefully) long-term standard, you want to move slow and fix things.

Right now, the “literal profile” is a half-assed design that no one knows how to use (not even you apparently). Leaving it like that and kicking the can down the road can only come back to hit us hard in the future.

@allenbaron
Copy link

@goutteg, confusion caused by the example table I shared is exactly the point I was trying to make about mixing literal and entity mappings. The source of both mappings is the same MESH entity, but the top mapping is between the literal in the subject_label (which was a synonym linked to that MESH term) and states that that literal should NOT be mapped to the specified DOID because it is wrong, while the bottom mapping is between the MESH and DOID entities and states they are exact matches. I could have left out the subject_id for the top mapping because it is optional but I wanted to be able to tell in the future where this literal came from.

@matentzn, barging in... hahaha. Like you said, you put in work earlier when you had time. I'm sorry I couldn't contribute more at an earlier stage, but I'm fine with leaving literals as unofficial. Can I ask why rdfs:Literal is an option for subject_type, predicate_type and object_type in SSSOM?

@graybeal
Copy link

A somewhat naive comment (it's hard to keep all these arguments clear without spending many hours, I thank you who have devoted that time):

  • Can someone easily describe the use case for literals in a mapping? I don't understand what they mean conceptually, especially when the location of the string is not captured (true?), and why the word "mapping" applies. I think I've read all the discussion but I haven't seen that crisply captured; a pointer is fine if I'm wrong.
  • I know I've read the justification for liberals as subjects and I really don't understand that. It may be too sophisticated for my ontology expertise, you don't have have to spend time on just me (but others like me will eventually use it, or try to).

WIth that in mind, I still think these basic thoughts could apply:

  • It is very non-ideal to have literals allowed by one part of the standard but not enabled by the other parts. Given that, the 'doneness' of the previous decision is not matched by the specification, so a decision needs to be taken.
  • If it is possible to make literals enabled with a minimal amount of work, I think that is worth doing, because it adds a lot of value. It seems to me literals only as objects makes the most sense as a minimal amount of work.
  • If it is not possible to enable literals with minimal work, remove the other reference(s) to them and stop here.
  • Even if mixing literals in the same spec/the same mapping is heresy for some reason, I think if they are to be supported at all then creating a whole separate profile is unhelpfully complicating for many users who want to do that. Because, usability.

I agree that if literals are not crisply specified in this standard, the chance of divergence and even competing standards is high. But if you think literal-included triples are not really mappings for SSSOM, then that's the principled decision on which you should stand, and that other thing is not a profile, it's a different standard.

@jonquet
Copy link

jonquet commented Jul 27, 2024

Implementing literal mapping to me is achieving the final step to make SSSOM a duplicata of RDF (and even RDF-star as we can say things about the triple). I don't think we need another RDF.
This is exactly what the example just above shows.

You can drop the Simple, yo can change Ontology by Resource and you get SSRM.

@matentzn
Copy link
Collaborator Author

In the interest of driving SSSOM 1.0 home in the coming weeks and the enormous amount of things to unpack in all the comments given here, I am ok with yanking the literal profile from the standard, for now (not happy but I can read a room 😂). I can move it to another repo and develop it independently as a non-Standard, and make sure we communicate use cases clearly for this. One day in the far future we can move this "profile" or "standard for something else" back here and have a vote.

Please voice your objections to this approach until 1st August; I will be responsible for the move!

@gouttegd
Copy link
Contributor

gouttegd commented Jul 27, 2024

I am ok with yanking the literal profile from the standard, for now
[…]
One day in the far future we can move this "profile" or "standard for something else" back here

This is just another version of “kicking the can down the road”, only in a different repo.

If the intention is that at some point the “literal mapping” becomes a part of SSSOM, we should think about how this will be done right now.

Adding in the future a new class of mappings is a completely different beast than adding or removing a slot in the existing mapping class.

For now, all the code dealing with SSSOM (in Python, Java, or any other language) can be built around the assumption that there is only one class of mapping. This is not something that will be easy to change, and the longer that assumption stays around, the harder it will be to change.

So if you already know that at some point you want the standard (and its implementations) to deal with several types of mappings (e.g mapping and literal mapping), this is something that must be decided ASAP, not “sometimes in the future”.

and have a vote

If you do this (make SSSOM 1.0 with no room for more than one class of mappings, then come back later with a proposition for another class as if it was an afterthought), I can already tell you what my vote will be: No. Absolutely not.

@jamesamcl
Copy link
Member

jamesamcl commented Jul 27, 2024

Hi, typing from my phone as I’m away camping without access to a computer. For us (biocurators at EBI) mapping from term to term or string to term are both classes of the same problem. We often have datasets that require both types of mappings to get to the types of identifiers we want. For example I am currently working with a dataset that has a mix of chemical names and CAS numbers. I want to map the CAS number where available (obviously a use case for core SSSOM) and otherwise map the chemical name (literal mapping). This is perhaps not the best example but I can dig out unlimited more when I get back to the office.

At EBI we use two tools for these term and literal mappings respectively: OXO and ZOOMA. So far we have maintained the databases for these tools internally which is not in the FAIR spirit of our community. We are therefore opening up OXO using SSSOM and hope to do the same with ZOOMA.

Just like term mappings, literal mappings are context dependent (we maintain different literal mapping sets per project in zooma for this reason) and have metadata associated eg lexical match or manual curation, a mapping author, a date, etc etc. I don’t think solving these problems twice by making a new SSSLiteralOM complete with website, issue tracker and so on is the best way to spend our time when we already have the community mindshare (or so I thought) and infrastructure here to support it

In fact this is extremely unlikely to happen with the resources we have, so ZOOMA’s data would stay loosely specified and difficult to use - but I thought we left this kind of thing in the past and moved towards trying to agree on things to enable interoperability.

@gouttegd
Copy link
Contributor

gouttegd commented Jul 27, 2024

@jamesamcl I am not against representing literal mappings in SSSOM.

I do share a bit of @jonquet ’s concerns about re-inventing RDF, but from what I’ve seen in the wild I am afraid that horse has left the barn anyway: people have already started to use SSSOM/TSV to serialise arbitrary RDF triples and not only triples that represent “mappings”. (This is a concern that has already been mentioned in #324). This is not what SSSOM is intended for, but I don’t think there’s much we can do about it. Once you put a tool in people’s hands, they will use it in any way they like. A kitchen knife is not supposed to be used to turn screws, but people will use one for that purpose if they don’t have a screwdriver. So what? I don’t think we should prevent SSSOM from being useful to manipulate mappings just because people find it useful to do other things with it (including things they shouldn’t do).

But if we are to allow literal mappings to be represented in SSSOM, we should do so correctly, and I am sorry but #235 is not a correct solution in its current state.

I see two ways of representing literal mappings in SSSOM:

A) Having a separate literal mapping class. That’s what #235 is about, but it does it in way that leaves way too many questions open.

If we want to go that route, I will insist that these questions must be addressed ASAP, before SSSOM 1.0 is published, because as I have stated above, this route breaks the assumption that there is only one class of mapping. That assumption has been there since the beginning of SSSOM, and is still present everywhere in the current form of the standard even after #235 has been merged.

In particular, the MappingSet class (which is the basis for the SSSOM/TSV format, since a SSSOM/TSV file is basically a serialisation of a MappingSet object) can only contain Mapping objects, not LiteralMapping objects.

So if we now have to deal with more classes of mappings than just Mapping, how are we going to do that?

  • Keep MappingSet as containing only Mapping objects, and have a separate LiteralMappingSet class to contain LiteralMapping object?
  • Add a literal_mappings slot to MappingSet?
  • Make LiteralMapping a subclass of Mapping, so we can keep MappingSet unchanged?
  • Make the mappings slots of MappingSet accept indifferently a list of Mapping or a list of LiteralMapping?
  • Other ideas?

Whatever method we choose is going to have huge implications on SSSOM implementations (especially implementations in statically typed languages), so I am flatly opposed to postponing any decision on this to after 1.0. I don’t care if this means that 1.0 is going to be delayed by 10 months until we figure out how to do it.

B) Shoehorn literal mappings into the existing Mapping class. That is basically what I proposed in this comment. We don’t create a new class, but we define a way to use the existing Mapping class to represent literal mappings.

That would be a much less invasive change, with much less implications on implementations, because the assumption that there will only ever be one class of mapping would stand. For that reason: (1) I tend to favor that route; (2) if we want to go that route we can easily postpone that to after 1.0.

@matentzn
Copy link
Collaborator Author

matentzn commented Aug 9, 2024

Alright, here we go.

#384 introduces an alternative model to the "literal mapping" proposal we have previously added. It is built on the following assumptions:

  • Introducing a profile causes a needless overhead which requires specialised tooling
  • We can achieve the same functionality with a very minimal intervention in the main SSSOM metadata model (see various @gouttegd comments above)

I do understand that there are various opposing views on the need for a "literal" profile, but I think this super minimal intervention will satisfy both sides. In essence, we do not have a literal profile; we have a convention that allows us to represent an "entity" by its label (subject_label) rather than by a semantic identifier. This means we do not need specialised tooling and documentation (nor training).

Huge thanks to @gouttegd 🙏 who managed to steer this massive carrier ship after it had left the harbor. This is rarely successful and needed a huge amount of thought, testing, and patience (mostly with me and my constant questions), and I am supremely happy we managed to make it!

🎉 THAT WAS IT FOLKS - the last issue before SSSOM 1.0 (#189). Thanks to all of you who helped and contributed; now the carrier ship has sailed off the horizon, hopefully, to connect the isolated shores of our data islands!

@gouttegd
Copy link
Contributor

gouttegd commented Aug 9, 2024

For those who had started to create pseudo-SSSOM/TSV files using the “literal profile” (even though this has never been officially feasible since that profile had never been connected to the rest of the spec), SSSOM-Java will support reading such files and converting it them to the new proposed convention.

That is, given an input file like this:

#curie_map:
#  some: https://example.org/my_source_of_literals
#  BTO: http://purl.obolibrary.org/obo/BTO_
#  CL: http://purl.obolibrary.org/obo/CL_
#mapping_set_id: https://example.org/myset
literal   literal_source   predicate_id      object_id     mapping_justification
uterus    some:source      skos:exactMatch   BTO:0001424   semapv:ManualMappingCuration
sperm     some:source      skos:exactMatch   CL:0000019    semapv:ManualMappingCuration
kidney    some:source      skos:exactMatch   BTO:0000671   semapv:ManualMappingCuration

SSSOM-CLI will silently convert it into:

#curie_map:
#  BTO: http://purl.obolibrary.org/obo/BTO_
#  CL: http://purl.obolibrary.org/obo/CL_
#  some: https://example.org/my_source_of_literals
#mapping_set_id: https://example.org/myset
#subject_type: rdfs literal
#subject_source: some:source
subject_label   predicate_id      object_id       mapping_justification
kidney          skos:exactMatch   BTO:0000671     semapv:ManualMappingCuration
sperm           skos:exactMatch   CL:0000019      semapv:ManualMappingCuration
uterus          skos:exactMatch   BTO:0001424     semapv:ManualMappingCuration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment