Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataset re-work for 7.0 release #1814

Closed
wants to merge 8 commits into from
Closed

dataset re-work for 7.0 release #1814

wants to merge 8 commits into from

Conversation

ghost
Copy link

@ghost ghost commented Apr 14, 2022

Summary of changes

Complete replacement of ConjunctiveGraph by Dataset , imposition of identifier-as-context throughout. Commits split into tolerable subsections, core changes confined to 5837d01

Checklist

  • Checked that there aren't other open pull requests for the same change.
  • Added tests for any changes that have a runtime impact.
  • Checked that all tests and type checking passes. (all tests passing prior to test_variants.py)
  • For changes that have a potential impact on users of this project:
    • Updated relevant documentation to avoid inaccuracies. (in progess)
    • Considered adding additional documentation. (in progress)
    • Considered adding an example in ./examples for new features. (in progress)
    • Considered updating our changelog (CHANGELOG.md). (TBD)
  • Considered granting push permissions to the PR branch,
    so maintainers can fix minor issues and keep your PR up to date.

Three (likely, trivial) test failures with test_variant.py

FAILED test/test_dataset/test_dataset_variants.py::test_variants[variants/xml_literal] - assert {(rdflib.term...ml_literal'))} == {(rdflib.term...ral')), None)}
FAILED test/test_dataset/test_dataset_variants.py::test_variants[variants/special_chars] - AssertionError: assert {(rdflib.term...), None), ...} == {(rdflib.term...chars')), ...}
FAILED test/test_dataset/test_dataset_variants.py::test_variants[variants/simple_triple] - AssertionError: assert {(rdflib.term...ple_triple'))} == {(rdflib.term...ject'), None)}
FAILED test/test_graph/test_variants.py::test_variants[variants/xml_literal] - assert {(rdflib.term...ml_literal'))} == {(rdflib.term...ral')), None)}
FAILED test/test_graph/test_variants.py::test_variants[variants/special_chars] - AssertionError: assert {(rdflib.term...), None), ...} == {(rdflib.term...chars')), ...}
FAILED test/test_graph/test_variants.py::test_variants[variants/simple_triple] - AssertionError: assert {(rdflib.term...ple_triple'))} == {(rdflib.term...ject'), None)}
================================================================ 6 failed, 5383 passed, 6 skipped, 154 xfailed, 5 xpassed, 115 warnings in 182.92s (0:03:02) ================================================================

@ghost ghost marked this pull request as draft April 14, 2022 15:53
@ghost
Copy link
Author

ghost commented Apr 14, 2022

Comments on changes:

WORK-IN-PROGRESS: Re-working of "identifier-as-context" PR and removing ConjunctiveGraph, re-positioned as a contribution to release 7.0 as the removal of ConjunctiveGraph comprehensively breaks the public API.

I've been working on this more complete implementation since January, hence the large number of changes. Actually there's relatively little impact on core library files, what's different from the last rebased PR is a few changed lines here and there in the parsers and serializers to get a coherent implementation. The bulk of the changes are additional tests and test fixture data.

  1. ConjunctiveGraph completely replaced by Dataset which now inherits directly from Graph.
  2. Identifier-as-context referencing imposed throughout.
  3. Store changed to use identifier-as-context, API remains the same (for now).
  4. Parsers and serializers changed to use Dataset instead of ConjunctiveGraph..
  5. Tests, tests and more tests.

Isomorphism is an unresolved issue, as are operators. The W3 declined to publish a formal semantics for Datasets and the pertinent issue was resolved as "won't fix"

https://www.w3.org/2011/rdf-wg/track/issues/17

RESOLVED: close issue-17 -- there is no general purpose way to merge datasets;
it can only be done with external knowledge.

David Wood, 29 Oct 2012, 13:42:58

I've retained my attempted implementation of operators so that reviewers can more easily get an idea of the representational issues without having to code them up but atm I'm inclined to believe that RDFLib should not support operators for Datasets (because it seems semantically vacuous to me) and that the implementation of Dataset operators can be removed.

Isomorphic functionality is currently sematically suspect, having been casually extended to use Datasets. There are only a couple of feasible changes that can be made without wrecking the implementation of the RGDA1 graph digest algorithm and I'm currently thrashing through the issues attendant on QuotedGraphs, formulae and the consequent implications for Dataset conicalization and isomorphism. One issue presenting difficulty is that IsomorphicGraph was a subclass of ConjunctiveGraph but can now only be a subclass of Dataset and atm I'm not at all sure that's a trivial change.

@aucampia
Copy link
Member

@gjhiggins not sure exactly what your plan is to merge this, but I would be very happy to slice a bunch of things off from this into seperate PRs - for example most of the changes to change tests from ConjunctiveGraph to DataSet would possibly make sense independently of this PR.

if isbnode(s):
s = _blank
if isbnode(p):
p = _blank
if isbnode(o):
o = _blank
gcopy.add((s, p, o))
gcopy.add((s, p, o, DATASET_DEFAULT_GRAPH_ID if c is None else c))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is still in draft, just thinking, would it not be better to associate None with the default Graph ID? That would make the interface a bit easier to use I think.

@ghost ghost changed the title changes to core library code dataset re-work for 7.0 release Apr 14, 2022
@aucampia
Copy link
Member

aucampia commented Apr 15, 2022

@gjhiggins happy to do rebases/merges on this, just let me know either here or on gitter and I will rebase with priority. Also happy to fix any other CI problems with this.

@ghost
Copy link
Author

ghost commented Apr 15, 2022

Uneasy about the migration to Dataset

The more I think about it, the more uncomfortable I am about this move as it necessitates imposing some arbitrary decisions about the (undefined) semantics of named graphs. Just last night @aucampia and I had to resolve a mismatch in expectations about the syntactic-vs-semantic content of named graphs in the context of checking the accuracy of round-tripping.

For a quad-aware store (i.e. all of the current RDFLib stores) the Graph.parse() method arbitrarily changes the content of the stored quads depending on whether the publicID parameter is set and/or whether there is a detectable document location:

"publicID: the logical URI to use as the document base. If None specified the document location is used (at least in the case where there is a document location)".

In practice, it can result in test failures, e.g. the mismatch between a triple parsed in the context of a document base:

(
    rdflib.term.URIRef('http://example.org/subject'),
    rdflib.term.URIRef('http://example.org/predicate'),
    rdflib.term.URIRef('http://example.org/object'),
    rdflib.term.URIRef('example:variants/simple_triple')
)

and one created as a result of parsing the same triple but without a document location:

(
    rdflib.term.URIRef('http://example.org/subject'),
    rdflib.term.URIRef('http://example.org/predicate'),
    rdflib.term.URIRef('http://example.org/object'),
    None
)

Given two Datasets with quad content as above, resolving the question of whether the two Datasets are isomorphic is arbitrary - either the names of graphs are treated as i) semantic tokens, in which case the datasets are not considered to be isomorphic or ii) merely syntactic tokens in which case the datasets can be considered to be isomorphic.

Currently, graph identifiers are ignored by isomorphic and the fact that named graphs can have BNode identifiers is worrying in this context, to say the least. I do appreciate that the implementation of ConjunctiveGraph's set-theoretic operators blithely ignores these considerations and this (arguably casual) approach to the semantics seems not to present users with any (reported) issues - I'm left wondering whether this eminently pragmatic approach to "working with RDF" isn't actually offering users some critically-useful advantage.

The W3 declined to publish a formal semantics for Datasets and whilst there was discussion of the implications, the issue of merging datasets was closed as "can't fix" (my interpretation):

https://www.w3.org/2011/rdf-wg/track/issues/17

RESOLVED: close issue-17 -- there is no general purpose way to merge datasets;
it can only be done with external knowledge.

David Wood, 29 Oct 2012, 13:42:58

If one can't provide semantically-sound support for UNION, then that must extend to all of the other set-theoretic operators. My concern here is that if Datasets can't be merged without external knowledge then it would seem that computing isomorphism must be similarly constrained - or am I missing something?

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
7.0 Changes planned for version 7
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants