dataset re-work for 7.0 release #1814

ghost · 2022-04-14T15:52:50Z

Summary of changes

Complete replacement of ConjunctiveGraph by Dataset , imposition of identifier-as-context throughout. Commits split into tolerable subsections, core changes confined to 5837d01

Checklist

Checked that there aren't other open pull requests for the same change.
Added tests for any changes that have a runtime impact.
Checked that all tests and type checking passes. (all tests passing prior to test_variants.py)
For changes that have a potential impact on users of this project:
- Updated relevant documentation to avoid inaccuracies. (in progess)
- Considered adding additional documentation. (in progress)
- Considered adding an example in ./examples for new features. (in progress)
- Considered updating our changelog (CHANGELOG.md). (TBD)
Considered granting push permissions to the PR branch,
so maintainers can fix minor issues and keep your PR up to date.

Three (likely, trivial) test failures with test_variant.py

FAILED test/test_dataset/test_dataset_variants.py::test_variants[variants/xml_literal] - assert {(rdflib.term...ml_literal'))} == {(rdflib.term...ral')), None)}
FAILED test/test_dataset/test_dataset_variants.py::test_variants[variants/special_chars] - AssertionError: assert {(rdflib.term...), None), ...} == {(rdflib.term...chars')), ...}
FAILED test/test_dataset/test_dataset_variants.py::test_variants[variants/simple_triple] - AssertionError: assert {(rdflib.term...ple_triple'))} == {(rdflib.term...ject'), None)}
FAILED test/test_graph/test_variants.py::test_variants[variants/xml_literal] - assert {(rdflib.term...ml_literal'))} == {(rdflib.term...ral')), None)}
FAILED test/test_graph/test_variants.py::test_variants[variants/special_chars] - AssertionError: assert {(rdflib.term...), None), ...} == {(rdflib.term...chars')), ...}
FAILED test/test_graph/test_variants.py::test_variants[variants/simple_triple] - AssertionError: assert {(rdflib.term...ple_triple'))} == {(rdflib.term...ject'), None)}
================================================================ 6 failed, 5383 passed, 6 skipped, 154 xfailed, 5 xpassed, 115 warnings in 182.92s (0:03:02) ================================================================

black the other new tests

ghost · 2022-04-14T17:34:21Z

Comments on changes:

WORK-IN-PROGRESS: Re-working of "identifier-as-context" PR and removing ConjunctiveGraph, re-positioned as a contribution to release 7.0 as the removal of ConjunctiveGraph comprehensively breaks the public API.

I've been working on this more complete implementation since January, hence the large number of changes. Actually there's relatively little impact on core library files, what's different from the last rebased PR is a few changed lines here and there in the parsers and serializers to get a coherent implementation. The bulk of the changes are additional tests and test fixture data.

ConjunctiveGraph completely replaced by Dataset which now inherits directly from Graph.
Identifier-as-context referencing imposed throughout.
Store changed to use identifier-as-context, API remains the same (for now).
Parsers and serializers changed to use Dataset instead of ConjunctiveGraph..
Tests, tests and more tests.

Isomorphism is an unresolved issue, as are operators. The W3 declined to publish a formal semantics for Datasets and the pertinent issue was resolved as "won't fix"

https://www.w3.org/2011/rdf-wg/track/issues/17

RESOLVED: close issue-17 -- there is no general purpose way to merge datasets;
it can only be done with external knowledge.

David Wood, 29 Oct 2012, 13:42:58

I've retained my attempted implementation of operators so that reviewers can more easily get an idea of the representational issues without having to code them up but atm I'm inclined to believe that RDFLib should not support operators for Datasets (because it seems semantically vacuous to me) and that the implementation of Dataset operators can be removed.

Isomorphic functionality is currently sematically suspect, having been casually extended to use Datasets. There are only a couple of feasible changes that can be made without wrecking the implementation of the RGDA1 graph digest algorithm and I'm currently thrashing through the issues attendant on QuotedGraphs, formulae and the consequent implications for Dataset conicalization and isomorphism. One issue presenting difficulty is that IsomorphicGraph was a subclass of ConjunctiveGraph but can now only be a subclass of Dataset and atm I'm not at all sure that's a trivial change.

aucampia · 2022-04-14T18:09:20Z

@gjhiggins not sure exactly what your plan is to merge this, but I would be very happy to slice a bunch of things off from this into seperate PRs - for example most of the changes to change tests from ConjunctiveGraph to DataSet would possibly make sense independently of this PR.

aucampia · 2022-04-14T19:13:06Z

test/test_serializers/test_prettyxml.py

        if isbnode(s):
            s = _blank
        if isbnode(p):
            p = _blank
        if isbnode(o):
            o = _blank
-        gcopy.add((s, p, o))
+        gcopy.add((s, p, o, DATASET_DEFAULT_GRAPH_ID if c is None else c))


I know this is still in draft, just thinking, would it not be better to associate None with the default Graph ID? That would make the interface a bit easier to use I think.

aucampia · 2022-04-15T13:55:02Z

@gjhiggins happy to do rebases/merges on this, just let me know either here or on gitter and I will rebase with priority. Also happy to fix any other CI problems with this.

ghost · 2022-04-15T19:34:09Z

Uneasy about the migration to Dataset

The more I think about it, the more uncomfortable I am about this move as it necessitates imposing some arbitrary decisions about the (undefined) semantics of named graphs. Just last night @aucampia and I had to resolve a mismatch in expectations about the syntactic-vs-semantic content of named graphs in the context of checking the accuracy of round-tripping.

For a quad-aware store (i.e. all of the current RDFLib stores) the Graph.parse() method arbitrarily changes the content of the stored quads depending on whether the publicID parameter is set and/or whether there is a detectable document location:

"publicID: the logical URI to use as the document base. If None specified the document location is used (at least in the case where there is a document location)".

In practice, it can result in test failures, e.g. the mismatch between a triple parsed in the context of a document base:

(
    rdflib.term.URIRef('http://example.org/subject'),
    rdflib.term.URIRef('http://example.org/predicate'),
    rdflib.term.URIRef('http://example.org/object'),
    rdflib.term.URIRef('example:variants/simple_triple')
)

and one created as a result of parsing the same triple but without a document location:

(
    rdflib.term.URIRef('http://example.org/subject'),
    rdflib.term.URIRef('http://example.org/predicate'),
    rdflib.term.URIRef('http://example.org/object'),
    None
)

Given two Datasets with quad content as above, resolving the question of whether the two Datasets are isomorphic is arbitrary - either the names of graphs are treated as i) semantic tokens, in which case the datasets are not considered to be isomorphic or ii) merely syntactic tokens in which case the datasets can be considered to be isomorphic.

Currently, graph identifiers are ignored by isomorphic and the fact that named graphs can have BNode identifiers is worrying in this context, to say the least. I do appreciate that the implementation of ConjunctiveGraph's set-theoretic operators blithely ignores these considerations and this (arguably casual) approach to the semantics seems not to present users with any (reported) issues - I'm left wondering whether this eminently pragmatic approach to "working with RDF" isn't actually offering users some critically-useful advantage.

The W3 declined to publish a formal semantics for Datasets and whilst there was discussion of the implications, the issue of merging datasets was closed as "can't fix" (my interpretation):

https://www.w3.org/2011/rdf-wg/track/issues/17

RESOLVED: close issue-17 -- there is no general purpose way to merge datasets;
it can only be done with external knowledge.

David Wood, 29 Oct 2012, 13:42:58

If one can't provide semantically-sound support for UNION, then that must extend to all of the other set-theoretic operators. My concern here is that if Datasets can't be merged without external knowledge then it would seem that computing isomorphism must be similarly constrained - or am I missing something?

changes to core library code

5837d01

ghost marked this pull request as draft April 14, 2022 15:53

Graham Higgins added 7 commits April 14, 2022 16:58

edited and additional tests

ee888d1

changes to examples

273a8c9

preliminary editing of docs

a42c31c

fix type issues, black new tests

2104637

comment out fragment of test to avoid type error precommit failure

15fb72d

black the other new tests

re-blacking

172ebcb

adjust test to skip berkeleydb if module unavailable

4fba0ff

aucampia reviewed Apr 14, 2022

View reviewed changes

ghost changed the title ~~changes to core library code~~ dataset re-work for 7.0 release Apr 14, 2022

ghost mentioned this pull request Apr 15, 2022

rdfpipe adds a graph name to triples in the default graph, consequently breaking round tripping #1804

Closed

ghost added the 7.0 Changes planned for version 7 label Jun 8, 2022

white-gecko added this to the 2022 October release (7.0.0) milestone Jun 20, 2022

This was referenced Jul 14, 2022

Issues with rdflib.Dataset #2021

Open

SPARQL Update insert into default graph does not work for Dataset #2019

Open

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset re-work for 7.0 release #1814

dataset re-work for 7.0 release #1814

ghost commented Apr 14, 2022 •

edited by ghost

Loading

ghost commented Apr 14, 2022 •

edited by ghost

Loading

aucampia commented Apr 14, 2022

aucampia Apr 14, 2022

aucampia commented Apr 15, 2022 •

edited

Loading

ghost commented Apr 15, 2022

dataset re-work for 7.0 release #1814

dataset re-work for 7.0 release #1814

Conversation

ghost commented Apr 14, 2022 • edited by ghost Loading

Summary of changes

Checklist

ghost commented Apr 14, 2022 • edited by ghost Loading

Comments on changes:

aucampia commented Apr 14, 2022

aucampia Apr 14, 2022

Choose a reason for hiding this comment

aucampia commented Apr 15, 2022 • edited Loading

ghost commented Apr 15, 2022

Uneasy about the migration to Dataset

ghost commented Apr 14, 2022 •

edited by ghost

Loading

ghost commented Apr 14, 2022 •

edited by ghost

Loading

aucampia commented Apr 15, 2022 •

edited

Loading