Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion around new dict-based store implementation #1133

Merged
merged 6 commits into from
Aug 27, 2020

Conversation

ashleysommer
Copy link
Contributor

@ashleysommer ashleysommer commented Jul 19, 2020

RDFLib currently has two different in-memory triplestore implementations, called Memory and IOMemory.

For a long while now, RDFLib has used the "Integer-Optimised" IOMemory Store implementation as the default backing triplestore for Graphs, ConjunctiveGraphs, and Datasets.

In Python 2.6/2.7 and Python 3.3/3.4 this Integer-Optimised implementation was in most use cases faster than the normal dict-based Memory store.

It has been discussed several times that we should conduct some benchmarks and potentially move back to using the dict-based Memory store by default, because Python 3.5+ has some good advancements in dict performance, and it is likely now better than the IOMemory implementation.

I went to do some benchmarks and realized a major problem with the proposition. The old Memory store is what is known as a naive triplestore implementation, this is, it is not context-aware, and not graph-aware. Looks like when RDFLib gained support for ConjectiveGraphs and Datasets years ago, compatibility with these features were built into the IOMemory store, but not the dict-based Memory store.

That means currently the Memory store cannot be used as the backing store for a ConjectiveGraph or Dataset object, and cannot store triples parsed from the N3, Trig, NQuads, or JSON-LD parsers. Because each of those parsers requires the backing store to be either context-aware, or graph-aware or both.

So I've gone ahead and implemented a new in-memory store which has those new features. It is essentially a cross between the current Memory and IOMemory stores. It uses the dict-based storage and lookup mechanisms from Memory and the context-aware and graph-aware features from IOMemory, while discarding the interger-optimisation features.

For now, the new Store is called Memory2 (and the old one is renamed Memory1). After some bug fixing and tiny optimizations, this new store now passes all of the built-in RDFLib tests, plus some other tests and sanity-checks that I've thrown at it.

All this work would be for naught if there's no performance gains, so here's the benchmark results!

Python 3.5
Memory1  1.3062 seconds
Memory2  3.1013 seconds
IOMemory 3.8140 seconds

Python 3.6
Memory1  0.9054 seconds
Memory2  2.3544 seconds
IOMemory 3.1724 seconds

Python 3.7
Memory1  0.8885 seconds
Memory2  2.2288 seconds
IOMemory 2.9218 seconds

Python 3.8
Memory1  0.7363 seconds
Memory2  1.8404 seconds
IOMemory 2.3520 seconds

Benchmark operation consists of adding 100,000 triples to a Graph, traversing the graph and listing each of its triples, then deleting every triple in the graph. Operation is performed 10x for each store type then result divided by 10 to get the average operation time over 10 runs. Precise run time is provided by the timeit library.

When viewing the results above, its easy to see that the original (old) dict-based Memory store is the fastest by far, but remember it is not graph-aware and not context-aware. So it is not usable as a default memory store type for a large portion of RDFLib users.

The take-away result is that between Memory2 and IOMemory, this new implementation is around 22-24% faster.

Also very interesting is how much difference is seen between each version of Python. Even on Python 3.5, which did not have much dict performance optimsation, the Memory2 store is still faster than IOMemory. And for each new major version of python the performance increases significantly.

@ashleysommer
Copy link
Contributor Author

@ashleysommer
Copy link
Contributor Author

ashleysommer commented Jul 19, 2020

This thread is mostly for discussion.
Memory1 and Memory2 are placeholder names, for purpose of comparison.
Likely Memory1 will be renamed to NaiveMemory or just removed entirely.
Memory2 will be renamed to Memory and become the default
And IOMemory will probably be removed (I don't think its of any use to anyone).

I know some tests in travis are failing, thats because Memory was renamed to Memory1 and some tests don't like that, also some tests assume that IOMemory will always be the default store and fail when its not.

There's definitely some more optimizations to be made in the triplestore code. The current implementation was done in a way to keep it as close as possible in operation to the old Memory store. Its at a point now where any further changes would be deviating from known working and battle tested code, and could potentially introduce changes in store behaviour, so I haven't looked further into that yet.

@ashleysommer ashleysommer changed the title Discussion around new in-memory graph implementation Discussion around new dict-based store implementation Jul 19, 2020
remove duplicate function definition
Remove a emptygen function not used
@tgbugs
Copy link
Contributor

tgbugs commented Jul 20, 2020

This sounds great! How does the memory usage compare? I've seen cases where dicts can consume more memory as they get large and wonder if that might be the case here.

@ashleysommer
Copy link
Contributor Author

@tgbugs
I'm still working on capturing hard numbers around memory use.
But preliminary findings based on looking at os memory percentage usage while running the benchmarks, seems like Memory2 uses the same or slightly less memory (~5% less) than IOMemory but the old Memory1 store uses the least memory (~10% less) because it doesn't track triple membership in context namespaces, and doesn't track graphs in the store.

…layer of abstraction and everyone knows that when doing a lookup, None means ANY.

Fix a flake8 lint error, change bare exceptins for LookupError exceptions.
@ashleysommer
Copy link
Contributor Author

I've included this new in-memory triplestore implementation in the latest PySHACL release, it patches rdflib to use Memory2 by default to speed up SHACL validiation.

Running pySHACL benchmarks, Memory2 is faster by:

  • 10.3% when benchmarking SHACL validation with no inferencing
  • 17% when benchmarking SHACL validation with rdfs inferencing
  • 19.5% when benchmarking SHACL validation with rdfs+owlrl inferencing

All unit tests, and integration tests still pass normally when using this store.

@nicholascar
Copy link
Member

I support bringing in a new, faster, Store!

@nicholascar
Copy link
Member

A new Store PR could also tidy up the locations of memory.py & sleepycat.py files (https://github.com/RDFLib/rdflib/tree/master/rdflib/plugins) placing them in the same dir as all the other Stores (https://github.com/RDFLib/rdflib/tree/master/rdflib/plugins/stores)

Renamed Memory2 to Memory
Renamed Memory1 to SimpleMemory
Set default store to new Memory
Fixed tests
Fixed docs
@ashleysommer
Copy link
Contributor Author

ashleysommer commented Aug 19, 2020

Renamed Memory2 to Memory
Renamed Memory1 to SimpleMemory
Set default store to new Memory
Moved sleepycat.py and memory.py into plugins/stores/ directory
Fixed tests
Fixed docs
This is now ready for final review and merge 🎉

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.02%) to 75.773% when pulling 860628c on ashleysommer:in_memory_store into 89cb369 on RDFLib:master.

2 similar comments
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.02%) to 75.773% when pulling 860628c on ashleysommer:in_memory_store into 89cb369 on RDFLib:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.02%) to 75.773% when pulling 860628c on ashleysommer:in_memory_store into 89cb369 on RDFLib:master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants