-
Notifications
You must be signed in to change notification settings - Fork 555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion around new dict-based store implementation #1133
Conversation
This thread is mostly for discussion. I know some tests in travis are failing, thats because There's definitely some more optimizations to be made in the triplestore code. The current implementation was done in a way to keep it as close as possible in operation to the old |
remove duplicate function definition
Remove a emptygen function not used
This sounds great! How does the memory usage compare? I've seen cases where dicts can consume more memory as they get large and wonder if that might be the case here. |
@tgbugs |
…layer of abstraction and everyone knows that when doing a lookup, None means ANY. Fix a flake8 lint error, change bare exceptins for LookupError exceptions.
I've included this new in-memory triplestore implementation in the latest PySHACL release, it patches rdflib to use Running pySHACL benchmarks,
All unit tests, and integration tests still pass normally when using this store. |
I support bringing in a new, faster, Store! |
A new Store PR could also tidy up the locations of |
Renamed Memory2 to Memory Renamed Memory1 to SimpleMemory Set default store to new Memory Fixed tests Fixed docs
Renamed Memory2 to Memory |
RDFLib currently has two different in-memory triplestore implementations, called
Memory
andIOMemory
.For a long while now, RDFLib has used the "Integer-Optimised"
IOMemory
Store implementation as the default backing triplestore for Graphs, ConjunctiveGraphs, and Datasets.In Python 2.6/2.7 and Python 3.3/3.4 this Integer-Optimised implementation was in most use cases faster than the normal
dict
-based Memory store.It has been discussed several times that we should conduct some benchmarks and potentially move back to using the
dict
-based Memory store by default, because Python 3.5+ has some good advancements indict
performance, and it is likely now better than theIOMemory
implementation.I went to do some benchmarks and realized a major problem with the proposition. The old
Memory
store is what is known as a naive triplestore implementation, this is, it is not context-aware, and not graph-aware. Looks like when RDFLib gained support for ConjectiveGraphs and Datasets years ago, compatibility with these features were built into theIOMemory
store, but not thedict
-based Memory store.That means currently the
Memory
store cannot be used as the backing store for a ConjectiveGraph or Dataset object, and cannot store triples parsed from the N3, Trig, NQuads, or JSON-LD parsers. Because each of those parsers requires the backing store to be eithercontext-aware
, orgraph-aware
or both.So I've gone ahead and implemented a new in-memory store which has those new features. It is essentially a cross between the current
Memory
andIOMemory
stores. It uses the dict-based storage and lookup mechanisms fromMemory
and thecontext-aware
andgraph-aware
features fromIOMemory
, while discarding the interger-optimisation features.For now, the new Store is called
Memory2
(and the old one is renamedMemory1
). After some bug fixing and tiny optimizations, this new store now passes all of the built-in RDFLib tests, plus some other tests and sanity-checks that I've thrown at it.All this work would be for naught if there's no performance gains, so here's the benchmark results!
Benchmark operation consists of adding 100,000 triples to a Graph, traversing the graph and listing each of its triples, then deleting every triple in the graph. Operation is performed 10x for each store type then result divided by 10 to get the average operation time over 10 runs. Precise run time is provided by the
timeit
library.When viewing the results above, its easy to see that the original (old) dict-based
Memory
store is the fastest by far, but remember it is not graph-aware and not context-aware. So it is not usable as a default memory store type for a large portion of RDFLib users.The take-away result is that between
Memory2
andIOMemory
, this new implementation is around 22-24% faster.Also very interesting is how much difference is seen between each version of Python. Even on Python 3.5, which did not have much dict performance optimsation, the
Memory2
store is still faster thanIOMemory
. And for each new major version of python the performance increases significantly.