Discussion around new dict-based store implementation #1133

ashleysommer · 2020-07-19T03:23:06Z

RDFLib currently has two different in-memory triplestore implementations, called Memory and IOMemory.

For a long while now, RDFLib has used the "Integer-Optimised" IOMemory Store implementation as the default backing triplestore for Graphs, ConjunctiveGraphs, and Datasets.

In Python 2.6/2.7 and Python 3.3/3.4 this Integer-Optimised implementation was in most use cases faster than the normal dict-based Memory store.

It has been discussed several times that we should conduct some benchmarks and potentially move back to using the dict-based Memory store by default, because Python 3.5+ has some good advancements in dict performance, and it is likely now better than the IOMemory implementation.

I went to do some benchmarks and realized a major problem with the proposition. The old Memory store is what is known as a naive triplestore implementation, this is, it is not context-aware, and not graph-aware. Looks like when RDFLib gained support for ConjectiveGraphs and Datasets years ago, compatibility with these features were built into the IOMemory store, but not the dict-based Memory store.

That means currently the Memory store cannot be used as the backing store for a ConjectiveGraph or Dataset object, and cannot store triples parsed from the N3, Trig, NQuads, or JSON-LD parsers. Because each of those parsers requires the backing store to be either context-aware, or graph-aware or both.

So I've gone ahead and implemented a new in-memory store which has those new features. It is essentially a cross between the current Memory and IOMemory stores. It uses the dict-based storage and lookup mechanisms from Memory and the context-aware and graph-aware features from IOMemory, while discarding the interger-optimisation features.

For now, the new Store is called Memory2 (and the old one is renamed Memory1). After some bug fixing and tiny optimizations, this new store now passes all of the built-in RDFLib tests, plus some other tests and sanity-checks that I've thrown at it.

All this work would be for naught if there's no performance gains, so here's the benchmark results!

Python 3.5
Memory1  1.3062 seconds
Memory2  3.1013 seconds
IOMemory 3.8140 seconds

Python 3.6
Memory1  0.9054 seconds
Memory2  2.3544 seconds
IOMemory 3.1724 seconds

Python 3.7
Memory1  0.8885 seconds
Memory2  2.2288 seconds
IOMemory 2.9218 seconds

Python 3.8
Memory1  0.7363 seconds
Memory2  1.8404 seconds
IOMemory 2.3520 seconds

Benchmark operation consists of adding 100,000 triples to a Graph, traversing the graph and listing each of its triples, then deleting every triple in the graph. Operation is performed 10x for each store type then result divided by 10 to get the average operation time over 10 runs. Precise run time is provided by the timeit library.

When viewing the results above, its easy to see that the original (old) dict-based Memory store is the fastest by far, but remember it is not graph-aware and not context-aware. So it is not usable as a default memory store type for a large portion of RDFLib users.

The take-away result is that between Memory2 and IOMemory, this new implementation is around 22-24% faster.

Also very interesting is how much difference is seen between each version of Python. Even on Python 3.5, which did not have much dict performance optimsation, the Memory2 store is still faster than IOMemory. And for each new major version of python the performance increases significantly.

…rformance

ashleysommer · 2020-07-19T03:24:29Z

Benchmark code:
https://gist.github.com/ashleysommer/b4e17be4afb8989348d892f001289074

ashleysommer · 2020-07-19T03:31:27Z

This thread is mostly for discussion.
Memory1 and Memory2 are placeholder names, for purpose of comparison.
Likely Memory1 will be renamed to NaiveMemory or just removed entirely.
Memory2 will be renamed to Memory and become the default
And IOMemory will probably be removed (I don't think its of any use to anyone).

I know some tests in travis are failing, thats because Memory was renamed to Memory1 and some tests don't like that, also some tests assume that IOMemory will always be the default store and fail when its not.

There's definitely some more optimizations to be made in the triplestore code. The current implementation was done in a way to keep it as close as possible in operation to the old Memory store. Its at a point now where any further changes would be deviating from known working and battle tested code, and could potentially introduce changes in store behaviour, so I haven't looked further into that yet.

remove duplicate function definition

Remove a emptygen function not used

tgbugs · 2020-07-20T21:00:53Z

This sounds great! How does the memory usage compare? I've seen cases where dicts can consume more memory as they get large and wonder if that might be the case here.

ashleysommer · 2020-07-20T23:18:43Z

@tgbugs
I'm still working on capturing hard numbers around memory use.
But preliminary findings based on looking at os memory percentage usage while running the benchmarks, seems like Memory2 uses the same or slightly less memory (~5% less) than IOMemory but the old Memory1 store uses the least memory (~10% less) because it doesn't track triple membership in context namespaces, and doesn't track graphs in the store.

…layer of abstraction and everyone knows that when doing a lookup, None means ANY. Fix a flake8 lint error, change bare exceptins for LookupError exceptions.

ashleysommer · 2020-07-22T06:41:04Z

I've included this new in-memory triplestore implementation in the latest PySHACL release, it patches rdflib to use Memory2 by default to speed up SHACL validiation.

Running pySHACL benchmarks, Memory2 is faster by:

10.3% when benchmarking SHACL validation with no inferencing
17% when benchmarking SHACL validation with rdfs inferencing
19.5% when benchmarking SHACL validation with rdfs+owlrl inferencing

All unit tests, and integration tests still pass normally when using this store.

nicholascar · 2020-08-13T15:05:52Z

I support bringing in a new, faster, Store!

nicholascar · 2020-08-13T15:07:25Z

A new Store PR could also tidy up the locations of memory.py & sleepycat.py files (https://github.com/RDFLib/rdflib/tree/master/rdflib/plugins) placing them in the same dir as all the other Stores (https://github.com/RDFLib/rdflib/tree/master/rdflib/plugins/stores)

Renamed Memory2 to Memory Renamed Memory1 to SimpleMemory Set default store to new Memory Fixed tests Fixed docs

ashleysommer · 2020-08-19T03:31:10Z

Renamed Memory2 to Memory
Renamed Memory1 to SimpleMemory
Set default store to new Memory
Moved sleepycat.py and memory.py into plugins/stores/ directory
Fixed tests
Fixed docs
This is now ready for final review and merge 🎉

coveralls · 2020-08-19T03:41:49Z

Coverage decreased (-0.02%) to 75.773% when pulling 860628c on ashleysommer:in_memory_store into 89cb369 on RDFLib:master.

coveralls · 2020-08-19T03:41:49Z

Coverage decreased (-0.02%) to 75.773% when pulling 860628c on ashleysommer:in_memory_store into 89cb369 on RDFLib:master.

coveralls · 2020-08-19T03:41:49Z

Coverage decreased (-0.02%) to 75.773% when pulling 860628c on ashleysommer:in_memory_store into 89cb369 on RDFLib:master.

New full-featured In-Memory-Store taking advantage of Python3 dict pe…

0aabee2

…rformance

ashleysommer changed the title ~~Discussion around new in-memory graph implementation~~ Discussion around new dict-based store implementation Jul 19, 2020

ashleysommer added 2 commits July 19, 2020 13:55

remove un-needed checks in ctx_to_str function

a0bc77b

remove duplicate function definition

remove some old out-of-date comments

2922b8c

Remove a emptygen function not used

In new Memory2 implementation, replace ANY with None, this removes a …

714545f

…layer of abstraction and everyone knows that when doing a lookup, None means ANY. Fix a flake8 lint error, change bare exceptins for LookupError exceptions.

ashleysommer added 2 commits August 19, 2020 12:39

Merge branch 'master' into in_memory_store

c0c6630

Removed IOMemory store

860628c

Renamed Memory2 to Memory Renamed Memory1 to SimpleMemory Set default store to new Memory Fixed tests Fixed docs

ashleysommer requested review from nicholascar and white-gecko August 19, 2020 03:44

nicholascar merged commit 4be2749 into RDFLib:master Aug 27, 2020

jayaddison mentioned this pull request Dec 11, 2020

Idea: migrate from {lxml, html-text} to {html5lib, bleach} scrapinghub/extruct#163

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion around new dict-based store implementation #1133

Discussion around new dict-based store implementation #1133

ashleysommer commented Jul 19, 2020 •

edited

Loading

ashleysommer commented Jul 19, 2020

ashleysommer commented Jul 19, 2020 •

edited

Loading

tgbugs commented Jul 20, 2020

ashleysommer commented Jul 20, 2020

ashleysommer commented Jul 22, 2020

nicholascar commented Aug 13, 2020

nicholascar commented Aug 13, 2020

ashleysommer commented Aug 19, 2020 •

edited

Loading

coveralls commented Aug 19, 2020

coveralls commented Aug 19, 2020

coveralls commented Aug 19, 2020

Discussion around new dict-based store implementation #1133

Discussion around new dict-based store implementation #1133

Conversation

ashleysommer commented Jul 19, 2020 • edited Loading

ashleysommer commented Jul 19, 2020

ashleysommer commented Jul 19, 2020 • edited Loading

tgbugs commented Jul 20, 2020

ashleysommer commented Jul 20, 2020

ashleysommer commented Jul 22, 2020

nicholascar commented Aug 13, 2020

nicholascar commented Aug 13, 2020

ashleysommer commented Aug 19, 2020 • edited Loading

coveralls commented Aug 19, 2020

coveralls commented Aug 19, 2020

coveralls commented Aug 19, 2020

ashleysommer commented Jul 19, 2020 •

edited

Loading

ashleysommer commented Jul 19, 2020 •

edited

Loading

ashleysommer commented Aug 19, 2020 •

edited

Loading