Cardinality estimates are essential for finding a good join order to improve query performance. In order to access the impact of having shapes statistics of RDF graphs on cardinality estimation, we have performed these experiments. We have generated global and shapes statistics and proposed a join ordering technique to make use of these statistics and estimate cardinalities to propose efficient query plans. We used synthetic (LUBM, WATDIV) and a real dataset (i.e., YAGO-4). We compared against the query plans proposed by Jena ARQ query engine, GraphDB, Characteristics Sets, and SumRDF approach. At this page we present technical details of our experiments such as how to generate these statistics, how to run the experiments, the links to the datasets, and finally the results.
All of the data and results presented in our experimental study are available at https://github.com/Kashif-Rabbani/sparql-optimization/ under Apache License 2.0 .
We used the following datasets, queries, and the statistics:
Given an RDF graph, we used shaclgen https://pypi.org/project/shaclgen/ library to generate its SHACL shapes graph.
We use Shapes Annotator component to extend SHACL shapes graph with the statistics of the RDF graph. E.g., for YAGO-4 dataset, we use the https://github.com/Kashif-Rabbani/sparql-optimization/blob/main/code/yagoConfig.properties file by setting the generateStatistics=true.
We loaded all datasets in Jena TDB, bundled the code in a Jar and created a config file to run each type of experiment. For example we used the following pattern fo run experiments using:
-
> Set the appropriate paths for the Jena TDB and the directory containing queries in the config files, e.g., for YAGO-4 dataset https://github.com/Kashif-Rabbani/sparql-optimization/blob/main/code/yagoConfig.properties > Set the value fo shapeExec=true , set the number of times the query should run. > Use java -jar code.jar yagoConfig.properties YAGO &> output.log > Logs will be saved in OUTPUT_QUERY directory as benchmarks.csv and also in output.log file. > Use these logs to plot the results.
-
> Follow the same steps as mentioned above for Shapes Statistics, except set the value shapeExec=false and globalStatsExec=true.
-
> Follow the same steps as mentioned above except set the value shapeExec and globalStatsExec as false and jenaExec=true.
-
> We loaded each dataset in GraphDB and used 'onto:explain' feature explained https://graphdb.ontotext.com/documentation/standard/explain-plan.html to see the plans and their cardinalities.
-
> We used the extended characteristics sets implementation from https://github.com/gmontoya/federatedOptimizer to generate characteristics Sets for each dataset and then gnerated their query plans.
-
6. SumRDF Cardinality Estimator (official link)
> We implemented our join ordering algorithm using SumRDF cardinality estimator. The code is available in the folder https://github.com/Kashif-Rabbani/sparql-optimization/tree/main/sumRDF
Discussed in the paper and available in folder results_data