Merge pull request #29 from Swirrl/doc-improvements

Simplify README a bit to more strongly recommend a default usage
Swirrl · May 29, 2020 · a53d4e0 · a53d4e0
2 parents 5c37735 + fe22bf8
commit a53d4e0
Show file tree

Hide file tree

Showing 5 changed files with 288 additions and 189 deletions.
diff --git a/README.md b/README.md
@@ -1,247 +1,171 @@
-# rdf-validator
+# RDF Validator
 
-Runner for RDF test cases
+A Simple runner for RDF test cases & validations.
 
-## Installation
+RDF Validator runs a collection of test cases against a SPARQL endpoint. The endpoint can be either a HTTP(s) SPARQL
+endpoint or a file or directory of RDF files on disk. Test cases can be specified as either a SPARQL query file containing either
+an `ASK` or a `SELECT` query, or a suite of such files with a suite manifest.
 
-Install [leiningen](https://leiningen.org/) and then run
+Main features:
 
-```
-lein uberjar
-```
-
-This will build a standalone jar in the `target/uberjar` directory.
-
-## Usage
-
-`rdf-validator` runs a collection of test cases against a SPARQL endpoint. The endpoint can be either a HTTP(s) SPARQL
-endpoint or a file or directory on disk. Test cases can be specified as either a SPARQL query file, or a directory
-of such files.
-
-The repository contains versions of the well-formed cube validation queries defined in the [RDF data cube specification](https://www.w3.org/TR/vocab-data-cube/#wf).
-These are defined as SPARQL SELECT queries rather than the ASK queries defined in the specification to enable more detailed error reporting.
+- 👍 SPARQL `SELECT` or `ASK` queries as validations
+- 👌🏾 Package suites as git dependencies with a simple manifest format
+- 🏃 Run 3rd party validations as dependencies via git or maven dependencies (thanks to the Clojure CLI tools)
+- 🏃🏾 Run validations against SPARQL endpoints or files of RDF
+- 🚴 Optionally dynamically generate queries with handlebars-like [selmer](https://github.com/yogthos/Selmer) templates
 
-To run these tests against a local SPARQL endpoint:
+## Quick Start
 
-    $ java -jar rdf-validator-standalone.jar --endpoint http://localhost/sparql/query --suite ./queries
-
-This will run all test cases in the queries directory against the endpoint. Test cases can be run individually:
+The recomended way to install and run the RDF Validator as an application is via the Clojure command line tools.
 
-    $ java -jar rdf-validator-standalone.jar --endpoint http://localhost/sparql/query --suite ./queries/01_SELECT_Observation_Has_At_Least_1_Dataset.sparql
-        
-SPARQL endpoints can also be loaded from a file containing serialised RDF triples:
+The advantage to this method is that it provides an advanced way to include suites of validations as git deps, that
+will be automatically fetched and installed on first usage, and cached thereafter.  This means you can use Clojure's
+`deps.edn` file to fetch suites of validations from 3rd parties easily.
 
-    $ java -jar rdf-validator-standalone.jar --endpoint data.ttl --suite ./queries
-
-Multiple test cases can be specified:
-
-    $ java -jar rdf-validator-standalone.jar --endpoint data.ttl --suite test1.sparql --suite test2.sparql 
-
-The RDF dataset can also be specified:
+To do this follow these steps:
 
-    $ java -jar rdf-validator-standalone.jar --endpoint data.ttl --graph http://graph1 --graph http://graph2 --suite test1.sparql
-
-Graphs are added a named graphs and included in the default graph. 
+First [Install the clojure CLI tools](https://clojure.org/guides/getting_started#_clojure_installer_and_cli_tools).
 
-## Writing test cases
+Then specify a `deps.edn` file like this:
 
-Test cases are expressed as either SPARQL ASK or SELECT queries. These queries are run against the target endpoint and the outcome of the test is based on the
-result of the query execution.
-
-### ASK queries
-
-SPARQL ASK queries are considered to have failed if they evaluate to `true` so should be written to find invalid statements. 
-This is consistent with the queries defined in the RDF data cube specification.
+```clojure
+{
+ :aliases {:rdf-validator {:extra-deps { swirrl/rdf-validator {:git/url "https://github.com/Swirrl/rdf-validator.git"
+                                                               :sha "fd848fabc5718f876f99ee4ee5a3f89ea8529571"}}
+                           :main-opts ["-m" "rdf-validator.core"]}
+                           }
+ }
+```
 
-### SELECT queries
+This then lets you run the command line validator like so:
 
-SPARQL SELECT queries are considered to have failed if they return any matching solutions. Like ASK queries they should return bindings describing invalid resources.
+    $ clojure -A:rdf-validator <ARGS>
 
-### Query variables
+You'll then want to configure some validation suites and supply it with the location of some RDF (either via a SPARQL endpoint) or as a file of triples.
 
-Validation queries can be parameterised with query variables which must be provided when the test suite is run. Query variables have the format `{{variable-name}}`
-within a query file. For example to validate no statements exist with a specified predicate, the following query could be defined:
+### Including a test suite
 
-*bad_predicate.sparql*
-```sparql
-SELECT ?this WHERE {
-  ?this <{{bad-predicate}}>> ?o .
-}
-```
+The easiest way to include a test suite is to include an existing one as a dependency in your `deps.edn`.  `deps.edn` supports
+[various ways of fetching and resolving dependencies](https://clojure.org/reference/deps_and_cli#_dependencies) and putting them
+on the classpath, such as via git deps, maven packaged jars, or just dependencies at a `:local/root`.
 
-when running this test case, the value of `bad-predicate` must be provided. This is done by providing an EDN file containing variable
-bindings. The EDN document should contain a map from keywords to the corresponding string values e.g.
+To do this we can include a 3rd party suite [such as those found in this repo](https://github.com/Swirrl/pmd-rdf-validations) like this:
 
-*variables.edn*
 ```clojure
-{ :bad-predicate "http://to-be-avoided"
-  :other-variable "http://other" }
+ {:deps {;; NOTE each dep here is a validation suite
+         swirrl/validations.qb {:git/url "git@github.com:Swirrl/pmd-rdf-validations.git"
+                                :sha "63479f200a7c3d1b0e63bc43b2617181644c846b"
+                                :deps/manifest :deps
+                                :deps/root "qb"}
+        }
+ :aliases {:rdf-validator {:extra-deps { swirrl/rdf-validator {:git/url "https://github.com/Swirrl/rdf-validator.git"
+                                                               :sha "fd848fabc5718f876f99ee4ee5a3f89ea8529571"}}
+                           :main-opts ["-m" "rdf-validator.core"]}
+                     }
+ }
 ```
 
-the file of variable bindings is specified when running the test case(s) using the `--variables` parameter e.g.
+This particular repository contains multiple suites, each defined as their own dep within the same repo.  The `:deps/root` key essentially
+lets us point to a directory containing a dep, here the dep is a copy of the [integrity constraints](https://www.w3.org/TR/vocab-data-cube/#wf-rules)
+from the [RDF Data Vocabulary](https://www.w3.org/TR/vocab-data-cube/).
 
-## Defining test suites
+Once these are specified we can run them against a repository containing data cubes, e.g.
 
-A test suite defines a group of tests to be run. A test suite can be created from a single test file or a directory containing test files as shown in the
-examples above. A test suite can also be defined within an EDN file that lists the tests it contains. The minimal form of this EDN file is:
+    $ clojure -A:rdf-validator --endpoint http://some.domain/sparql/query
 
-```clojure
-{
-  :suite-name ["test1.sparql"
-               "dir/test2.sparql"]
-  :suite2     ["suite2/test3.sparql"]
-}
-```
+Note that this command will first fetch the validation suite dependency, cache it locally for future use, and run all the validation suites
+we put on the classpath (here just the data cube validations).
 
-Each key in the top-level map defines a test suite and the corresponding value contains the suite definition. Each test definition in the associated
-list should be a path to a test file relative to the suite definition file. The type and name of each test is derived from the test file name. These
-can be stated explicitly by defining tests within a map:
+### Writing your own validation suites
 
-```clojure
-{
-  :suite-name [{:source "test1.sparql"
-                :type :sparql
-                :name "first"}
-               {:source "test2.sparql"
-                :name "second"}
-               {:source "test3.sparql"}
-               "dir/test4.sparql"]
-}
-```
+Validations can be supplied on the command line as just a directory of `.sparql` files, or specified on the JVMs classpath via your `deps.edn` file.
 
-When defining test definitions explicitly, only the `:source` key is required, the type and name will be derived from the test file name if not
-provided. The two styles of defining tests can be combined within a test suite definition as defined above.
+Here we demonstrate writing a simple classpath suite, as it is the easiest way to manage suites of validations that can be included as libraries.  Other
+supported methods are described in the more detailed docs.
 
-### Combining test suites
-
-Test suites can selectively include test cases from other test suites:
+To do this first add a `:paths ["src"]` key to your `deps.edn`:
 
 ```clojure
-{
-  :suite1 ["test1.sparql"
-           "test2.sparql"]
-  :suite2 ["test3.sparql"]
-  :suite3 {:import [:suite1 :suite2]
-           :exclude [:suite1/test1]
-           :tests [{:source "test4.txt"
-                    :type :sparql}]}
-}
+ {:paths ["src"]
+  :deps {;; NOTE each dep here is a validation suite
+         swirrl/validations.qb {:git/url "git@github.com:Swirrl/pmd-rdf-validations.git"
+                                :sha "63479f200a7c3d1b0e63bc43b2617181644c846b"
+                                :deps/manifest :deps
+                                :deps/root "qb"}
+        }
+ :aliases {:rdf-validator {:extra-deps { swirrl/rdf-validator {:git/url "https://github.com/Swirrl/rdf-validator.git"
+                                                               :sha "fd848fabc5718f876f99ee4ee5a3f89ea8529571"}}
+                           :main-opts ["-m" "rdf-validator.core"]}
+                     }
+
+ }
 ```
 
-Test suites can import any number of other suites - this includes each test from the referenced suite into the importing suite. Any tests defined 
-in the imported suites can be selectively excluded by referencing them in the `:exclude` list. Each entry should contain a keyword of the form
-`:suite-name/test-name`. By default test names are the stem of the file name up to the file extension e.g. the test for file `"test1.sparql"`
-will be named `"test"`.
+This essentially says when running the validator to include the `"src"` directory on the JVM's classpath.  Next create the suite with the following
+directory structure:
 
-Test suite extensions must be acyclic e.g. `:suite1` importing `:suite2` which in turn imports `:suite1` is an error.
-An error will be raised if any suite listed within an extension list is not defined, but suites do not need to be defined within the
-same suite file. For example given two test files:
+    /your/validation/repo
+      |---- deps.edn
+      |---- src
+              |---- rdf-validator-suite.edn
+              |---- myorg
+                    |---- mysuite
+                          |---- test1.sparql
+                          |---- test2.sparql
 
-#### suite1.edn
-```clojure
-{:suite1 ["test1.sparql"]}
-```
+Then in the `rdf-validator-suite.edn` file which must be at a classpath root (i.e. at the root of the "src" directory) specify the suites name and the relative paths
+to the SPARQL files to include the suite, e.g.
 
-#### suite2.edn
 ```clojure
-{:suite2 {:import [:suite1]
-          :tests ["test2.sparql"]}}
+{
+  :suite-name ["myorg/mysuite/test1.sparql"
+               "myorg/mysuite/test2.sparql"]
+}
 ```
 
-this is valid as long as `suite1.edn` is provided as a suite whenever `suite2.edn` is required e.g.
+Next write your SPARQL validations and run like so:
 
-    java -jar rdf-validator-standalone.jar --endpoint data.ttl --suite suite1.edn --suite suite2.edn
-
-### Locating suites on the Java classpath
+    $ clojure -A:rdf-validator --endpoint http://some.domain/sparql/query
 
-In addition to test suites explicitly provided through the `--suite` parameter, rdf-validator also searches the classpath for test
-suite EDN definitions. The searched test suite files should be called `rdf-validator-suite.edn` and follow the format detailed above.
-When running from the command line, the containing directory should be added to the Java classpath using the `-classpath` option.
-Given an `rdf-validator-suite.edn` file:
+[More on defining test suites](/docs/DEFINING_TEST_SUITES.md)
 
-#### rdf-validator-suite.edn
-```clojure
-{:cp-suite ["test1.sparql"
-            "test2.sparql"]}
-```
-
-If this file is placed alongside the referenced `test1.sparql` and `test2.sparql` files in the directory `/tmp/rdf/my-suite` it can
-be run as follows:
+### Writing SPARQL validations
 
-    java -cp "/tmp/rdf/my-suite:rdf-validator-standalone.jar" clojure.main -m rdf-validator.core --endpoint data.ttl
+Validations are written as either SPARQL `SELECT` queries which should find and return validation failures, or
+ASK queries which fail when returning `false`.
 
-Use of the `-jar` option overrides any specified `-classpath` value, so the command above explicitly adds `rdf-validator.jar` to
-the classpath and invokes `clojure.main` instead (which in turn executes the `rdf-validator` main method).
+We recommend prefering the `SELECT` style as they provide more information to users on what went wrong.  For example
+this query is a port of IC-1 from the RDF Datacube spec into `SELECT` style.
 
-### Running via the Clojure tool
+It will return any `qb:Observation`s that are not also in a `qb:dataSet`:
 
-Manually building a Java classpath as shown above is tedious and error-prone. The [Clojure command-line tool](https://clojure.org/reference/deps_and_cli)
-can automate the generation of the classpath and allows test suite directories to be packaged an distributed through `.jar` files or
-remote `git` repositories. To run `rdf-validator` through the `clojure` tool, first create a new directory containing a `deps.edn` file:
-
-#### deps.edn
-```clojure
-{:deps {swirrl/rdf-validator {:local/root "/path/to/rdf-validator.jar"}
-        suite {:local/root "/path/to/test/suite"}}
- :aliases {:rdf-validator {:main-opts ["-m" "rdf-validator.core"]}}}
+```sparql
+PREFIX qb:      <http://purl.org/linked-data/cube#>
+
+SELECT (?obs AS ?obsWithNoDataset)
+WHERE {
+  {
+    # Check observation has a data set
+    ?obs a qb:Observation .
+    FILTER NOT EXISTS { ?obs qb:dataSet ?dataset1 . }
+  }
+}
 ```
 
-The `/path/to/test/suite` directory should contain `deps.edn` file along with a `src` directory containing an `rdf-validator-suite.edn` file with the 
-format described above i.e.
-
-    /path/to/test/suite
-      |---- deps.edn
-      |---- src
-              |---- rdf-validator-suite.edn
-              |---- test1.sparql
-              |---- test2.sparql
-
-The `deps.edn` file can be empty, although it can also be used to reference dependencies such as other test suites it imports 
-from (see below on how to specify dependencies). The `clojure` tool will put the `/path/to/test/suite/src` directory on the java classpath and the 
-`:rdf-validator` alias will invoke `clojure.main` with the required arguments.
-
-Now `rdf-validator` can be run with:
+Some more example `SELECT` queries for validating RDF Data cubes can be [found here](https://github.com/Swirrl/pmd-rdf-validations/tree/master/pmd-qb/src/swirrl/validations/pmd-qb)
 
-    clj -A:rdf-validator --endpoint data.ttl
-
-This will run the test cases defined in `/path/to/test/suite/src/rdf-validator-suite.edn`
+Additionally RDF Validator supports an advanced feature which usually needn't be used, to pre-process queries with [selmer](https://github.com/yogthos/Selmer) by replacing "handlebars like" variables (e.g `{{dataset-uri}}`) with any bound `--variables` provided via an `.edn` map of bindings, e.g.
 
-The `suite` dependency does not necessarily need to be defined locally. The `clojure` tool allows dependencies to be specified
-in remote `git` repositories or `.jar` files. If the test suite was hosted in a `git` repository instead, `deps.edn` could be
-modified to refer to the desired commit. Similarly, the `rdf-validator` dependency can refer to a version on Github rather than
-a local `.jar` file:
-
-#### deps.edn
 ```clojure
-{:deps {swirrl/rdf-validator {:git/url "https://github.com/Swirrl/rdf-validator.git" :sha "9e87347db0784cca974ad140b5091e1b3ae3c4f8"}
-        suite {:git/url "https://github.com/my/rdf/validator/suite" :sha "0f95c170d3799af13f51a5945339cae972866ff0"}}
- :aliases {:rdf-validator {:main-opts ["-m" "rdf-validator.core"]}}}
+{:dataset-uri "http://my.domain/data/my-dataset"}
 ```
 
-### Running individual suites
-
-By default all test cases within all test suites will be executed when running `rdf-validator`.
-This may be undesirable if many test suites are defined, or if one suite imports from another since
-this will cause imported test cases to be executed multiple times.
-
-Individual test suites can be executed by providing the suite names to be run in an argument list
-to the command-line invocation e.g.
+[More on writing test cases](/docs/WRITING_TEST_CASES.md)
 
-#### tests.edn
-```clojure
-{:suite1 ["test1.sparql" "test2.sparql" "test3.sparql"]
- :suite2 {:import [:suite1]
-          :exclude [:suite1/test2]
-          :tests ["test4.sparql"]
- :suite3 ["test5.sparql"]}
-```
+## Usage
 
-    java -jar rdf-validator-standalone.jar --endpoint data.ttl --suite tests.edn suite2 suite3
-
-This will execute the tests defined within `suite2` and `suite3` within `tests.edn`.
+[More on command line options and usage](/docs/USAGE.md)
 
-    $ java -jar rdf-validator-standalone.jar --endpoint data.ttl --suite bad_predicate.sparql --variables variables.edn
-
 ## License
 
 Copyright © 2018 Swirrl IT Ltd.

diff --git a/docs/COMPILING.md b/docs/COMPILING.md
@@ -0,0 +1,15 @@
+# Compiling
+
+Rather than using via the [clojure CLI tools](https://clojure.org/guides/getting_started#_clojure_installer_and_cli_tools) it
+is also possible to AOT compile the RDF Validator as an uberjar, and run with the incantation: `java -jar rdf-validator.jar`.
+
+This has the small advantage that it reduces start up time a little, however it does also make it substantially harder to assemble dependencies via
+the command line tools.  Hence this mechanism is no longer recommended.
+
+To compile an uberjar though, you need to first install [leiningen](https://leiningen.org/) and then run:
+
+```
+lein uberjar
+```
+
+This will build a standalone jar in the `target/uberjar` directory.