Skip to content

Commit

Permalink
Merge pull request #29 from Swirrl/doc-improvements
Browse files Browse the repository at this point in the history
Simplify README a bit to more strongly recommend a default usage
  • Loading branch information
RickMoynihan committed May 29, 2020
2 parents 5c37735 + fe22bf8 commit a53d4e0
Show file tree
Hide file tree
Showing 5 changed files with 288 additions and 189 deletions.
302 changes: 113 additions & 189 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,247 +1,171 @@
# rdf-validator
# RDF Validator

Runner for RDF test cases
A Simple runner for RDF test cases & validations.

## Installation
RDF Validator runs a collection of test cases against a SPARQL endpoint. The endpoint can be either a HTTP(s) SPARQL
endpoint or a file or directory of RDF files on disk. Test cases can be specified as either a SPARQL query file containing either
an `ASK` or a `SELECT` query, or a suite of such files with a suite manifest.

Install [leiningen](https://leiningen.org/) and then run
Main features:

```
lein uberjar
```

This will build a standalone jar in the `target/uberjar` directory.

## Usage

`rdf-validator` runs a collection of test cases against a SPARQL endpoint. The endpoint can be either a HTTP(s) SPARQL
endpoint or a file or directory on disk. Test cases can be specified as either a SPARQL query file, or a directory
of such files.

The repository contains versions of the well-formed cube validation queries defined in the [RDF data cube specification](https://www.w3.org/TR/vocab-data-cube/#wf).
These are defined as SPARQL SELECT queries rather than the ASK queries defined in the specification to enable more detailed error reporting.
- 👍 SPARQL `SELECT` or `ASK` queries as validations
- 👌🏾 Package suites as git dependencies with a simple manifest format
- 🏃 Run 3rd party validations as dependencies via git or maven dependencies (thanks to the Clojure CLI tools)
- 🏃🏾 Run validations against SPARQL endpoints or files of RDF
- 🚴 Optionally dynamically generate queries with handlebars-like [selmer](https://github.com/yogthos/Selmer) templates

To run these tests against a local SPARQL endpoint:
## Quick Start

$ java -jar rdf-validator-standalone.jar --endpoint http://localhost/sparql/query --suite ./queries

This will run all test cases in the queries directory against the endpoint. Test cases can be run individually:
The recomended way to install and run the RDF Validator as an application is via the Clojure command line tools.

$ java -jar rdf-validator-standalone.jar --endpoint http://localhost/sparql/query --suite ./queries/01_SELECT_Observation_Has_At_Least_1_Dataset.sparql
SPARQL endpoints can also be loaded from a file containing serialised RDF triples:
The advantage to this method is that it provides an advanced way to include suites of validations as git deps, that
will be automatically fetched and installed on first usage, and cached thereafter. This means you can use Clojure's
`deps.edn` file to fetch suites of validations from 3rd parties easily.

$ java -jar rdf-validator-standalone.jar --endpoint data.ttl --suite ./queries

Multiple test cases can be specified:

$ java -jar rdf-validator-standalone.jar --endpoint data.ttl --suite test1.sparql --suite test2.sparql

The RDF dataset can also be specified:
To do this follow these steps:

$ java -jar rdf-validator-standalone.jar --endpoint data.ttl --graph http://graph1 --graph http://graph2 --suite test1.sparql

Graphs are added a named graphs and included in the default graph.
First [Install the clojure CLI tools](https://clojure.org/guides/getting_started#_clojure_installer_and_cli_tools).

## Writing test cases
Then specify a `deps.edn` file like this:

Test cases are expressed as either SPARQL ASK or SELECT queries. These queries are run against the target endpoint and the outcome of the test is based on the
result of the query execution.

### ASK queries

SPARQL ASK queries are considered to have failed if they evaluate to `true` so should be written to find invalid statements.
This is consistent with the queries defined in the RDF data cube specification.
```clojure
{
:aliases {:rdf-validator {:extra-deps { swirrl/rdf-validator {:git/url "https://github.com/Swirrl/rdf-validator.git"
:sha "fd848fabc5718f876f99ee4ee5a3f89ea8529571"}}
:main-opts ["-m" "rdf-validator.core"]}
}
}
```

### SELECT queries
This then lets you run the command line validator like so:

SPARQL SELECT queries are considered to have failed if they return any matching solutions. Like ASK queries they should return bindings describing invalid resources.
$ clojure -A:rdf-validator <ARGS>

### Query variables
You'll then want to configure some validation suites and supply it with the location of some RDF (either via a SPARQL endpoint) or as a file of triples.

Validation queries can be parameterised with query variables which must be provided when the test suite is run. Query variables have the format `{{variable-name}}`
within a query file. For example to validate no statements exist with a specified predicate, the following query could be defined:
### Including a test suite

*bad_predicate.sparql*
```sparql
SELECT ?this WHERE {
?this <{{bad-predicate}}>> ?o .
}
```
The easiest way to include a test suite is to include an existing one as a dependency in your `deps.edn`. `deps.edn` supports
[various ways of fetching and resolving dependencies](https://clojure.org/reference/deps_and_cli#_dependencies) and putting them
on the classpath, such as via git deps, maven packaged jars, or just dependencies at a `:local/root`.

when running this test case, the value of `bad-predicate` must be provided. This is done by providing an EDN file containing variable
bindings. The EDN document should contain a map from keywords to the corresponding string values e.g.
To do this we can include a 3rd party suite [such as those found in this repo](https://github.com/Swirrl/pmd-rdf-validations) like this:

*variables.edn*
```clojure
{ :bad-predicate "http://to-be-avoided"
:other-variable "http://other" }
{:deps {;; NOTE each dep here is a validation suite
swirrl/validations.qb {:git/url "git@github.com:Swirrl/pmd-rdf-validations.git"
:sha "63479f200a7c3d1b0e63bc43b2617181644c846b"
:deps/manifest :deps
:deps/root "qb"}
}
:aliases {:rdf-validator {:extra-deps { swirrl/rdf-validator {:git/url "https://github.com/Swirrl/rdf-validator.git"
:sha "fd848fabc5718f876f99ee4ee5a3f89ea8529571"}}
:main-opts ["-m" "rdf-validator.core"]}
}
}
```

the file of variable bindings is specified when running the test case(s) using the `--variables` parameter e.g.
This particular repository contains multiple suites, each defined as their own dep within the same repo. The `:deps/root` key essentially
lets us point to a directory containing a dep, here the dep is a copy of the [integrity constraints](https://www.w3.org/TR/vocab-data-cube/#wf-rules)
from the [RDF Data Vocabulary](https://www.w3.org/TR/vocab-data-cube/).

## Defining test suites
Once these are specified we can run them against a repository containing data cubes, e.g.

A test suite defines a group of tests to be run. A test suite can be created from a single test file or a directory containing test files as shown in the
examples above. A test suite can also be defined within an EDN file that lists the tests it contains. The minimal form of this EDN file is:
$ clojure -A:rdf-validator --endpoint http://some.domain/sparql/query

```clojure
{
:suite-name ["test1.sparql"
"dir/test2.sparql"]
:suite2 ["suite2/test3.sparql"]
}
```
Note that this command will first fetch the validation suite dependency, cache it locally for future use, and run all the validation suites
we put on the classpath (here just the data cube validations).

Each key in the top-level map defines a test suite and the corresponding value contains the suite definition. Each test definition in the associated
list should be a path to a test file relative to the suite definition file. The type and name of each test is derived from the test file name. These
can be stated explicitly by defining tests within a map:
### Writing your own validation suites

```clojure
{
:suite-name [{:source "test1.sparql"
:type :sparql
:name "first"}
{:source "test2.sparql"
:name "second"}
{:source "test3.sparql"}
"dir/test4.sparql"]
}
```
Validations can be supplied on the command line as just a directory of `.sparql` files, or specified on the JVMs classpath via your `deps.edn` file.

When defining test definitions explicitly, only the `:source` key is required, the type and name will be derived from the test file name if not
provided. The two styles of defining tests can be combined within a test suite definition as defined above.
Here we demonstrate writing a simple classpath suite, as it is the easiest way to manage suites of validations that can be included as libraries. Other
supported methods are described in the more detailed docs.

### Combining test suites

Test suites can selectively include test cases from other test suites:
To do this first add a `:paths ["src"]` key to your `deps.edn`:

```clojure
{
:suite1 ["test1.sparql"
"test2.sparql"]
:suite2 ["test3.sparql"]
:suite3 {:import [:suite1 :suite2]
:exclude [:suite1/test1]
:tests [{:source "test4.txt"
:type :sparql}]}
}
{:paths ["src"]
:deps {;; NOTE each dep here is a validation suite
swirrl/validations.qb {:git/url "git@github.com:Swirrl/pmd-rdf-validations.git"
:sha "63479f200a7c3d1b0e63bc43b2617181644c846b"
:deps/manifest :deps
:deps/root "qb"}
}
:aliases {:rdf-validator {:extra-deps { swirrl/rdf-validator {:git/url "https://github.com/Swirrl/rdf-validator.git"
:sha "fd848fabc5718f876f99ee4ee5a3f89ea8529571"}}
:main-opts ["-m" "rdf-validator.core"]}
}

}
```

Test suites can import any number of other suites - this includes each test from the referenced suite into the importing suite. Any tests defined
in the imported suites can be selectively excluded by referencing them in the `:exclude` list. Each entry should contain a keyword of the form
`:suite-name/test-name`. By default test names are the stem of the file name up to the file extension e.g. the test for file `"test1.sparql"`
will be named `"test"`.
This essentially says when running the validator to include the `"src"` directory on the JVM's classpath. Next create the suite with the following
directory structure:

Test suite extensions must be acyclic e.g. `:suite1` importing `:suite2` which in turn imports `:suite1` is an error.
An error will be raised if any suite listed within an extension list is not defined, but suites do not need to be defined within the
same suite file. For example given two test files:
/your/validation/repo
|---- deps.edn
|---- src
|---- rdf-validator-suite.edn
|---- myorg
|---- mysuite
|---- test1.sparql
|---- test2.sparql

#### suite1.edn
```clojure
{:suite1 ["test1.sparql"]}
```
Then in the `rdf-validator-suite.edn` file which must be at a classpath root (i.e. at the root of the "src" directory) specify the suites name and the relative paths
to the SPARQL files to include the suite, e.g.

#### suite2.edn
```clojure
{:suite2 {:import [:suite1]
:tests ["test2.sparql"]}}
{
:suite-name ["myorg/mysuite/test1.sparql"
"myorg/mysuite/test2.sparql"]
}
```

this is valid as long as `suite1.edn` is provided as a suite whenever `suite2.edn` is required e.g.
Next write your SPARQL validations and run like so:

java -jar rdf-validator-standalone.jar --endpoint data.ttl --suite suite1.edn --suite suite2.edn

### Locating suites on the Java classpath
$ clojure -A:rdf-validator --endpoint http://some.domain/sparql/query

In addition to test suites explicitly provided through the `--suite` parameter, rdf-validator also searches the classpath for test
suite EDN definitions. The searched test suite files should be called `rdf-validator-suite.edn` and follow the format detailed above.
When running from the command line, the containing directory should be added to the Java classpath using the `-classpath` option.
Given an `rdf-validator-suite.edn` file:
[More on defining test suites](/docs/DEFINING_TEST_SUITES.md)

#### rdf-validator-suite.edn
```clojure
{:cp-suite ["test1.sparql"
"test2.sparql"]}
```

If this file is placed alongside the referenced `test1.sparql` and `test2.sparql` files in the directory `/tmp/rdf/my-suite` it can
be run as follows:
### Writing SPARQL validations

java -cp "/tmp/rdf/my-suite:rdf-validator-standalone.jar" clojure.main -m rdf-validator.core --endpoint data.ttl
Validations are written as either SPARQL `SELECT` queries which should find and return validation failures, or
ASK queries which fail when returning `false`.

Use of the `-jar` option overrides any specified `-classpath` value, so the command above explicitly adds `rdf-validator.jar` to
the classpath and invokes `clojure.main` instead (which in turn executes the `rdf-validator` main method).
We recommend prefering the `SELECT` style as they provide more information to users on what went wrong. For example
this query is a port of IC-1 from the RDF Datacube spec into `SELECT` style.

### Running via the Clojure tool
It will return any `qb:Observation`s that are not also in a `qb:dataSet`:

Manually building a Java classpath as shown above is tedious and error-prone. The [Clojure command-line tool](https://clojure.org/reference/deps_and_cli)
can automate the generation of the classpath and allows test suite directories to be packaged an distributed through `.jar` files or
remote `git` repositories. To run `rdf-validator` through the `clojure` tool, first create a new directory containing a `deps.edn` file:

#### deps.edn
```clojure
{:deps {swirrl/rdf-validator {:local/root "/path/to/rdf-validator.jar"}
suite {:local/root "/path/to/test/suite"}}
:aliases {:rdf-validator {:main-opts ["-m" "rdf-validator.core"]}}}
```sparql
PREFIX qb: <http://purl.org/linked-data/cube#>
SELECT (?obs AS ?obsWithNoDataset)
WHERE {
{
# Check observation has a data set
?obs a qb:Observation .
FILTER NOT EXISTS { ?obs qb:dataSet ?dataset1 . }
}
}
```

The `/path/to/test/suite` directory should contain `deps.edn` file along with a `src` directory containing an `rdf-validator-suite.edn` file with the
format described above i.e.

/path/to/test/suite
|---- deps.edn
|---- src
|---- rdf-validator-suite.edn
|---- test1.sparql
|---- test2.sparql

The `deps.edn` file can be empty, although it can also be used to reference dependencies such as other test suites it imports
from (see below on how to specify dependencies). The `clojure` tool will put the `/path/to/test/suite/src` directory on the java classpath and the
`:rdf-validator` alias will invoke `clojure.main` with the required arguments.

Now `rdf-validator` can be run with:
Some more example `SELECT` queries for validating RDF Data cubes can be [found here](https://github.com/Swirrl/pmd-rdf-validations/tree/master/pmd-qb/src/swirrl/validations/pmd-qb)

clj -A:rdf-validator --endpoint data.ttl

This will run the test cases defined in `/path/to/test/suite/src/rdf-validator-suite.edn`
Additionally RDF Validator supports an advanced feature which usually needn't be used, to pre-process queries with [selmer](https://github.com/yogthos/Selmer) by replacing "handlebars like" variables (e.g `{{dataset-uri}}`) with any bound `--variables` provided via an `.edn` map of bindings, e.g.

The `suite` dependency does not necessarily need to be defined locally. The `clojure` tool allows dependencies to be specified
in remote `git` repositories or `.jar` files. If the test suite was hosted in a `git` repository instead, `deps.edn` could be
modified to refer to the desired commit. Similarly, the `rdf-validator` dependency can refer to a version on Github rather than
a local `.jar` file:

#### deps.edn
```clojure
{:deps {swirrl/rdf-validator {:git/url "https://github.com/Swirrl/rdf-validator.git" :sha "9e87347db0784cca974ad140b5091e1b3ae3c4f8"}
suite {:git/url "https://github.com/my/rdf/validator/suite" :sha "0f95c170d3799af13f51a5945339cae972866ff0"}}
:aliases {:rdf-validator {:main-opts ["-m" "rdf-validator.core"]}}}
{:dataset-uri "http://my.domain/data/my-dataset"}
```

### Running individual suites

By default all test cases within all test suites will be executed when running `rdf-validator`.
This may be undesirable if many test suites are defined, or if one suite imports from another since
this will cause imported test cases to be executed multiple times.

Individual test suites can be executed by providing the suite names to be run in an argument list
to the command-line invocation e.g.
[More on writing test cases](/docs/WRITING_TEST_CASES.md)

#### tests.edn
```clojure
{:suite1 ["test1.sparql" "test2.sparql" "test3.sparql"]
:suite2 {:import [:suite1]
:exclude [:suite1/test2]
:tests ["test4.sparql"]
:suite3 ["test5.sparql"]}
```
## Usage

java -jar rdf-validator-standalone.jar --endpoint data.ttl --suite tests.edn suite2 suite3

This will execute the tests defined within `suite2` and `suite3` within `tests.edn`.
[More on command line options and usage](/docs/USAGE.md)

$ java -jar rdf-validator-standalone.jar --endpoint data.ttl --suite bad_predicate.sparql --variables variables.edn

## License

Copyright © 2018 Swirrl IT Ltd.
Expand Down
15 changes: 15 additions & 0 deletions docs/COMPILING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Compiling

Rather than using via the [clojure CLI tools](https://clojure.org/guides/getting_started#_clojure_installer_and_cli_tools) it
is also possible to AOT compile the RDF Validator as an uberjar, and run with the incantation: `java -jar rdf-validator.jar`.

This has the small advantage that it reduces start up time a little, however it does also make it substantially harder to assemble dependencies via
the command line tools. Hence this mechanism is no longer recommended.

To compile an uberjar though, you need to first install [leiningen](https://leiningen.org/) and then run:

```
lein uberjar
```

This will build a standalone jar in the `target/uberjar` directory.
Loading

0 comments on commit a53d4e0

Please sign in to comment.