Skip to content

Latest commit

 

History

History
124 lines (83 loc) · 8.75 KB

code.md

File metadata and controls

124 lines (83 loc) · 8.75 KB

Overview of the code

csv2rdf is an implementation of the Generating RDF from Tabular Data on the Web specification. This specification in turn directly depends on two main specifications:

  1. Model for Tabular Data and Metadata on the Web
  2. Metadata Vocabulary for Tabular Data.

These specifications themselves also rely on other specifications - all specifications required for the implementation of csv2rdf are listed in the External Specifications section.

The majority of the code implements the above 3 specifications. The code for each can be found in an acompanying namespace:

An overview of the main concepts within the implementations of those namespaces follows.

Metadata

csv2rdf.metadata

The Metadata Vocabulary for Tabular Data specification defines the structure of metadata JSON files along with how they should be processed and validated. Metadata documents are traversed from the root document down to the leaves (i.e. primitive JSON values: strings, numbers and booleans) and then validated and normalised from the leaves back up to the root. Each part of the document is processed by a 'validator' function responsible for validating the input and normalising the output.

Validators

Conceptually a validator is a function Validator[TOut](ParseContext, JSONValue & ErrorHandler): (TOut | invalid) i.e. a function which accepts two or three arguments and returns either the normalised output or a special invalid value indicating an error. Later validators may handle invalid values by e.g. using a specified default value or ignoring the property. Parsed metadata documents should not contain any instances of the invalid value. Validators should throw an exception to indicate an error (.e.g if a required property is missing), or log a warning message if the invalid input is non-fatal.

Combinators

Validator functions are composed using the combinators defined in the csv2rdf.metadata.validator namespace. Metadata documents are then validated using either the table-group or table validators depending on the keys contained in the root metadata object.

ParseContext

The ParseContext, defined in csv2rdf.metadata.context contains information about the item current being validated such as its location within the document or the URI of the containing metadata document. The parse context is updated during parsing by the JSON-LD context of contained metadata documents as well as by validators for composite JSON types (objects and arrays). The path of the validating item describes its location within the metadata document, for example the value 4

{"docs": [{"key1": 1, "key2": "foo"}, {"key1": 4}, {"key1": 3, "key2": "bar"}]}

has path ["docs", 1, "key1"] i.e. the value for the key1 key within the object at index 1 within the array under the docs key in the root object.

ErrorHandler

Some validators need to behave differently on invalid input in different circumstances e.g. sometimes an invalid URI value is fatal, while sometimes a default value can be returned after logging a warning. Such validators can be parameterised by an error callback function responsible for notifying the error. Error callbacks should have signature ErrorHandler[TOut](String): (TOut | invalid) i.e. they should take a single parameter string describing the error and return a default value, invalid, or raise an exception if the error is fatal.

Tabular data

csv2rdf.tabular

The Model for Tabular Data and Metadata on the Web specification defines how metadata documents should be located and loaded before being used to parse and validate tabular data files. The metadata location process is implemented in csv2rdf.tabular.metadata. Tabular data is parsed into a lazy sequence of maps representing a single data record within the input file according to the structure defined in the metadata file.

Cell parsing

The csv2rdf.tabular.cell namespace is responsible for parsing input cell values according to the column datatypes defined in the metadata document. Cell values are converted to appropriate java types depending on the column datatype. Parsed values are then validated according to any size or range constraints for the datatype.

CSV Reader

The csv2rdf.tabular.csv.reader namespace implements a parser for CSV files according to the tabular specification. The reader returns a lazy sequence of maps describing the raw CSV data according to the given CSV dialect. For example the input file

col1 col2
a b
c d

will return a sequence like

({:source-row-number 1, :content "col1,col2", :comment nil, :cells ["col1" "col2"], :type :data}
 {:source-row-number 2, :content "a,b", :comment nil, :cells ["a" "b"], :type :data}
 {:source-row-number 3, :content "c,d", :comment nil, :cells ["c" "d"], :type :data})

Row parsing

The csv2rdf.tabular.csv namespace transforms each row record from the reader according to the structure described by the metadata document. This includes parsing cell values and deriving cell URIs from property templates.

CSVW

csv2rdf.csvw

The Generating RDF from Tabular Data on the Web specification defines how to create RDF from a tabular data file and associated metadata. This is implemented within the csv2rdf.csvw namespaces - the minimal and standard modes are implemented by the csv2rdf.csvw.minimal and csv2rdf.csvw.standard namespaces respectively.

Grafter

Grafter is a clojure library for processing RDF data. The CSVW functions output a lazy sequence of Grafter statement records.

External specifications

The code contains (not necessarily complete) implementations of various specifications required to implmement the CSVW conversion process. For smaller specifications these are implemented in a single namespace, while larger specifications are split across multiple sub-namespaces. Functions within these namespaces may make reference to a particular section of the specification they implement. Such functions are tagged with a metadata item indicating the specification/section they implement e.g.

(defn ^{:metadata-spec "5.1.2"} link-property ...)

This indicates the link-property function implements some part of section 5.1.2 of the "Metadata Vocabulary for Tabular Data" specification. Below is the list of specifications implemented referenced within the code:

Specification metadata key namespace
Generating RDF from Tabular Data on the Web :csvw-spec csv2rdf.csvw
Model for Tabular Data and Metadata on the Web :tabular-spec csv2rdf.tabular
Metadata Vocabulary for Tabular Data :metadata-spec csv2rdf.metadata
W3C XML Schema Definition Language (XSD) 1.1 Part 2: Datatypes :xml-schema-spec csv2rdf.xml
JSON LD 1.0 :json-ld-spec csv2rdf.json-ld
JSON-LD 1.0 Processing Algorithms and API :jsonld-api-spec csv2rdf.json-ld
Tags for Identifying Languages :bcp47-spec csv2rdf.bcp47
Number Format Patterns csv2rdf.uax35
URI Template csv2rdf.uri-template