Skip to content

Latest commit

 

History

History
95 lines (68 loc) · 5.54 KB

DESIGN.md

File metadata and controls

95 lines (68 loc) · 5.54 KB

Frictionless Data Julia Libraries - Design Document

Oleg Lavrovsky ~ @loleg Last updated: November 20, 2017

Overview

Frictionless Data is a set of lightweight specifications, libraries and software improving the ways to get, share, and validate data.

This design document focuses on functional specification and design of two code libraries written in Julia: "Table Schema" and "Data Package". The design follows the general design prbinciples described at specs.frictionlessdata.io and the V1 announcement (blog.okfn.org, hackmd.io).

Functional Specification

Each library needs to implement a set of core “actions” that are further described in the implementation documentation. For simplicity, these core actions are reproduced here, along with links to the corresponding unit test cases:

Table Schema

  • read and validate a table schema descriptor schema.jl
  • create/edit a table schema descriptor schema.jl
  • provide a model-type interface to interact with a descriptor schema.jl
  • infer a Table Schema descriptor from a supplied sample of data infer.jl
  • validate a data source against the Table Schema descriptor validate.jl
  • validate in response to editing the descriptor changes.jl
  • enable streaming and reading of a data source through a Table Schema read.jl
  • reading of a data source with cast on iteration schema.jl
  • saving of a descriptor to disk save.jl

Data Package

  • read an existing Data Package descriptor read.jl
  • validate an existing Data Package descriptor, including profile-specific validation via the registry of JSON Schemas
  • create a new Data Package descriptor
  • edit an existing Data Package descriptor
  • as part of editing a descriptor, helper methods to add and remove resources from the resources array
  • validate edits made to a data package descriptor
  • save a Data Package descriptor to a file path
  • zip a Data Package descriptor and its co-located references (more generically: "zip a data package")
  • read a zip file that "claims" to be a data package
  • save a zipped Data Package to disk

API Proposal

Package names should be short, named as the base name of its source directory, and CamelCase, as per conventions described in Julia's Manual on Packages.

We will have two central classes within the project: Schema and Table. These will allow us to have constructions like Schema.infer(), which are desirable for readability.

This first design proposal follows the basic usages described in tableschema-py, tableschema-js and tableschema-go.

The Schema() type constructor accepts a stream (file I/O), string (JSON) or dictionary (parsed object) representation of a table schema:

function Schema(dictionary::Dict) (*Schema, error)
function Schema(filename::String) (*Schema, error)
function Schema(stream::IO) (*Schema, error)

Table represents a table that is an instance of the schema, and is validated by it.

Field represents a set of resources in the schema, such as the columns in a table.

Usage

For an example usage sequence please see runtests.jl in the test subfolder. Tables and schema can be loaded as follows:

using TableSchema

# read Table Schema from a JSON file:
filestream = os.open("schema.json")
schema = Schema(filestream)

# err is falsy, or an error summary:
err = schema.errors

# read Table Schema from a CSV file:
filestream = os.open("data.csv")
table = Table(filestream)
rows = table.read()

# save the Schema back to a file
if not table.errors and table.schema.valid
  table.schema.save("data_schema.json")
end

Implementation