Skip to content

2.2. Knowledge Graph Bootstrap

Paulo Pinheiro edited this page Apr 1, 2020 · 26 revisions

No data is expected to be properly ingested in HADatAc if its Knowledge Graph is not bootstrapped

2.2.1. HADAtAc Knowledge Graph (KG)

HADatAc's Knowledge Graph is composed of the content inside both the data repository (SOLR) and the metadata repository (Blazegraph).

2.2.2. Bootstrap without Labkey

The Bootstrapping process, which is entirely described in this Section 2.2., erases any data and metadata content inside of HADatAc (only preserving information about user accounts). No bootstrapping should occur on a HADatAc instance that has already been bootstrapped, unless the knowledge graph needs to be reconstructed from scratch.

The Bootstrapping process consists of four steps if not using LabKey, or six steps if using LabKey as decribed in Section 2.2.3:

Step 1. Erase existing files uploaded into HADatAc's processed and unprocessed folders, if any
Step 2. Erase any previous data/metadata
Step 3. Upload namespace list
Step 4. Upload ontologies

(Step 1) Erase Any Existing File on HADatAc

Action: Go to "HADatAc Home > Manage Data Files". Select and Delete button next to each file in both processed folder and unprocessed folder

Verification: Still in "HADatAc Home > Manage Data Files". Confirm that no file is listed under processed folder and unprocessed folder

(Step 2) Erase Data and Metadata Repositories

Clearing the data before beginning the ingestion process of datasets is one way to ensure the exact control over the contents of HADatAc's repositories.

Action: Clean repositories. Go to "HADatAc Home > Repository Management". Select the following buttons, one button a time: "Clean Data", "Clean Data Acquisitions" and "Clean Metadata"

Verification: After cleaning the Data Repository and Metadata Repository, they should be like this:

Note: The "user graph content" should have some triples corresponding to knowledge about user accounts within HADatAc. There is no option in HADatAc itself to erase the user graph because erasing user-related triples would prevent one of staying logged or logging into HADatAc.

(Step 3) Upload namespace list

This configuration file has a list of all the namespaces used in HADatAc's knowledge base. This list is cached inside of HADatAc when it is running, and HADatAc needs to be restarted for any change to this configuration file to take effect.

This list has two roles:

  • to indicate which ontologies should be loaded into HADatAc's knowledge graph
  • to indicate which namespaces may be used as a prefix during the execution of any SPARQL query against the repository

Every loaded ontology should be included in the list of SPARQL prefixes, but the content inside the knowledge graph may refer to terms from namespaces that are not necessarily loaded into the knowledge graph.

Each namespace has four attributes:

  • definition (mandatory): It is of the form [abbreviation]=[reference_URI] like xsd=http://www.w3.org/2001/XMLSchema# where [abbreviation] is xsd and [reference_URI] is http://www.w3.org/2001/XMLSchema#
  • mime type (only for loaded ontologies): this attribute identifies the encoding format of the ontology to be loaded. If the ontology is encode in RDF/XML the value for this attribute should be application/rdf+xml. If the ontology is encode in turtle tha value for this attribute should be text/turtle.
  • loading URL (only for loaded ontologies): this is an URL that should be resolvable on the web (it is suggested that the URL is tested in a browser to verify if the URL is resolvable).

(Step 4) Upload ontologies

IMPORTANT NOTE ABOUT UPLOAD ONTOLOGIES: This step should only be performed as part of the knowledge graph bootstrap process. Any attempt to upload ontologies as described here AFTER a knowledge graph has been bootstrapped will corrupt the existing knowledge graph and invalidate any study data that has already been ingested. See Section 6.2. to update an ontology that has already been uploaded

HADatAc ontologies provide the concepts required for the framework to acquire and manage scientific data. Those ontologies may be loaded straight from the web, or may be cached locally in case they need to be reloaded afterward, when connectivity may be unavailable.

Action: Go to "HADatAc Home > Manage Ontologies > Load Ontologies from the Web" (See figure below). By loading ontologies from the web you are also caching them locally. In case you already have cached ontologies, you can also used the "Load cached ontologies" to upload the content of these ontologies into the metadata repository.

Verification: verify message about loading ontologies. Each ontology in the namespace list should result in some triples being added into the metadata repository (i.e., the triplestore). After loading the ontologies the number of triples should increase to +15,000. The exact numbers of triples may vary as the content of these ontologies, which are maintained by thrid-party entities, may vary over time.

Alternative Verification: Go to blazegraph service (see the exact address in HADatAc's configuration file), and then to the query menu. Once there, type the following SPARQL query:

select ?p ?o ?u where {?p ?o ?u . } 	

In the case of this example, you should get +15,000 triples as result.

2.2.3. Boostrap with Labkey

The six steps required to boostrap a knowledge graph with the use of Labkey consists of the four steps described in Section 2.2.2., followed by the two additional steps (i.e., Step 5 and and Step 6) described below:

Step 5. Upload base ontology Step 6. Upload knowledge graph

(Step 5) Upload Base Ontology (only if your Base Ontology is in LabKey)

Action: Go through the following steps:

  • Go to "HADatAc Home > Load Ontology from Labkey"
  • If you are logging into LabKey for the first time during the current session, HADatAc will ask for your username and password in Labkey
  • Once you provide your credentials, you need to select the "view" button of the specific LabKey folder that you are going to upload you data from
  • Press the "Batch Loading Concepts" button at the bottom of the page

Note that you have two options to upload the main ontology: (a) to use labkey to feed your main ontology ("Load Ontology from LabKey"), or (b) to upload an ontology itself by using the "Upload Additional Knowledge".

A brief note about LabKey: LabKey contains a copy of Hadatac's Knowledge Graph state encoded as a collection of tables. Each table is either a concept table or an instance list. The first column of each table is always hasUri. The second column is either a rdfs:subClassOf or a rdf:type (represented in RDF also by an 'a'). If the second column is the rdfs:subClassOf predicate, then the table is for concepts, otherwise it is for instances. The complete set of concept tables represents the "ontology" that is loaded from Labkey. The set of instance tables are all instances of HADatAc. KG is the set of ontology + instances + values from the data repository.

Verification: verify the message generated after selecting "Batch Loading Concepts". It should have say the the concepts were loaded successfully. If the loading fail, it will show the line of the exact TTL file that was generated from LabKey and indicate the number of the line in the TTL file that prevented content to be added into HADatAc. In that case, you need to locate and fix the content of your LabKey repository that corresponds to the error message.

(Step 6) Upload Knowledge Base (only if your knowledge graph is in LabKey)

Action: Now you need to "Load Knowledge from Labkey". Hadatac will ask for your username and password in labkey. Then you click on "view" on your specific labkey folder. Click on "Select All" checkbox.

Click on "Batch Loading Selected Instance Data" button.

Msg: Operation [load] complete -- check the results above to see if the parsing of the facts was successful.

Data Owner Guide

  1. Installation
    1.1. Installing for Linux (Production)
    1.2. Installing for Linux (Development)
    1.3. Installing for MacOS (Development)
    1.4. Deploying with Docker (Production)
    1.5. Deploying with Docker (Development)
    1.6. Installing for Vagrant under Windows
    1.7. Upgrading
    1.8. Starting HADatAc
    1.9. Stopping HADatAc
  2. Setting Up
    2.1. Software Configuration
    2.2. Knowledge Graph Bootstrap
    2.2.1. Knowledge Graph
    2.2.2. Bootstrap without Labkey
    2.2.3. Bootstrap with Labkey
    2.3. Config Verification
  3. Using HADatAc
    3.1. Initial Page
    3.1.1. Home Button
    3.1.2. Sandbox Mode Button
    3.2. File Ingestion
    3.2.1. Ingesting Study Content
    3.2.2. Manual Submission of Files
    3.2.3. Automatic Submission of Files
    3.2.4. Data File Operations
    3.3. Manage Working Files 3.3.1. [Create Empty Semantic File from Template]
    3.3.2. SDD Editor
    3.3.3. DD Editor
    3.4. Manage Metadata
    3.4.1. Manage Instrument Infrastructure
    3.4.2. Manage Deployments 3.4.3. Manage Studies
    3.4.4. [Manage Object Collections]
    3.4.5. Manage Streams
    3.4.6. Manage Semantic Data Dictionaries
    3.4.7. Manage Indicators
    3.5. Data Search
    3.5.1. Data Faceted Search
    3.5.2. Data Spatial Search
    3.6. Metadata Browser and Search
    3.7. Knowledge Graph Browser
    3.8. API
    3.9. Data Download
  4. Software Architecture
    4.1. Software Components
    4.2. The Human-Aware Science Ontology (HAScO)
  5. Metadata Files
    5.1. Deployment Specification (DPL)
    5.2. Study Specification (STD)
    5.3. Semantic Study Design (SSD)
    5.4. Semantic Data Dictionary (SDD)
    5.5. Stream Specification (STR)
  6. Content Evolution
    6.1. Namespace List Update
    6.2. Ontology Update
    6.3. [DPL Update]
    6.4. [SSD Update]
    6.5. SDD Update
  7. Data Governance
    7.1. Access Network
    7.2. User Status, Categories and Access Permissions
    7.3. Data and Metadata Privacy
  8. HADatAc-Supported Projects
  9. Derived Products and Technologies
  10. Glossary
Clone this wiki locally