Skip to content

2.2. Knowledge Graph Bootstrap

Paulo Pinheiro edited this page Apr 8, 2018 · 26 revisions

No data is expected to be properly ingested in HADatAc if its Knowledge Graph is not properly constructed

HADatAc's Knowledge Graph is composed of the content inside both the data repository (SOLR) and the metadata repository (Blazegraph). The Bootstrapping process, which is entirely described in this Section 2.2., erases any data and metadata content inside of HADatAc (only preserving information about user accounts). No bootstrapping should occur on HADatAc instance that has already been bootstrapped. Bootstrapping process consists of these steps:

Step 1. Erase existing files uploaded into HADatAc, if any
Step 2. Erase any previous data/metadata
Step 3. Upload supporting ontologies
Step 4. Upload main project ontology
Step 5. Upload "knowledge graph" that is composed of the instances related to the concepts added in Step 3 and Step 4 above

(Step 1) Erase Any Existing File on HADatAc

Action: Go to "HADatAc Home > Manage Data Files". Select and Delete button next to each file in both processed folder and unprocessed folder

Verification: Still in "HADatAc Home > Manage Data Files". Confirm that no file is listed under processed folder and unprocessed folder

(Step 2) Erase Data and Metadata Repositories

Clearing the data before beginning the ingestion process of datasets is one way to ensure the exact control over the contents of HADatAc's repositories.

Action: Clean repositories. Go to "HADatAc Home > Repository Management". Select the following buttons, one button a time: "Clean Data", "Clean Data Acquisitions" and "Clean Metadata"

Verification: After cleaning the Data Repository and Metadata Repository, they should be like this:

Status: OPERATIONAL 
Data Acquisitions: 0 documents. Data Content: 0 documents. 

Status: OPERATIONAL 
Metadata Content: 0 triples. User Graph Content: 0 triples.

Note: The "user graph content" should have some triples corresponding to knowledge about user accounts within HADatAc. There is no option in HADatAc itself to erase the user graph because erasing user-related triples would prevent one of staying logged or logging into HADatAc.

(Step 3) Upload Supporting Ontologies

HADatAc associated/supporting ontologies provide the concepts required for the framework to acquire and manage scientific data. Those ontologies may be loaded straight from the web, or may be cached locally in case they need to be reloaded afterward, when connectivity may be unavailable.

Action: Go to "HADatAc Home > Upload Supporting Ontologies > Load Ontologies from the Web". By loading ontologies from the web you are also caching them locally. In case you already have cached ontologies, you can also used the "Load cached ontologies" to upload the content of these ontologies into the metadata repository.

Verification: (a) verify message about loading supporting ontologies. All the ontologies should result in some triples being added into the metadata repository (i.e., the triplestore). After loading the supporting ontologies the number of triples should increase to +15,000. The exact numbers of triples may vary as the content of these ontologies, which are maintained by thrid-party entities, may vary over time.

Triples before [loadOntologies]: 0
Triples after [loadOntologies]: (more than 15,000 triples)

Verification (b): one can alternatively verify if supporting ontologies were loaded by querying the triplestore. Go to blazegraph service (see the exact address in HADatAc's configuration file), and then to the query menu. Onc there, type the following SPARQL query:

select ?p ?o ?u where {?p ?o ?u . } 	

In the case of this example, you should get +15,000 triples as result.

(Step 4) Upload Domain Ontology

Action: Go through the following steps:

  • Go to "HADatAc Home > Load Ontology from Labkey"
  • If you are logging into LabKey for the first time during the current session, HADatAc will ask for your username and password in Labkey
  • Once you provide your credentials, you need to select the "view" button of the specific LabKey folder that you are going to upload you data from
  • Press the "Batch Loading Concepts" button at the bottom of the page

Note that you have two options to upload the main ontology: (a) to use labkey to feed your main ontology ("Load Ontology from LabKey"), or (b) to upload an ontology itself by using the "Upload Additional Knowledge".

A brief note about LabKey: LabKey contains a copy of Hadatac's Knowledge Graph state encoded as a collection of tables. Each table is either a concept table or an instance list. The first column of each table is always hasUri. The second column is either a rdfs:subClassOf or a rdf:type (represented in RDF also by an 'a'). If the second column is the rdfs:subClassOf predicate, then the table is for concepts, otherwise it is for instances. The complete set of concept tables represents the "ontology" that is loaded from Labkey. The set of instance tables are all instances of HADatAc. KG is the set of ontology + instances + values from the data repository.

Verification: verify the message generated after selecting "Batch Loading Concepts". It should have say the the concepts were loaded successfully. If the loading fail, it will show the line of the exact TTL file that was generated from LabKey and indicate the number of the line in the TTL file that prevented content to be added into HADatAc. In that case, you need to locate and fix the content of your LabKey repository that corresponds to the error message.

(Step 5) Upload Knowledge Base

Action: Now you need to "Load Knowledge from Labkey". Hadatac will ask for your username and password in labkey. Then you click on "view" on your specific labkey folder. Click on "Select All" checkbox.

Click on "Batch Loading Selected Instance Data" button.

Msg: Operation [load] complete -- check the results above to see if the parsing of the facts was successful.

Data Owner Guide

  1. Installation
    1.1. Installing for Linux (Production)
    1.2. Installing for Linux (Development)
    1.3. Installing for MacOS (Development)
    1.4. Deploying with Docker (Production)
    1.5. Deploying with Docker (Development)
    1.6. Installing for Vagrant under Windows
    1.7. Upgrading
    1.8. Starting HADatAc
    1.9. Stopping HADatAc
  2. Setting Up
    2.1. Software Configuration
    2.2. Knowledge Graph Bootstrap
    2.2.1. Knowledge Graph
    2.2.2. Bootstrap without Labkey
    2.2.3. Bootstrap with Labkey
    2.3. Config Verification
  3. Using HADatAc
    3.1. Initial Page
    3.1.1. Home Button
    3.1.2. Sandbox Mode Button
    3.2. File Ingestion
    3.2.1. Ingesting Study Content
    3.2.2. Manual Submission of Files
    3.2.3. Automatic Submission of Files
    3.2.4. Data File Operations
    3.3. Manage Working Files 3.3.1. [Create Empty Semantic File from Template]
    3.3.2. SDD Editor
    3.3.3. DD Editor
    3.4. Manage Metadata
    3.4.1. Manage Instrument Infrastructure
    3.4.2. Manage Deployments 3.4.3. Manage Studies
    3.4.4. [Manage Object Collections]
    3.4.5. Manage Streams
    3.4.6. Manage Semantic Data Dictionaries
    3.4.7. Manage Indicators
    3.5. Data Search
    3.5.1. Data Faceted Search
    3.5.2. Data Spatial Search
    3.6. Metadata Browser and Search
    3.7. Knowledge Graph Browser
    3.8. API
    3.9. Data Download
  4. Software Architecture
    4.1. Software Components
    4.2. The Human-Aware Science Ontology (HAScO)
  5. Metadata Files
    5.1. Deployment Specification (DPL)
    5.2. Study Specification (STD)
    5.3. Semantic Study Design (SSD)
    5.4. Semantic Data Dictionary (SDD)
    5.5. Stream Specification (STR)
  6. Content Evolution
    6.1. Namespace List Update
    6.2. Ontology Update
    6.3. [DPL Update]
    6.4. [SSD Update]
    6.5. SDD Update
  7. Data Governance
    7.1. Access Network
    7.2. User Status, Categories and Access Permissions
    7.3. Data and Metadata Privacy
  8. HADatAc-Supported Projects
  9. Derived Products and Technologies
  10. Glossary
Clone this wiki locally