Skip to content

2.2. Knowledge Graph Bootstrap

Paulo Pinheiro edited this page Aug 30, 2019 · 26 revisions

No data is expected to be properly ingested in HADatAc if its Knowledge Graph is not fully constructed

HADatAc's Knowledge Graph is composed of the content inside both the data repository (SOLR) and the metadata repository (Blazegraph). The Bootstrapping process, which is entirely described in this Section 2.2., erases any data and metadata content inside of HADatAc (only preserving information about user accounts). No bootstrapping should occur on HADatAc instance that has already been bootstrapped. Bootstrapping process consists of these steps:

Step 1. Erase existing files uploaded into HADatAc, if any
Step 2. Erase any previous data/metadata
Step 3. Upload ontologies
Step 4. Upload main project ontology
Step 5. Upload "knowledge graph" that is composed of the instances related to the concepts added in Step 3 and Step 4 above

(Step 1) Erase Any Existing File on HADatAc

Action: Go to "HADatAc Home > Manage Data Files". Select and Delete button next to each file in both processed folder and unprocessed folder

Verification: Still in "HADatAc Home > Manage Data Files". Confirm that no file is listed under processed folder and unprocessed folder

(Step 2) Erase Data and Metadata Repositories

Clearing the data before beginning the ingestion process of datasets is one way to ensure the exact control over the contents of HADatAc's repositories.

Action: Clean repositories. Go to "HADatAc Home > Repository Management". Select the following buttons, one button a time: "Clean Data", "Clean Data Acquisitions" and "Clean Metadata"

Verification: After cleaning the Data Repository and Metadata Repository, they should be like this:

Note: The "user graph content" should have some triples corresponding to knowledge about user accounts within HADatAc. There is no option in HADatAc itself to erase the user graph because erasing user-related triples would prevent one of staying logged or logging into HADatAc.

(Step 3) Upload Ontologies

HADatAc ontologies provide the concepts required for the framework to acquire and manage scientific data. Those ontologies may be loaded straight from the web, or may be cached locally in case they need to be reloaded afterward, when connectivity may be unavailable.

Action: Go to "HADatAc Home > Manage Ontologies > Load Ontologies from the Web" (See figure below). By loading ontologies from the web you are also caching them locally. In case you already have cached ontologies, you can also used the "Load cached ontologies" to upload the content of these ontologies into the metadata repository.

Verification: verify message about loading ontologies. Each ontology in the namespace list should result in some triples being added into the metadata repository (i.e., the triplestore). After loading the ontologies the number of triples should increase to +15,000. The exact numbers of triples may vary as the content of these ontologies, which are maintained by thrid-party entities, may vary over time.

Alternative Verification: Go to blazegraph service (see the exact address in HADatAc's configuration file), and then to the query menu. Once there, type the following SPARQL query:

select ?p ?o ?u where {?p ?o ?u . } 	

In the case of this example, you should get +15,000 triples as result.

(Step 4) Upload Domain Ontology (only if your Domain Ontology is in LabKey)

Action: Go through the following steps:

  • Go to "HADatAc Home > Load Ontology from Labkey"
  • If you are logging into LabKey for the first time during the current session, HADatAc will ask for your username and password in Labkey
  • Once you provide your credentials, you need to select the "view" button of the specific LabKey folder that you are going to upload you data from
  • Press the "Batch Loading Concepts" button at the bottom of the page

Note that you have two options to upload the main ontology: (a) to use labkey to feed your main ontology ("Load Ontology from LabKey"), or (b) to upload an ontology itself by using the "Upload Additional Knowledge".

A brief note about LabKey: LabKey contains a copy of Hadatac's Knowledge Graph state encoded as a collection of tables. Each table is either a concept table or an instance list. The first column of each table is always hasUri. The second column is either a rdfs:subClassOf or a rdf:type (represented in RDF also by an 'a'). If the second column is the rdfs:subClassOf predicate, then the table is for concepts, otherwise it is for instances. The complete set of concept tables represents the "ontology" that is loaded from Labkey. The set of instance tables are all instances of HADatAc. KG is the set of ontology + instances + values from the data repository.

Verification: verify the message generated after selecting "Batch Loading Concepts". It should have say the the concepts were loaded successfully. If the loading fail, it will show the line of the exact TTL file that was generated from LabKey and indicate the number of the line in the TTL file that prevented content to be added into HADatAc. In that case, you need to locate and fix the content of your LabKey repository that corresponds to the error message.

(Step 5) Upload Knowledge Base (only if your knowledge graph is in LabKey)

Action: Now you need to "Load Knowledge from Labkey". Hadatac will ask for your username and password in labkey. Then you click on "view" on your specific labkey folder. Click on "Select All" checkbox.

Click on "Batch Loading Selected Instance Data" button.

Msg: Operation [load] complete -- check the results above to see if the parsing of the facts was successful.

Data Owner Guide

  1. Installation
    1.1. Installing for Linux (Production)
    1.2. Installing for Linux (Development)
    1.3. Installing for MacOS (Development)
    1.4. Deploying with Docker (Production)
    1.5. Deploying with Docker (Development)
    1.6. Installing for Vagrant under Windows
    1.7. Upgrading
    1.8. Starting HADatAc
    1.9. Stopping HADatAc
  2. Setting Up
    2.1. Software Configuration
    2.2. Knowledge Graph Bootstrap
    2.2.1. Knowledge Graph
    2.2.2. Bootstrap without Labkey
    2.2.3. Bootstrap with Labkey
    2.3. Config Verification
  3. Using HADatAc
    3.1. Initial Page
    3.1.1. Home Button
    3.1.2. Sandbox Mode Button
    3.2. File Ingestion
    3.2.1. Ingesting Study Content
    3.2.2. Manual Submission of Files
    3.2.3. Automatic Submission of Files
    3.2.4. Data File Operations
    3.3. Manage Working Files 3.3.1. [Create Empty Semantic File from Template]
    3.3.2. SDD Editor
    3.3.3. DD Editor
    3.4. Manage Metadata
    3.4.1. Manage Instrument Infrastructure
    3.4.2. Manage Deployments 3.4.3. Manage Studies
    3.4.4. [Manage Object Collections]
    3.4.5. Manage Streams
    3.4.6. Manage Semantic Data Dictionaries
    3.4.7. Manage Indicators
    3.5. Data Search
    3.5.1. Data Faceted Search
    3.5.2. Data Spatial Search
    3.6. Metadata Browser and Search
    3.7. Knowledge Graph Browser
    3.8. API
    3.9. Data Download
  4. Software Architecture
    4.1. Software Components
    4.2. The Human-Aware Science Ontology (HAScO)
  5. Metadata Files
    5.1. Deployment Specification (DPL)
    5.2. Study Specification (STD)
    5.3. Semantic Study Design (SSD)
    5.4. Semantic Data Dictionary (SDD)
    5.5. Stream Specification (STR)
  6. Content Evolution
    6.1. Namespace List Update
    6.2. Ontology Update
    6.3. [DPL Update]
    6.4. [SSD Update]
    6.5. SDD Update
  7. Data Governance
    7.1. Access Network
    7.2. User Status, Categories and Access Permissions
    7.3. Data and Metadata Privacy
  8. HADatAc-Supported Projects
  9. Derived Products and Technologies
  10. Glossary
Clone this wiki locally