Skip to content

4.2. The Human Aware Science Ontology (HAScO)

Paulo Pinheiro edited this page Mar 11, 2019 · 4 revisions

4.2.1. Scientific Activities

HAScO uses the view that "science is organized knowledge" and recognizes that many events may be required to acquire and organize knowledge, including the learning of new knowledge from acquisition and analysis of data. One of HAScO's goals is to support the identification and categorization of these events, viewed as scientific activities, along with supporting the deeper representation of the events and their interdependencies.

Figure 1 shows the three essential scientific activities defined in HAScO: Study, DataAcquisition, and Deployment. HAScO scientific activities are defined as subclasses of W3C PROV's Activity. That means that they are "something that occurs over a period of time and acts upon or with entities; that may include consuming, processing, transforming, modifying, relocating, using, or generating entities." The exact nature of the entities depends on the kind of scientific activity, as described below in the Section.

Figure 1: Data Acquisitions

Studies
Processes of acquiring and organizing knowledge are central for scientific advancement, and they may range from a short-term performance of a single scientific activity to a long-term performance of many complex scientific activities. From a practical point of view, these activities need to have some specific goals, a well-defined plan and many other components like a leader and a line of funding among others. Not less important is for those activities to have a name that identify than a Study.

Figure 2: Studies

HAScO does not have a specific goal of fully defining a scientific study. Instead, HAScO aims to identify how data are directly related to named studies, how they support the process of achieving study goals, and what is becoming more relevant as we further improve the process of acquiring knowledge from data, how data that was originally acquired in support of one study may be reused in other studies, i.e., data re-purposing.

Data Acquisitions
A DataAcquisition is both an event of using an instrument to acquire data, as well as the overall collection of all data values acquired by the instrument, during its deployment, where all the points belonging to the collection have exactly the same quality. HAScO defines data quality as the entire configuration set and properties set of the instrument and corresponding deployment that were used to acquire the data. For instance, instrument properties are key to define data accuracy and resolution and instrument/deployment configuration parameters are key to impact the precision with which the data come to be.

HAScO DataAcquisition as a data collection may correspond to named things including "datasets", "execution results", and "data files", while as an event it may correspond to things events such as "run" and "batch." This exemplifies one of the nuances related to the use of the term since a dataset may be a data collection composed of the data from multiple data acquisitions, while a data file may be just a small part of a much larger data acquisition.

In Figure 3, we see that a study, in terms of data, is characterized by its associated collection of data acquisitions. For instance, one may ask "what are the variables of a study?", "who are the subjects of a study?" or "what are the samples used in or collected for a study?" According to HAScO, these questions can be translated into "what are the variables of the data acquisitions of a study?" and so on. This flexibility of defining the properties of a study by the study inheritance of properties from its data acquisitions reflex the dynamic nature of studies that are allowed to change as they evolve and new discoveries are made. This is particularly important for longitudinal studies, for example.

Figure 3: Deployments

As shown in Figure 3, each data acquisition is associated with a single deployment. Data acquisitions only exist in the context of deployments. This means that the start date/time of a data acquisition cannot occur before the start date/time of its associated deployment, and the ending date/time of a deployment is also the ending date/time of any open data acquisition associated with the deployment.

Deployments
A Deployment is the placement of an instrument in a platform so that it can be ready for acquiring data. At any given time, more than one instrument can be placed on a single platform, meaning that many deployments may occur on a single platform at any given time. A deployment must have a start time and may have a stop time. If a deployment has no end time, it is assumed to be ongoing.

In the specific case of sensor networks, we observe that detachable detectors are connected to instruments at the same time that instruments are deployed at platforms. It is intentional that HAScO does not have a concept called ``Sensor'' since the term is sometimes used to refer to detectors, while some other times it is used to refer to instruments, and very often it is also used to refer to a combination of detectors and instruments. This lack of the often overloaded sensor term allows HAScO users to decrease ambiguity in the usage of sensor-related terms.

4.2.2. Instrument As An Unification Factor

HAScO's Instrument is a concept imported from the VSTO ontology, and is the key building block of sensor networks. Figure 4 shows a specialization of Instrument into Questionnaire, PhysicalInstrument and Model, where the PhysicalInstrument is the concept that is currently used as sensor network's building block.

Figure 4: Instruments.

The use of other non-physical instruments along with deployments and data acquisitions has shown that the HAScO generalization of questionnaires and models as instruments enable uniform characterization of data in the sense that each data point, regardless of its provenance, was acquired by an instrument deployed to a platform, and the quality of the data is defined by instrument/deployment configurations and settings.

Physical Instrument as Instrument
Measurements are values resulting from the use of sensing capabilities to quantify observable physical phenomena. Sensing capabilities are provided by a sensor network, which is composed of platforms, instruments, and detectors.

Figure 5: Sensing Perspectives.

Questionnaire as Instrument
Questionnaires are often viewed as instruments for eliciting human knowledge. Figure 3 shows that questionnaires are required to be "deployed" on a platform to be associated with a data acquisition. The definition of Instrument as a subclass of vstoi:Platform is needed to identify that data are coming from subjects. The same approach holds when data are coming from any sample. In fact, Figure 6 shows that Sample is an Entity, and that Subject is a Sample, namely, a population sample.

Figure 6: Samples.

These use of samples as a platform is needed to assign, for instance, that a specific elicited value, like a subject's answer about smoking habits, is from a subject that is also identified as a HAScO subject. It is important to stress that a platform is sometimes a single sample, for example, when a data acquisition is concerned with data coming from a single sample, or from collections of samples such as SampleCollection and Cohort (i.e., a collection of Subjects).

Computer Model as Instrument
Once the data output from a computer model is systematically compared against observational data under multiple conditions, the model is said to be validated, and may be used to simulate situations that may be difficult to predict in real life, or that may in the future. The values generated by these models are basically the same kind of output that one would expect to see from a physical instrument, and the parameters used to configure the model (including any kind of training data) are the ones defining the meaning of the output.

Uncertainty properties of model-generated data may be very different from those from physical instruments. However, the data is indeed expected to be semantically equivalent to the data measured by their corresponding physical instruments.

For HAScO, ComputerModels that are capable of generating data semantically equivalent to physical instruments are considered subclasses of vstoi:Instrument. Also, ComputerModels under development and/or under a validation process, although still incapable of matching the definition above, are also considered to be instruments.

4.2.3. Scientific Data Organization

Data Acquisition Schema

HAScO identifies two kinds of schemas considered in a study. One kind of schema called a "data dictionary" is often provided along with study’s data to explain to potential data users the meaning of data. Another kind of schema HAScO calls "Data Acquisition Schema" since it is used to identify the portions of a dataset that are relevant to a study, and how these portions of the dataset content should be semantically represented. It is assumed that a study is comprised of at least one DataAcquisition and that each data acquisition is described by one DataAcquistionSchema. It is important to note that data acquisitions are specified in the context of one study (although they may be reused in other studies), and that data acquisition schemas are study-independent and may be used by multiple data acquisitions, from distinct studies.

Figure 7: Data Acquisition Schemas

Figure 7 shows that a data acquisition schema is comprised of a collection of DataAcquisitionSchemaAttributes, and that each schema attribute is associated with three classes: a subclass of hasco:Entity, a subclass of sio:Attribute, and a subclass of uo:Unit. With the characterization of these three classes, we can semantically verify is any two points are semantically equivalent, if for instance, we can retrieve the data acquisition of the points and compare the schema attributes of these two points. It is important to note that two points that are semantically equivalent according to the criteria above may have very distinct quality properties as defined by the context of the data, that is the data acquisition itself.

Data Acquisitions Over Time
Within a deployment, a triggering event establishes the beginning of a new data acquisition and marks the end of an ongoing data acquisition, if there exists an ongoing data acquisition within the deployment. A triggering event indicates a change in the deployment configuration, which may be a change in the instrument itself. Any configuration change during an ongoing deployment means that, within the deployment, data acquired before the change should only be compared or analyzed against data acquired after the event if there is a clear understanding and consideration of the change event in the quality of acquired data.

As events that state change to configuration conditions, some types of triggering events can be identified:

  • New deployment: the activity of deploying an instrument marks the initial data collection of that single deployment.
  • New standard operating procedure (SOP): SOPs are written criteria about how experiments and observations should be conducted to maintain uniformity and predictability. Any change to a SOP, although not interfering with the quality of the data collected itself, triggers a new data collection as that data is supposed to be used under new circumstances.
  • Staff personnel change: e.g., lab technician handling or monitoring any sensor network component.
  • Calibration procedure: changes to instrument configuration parameters done by calibration (automatically or not) may interfere with the data collected by the instrument, thus initiating a new data collection.

4.2.4. HAScO Supporting Ontologies

HAScO's original annotation of data entities and attributes was based on The Extensible Observation Ontology (OBOE), which presents itself as a way to overcome difficulties in the tasks of discovering and accessing ecological data by providing a generic framework for describing the semantics of datasets consisting of observations and measurements. OBOE describes and relates scientific observations, measurements, entities, characteristics and measurement standards. The Entity, Characteristic and Measurement Standard classes are supposed to be extended to better specify each type of object, ultimately building a meaningful hierarchy of concepts that can be reused to annotate each scientific data point.

In 2015, after using OBOE for more than one year, the HAScO team decided to replace OBOE with The Semanticscience Integrated Ontology (SIO), which defines the types and relations currently used in HAScO for objects, processes and attributes, and therefore provides the integrated framework from which the ontology is rooted.

SIO is a self-contained ontology that provides support for information resources; as well as hypothetical, fictional, and imaginary entities, and it has seen increasing usage, particularly in biomedical settings. It is not integrated with other ontologies, but we were able to easily integrate it with other ontologies in HAScO. SIO is centered around descriptions of objects, processes, their attributes, and time.

Foundational Ontologies for Provenance and Sensing Networks Provenance knowledge is crucial for scientific data, enabling one to understand and thus answer many important questions about the data: what is the data about? were the data measured, elicited or computed? which instrument was used to acquire the data? what was the main reason for the data to be acquired? The PROV ontology (PROV-O) is the W3C recommendation for encoding provenance on the web, and the foundation used by HAScO to support the capture of provenance knowledge.

HAScO support for provenance is also tailored for science and for data quality. The Human-Aware Sensor Network Ontology (HASNetO) is a comprehensive alignment and integration of the VSTO-I sensing infrastructure and the PROV ontology. HASNetO work that gave origin to HAScO precedes the W3C’s Semantic Sensor Network Ontology (SSN). Although SSN is capable of describing sensor networks used to collect data, SSN does not rely on standard provenance approaches, like the W3C’s PROV, and thus is limited when they attempt to describe human interventions to sensor networks. Furthermore, SSN is restricted to work with data acquired through measurements made in sensor networks, leaving the users of the ontology with the task of aligning the SSN metadata to the metadata of any scientific data that is not derived from sensor networks.

Foundational Ontologies for Instruments and Units "The Virtual Solar-Terrestrial Ontology - Instrument model" (VSTOi) is an ontology that contains concepts that describe entities capable of collecting data (e.g., instruments, detectors and platforms) and activities related to these entities such as a deployment of an instrument on a platform. The VSTO-I ontology's development was led by the National Center for Atmospheric Research’s High Altitude Observatory in collaboration with McGuinness Associates, and used and refined by a number of organizations including in collaboration with Woods Hole Oceanographic Institute’s data coordination team BCO-DMO effort.

To represent units of measure in HAScO, we use the Units Ontology (UO). UO is an ontology from the OBO Foundry that provides URIs, labels, definitions, and a hierarchy for all of the SI standard units of measure. UO units are commonly used with data aligned with either SIO or the OBO-Foundry ontologies, making it one of the more widely-adopted measurement unit ontologies available.

4.2.5. Generating Domain Ontology From Metadata Repository

4.2.5.1. Domain Ontology

Ontologies are used to build HADatAc's Knowledge Graph. These ontologies specify the vocabulary used to build metadata specifications described in Section 5. The Domain Ontology is the one of the ontologies loaded during knowledge graph bootstrap.

The domain ontology is specifically designed for each area of research, and each HADatAc installation is specifically tailored for a given domain, which is the domain specified by the domain ontology used within each HADatAc installation.

One of the most challenging aspects of HADatAc domain ontology is that, in most cases, the ontology is not just a static document, but an evolving vocabulary. In this case, we are currently using Labkey, a Laboratory Information Management System (LIMS), to support a collaborative management of multiple tables. One of our goals is to move the tables currently under LabKey into something easier to install and use like Google Docs.

4.2.5.2. Generating Domain Ontology From Metadata Repository

Data Owner Guide

  1. Installation
    1.1. Installing for Linux (Production)
    1.2. Installing for Linux (Development)
    1.3. Installing for MacOS (Development)
    1.4. Deploying with Docker (Production)
    1.5. Deploying with Docker (Development)
    1.6. Installing for Vagrant under Windows
    1.7. Upgrading
    1.8. Starting HADatAc
    1.9. Stopping HADatAc
  2. Setting Up
    2.1. Software Configuration
    2.2. Knowledge Graph Bootstrap
    2.2.1. Knowledge Graph
    2.2.2. Bootstrap without Labkey
    2.2.3. Bootstrap with Labkey
    2.3. Config Verification
  3. Using HADatAc
    3.1. Initial Page
    3.1.1. Home Button
    3.1.2. Sandbox Mode Button
    3.2. File Ingestion
    3.2.1. Ingesting Study Content
    3.2.2. Manual Submission of Files
    3.2.3. Automatic Submission of Files
    3.2.4. Data File Operations
    3.3. Manage Working Files 3.3.1. [Create Empty Semantic File from Template]
    3.3.2. SDD Editor
    3.3.3. DD Editor
    3.4. Manage Metadata
    3.4.1. Manage Instrument Infrastructure
    3.4.2. Manage Deployments 3.4.3. Manage Studies
    3.4.4. [Manage Object Collections]
    3.4.5. Manage Streams
    3.4.6. Manage Semantic Data Dictionaries
    3.4.7. Manage Indicators
    3.5. Data Search
    3.5.1. Data Faceted Search
    3.5.2. Data Spatial Search
    3.6. Metadata Browser and Search
    3.7. Knowledge Graph Browser
    3.8. API
    3.9. Data Download
  4. Software Architecture
    4.1. Software Components
    4.2. The Human-Aware Science Ontology (HAScO)
  5. Metadata Files
    5.1. Deployment Specification (DPL)
    5.2. Study Specification (STD)
    5.3. Semantic Study Design (SSD)
    5.4. Semantic Data Dictionary (SDD)
    5.5. Stream Specification (STR)
  6. Content Evolution
    6.1. Namespace List Update
    6.2. Ontology Update
    6.3. [DPL Update]
    6.4. [SSD Update]
    6.5. SDD Update
  7. Data Governance
    7.1. Access Network
    7.2. User Status, Categories and Access Permissions
    7.3. Data and Metadata Privacy
  8. HADatAc-Supported Projects
  9. Derived Products and Technologies
  10. Glossary
Clone this wiki locally