Skip to content

Latest commit

 

History

History
105 lines (72 loc) · 7.38 KB

os-c-data-commons-developer-guide.md

File metadata and controls

105 lines (72 loc) · 7.38 KB

OS-Climate Data Commons Developer Guide

This developer guide is for data engineers, data scientists and developers of the OS-Climate community who are looking at leveraging the OS-Climate Data Commons to build data ingestion and processing pipelines, as well as AI / ML pipelines. It shows step-by-step how to configure your development environment, structure projects, and manage data and code in a way that complies with our Architecture Blueprint.

Need Help?

  • Outage / System failure: File an Linux Foundation (LF) outage ticket (note: select OS-Climate from project list)
  • New infrastructure request (e.g. software upgrade): File an LF ticket (note: select OS-Climate from project list)
  • General infrastructure support: Get help on OS-Climate Slack Data Commons channel
  • Data Commons developer support: Get help on OS-Climate Slack Developers channel

OS-Climate's Cluster Information

  • Cluster 1 CL1: used for development and initial upgrades of applications
  • Cluster 2 CL2: stable cluster, sandbox UI and released versions of tools are available from cluster 2
  • Cluster 3 CL3: administrative cluster, managed by Red Hat and Linux Foundation IT org
  • Cluster 4 CL4: latest implementation of Red Hat's Data Mesh pattern - under construction. Follows Open Data HubData Mesh Pattern.

Tools

Pipeline development leverages a number of tools provided by Data Commons. The list below provides an overview of key technologies involved as well as links to development instances:

Technology Description Link
GitHub Version control tool used to maintain the pipelines as code OS-Climate GitHub
GitHub Projects Project tracking tool that integrates issues and pull requests Data Commons Project Board
JupyterHub Self-service environment for Jupyter notebooks used to develop pipelines JupyterHub Development Instance
Kubeflow Pipelines MLOps tool to support model development, training, serving and automated machine learning
Trino Distributed SQL Query Engine for big data, used for data ingestion and distributed queries Trino Console
CloudBeaver Web-based database GUI tool which provides rich web interface to Trino CloudBeaver Development Instance
Pachyderm Data-driven pipeline management tool for machine learning, providing version control for data
dbt SQL-based data transformation tool providing git-enabled version control of data transformation pipelines
Great Expectations Data quality tool providing git-enabled data quality pipelines management
OpenMetadata Centralized metadata store providing data discovery, data collaboration, metadata versioning and data lineage OpenMetadata Development Instance
Airflow Workflow management platform for data engineering pipelines Airflow Development Instance
Apache Superset Data exploration and visualization platform Superset Development Instance
Grafana Analytics and interactive visualization platform Grafana Development Instance
INCEpTION Text-annotation environment primarily used by OS-C for machine learning-based data extraction INCEpTION Development Instance

GitOps for reproducibility, portability, traceability with AI support

Nowadays, developers (including data scientists) use Git and GitOps practices to store and share code on development platforms such as GitHub. GitOps best practices allow for reproducibility and traceability in projects. For this reason, we have decided to adopt a GitOps approach toward managing the platform, data pipeline code as well as data and related artifacts.

One of the most important requirements to ensure data quality through reproducibility is dependency management. Having dependencies clearly managed in audited configuration artifacts allows portability of notebooks, so they can be shared safely with others and reused in other projects.

Project templates

We use two project templates as starting point for new repositories:

Together the use of these templates ties data scientist needs (e.g. notebooks, models) and data engineers needs (e.g. data and metadata pipelines). Having structure in a project ensures all the pieces required for the Data and MLOps lifecycles are present and easily discoverable.

Tutorial Steps

  1. Pre-requisites

ML Lifecycle / Source Lifecycle

  1. Setup your initial environment

  2. Explore notebooks and manage dependencies

  3. Push changes to GitHub

  4. Setup pipelines to create releases, build images and enable dependency management

DataOps Lifecycle

  1. Data Ingestion Pipeline Overview

  2. Data Extraction

  3. Data Loading

  4. Data Transformation

  5. Metadata Management

ModelOps Lifecycle

  1. ModelOps Lifecycle Overview

  2. Setup and Deploy Inference Application

  3. Test Deployed inference application

  4. Monitor your inference application