Skip to content

EBISPOT/GrEBI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GrEBI (Graphs@EBI)

HPC pipeline to integrate knowledge graphs from EMBL-EBI resources, the MONARCH Initiative KG, ROBOKOP, Ubergraph, and other sources into giant (multi-terabyte) materialised, clique merged Neo4j+Solr+RocksDB databases.

Datasource Loaded from
IMPC EBI
GWAS Catalog EBI
OLS EBI
OpenTargets EBI
Metabolights EBI
ChEMBL EBI
Reactome EBI, MONARCH
BGee MONARCH
BioGrid MONARCH
Gene Ontology (GO) Annotation Database MONARCH
HGNC (HUGO Gene Nomenclature Committee) MONARCH
Human Phenotype Ontology Annotations (HPOA) MONARCH
NCBI Gene MONARCH
PHENIO MONARCH
PomBase MONARCH
ZFIN MONARCH
Protein ANalysis THrough Evolutionary Relationships (PANTHER) MONARCH, ROBOKOP
STRING MONARCH, ROBOKOP
Comparative Toxicogenomics Database (CTD) MONARCH, ROBOKOP
Alliance of Genome Resources MONARCH, ROBOKOP
BINDING ROBOKOP
CAM KG ROBOKOP
The Comparative Toxicogenomics Database (CTD) ROBOKOP
Drug Central ROBOKOP
The Alliance of Genome Resources ROBOKOP
The Genotype-Tissue Expression (GTEx) portal ROBOKOP
Guide to Pharmacology database (GtoPdb) ROBOKOP
Hetionet ROBOKOP
HMDB ROBOKOP
Human GOA ROBOKOP
Integrated Clinical and Environmental Exposures Service (ICEES) KG ROBOKOP
IntAct ROBOKOP
Protein ANalysis THrough Evolutionary Relationships (PANTHER) ROBOKOP
Pharos ROBOKOP
STRING ROBOKOP
Text Mining Provider KG ROBOKOP
Viral Proteome ROBOKOP
AOPWiki AOPWikiRDF
Ubergraph
Human Reference Atlas KG

The resulting graphs can be downloaded from https://ftp.ebi.ac.uk/pub/databases/spot/kg/ebi/

Implementation

The pipeline is implemented as Rust programs with simple CLIs, orchestrated with Nextflow.

The primary output the pipeline is a property graph for Neo4j. The input format (after ingests to extract from KGX, RDF, and bespoke DB formats) is simple JSONL files, to which "bruteforce" integration is applied:

  • All strings that begin with any IRI or CURIE prefix from the Bioregistry are canonicalised to the standard CURIE form
  • All property values that are the identifier of another node in the graph become edges
  • Cliques of equivalent nodes are merged into single nodes
  • Cliques of equivalent properties are merged into single properties (and for ontology-defined properties, the qualified safe labels are used)

In addition to Neo4j, the nodes and edges are loaded into Solr for full-text search and RocksDB for id->object resolution.