In this repository you can find all deployments of our and orchestration.
As TuringML we thrive to take advantage of existing open source projects and create customisable pipelines for data engineering, data scientist and business intelligence.
This repo should contain a single click deployment with all settings needed for all applications we need in Turing.
In order to run the application, first you need to have the image which allows you to run all commands without dependencies.
Prerequisites:
- Clone repository
- Start your docker environment:
$ make build
or$ make pull
will get the container needed in order to run manage the infrastructure.$ make run
will run the application. Here it's important to set following environmental variables:
- CLIENT_ID
Example:
CLIENT_ID=client_name make run
-
Create / update infrastructure
$ ansible-playbook playbook.yaml
will spin up the whole infrastructure or updates it accordingly. -
Delete infrastructure
$ ansible-playbook playbook.yaml -e state=absent
will remove the whole infrastructure.
Each subfolder contains a dedicated README with the installation instructions.
- Kubernetes
- NiFi
- ZooKeeper
- Druid
- ZooKeeper
- EMR (AWS)
- MySQL (RDS)
- S3
- Kafka
- ZooKeeper
- Superset
- MySQL
- Redis
- Heapster
- Grafana / Cadvisor
- Vault
- API
- UI
- Flink
With these services we thrive to have following functionalities:
A collector allows a user collect data from several sources with many different types of data.
We would like to support two types of data:
- Event level data
- Relational data
Following sources are supported for event level data
- Kafka (NiFi)
- Kinesis (NiFi)
- Cloud storage
- GCS (NiFi)
- S3 (NiFi)
- ADL (NiFi)
- FTP (NiFi)
Following sources are supported for event relational data
- HTTP pull (API) (event based)
- HTTP push (NiFi)
- DataBases (RDBMS / JDBC?) (? Druid? Auto updated?)
- MongoDB
- Cloud storage
- GCS
- S3
- ADL
- FTP
// TODO: Check support for all database types with JDBC and Calcite/Avatica etc.
We would like to support following types:
- JSON
- AVRO
- CSV The types only apply on all Cloud Storage / FTP and streaming services.
As within the enrichers we support two types of enrichers:
- We have the IP enrichment of MaxMind, maybe other services are coming later
- Enrich (http / DB lookup) with non time-series data HTTP we will depend on ourselves with each event the request needs to enrich the information. (Most likely not in phase one)
All other datasources connected should store the data in Redis (check with NiFi) for easy lookup.
- (no Service for)
- (no Service for)
- (no Service for)
- (Druid / NiFi, Cloud Storage, RDBMS, etc...)
- Should we use Helm?