Build a pipeline for open COVID-19 dataset. The dataset from corona-virus-report. It shows the cumulative confirmed, recovered and deaths figures by day, country and province. Latitude and longitude coordinates are also added at the country level. The pipeline complete the ETL of raw data and give visualization of entire world spread.
The pipeline as below
- Save messed-up dataset to the covid-19-raw-data S3 bucket.
- Run the AWS Glue Crawler on covid-19-raw-data S3 bucket to parse JSONs and create the covid-19-raw-data table in the Glue Data Catalog.
- Run the Glue ETL Job on covid-19-raw-data table to:
- clean the data
- save ETL JSON result to the covid_19_output_data S3 bucket.
- Run the AWS Glue Crawler on covid-19-output-data S3 bucket to parse JSONs and create the covid-19-output-data table in the Glue Data Catalog.
- Query the covid-19-output-data table in Amazon Athena. Remove duplicates and create the final covid19_app_data_athena table in the Glue Data Catalog.
- Connect Apache Superset to the covid19_app_data_athena table and build visualization dashboard.
- Install and Configure Superset
- PyAthena Pyathenajdbc Update for China
- Covid_19_ETL step by step guide
- The public data lake for analysis of COVID-19 data
- 用于分析新冠肺炎 COVID-19 的 AWS 公共数据湖示例
- Apache Superset LDAP authentication with Active Directory
covid-19-end-to-end-analytics-with-aws-glue-athena-and-quicksight A public data lake for analysis of COVID-19 data