Data Engineering on a simulated song streaming application with Kafka, PySpark, dbt, S3, Redshift.
Streamify is a music streaming company that joys in the satisfaction of their users. The intelligence of the application is derived from the team of Data Techies who track, monitor, and filter out playlists uniquely for each user, making it less probable for a user to skip through a track because this app knows them so well.
Some common question asked by the Business Intelligence team are:
- What song/genre does user A play the most?
- What artists are listened to the most by each user?
- At what time are these particular artists listened to?
- Most played songs per location
- Terraform (Infrastructure as Code)
- Apache Kafka (Streaming)
- Apache Spark (Streamed data processor)
- Apache Airflow (Workflow management)
- AWS S3 (Data lake)
- DBT (Data Transformation)
- Amazon Redshift (Data Warehouse)
All data in the lake (AWS S3) is stored in CSV
format
- Visit the docker page to install docker on your Mac, Windows, or Linux OSes. Test run this installation by creating a dummy hello-world image. If this works, you're good to go.
- Don't forget to log in on your terminal with
docker login
orsudo docker login
. Enter your username and password.
- run the following code to power kafka
cd kafka && docker compose build && docker compose up
- if all builds well, you'll be able to view the confluence UI in your browser. (localhost:9021).
cd scripts && bash eventsim_startup.sh
- (optional) run
docker --follow million_events
to see logs - it may take a while to see these topics reflect in your UI. But once it does, you'll have about four topics all together.
cd lake && python extraction.py
Spark reads data from the broker(s) every 120 seconds. Each read is saved in a new csv. The naming convention is sparks default - partition.csv, Watch Spark perform its magic ;)