Skip to content

Data Mining project for the Data Mining lecture at the University of Mannheim HWS22

Notifications You must be signed in to change notification settings

medplot/data-mining-project

Repository files navigation

Medplot Data Mining Project

Medplot CI pipeline

What is Medplot

Diabetes is one of the most common chronic diseases in many countries, with the United States leading the way. Diabetes is a severe chronic disease in which individuals lose the ability to effectively regulate blood glucose levels. Complications such as heart disease, vision loss, lower limb amputations, and kidney disease are associated with chronically high blood glucose levels in diabetics. Thus, the disease can lead to a decreased quality of life and life expectancy.

According to the Centers for Disease Control and Prevention (CDC), 34.2 million Americans have diabetes and 88 million have prediabetes in 2018. In addition, the CDC estimates that 1 in 5 diabetics and about 8 in 10 prediabetics are unaware of their risk. Diabetes also places a tremendous burden on the economy: The cost of diagnosed diabetes is approximately 327 billion, and the total cost of undiagnosed diabetes and prediabetes is nearly 400 billion per year.

The aim of the Medplot data mining project is primarily to identify patterns of which people are particularly predisposed to developing diabetes, so that prevention measures can be designed even better and in a more targeted manner.

The Behavioral Risk Factor Surveillance System dataset

We use data from the Behavioral Risk Factor Surveillance System (BRFSS). The BRFSS was established in 1984 by the CDC, the national public health agency of the United States. The BRFSS collects data in all 50 U.S. States as well as in the District of Columbia and three U.S. territories.

Being the largest continuously conducted health survey system in the world, the BRFSS conducts more than 400,000 interviews per year to collect data about health-related risk behaviors, chronic health conditions and the use of preventive services. Examples of the data collected include tobacco use, health care coverage, physical activity and fruit and vegetable consumption. We will use the dataset from 2015 for our analysis, it is provided as a CSV file. The data provided by the BRFSS includes 330 factors for 441,456 interviewed individuals. A small subset of the dataset can be viewed in the directory brfss_dataset/2015_small.csv. The complete dataset can be found on kaggle.

Overview of the different approaches

We implement different classification approaches to test which approach performs best to discover patients who have diabetes. For each of the following approaches we created a jupyter notebook that contains the data analysis with the respective model. Functionality that is used in multiple notebooks is implemented separately and can be imported into the notebooks.

K-nearest neighbors

Nearest centroids

Naive Bayes

Decision Tree

Since decision trees work best with label encodings, the parameters were optimized for this encoding from the beginning. The model achieves an average good performance, whereby the recall with approx. 80% is relatively good, the precision with approx. 30% is relatively bad.

Decision Tree

Random forest

The performance of the random forest classifier is similar to the performance of the decision tree classifier and only slightly better.

XGBoost

The performance of the XGBoost classifier is similar to the performance of the decision tree and the random forest classifier. It seems that the decision tree based classifiers cannot fully capture the problem.

Support Vector Machine

The SVM was implemented with a linear kernel function and evaluated by means of a train/val/test split.

Artificial neural network

The neural network uses an embedding layer for the categorical input values, followed by a dropout layer. For the numerical input values, a batch normalization is applied first. After these first steps the input values are concatenated and passed into multiple linear layers.

Results

Model F2-score
K-nearest neighbors 0.5876770480427772
Nearest centroid 0.5745641671917665
Decision Tree 0.5947951590481642
Random Forest 0.6011349227795215
XGBoost 0.6086731525564065
Support Vector Classifier 0.6083371698823924
Naive Bayes 0.8274523809825522
Neural network with embeddings 0.8655531658316447
Neural network one hot encoding 0.8242131481621253

Result plot

General functions for all notebooks

To install all packages that are required for the general functions you can use the requirements.txt file. Open a terminal, navigate to the project root, activate the project environment and execute the following command to install all dependencies from the requirements.txtfile.

pip install -r requirements.txt

Download the Behavioral Risk Factor Surveillance System dataset

To download the brfss dataset you can use the following function if you want. The function requires that the kaggle python package is already installed. You can create your personal kaggle api key under Account -> API -> Create New API Token. If you download the brfss dataset manually please copy the 2015.csv file into the brfss_dataset folder.

from preprocessing.preprocessing import download_brfss_dataset

download_brfss_dataset("kaggle_username", "your_kaggle_api_key")

Load preprocessed dataset

To load the preprocessed dataset import and use the following function. The function requires the dataset to be present in the brfss_dataset folder.

from preprocessing.preprocessing import get_preprocessed_brfss_dataset

brfss_dataset, brfss_target = get_preprocessed_brfss_dataset()

The file preprocessing.py contains further preprocessing functions, for example to get a train-test split, to oversample the dataset or to get the target one hot encoded.

For the artificial neural network specific preprocessing functions are implemented in the neural_network_perprocessing.py file. These preprocessing is needed in particular for the embedding layer of the neural network.

Ready-to-use plot functions

Plot functions can be found in the directory visualization. Currently the following plot functions are available:

  • classification_plots.py
    • plot_decision_boundary (interesting for k-nearest classifiers)
    • plot_confusion_matrix
  • general_plots
    • plot_class_frequencies (useful to show that the dataset is imbalanced)
  • neural_network_plots
    • plot_loss
    • plot_multiple_loss_curves

Continuous Integration pipeline

The CI pipeline is based on GitHub actions and consists of two steps.

  • Flake8: This step checks the code linting and ensures that all python files follow the Python Enhancement Proposals. To execute flake8 locally open a terminal, navigate to the project root, ensure that the project environment where flake8 is installed is activated and execute the command flake8.
  • Unit tests: Running the unit tests locally is similar to the flake8 execution but instead of executing the command flake8 run the command pytest. If you are working in PyCharm you can also right-click on the test folder and select Run pytest in test.

About

Data Mining project for the Data Mining lecture at the University of Mannheim HWS22

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published