Heart Disease
Predictive Modeling

Intro / Background:

With the resources we collected, we found out that angina is chest pain caused by insufficient blood flow and oxygen to part of the heart muscle. If you have angina, your risk of having a heart attack will most likely increase. According to our research, we also found out that ST-segment depression is associated with an increased risk of subsequent cardiac events, which may also affect our result of diagonalizing the result of heart disease. Meanwhile, research has shown that adults who are 65 years older have a higher risk of developing heart disease. Therefore, some potential variables that may be associated with our response variable are anything related to age, angina, and ST depression.

Exploratory Data Analysis:

The heat map shows the correlation between the variables and the most correlated variable with the predicted variable is “exang” which is exercise induced angina metric. The heatmap shows that while majority of the variables are positively correlated with the diagnosis of heart diseaes variable, some variables such as thalach is negatively correlated with the predicted variable.

As the age increases, the risk of heart disease increases based on the sample of people who have heart diseases. However, the number of people aged 50-70 is higher than those of the other age groups. This shows that although the number of people who suffer heart disease increases as age increases, the rate itself needs further analysis.

The oldpeak which represents ST depression induced by exercise is plotted against the number of diagnoses of heart disease. It shows that the rate of heart disease increases significantly as the oldpeak number increases as it shows that the majority of people with a high number of oldpeak is diagnosed with heart disease while less than half of the people with low oldpeak is diagnosed with the disease.

Based on the histograms of the variables and scatterplots of pairs of variables, it is clear that there is no significant multicollinearity between variables. No pairs of the variables exhibited significant trends that were similar to each other.

The box plot shows the range of the resting blood pressure by different age groups. It is shown that generally as age increases, the resting blood pressure increases although the age of 60 to 70 has higher blood pressure than the age of 70 to 80.

The thalach which is the maximum heart rate increases as the age group decreases. This is shown as the highest maximum heart rate is in the age group below 40 while the lowest maximum heart rate is in the group 70 to 80.

Preprocessing / Recipes:

For the variables [cp, restecg, slope, ca, thal] , we one-hot encoded them because they are essentially categorical variables. This is better than converting them to numeric variables because that might imply a hierarchy on independent values when there isn’t one. Also, for some variables like slope , the jump from one value to another might not be exactly linear and equal to 1, as the values suggest.

For the remaining numeric variables, we used a min-max scaler, which is good in general but also helps a lot for KNN, which relies on euclidean distance.

Candidate models | Model evaluation and tuning:

Our method of determining which models to move forward with was to simply pit each model against each other with little to no modifications.

To measure the performance of the models, we used 10-fold cross-validation and collected the f1 score for each model.

Logistic Regression:

For Logistic Regression, we trained the model with regularization on the preprocessed data and measured its performance using 10-fold cross-validation. For L1 regularization, the model received a mean f1 score of ~0.7900 and for L2 regularization, the model received a mean f1 score of ~0.8000.

Support Vector Machine:

For SVM, we first compared between the different types of kernels [linear, poly, rbf]. At a first look the poly and rbf kernels performed better than the linear kernel, which is understandable since those kernels are more flexible. Both the degree-2 poly kernel and rbf kernel achieved mean f1 scores of ~0.8200. After tuning the regularization parameters, both kernels were able to raise their mean f1 scores to ~0.8200.

K-Nearest Neighbors:

For KNN, we iterated from 1 through 30 to find the best value for k to use. When it came to the k with the best f1 scores, we had varying results [10, 15, 27] that had a mean f1 score around ~0.8000. In the end, we chose k=10 since a smaller k is less likely to overfit.

XGBoost:

When it came to decision trees, we wanted to incorporate boosting right away since it will most likely improve the model. For this project, we tried XGBoost. We first tuned the depth of the tree and found that 6 is the ideal depth. We then lowered the learning rate and increased the amount of estimators to make the model more accurate. In addition, we added row sampling to reduce variance and prevent the model from overfitting. Unfortunately did not perform as well as expected and only received a mean f1 score of

~0.7700.

Discussion of Final Model (SVM):

While most of the models had a mean f1 score that hovered around ~0.8000, SVM was the only that was able to reach almost ~0.8500. In this case, both the rbf kernel and Degree-2 Poly kernel work equally well. The ideal regularization coefficient for the rbf kernel is 0.5, and for the Poly kernel, the ideal regularization coefficient is 0.4.

We decided to choose svm(rbf) as our final model based on its highest prediction. One possible cons is that it is hard to interpret. Although this holds true to most of the machine learning algorithms, it might be better to use an easy-to-interpret model in the healthcare industry in real life.

Potential improvements:

Potential improvement of our model is to spend more time on EDA and feature engineering. We have primarily worked on tuning the parameters and comparing various models. With all these models, we changed the categorical predictors into numeric types since some models require this preprocessing. However, we could have made a more detailed recipe. One thing we found important for this is to gain a deeper understanding of the subject. If we had more time, we could have gone over more external resources covering heart disease to further understand the topics.