Description: Heart disease is one of the deadliest diseases in the world. This project is about deciding whether a patient is postive or negative. The process model is implemented in a Python script that using Logistic Regression and Random Forest model.
In this article, we’ll use the Heart Disease Dataset available from the UCI Machine Learning Repository datasets. This data is from cardiovascular research in residents of the city of Framingham, Massachusetts. The variable used in this article consists of one dependent variable that has a “yes” or “no” value stating whether the patient has heart disease and 14 independent variables. This dataset contains a total of 303 data corresponding to the ‘Negative’ 164 data and ‘Positive’ 139 data.
The code below shows how to load dataset from UCI Machine Learning Repository
heart=read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data",header=FALSE,sep=",",na.strings = '?')
names(heart) = c( "age", "sex", "cp", "trestbps", "chol","fbs", "restecg","thalach","exang", "oldpeak","slope", "ca", "thal", "target")
heart$target=ifelse(heart$target==0,0,1)
The code below shows the data cleaning process for each sample in the dataset. Data cleaning is done by removing the missing values in the data. After that, the data is randomized and divided into training data and testing data.
#Remove missing value
dataset=na.omit(heart)
str(dataset)
#Randomize
set.seed(42)
rows = sample(nrow(dataset))
dataset = dataset[rows, ]
#Hold Out Method (data splitting)
train.flag = createDataPartition(y=dataset$target, p=0.8, list=FALSE)
training = dataset[ train.flag, ]
testing = dataset[-train.flag, ]
In this example, the data set were divided with the 80/20 pattern. It’s mean that 80% of instances chosen randomly is used for training a model and the other 20% for its validation.
After several preprocessing stages have been carried out, the data will be analyzed using the Logistics Regression algorithm. First we build a model by including all the independent variables that exist.
model1 = train(target~., data=training, method="glm", family=binomial(link="logit"))
summary(model1)
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7770 -0.4940 -0.1005 0.2828 2.4811
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.272703 3.765205 -2.197 0.028010 *
age -0.012120 0.028856 -0.420 0.674472
sex1 2.019067 0.627393 3.218 0.001290 **
cp2 1.843374 0.954542 1.931 0.053463 .
cp3 0.532689 0.797000 0.668 0.503900
cp4 2.970446 0.885516 3.354 0.000795 ***
trestbps 0.024254 0.013604 1.783 0.074606 .
chol 0.006870 0.004487 1.531 0.125784
fbs1 -0.262333 0.688536 -0.381 0.703202
restecg1 1.005438 2.985951 0.337 0.736326
restecg2 0.441863 0.456003 0.969 0.332549
thalach -0.014120 0.013791 -1.024 0.305893
exang1 0.576336 0.515313 1.118 0.263388
oldpeak 0.623193 0.281367 2.215 0.026769 *
slope2 1.145002 0.589961 1.941 0.052281 .
slope3 0.262959 1.075011 0.245 0.806758
ca 1.434166 0.352831 4.065 4.81e-05 ***
thal6 -0.303190 0.924977 -0.328 0.743077
thal7 1.059326 0.485854 2.180 0.029232 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 328.58 on 237 degrees of freedom
Residual deviance: 143.72 on 219 degrees of freedom
AIC: 181.72
Number of Fisher Scoring iterations: 6
The results of model1 summary show that not all independent variables significant to the model. Therefore, the next model is built with significant variables, namely sex, cp, oldpeak, ca, and thal.
model2 = train(target~sex+cp+oldpeak+ca+thal, data=training, method="glm",family=binomial(link="logit"))
summary(model2)
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-3.0190 -0.5064 -0.1651 0.3697 2.7527
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.9193 0.9413 -5.226 1.73e-07 ***
sex1 1.1536 0.4865 2.371 0.017741 *
cp2 1.3163 0.8743 1.506 0.132167
cp3 0.4907 0.7681 0.639 0.522946
cp4 2.9655 0.7819 3.793 0.000149 ***
oldpeak 0.9260 0.2294 4.036 5.43e-05 ***
ca 1.3103 0.2998 4.371 1.24e-05 ***
thal6 0.2715 0.8495 0.320 0.749249
thal7 1.3233 0.4406 3.003 0.002673 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 328.58 on 237 degrees of freedom
Residual deviance: 162.70 on 229 degrees of freedom
AIC: 180.7
Number of Fisher Scoring iterations: 6
After the model is formed, the next step is to evaluate the model on the testing data with the confusion matrix shown in the output below:
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 25 6
1 7 21
Accuracy : 0.7797
95% CI : (0.6527, 0.8771)
The accuracy obtained is 77.97%
Furthermore, the data were analyzed using the Random Forest algorithm. The selection of the best Random Forest model is done by comparing the results model accuracy with multiple mtry, ntree, and nodesize values. The code below shows the syntax for building a random forest model by specifying multiple hyperparameter values.
x=dataset[,1:13]
y=subset(dataset, select = c(14))
customRF = list(type = "Classification", library = "randomForest", loop = NULL)
customRF$parameters = data.frame(parameter = c("mtry", "ntree", "nodesize"), class = rep("numeric", 3), label = c("mtry", "ntree", "nodesize"))
customRF$grid = function(x, y, len = NULL, search = "grid") {}
customRF$fit = function(x, y, wts, param, lev, last, weights, classProbs, ...)
{randomForest(x, y, mtry = param$mtry, ntree=param$ntree, nodesize=param$nodesize, ...)}
customRF$predict = function(modelFit, newdata, preProc = NULL, submodels = NULL)predict(modelFit, newdata)
customRF$prob = function(modelFit, newdata, preProc = NULL, submodels = NULL)predict(modelFit, newdata, type = "prob")
customRF$sort = function(x) x[order(x[,1]),]
customRF$levels = function(x) x$classes
fitControl = trainControl(method = "repeatedcv", number = 3, repeats = 10)
grid = expand.grid(.mtry=c(2, 3, 4,5,6),
.ntree = c(100,200, 300,400, 500,600,700,800,900, 1000),
.nodesize =c(1:5))
set.seed(123)
fit_rf =train(target ~ ., data = training, method = customRF, tuneGrid= grid,trControl = fitControl)
The model formed consists of several mtry values, namely 2, 3, 4, 5 and 6, the n tree values are 100, 200, 300, 400, 500, 600, 700, 800, 900 and 1000, as well as the nodesize values, namely 1, 2, 3, 4 and 5. The figure above shows that the model with mtry 2, n tree 500 and nodesize 5 is the model with the highest accuracy.
After the model is formed, the next step is to evaluate the model on the testing data with the confusion matrix shown in the output below:
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 28 7
1 4 20
Accuracy : 0.8136
95% CI : (0.6909, 0.9031)
The accuracy obtained is 81.36%
Based on the results of the classification of the two algorithms, it can be concluded that the classification using the Random Forest algorithm is slightly better than the Logistics Regression algorithm in classifying heart disease data from UCI Machine Learning. This statement is supported by the results of the confusion matrix where the accuracy of the testing data from the Random Forest algorithm is 81.36%, while the Logistics Regression algorithm is 77.97%.