In this blog, we will do an application of Decision Tree algorithm in python for prediction of
diabetes in a dataset
Let’s define the problem statement first:
We have to build a model to lower the misclassification of
diabetes in the diabetes dataset.
Just to give a small introduction to diabetes:
Diabetes is a disease in which the body is unable to properly use
and store glucose (a form of sugar). Glucose backs up in the bloodstream —
causing one’s blood glucose (sometimes referred to as blood sugar) to rise too
high.
There are two major types of diabetes-
Type 1 (formerly called juvenile-onset or insulin-dependent)
diabetes - Here the body completely stops producing any insulin, a hormone that
enables the body to use glucose found in foods for energy. People with type 1
diabetes must take daily insulin injections to survive. This form of diabetes
usually develops in children or young adults, but can occur at any age.
Type 2 (formerly called adult-onset or non-insulin-dependent)
diabetes – It is when the body doesn’t produce enough insulin and/or is unable
to use insulin properly (insulin resistance). This form of diabetes usually
occurs in people who are over 40, overweight, and have a family history of
diabetes, although today it is increasingly occurring in younger people,
particularly adolescents.
Diabetes can occur in anyone. However, people who have close
relatives with the disease are somewhat more likely to develop it. Other risk
factors include obesity, high cholesterol, high blood pressure, and physical
inactivity. The risk of developing diabetes also increases as people grow
older. People who are over 40 and overweight are more likely to develop
diabetes, although the incidence of type 2 diabetes in adolescents is growing.
Also, people who develop diabetes while pregnant (a condition called gestational
diabetes) are more likely to develop full-blown diabetes later in life.
Nowadays, large amount of information is collected in the form of
patient records by the hospitals. Knowledge discovery for predictive purposes
is done through data mining, which is an analysis technique that helps in
proposing inferences. This method helps in decision-making through algorithms
from large amounts of data generated by these medical centers. Considering the
importance of early medical diagnosis of this disease, data mining techniques
can be applied to help people in detection of diabetes at an early stage and
treatment, which may help in avoiding complications.
This blog focuses on diabetes recorded in pregnant women. In this blog,
Decision Tree algorithm has been used on train dataset to predict whether
diabetes is recorded or not in a patient. Result and application of the above
algorithm has been presented in this blog.
The tools and classifier used:
Creating a mining model based on the classification algorithm used
in order to provide a simpler solution to the problem of diagnosis of diabetes
in women. The results have been analyzed using statistical methods and are
presented in the below sections.
Here we have used the Python as our tool to do our analysis. We
have used the “decision tree algorithm” which is a tree structure, in the form
of a flowchart. It is used as a method for classification and prediction with
representation using nodes and internodes. The root and internal nodes are the
test cases that are used to separate the instances with different features.
Internal nodes themselves are the result of attribute test cases. Leaf nodes
denote the class variable.
Here the decision tree provides a powerful technique for
classification and prediction in Diabetes diagnosis problem. Each node for the
decision tree is found by calculating the highest information gain for all
attributes and if a specific attribute gives an unambiguous end product
(explicit classification of class attribute), the branch of this attribute is
terminated and target value is assigned to it.
A little brief about
the diabetes dataset and its attributes:
Dataset
|
No. of Attributes
|
No. of Instances
|
Original Dataset
|
8
|
768
|
Train Dataset
|
8
|
614
|
Test Dataset
|
8
|
154
|
S.No.
|
Attribute
|
Variable name in the dataset
|
1
|
Pregnancies count
|
pregnancies
|
2
|
Plasma glucose concentration
|
plasma glucose
|
3
|
Diastolic blood pressure
|
blood pressure
|
4
|
Triceps skin fold thickness
|
triceps skin thickness
|
5
|
Insulin
|
insulin
|
6
|
Body mass index (kg/m2)
|
bmi
|
7
|
Diabetes pedigree function
|
diabetes pedigree
|
8
|
Age (years)
|
age
|
9
|
Diabetes (True / False)
|
diabetes
|
-
Diabetes is the dependent variable we need to
predict form the model we will try to build, which is suspected to be affected
by the independent variables
-
The other 8 variables are the independent
variables which are suspected to affect the dependent variable
The methodology used:
We first load the file from our location and check the data
structure and do a univariate analysis of the data.
import pandas as pd
%matplotlib inline
location = r'C:\Users\Rajat\Documents\Python\Assignment\Custom_Diabetes_Dataset.csv'
diabetes_data = pd.read_csv(location)
diabetes_data.head(5)
diabetes_data.info()
-
Plasma
Glucose
-
Blood
Pressure
-
Triceps
Skin Thickness
-
Insulin
-
BMI
We will fill in these ‘0’ values by their respective median values.
The reason for filling these ‘0’ values by median can be entertained by the
fact that since this dataset is of living human beings or so to say for
patients, we cannot have ‘0’ values for these variables.
We can then check the distribution of the dataset before and after
this conversion.
-
For Plasma Glucose
diabetes_data['plasma glucose'].hist()
diabetes_data['plasma glucose'] = diabetes_data['plasma glucose'].replace(0, diabetes_data['plasma glucose'].median())
diabetes_data['plasma glucose'].hist()
-
For Blood pressure
diabetes_data['blood pressure'].hist()
diabetes_data['blood pressure'] = diabetes_data['blood pressure'].replace(0, diabetes_data['blood pressure'].median())
diabetes_data['blood pressure'].hist()
diabetes_data['triceps skin thickness'].hist()
diabetes_data['triceps skin thickness'] = diabetes_data['triceps skin thickness'].replace(0, diabetes_data['triceps skin thickness'].median())
diabetes_data['triceps skin thickness'].hist()
diabetes_data['insulin'].hist()
diabetes_data['insulin'] = diabetes_data['insulin'].replace(0, diabetes_data['insulin'].median())
diabetes_data['insulin'].hist()
diabetes_data['bmi'].hist()
diabetes_data['bmi'] = diabetes_data['bmi'].replace(0, diabetes_data['bmi'].median())
diabetes_data['bmi'].hist()
Then, the data was divided into training set (random 80% of the
dataset) and test set (rest 20% of the dataset).
train_set = diabetes_data.sample(frac = 0.8, random_state = 768)
test_set = diabetes_data.drop(train_set.index)
A model is then built by using the decision tree classifier is
employed on the train dataset using the criteria as “Entropy”, after which data
are divided into “tested-positive” or “tested-negative” depending on the final
result of the decision tree that is constructed.
from sklearn import tree
clf = tree.DecisionTreeClassifier(criterion = "entropy")
We separate from the trainset the dependent variable and build a
model on the basis of this dependent variable and the other independent
variables of the train set. We do the same thing for the test set too.
train_result = train_set.drop(['diabetes'], axis = 1)
clf = clf.fit(train_result, train_set.diabetes)
test_result = test_set.drop(['diabetes'], axis = 1)
pred = clf.predict(test_result)
Then we apply the model on the test set (set excluding the
dependent variable).
pred = clf.predict(test_result)
Now we check the accuracy and prediction capability of our model:
We then calculate the accuracy percentage of our model.
from sklearn.metrics import accuracy_score
accuracy_score(test_set.diabetes, pred)
This accuracy which we applied on the test set, can be checked through
confusion matrix.
The confusion matrix gives us the output as a matrix where the rows indicate the predicted results, the first row being true predictions and the second row indicating false predictions. The columns indicate the actual results, with the first column indicating the true actual condition and second column indicating false actual condition.
Here, we connect with our model like this way –
the rows indicate the predictions of our model on the test dataset, where the
first row is denoting the condition where the model predicts that the patient
is not diabetic and the second row denotes the condition which our model
predicts that the patient is not diabetic. Similarly, the columns denote the actual
condition of the patient, where the first column denotes the actual condition
of the patient, indicating that the patient is not diabetic and the second
column indicates that the patient is diabetic. from sklearn.metrics import confusion_matrix
c=confusion_matrix(test_set.diabetes, pred)
The confusion matrix gives us the output as a matrix where the rows indicate the predicted results, the first row being true predictions and the second row indicating false predictions. The columns indicate the actual results, with the first column indicating the true actual condition and second column indicating false actual condition.
To extrapolate from the confusion matrix of our model, we can say that our model has predicted 76 patients with no diabetes and the actual condition was that they did not have diabetes. It also predicts that 24 cases have diabetes and actually they have diabetes. Having said that, the model also has some errors, where it predicts 31 cases to be diabetic, whereas actually they don't have diabetes. And also, it predicts 23 cases as not having diabetes but in actuality they have diabetes.
We can fine tune our model in many ways to come at a better accuracy. but since this blog is just to introduce application of Decision Tree algorithm using Python, we will not focus on these improvement techniques.
The codes and the data set can be accessed from the below google drive link:
https://drive.google.com/drive/folders/0B7DVvFs6qM1dYlBuX043WkEtM0k













No comments:
Post a Comment