Wednesday, 7 December 2016

Python uses tree to predict diabetes..Don't Believe It....Read on!!!

In this blog, we will do an application of Decision Tree algorithm in python for prediction of diabetes in a dataset

Let’s define the problem statement first:
We have to build a model to lower the misclassification of diabetes in the diabetes dataset.

Just to give a small introduction to diabetes:
Diabetes is a disease in which the body is unable to properly use and store glucose (a form of sugar). Glucose backs up in the bloodstream — causing one’s blood glucose (sometimes referred to as blood sugar) to rise too high.

There are two major types of diabetes-
Type 1 (formerly called juvenile-onset or insulin-dependent) diabetes - Here the body completely stops producing any insulin, a hormone that enables the body to use glucose found in foods for energy. People with type 1 diabetes must take daily insulin injections to survive. This form of diabetes usually develops in children or young adults, but can occur at any age.

Type 2 (formerly called adult-onset or non-insulin-dependent) diabetes – It is when the body doesn’t produce enough insulin and/or is unable to use insulin properly (insulin resistance). This form of diabetes usually occurs in people who are over 40, overweight, and have a family history of diabetes, although today it is increasingly occurring in younger people, particularly adolescents.

Diabetes can occur in anyone. However, people who have close relatives with the disease are somewhat more likely to develop it. Other risk factors include obesity, high cholesterol, high blood pressure, and physical inactivity. The risk of developing diabetes also increases as people grow older. People who are over 40 and overweight are more likely to develop diabetes, although the incidence of type 2 diabetes in adolescents is growing. Also, people who develop diabetes while pregnant (a condition called gestational diabetes) are more likely to develop full-blown diabetes later in life.

Nowadays, large amount of information is collected in the form of patient records by the hospitals. Knowledge discovery for predictive purposes is done through data mining, which is an analysis technique that helps in proposing inferences. This method helps in decision-making through algorithms from large amounts of data generated by these medical centers. Considering the importance of early medical diagnosis of this disease, data mining techniques can be applied to help people in detection of diabetes at an early stage and treatment, which may help in avoiding complications.
This blog focuses on diabetes recorded in pregnant women. In this blog, Decision Tree algorithm has been used on train dataset to predict whether diabetes is recorded or not in a patient. Result and application of the above algorithm has been presented in this blog.

The tools and classifier used:
Creating a mining model based on the classification algorithm used in order to provide a simpler solution to the problem of diagnosis of diabetes in women. The results have been analyzed using statistical methods and are presented in the below sections.

Here we have used the Python as our tool to do our analysis. We have used the “decision tree algorithm” which is a tree structure, in the form of a flowchart. It is used as a method for classification and prediction with representation using nodes and internodes. The root and internal nodes are the test cases that are used to separate the instances with different features. Internal nodes themselves are the result of attribute test cases. Leaf nodes denote the class variable.

Here the decision tree provides a powerful technique for classification and prediction in Diabetes diagnosis problem. Each node for the decision tree is found by calculating the highest information gain for all attributes and if a specific attribute gives an unambiguous end product (explicit classification of class attribute), the branch of this attribute is terminated and target value is assigned to it.

A little brief about the diabetes dataset and its attributes:
Dataset
No. of Attributes
No. of Instances
Original Dataset
8
768
Train Dataset
8
614
Test Dataset
8
154

S.No.
Attribute
Variable name in the dataset
1
Pregnancies count
pregnancies
2
Plasma glucose concentration
plasma glucose
3
Diastolic blood pressure
blood pressure
4
Triceps skin fold thickness
triceps skin thickness
5
Insulin
insulin
6
Body mass index (kg/m2)
bmi
7
Diabetes pedigree function
diabetes pedigree
8
Age (years)
age
9
Diabetes (True / False)
diabetes

-          Diabetes is the dependent variable we need to predict form the model we will try to build, which is suspected to be affected by the independent variables
-          The other 8 variables are the independent variables which are suspected to affect the dependent variable

The methodology used:

We first load the file from our location and check the data structure and do a univariate analysis of the data.
import pandas as pd
%matplotlib inline 
location = r'C:\Users\Rajat\Documents\Python\Assignment\Custom_Diabetes_Dataset.csv'
diabetes_data = pd.read_csv(location)
diabetes_data.head(5)
diabetes_data.info()

We find that there are ‘0’ values in the dataset. The variables which have ‘0’ values are:
-          Plasma Glucose
-          Blood Pressure
-          Triceps Skin Thickness
-          Insulin
-          BMI

We will fill in these ‘0’ values by their respective median values. The reason for filling these ‘0’ values by median can be entertained by the fact that since this dataset is of living human beings or so to say for patients, we cannot have ‘0’ values for these variables.

We can then check the distribution of the dataset before and after this conversion.
-          For Plasma Glucose

diabetes_data['plasma glucose'].hist()

diabetes_data['plasma glucose'] = diabetes_data['plasma glucose'].replace(0, diabetes_data['plasma glucose'].median())

diabetes_data['plasma glucose'].hist()


-          For Blood pressure

diabetes_data['blood pressure'].hist()

diabetes_data['blood pressure'] = diabetes_data['blood pressure'].replace(0, diabetes_data['blood pressure'].median())
diabetes_data['blood pressure'].hist()


-          For Triceps Skin Thickness
diabetes_data['triceps skin thickness'].hist()
diabetes_data['triceps skin thickness'] = diabetes_data['triceps skin thickness'].replace(0, diabetes_data['triceps skin thickness'].median())
diabetes_data['triceps skin thickness'].hist()


-          For Insulin
diabetes_data['insulin'].hist()
diabetes_data['insulin'] = diabetes_data['insulin'].replace(0, diabetes_data['insulin'].median())
diabetes_data['insulin'].hist()


-          For BMI

diabetes_data['bmi'].hist()

diabetes_data['bmi'] = diabetes_data['bmi'].replace(0, diabetes_data['bmi'].median())

diabetes_data['bmi'].hist()


Then, the data was divided into training set (random 80% of the dataset) and test set (rest 20% of the dataset). 

train_set = diabetes_data.sample(frac = 0.8, random_state = 768)
test_set = diabetes_data.drop(train_set.index)

A model is then built by using the decision tree classifier is employed on the train dataset using the criteria as “Entropy”, after which data are divided into “tested-positive” or “tested-negative” depending on the final result of the decision tree that is constructed.
from sklearn import tree
clf = tree.DecisionTreeClassifier(criterion = "entropy") 

We separate from the trainset the dependent variable and build a model on the basis of this dependent variable and the other independent variables of the train set. We do the same thing for the test set too.
train_result = train_set.drop(['diabetes'], axis = 1)
clf = clf.fit(train_result, train_set.diabetes)

test_result = test_set.drop(['diabetes'], axis = 1)
pred = clf.predict(test_result)

Then we apply the model on the test set (set excluding the dependent variable).

pred = clf.predict(test_result)


Now we check the accuracy and prediction capability of our model:

We then calculate the accuracy percentage of our model.

from sklearn.metrics import accuracy_score 
accuracy_score(test_set.diabetes, pred)
This accuracy which we applied on the test set, can be checked through confusion matrix. 

from sklearn.metrics import confusion_matrix
c=confusion_matrix(test_set.diabetes, pred)


The confusion matrix gives us the output as a matrix where the rows indicate the predicted results, the first row being true predictions and the second row indicating false predictions. The columns indicate the actual results, with the first column indicating the true actual condition and second column indicating false actual condition.


Here, we connect with our model like this way – the rows indicate the predictions of our model on the test dataset, where the first row is denoting the condition where the model predicts that the patient is not diabetic and the second row denotes the condition which our model predicts that the patient is not diabetic. Similarly, the columns denote the actual condition of the patient, where the first column denotes the actual condition of the patient, indicating that the patient is not diabetic and the second column indicates that the patient is diabetic. 

To extrapolate from the confusion matrix of our model, we can say that our model has predicted 76 patients with no diabetes and the actual condition was that they did not have diabetes. It also predicts that 24 cases have diabetes and actually they have diabetes. Having said that, the model also has some errors, where it predicts 31 cases to be diabetic, whereas actually they don't have diabetes. And also, it predicts 23 cases as not having diabetes but in actuality they have diabetes.

We can fine tune our model in many ways to come at a better accuracy. but since this blog is just to introduce application of Decision Tree algorithm using Python, we will not focus on these improvement techniques.

The codes and the data set can be accessed from the below google drive link:
https://drive.google.com/drive/folders/0B7DVvFs6qM1dYlBuX043WkEtM0k

No comments:

Post a Comment