Thursday, 8 December 2016

Can a hamburger set wages for a company??? Yes it cannnn…

Application of linear regression algorithm in python in spark context environment for prediction of International Hourly wages by the price of a ‘Big Mac Hamburger’

Let’s define the managerial and statistical problem first:
The question is – Is it possible to develop a model to predict or determine the net hourly wage of a worker around the world by the price of a Big Mac hamburger in that country? We have to construct a model which will determine the net hourly wage of a worker by the price of the Big Mac and hence can we predict the net hourly wage of a worker in a country if the price of a Big Mac hamburger was $3.00?

To introduce to this decision dilemma:
The McDonanld’s Corporation which is the leading global food-service retailer with more than 36,615 outlets servicing nearly 50 million people in more than 119 countries each day. This global presence, in addition to its consistency in food offerings and restaurant operations, makes McDonald’s a unique and attractive setting for economists to make salary and price comparisons around the world.

Because the Big mac hamburger is a standardized hamburger produced and sold in virtually every McDonald’s around the world, the Economist, a weekly newspaper focusing on international politics and business news and opinion, as early as 1986 was compiling information about Big Mac prices as an indicator of exchange rates. Building on this idea, some researchers proposed comparing wage rates across countries and the price of a Big Mac hamburger.

The tool used:
Here we make use of the “Spark” to run Python and instead call the spark context inside it, without explicitly running python in the Spark environment, which by defaults runs with the spark context in it. So, now, from the terminal window we run “Spark” and call ‘ipython notebook’ (without the spark context here, but will call it once we are in the jupyter notebook).

So now, we are in the jupyter notebook interface and now call on the spark context to run our further codes below. We call the necessary libraries which we would be using in this model building.
from pyspark import SparkContext
from pyspark import SparkConf
conf = ( SparkConf().setMaster("local[*]").setAppName('pyspark'))
sc = SparkContext(conf=conf)
import sys
from __future__ import division
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import math



The model used:
So far so good, so now we proceed to create a linear regression model which helps us to predict the net hourly wage of a worker by the price of a Big Mac.

Linear regression is a procedure that produces a mathematical model (function) that can be used to predict one variable by other variables. Simple regression is bivariate (two variables) and linear (only line fit is attempted). Simple regression analysis produces a model that attempts to predict a y variable, referred to as the dependent variable, by an x variable, referred to as the independent variable. The general form of the equation of the simple regression line is the slope-intercept equation of a line. The equation of the simple regression model consists of the slope of the line as a coefficient of x and a y-intercept value as a constant.

After the equation of the line has been developed, several statistics are available that can be used to determine how well the line fits the data. Using the historical (or given) values of x, predicted values of y can be calculated by inserting values of x into the regression equation. The predicted values can then be compared with the actual values of y to determine how well the regression equation fits the known data.

A little brief about the diabetes dataset and its attributes:
We have data of Big Mac prices and net hourly wage figures (both in US dollars) for 27 countries. The net hourly wages are based on a weighted average of 12 professions.

Country
Big Mac prices (US $)
Net hourly wages (US $)
Argentina
1.78
3.3
Australia
3.84
14
Brazil
4.91
4.3
Britain
3.48
13.9
Canada
4
12.8
Chile
3.34
3.1
China
1.95
3
Czech Republic
3.43
5.1
Denmark
4.9
17.7
Hungary
3.33
3
Indonesia
2.51
1.3
Japan
3.67
15.7
Malaysia
2.19
3.1
Mexico
2.5
1.8
New Zealand
3.59
8.4
Philippines
2.19
1.4
Poland
2.6
4.1
Russia
2.33
5.9
Singapore
3.08
5.9
South Africa
2.45
5.1
South Korea
2.82
6.1
Sweden
6.56
13.5
Switzerland
6.19
22.6
Thailand
2.17
2.6
Turkey
3.89
4.3
UAE
2.99
10.1
United States
3.73
16.5


The methodology used: 
As we see that here we have the dependent variable - y as ‘Big Mac price’ and the independent variable – x as ‘Net hourly wage’. We first plot both the variables on a scatter plot to see what type of relationship exists between these two variables or not and then we go to do other analysis.

# to see the scatter plot
df = pd.read_csv('Big_Mac.csv', header = None, names = ['x','y'])
x = np.array(df.x)
y = np.array(df.y)
theta = np.zeros((2,1))

#scatterplot of data with option to save figure.
def scatterPlot(x,y,yp=None,savePng=False):
    plt.xlabel('Big Mac prices in US $')
    plt.ylabel('Net Hourly wages in US $')
    plt.scatter(x, y, marker='x')
    
    if yp != None:
        plt.plot(x,yp)
    
    if savePng==False:
        plt.show()
    
    else:
            name = raw_input('Figure: ')
            plt.savefig(name+'.png')

scatterPlot(x,y)
From the scatter plot we can see that these 2 variables are have some positive relationship between them.

So now we check the correlation matrix between them to see what is the degree of correlation between them and this is given by Pearson’s coefficient of correlation.

data = sc.textFile("Big_Mac.csv")

def parseVector(line):
    # data values separated by commas
    return np.array([float(x) for x in line.split(',')])

parsedData_corr = data.map(parseVector)

from pyspark.mllib.stat import Statistics 
correlation_matrix = Statistics.corr(parsedData_corr, method = "spearman")

headers = ["Big Mac price","Net Hourly wage"]
corr_matrix = pd.DataFrame(correlation_matrix, index = headers, columns = headers)
corr_matrix

We can see that there is relatively high positive correlation between these two variables and comprehend that 70.62% of variability in the response variable i.e. Net hourly wages, is being explained by the explanatory variable i.e.- Big Mac prices.

Hence, we can go for linear regression technique to build our prediction model.

# Load and parse the data again for building Linear model
def parsePoint(line):
    values = [float(x) for x in line.replace(',', ' ').split(' ')]
    return LabeledPoint(values[0], values[1:])

parsedData = data.map(parsePoint)

from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel

# Build the model
model = LinearRegressionWithSGD.train(parsedData,10,.01)

# Evaluate the model on training data
valuesAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
MSE = valuesAndPreds.map(lambda kv: (kv[0] - kv[1])**2).reduce(lambda x, y: x + y) / valuesAndPreds.count()
print("Mean Squared Error = " + str(MSE))
print ("Model coefficients:" + str(model))

parsedData_corr = data.map(parseVector)

(m,b) = np.polyfit(x,y,1)
print ('Slope is ' +str(m))
print ('Y intercept is ' +str(b))
yp = np.polyval([m,b],x)
scatterPlot(x,y,yp)
Finally, we can plot the best fit line that is being given by our model and see the regression model equation for the same.

The model gives us the y-intercept as -4.154 and the slope as 3.547.
This gives us the regression equation as:
                Net Hourly Wage =  - 4.154  +   3.547 (Price of Big Mac)

While the y-intercept has virtually no practical meaning in this analysis, the slope indicates that for every dollar increase in the price of a Big Mac, there is an incremental increase of $ 3.547 in net hourly wages of the workers for a country.

Using this regression model, our second decision dilemma also can be solved - where we had to find out the net hourly wage for a country with a $ 3.00 Big Mac. This can be predicted by substituting x = 3 into the above model:
                Net Hourly Wage =  - 4.154  +   3.547 (3)  =  $ 6.49

Hence, our model predicts that the net hourly wage for a country is $ 6.49 when the price of a Big Mac is $ 3.00.

The codes and the data can be accessed from the below google drive link:
https://drive.google.com/drive/folders/0B7DVvFs6qM1ddDl0NkNGM2NmeVk?usp=sharing


Wednesday, 7 December 2016

Python uses tree to predict diabetes..Don't Believe It....Read on!!!

In this blog, we will do an application of Decision Tree algorithm in python for prediction of diabetes in a dataset

Let’s define the problem statement first:
We have to build a model to lower the misclassification of diabetes in the diabetes dataset.

Just to give a small introduction to diabetes:
Diabetes is a disease in which the body is unable to properly use and store glucose (a form of sugar). Glucose backs up in the bloodstream — causing one’s blood glucose (sometimes referred to as blood sugar) to rise too high.

There are two major types of diabetes-
Type 1 (formerly called juvenile-onset or insulin-dependent) diabetes - Here the body completely stops producing any insulin, a hormone that enables the body to use glucose found in foods for energy. People with type 1 diabetes must take daily insulin injections to survive. This form of diabetes usually develops in children or young adults, but can occur at any age.

Type 2 (formerly called adult-onset or non-insulin-dependent) diabetes – It is when the body doesn’t produce enough insulin and/or is unable to use insulin properly (insulin resistance). This form of diabetes usually occurs in people who are over 40, overweight, and have a family history of diabetes, although today it is increasingly occurring in younger people, particularly adolescents.

Diabetes can occur in anyone. However, people who have close relatives with the disease are somewhat more likely to develop it. Other risk factors include obesity, high cholesterol, high blood pressure, and physical inactivity. The risk of developing diabetes also increases as people grow older. People who are over 40 and overweight are more likely to develop diabetes, although the incidence of type 2 diabetes in adolescents is growing. Also, people who develop diabetes while pregnant (a condition called gestational diabetes) are more likely to develop full-blown diabetes later in life.

Nowadays, large amount of information is collected in the form of patient records by the hospitals. Knowledge discovery for predictive purposes is done through data mining, which is an analysis technique that helps in proposing inferences. This method helps in decision-making through algorithms from large amounts of data generated by these medical centers. Considering the importance of early medical diagnosis of this disease, data mining techniques can be applied to help people in detection of diabetes at an early stage and treatment, which may help in avoiding complications.
This blog focuses on diabetes recorded in pregnant women. In this blog, Decision Tree algorithm has been used on train dataset to predict whether diabetes is recorded or not in a patient. Result and application of the above algorithm has been presented in this blog.

The tools and classifier used:
Creating a mining model based on the classification algorithm used in order to provide a simpler solution to the problem of diagnosis of diabetes in women. The results have been analyzed using statistical methods and are presented in the below sections.

Here we have used the Python as our tool to do our analysis. We have used the “decision tree algorithm” which is a tree structure, in the form of a flowchart. It is used as a method for classification and prediction with representation using nodes and internodes. The root and internal nodes are the test cases that are used to separate the instances with different features. Internal nodes themselves are the result of attribute test cases. Leaf nodes denote the class variable.

Here the decision tree provides a powerful technique for classification and prediction in Diabetes diagnosis problem. Each node for the decision tree is found by calculating the highest information gain for all attributes and if a specific attribute gives an unambiguous end product (explicit classification of class attribute), the branch of this attribute is terminated and target value is assigned to it.

A little brief about the diabetes dataset and its attributes:
Dataset
No. of Attributes
No. of Instances
Original Dataset
8
768
Train Dataset
8
614
Test Dataset
8
154

S.No.
Attribute
Variable name in the dataset
1
Pregnancies count
pregnancies
2
Plasma glucose concentration
plasma glucose
3
Diastolic blood pressure
blood pressure
4
Triceps skin fold thickness
triceps skin thickness
5
Insulin
insulin
6
Body mass index (kg/m2)
bmi
7
Diabetes pedigree function
diabetes pedigree
8
Age (years)
age
9
Diabetes (True / False)
diabetes

-          Diabetes is the dependent variable we need to predict form the model we will try to build, which is suspected to be affected by the independent variables
-          The other 8 variables are the independent variables which are suspected to affect the dependent variable

The methodology used:

We first load the file from our location and check the data structure and do a univariate analysis of the data.
import pandas as pd
%matplotlib inline 
location = r'C:\Users\Rajat\Documents\Python\Assignment\Custom_Diabetes_Dataset.csv'
diabetes_data = pd.read_csv(location)
diabetes_data.head(5)
diabetes_data.info()

We find that there are ‘0’ values in the dataset. The variables which have ‘0’ values are:
-          Plasma Glucose
-          Blood Pressure
-          Triceps Skin Thickness
-          Insulin
-          BMI

We will fill in these ‘0’ values by their respective median values. The reason for filling these ‘0’ values by median can be entertained by the fact that since this dataset is of living human beings or so to say for patients, we cannot have ‘0’ values for these variables.

We can then check the distribution of the dataset before and after this conversion.
-          For Plasma Glucose

diabetes_data['plasma glucose'].hist()

diabetes_data['plasma glucose'] = diabetes_data['plasma glucose'].replace(0, diabetes_data['plasma glucose'].median())

diabetes_data['plasma glucose'].hist()


-          For Blood pressure

diabetes_data['blood pressure'].hist()

diabetes_data['blood pressure'] = diabetes_data['blood pressure'].replace(0, diabetes_data['blood pressure'].median())
diabetes_data['blood pressure'].hist()


-          For Triceps Skin Thickness
diabetes_data['triceps skin thickness'].hist()
diabetes_data['triceps skin thickness'] = diabetes_data['triceps skin thickness'].replace(0, diabetes_data['triceps skin thickness'].median())
diabetes_data['triceps skin thickness'].hist()


-          For Insulin
diabetes_data['insulin'].hist()
diabetes_data['insulin'] = diabetes_data['insulin'].replace(0, diabetes_data['insulin'].median())
diabetes_data['insulin'].hist()


-          For BMI

diabetes_data['bmi'].hist()

diabetes_data['bmi'] = diabetes_data['bmi'].replace(0, diabetes_data['bmi'].median())

diabetes_data['bmi'].hist()


Then, the data was divided into training set (random 80% of the dataset) and test set (rest 20% of the dataset). 

train_set = diabetes_data.sample(frac = 0.8, random_state = 768)
test_set = diabetes_data.drop(train_set.index)

A model is then built by using the decision tree classifier is employed on the train dataset using the criteria as “Entropy”, after which data are divided into “tested-positive” or “tested-negative” depending on the final result of the decision tree that is constructed.
from sklearn import tree
clf = tree.DecisionTreeClassifier(criterion = "entropy") 

We separate from the trainset the dependent variable and build a model on the basis of this dependent variable and the other independent variables of the train set. We do the same thing for the test set too.
train_result = train_set.drop(['diabetes'], axis = 1)
clf = clf.fit(train_result, train_set.diabetes)

test_result = test_set.drop(['diabetes'], axis = 1)
pred = clf.predict(test_result)

Then we apply the model on the test set (set excluding the dependent variable).

pred = clf.predict(test_result)


Now we check the accuracy and prediction capability of our model:

We then calculate the accuracy percentage of our model.

from sklearn.metrics import accuracy_score 
accuracy_score(test_set.diabetes, pred)
This accuracy which we applied on the test set, can be checked through confusion matrix. 

from sklearn.metrics import confusion_matrix
c=confusion_matrix(test_set.diabetes, pred)


The confusion matrix gives us the output as a matrix where the rows indicate the predicted results, the first row being true predictions and the second row indicating false predictions. The columns indicate the actual results, with the first column indicating the true actual condition and second column indicating false actual condition.


Here, we connect with our model like this way – the rows indicate the predictions of our model on the test dataset, where the first row is denoting the condition where the model predicts that the patient is not diabetic and the second row denotes the condition which our model predicts that the patient is not diabetic. Similarly, the columns denote the actual condition of the patient, where the first column denotes the actual condition of the patient, indicating that the patient is not diabetic and the second column indicates that the patient is diabetic. 

To extrapolate from the confusion matrix of our model, we can say that our model has predicted 76 patients with no diabetes and the actual condition was that they did not have diabetes. It also predicts that 24 cases have diabetes and actually they have diabetes. Having said that, the model also has some errors, where it predicts 31 cases to be diabetic, whereas actually they don't have diabetes. And also, it predicts 23 cases as not having diabetes but in actuality they have diabetes.

We can fine tune our model in many ways to come at a better accuracy. but since this blog is just to introduce application of Decision Tree algorithm using Python, we will not focus on these improvement techniques.

The codes and the data set can be accessed from the below google drive link:
https://drive.google.com/drive/folders/0B7DVvFs6qM1dYlBuX043WkEtM0k