Thursday, 8 December 2016

Can a hamburger set wages for a company??? Yes it cannnn…

Application of linear regression algorithm in python in spark context environment for prediction of International Hourly wages by the price of a ‘Big Mac Hamburger’

Let’s define the managerial and statistical problem first:
The question is – Is it possible to develop a model to predict or determine the net hourly wage of a worker around the world by the price of a Big Mac hamburger in that country? We have to construct a model which will determine the net hourly wage of a worker by the price of the Big Mac and hence can we predict the net hourly wage of a worker in a country if the price of a Big Mac hamburger was $3.00?

To introduce to this decision dilemma:
The McDonanld’s Corporation which is the leading global food-service retailer with more than 36,615 outlets servicing nearly 50 million people in more than 119 countries each day. This global presence, in addition to its consistency in food offerings and restaurant operations, makes McDonald’s a unique and attractive setting for economists to make salary and price comparisons around the world.

Because the Big mac hamburger is a standardized hamburger produced and sold in virtually every McDonald’s around the world, the Economist, a weekly newspaper focusing on international politics and business news and opinion, as early as 1986 was compiling information about Big Mac prices as an indicator of exchange rates. Building on this idea, some researchers proposed comparing wage rates across countries and the price of a Big Mac hamburger.

The tool used:
Here we make use of the “Spark” to run Python and instead call the spark context inside it, without explicitly running python in the Spark environment, which by defaults runs with the spark context in it. So, now, from the terminal window we run “Spark” and call ‘ipython notebook’ (without the spark context here, but will call it once we are in the jupyter notebook).

So now, we are in the jupyter notebook interface and now call on the spark context to run our further codes below. We call the necessary libraries which we would be using in this model building.
from pyspark import SparkContext
from pyspark import SparkConf
conf = ( SparkConf().setMaster("local[*]").setAppName('pyspark'))
sc = SparkContext(conf=conf)
import sys
from __future__ import division
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import math



The model used:
So far so good, so now we proceed to create a linear regression model which helps us to predict the net hourly wage of a worker by the price of a Big Mac.

Linear regression is a procedure that produces a mathematical model (function) that can be used to predict one variable by other variables. Simple regression is bivariate (two variables) and linear (only line fit is attempted). Simple regression analysis produces a model that attempts to predict a y variable, referred to as the dependent variable, by an x variable, referred to as the independent variable. The general form of the equation of the simple regression line is the slope-intercept equation of a line. The equation of the simple regression model consists of the slope of the line as a coefficient of x and a y-intercept value as a constant.

After the equation of the line has been developed, several statistics are available that can be used to determine how well the line fits the data. Using the historical (or given) values of x, predicted values of y can be calculated by inserting values of x into the regression equation. The predicted values can then be compared with the actual values of y to determine how well the regression equation fits the known data.

A little brief about the diabetes dataset and its attributes:
We have data of Big Mac prices and net hourly wage figures (both in US dollars) for 27 countries. The net hourly wages are based on a weighted average of 12 professions.

Country
Big Mac prices (US $)
Net hourly wages (US $)
Argentina
1.78
3.3
Australia
3.84
14
Brazil
4.91
4.3
Britain
3.48
13.9
Canada
4
12.8
Chile
3.34
3.1
China
1.95
3
Czech Republic
3.43
5.1
Denmark
4.9
17.7
Hungary
3.33
3
Indonesia
2.51
1.3
Japan
3.67
15.7
Malaysia
2.19
3.1
Mexico
2.5
1.8
New Zealand
3.59
8.4
Philippines
2.19
1.4
Poland
2.6
4.1
Russia
2.33
5.9
Singapore
3.08
5.9
South Africa
2.45
5.1
South Korea
2.82
6.1
Sweden
6.56
13.5
Switzerland
6.19
22.6
Thailand
2.17
2.6
Turkey
3.89
4.3
UAE
2.99
10.1
United States
3.73
16.5


The methodology used: 
As we see that here we have the dependent variable - y as ‘Big Mac price’ and the independent variable – x as ‘Net hourly wage’. We first plot both the variables on a scatter plot to see what type of relationship exists between these two variables or not and then we go to do other analysis.

# to see the scatter plot
df = pd.read_csv('Big_Mac.csv', header = None, names = ['x','y'])
x = np.array(df.x)
y = np.array(df.y)
theta = np.zeros((2,1))

#scatterplot of data with option to save figure.
def scatterPlot(x,y,yp=None,savePng=False):
    plt.xlabel('Big Mac prices in US $')
    plt.ylabel('Net Hourly wages in US $')
    plt.scatter(x, y, marker='x')
    
    if yp != None:
        plt.plot(x,yp)
    
    if savePng==False:
        plt.show()
    
    else:
            name = raw_input('Figure: ')
            plt.savefig(name+'.png')

scatterPlot(x,y)
From the scatter plot we can see that these 2 variables are have some positive relationship between them.

So now we check the correlation matrix between them to see what is the degree of correlation between them and this is given by Pearson’s coefficient of correlation.

data = sc.textFile("Big_Mac.csv")

def parseVector(line):
    # data values separated by commas
    return np.array([float(x) for x in line.split(',')])

parsedData_corr = data.map(parseVector)

from pyspark.mllib.stat import Statistics 
correlation_matrix = Statistics.corr(parsedData_corr, method = "spearman")

headers = ["Big Mac price","Net Hourly wage"]
corr_matrix = pd.DataFrame(correlation_matrix, index = headers, columns = headers)
corr_matrix

We can see that there is relatively high positive correlation between these two variables and comprehend that 70.62% of variability in the response variable i.e. Net hourly wages, is being explained by the explanatory variable i.e.- Big Mac prices.

Hence, we can go for linear regression technique to build our prediction model.

# Load and parse the data again for building Linear model
def parsePoint(line):
    values = [float(x) for x in line.replace(',', ' ').split(' ')]
    return LabeledPoint(values[0], values[1:])

parsedData = data.map(parsePoint)

from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel

# Build the model
model = LinearRegressionWithSGD.train(parsedData,10,.01)

# Evaluate the model on training data
valuesAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
MSE = valuesAndPreds.map(lambda kv: (kv[0] - kv[1])**2).reduce(lambda x, y: x + y) / valuesAndPreds.count()
print("Mean Squared Error = " + str(MSE))
print ("Model coefficients:" + str(model))

parsedData_corr = data.map(parseVector)

(m,b) = np.polyfit(x,y,1)
print ('Slope is ' +str(m))
print ('Y intercept is ' +str(b))
yp = np.polyval([m,b],x)
scatterPlot(x,y,yp)
Finally, we can plot the best fit line that is being given by our model and see the regression model equation for the same.

The model gives us the y-intercept as -4.154 and the slope as 3.547.
This gives us the regression equation as:
                Net Hourly Wage =  - 4.154  +   3.547 (Price of Big Mac)

While the y-intercept has virtually no practical meaning in this analysis, the slope indicates that for every dollar increase in the price of a Big Mac, there is an incremental increase of $ 3.547 in net hourly wages of the workers for a country.

Using this regression model, our second decision dilemma also can be solved - where we had to find out the net hourly wage for a country with a $ 3.00 Big Mac. This can be predicted by substituting x = 3 into the above model:
                Net Hourly Wage =  - 4.154  +   3.547 (3)  =  $ 6.49

Hence, our model predicts that the net hourly wage for a country is $ 6.49 when the price of a Big Mac is $ 3.00.

The codes and the data can be accessed from the below google drive link:
https://drive.google.com/drive/folders/0B7DVvFs6qM1ddDl0NkNGM2NmeVk?usp=sharing


No comments:

Post a Comment