Application of linear regression algorithm in python in spark
context environment for prediction of International Hourly wages by the price
of a ‘Big Mac Hamburger’
Let’s define the managerial and statistical problem first:
The question is – Is it possible to develop a model to predict or
determine the net hourly wage of a worker around the world by the price of a
Big Mac hamburger in that country? We have to construct a model which will
determine the net hourly wage of a worker by the price of the Big Mac and hence
can we predict the net hourly wage of a worker in a country if the price of a
Big Mac hamburger was $3.00?
To introduce to this decision dilemma:
The McDonanld’s Corporation which is the leading global
food-service retailer with more than 36,615 outlets servicing nearly 50 million
people in more than 119 countries each day. This global presence, in addition
to its consistency in food offerings and restaurant operations, makes
McDonald’s a unique and attractive setting for economists to make salary and
price comparisons around the world.
Because the Big mac hamburger is a standardized hamburger produced
and sold in virtually every McDonald’s around the world, the Economist, a
weekly newspaper focusing on international politics and business news and
opinion, as early as 1986 was compiling information about Big Mac prices as an
indicator of exchange rates. Building on this idea, some researchers proposed
comparing wage rates across countries and the price of a Big Mac hamburger.
The tool used:
Here we make use of the “Spark” to run Python and instead call the
spark context inside it, without explicitly running python in the Spark
environment, which by defaults runs with the spark context in it. So, now, from
the terminal window we run “Spark” and call ‘ipython notebook’ (without the
spark context here, but will call it once we are in the jupyter notebook).
So now, we are in the jupyter notebook interface and now call on
the spark context to run our further codes below. We call the necessary libraries which we would be using in this model building.
from pyspark import SparkContext
from pyspark import SparkConf
conf = ( SparkConf().setMaster("local[*]").setAppName('pyspark'))
sc = SparkContext(conf=conf)
import sys
from __future__ import division
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
The model used:
So far so good, so now we proceed to create a linear regression model
which helps us to predict the net hourly wage of a worker by the price of a Big
Mac.
Linear regression is a procedure that produces a mathematical
model (function) that can be used to predict one variable by other variables.
Simple regression is bivariate (two variables) and linear (only line fit is
attempted). Simple regression analysis produces a model that attempts to
predict a y variable, referred to as the dependent variable, by an x variable,
referred to as the independent variable. The general form of the equation of
the simple regression line is the slope-intercept equation of a line. The
equation of the simple regression model consists of the slope of the line as a
coefficient of x and a y-intercept value as a constant.
After the equation of the line has been developed, several
statistics are available that can be used to determine how well the line fits
the data. Using the historical (or given) values of x, predicted values of y
can be calculated by inserting values of x into the regression equation. The
predicted values can then be compared with the actual values of y to determine
how well the regression equation fits the known data.
A little brief
about the diabetes dataset and its attributes:
We have data of
Big Mac prices and net hourly wage figures (both in US dollars) for 27
countries. The net hourly wages are based on a weighted average of 12
professions.
Country
|
Big Mac prices (US $)
|
Net hourly wages (US $)
|
Argentina
|
1.78
|
3.3
|
Australia
|
3.84
|
14
|
Brazil
|
4.91
|
4.3
|
Britain
|
3.48
|
13.9
|
Canada
|
4
|
12.8
|
Chile
|
3.34
|
3.1
|
China
|
1.95
|
3
|
Czech Republic
|
3.43
|
5.1
|
Denmark
|
4.9
|
17.7
|
Hungary
|
3.33
|
3
|
Indonesia
|
2.51
|
1.3
|
Japan
|
3.67
|
15.7
|
Malaysia
|
2.19
|
3.1
|
Mexico
|
2.5
|
1.8
|
New Zealand
|
3.59
|
8.4
|
Philippines
|
2.19
|
1.4
|
Poland
|
2.6
|
4.1
|
Russia
|
2.33
|
5.9
|
Singapore
|
3.08
|
5.9
|
South Africa
|
2.45
|
5.1
|
South Korea
|
2.82
|
6.1
|
Sweden
|
6.56
|
13.5
|
Switzerland
|
6.19
|
22.6
|
Thailand
|
2.17
|
2.6
|
Turkey
|
3.89
|
4.3
|
UAE
|
2.99
|
10.1
|
United States
|
3.73
|
16.5
|
The methodology used:
As we see that
here we have the dependent variable - y as ‘Big Mac price’ and the independent
variable – x as ‘Net hourly wage’. We first plot both the variables on a
scatter plot to see what type of relationship exists between these two
variables or not and then we go to do other analysis.
# to see the scatter plot
df = pd.read_csv('Big_Mac.csv', header = None, names = ['x','y'])
x = np.array(df.x)
y = np.array(df.y)
theta = np.zeros((2,1))
#scatterplot of data with option to save figure.
def scatterPlot(x,y,yp=None,savePng=False):
plt.xlabel('Big Mac prices in US $')
plt.ylabel('Net Hourly wages in US $')
plt.scatter(x, y, marker='x')
if yp != None:
plt.plot(x,yp)
if savePng==False:
plt.show()
else:
name = raw_input('Figure: ')
plt.savefig(name+'.png')
scatterPlot(x,y)
From the scatter
plot we can see that these 2 variables are have some positive relationship between
them.
So now we check
the correlation matrix between them to see what is the degree of correlation
between them and this is given by Pearson’s coefficient of correlation.
data = sc.textFile("Big_Mac.csv")
def parseVector(line):
# data values separated by commas
return np.array([float(x) for x in line.split(',')])
parsedData_corr = data.map(parseVector)
from pyspark.mllib.stat import Statistics
correlation_matrix = Statistics.corr(parsedData_corr, method = "spearman")
headers = ["Big Mac price","Net Hourly wage"]
corr_matrix = pd.DataFrame(correlation_matrix, index = headers, columns = headers)
corr_matrix

We can see that
there is relatively high positive correlation between these two variables and comprehend
that 70.62% of variability in the response variable i.e. Net hourly wages, is
being explained by the explanatory variable i.e.- Big Mac prices.
Hence, we can go
for linear regression technique to build our prediction model.
# Load and parse the data again for building Linear model
def parsePoint(line):
values = [float(x) for x in line.replace(',', ' ').split(' ')]
return LabeledPoint(values[0], values[1:])
parsedData = data.map(parsePoint)
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel
# Build the model
model = LinearRegressionWithSGD.train(parsedData,10,.01)
# Evaluate the model on training data
valuesAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
MSE = valuesAndPreds.map(lambda kv: (kv[0] - kv[1])**2).reduce(lambda x, y: x + y) / valuesAndPreds.count()
print("Mean Squared Error = " + str(MSE))
print ("Model coefficients:" + str(model))
parsedData_corr = data.map(parseVector)
(m,b) = np.polyfit(x,y,1)
print ('Slope is ' +str(m))
print ('Y intercept is ' +str(b))
yp = np.polyval([m,b],x)
scatterPlot(x,y,yp)
Finally, we can
plot the best fit line that is being given by our model and see the regression
model equation for the same.
The model gives
us the y-intercept as -4.154 and the slope as 3.547.
This gives us
the regression equation as:
Net Hourly Wage = - 4.154
+ 3.547 (Price of Big Mac)
While the
y-intercept has virtually no practical meaning in this analysis, the slope
indicates that for every dollar increase in the price of a Big Mac, there is an
incremental increase of $ 3.547 in net hourly wages of the workers for a country.
Using this
regression model, our second decision dilemma also can be solved - where we had
to find out the net hourly wage for a country with a $ 3.00 Big Mac. This can
be predicted by substituting x = 3 into the above model:
Net Hourly Wage = - 4.154
+ 3.547 (3)
= $ 6.49
Hence, our model
predicts that the net hourly wage for a country is $ 6.49 when the price of a
Big Mac is $ 3.00.
https://drive.google.com/drive/folders/0B7DVvFs6qM1ddDl0NkNGM2NmeVk?usp=sharing





















