In the question of Hadoop vs. Spark, the most
accurate view is that designers intended Hadoop and Spark
to work together on the same team.
A direct comparison of Hadoop and Spark is difficult because they do many of the same things, but are also non-overlapping in some areas.
For example, Spark has no file management and therefore must rely on Hadoop's Distributed File System (HDFS) or some other solution. It is wiser to compare Hadoop MapReduce to Spark, because they're more comparable as data processing engines.
As data science has matured over the past few
years, so has the need for a different approach to data and its “bigness”. The
analysis of which one is better needs examination of attributes for each
platform including performance, fault tolerance, cost, ease of use, data
processing, compatibility, and security.
The most important thing to remember about
Hadoop and Spark is that their use is not an either-or scenario because they are
not mutually exclusive. Nor is one necessarily a drop-in replacement for the
other. The two are compatible with each other and that makes their pairing an
extremely powerful solution for a variety of big data applications.
Hadoop:
Hadoop is an Apache.org project that is a
software library and a framework that allows for distributed processing of
large data sets (big data) across computer clusters using simple programming
models. Hadoop can scale from single computer systems up to thousands of commodity
systems that offer local storage and compute power.
Many companies that use big data sets and
analytics use Hadoop. It has become the de facto standard in big data
applications. Hadoop originally was designed to handle crawling and searching
billions of web pages and collecting their information into a database. The
result of the desire to crawl and search the web was Hadoop’s HDFS and its
distributed processing engine, MapReduce.
Principal Hadoop concepts are:
- Hadoop Distributed File System2 (HDFS2) - Hadoop File Distributed File System distributes large files across multiple machines in a way that is invisible to the user. It is useful to companies when data sets become so large or so complex that their current solutions cannot effectively process the information in what the data users consider being a reasonable amount of time.
- MapReduce - MapReduce is an excellent text processing engine and rightly so since crawling and searching the web (its first job) are both text-based tasks. It is a way to restructure a job so that it can be broken down into independent tasks:
- A Map Task
- A Reduce Task
Spark is regarded as "a fast and general engine for large-scale data processing." Spark's in-memory computing processing is very fast (upto 100 time faster than Hadoop MapReduce) and it also runs upto 10 times faster on disk. It can also perform batch processing, however, it really excels at streaming workloads, interactive queries, and machine-based learning.
The usability is also improved through use of rich API's in Java, Python, Scala in its interactive shell which reduces the use of coding.(by upto 2-10 times less code).
Spark's big claim to fame is its real-time data processing capability as compared to MapReduce's disk-bound, batch processing engine. Spark is compatible with Hadoop and its modules.
Spark is a cluster-computing framework, which means that it competes more with MapReduce than with the entire Hadoop ecosystem. For example- Spark doesn't have its own distributed filesystem, but can use HDFS.
Spark uses memory and can use disk for
processing, whereas MapReduce is strictly disk-based. The primary difference
between MapReduce and Spark is that MapReduce uses persistent storage and Spark
uses Resilient Distributed Datasets (RDDs), which are very efficient in fault tolerance.
Hadop Vs Spark summary:
Upon first glance, it would seem that using Spark would be the default choice for any big data application. However, that's not the case. MapReduce has made inroads into the big data market for businesses that need huge datasets brought under control by commodity systems. Spark's speed, agility, and relative ease of use are perfect complements to MapReduce's low cost of operation.
The truth is that Spark and MapReduce have a symbiotic relationship with each other. Hadoop provides features that Spark does not possess, such as distributed file system and Spark provides real-time, in-memory processing for those data sets that require it. The perfect big data scenario is exactly as the designers intended - Hadoop and Spark to work together on the same team.
How to do it??
Now, a flipside is that Spark doesn’t work comfortably on
Windows OS, but it works perfectly in Linux OS.
Step 1:
If you have Windows OS then you have 2 options to do the
same:
- - either to make your system dual boot with
Windows and Linux
- - or create a virtual Linux machine in Windows
Step 2:
Install VirtualBox on Windows
We have to create a Linux Virtual machine in Windows by
downloading the Oracle VM VirtualBox for Windows and installing the same.
VirtualBox, originally developed by Sun MicroSystems and now
owned by Oracle, can simulate a standalone computer. Each standalone computer
(aka virtual appliance) can run its own operating system (guest OS) and is
self-contained (delete the appliance and your host system is back to its
original state). The appliances can interact with each other and be a part of
your home network and will be treated as a separate system. Because of these
features and more, VirtualBox allows you to test various operating systems,
without making permanent changes to your host OS.
First, go to the VirtualBox download page,
scroll down and find the latest version (currently 5.1.0). Click on the latest
version number and then in the following page, scroll down, find the .exefile,
and download it to a known location on your Windows computer.
Once VirtualBox Windows installer is downloaded, run the
executable file and follow through onscreen instructions to install VirtualBox
on Windows.
After VirtualBox installation finishes you will
have to restart your computer. After reboot, VirtualBox should be available in
your apps as shown below. You can now run VirtualBox and create Windows virtual
machine with almost any OS.
Step 3:
Step 3.A - Install Ubuntu image in VirtualBox
Now that we have installed the Virtual Machine in our
Windows, now we have to create a Linux Virtual machine in Windows and we start by
installing Ubuntu image in Oracle VM VirtualBox.
1. VirtualBox, click the “New” button to start
the virtual machine wizard.
1. 2. Give your virtual machine a name and select the
operating system you’ll be running. Click “Next”. For this example, you’ll be
installing Ubuntu.
Type Ubuntu and for “Operating System,”
choose “Linux.” The version will automatically default to “Ubuntu.” Click
“Next” when you’re done. (I already have Ubuntu image, hence I’m here giving
Ubuntu1 as new name)
1. 3. Select the amount of memory your VM will use and
click “Next.” When you chose your operating system in the previous step,
VirtualBox automatically recommends the proper amount of memory to use. If you
feel this amount isn’t correct, you can move the slider or type a new amount in
the box. To run Spark atleast 6GB RAM memory space needs to be allocated. Click
“Next” when you’re done.
1. 4. Click “Next” to create a new virtual hard disk,
then click “Next” again. This opens a second wizard to create a new virtual
hard disk.
5. Select VDI (VirtualBox Disk Image) and click on
‘Next’.
6. Select either "Fixed-Size Storage" or
"Dynamically Expanding Storage" depending upon your needs. A fixed
size storage is going to be the size of the virtual hard disk on the host OS
(e.g.: a virtual disk 8 GB will be 8 GB on the host OS's hard disk). A
dynamically expanding storage will be only the size of Ubuntu on your hard
disk, but will grow in size as files are added to it until it reaches its
limit.
7. Choose the default name and size of the virtual
hard disk. Again, VirtualBox recommends the proper size of your virtual hard
disk. If you feel this amount isn’t correct, you can move the slider or type a
new amount in the box. Click “Create” when you’re done and wait while
VirtualBox creates the new virtual hard disk. You will see your new virtual
machine in list.
Step 3.B - Giving Ubuntu1 path for setup in VirtualBox
Now that we have installed the Ubuntu image in Virtual
Machine in our Windows, now we have to give the path to this Ubuntu image to
set up the Ubuntu environment within itself
1. Select your new virtual machine. Once you've
done this, click the “Settings” button
2. Click “Storage” tab. Click the “CD/DVD
icon" having "+" on it and select ISO to mount by clicking on
the “Add Optical Drive” and “Choose Disk”.
3. Ubuntu ISO will be mounted under controller
device.
4. You may now close the settings window and return
to the main window. You ubuntu machine is ready to boot now.
Step 3.C – Installing Ubuntu OS in the Ubuntu image in the VirtualBox
Now, we have installed the Ubuntu image in Virtual Machine
in our Windows, and given the path of the set up file from the hard disk, we now
have to install Ubuntu OS in this Ubuntu image.
1.
Select your new virtual machine. Once you've
done this, click the “Start” button
1.
Select your new virtual machine. Once you've
done this, click the “Start” button.
1.
Ubuntu Virtual machine will start in a separate
window.
2.
Machine will boot from selected ISO and you will
see language option. Choose your preferred language and press Enter.
1.
In next window you will see "Install
Options". You can choose to try ubuntu without installing, you can choose
install ubuntu option, you can also check for disk and memory for defects and
problems and you can also choose to boot from existing hard disk. Choose to
INSTALL ubuntu option here.
2.
Once ubuntu has loaded, Choose your language and
Click “Continue".
1.
On next screen, ubuntu will give you a checklist
and you will be asked if you need to update during install. Choose your
required option and click "Continue".
1.
Next option will ask you if you want to delete
all data and install or you can also choose or create your own partitions from
option "Something Else".
1.
Select your time zone from the map, then click
“Continue.”
2.
Click “Continue” to keep the default keyboard
layout or choose your desired one.
1.
Type your username in the first text box. This
will automatically fill in the login name and computer name. Type your password
and confirm your password and click "Continue".
2.
Ubuntu will begin the installation now.
3.
Once installation is complete, click “Restart
Now” to finish installation.
1.
Machine will restart and Installed Ubuntu will
load from hard disk, provide password to username and login to main window of
ubuntu.
1.
Once your Ubuntu Virtual Machine starts, you
will find that it runs in a small window. To make it occupy the full screen you
need to install Guest Additions to Virtual Box.
Once you have logged in to Ubuntu, click on
the "Devices" tab in virtualbox. Select "Insert Guest Additions
CD Image...".
1.
When Ubuntu asks to install a program and it
needs a password, type your user password. Click "Install Now."
1.
Let the terminal program run, and when it has
finished, press Enter.
1.
Reboot your VM and once it has booted, click on
the "View" menu, and click "Auto-resize Guest Display" and
you will now have a full-resolution Ubuntu VM on your computer.
Step 4:
Install Java in Ubuntu
The Virtual Machine image does not come with Java, that is
essential for Spark. So we have to install Java.
First, update the package index by using this command
- sudo apt-get update
- sudo apt-get update
Next, install Java. Specifically, this command will install
the Java Runtime Environment (JRE).
- sudo apt-get install default-jre
- sudo apt-get install default-jre
There is another default Java installation called the JDK (Java Development Kit). The JDK is usually only needed if you are going to compile Java programs or if the software that will use Java specifically requires it. The JDK does contain the JRE, so there are no disadvantages if you install the JDK instead of the JRE, except for the larger file size. You can install the JDK with the following command:
- sudo apt-get install default-jdk
Step 5:
Install Spark in Ubuntu
I.
Python - The Ubuntu 16.10 virtual machine comes
with Python 3.5 already installed and is adequate if you want to use Spark at
the command line. However, it is better to install iPython notebook. There are
many ways to install iPython notebook but the easiest way would be to download
and install Anaconda from the below web-link.
Note
that this needs to be downloaded inside the Ubuntu guest OS and not the Windows
host OS if we are using a VM. When the install scripts prompts - if Anaconda
should be placed in the system path, we should click YES.
Start
python and ipython from the Ubuntu prompt and you should see that Anaconda's
version of python is being loaded.
II.
Spark - the instructions given here have been
derived from this page but there are some significant deviations to accommodate
the current version of the ipython notebook.
1.
Download the latest version of Spark from the
below web-link:
http://spark.apache.org/downloads.html
§
Choose the latest version from the Spark
release.
§
In the package type - DO NOT CHOOSE source code
as otherwise you will have to compile it. Instead choose the package with the
latest pre-built Hadoop.
§
Choose direct download, not apache mirror.
§
Download the .tgz file
1.
Unzip the tgz file, move the resultant directory
to a convenient location and give it a simple name. In our case it was /home/rajat/spark16
1.
Add the following lines to the end of file
.profile
And if the above is not working then go to the terminal and type the
below commands on the command line
§
export SPARK_HOME=/home/rajat/spark16
§
export PATH=$SPARK_HOME/bin:$PATH
§
export
PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
§
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.3-src.zip:$PYTHONPATH
v
to get the correct version of the
py4j-n.n-src.zip file go to $SPARK_HOME/python/lib and see the actual value
Here it is – home/rajat/spark16/python/lib
and the py4j-n.n-src.zip version is py4j-10.3-src.zip
v
the last two paths are required because in many
cases the py4j library is not found.
§
export SPARK_LOCAL_IP=LOCALHOST
Now there are 3 ways one can begin Spark:
1.
To start spark in the command line mode enter
the "pyspark" command and you should see the familiar Spark screen.
To quit enter exit()
1.
To start spark in the ipython notebook format
enter the command $IPYTHON_OPTS="notebook" pyspark. Please note that
the strategy of using profiles for starting ipython notebook may not work as
the current version of jupyter does not support profiles anymore and hence this
strategy was used. This will start the server and make it available at port
8888 on the localhost. To quit press ctrl-c
twice in quick succession.
1.
An alternative way of starting the notebook, not
involving the IPYTHON_OPTS command is shown here. This is easier
1.
Start notebook with $ipython notebook ( or
alternatively, $jupyter notebook)
2.
Execute these two lines from the first cell of
the notebook
1.
from pyspark import SparkContext
2.
sc = SparkContext( 'local', 'pyspark')
III.
Now we have Spark running on our Ubuntu machine,
check out the status at http://localhost:4040
Step 6:
Running simple sample program to check if all are working
fine..
Now we will run a sample word count program on spark and
python using all the 3 ways to initiate spark or python environment.
Enter each of these lines as a command at the
pyspark prompt:
text = sc.textFile("datafile.txt")
print text
from operator import add
def tokenize(text):
return text.split()
words = text.flatMap(tokenize)
print words
wc = words.map(lambda x: (x,1))
print wc.toDebugString()
counts = wc.reduceByKey(add)
counts.saveAsTextFile("output-dir")
The final output in Hadoop style will be stored in a directory called
"output-dir". Remember Hadoop, and hence Spark, does not allow the
same output directory to be reused.
The same commands can also be entered one by one in the ipython notebook in
spark context environment and we will get the output
If we want to run the same program in ipython notebook without
the spark context environment we will input the same earlier commands one by
one in the ipython notebook and run them except the first command and the output
will be the same.
If we want to run the same program in command line mode we will input the same earlier commands one by one at the command prompt and run them and the output will again be the same.
This establishes that we have Spark and Python working
smoothly on our machine.










































