Rajat's Analyticzone: November 2016

In the question of Hadoop vs. Spark, the most accurate view is that designers intended Hadoop and Spark to work together on the same team.

A direct comparison of Hadoop and Spark is difficult because they do many of the same things, but are also non-overlapping in some areas.

For example, Spark has no file management and therefore must rely on Hadoop's Distributed File System (HDFS) or some other solution. It is wiser to compare Hadoop MapReduce to Spark, because they're more comparable as data processing engines.

As data science has matured over the past few years, so has the need for a different approach to data and its “bigness”. The analysis of which one is better needs examination of attributes for each platform including performance, fault tolerance, cost, ease of use, data processing, compatibility, and security.

The most important thing to remember about Hadoop and Spark is that their use is not an either-or scenario because they are not mutually exclusive. Nor is one necessarily a drop-in replacement for the other. The two are compatible with each other and that makes their pairing an extremely powerful solution for a variety of big data applications.

Hadoop:

Hadoop is an Apache.org project that is a software library and a framework that allows for distributed processing of large data sets (big data) across computer clusters using simple programming models. Hadoop can scale from single computer systems up to thousands of commodity systems that offer local storage and compute power.

Many companies that use big data sets and analytics use Hadoop. It has become the de facto standard in big data applications. Hadoop originally was designed to handle crawling and searching billions of web pages and collecting their information into a database. The result of the desire to crawl and search the web was Hadoop’s HDFS and its distributed processing engine, MapReduce.

Principal Hadoop concepts are:

- Hadoop Distributed File System2 (HDFS2) - Hadoop File Distributed File System distributes large files across multiple machines in a way that is invisible to the user. It is useful to companies when data sets become so large or so complex that their current solutions cannot effectively process the information in what the data users consider being a reasonable amount of time.

- MapReduce - MapReduce is an excellent text processing engine and rightly so since crawling and searching the web (its first job) are both text-based tasks. It is a way to restructure a job so that it can be broken down into independent tasks:

A Map Task
A Reduce Task

Spark:

Spark is regarded as "a fast and general engine for large-scale data processing." Spark's in-memory computing processing is very fast (upto 100 time faster than Hadoop MapReduce) and it also runs upto 10 times faster on disk. It can also perform batch processing, however, it really excels at streaming workloads, interactive queries, and machine-based learning.

The usability is also improved through use of rich API's in Java, Python, Scala in its interactive shell which reduces the use of coding.(by upto 2-10 times less code).

Spark's big claim to fame is its real-time data processing capability as compared to MapReduce's disk-bound, batch processing engine. Spark is compatible with Hadoop and its modules.

Spark is a cluster-computing framework, which means that it competes more with MapReduce than with the entire Hadoop ecosystem. For example- Spark doesn't have its own distributed filesystem, but can use HDFS.

Spark uses memory and can use disk for processing, whereas MapReduce is strictly disk-based. The primary difference between MapReduce and Spark is that MapReduce uses persistent storage and Spark uses Resilient Distributed Datasets (RDDs), which are very efficient in fault tolerance.

Hadop Vs Spark summary:

Upon first glance, it would seem that using Spark would be the default choice for any big data application. However, that's not the case. MapReduce has made inroads into the big data market for businesses that need huge datasets brought under control by commodity systems. Spark's speed, agility, and relative ease of use are perfect complements to MapReduce's low cost of operation.

The truth is that Spark and MapReduce have a symbiotic relationship with each other. Hadoop provides features that Spark does not possess, such as distributed file system and Spark provides real-time, in-memory processing for those data sets that require it. The perfect big data scenario is exactly as the designers intended - Hadoop and Spark to work together on the same team.

How to do it??

Now, a flipside is that Spark doesn’t work comfortably on Windows OS, but it works perfectly in Linux OS.

Step 1:

If you have Windows OS then you have 2 options to do the same:

- - either to make your system dual boot with Windows and Linux

- - or create a virtual Linux machine in Windows

Step 2:

Install VirtualBox on Windows

We have to create a Linux Virtual machine in Windows by downloading the Oracle VM VirtualBox for Windows and installing the same.

VirtualBox, originally developed by Sun MicroSystems and now owned by Oracle, can simulate a standalone computer. Each standalone computer (aka virtual appliance) can run its own operating system (guest OS) and is self-contained (delete the appliance and your host system is back to its original state). The appliances can interact with each other and be a part of your home network and will be treated as a separate system. Because of these features and more, VirtualBox allows you to test various operating systems, without making permanent changes to your host OS.

First, go to the VirtualBox download page, scroll down and find the latest version (currently 5.1.0). Click on the latest version number and then in the following page, scroll down, find the .exefile, and download it to a known location on your Windows computer.

Once VirtualBox Windows installer is downloaded, run the executable file and follow through onscreen instructions to install VirtualBox on Windows.

After VirtualBox installation finishes you will have to restart your computer. After reboot, VirtualBox should be available in your apps as shown below. You can now run VirtualBox and create Windows virtual machine with almost any OS.

Step 3:

Step 3.A - Install Ubuntu image in VirtualBox

Now that we have installed the Virtual Machine in our Windows, now we have to create a Linux Virtual machine in Windows and we start by installing Ubuntu image in Oracle VM VirtualBox.

1. VirtualBox, click the “New” button to start the virtual machine wizard.

1. 2. Give your virtual machine a name and select the operating system you’ll be running. Click “Next”. For this example, you’ll be installing Ubuntu.

Type Ubuntu and for “Operating System,” choose “Linux.” The version will automatically default to “Ubuntu.” Click “Next” when you’re done. (I already have Ubuntu image, hence I’m here giving Ubuntu1 as new name)

1. 3. Select the amount of memory your VM will use and click “Next.” When you chose your operating system in the previous step, VirtualBox automatically recommends the proper amount of memory to use. If you feel this amount isn’t correct, you can move the slider or type a new amount in the box. To run Spark atleast 6GB RAM memory space needs to be allocated. Click “Next” when you’re done.

1. 4. Click “Next” to create a new virtual hard disk, then click “Next” again. This opens a second wizard to create a new virtual hard disk.

5. Select VDI (VirtualBox Disk Image) and click on ‘Next’.

6. Select either "Fixed-Size Storage" or "Dynamically Expanding Storage" depending upon your needs. A fixed size storage is going to be the size of the virtual hard disk on the host OS (e.g.: a virtual disk 8 GB will be 8 GB on the host OS's hard disk). A dynamically expanding storage will be only the size of Ubuntu on your hard disk, but will grow in size as files are added to it until it reaches its limit.

7. Choose the default name and size of the virtual hard disk. Again, VirtualBox recommends the proper size of your virtual hard disk. If you feel this amount isn’t correct, you can move the slider or type a new amount in the box. Click “Create” when you’re done and wait while VirtualBox creates the new virtual hard disk. You will see your new virtual machine in list.

Step 3.B - Giving Ubuntu1 path for setup in VirtualBox

Now that we have installed the Ubuntu image in Virtual Machine in our Windows, now we have to give the path to this Ubuntu image to set up the Ubuntu environment within itself

1. Select your new virtual machine. Once you've done this, click the “Settings” button

2. Click “Storage” tab. Click the “CD/DVD icon" having "+" on it and select ISO to mount by clicking on the “Add Optical Drive” and “Choose Disk”.

3. Ubuntu ISO will be mounted under controller device.

4. You may now close the settings window and return to the main window. You ubuntu machine is ready to boot now.

Step 3.C – Installing Ubuntu OS in the Ubuntu image in the VirtualBox

Now, we have installed the Ubuntu image in Virtual Machine in our Windows, and given the path of the set up file from the hard disk, we now have to install Ubuntu OS in this Ubuntu image.

1. Select your new virtual machine. Once you've done this, click the “Start” button

1. Select your new virtual machine. Once you've done this, click the “Start” button.

1. Ubuntu Virtual machine will start in a separate window.

2. Machine will boot from selected ISO and you will see language option. Choose your preferred language and press Enter.

1. In next window you will see "Install Options". You can choose to try ubuntu without installing, you can choose install ubuntu option, you can also check for disk and memory for defects and problems and you can also choose to boot from existing hard disk. Choose to INSTALL ubuntu option here.

2. Once ubuntu has loaded, Choose your language and Click “Continue".

1. On next screen, ubuntu will give you a checklist and you will be asked if you need to update during install. Choose your required option and click "Continue".

1. Next option will ask you if you want to delete all data and install or you can also choose or create your own partitions from option "Something Else".

1. Select your time zone from the map, then click “Continue.”

2. Click “Continue” to keep the default keyboard layout or choose your desired one.

1. Type your username in the first text box. This will automatically fill in the login name and computer name. Type your password and confirm your password and click "Continue".

2. Ubuntu will begin the installation now.

3. Once installation is complete, click “Restart Now” to finish installation.

1. Machine will restart and Installed Ubuntu will load from hard disk, provide password to username and login to main window of ubuntu.

1. Once your Ubuntu Virtual Machine starts, you will find that it runs in a small window. To make it occupy the full screen you need to install Guest Additions to Virtual Box.

Once you have logged in to Ubuntu, click on the "Devices" tab in virtualbox. Select "Insert Guest Additions CD Image...".

1. When Ubuntu asks to install a program and it needs a password, type your user password. Click "Install Now."

1. Let the terminal program run, and when it has finished, press Enter.

1. Reboot your VM and once it has booted, click on the "View" menu, and click "Auto-resize Guest Display" and you will now have a full-resolution Ubuntu VM on your computer.

Step 4:

Install Java in Ubuntu

The Virtual Machine image does not come with Java, that is essential for Spark. So we have to install Java.

First, update the package index by using this command
- sudo apt-get update

Next, install Java. Specifically, this command will install the Java Runtime Environment (JRE).
- sudo apt-get install default-jre

There is another default Java installation called the JDK (Java Development Kit). The JDK is usually only needed if you are going to compile Java programs or if the software that will use Java specifically requires it. The JDK does contain the JRE, so there are no disadvantages if you install the JDK instead of the JRE, except for the larger file size. You can install the JDK with the following command:
- sudo apt-get install default-jdk

Step 5:

Install Spark in Ubuntu

I. Python - The Ubuntu 16.10 virtual machine comes with Python 3.5 already installed and is adequate if you want to use Spark at the command line. However, it is better to install iPython notebook. There are many ways to install iPython notebook but the easiest way would be to download and install Anaconda from the below web-link.

https://www.continuum.io/downloads#_unix

Note that this needs to be downloaded inside the Ubuntu guest OS and not the Windows host OS if we are using a VM. When the install scripts prompts - if Anaconda should be placed in the system path, we should click YES.

Start python and ipython from the Ubuntu prompt and you should see that Anaconda's version of python is being loaded.

II. Spark - the instructions given here have been derived from this page but there are some significant deviations to accommodate the current version of the ipython notebook.

1. Download the latest version of Spark from the below web-link:

http://spark.apache.org/downloads.html

§ Choose the latest version from the Spark release.

§ In the package type - DO NOT CHOOSE source code as otherwise you will have to compile it. Instead choose the package with the latest pre-built Hadoop.

§ Choose direct download, not apache mirror.

§ Download the .tgz file

1. Unzip the tgz file, move the resultant directory to a convenient location and give it a simple name. In our case it was /home/rajat/spark16

1. Add the following lines to the end of file .profile

And if the above is not working then go to the terminal and type the below commands on the command line

§ export SPARK_HOME=/home/rajat/spark16

§ export PATH=$SPARK_HOME/bin:$PATH

§ export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH

§ export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.3-src.zip:$PYTHONPATH

v to get the correct version of the py4j-n.n-src.zip file go to $SPARK_HOME/python/lib and see the actual value

Here it is – home/rajat/spark16/python/lib and the py4j-n.n-src.zip version is py4j-10.3-src.zip

v the last two paths are required because in many cases the py4j library is not found.

§ export SPARK_LOCAL_IP=LOCALHOST

Now there are 3 ways one can begin Spark:

1. To start spark in the command line mode enter the "pyspark" command and you should see the familiar Spark screen. To quit enter exit()

1. To start spark in the ipython notebook format enter the command $IPYTHON_OPTS="notebook" pyspark. Please note that the strategy of using profiles for starting ipython notebook may not work as the current version of jupyter does not support profiles anymore and hence this strategy was used. This will start the server and make it available at port 8888 on the localhost. To quit press ctrl-c twice in quick succession.

1. An alternative way of starting the notebook, not involving the IPYTHON_OPTS command is shown here. This is easier

1. Start notebook with $ipython notebook ( or alternatively, $jupyter notebook)

2. Execute these two lines from the first cell of the notebook

1. from pyspark import SparkContext

2. sc = SparkContext( 'local', 'pyspark')

III. Now we have Spark running on our Ubuntu machine, check out the status at http://localhost:4040

Step 6:

Running simple sample program to check if all are working fine..

Now we will run a sample word count program on spark and python using all the 3 ways to initiate spark or python environment.

Enter each of these lines as a command at the pyspark prompt:

text = sc.textFile("datafile.txt")

print text

from operator import add

def tokenize(text):

return text.split()

words = text.flatMap(tokenize)

print words

wc = words.map(lambda x: (x,1))

print wc.toDebugString()

counts = wc.reduceByKey(add)

counts.saveAsTextFile("output-dir")

The final output in Hadoop style will be stored in a directory called "output-dir". Remember Hadoop, and hence Spark, does not allow the same output directory to be reused.

The same commands can also be entered one by one in the ipython notebook in spark context environment and we will get the output

If we want to run the same program in ipython notebook without the spark context environment we will input the same earlier commands one by one in the ipython notebook and run them except the first command and the output will be the same.

If we want to run the same program in command line mode we will input the same earlier commands one by one at the command prompt and run them and the output will again be the same.

This establishes that we have Spark and Python working smoothly on our machine.

Rajat's Analyticzone

Monday, 14 November 2016

Hadoop vs. Spark: The latest league in Big Data