Spark Vs Hadoop

Before going to answer this question let me give you brief introduction of Spark and Hadoop.

Spark:

Apache Spark is an open source cluster computing system that aims to make data analytics fast (both fast to run and fast to write).

It is developed in Scala (functional programming) which is suitable for distributed systems, a lot can be accomplished in a small piece of code. It’s more readable also and easy to understand.

Hadoop:

Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models.

Hadoop framework developed in Java. It uses it’s own distributed storage system called as HDFS (Hadoop File System).

For organizations looking to adopt a big data analytics functionality, here’s a comparative look at Apache Hadoop and Apache Spark:

Though Spark is know to work faster than Hadoop, it does not provide it’s own distributed storage systems.

For distributed storage, Spark can interface with

Hadoop Distributed File System (HDFS)
Cassandra
OpenStack Swift
Amazon S3
Kudu or a custom solution can be implemented.

Spark Course: http://www.muniversity.mobi/course/preview.php?id=282

Wordcount program:

50+ lines of code

We can see that, the code size gets reduced drastically in Spark as compare to Hadoop.

Hadoop MapReduce writes all of the data back to the disk after each operation to ensure that recovery can be done if require whereas Spark manages most of its operation in-memory and stores it’s data/processes in RDD (Resilient distributed Data-set) hence it is faster as there is no time spent in moving the data/processes in and out of the disk, whereas MapReduce requires a lot of time to perform these input/output operations thereby increasing latency..As a result the time reduces for operation and user gets tremendous amount of increment in performance.

If you have large amount of structured data at rest and Hadoop is already installed then there in no need to install Spark since Spark is a less mature ecosystem and needs further improvements in areas like security and integration with BI (Business Intelligence) tools.

Image source: Google analytics

Spark is getting more and more popular than hadoop also it’s code contributor are also increasing. It seems like Sparkians just fall in love with Spark.

What makes Spark superior than Hadoop ?

Speed:
- Run programs up to 100 x faster than Hadoop Map Reduce in memory, and 10 x faster on disk.

Ease of use:
- Developers can quickly write application in their familiar languages Python,Java,Scala or R as SparkR provides an integration to ‘R’.

Mlib:
- Spark provides it’s own set of libraries containing machine learning algorithms for faster analysis on No third party machine learning tool needed.

Real time streaming:
- Spark can process streaming data (e.g.: twitter data) i.e the data which is in a flow on other hand hadoop map-reduce can process the data at rest.

GraphX:
- Most graph processing algorithms (e.g.: page rank) requires multiple iterations. In hadoop map-reduce it increases no. of I/O operations by writing processed data to disk after each successive iterations.
- As spark process data in memory and has inbuilt graph library i.e. GraphX hence we get better performance in simplified way.

“Is it required to have better understanding of Hadoop first to learn Spark?”

No, it is not required to learn hadoop to learn Spark. Newbies always in a dilemma what to choose first.Though, nowadays many companies are using hadoop HDFS (Hadoop File System) as a distributed storage system for spark but it is not necessary to know about hadoop Map-Reduce.

But it has been the great combination (Spark + HDFS) till the date.

Bottomline:

Spark took the MapReduce to the next level with less expensive shuffles in the data processing. In fact it is alternative to Map-Reduce within Hadoop ecosystem and not Hadoop alternate.Spark and hadoop can work together. If you have already invested your money to setting up the hadoop then you can install spark on top of the hadoop HDFS. As Spark requires large amount of RAM since it does all operations in memory if you don’t want to puncture your pocket then go for Hadoop.