Apache Spart (abbreviation: Spark) is one of the most intense technologies in the year 2015, such was its effect that many assume that it will serve as a substitute to Apache Hadoop in the future. This article will focus on general discription of Spark, as opposed to Hadoop to give the answer.
In recent years, on hearing of the term Big Data, what we all probably think about first is Apache Hadoop, the technology written by Doug Cutting basing on GFG (Google File System) and MapReduce of Google in 2005. On April 2008, Hadoop became the fastest system to sort 1 terabyte of data when it only took 209 seconds to run on a 910-node cluster, beating the previous record of 297 seconds. On November 2008, Google announced that their MapReduce system just needed 68 seconds to sort 1 terabyte of data. On May 2009, it only took Yahoo with the help of Hadoop 62 seconds to tackle the same task. Up till now, an ecosystem has been constructed on the foundation of Hadoop to solve Big Data problems.
Hadoop consists of two major components:
- HDFS: Hadoop Distributed File System enables data storage on cluster which contains many computers with normal configuration.
- MapReduce: a framework allows parallel data process on cluster.
On the platform of these two components, OpenSource community has developed many other tools to enhance efficiency working with Hadoop:
- Hbase: NoSQL database, which is built on HDFS, supporting unstructured data.
- Flume: Collect data from sources like log system
- Oozie: Define the correlation among actions and set up working stream for MapReduce.
- Hive: Use such commands as SQL and translate these commands into sets of actions for MapReduce.
- Mahout: Apply to problems of machine learning.
- Sqoop: Transfer data from relational database to HDFS.
Matei Zaharia, the father of Spark, has used Hadoop from the early days. In 2009, he wrote Apache Spark to solve problems of machine learning in UC Berkely University because Hadoop MapReduce was not useful for these problems. Soon after that, he realized that Spark was not only helpful for machine learning but also valuable for processing complete dataflow.
- The central component of Spark is Spark Core: which provides the basic functions of Spark like action schedule, memory management, fault recovery, interactions with storage systems… Particularly, Spark Core provides API to define RDD (Resilient Distributed DataSet), a collection of items distributed on cluster nodes and allowing parallel processing.
- Spark can run on many types of Cluster Managers like Hadoop YARN, Apache Mesos or right on the cluster manager provided by Spark, which is called Standalone Scheduler.
- Spark SQL enables structured data query through SQL commands. Spark SQL can also manipulate lots of data sources like Hive tables, Parquet, and JSON.
- Spark Streaming provides API for effortless process of stream data
- Mllib provides many algorithms of machine learning like: classification, regression, clustering, collaborative filtering…
- GraphX is the library for graph processing.
One of the reasons why Spark runs faster than Hadoop MapReduce is that each of its action is delivered to the memory and processed there, the following actions can use data on the memory without having to write continously on HDFS like Hadoop MapReduce (demonstration below)
By comparison, in 2013, Hadoop used cluster containing 2100 machiness and had to spend 72 minutes sorting 100 TB of data, while Spark only had a tenth of that machine numbers yet could sort in 23 minutes. In many cases, Spark can run about 30-50 times faster than Hadoop MapReduce.
To have an overview about Spark, let’s look at some statistics:
Among the libraries supplied by Spark, there is 69% users of Spark SQL, 62% users of DataFrames, and the figure of Spark Streaming and Mllib + GraphX is 58%.
Programmers can write Spark applications in a number of different languages. In 2014, 84% of users use Scala, whereas Java and Python users made up for 38% each (Users can use more than one language in their applications). In 2015, Spark began to support R language, which quickly attracted 18% users, the percentage of Python users also increased to 58%.
In 2015, Spark became the most intense OpenSource project in the field of Big Data with regular updates from more than 800 programmers of 200 companies all over the world.
At present, there are many large brands using Spark for their products like Yahoo, ebay, IBM, Cisco…
Tencent uses the larggest cluster to run Spark, which consists of around 8000 nodes. Databricks and Alibaba used Spark to run 1 petabyte of data.
Now, turn back to our overall picture:
- Hadoop MapReduce could be replaced by Spark Core. From the above analysis, speed is the most outstanding feature of Spark Core, which seems to leave Hadoop MapReduce out of the Big Data wave.
- YARN could be replaced by
- Hadoop HDFS was written in the 2000s when memory was extremely costly, now as it is getting cheaper, some distributed storage systems centralizing memory like Apache Tachyon (recently called Alluxio) can possibly be a trend in the future.
- Spark SQL can be used as a substitute to HIVE, Spark SQL continues to be innovated in order to be easier to use and provide new functions.
- Mahout: is losing its market share, and stands high chance of being replaced by Spark MLLib
Hadoop has been developing for a long time and proved its efficiency in many products, thus, though some components of the Hadoop ecosystem can be replaced by Spark, it is impossible to say that Spark can completely wipe out Hadoop. With the spectacular advance of Spark in recent years, programmers and computer scientists have adopted this helpful tool for their jobs, then people may forget about “Hadoop Stack” and instead, know more about “Big data Stack” with many other options apart from Hadoop.
- Learning Spark by Matei Zaharia, Patrick Wendell, Andy Konwinski, Holden Karau
- Hadoop: The Definitive Guide, 4th Edition – O’Reilly Media
Nguyen Viet Cuong – FPT HORelated posts: