Apache Spark
When processing large amounts of data, it's common to distribute and parallelize the workload across a cluster of machines. Apache Spark is a framework that sits between the applications above and the cluster of resources below. Spark doesn't manage the low-level storage and compute resources directly. Instead, it makes use of other frameworks such as Mesos or Hadoop. In fact, Apache Spark is described as "a unified analytics engine for large-scale data processing".
Applications written in many popular languages can make use of Spark. Meanwhile, support for more languages is coming. Since Spark comes with many useful libraries, different types of processing are possible in a single application. Spark is being popularly used for Machine Learning workloads.
Spark is open source and is managed by Apache Software Foundation.
Discussion
-
In which application scenarios is Spark useful? Hadoop MapReduce is great for batch processing where typically data is read from disk, processed and written back to disk. But MapReduce is inefficient for multi-pass applications that read more than once. Performance drops due to data duplication, serialization and disk I/O. Apache Spark was created to solve this and is useful for the following:
- Iterative Algorithms: This includes machine learning algorithms that by nature process data through many iterations.
- Interactive Data Mining: Data is loaded into RAM once and then repeatedly queried. Interactive or ad-hoc analysis often includes visualizations. The language for querying the data must also be expressive.
- Streaming Applications: For real-time analysis, data must be analyzed as they come into the system. There's a need to maintain aggregate state over time.
Spark can be used in gaming, e-commerce, finance, advertising, and more. Many of these involve real-time analysis and unstructured data sources. Uber used Spark for feature aggregation in its ML data pipeline. Spark can help in scaling data pipelines for genomics. One researcher analyzed NASA server logs (~300MB) using database-like queries and regular expressions via Spark.
-
What makes Spark better than Hadoop MapReduce? Like MapReduce, Spark is scalable, distributed and fault tolerant, but it's also more efficient and easy to use. While MapReduce reads and writes to disk between tasks, Spark does in-memory caching and thereby improves performance.
Spark does this by providing a data abstraction called Resilient Distributed Dataset (RDD). Interfacing to RDD is enhanced with DataFrame and Dataset APIs. This is just one example of Spark's better usability via rich APIs and a functional programming model.
MapReduce jobs are executed as JVM processes but Spark is able to execute multiple tasks inside a JVM process. Another advantage of Spark is that each JVM process lives for entire duration of the application, unlike in MapReduce where the process exits once execution completes. This means that new tasks submitted to a Spark cluster can start much faster. There's better CPU utilization. The tradeoff is that resource management is coarse grained but cluster managers can overcome this (such as YARN's container resizing).
However, MapReduce may still be relevant for linear processing of huge datasets that can't fit into memory, particularly for join operations.
-
What's the architecture of Apache Spark? A Spark application runs in a distributed manner on a cluster of nodes. It consists of two parts: the main program called driver program, and executors or processes on worker nodes where actual execution happen. Driver program contains the SparkContext object for coordinating the application.
A worker node is any node that runs application code. Such code could be JAR or Python files, for example. Application code available in SparkContext is passed to executors. SparkContext also schedules and sends tasks to executors. Each application get its own executors. This cleanly isolates one application from another. Multiple tasks can run within an executor due to multi-threading.
Since the driver program schedules tasks, it should be close to the worker nodes, preferably on the same local area network. It should also be network addressable from worker nodes and listen for incoming connections.
Driver program connects to a cluster manager that's responsible for managing resource allocation. The cluster manager could be anything: Spark's default manager, Apache Mesos, Hadoop YARN, Kubernetes, etc. The manager needs to only acquire executor processes and these communicate with one another.
-
What are some essential terms to know in Apache Spark? Here are some essential terms:
- Task: A unit of work sent to one executor.
- Job: Parallel computation involving multiple spawned tasks for some actions such as
save
orcollect
. - Stage: A smaller set of tasks of a job, useful when one stage depends on another.
- RDD: A fault-tolerant collection of elements that can be processed in parallel.
- Partition: A smaller chunk into which an RDD is divided. A task is launched per partition. Thus, more partitions imply greater parallelism. Having 2-4 partitions per CPU is typical.
- Transformation: An operation performed on an RDD. Since RDDs are immutable, transformations on an RDD result in a new RDD.
- Action: An operation on an RDD that returns a result to the driver program.
-
What's the Spark software stack? The Spark Core is the main programming abstraction. It gives APIs (in Java, Scala, Python, SQL, and R) to access and manipulate RDDs. To ease development, Spark comes with component libraries including Spark SQL/DataFrame/Dataset, Spark Streaming, MLlib and GraphX. Each of these serves specific application requirements but they all rely on Spark Core's unified API. This approach of a modular core plus useful components makes Spark attractive to developers.
User applications sit on top of these components. The Spark Shell is an example app that facilitates interactive analysis.
Spark Core itself doesn't manage cluster resources. This is done by a cluster manager. Spark comes with a standalone cluster manager but we are free to choose alternatives such as Hadoop YARN, Mesos, Kubernetes, etc. Spark Core also doesn't deal with disk storage for which we can use Hadoop HDFS, Cassandra, HBase, S3, etc. For example, in one deployment we could choose to use the YARN along with HDFS while the computing is managed via Spark Core.
Milestones
Researchers at UC Berkeley's AMPLabs create a cluster management framework called Mesos. To showcase how easy it is build something on top of Mesos, they look at the limitations of MapReduce and Hadoop. These are good at batch processing but not at iterative or interactive processing, such as machine learning or interactive querying. To overcome these limitations, they build Spark. Spark is written in Scala and exposes a functional programming interface.
Spark is open sourced. It's creators publish the first paper on Spark, titled Spark: Cluster Computing with Working Sets.
2016
Spark 2.0.0 is released. To enable optimization, DataFrame API was introduced in v1.3. Dataset API introduced in v1.6 enabled compile-time checks. From v2.0, Dataset presents a single abstraction although language bindings that don't have type checks (Python, R) will internally use DataFrame, which is an alias for Dataset[Row].
In February, Spark 2.3.0 is released. With this release, native Spark jobs can be managed by Kubernetes. Data source API V2 improves over V1. Meanwhile, a ranking of distributed computing packages for data science shows Apache Spark at the top, followed by Apache Hadoop. In the enterprise, Apache Spark MLib is most adopted for ML and Big Data analytics. This is followed by TensorFlow.
References
- Apache Spark. 2018. "Spark Release 2.3.0." Apache Spark. Accessed 2019-07-06.
- Apache Spark. 2019a. "Apache Spark History." Accessed 2019-05-23.
- Apache Spark. 2019b. "Homepage." Apache Spark. Accessed 2019-05-23.
- Apache Spark. 2019c. "Spark Research." Apache Spark. Accessed 2019-05-27.
- Apache Spark. 2019d. "Spark News." Apache Spark. Accessed 2019-07-06.
- Apache Spark. 2019e. "Plan for dropping Python 2 support." Apache Spark. Accessed 2019-07-06.
- Apache Spark Docs. 2019a. "Cluster Mode Overview." v2.4.3. Accessed 2019-07-06.
- Apache Spark Docs. 2019b. "RDD Programming Guide." v2.4.3. Accessed 2019-07-06.
- Apache Spark Docs. 2019c. "Quick Start." v2.4.3. Accessed 2019-07-06.
- Bekker, Alex. 2017. "Spark vs. Hadoop MapReduce: Which big data framework to choose." Blog, ScienceSoft, September 14. Accessed 2019-07-06.
- Boland, Sean. 2018. "Ranking Popular Distributed Computing Packages for Data Science." The Data Incubator, February 07. Accessed 2019-05-23.
- Brust, Andrew. 2014. "Apache Spark becomes top-level project." ZDNet, February 27. Accessed 2019-05-23.
- Columbus, Louis. 2018. "Big Data Analytics Adoption Soared In The Enterprise In 2018." Forbes, December 23. Accessed 2019-05-23.
- Damji, Jules. 2016. "A Tale of Three Apache Spark APIs: RDDs vs DataFrames and Datasets." Engineering Blog, Databricks, July 14. Accessed 2019-05-23.
- Damji, Jules. 2018. "A Guide to Developer, Apache Spark Use Cases, and Deep Dives Talks at Spark + AI Summit." Blog, Databricks, May 23. Accessed 2019-05-23.
- Kešpret, Grega and Denny Lee. 2016. "An Illustrated Guide to Advertising Analytics." Blog, Databricks, February 02. Accessed 2019-05-23.
- Kim, Jenny, and Benjamin Bengfort. 2016. "Data Analytics with Hadoop." O'Reilly Media, Inc. Accessed 2019-07-06.
- McDonald, Carol. 2018. "Spark 101: What Is It, What It Does, and Why It Matters." Blog, MapR Technologies, Inc., October 17. Accessed 2019-07-06.
- Ostrowski, Radek. 2015. "Introduction to Apache Spark with Examples and Use Cases." Toptal. Accessed 2019-05-23.
- Phatak, Madhukara. 2015. "History of Apache Spark : Journey from Academia to Industry." Excerpts from O’Reilly Ben Lorica's interview of Ion Stoica, UC Berkeley professor and Databricks CEO, Madhukara's Blog, January 02. Accessed 2019-05-23.
- Phatak, Madhukara. 2016. "Introduction to Spark 2.0 : A Sneak Peek At Next Generation Spark." Madhukara's Blog, March 04. Accessed 2019-05-23.
- Ramel, David. 2015. "Spark Lighting a Big Data Fire, Survey Says." ADT Mag, September 24. Accessed 2019-07-06.
- Ramel, David. 2019. "'.NET for Apache Spark' Debuts for C#/F# Big Data." VisualStudio Magazine, April 25. Accessed 2019-05-23.
- Rifat, Taygan. 2018. "Getting Started with Apache Spark." September 04. Accessed 2019-05-23.
- Ryza, Sandy. 2014. "Apache Spark Resource Management and YARN App Models." Blog, Cloudera, May 30. Accessed 2019-07-06.
- Saby, Nastasia. 2018. "A comparison between RDD, DataFrame and Dataset in Spark from a developer’s point of view." Zenika, January 25. Accessed 2019-07-06.
- Sarkar, Dipanjan. 2019. "Scalable Log Analytics with Apache Spark — A Comprehensive Case-Study." Towards Data Science, April 10. Accessed 2019-05-23.
- Spark GitHub. 2019. "apache/spark." Apache Spark, on GitHub. Accessed 2019-05-23.
- Viswanath, Vishnu. 2016. "Spark RDDs Simplified." February 04. Accessed 2019-07-06.
- Zaharia, Matei, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. "Spark: Cluster Computing with Working Sets." University of California, Berkeley. Accessed 2019-05-23.
Further Reading
- Apache Spark Docs. 2019b. "RDD Programming Guide." v2.4.3. Accessed 2019-07-06.
- Apache Spark Docs. 2019c. "Quick Start." v2.4.3. Accessed 2019-07-06.
- Ostrowski, Radek. 2015. "Introduction to Apache Spark with Examples and Use Cases." Toptal. Accessed 2019-05-23.
- McDonald, Carol. 2018. "Spark 101: What Is It, What It Does, and Why It Matters." Blog, MapR Technologies, Inc., October 17. Accessed 2019-07-06.
- Meerasahib, Anzy. 2018. "Apache Spark and Big Data: What's Ahead." TDWI, June 22. Accessed 2019-07-06.