Apache Spark

Article Info

Contributed by
2 authors

Last updated on
2019-07-07 04:43:07

Spark Performance Tuning
Spark Data Handling
Spark Streaming
PySpark
Apache Mesos
Apache Ignite

Article Versions

5 2019-07-07 04:43:07
1483,1482 5,1483

By arvindpdmn

Good. No comments. Publishing.
4 2019-07-07 04:13:18
1482,1369 4,1482

By jeetsingh

Completing the article. RDD details and code can be in other related articles. Adding one more tag.
3 2019-05-23 07:00:16
1369,1368 3,1369

By jeetsingh

Identified some questions.
2 2019-05-23 06:42:43
1368,1366 2,1368 1

By arvindpdmn

Uploaded image for milestone.
1 2019-05-23 06:40:37
1,1366

By arvindpdmn

Adding milestones only. Image to upload.

Chat Room

Submitting ...

You are editing an existing chat message.

When processing large amounts of data, it's common to distribute and parallelize the workload across a cluster of machines. Apache Spark is a framework that sits between the applications above and the cluster of resources below. Spark doesn't manage the low-level storage and compute resources directly. Instead, it makes use of other frameworks such as Mesos or Hadoop. In fact, Apache Spark is described as "a unified analytics engine for large-scale data processing".

Applications written in many popular languages can make use of Spark. Meanwhile, support for more languages is coming. Since Spark comes with many useful libraries, different types of processing are possible in a single application. Spark is being popularly used for Machine Learning workloads.

Spark is open source and is managed by Apache Software Foundation.

Discussion

In which application scenarios is Spark useful?
Spark use cases from a 2015 survey. Source: Ramel 2015.
Hadoop MapReduce is great for batch processing where typically data is read from disk, processed and written back to disk. But MapReduce is inefficient for multi-pass applications that read more than once. Performance drops due to data duplication, serialization and disk I/O. Apache Spark was created to solve this and is useful for the following:
- Iterative Algorithms: This includes machine learning algorithms that by nature process data through many iterations.
- Interactive Data Mining: Data is loaded into RAM once and then repeatedly queried. Interactive or ad-hoc analysis often includes visualizations. The language for querying the data must also be expressive.
- Streaming Applications: For real-time analysis, data must be analyzed as they come into the system. There's a need to maintain aggregate state over time.
Spark can be used in gaming, e-commerce, finance, advertising, and more. Many of these involve real-time analysis and unstructured data sources. Uber used Spark for feature aggregation in its ML data pipeline. Spark can help in scaling data pipelines for genomics. One researcher analyzed NASA server logs (~300MB) using database-like queries and regular expressions via Spark.
What makes Spark better than Hadoop MapReduce?
Performance comparison of MapReduce and Spark. Source: Rifat 2018.
Like MapReduce, Spark is scalable, distributed and fault tolerant, but it's also more efficient and easy to use. While MapReduce reads and writes to disk between tasks, Spark does in-memory caching and thereby improves performance.
Spark does this by providing a data abstraction called Resilient Distributed Dataset (RDD). Interfacing to RDD is enhanced with DataFrame and Dataset APIs. This is just one example of Spark's better usability via rich APIs and a functional programming model.
MapReduce jobs are executed as JVM processes but Spark is able to execute multiple tasks inside a JVM process. Another advantage of Spark is that each JVM process lives for entire duration of the application, unlike in MapReduce where the process exits once execution completes. This means that new tasks submitted to a Spark cluster can start much faster. There's better CPU utilization. The tradeoff is that resource management is coarse grained but cluster managers can overcome this (such as YARN's container resizing).
However, MapReduce may still be relevant for linear processing of huge datasets that can't fit into memory, particularly for join operations.
What's the architecture of Apache Spark?
Apache Spark architecture. Source: Apache Spark Docs 2019a.
A Spark application runs in a distributed manner on a cluster of nodes. It consists of two parts: the main program called driver program, and executors or processes on worker nodes where actual execution happen. Driver program contains the SparkContext object for coordinating the application.
A worker node is any node that runs application code. Such code could be JAR or Python files, for example. Application code available in SparkContext is passed to executors. SparkContext also schedules and sends tasks to executors. Each application get its own executors. This cleanly isolates one application from another. Multiple tasks can run within an executor due to multi-threading.
Since the driver program schedules tasks, it should be close to the worker nodes, preferably on the same local area network. It should also be network addressable from worker nodes and listen for incoming connections.
Driver program connects to a cluster manager that's responsible for managing resource allocation. The cluster manager could be anything: Spark's default manager, Apache Mesos, Hadoop YARN, Kubernetes, etc. The manager needs to only acquire executor processes and these communicate with one another.
What are some essential terms to know in Apache Spark?
Transformations and actions on an RDD. Source: Viswanath 2016.
Here are some essential terms:
- Task: A unit of work sent to one executor.
- Job: Parallel computation involving multiple spawned tasks for some actions such as save or collect.
- Stage: A smaller set of tasks of a job, useful when one stage depends on another.
- RDD: A fault-tolerant collection of elements that can be processed in parallel.
- Partition: A smaller chunk into which an RDD is divided. A task is launched per partition. Thus, more partitions imply greater parallelism. Having 2-4 partitions per CPU is typical.
- Transformation: An operation performed on an RDD. Since RDDs are immutable, transformations on an RDD result in a new RDD.
- Action: An operation on an RDD that returns a result to the driver program.
What's the Spark software stack?
Apache Spark software stack. Source: Kim and Bengfort 2016, ch. 4, fig. 4-1.
The Spark Core is the main programming abstraction. It gives APIs (in Java, Scala, Python, SQL, and R) to access and manipulate RDDs. To ease development, Spark comes with component libraries including Spark SQL/DataFrame/Dataset, Spark Streaming, MLlib and GraphX. Each of these serves specific application requirements but they all rely on Spark Core's unified API. This approach of a modular core plus useful components makes Spark attractive to developers.
User applications sit on top of these components. The Spark Shell is an example app that facilitates interactive analysis.
Spark Core itself doesn't manage cluster resources. This is done by a cluster manager. Spark comes with a standalone cluster manager but we are free to choose alternatives such as Hadoop YARN, Mesos, Kubernetes, etc. Spark Core also doesn't deal with disk storage for which we can use Hadoop HDFS, Cassandra, HBase, S3, etc. For example, in one deployment we could choose to use the YARN along with HDFS while the computing is managed via Spark Core.

Milestones

2009

Researchers at UC Berkeley's AMPLabs create a cluster management framework called Mesos. To showcase how easy it is build something on top of Mesos, they look at the limitations of MapReduce and Hadoop. These are good at batch processing but not at iterative or interactive processing, such as machine learning or interactive querying. To overcome these limitations, they build Spark. Spark is written in Scala and exposes a functional programming interface.

2010

Spark is open sourced. It's creators publish the first paper on Spark, titled Spark: Cluster Computing with Working Sets.

2013

Spark is incubated under the Apache Software Foundation. In February 2014, it becomes a top-level Apache project.

2015

With Spark 1.4 release, there's support for both Python 2 and 3. However, it's announced later to deprecate Python 2 support in the next major release of 2019.

Jul
2016

Spark 2.0.0 unifies DataFrame and Dataset APIs. Source: Damji 2016.

Spark 2.0.0 is released. To enable optimization, DataFrame API was introduced in v1.3. Dataset API introduced in v1.6 enabled compile-time checks. From v2.0, Dataset presents a single abstraction although language bindings that don't have type checks (Python, R) will internally use DataFrame, which is an alias for Dataset[Row].

2018

In February, Spark 2.3.0 is released. With this release, native Spark jobs can be managed by Kubernetes. Data source API V2 improves over V1. Meanwhile, a ranking of distributed computing packages for data science shows Apache Spark at the top, followed by Apache Hadoop. In the enterprise, Apache Spark MLib is most adopted for ML and Big Data analytics. This is followed by TensorFlow.

Apr
2019

For using Spark from C# and F#, .NET for Apache Spark is launched as a preview project. Direction will come from .NET Foundation. This effort replaces and deprecates Mobius.

References

Article Stats

1333

Words

Authors

Edits

Chats

Likes

20K

Hits

Cite As

Devopedia. 2019. "Apache Spark." Version 5, July 7. Accessed 2023-11-13. https://devopedia.org/apache-spark

Contributed by
2 authors

Last updated on
2019-07-07 04:43:07

system distributed computing framework big data

Spark Performance Tuning
Spark Data Handling
Spark Streaming
PySpark
Apache Mesos
Apache Ignite

Apache Spark

Discussion

Milestones

References

Further Reading

Article Stats

Cite As

See Also

Apache Spark

Discussion

Milestones

References

Further Reading

Article Stats

Author-wise Stats for Article Edits

Cite As

See Also

Login