Apache Beam

Article Info

Contributed by
1 author

Last updated on
2020-09-17 11:17:24

Stream Processing Frameworks
Extract, Transform, Load
Dataflow Programming
Event Streaming
Apache Spark
Apache Flink

Article Versions

4 2020-09-17 11:17:24
2244,2243 4,2244

By arvindpdmn

Adding Beam logo.
3 2020-09-17 10:52:33
2243,2242 3,2243

By arvindpdmn

Completing all content. Uploaded images. Publishing.
2 2020-09-17 05:57:17
2242,1481 2,2242

By arvindpdmn

Added Sample Code and See Also. Work in progress.
1 2019-07-03 16:34:40
1,1481

By arvindpdmn

Questions identified

Chat Room

Submitting ...

You are editing an existing chat message.

Apache Beam makes your data pipelines portable across languages and runtimes. Source: Mejía 2018, fig. 1.

With the rise of Big Data, many frameworks have emerged to process that data. These are either for batch processing, stream processing or both. Examples include Apache Hadoop MapReduce, Apache Spark, Apache Storm, and Apache Flink.

The problem for developers is that once a better framework arrives, they have to rewrite their code. Mature old code gets replaced with immature new code. They may even have to maintain two codebases and pipelines, one for batch processing and another for stream processing.

Apache Beam aims to solve this problem by offering a unified programming model for both batch and streaming workloads that can run on any distributed processing backend execution engine. Beam offers SDKs in a few languages. It supports a number of backend execution engines.

Discussion

Why do I need Apache Beam when there are already so many data processing frameworks?
Having many processing frameworks is part of the problem. Developers have to write and maintain multiple pipelines to work with different frameworks. When a better framework comes along, there's significant effort involved in adopting it. Apache Beam solves this by enabling and reusing a single pipeline across multiple runtimes.
The benefit of Apache Beam is therefore both in development and operations. Developers can focus on their pipelines and less about the runtime. Pipelines become portable. Therefore there's no lock-in to a particular runtime. Beam SDKs allow developers to quickly integrate a pipeline into their applications.
The Beam Model offers powerful semantics for developers to think about data processing at higher level of abstractions. Concepts such as windowing, ordering, triggering and accumulation are part of the Beam model.
Beam has auto-scaling. It looks at current progress to dynamically reassign work to idle workers or scaling up/down the number of workers.
Since Apache Beam is open source, support for more languages (by SDK writers) or runtimes (by runner writers) can be added by the community.
What are the use cases served by Apache Beam?
Some use cases of Apache Beam. Source: Iyer and Onofré 2018.
Apache Beam is suitable for any task that can be parallelized by breaking down the data into smaller parts, each part running independently. Beam supports a wide variety of use cases. The simplest ones are perhaps Extract, Transform, Load (ETL) tasks that are typically used to move data across systems or formats.
Beam supports batch as well as streaming workloads. In fact, the name Beam signifies a combination of "Batch" and "Stream". It therefore presents a unified model and API to define parallel processing pipelines for both types of workloads.
Applications that use multiple streaming frameworks (such as Apache Spark and Apache Flink) can adopt Beam to simplify the codebase. A single data pipeline written in Beam can address both execution runtimes.
Beam can be used for scientific computations. For example, Landsat data (satellite images) can be processed in parallel and Beam can be used for this use case.
IoT applications often require real-time stream processing where Beam can be used. Another use case is computing scores for users in a mobile gaming app.
Which are the programming languages and runners supported by Apache Beam?
Test status summary of Beam for languages and runners. Source: Apache Beam GitHub 2020a.
Apache Beam started with a Java SDK. By 2020, it supported Java, Go, Python2 and Python3. Scio is a Scala API for Apache Beam.
Among the main runners supported are Dataflow, Apache Flink, Apache Samza, Apache Spark and Twister2. Others include Apache Hadoop MapReduce, JStorm, IBM Streams, Apache Nemo, and Hazelcast Jet. Refer to the Beam Capability Matrix for more details.
Java SDK supports the main runners but other SDKs support only some of them. This is because the runners themselves are written in Java, which makes support for non-Java SDKs non-trivial. Beam's portability framework aims to improve this situation and enable full interoperability. This framework would define data structures and protocols that can match any language to any runner.
Direct Runner is useful during development and testing for execution on your local machine. Direct Runner checks if your pipeline conforms to the Beam model. This brings greater confidence that the pipeline will run correctly on various runners.
Beam's portability framework comes with Universal Local Runner (ULR). This complements the Direct Runner.
What are the essential programming abstractions in Apache Beam?
An example of branching pipeline with three transforms. Source: Apache Beam Docs 2020d, fig. 2.
Beam provides the following abstractions for data processing:
- Pipeline: Encapsulates the entire task including reading input data, transforming data and writing output. Pipelines are created with options using PipelineOptionsFactory that returns a PipelineOptions object. Options can specify for example location of data, runner to use or runner-specific configuration. A pipeline can be linear or branching.
- PCollection: Represents the data. Every step of a pipeline inputs and outputs PCollection objects. The data can be bounded (eg. read from a file) or unbounded (eg. streamed from a continuous source).
- PTransform: Represents an operation on the data. Inputs are one or more PCollection objects. Outputs are zero or more PCollection objects. A transform doesn't modify the input collection. I/O transforms read and write to external storage. Core Beam transforms include ParDo, GroupByKey, CoGroupByKey, Combine, Flatten, and Partition. Built-in I/O transforms can connect to files, filesystems (eg. Hadoop, Amazon S3), messaging systems (eg. Kafka, MQTT), databases (eg. Elasticsearch, MongoDb), and more.
What's the Beam execution model?
Creating a pipeline doesn't imply immediate execution. The designated pipeline runner will construct a workflow graph of the pipeline. Such a graph connects collections via transforms. Then the graph is submitted to the distributed processing backend for execution. Execution happens asynchronously. However, some runners such as Dataflow support blocking execution.
What are some essential concepts of the Beam model?
The Beam Model along with an example. Source: Adapted from Akidau 2016.
Data often has two associated times: event time, when the event actually occurred, and processing time, when the event was observed in the system. Typically, processing time lags event time. This is called skew and it's highly variable.
Bounded or unbounded data are grouped into windows by either event time or processing time. Windows themselves can be fixed, sliding or dynamic such as based on user sessions. Processing can also be time-agnostic by chopping unbounded data into a sequence of bounded data.
Data can arrive out of order and with unpredictable delays. There's no way of knowing if all data applicable to an event-time window have arrived. Beam overcomes this by tracking watermarks, which gives a notion of data completeness. When a watermark is reached, the results are materialized.
We can also materialize early results (before watermark is reached) or late results (data arriving after the watermark) using triggers. This allows us to refine results over time. Finally, accumulation tells how to combine multiple results of the same window.
Could you point to useful developer resources to learn Apache Beam?
Apache Beam's official website contains quick start guides and documentation. The Overview page is a good place to start. There's an example to try out Apache Beam on Colab.
The Programming Guide is an essential read for developers who wish to use Beam SDKs and create data processing pipelines.
Visit the Learning Resources page for links to useful resources. On GitHub, there's a curated list of Beam resources and a few code samples.
Developers who wish to contribute to the Beam project should read the Contribution Guide. The code is on GitHub. The codebase also includes useful examples. For example, Python examples are in folder path sdks/python/apache_beam/examples and Go examples in sdks/go/examples.

Milestones

Jul
2014

Jay Kreps questions the need to maintain and execute parallel pipelines, one for batch processing (eg. using Apache Hadoop MapReduce) and one for stream processing (eg. using Apache Storm). The batch pipeline gives exact results and allows data reprocessing. The streaming pipeline has low-latency and gives approximate results. This has been called Lambda Architecture. Kreps instead proposes, "a language or framework that abstracts over both the real-time and batch framework."

May
2015

Streaming systems at Google that lead to Apache Beam. Source: Akidau 2015a, 1:00.

At a Big Data conference in London, Google engineer Tyler Akidau talks about streaming systems. He introduces some of those currently used at Google: MillWheel, Google Flume, and Cloud Dataflow. In fact, Cloud Dataflow is based on Google Flume. These recent developments bring greater maturity to streaming systems, which so far have remained less mature compared to batch processing systems.

2016

Beam logo is released in February 2016. Source: Apache Beam 2020d.

In February, Google's Dataflow is accepted by Apache Software Foundation as an Incubator Project. It's named Apache Beam. The open-sourced code includes Dataflow Java SDK, which already supports four runners. There's plan to build a Python SDK. Google Cloud Dataflow will continue as a managed service executing on the Google Cloud Platform. Beam logo is also released in February.

Jan
2017

Apache Beam graduates from being an incubator project to a top-level Apache project. During the incubation period (2016), the code was refactored and documentation was improved in an extensible vendor-neutral manner.

May
2017

Beam 2.0 is released. This is the first stable release of Beam under the Apache brand. It's said that Beam is at this point "truly portable, truly engine agnostic, truly ready for use."

Jun
2018

With the release of Beam 2.5.0, Go SDK is now supported. Go pipelines run on Dataflow runner.

Jul
2020

Beam 2.23.0 is released. This release supports Twister2 runner and Python 3.8. It removes support for runners Gearpump and Apex.

Sample Code

java

// Source: https://beam.apache.org/get-started/try-apache-beam/
// Accessed 2020-09-16
// Example use of Pipeline
 
package samples.quickstart;
 
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.Count;
import org.apache.beam.sdk.transforms.Filter;
import org.apache.beam.sdk.transforms.FlatMapElements;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.values.KV;
import org.apache.beam.sdk.values.TypeDescriptors;
 
import java.util.Arrays;
 
public class WordCount {
  public static void main(String[] args) {
    String inputsDir = "data/*";
    String outputsPrefix = "outputs/part";
 
    PipelineOptions options = PipelineOptionsFactory.fromArgs(args).create();
    Pipeline pipeline = Pipeline.create(options);
    pipeline
        .apply("Read lines", TextIO.read().from(inputsDir))
        .apply("Find words", FlatMapElements.into(TypeDescriptors.strings())
            .via((String line) -> Arrays.asList(line.split("[^\\p{L}]+"))))
        .apply("Filter empty words", Filter.by((String word) -> !word.isEmpty()))
        .apply("Count words", Count.perElement())
        .apply("Write results", MapElements.into(TypeDescriptors.strings())
            .via((KV<String, Long> wordCount) ->
                  wordCount.getKey() + ": " + wordCount.getValue()))
        .apply(TextIO.write().to(outputsPrefix));
    pipeline.run();
  }
}

References

Article Stats

1596

Words

Authors

Edits

Chats

Likes

5050

Hits

Cite As

Devopedia. 2020. "Apache Beam." Version 4, September 17. Accessed 2024-06-25. https://devopedia.org/apache-beam

Contributed by
1 author

Last updated on
2020-09-17 11:17:24

system framework data processing

Stream Processing Frameworks
Extract, Transform, Load
Dataflow Programming
Event Streaming
Apache Spark
Apache Flink

Apache Beam

Discussion

Milestones

Sample Code

References

Further Reading

Article Stats

Author-wise Stats for Article Edits

Cite As

See Also

Login