Apache Beam
- Summary
-
Discussion
- Why do I need Apache Beam when there are already so many data processing frameworks?
- What are the use cases served by Apache Beam?
- Which are the programming languages and runners supported by Apache Beam?
- What are the essential programming abstractions in Apache Beam?
- What's the Beam execution model?
- What are some essential concepts of the Beam model?
- Could you point to useful developer resources to learn Apache Beam?
- Milestones
- Sample Code
- References
- Further Reading
- Article Stats
- Cite As
With the rise of Big Data, many frameworks have emerged to process that data. These are either for batch processing, stream processing or both. Examples include Apache Hadoop MapReduce, Apache Spark, Apache Storm, and Apache Flink.
The problem for developers is that once a better framework arrives, they have to rewrite their code. Mature old code gets replaced with immature new code. They may even have to maintain two codebases and pipelines, one for batch processing and another for stream processing.
Apache Beam aims to solve this problem by offering a unified programming model for both batch and streaming workloads that can run on any distributed processing backend execution engine. Beam offers SDKs in a few languages. It supports a number of backend execution engines.
Discussion
-
Why do I need Apache Beam when there are already so many data processing frameworks? Having many processing frameworks is part of the problem. Developers have to write and maintain multiple pipelines to work with different frameworks. When a better framework comes along, there's significant effort involved in adopting it. Apache Beam solves this by enabling and reusing a single pipeline across multiple runtimes.
The benefit of Apache Beam is therefore both in development and operations. Developers can focus on their pipelines and less about the runtime. Pipelines become portable. Therefore there's no lock-in to a particular runtime. Beam SDKs allow developers to quickly integrate a pipeline into their applications.
The Beam Model offers powerful semantics for developers to think about data processing at higher level of abstractions. Concepts such as windowing, ordering, triggering and accumulation are part of the Beam model.
Beam has auto-scaling. It looks at current progress to dynamically reassign work to idle workers or scaling up/down the number of workers.
Since Apache Beam is open source, support for more languages (by SDK writers) or runtimes (by runner writers) can be added by the community.
-
What are the use cases served by Apache Beam? Apache Beam is suitable for any task that can be parallelized by breaking down the data into smaller parts, each part running independently. Beam supports a wide variety of use cases. The simplest ones are perhaps Extract, Transform, Load (ETL) tasks that are typically used to move data across systems or formats.
Beam supports batch as well as streaming workloads. In fact, the name Beam signifies a combination of "Batch" and "Stream". It therefore presents a unified model and API to define parallel processing pipelines for both types of workloads.
Applications that use multiple streaming frameworks (such as Apache Spark and Apache Flink) can adopt Beam to simplify the codebase. A single data pipeline written in Beam can address both execution runtimes.
Beam can be used for scientific computations. For example, Landsat data (satellite images) can be processed in parallel and Beam can be used for this use case.
IoT applications often require real-time stream processing where Beam can be used. Another use case is computing scores for users in a mobile gaming app.
-
Which are the programming languages and runners supported by Apache Beam? Apache Beam started with a Java SDK. By 2020, it supported Java, Go, Python2 and Python3. Scio is a Scala API for Apache Beam.
Among the main runners supported are Dataflow, Apache Flink, Apache Samza, Apache Spark and Twister2. Others include Apache Hadoop MapReduce, JStorm, IBM Streams, Apache Nemo, and Hazelcast Jet. Refer to the Beam Capability Matrix for more details.
Java SDK supports the main runners but other SDKs support only some of them. This is because the runners themselves are written in Java, which makes support for non-Java SDKs non-trivial. Beam's portability framework aims to improve this situation and enable full interoperability. This framework would define data structures and protocols that can match any language to any runner.
Direct Runner is useful during development and testing for execution on your local machine. Direct Runner checks if your pipeline conforms to the Beam model. This brings greater confidence that the pipeline will run correctly on various runners.
Beam's portability framework comes with Universal Local Runner (ULR). This complements the Direct Runner.
-
What are the essential programming abstractions in Apache Beam? Beam provides the following abstractions for data processing:
Pipeline
: Encapsulates the entire task including reading input data, transforming data and writing output. Pipelines are created with options usingPipelineOptionsFactory
that returns aPipelineOptions
object. Options can specify for example location of data, runner to use or runner-specific configuration. A pipeline can be linear or branching.PCollection
: Represents the data. Every step of a pipeline inputs and outputsPCollection
objects. The data can be bounded (eg. read from a file) or unbounded (eg. streamed from a continuous source).PTransform
: Represents an operation on the data. Inputs are one or morePCollection
objects. Outputs are zero or morePCollection
objects. A transform doesn't modify the input collection. I/O transforms read and write to external storage. Core Beam transforms includeParDo
,GroupByKey
,CoGroupByKey
,Combine
,Flatten
, andPartition
. Built-in I/O transforms can connect to files, filesystems (eg. Hadoop, Amazon S3), messaging systems (eg. Kafka, MQTT), databases (eg. Elasticsearch, MongoDb), and more.
-
What's the Beam execution model? Creating a pipeline doesn't imply immediate execution. The designated pipeline runner will construct a workflow graph of the pipeline. Such a graph connects collections via transforms. Then the graph is submitted to the distributed processing backend for execution. Execution happens asynchronously. However, some runners such as Dataflow support blocking execution.
-
What are some essential concepts of the Beam model? Data often has two associated times: event time, when the event actually occurred, and processing time, when the event was observed in the system. Typically, processing time lags event time. This is called skew and it's highly variable.
Bounded or unbounded data are grouped into windows by either event time or processing time. Windows themselves can be fixed, sliding or dynamic such as based on user sessions. Processing can also be time-agnostic by chopping unbounded data into a sequence of bounded data.
Data can arrive out of order and with unpredictable delays. There's no way of knowing if all data applicable to an event-time window have arrived. Beam overcomes this by tracking watermarks, which gives a notion of data completeness. When a watermark is reached, the results are materialized.
We can also materialize early results (before watermark is reached) or late results (data arriving after the watermark) using triggers. This allows us to refine results over time. Finally, accumulation tells how to combine multiple results of the same window.
-
Could you point to useful developer resources to learn Apache Beam? Apache Beam's official website contains quick start guides and documentation. The Overview page is a good place to start. There's an example to try out Apache Beam on Colab.
The Programming Guide is an essential read for developers who wish to use Beam SDKs and create data processing pipelines.
Visit the Learning Resources page for links to useful resources. On GitHub, there's a curated list of Beam resources and a few code samples.
Developers who wish to contribute to the Beam project should read the Contribution Guide. The code is on GitHub. The codebase also includes useful examples. For example, Python examples are in folder path
sdks/python/apache_beam/examples
and Go examples insdks/go/examples
.
Milestones
2014
Jay Kreps questions the need to maintain and execute parallel pipelines, one for batch processing (eg. using Apache Hadoop MapReduce) and one for stream processing (eg. using Apache Storm). The batch pipeline gives exact results and allows data reprocessing. The streaming pipeline has low-latency and gives approximate results. This has been called Lambda Architecture. Kreps instead proposes, "a language or framework that abstracts over both the real-time and batch framework."
2015
At a Big Data conference in London, Google engineer Tyler Akidau talks about streaming systems. He introduces some of those currently used at Google: MillWheel, Google Flume, and Cloud Dataflow. In fact, Cloud Dataflow is based on Google Flume. These recent developments bring greater maturity to streaming systems, which so far have remained less mature compared to batch processing systems.
In February, Google's Dataflow is accepted by Apache Software Foundation as an Incubator Project. It's named Apache Beam. The open-sourced code includes Dataflow Java SDK, which already supports four runners. There's plan to build a Python SDK. Google Cloud Dataflow will continue as a managed service executing on the Google Cloud Platform. Beam logo is also released in February.
2017
2017
2018
Sample Code
References
- Akidau, Tyler. 2015a. "Say Goodbye to Batch." Strata+Hadoop World, O'Reilly, May 5-7. Accessed 2020-09-16.
- Akidau, Tyler. 2015b. "Streaming 101: The world beyond batch: A high-level tour of modern data-processing concepts." O'Reilly, August 5. Accessed 2020-09-16.
- Akidau, Tyler. 2016. "Streaming 102: The world beyond batch: The what, where, when, and how of unbounded data processing." O'Reilly, January 20. Accessed 2020-09-16.
- Apache Beam. 2020a. "Apache Beam Overview." Get Started, Apache Beam, September 1. Accessed 2020-09-16.
- Apache Beam. 2020b. "Portability Framework Roadmap." Roadmap, Apache Beam, July 22. Accessed 2020-09-16.
- Apache Beam. 2020c. "Apache Beam Mobile Gaming Pipeline Examples." Get Started, Apache Beam, May 28. Accessed 2020-09-16.
- Apache Beam. 2020d. "Apache Beam Logos." Apache Beam, May 15. Accessed 2020-09-17.
- Apache Beam Docs. 2020a. "Apache Beam Programming Guide." Apache Beam, August 21. Accessed 2020-09-16.
- Apache Beam Docs. 2020b. "Beam Capability Matrix." Apache Beam, May 15. Accessed 2020-09-16.
- Apache Beam Docs. 2020c. "Using the Direct Runner." Apache Beam, August 21. Accessed 2020-09-16.
- Apache Beam Docs. 2020d. "Design Your Pipeline." Apache Beam, May 15. Accessed 2020-09-16.
- Apache Beam Docs. 2020e. "Using the Google Cloud Dataflow Runner." Apache Beam, May 15. Accessed 2020-09-16.
- Apache Beam Docs. 2020f. "Built-in I/O Transforms." Apache Beam, June 10. Accessed 2020-09-16.
- Apache Beam GitHub. 2020a. "apache/beam." Apache Software Foundation, on GitHub, September 16. Accessed 2020-09-16.
- Apache Beam GitHub. 2020b. "Beam 2.23.0 release." apache/beam, Apache Software Foundation, on GitHub, July 30. Accessed 2020-09-16.
- Bonaci, Davor. 2017. "Apache Beam established as a new top-level project." Blog, Apache Beam, January 10. Accessed 2020-09-16.
- Google Cloud. 2020. "Programming model for Apache Beam." Dataflow Documentation, Google Cloud, June 26. Accessed 2020-09-16.
- Google Codelabs. 2020. "Distributed Computation of NDVI from Landsat Images using Cloud Dataflow." Codelabs, Google. Accessed 2020-09-16.
- Iyer, Anand and Jean-Baptiste Onofré. 2018. "Apache Beam: A Look Back at 2017." Blog, Apache Beam, January 10. Accessed 2020-09-16.
- Janakiram MSV. 2016. "All the Apache Streaming Projects: An Exploratory Guide." The New Stack, July 8. Accessed 2020-09-16.
- Kreps, Jay. 2014. "Questioning the Lambda Architecture." O'Reilly, July 2. Accessed 2020-09-16.
- Malone, James. 2016. "Apache Beam has a logo!" Blog, Apache Beam, February 22. Accessed 2020-09-17.
- Mejía, Ismaël. 2018. "Making data-intensive processing efficient and portable with Apache Beam." OpenSource.com, Red Hat, Inc., May 1. Accessed 2020-09-16.
- Perry, Frances. 2016. "Dataflow and open source - proposal to join the Apache Incubator." Blog, Google Cloud, January 20. Accessed 2020-09-16.
- Psaltis, Andrew. 2016. "Apache Beam: The Case for Unifying Streaming APIs." QCon New York, June 13. Accessed 2020-09-16.
- Rokni, Reza. 2020. "TensorFlow Extended (TFX): Using Apache Beam for large scale data processing." TensorFlow Blog, March 10. Accessed 2020-09-16.
- Romanenko, Alexey. 2018. "Apache Beam 2.5.0." Blog, Apache Beam, June 26. Accessed 2020-09-16.
- Woodie, Alex. 2017. "Google/ASF Tackle Big Computing Trade-Offs with Apache Beam 2.0." Datanami, May 19. Accessed 2020-09-16.
Further Reading
- Wasilewski, Kamil. 2020. "Apache Beam: Tutorial and Beginners' Guide." Blog, Polidea, January 16. Accessed 2020-09-16.
- Akidau, Tyler, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle. 2015. "The Dataflow Model: A Practical Approach to BalancingCorrectness, Latency, and Cost in Massive-Scale,Unbounded, Out-of-Order Data Processing." Proceedings of the VLDB Endowment, vol. 8, no. 12, pp. 1792-1803, August. Accessed 2020-09-16.
- Apache Beam Docs. 2020a. "Apache Beam Programming Guide." Apache Beam, August 21. Accessed 2020-09-16.
- Akidau, Tyler. 2015b. "Streaming 101: The world beyond batch: A high-level tour of modern data-processing concepts." O'Reilly, August 5. Accessed 2020-09-16.
- Akidau, Tyler. 2016. "Streaming 102: The world beyond batch: The what, where, when, and how of unbounded data processing." O'Reilly, January 20. Accessed 2020-09-16.
- Li, Shen Li Paul Gerver, John MacMillan, Daniel Debrunner, William Marshall, and Kun-Lung Wu. 2018. "Challenges and experiences in building an efficient apache beam runner for IBM streams." Proceedings of the VLDB Endowment, vol. 11, no. 12, pp. 1742-1754, August. Accessed 2020-09-16.
Article Stats
Cite As
See Also
- Stream Processing Frameworks
- Extract, Transform, Load
- Dataflow Programming
- Event Streaming
- Apache Spark
- Apache Flink