Apache Beam

Apache Beam makes your data pipelines portable across languages and runtimes. Source: Mejía 2018, fig. 1.
Apache Beam makes your data pipelines portable across languages and runtimes. Source: Mejía 2018, fig. 1.

With the rise of Big Data, many frameworks have emerged to process that data. These are either for batch processing, stream processing or both. Examples include Apache Hadoop MapReduce, Apache Spark, Apache Storm, and Apache Flink.

The problem for developers is that once a better framework arrives, they have to rewrite their code. Mature old code gets replaced with immature new code. They may even have to maintain two codebases and pipelines, one for batch processing and another for stream processing.

Apache Beam aims to solve this problem by offering a unified programming model for both batch and streaming workloads that can run on any distributed processing backend execution engine. Beam offers SDKs in a few languages. It supports a number of backend execution engines.

Discussion

  • Why do I need Apache Beam when there are already so many data processing frameworks?

    Having many processing frameworks is part of the problem. Developers have to write and maintain multiple pipelines to work with different frameworks. When a better framework comes along, there's significant effort involved in adopting it. Apache Beam solves this by enabling and reusing a single pipeline across multiple runtimes.

    The benefit of Apache Beam is therefore both in development and operations. Developers can focus on their pipelines and less about the runtime. Pipelines become portable. Therefore there's no lock-in to a particular runtime. Beam SDKs allow developers to quickly integrate a pipeline into their applications.

    The Beam Model offers powerful semantics for developers to think about data processing at higher level of abstractions. Concepts such as windowing, ordering, triggering and accumulation are part of the Beam model.

    Beam has auto-scaling. It looks at current progress to dynamically reassign work to idle workers or scaling up/down the number of workers.

    Since Apache Beam is open source, support for more languages (by SDK writers) or runtimes (by runner writers) can be added by the community.

  • What are the use cases served by Apache Beam?
    Some use cases of Apache Beam. Source: Iyer and Onofré 2018.
    Some use cases of Apache Beam. Source: Iyer and Onofré 2018.

    Apache Beam is suitable for any task that can be parallelized by breaking down the data into smaller parts, each part running independently. Beam supports a wide variety of use cases. The simplest ones are perhaps Extract, Transform, Load (ETL) tasks that are typically used to move data across systems or formats.

    Beam supports batch as well as streaming workloads. In fact, the name Beam signifies a combination of "Batch" and "Stream". It therefore presents a unified model and API to define parallel processing pipelines for both types of workloads.

    Applications that use multiple streaming frameworks (such as Apache Spark and Apache Flink) can adopt Beam to simplify the codebase. A single data pipeline written in Beam can address both execution runtimes.

    Beam can be used for scientific computations. For example, Landsat data (satellite images) can be processed in parallel and Beam can be used for this use case.

    IoT applications often require real-time stream processing where Beam can be used. Another use case is computing scores for users in a mobile gaming app.

  • Which are the programming languages and runners supported by Apache Beam?
    Test status summary of Beam for languages and runners. Source: Apache Beam GitHub 2020a.
    Test status summary of Beam for languages and runners. Source: Apache Beam GitHub 2020a.

    Apache Beam started with a Java SDK. By 2020, it supported Java, Go, Python2 and Python3. Scio is a Scala API for Apache Beam.

    Among the main runners supported are Dataflow, Apache Flink, Apache Samza, Apache Spark and Twister2. Others include Apache Hadoop MapReduce, JStorm, IBM Streams, Apache Nemo, and Hazelcast Jet. Refer to the Beam Capability Matrix for more details.

    Java SDK supports the main runners but other SDKs support only some of them. This is because the runners themselves are written in Java, which makes support for non-Java SDKs non-trivial. Beam's portability framework aims to improve this situation and enable full interoperability. This framework would define data structures and protocols that can match any language to any runner.

    Direct Runner is useful during development and testing for execution on your local machine. Direct Runner checks if your pipeline conforms to the Beam model. This brings greater confidence that the pipeline will run correctly on various runners.

    Beam's portability framework comes with Universal Local Runner (ULR). This complements the Direct Runner.

  • What are the essential programming abstractions in Apache Beam?
    An example of branching pipeline with three transforms. Source: Apache Beam Docs 2020d, fig. 2.
    An example of branching pipeline with three transforms. Source: Apache Beam Docs 2020d, fig. 2.

    Beam provides the following abstractions for data processing:

    • Pipeline: Encapsulates the entire task including reading input data, transforming data and writing output. Pipelines are created with options using PipelineOptionsFactory that returns a PipelineOptions object. Options can specify for example location of data, runner to use or runner-specific configuration. A pipeline can be linear or branching.
    • PCollection: Represents the data. Every step of a pipeline inputs and outputs PCollection objects. The data can be bounded (eg. read from a file) or unbounded (eg. streamed from a continuous source).
    • PTransform: Represents an operation on the data. Inputs are one or more PCollection objects. Outputs are zero or more PCollection objects. A transform doesn't modify the input collection. I/O transforms read and write to external storage. Core Beam transforms include ParDo, GroupByKey, CoGroupByKey, Combine, Flatten, and Partition. Built-in I/O transforms can connect to files, filesystems (eg. Hadoop, Amazon S3), messaging systems (eg. Kafka, MQTT), databases (eg. Elasticsearch, MongoDb), and more.
  • What's the Beam execution model?

    Creating a pipeline doesn't imply immediate execution. The designated pipeline runner will construct a workflow graph of the pipeline. Such a graph connects collections via transforms. Then the graph is submitted to the distributed processing backend for execution. Execution happens asynchronously. However, some runners such as Dataflow support blocking execution.

  • What are some essential concepts of the Beam model?
    The Beam Model along with an example. Source: Adapted from Akidau 2016.
    The Beam Model along with an example. Source: Adapted from Akidau 2016.

    Data often has two associated times: event time, when the event actually occurred, and processing time, when the event was observed in the system. Typically, processing time lags event time. This is called skew and it's highly variable.

    Bounded or unbounded data are grouped into windows by either event time or processing time. Windows themselves can be fixed, sliding or dynamic such as based on user sessions. Processing can also be time-agnostic by chopping unbounded data into a sequence of bounded data.

    Data can arrive out of order and with unpredictable delays. There's no way of knowing if all data applicable to an event-time window have arrived. Beam overcomes this by tracking watermarks, which gives a notion of data completeness. When a watermark is reached, the results are materialized.

    We can also materialize early results (before watermark is reached) or late results (data arriving after the watermark) using triggers. This allows us to refine results over time. Finally, accumulation tells how to combine multiple results of the same window.

  • Could you point to useful developer resources to learn Apache Beam?

    Apache Beam's official website contains quick start guides and documentation. The Overview page is a good place to start. There's an example to try out Apache Beam on Colab.

    The Programming Guide is an essential read for developers who wish to use Beam SDKs and create data processing pipelines.

    Visit the Learning Resources page for links to useful resources. On GitHub, there's a curated list of Beam resources and a few code samples.

    Developers who wish to contribute to the Beam project should read the Contribution Guide. The code is on GitHub. The codebase also includes useful examples. For example, Python examples are in folder path sdks/python/apache_beam/examples and Go examples in sdks/go/examples.

Milestones

Jul
2014

Jay Kreps questions the need to maintain and execute parallel pipelines, one for batch processing (eg. using Apache Hadoop MapReduce) and one for stream processing (eg. using Apache Storm). The batch pipeline gives exact results and allows data reprocessing. The streaming pipeline has low-latency and gives approximate results. This has been called Lambda Architecture. Kreps instead proposes, "a language or framework that abstracts over both the real-time and batch framework."

May
2015
Streaming systems at Google that lead to Apache Beam. Source: Akidau 2015a, 1:00.
Streaming systems at Google that lead to Apache Beam. Source: Akidau 2015a, 1:00.

At a Big Data conference in London, Google engineer Tyler Akidau talks about streaming systems. He introduces some of those currently used at Google: MillWheel, Google Flume, and Cloud Dataflow. In fact, Cloud Dataflow is based on Google Flume. These recent developments bring greater maturity to streaming systems, which so far have remained less mature compared to batch processing systems.

2016
Beam logo is released in February 2016. Source: Apache Beam 2020d.
Beam logo is released in February 2016. Source: Apache Beam 2020d.

In February, Google's Dataflow is accepted by Apache Software Foundation as an Incubator Project. It's named Apache Beam. The open-sourced code includes Dataflow Java SDK, which already supports four runners. There's plan to build a Python SDK. Google Cloud Dataflow will continue as a managed service executing on the Google Cloud Platform. Beam logo is also released in February.

Jan
2017

Apache Beam graduates from being an incubator project to a top-level Apache project. During the incubation period (2016), the code was refactored and documentation was improved in an extensible vendor-neutral manner.

May
2017

Beam 2.0 is released. This is the first stable release of Beam under the Apache brand. It's said that Beam is at this point "truly portable, truly engine agnostic, truly ready for use."

Jun
2018

With the release of Beam 2.5.0, Go SDK is now supported. Go pipelines run on Dataflow runner.

Jul
2020

Beam 2.23.0 is released. This release supports Twister2 runner and Python 3.8. It removes support for runners Gearpump and Apex.

Sample Code

  • // Source: https://beam.apache.org/get-started/try-apache-beam/
    // Accessed 2020-09-16
    // Example use of Pipeline
     
    package samples.quickstart;
     
    import org.apache.beam.sdk.Pipeline;
    import org.apache.beam.sdk.io.TextIO;
    import org.apache.beam.sdk.options.PipelineOptions;
    import org.apache.beam.sdk.options.PipelineOptionsFactory;
    import org.apache.beam.sdk.transforms.Count;
    import org.apache.beam.sdk.transforms.Filter;
    import org.apache.beam.sdk.transforms.FlatMapElements;
    import org.apache.beam.sdk.transforms.MapElements;
    import org.apache.beam.sdk.values.KV;
    import org.apache.beam.sdk.values.TypeDescriptors;
     
    import java.util.Arrays;
     
    public class WordCount {
      public static void main(String[] args) {
        String inputsDir = "data/*";
        String outputsPrefix = "outputs/part";
     
        PipelineOptions options = PipelineOptionsFactory.fromArgs(args).create();
        Pipeline pipeline = Pipeline.create(options);
        pipeline
            .apply("Read lines", TextIO.read().from(inputsDir))
            .apply("Find words", FlatMapElements.into(TypeDescriptors.strings())
                .via((String line) -> Arrays.asList(line.split("[^\\p{L}]+"))))
            .apply("Filter empty words", Filter.by((String word) -> !word.isEmpty()))
            .apply("Count words", Count.perElement())
            .apply("Write results", MapElements.into(TypeDescriptors.strings())
                .via((KV<String, Long> wordCount) ->
                      wordCount.getKey() + ": " + wordCount.getValue()))
            .apply(TextIO.write().to(outputsPrefix));
        pipeline.run();
      }
    }
     

References

  1. Akidau, Tyler. 2015a. "Say Goodbye to Batch." Strata+Hadoop World, O'Reilly, May 5-7. Accessed 2020-09-16.
  2. Akidau, Tyler. 2015b. "Streaming 101: The world beyond batch: A high-level tour of modern data-processing concepts." O'Reilly, August 5. Accessed 2020-09-16.
  3. Akidau, Tyler. 2016. "Streaming 102: The world beyond batch: The what, where, when, and how of unbounded data processing." O'Reilly, January 20. Accessed 2020-09-16.
  4. Apache Beam. 2020a. "Apache Beam Overview." Get Started, Apache Beam, September 1. Accessed 2020-09-16.
  5. Apache Beam. 2020b. "Portability Framework Roadmap." Roadmap, Apache Beam, July 22. Accessed 2020-09-16.
  6. Apache Beam. 2020c. "Apache Beam Mobile Gaming Pipeline Examples." Get Started, Apache Beam, May 28. Accessed 2020-09-16.
  7. Apache Beam. 2020d. "Apache Beam Logos." Apache Beam, May 15. Accessed 2020-09-17.
  8. Apache Beam Docs. 2020a. "Apache Beam Programming Guide." Apache Beam, August 21. Accessed 2020-09-16.
  9. Apache Beam Docs. 2020b. "Beam Capability Matrix." Apache Beam, May 15. Accessed 2020-09-16.
  10. Apache Beam Docs. 2020c. "Using the Direct Runner." Apache Beam, August 21. Accessed 2020-09-16.
  11. Apache Beam Docs. 2020d. "Design Your Pipeline." Apache Beam, May 15. Accessed 2020-09-16.
  12. Apache Beam Docs. 2020e. "Using the Google Cloud Dataflow Runner." Apache Beam, May 15. Accessed 2020-09-16.
  13. Apache Beam Docs. 2020f. "Built-in I/O Transforms." Apache Beam, June 10. Accessed 2020-09-16.
  14. Apache Beam GitHub. 2020a. "apache/beam." Apache Software Foundation, on GitHub, September 16. Accessed 2020-09-16.
  15. Apache Beam GitHub. 2020b. "Beam 2.23.0 release." apache/beam, Apache Software Foundation, on GitHub, July 30. Accessed 2020-09-16.
  16. Bonaci, Davor. 2017. "Apache Beam established as a new top-level project." Blog, Apache Beam, January 10. Accessed 2020-09-16.
  17. Google Cloud. 2020. "Programming model for Apache Beam." Dataflow Documentation, Google Cloud, June 26. Accessed 2020-09-16.
  18. Google Codelabs. 2020. "Distributed Computation of NDVI from Landsat Images using Cloud Dataflow." Codelabs, Google. Accessed 2020-09-16.
  19. Iyer, Anand and Jean-Baptiste Onofré. 2018. "Apache Beam: A Look Back at 2017." Blog, Apache Beam, January 10. Accessed 2020-09-16.
  20. Janakiram MSV. 2016. "All the Apache Streaming Projects: An Exploratory Guide." The New Stack, July 8. Accessed 2020-09-16.
  21. Kreps, Jay. 2014. "Questioning the Lambda Architecture." O'Reilly, July 2. Accessed 2020-09-16.
  22. Malone, James. 2016. "Apache Beam has a logo!" Blog, Apache Beam, February 22. Accessed 2020-09-17.
  23. Mejía, Ismaël. 2018. "Making data-intensive processing efficient and portable with Apache Beam." OpenSource.com, Red Hat, Inc., May 1. Accessed 2020-09-16.
  24. Perry, Frances. 2016. "Dataflow and open source - proposal to join the Apache Incubator." Blog, Google Cloud, January 20. Accessed 2020-09-16.
  25. Psaltis, Andrew. 2016. "Apache Beam: The Case for Unifying Streaming APIs." QCon New York, June 13. Accessed 2020-09-16.
  26. Rokni, Reza. 2020. "TensorFlow Extended (TFX): Using Apache Beam for large scale data processing." TensorFlow Blog, March 10. Accessed 2020-09-16.
  27. Romanenko, Alexey. 2018. "Apache Beam 2.5.0." Blog, Apache Beam, June 26. Accessed 2020-09-16.
  28. Woodie, Alex. 2017. "Google/ASF Tackle Big Computing Trade-Offs with Apache Beam 2.0." Datanami, May 19. Accessed 2020-09-16.

Further Reading

  1. Wasilewski, Kamil. 2020. "Apache Beam: Tutorial and Beginners' Guide." Blog, Polidea, January 16. Accessed 2020-09-16.
  2. Akidau, Tyler, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle. 2015. "The Dataflow Model: A Practical Approach to BalancingCorrectness, Latency, and Cost in Massive-Scale,Unbounded, Out-of-Order Data Processing." Proceedings of the VLDB Endowment, vol. 8, no. 12, pp. 1792-1803, August. Accessed 2020-09-16.
  3. Apache Beam Docs. 2020a. "Apache Beam Programming Guide." Apache Beam, August 21. Accessed 2020-09-16.
  4. Akidau, Tyler. 2015b. "Streaming 101: The world beyond batch: A high-level tour of modern data-processing concepts." O'Reilly, August 5. Accessed 2020-09-16.
  5. Akidau, Tyler. 2016. "Streaming 102: The world beyond batch: The what, where, when, and how of unbounded data processing." O'Reilly, January 20. Accessed 2020-09-16.
  6. Li, Shen Li Paul Gerver, John MacMillan, Daniel Debrunner, William Marshall, and Kun-Lung Wu. 2018. "Challenges and experiences in building an efficient apache beam runner for IBM streams." Proceedings of the VLDB Endowment, vol. 11, no. 12, pp. 1742-1754, August. Accessed 2020-09-16.

Article Stats

Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins
4
0
1299
1596
Words
1
Likes
5050
Hits

Cite As

Devopedia. 2020. "Apache Beam." Version 4, September 17. Accessed 2024-06-25. https://devopedia.org/apache-beam
Contributed by
1 author


Last updated on
2020-09-17 11:17:24