Dataflow /Apache Beam— Almost all you need to know
Use a unified programming model for both batch and streaming use cases — and run in a serverless fashion on Google Cloud
Dataflow is one of the runners for the open source Apache Beam framework. Apache Beam is an open source, unified model for defining both batch- and streaming-data parallel-processing pipelines. It allows you to execute your pipelines on multiple execution environments like Dataflow, Spark, Samza, Flink etc. The following diagram shows an example stream analytics pipeline possible on Google cloud using Dataflow.
Before we get into Dataflow / Apache Beam, let’s first check out the situations for which Dataflow is a good fit.
Dataflow vs. Dataproc
Dataproc and Dataflow are the two data processing option available on Google cloud. Both allow you to define batch- and streaming-data parallel-processing pipelines. The following table will aid you in deciding which service is to be used in what scenarios:
The Apache Beam programming model simplifies the mechanics of large-scale data processing. Using one of the Apache Beam SDKs, you build a program that defines the pipeline. Then, one of Apache Beam’s supported distributed processing backends, such as Dataflow, executes the pipeline.
Pipelines — A pipeline encapsulates the entire series of computations involved in reading input data, transforming that data, and writing output data.
PCollection — A
PCollection represents a potentially distributed, multi-element dataset that acts as the pipeline's data. A
PCollection can hold a dataset of a fixed size (bounded) or an unbounded dataset from a continuously updating data source.
Transforms — A transform represents a processing operation that transforms data. A transform takes one or more
PCollections as input, performs an operation that you specify on each element in that collection (filter, map etc.), and produces one or more
PCollections as output.
ParDo — A
ParDo is the core parallel processing operation in the Apache Beam SDKs, invoking a user-specified function on each of the elements of the input
Pipeline I/O — Apache Beam I/O connectors let you read data into your pipeline and write output data from your pipeline. An I/O connector consists of a source and a sink.
Pipeline Runner — Pipeline Runners are the software that accepts a pipeline and executes it.
Event time — The time a data event occurs, determined by the timestamp on the data element itself. This contrasts with the time the actual data element gets processed at any stage in the pipeline.
Windowing — Windowing enables grouping operations over unbounded collections by dividing the collection into windows of finite collections according to the timestamps of the individual elements.
Watermarks — Apache Beam tracks a watermark, which is the system’s notion of when all data in a certain window can be expected to have arrived in the pipeline. Apache Beam tracks a watermark because data is not guaranteed to arrive in a pipeline in time order or at predictable intervals. In addition, there are no guarantees that data events will appear in the pipeline in the same order that they were generated.
Trigger — Triggers determine when to emit aggregated results as data arrives. For bounded data, results are emitted after all of the input has been processed. For unbounded data, results are emitted when the watermark passes the end of the window, indicating that the system believes all input data for that window has been processed.
Dataflow takes the bounded/unbounded data as source and outputs the processed data into sinks like BigTable, BigQuery and AI Platform. It does all the resource management required to run the pipeline and is able to constantly rebalance the work. By using adaptive autoscaling, it upscales / downscales based on the transformation done in the individual steps in the pipeline. It provides intelligent watermarking, auto-healing while monitoring and log collection are all built-in.
You cannot use only a key to group elements in an unbounded collection (Streaming). There might be infinitely many elements for a given key in streaming data because the data source constantly adds new elements. You can use windows, watermarks, and triggers to aggregate elements in unbounded collections.
Windows and windowing functions
Windowing functions divide unbounded collections into logical components, or windows. Types of windows:
- Tumbling windows (called fixed windows in Apache Beam)
- Hopping windows (called sliding windows in Apache Beam)
- Session windows
Tumbling windows — A tumbling window represents a consistent, disjoint time interval in the data stream.
For example, if you set to a thirty-second tumbling window, the elements with timestamp values [0:00:00–0:00:30) are in the first window. Elements with timestamp values [0:00:30–0:01:00) are in the second window.
The following image illustrates how elements are divided into thirty-second tumbling windows.
Hopping windows — A hopping window represents a consistent time interval in the data stream. Hopping windows can overlap, whereas tumbling windows are disjoint.
For example, a hopping window can start every ten seconds and capture one minute of data and the window. The frequency with which hopping windows begin is called the period. This example has a one-minute window and ten-second period.
The following image illustrates how elements are divided into one-minute hopping windows with a thirty-second period.
To take running averages of data, use hopping windows. You can use one-minute hopping windows with a thirty-second period to compute a one-minute running average every thirty seconds.
Session windows — A session window contains elements within a gap duration of another element. The gap duration is an interval between new data in a data stream. If data arrives after the gap duration, the data is assigned to a new window.
For example, session windows can divide a data stream representing user mouse activity. This data stream might have long periods of idle time interspersed with many clicks. A session window can contain the data generated by the clicks.
Session windowing assigns different windows to each data key. Tumbling and hopping windows contain all elements in the specified time interval, regardless of data keys.
The following image visualizes how elements are divided into session windows.
Dataflow Shuffle is the base operation behind Dataflow transforms such as
Combine. The Dataflow Shuffle operation partitions and groups data by key in a scalable, efficient, fault-tolerant manner. Currently, Dataflow uses a shuffle implementation that runs entirely on worker virtual machines and consumes worker CPU, memory, and Persistent Disk storage. The service-based Dataflow Shuffle feature, available for batch pipelines only, moves the shuffle operation out of the worker VMs and into the Dataflow service backend.
Benefits of Dataflow Shuffle
The service-based Dataflow Shuffle has the following benefits:
- Faster execution time of batch pipelines for the majority of pipeline job types.
- A reduction in consumed CPU, memory, and Persistent Disk storage resources on the worker VMs.
- Better autoscaling since VMs no longer hold any shuffle data and can therefore be scaled down earlier.
- Better fault tolerance; an unhealthy VM holding Dataflow Shuffle data will not cause the entire job to fail, as would happen if not using the feature.
Dataflow Flexible Resource Scheduling (FlexRS)
Dataflow FlexRS reduces batch processing costs by using advanced scheduling techniques, the Dataflow Shuffle service, and a combination of preemptible virtual machine (VM) instances and regular VMs. By running preemptible VMs and regular VMs in parallel, Dataflow improves the user experience if Compute Engine stops preemptible VM instances during a system event. FlexRS helps to ensure that the pipeline continues to make progress and that you do not lose previous work when Compute Engine preempts your preemptible VMs.
FlexRS is most suitable for workloads that are not time-critical, such as daily or weekly jobs that can complete within a certain time window.
Security and permissions for pipelines on Google Cloud Platform
When you run your pipeline, Dataflow uses two service accounts to manage security and permissions: the Dataflow service account and the controller service account. The Dataflow service uses the Dataflow service account as part of the job creation request (for example, to check project quota and to create worker instances on your behalf), and during job execution to manage the job. Worker instances use the controller service account to access input and output resources after you submit your job.
When you run your pipeline on the Dataflow service, it uses a service account (
service-<project-number>@dataflow-service-producer-prod.iam.gserviceaccount.com) . This account is automatically created when a Dataflow project is created, gets assigned the Dataflow Service Agent role on the project, and has the necessary permissions to run a Dataflow job under the project, including starting Compute Engine workers.
Workers use your project’s Compute Engine service account as the controller service account. This service account (
<project-number>-firstname.lastname@example.org) is automatically created when you enable the Compute Engine API. If you want to create your own service account, your service account must have the Dataflow Worker role.
You gotta know!
- You can run Dataflow pipelines using SQL in the BigQuery UI using the Cloud Dataflow Engine. Dataflow SQL does not process late data.
- You may run up to 100 concurrent Dataflow jobs per Google Cloud project.
- The Dataflow service currently allows a maximum of 1000 Compute Engine instances per job. The default machine type is
n1-standard-1for a batch job, and
n1-standard-4for streaming; when using the default machine types, the Dataflow service can therefore allocate up to 4000 cores per job.
- Shared core machine types such as
g1series workers are not supported under Dataflow's Service Level Agreement.