Ch. 5 Pipelines with Cloud Data Flow

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/30

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

31 Terms

New cards

What is a pipeline?

full set of data transformations from ingestion to output

New cards

What is Cloud Dataflow?

an ETL pipeline tool, that works in parallel

New cards

What are Cloud Dataflow pipeline executions called?

jobs

New cards

What type of jobs does Dataflow support?

real-rime and batch processing jobs

New cards

What are the uses of Dataflow?

fraud detection, personalization of user experiences, and IOT analytics

New cards

What is the structure of a Dataflow pipeline

read data from a source using Apache Beam SDK, data transformations, and output it to a sink using Apache Beam SDK

New cards

What is the Driver program & what do you use to write it?

it defines your pipeline, write it using Apache SDK in Java or Python

New cards

What is a Runner?

software that manages the execution of a pipeline, and acts as a translation for the Backend executions framework

New cards

What is the Backend?

a massively parallel processing system (aka Cloud Dataflow)

New cards

How do the Driver & Runner work together?

the driver is submitted to the runner for processing

New cards

What is a PCollection?

represents an array of data elements as it is transformed inside a pipeline

New cards

When is a PCollection treated as Batch vs. Streaming?

When the data comes from a single source it is batch, when it comes from something like a smart device it is streaming

New cards

Where should you debug your pipeline?

on your local machine

New cards

What are the steps to design a pipeline?

location of data, input data structure & format, transformation objectives, output data structure & location

New cards

What are the steps to the creation of a pipline?

create pipeline object, create a PCollection using read or create transform, apply transforms, write out final PCollection

New cards

How do you merge branching pipelines?

flatten or join transforms

New cards

What is an Agrregation transform?

take multiple inputs from a PC & returns 1 output PC

New cards

Characteristics of a PCollection (PC)

PC are boundless arrays by default, PC do not support random access, PC elements: have to be the same datatype, have a timestamp attached, and immutable

New cards

What is processing time?

The sum of time of all transformations that a PC takes

New cards

What is windowing?

allows you to group PC elements by timestamp, especially useful for streaming data

New cards

What is the Cloud Dataflow service account

an automatically created account upon creation of a Dataflow project, that manages job resources

New cards

What is an example of a Cloud Dataflow service account managing job resources?

removing worker VMs during pipeline processing

New cards

What type of role does Cloud Dataflow service account assume, and why?

service agent role, so that it has the permission to run dataflow jobs

New cards

What is the controller service account?

it is the account assigned to dataflow workers

New cards

How is the compute engine involved in the controller service account?

by default controller service account is the compute engine service account

New cards

How are compute engine instances involved in dataflow?

they execute pipeline operations

New cards

What does the controller service account do?

metadata operations & access pipeline files, like figuring out the size of a file in cloud storage

New cards

What is the user-managed controller service account & why use it?

replaces cloud engine service account, in order to create and use resources with custom access control

New cards

What is a regional endpoint?

GCP geographic regions

New cards

What does a regional endpoint do?

manages: metadata about dataflow jobs, workers, workers zone location within specified region

New cards

What is the purpose of a trigger in dataflow?

triggers determine when to emit output data, and behave differently for bounded and unbounded data