Ch. 5 Pipelines with Cloud Data Flow

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/30

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

31 Terms

1
New cards

What is a pipeline?

full set of data transformations from ingestion to output

2
New cards

What is Cloud Dataflow?

an ETL pipeline tool, that works in parallel

3
New cards

What are Cloud Dataflow pipeline executions called?

jobs

4
New cards

What type of jobs does Dataflow support?

real-rime and batch processing jobs

5
New cards

What are the uses of Dataflow?

fraud detection, personalization of user experiences, and IOT analytics

6
New cards

What is the structure of a Dataflow pipeline

read data from a source using Apache Beam SDK, data transformations, and output it to a sink using Apache Beam SDK

7
New cards

What is the Driver program & what do you use to write it?

it defines your pipeline, write it using Apache SDK in Java or Python

8
New cards

What is a Runner?

software that manages the execution of a pipeline, and acts as a translation for the Backend executions framework

9
New cards

What is the Backend?

a massively parallel processing system (aka Cloud Dataflow)

10
New cards

How do the Driver & Runner work together?

the driver is submitted to the runner for processing

11
New cards

What is a PCollection?

represents an array of data elements as it is transformed inside a pipeline

12
New cards

When is a PCollection treated as Batch vs. Streaming?

When the data comes from a single source it is batch, when it comes from something like a smart device it is streaming

13
New cards

Where should you debug your pipeline?

on your local machine

14
New cards

What are the steps to design a pipeline?

location of data, input data structure & format, transformation objectives, output data structure & location

15
New cards

What are the steps to the creation of a pipline?

create pipeline object, create a PCollection using read or create transform, apply transforms, write out final PCollection

16
New cards

How do you merge branching pipelines?

flatten or join transforms

17
New cards

What is an Agrregation transform?

take multiple inputs from a PC & returns 1 output PC

18
New cards

Characteristics of a PCollection (PC)

PC are boundless arrays by default, PC do not support random access, PC elements: have to be the same datatype, have a timestamp attached, and immutable

19
New cards

What is processing time?

The sum of time of all transformations that a PC takes

20
New cards

What is windowing?

allows you to group PC elements by timestamp, especially useful for streaming data

21
New cards

What is the Cloud Dataflow service account

an automatically created account upon creation of a Dataflow project, that manages job resources

22
New cards

What is an example of a Cloud Dataflow service account managing job resources?

removing worker VMs during pipeline processing

23
New cards

What type of role does Cloud Dataflow service account assume, and why?

service agent role, so that it has the permission to run dataflow jobs

24
New cards

What is the controller service account?

it is the account assigned to dataflow workers

25
New cards

How is the compute engine involved in the controller service account?

by default controller service account is the compute engine service account

26
New cards

How are compute engine instances involved in dataflow?

they execute pipeline operations

27
New cards

What does the controller service account do?

metadata operations & access pipeline files, like figuring out the size of a file in cloud storage

28
New cards

What is the user-managed controller service account & why use it?

replaces cloud engine service account, in order to create and use resources with custom access control

29
New cards

What is a regional endpoint?

GCP geographic regions

30
New cards

What does a regional endpoint do?

manages: metadata about dataflow jobs, workers, workers zone location within specified region

31
New cards

What is the purpose of a trigger in dataflow?

triggers determine when to emit output data, and behave differently for bounded and unbounded data