1/30
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
What is a pipeline?
full set of data transformations from ingestion to output
What is Cloud Dataflow?
an ETL pipeline tool, that works in parallel
What are Cloud Dataflow pipeline executions called?
jobs
What type of jobs does Dataflow support?
real-rime and batch processing jobs
What are the uses of Dataflow?
fraud detection, personalization of user experiences, and IOT analytics
What is the structure of a Dataflow pipeline
read data from a source using Apache Beam SDK, data transformations, and output it to a sink using Apache Beam SDK
What is the Driver program & what do you use to write it?
it defines your pipeline, write it using Apache SDK in Java or Python
What is a Runner?
software that manages the execution of a pipeline, and acts as a translation for the Backend executions framework
What is the Backend?
a massively parallel processing system (aka Cloud Dataflow)
How do the Driver & Runner work together?
the driver is submitted to the runner for processing
What is a PCollection?
represents an array of data elements as it is transformed inside a pipeline
When is a PCollection treated as Batch vs. Streaming?
When the data comes from a single source it is batch, when it comes from something like a smart device it is streaming
Where should you debug your pipeline?
on your local machine
What are the steps to design a pipeline?
location of data, input data structure & format, transformation objectives, output data structure & location
What are the steps to the creation of a pipline?
create pipeline object, create a PCollection using read or create transform, apply transforms, write out final PCollection
How do you merge branching pipelines?
flatten or join transforms
What is an Agrregation transform?
take multiple inputs from a PC & returns 1 output PC
Characteristics of a PCollection (PC)
PC are boundless arrays by default, PC do not support random access, PC elements: have to be the same datatype, have a timestamp attached, and immutable
What is processing time?
The sum of time of all transformations that a PC takes
What is windowing?
allows you to group PC elements by timestamp, especially useful for streaming data
What is the Cloud Dataflow service account
an automatically created account upon creation of a Dataflow project, that manages job resources
What is an example of a Cloud Dataflow service account managing job resources?
removing worker VMs during pipeline processing
What type of role does Cloud Dataflow service account assume, and why?
service agent role, so that it has the permission to run dataflow jobs
What is the controller service account?
it is the account assigned to dataflow workers
How is the compute engine involved in the controller service account?
by default controller service account is the compute engine service account
How are compute engine instances involved in dataflow?
they execute pipeline operations
What does the controller service account do?
metadata operations & access pipeline files, like figuring out the size of a file in cloud storage
What is the user-managed controller service account & why use it?
replaces cloud engine service account, in order to create and use resources with custom access control
What is a regional endpoint?
GCP geographic regions
What does a regional endpoint do?
manages: metadata about dataflow jobs, workers, workers zone location within specified region
What is the purpose of a trigger in dataflow?
triggers determine when to emit output data, and behave differently for bounded and unbounded data