1/25
Flashcards covering key concepts from the lecture on Spark RDDs, DataFrames, ML Pipelines, and Parallelization.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Resilient Distributed Datasets (RDDs)
A distributed memory abstraction enabling in-memory computations on large clusters in a fault-tolerant manner.
In-Memory (RDD Trait)
Data stored in memory as much as possible.
Immutable (RDD Trait)
RDDs cannot be changed after creation, but can be transformed into new RDDs.
Lazily Evaluated (RDD Trait)
RDD data is not available/transformed until an action is executed.
Parallel (RDD Trait)
RDDs can process data across multiple nodes.
Partitioned (RDD Trait)
Data in a RDD is divided and distributed.
Cacheable (RDD Trait)
RDDs can hold data in a persistent storage.
Transformation (RDD Operation)
Takes an RDD and returns a new RDD but nothing gets evaluated / computed
Action (RDD Operation)
All the data processing queries are computed (evaluated) and the result value is returned
Spark Transformations
Create new datasets from an existing one, spark optimizes the required calculations, and spark recovers from failures
Spark Actions
Cause Spark to execute recipe to transform source
Lineage Graph
RDDs keep track of lineage to compute partitions from stable storage.
RDD Persistence
Nodes store partitions for reuse in other actions on that dataset
groupByKey([numPartitions])
Used for aggregating data in key-value pairs, returns a dataset of (K, Iterable
DataFrame API
A DataFrame API to perform relational operations on both external data sources and Spark’s built-in RDDs
Catalyst
A highly extensible optimizer to use Scala features to add composable rule, control code generation, and define extensions
DataFrames and Datasets
DataFrame is schema, generic untyped whereas Dataset is static typing, strongly-typed
DataFrame (ML Pipelines)
An ML dataset holding various data types, e.g. columns for text, feature vectors, true labels, & predictions
Transformer (ML Pipelines)
Algorithm transforming one DataFrame into another, e.g. features ML model predictions
Estimator (ML Pipelines)
Algorithm fitting on a DataFrame to produce a Transformer, e.g. training data ML algorithm ML model
Pipeline (ML Pipelines)
Chains multiple Transformers and Estimators together to specify an ML workflow
Shared Variables
Variables are distributed to workers via closures
Broadcast variables
Keep rather than ship
Accumulators
Only added to, e.g. sums/counters
Shared Variables
Variables are distributed to workers via closures
User applications
Results in a Directed Acyclic Graph (DAG) of operators