Spark RDD, DataFrame, ML Pipelines, and Parallelization

0.0(0)
studied byStudied by 0 people
GameKnowt Play
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/25

flashcard set

Earn XP

Description and Tags

Flashcards covering key concepts from the lecture on Spark RDDs, DataFrames, ML Pipelines, and Parallelization.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

26 Terms

1
New cards

Resilient Distributed Datasets (RDDs)

A distributed memory abstraction enabling in-memory computations on large clusters in a fault-tolerant manner.

2
New cards

In-Memory (RDD Trait)

Data stored in memory as much as possible.

3
New cards

Immutable (RDD Trait)

RDDs cannot be changed after creation, but can be transformed into new RDDs.

4
New cards

Lazily Evaluated (RDD Trait)

RDD data is not available/transformed until an action is executed.

5
New cards

Parallel (RDD Trait)

RDDs can process data across multiple nodes.

6
New cards

Partitioned (RDD Trait)

Data in a RDD is divided and distributed.

7
New cards

Cacheable (RDD Trait)

RDDs can hold data in a persistent storage.

8
New cards

Transformation (RDD Operation)

Takes an RDD and returns a new RDD but nothing gets evaluated / computed

9
New cards

Action (RDD Operation)

All the data processing queries are computed (evaluated) and the result value is returned

10
New cards

Spark Transformations

Create new datasets from an existing one, spark optimizes the required calculations, and spark recovers from failures

11
New cards

Spark Actions

Cause Spark to execute recipe to transform source

12
New cards

Lineage Graph

RDDs keep track of lineage to compute partitions from stable storage.

13
New cards

RDD Persistence

Nodes store partitions for reuse in other actions on that dataset

14
New cards

groupByKey([numPartitions])

Used for aggregating data in key-value pairs, returns a dataset of (K, Iterable) pairs.

15
New cards

DataFrame API

A DataFrame API to perform relational operations on both external data sources and Spark’s built-in RDDs

16
New cards

Catalyst

A highly extensible optimizer to use Scala features to add composable rule, control code generation, and define extensions

17
New cards

DataFrames and Datasets

DataFrame is schema, generic untyped whereas Dataset is static typing, strongly-typed

18
New cards

DataFrame (ML Pipelines)

An ML dataset holding various data types, e.g. columns for text, feature vectors, true labels, & predictions

19
New cards

Transformer (ML Pipelines)

Algorithm transforming one DataFrame into another, e.g. features  ML model  predictions

20
New cards

Estimator (ML Pipelines)

Algorithm fitting on a DataFrame to produce a Transformer, e.g. training data ML algorithm  ML model

21
New cards

Pipeline (ML Pipelines)

Chains multiple Transformers and Estimators together to specify an ML workflow

22
New cards

Shared Variables

Variables are distributed to workers via closures

23
New cards

Broadcast variables

Keep rather than ship

24
New cards

Accumulators

Only added to, e.g. sums/counters

25
New cards

Shared Variables

Variables are distributed to workers via closures

26
New cards

User applications

Results in a Directed Acyclic Graph (DAG) of operators