Spark RDD, DataFrame, ML Pipelines, and Parallelization

0.0(0)

Studied by 0 people

Knowt Play

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/25

Earn XP

Description and Tags

Flashcards covering key concepts from the lecture on Spark RDDs, DataFrames, ML Pipelines, and Parallelization.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

26 Terms

New cards

Resilient Distributed Datasets (RDDs)

A distributed memory abstraction enabling in-memory computations on large clusters in a fault-tolerant manner.

New cards

In-Memory (RDD Trait)

Data stored in memory as much as possible.

New cards

Immutable (RDD Trait)

RDDs cannot be changed after creation, but can be transformed into new RDDs.

New cards

Lazily Evaluated (RDD Trait)

RDD data is not available/transformed until an action is executed.

New cards

Parallel (RDD Trait)

RDDs can process data across multiple nodes.

New cards

Partitioned (RDD Trait)

Data in a RDD is divided and distributed.

New cards

Cacheable (RDD Trait)

RDDs can hold data in a persistent storage.

New cards

Transformation (RDD Operation)

Takes an RDD and returns a new RDD but nothing gets evaluated / computed

New cards

Action (RDD Operation)

All the data processing queries are computed (evaluated) and the result value is returned

New cards

Spark Transformations

Create new datasets from an existing one, spark optimizes the required calculations, and spark recovers from failures

New cards

Spark Actions

Cause Spark to execute recipe to transform source

New cards

Lineage Graph

RDDs keep track of lineage to compute partitions from stable storage.

New cards

RDD Persistence

Nodes store partitions for reuse in other actions on that dataset

New cards

groupByKey([numPartitions])

Used for aggregating data in key-value pairs, returns a dataset of (K, Iterable) pairs.

New cards

DataFrame API

A DataFrame API to perform relational operations on both external data sources and Spark’s built-in RDDs

New cards

Catalyst

A highly extensible optimizer to use Scala features to add composable rule, control code generation, and define extensions

New cards

DataFrames and Datasets

DataFrame is schema, generic untyped whereas Dataset is static typing, strongly-typed

New cards

DataFrame (ML Pipelines)

An ML dataset holding various data types, e.g. columns for text, feature vectors, true labels, & predictions

New cards

Transformer (ML Pipelines)

Algorithm transforming one DataFrame into another, e.g. features  ML model  predictions

New cards

Estimator (ML Pipelines)

Algorithm fitting on a DataFrame to produce a Transformer, e.g. training data ML algorithm  ML model

New cards

Pipeline (ML Pipelines)

Chains multiple Transformers and Estimators together to specify an ML workflow

New cards

Shared Variables

Variables are distributed to workers via closures

New cards

Broadcast variables

Keep rather than ship

New cards

Accumulators

Only added to, e.g. sums/counters

New cards

Shared Variables

Variables are distributed to workers via closures

New cards

User applications

Results in a Directed Acyclic Graph (DAG) of operators

Explore top notes

Note

Note

1.6 Growth & evolution

Updated 1116d ago

Note

Chapter 3: Buddhism "Religion of Release"

Updated 970d ago

Note

fv1 - production function

Updated 312d ago

Note

Unit 11: The Industrial Revolution and Imperialism. The division of the world - Point 9

Updated 876d ago

Note

Chp 4 Linguistic Anthropology: Relating Language and Culture

Updated 1061d ago

Note

Unit 8: South, East, and Southeast Asia, 300 BCE–1980 CE

Updated 867d ago

Note

Explore top flashcards

unit 3.1: replication

Flashcards (35)

Flashcards (100)

Flashcards (20)

Year 7 Term 4 Exam Part 1 Religion Revision

Updated 305d ago

Flashcards (29)

photosynthesis in higher plants

Updated 243d ago

Flashcards (103)

Biologia medyczna - praktyczny

Updated 979d ago

Flashcards (33)

Landmark US Supreme Court Decisions

Updated 70d ago

Flashcards (25)

biology unit 1

Updated 712d ago

Flashcards (70)