Spark & Hadoop – IDS 200 Lecture Vocabulary

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/11

flashcard set

Earn XP

Description and Tags

Twelve vocabulary flashcards summarizing essential terms from the Spark & Hadoop lecture.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

12 Terms

1
New cards

Distributed Database

A storage system spreading large datasets across multiple devices, managing data placement, retrieval, and bottlenecks.

2
New cards

MapReduce

A two-stage algorithm for massive data processing: a Map phase to split work across nodes and a Reduce phase to merge the partial results.

3
New cards

Map Stage

The first phase of MapReduce in which a problem is broken into smaller pieces, each sent to separate hardware for parallel processing.

4
New cards

Reduce Stage

The second phase of MapReduce that combines the outputs from all map tasks to produce the final, complete result.

5
New cards

Hadoop

A disk-based, cluster-oriented MapReduce framework using HDFS for storage and YARN for scheduling; slower and older than Spark.

6
New cards

Hadoop Distributed File System (HDFS)

Hadoop's disk storage layer saving fixed-size data blocks across nodes, with NameNode managing metadata.

7
New cards

NameNode

The master node in HDFS that maintains the list of all files and blocks stored on the cluster’s data nodes.

8
New cards

Yet Another Resource Negotiator (YARN)

Hadoop’s scheduler that assigns computing tasks to nodes and locates needed data via the NameNode.

9
New cards

Spark

A Newer RAM-based big data engine, in-memory for speed, often running atop Hadoop with MLlib.

10
New cards

Hadoop Ecosystem

The broader set of technologies that support Hadoop clusters, sometimes used as an umbrella term that even includes Spark.

11
New cards

MLlib

Spark’s built-in machine learning library that supplies algorithms beyond basic MapReduce processing.

12
New cards

Petabyte

A data-size unit of roughly 1,000 terabytes (1,000 + TB), commonly referenced as the scale handled by Hadoop clusters.