1/11
Twelve vocabulary flashcards summarizing essential terms from the Spark & Hadoop lecture.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Distributed Database
A storage system spreading large datasets across multiple devices, managing data placement, retrieval, and bottlenecks.
MapReduce
A two-stage algorithm for massive data processing: a Map phase to split work across nodes and a Reduce phase to merge the partial results.
Map Stage
The first phase of MapReduce in which a problem is broken into smaller pieces, each sent to separate hardware for parallel processing.
Reduce Stage
The second phase of MapReduce that combines the outputs from all map tasks to produce the final, complete result.
Hadoop
A disk-based, cluster-oriented MapReduce framework using HDFS for storage and YARN for scheduling; slower and older than Spark.
Hadoop Distributed File System (HDFS)
Hadoop's disk storage layer saving fixed-size data blocks across nodes, with NameNode managing metadata.
NameNode
The master node in HDFS that maintains the list of all files and blocks stored on the cluster’s data nodes.
Yet Another Resource Negotiator (YARN)
Hadoop’s scheduler that assigns computing tasks to nodes and locates needed data via the NameNode.
Spark
A Newer RAM-based big data engine, in-memory for speed, often running atop Hadoop with MLlib.
Hadoop Ecosystem
The broader set of technologies that support Hadoop clusters, sometimes used as an umbrella term that even includes Spark.
MLlib
Spark’s built-in machine learning library that supplies algorithms beyond basic MapReduce processing.
Petabyte
A data-size unit of roughly 1,000 terabytes (1,000 + TB), commonly referenced as the scale handled by Hadoop clusters.