Send a link to your students to track their progress
13 Terms
1
New cards
Hadoop
A Java-based framework (not a database) for distributing and processing very large data sets across clusters of computers.
2
New cards
Two most important parts of Hadoop
HDFS (Hadoop Distributed File System) and MapReduce.
3
New cards
HDFS
A highly distributed, fault-tolerant file storage system designed to manage large amounts of data at high speed; a low-level distributed file system used directly for storage.
4
New cards
Four HDFS assumptions
(1) High volume (terabyte+ files), (2) Write-once, read-many (no edits after close), (3) Streaming access (process whole files as a stream), (4) Fault tolerance (replicate data across many machines).
5
New cards
Client node (HDFS)
A node that makes requests to the file system.
6
New cards
Name node (HDFS)
The node that stores metadata about which blocks belong to which files and which data nodes hold them.
7
New cards
Data node (HDFS)
A node that stores the actual file data blocks.
8
New cards
Block report
A report sent every 6 hours from a data node to the name node listing which blocks it holds.
9
New cards
Heartbeat
A signal sent every 3 seconds from a data node to the name node to confirm it is still available.
10
New cards
What happens when a name node stops receiving heartbeats from a data node
It excludes that data node from future read/write lists and may instruct other nodes to replicate the missing data.
11
New cards
MapReduce
A divide-and-conquer parallel processing technique: split a large data block into sub-blocks, compute intermediate results, then summarize into one final answer.