1/48
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
What is MapReduce
programming model and processing technique for distributed computing on large datasets
Who developed MapReduce
Google in 2004
Challenge that prompted MapReduce
build search engine that could index the entire web and answer search queries in milliseconds
breakthrough of MapReduce
made it possible to write simple program and run efficiently on a thousand machines in half an hour
speeding up development and prototyping cycle for engineers
MapReduce key operations
Map, Shuffle & Sort, Reduce
Map Phase
data split into chunks, each chunk processed in parallel by different machines
each mapper applies same function to its chunk
map phase output
key-value pairs
shuffle and sort phase
system groups all values with same key
data redistributed across cluster
ensures all values for a key go to the same reducer
reduce phase
each reducer processes all values for its assigned keys
combines the values to produce final result
reduce phase output
final key-values pairs
scalability of MapReduce
handles petabytes of data across thousands of machines
linear scaling: double the machines - cut the time in half
automatic load distribution
MapReduce fault tolerance
automatic handling of failures
re-executes failed tasks on another machine
no single point of failure
MapReduce Simplicity
developers focus on business logic not distributed systems
framework handles parallelization, fault tolerance, and data distribution
familiar functional programming concepts
MapReduce cost effectiveness
efficient resource utilization across clusters
pay-as-you-scale model
MapReduce importance
changed how we think ab processing large datasets
laid groundwork for entire big data ecosystem
core concepts still important in modern distributed systems
understanding MapReduce helps with other distributed computing concepts
what is Hadoop
framework that allows for distributed processing of large data sets across clusters of computers
cloud integration of hadoop
Hadoop integrates with cloud platforms offering flexible deployment and elastic scalability for big data applications
what does HDFS stand for
Hadoop distributed file system
HDFS challenge
ensure tolerance to node failure without losing data
HDFS features
stores files from hundreds of megabytes to petabytes
works on commonly available hardware; doesn’t have to be expensive, ultra-reliable hardware
required when dataset too large for single machine
manages storage across network of machines
focuses on reading entire dataset efficiently, rather than quickly assessing first record
HDFS complexities
network-based
complications from network programming
more intricate than regular disk filesystems
HDFS data distribution and replication
HDFS splits files into blocks and replicates each block 3 times to ensure fault tolerance and data reliability
splits into blocks (128MB by default)
3 copies balances data protection and storage costs
HDFS Namenode
manages the filesystem namespace (structure of files and directories)
tracks hierarchy of files and directories
maintains metadata ab structure but doesn’t store block locations persistently
block locations are reconstructs from data nodes at startup, which helps reduce memory and allows dynamic recovery
records metadata
manages file system operations
ensures consistency of namespace
namespace isn’t the actual data, but the map of the data
HDFS datanode (worker)
stores and retrieves blocks
each file split into blocks
data nodes physically store blocks on local disks
when client requests file, data nodes serve blocks directly
reports block information to the namenode
handles read/write requests from clients
participates in replication
datanodes replicate blocks to other nodes based on replication factor set by HDFS (default is 3)
HDFS client - datanode interaction
client reads data directly from datanodes
after contacting namenode for block locations, client connects to nearest datanode
data streamed block-by-block from data nodes to the client
efficient design for scalability
namenode handles metadata only, not actual data transfer
this separation avoids bottlenecks and supports large-scale parallel reads
error handling during reads
if a datanode fails, client retries with another replica
corrupted blocks are reported to the namenode for recovery
HDFS secondary namenode - purpose
not a backup or failover for primary namenode
role is to merge the namespace image (fsimage) with the edit log to keep metadata manageable
HDFS secondary namenode - function
periodically downloads the fsimage and edit log from namenode
merges them into new fsimage and and uploads it back to namenode
helps prevent edit log from growing too large, which would slow down recovery
HDFS secondary namenode - resource requirements
requires significant CPU and memory to perform and merge
stores copy of merged namespace image locally
HDFS secondary namenode - misconception
despite its name, secondary namenode CAN’T take over if primary namenode fails
it is a checkpointing assistant, not high availability solution
HDFS high availability namenode - purpose
eliminates the single point of failure in traditional HDFS architecture
ensures cluster remains operational even if the active namenode fails
HDFS high availability namenode - architecture
uses active-standby namenode configuration
both name nodes share access to a common edits log, typically stored on NFS or Quorum Journal Manager
HDFS high availability namenode - Failover mechanism
zookeeper and ZKFailoverController monitor namenode health
if the active namenode fails, automatic failover promotes the standby to active
HDFS high availability namenode - Fencing
prevents the old active namenode from corrupting the system after failover
techniques include killing the process or revoking access to shared storage
HDFS high availability namenode - key difference from secondary namenode
secondary namenode is not a failover node - it only merges metadata
HA Namenode is a true standby, ready to take over instantly
HDFS distributed file system - limitations
not suitable for applications needing data access in tens of milliseconds
limited by namenode memory because it holds the metadata - cap is set by the amount of memory on the namenode
supports only one writer at a time; writes are append-only
MapReduce: processing framework
Hadoop provides the distributed storage (HDFS) and resource management (YARN) that MapReduce relies on to perform parallel data processing
Hadoop is the engine that runs MapReduce
MapReduce supports fault-tolerant execution across thousands of noes for large-scale data processing
what is YARN
YARN: yet another resource negotiator
is the cluster resource management layer of hadoop
separates resource management from data processing, allowing multiple processing frameworks to run on same cluster
YARN architecture - ResourceManager (Master)
global authority for resource allocation across the cluster
manages node heartbeats and container assignments
YARN architecture - NodeManager (worker)
runs on each node in the cluster
reports resource usage and health to the ResourceManager
launches and monitors containers
YARN architecture - ApplicationMaster (per application)
manages the lifecycle of single application
requests containers from ResourceManager
coordinates execution across nodeManagers
YARN architecture - container
logical unit of resource allocation (CPU, memory)
runs tasks assigned by the applicationMaster
YARN workflow
client submits an application to the resource manager
resourceManager launches an applicationMaster in a container
ApplicationMaster requests containers for tasks
NodeManagers launch containers and execute tasks
ApplicationMaster monitors progress and report status
application completes, containers are released
YARN workflow - client submits job
mapReduce job is submitted to resourceManager
YARN launches a mapReduce applicationMaster in a container
YARN workflow - applicationMaster initialization
initializes job configuration and splits input data
requests containers for map tasks from the resourceManager
YARN workflow - map task execution
containers are launched on nodeManagers
each map task processes a split of input data and emits intermediate key-value pairs
YARN workflow - shuffle and sort phase
intermediate data shuffled across nodes
data sorted by key before being passed to reduce tasks
YARN workflow - reduce task execution
applicationMaster requests containers for reduce tasks
reduce tasks aggregate and process intermediate data to produce final output
YARN workflow - job completion
applicationMaster monitors task progress
once all tasks finish, it signals job completion and releases resources
benefits of YARN
scalability: supports thousands of concurrent application
flexibility: enables multiple data processing engines
resource efficiency: dynamically allocates resources based on workload
fault tolerance: nodeManagers and applicationMasters can recover independently