Cloud Computing Exam 1

0.0(0)

Studied by 0 people

Call with Kai

Knowt Play

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/48

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

49 Terms

New cards

What is MapReduce

programming model and processing technique for distributed computing on large datasets

New cards

Who developed MapReduce

Google in 2004

New cards

Challenge that prompted MapReduce

build search engine that could index the entire web and answer search queries in milliseconds

New cards

breakthrough of MapReduce

made it possible to write simple program and run efficiently on a thousand machines in half an hour
speeding up development and prototyping cycle for engineers

New cards

MapReduce key operations

Map, Shuffle & Sort, Reduce

New cards

Map Phase

data split into chunks, each chunk processed in parallel by different machines
each mapper applies same function to its chunk

New cards

map phase output

key-value pairs

New cards

shuffle and sort phase

system groups all values with same key
data redistributed across cluster
ensures all values for a key go to the same reducer

New cards

reduce phase

each reducer processes all values for its assigned keys
combines the values to produce final result

New cards

reduce phase output

final key-values pairs

New cards

scalability of MapReduce

handles petabytes of data across thousands of machines
linear scaling: double the machines - cut the time in half
automatic load distribution

New cards

MapReduce fault tolerance

automatic handling of failures
re-executes failed tasks on another machine
no single point of failure

New cards

MapReduce Simplicity

developers focus on business logic not distributed systems
framework handles parallelization, fault tolerance, and data distribution
familiar functional programming concepts

New cards

MapReduce cost effectiveness

efficient resource utilization across clusters
pay-as-you-scale model

New cards

MapReduce importance

changed how we think ab processing large datasets
laid groundwork for entire big data ecosystem
core concepts still important in modern distributed systems
understanding MapReduce helps with other distributed computing concepts

New cards

what is Hadoop

framework that allows for distributed processing of large data sets across clusters of computers

New cards

cloud integration of hadoop

Hadoop integrates with cloud platforms offering flexible deployment and elastic scalability for big data applications

New cards

what does HDFS stand for

Hadoop distributed file system

New cards

HDFS challenge

ensure tolerance to node failure without losing data

New cards

HDFS features

stores files from hundreds of megabytes to petabytes
works on commonly available hardware; doesn’t have to be expensive, ultra-reliable hardware
required when dataset too large for single machine
manages storage across network of machines
focuses on reading entire dataset efficiently, rather than quickly assessing first record

New cards

HDFS complexities

network-based
complications from network programming
more intricate than regular disk filesystems

New cards

HDFS data distribution and replication

HDFS splits files into blocks and replicates each block 3 times to ensure fault tolerance and data reliability
splits into blocks (128MB by default)
3 copies balances data protection and storage costs

New cards

HDFS Namenode

manages the filesystem namespace (structure of files and directories)
tracks hierarchy of files and directories
- maintains metadata ab structure but doesn’t store block locations persistently
- block locations are reconstructs from data nodes at startup, which helps reduce memory and allows dynamic recovery
records metadata
manages file system operations
ensures consistency of namespace
namespace isn’t the actual data, but the map of the data

New cards

HDFS datanode (worker)

stores and retrieves blocks
- each file split into blocks
- data nodes physically store blocks on local disks
- when client requests file, data nodes serve blocks directly
reports block information to the namenode
handles read/write requests from clients
participates in replication
- datanodes replicate blocks to other nodes based on replication factor set by HDFS (default is 3)

New cards

HDFS client - datanode interaction

client reads data directly from datanodes
- after contacting namenode for block locations, client connects to nearest datanode
- data streamed block-by-block from data nodes to the client
efficient design for scalability
- namenode handles metadata only, not actual data transfer
- this separation avoids bottlenecks and supports large-scale parallel reads
error handling during reads
- if a datanode fails, client retries with another replica
- corrupted blocks are reported to the namenode for recovery

New cards

HDFS secondary namenode - purpose

not a backup or failover for primary namenode
role is to merge the namespace image (fsimage) with the edit log to keep metadata manageable

New cards

HDFS secondary namenode - function

periodically downloads the fsimage and edit log from namenode
merges them into new fsimage and and uploads it back to namenode
helps prevent edit log from growing too large, which would slow down recovery

New cards

HDFS secondary namenode - resource requirements

requires significant CPU and memory to perform and merge
- stores copy of merged namespace image locally

New cards

HDFS secondary namenode - misconception

despite its name, secondary namenode CAN’T take over if primary namenode fails
it is a checkpointing assistant, not high availability solution

New cards

HDFS high availability namenode - purpose

eliminates the single point of failure in traditional HDFS architecture
ensures cluster remains operational even if the active namenode fails

New cards

HDFS high availability namenode - architecture

uses active-standby namenode configuration
both name nodes share access to a common edits log, typically stored on NFS or Quorum Journal Manager

New cards

HDFS high availability namenode - Failover mechanism

zookeeper and ZKFailoverController monitor namenode health
if the active namenode fails, automatic failover promotes the standby to active

New cards

HDFS high availability namenode - Fencing

prevents the old active namenode from corrupting the system after failover
techniques include killing the process or revoking access to shared storage

New cards

HDFS high availability namenode - key difference from secondary namenode

secondary namenode is not a failover node - it only merges metadata
HA Namenode is a true standby, ready to take over instantly

New cards

HDFS distributed file system - limitations

not suitable for applications needing data access in tens of milliseconds
limited by namenode memory because it holds the metadata - cap is set by the amount of memory on the namenode
supports only one writer at a time; writes are append-only

New cards

MapReduce: processing framework

Hadoop provides the distributed storage (HDFS) and resource management (YARN) that MapReduce relies on to perform parallel data processing
Hadoop is the engine that runs MapReduce
MapReduce supports fault-tolerant execution across thousands of noes for large-scale data processing

New cards

what is YARN

YARN: yet another resource negotiator
is the cluster resource management layer of hadoop
separates resource management from data processing, allowing multiple processing frameworks to run on same cluster

New cards

YARN architecture - ResourceManager (Master)

global authority for resource allocation across the cluster
manages node heartbeats and container assignments

New cards

YARN architecture - NodeManager (worker)

runs on each node in the cluster
reports resource usage and health to the ResourceManager
launches and monitors containers

New cards

YARN architecture - ApplicationMaster (per application)

manages the lifecycle of single application
requests containers from ResourceManager
coordinates execution across nodeManagers

New cards

YARN architecture - container

logical unit of resource allocation (CPU, memory)
runs tasks assigned by the applicationMaster

New cards

YARN workflow

client submits an application to the resource manager
resourceManager launches an applicationMaster in a container
ApplicationMaster requests containers for tasks
NodeManagers launch containers and execute tasks
ApplicationMaster monitors progress and report status
application completes, containers are released

New cards

YARN workflow - client submits job

mapReduce job is submitted to resourceManager
YARN launches a mapReduce applicationMaster in a container

New cards

YARN workflow - applicationMaster initialization

initializes job configuration and splits input data
requests containers for map tasks from the resourceManager

New cards

YARN workflow - map task execution

containers are launched on nodeManagers
each map task processes a split of input data and emits intermediate key-value pairs

New cards

YARN workflow - shuffle and sort phase

intermediate data shuffled across nodes
data sorted by key before being passed to reduce tasks

New cards

YARN workflow - reduce task execution

applicationMaster requests containers for reduce tasks
reduce tasks aggregate and process intermediate data to produce final output

New cards

YARN workflow - job completion

applicationMaster monitors task progress
once all tasks finish, it signals job completion and releases resources

New cards

benefits of YARN

scalability: supports thousands of concurrent application
flexibility: enables multiple data processing engines
resource efficiency: dynamically allocates resources based on workload
fault tolerance: nodeManagers and applicationMasters can recover independently