Cloud Computing Exam 1

0.0(0)
studied byStudied by 0 people
full-widthCall with Kai
GameKnowt Play
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/48

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

49 Terms

1
New cards

What is MapReduce

programming model and processing technique for distributed computing on large datasets

2
New cards

Who developed MapReduce

Google in 2004

3
New cards

Challenge that prompted MapReduce

build search engine that could index the entire web and answer search queries in milliseconds

4
New cards

breakthrough of MapReduce

  • made it possible to write simple program and run efficiently on a thousand machines in half an hour

  • speeding up development and prototyping cycle for engineers

5
New cards

MapReduce key operations

Map, Shuffle & Sort, Reduce

6
New cards

Map Phase

  • data split into chunks, each chunk processed in parallel by different machines

  • each mapper applies same function to its chunk

7
New cards

map phase output

key-value pairs

8
New cards

shuffle and sort phase

  • system groups all values with same key

  • data redistributed across cluster

  • ensures all values for a key go to the same reducer

9
New cards

reduce phase

  • each reducer processes all values for its assigned keys

  • combines the values to produce final result

10
New cards

reduce phase output

final key-values pairs

11
New cards

scalability of MapReduce

  • handles petabytes of data across thousands of machines

  • linear scaling: double the machines - cut the time in half

  • automatic load distribution

12
New cards

MapReduce fault tolerance

  • automatic handling of failures

  • re-executes failed tasks on another machine

  • no single point of failure

13
New cards

MapReduce Simplicity

  • developers focus on business logic not distributed systems

  • framework handles parallelization, fault tolerance, and data distribution

  • familiar functional programming concepts

14
New cards

MapReduce cost effectiveness

  • efficient resource utilization across clusters

  • pay-as-you-scale model

15
New cards

MapReduce importance

  • changed how we think ab processing large datasets

  • laid groundwork for entire big data ecosystem

  • core concepts still important in modern distributed systems

  • understanding MapReduce helps with other distributed computing concepts

16
New cards

what is Hadoop

framework that allows for distributed processing of large data sets across clusters of computers

17
New cards

cloud integration of hadoop

Hadoop integrates with cloud platforms offering flexible deployment and elastic scalability for big data applications

18
New cards

what does HDFS stand for

Hadoop distributed file system

19
New cards

HDFS challenge

ensure tolerance to node failure without losing data

20
New cards

HDFS features

  • stores files from hundreds of megabytes to petabytes

  • works on commonly available hardware; doesn’t have to be expensive, ultra-reliable hardware

  • required when dataset too large for single machine

  • manages storage across network of machines

  • focuses on reading entire dataset efficiently, rather than quickly assessing first record

21
New cards

HDFS complexities

  • network-based

  • complications from network programming

  • more intricate than regular disk filesystems

22
New cards

HDFS data distribution and replication

  • HDFS splits files into blocks and replicates each block 3 times to ensure fault tolerance and data reliability

  • splits into blocks (128MB by default)

  • 3 copies balances data protection and storage costs

23
New cards

HDFS Namenode

  • manages the filesystem namespace (structure of files and directories)

  • tracks hierarchy of files and directories

    • maintains metadata ab structure but doesn’t store block locations persistently

    • block locations are reconstructs from data nodes at startup, which helps reduce memory and allows dynamic recovery

  • records metadata

  • manages file system operations

  • ensures consistency of namespace

  • namespace isn’t the actual data, but the map of the data

24
New cards

HDFS datanode (worker)

  • stores and retrieves blocks

    • each file split into blocks

    • data nodes physically store blocks on local disks

    • when client requests file, data nodes serve blocks directly

  • reports block information to the namenode

  • handles read/write requests from clients

  • participates in replication

    • datanodes replicate blocks to other nodes based on replication factor set by HDFS (default is 3)

25
New cards

HDFS client - datanode interaction

  • client reads data directly from datanodes

    • after contacting namenode for block locations, client connects to nearest datanode

    • data streamed block-by-block from data nodes to the client

  • efficient design for scalability

    • namenode handles metadata only, not actual data transfer

    • this separation avoids bottlenecks and supports large-scale parallel reads

  • error handling during reads

    • if a datanode fails, client retries with another replica

    • corrupted blocks are reported to the namenode for recovery

26
New cards

HDFS secondary namenode - purpose

  • not a backup or failover for primary namenode

  • role is to merge the namespace image (fsimage) with the edit log to keep metadata manageable

27
New cards

HDFS secondary namenode - function

  • periodically downloads the fsimage and edit log from namenode

  • merges them into new fsimage and and uploads it back to namenode

  • helps prevent edit log from growing too large, which would slow down recovery

28
New cards

HDFS secondary namenode - resource requirements

  • requires significant CPU and memory to perform and merge

    • stores copy of merged namespace image locally

29
New cards

HDFS secondary namenode - misconception

  • despite its name, secondary namenode CAN’T take over if primary namenode fails

  • it is a checkpointing assistant, not high availability solution

30
New cards

HDFS high availability namenode - purpose

  • eliminates the single point of failure in traditional HDFS architecture

  • ensures cluster remains operational even if the active namenode fails

31
New cards

HDFS high availability namenode - architecture

  • uses active-standby namenode configuration

  • both name nodes share access to a common edits log, typically stored on NFS or Quorum Journal Manager

32
New cards

HDFS high availability namenode - Failover mechanism

  • zookeeper and ZKFailoverController monitor namenode health

  • if the active namenode fails, automatic failover promotes the standby to active

33
New cards

HDFS high availability namenode - Fencing

  • prevents the old active namenode from corrupting the system after failover

  • techniques include killing the process or revoking access to shared storage

34
New cards

HDFS high availability namenode - key difference from secondary namenode

  • secondary namenode is not a failover node - it only merges metadata

  • HA Namenode is a true standby, ready to take over instantly

35
New cards

HDFS distributed file system - limitations

  • not suitable for applications needing data access in tens of milliseconds

  • limited by namenode memory because it holds the metadata - cap is set by the amount of memory on the namenode

  • supports only one writer at a time; writes are append-only

36
New cards

MapReduce: processing framework

  • Hadoop provides the distributed storage (HDFS) and resource management (YARN) that MapReduce relies on to perform parallel data processing

  • Hadoop is the engine that runs MapReduce

  • MapReduce supports fault-tolerant execution across thousands of noes for large-scale data processing

37
New cards

what is YARN

  • YARN: yet another resource negotiator

  • is the cluster resource management layer of hadoop

  • separates resource management from data processing, allowing multiple processing frameworks to run on same cluster

38
New cards

YARN architecture - ResourceManager (Master)

  • global authority for resource allocation across the cluster

  • manages node heartbeats and container assignments

39
New cards

YARN architecture - NodeManager (worker)

  • runs on each node in the cluster

  • reports resource usage and health to the ResourceManager

  • launches and monitors containers

40
New cards

YARN architecture - ApplicationMaster (per application)

  • manages the lifecycle of single application

  • requests containers from ResourceManager

  • coordinates execution across nodeManagers

41
New cards

YARN architecture - container

  • logical unit of resource allocation (CPU, memory)

  • runs tasks assigned by the applicationMaster

42
New cards

YARN workflow

  1. client submits an application to the resource manager

  2. resourceManager launches an applicationMaster in a container

  3. ApplicationMaster requests containers for tasks

  4. NodeManagers launch containers and execute tasks

  5. ApplicationMaster monitors progress and report status

  6. application completes, containers are released

43
New cards

YARN workflow - client submits job

  • mapReduce job is submitted to resourceManager

  • YARN launches a mapReduce applicationMaster in a container

44
New cards

YARN workflow - applicationMaster initialization

  • initializes job configuration and splits input data

  • requests containers for map tasks from the resourceManager

45
New cards

YARN workflow - map task execution

  • containers are launched on nodeManagers

  • each map task processes a split of input data and emits intermediate key-value pairs

46
New cards

YARN workflow - shuffle and sort phase

  • intermediate data shuffled across nodes

  • data sorted by key before being passed to reduce tasks

47
New cards

YARN workflow - reduce task execution

  • applicationMaster requests containers for reduce tasks

  • reduce tasks aggregate and process intermediate data to produce final output

48
New cards

YARN workflow - job completion

  • applicationMaster monitors task progress

  • once all tasks finish, it signals job completion and releases resources

49
New cards

benefits of YARN

  • scalability: supports thousands of concurrent application

  • flexibility: enables multiple data processing engines

  • resource efficiency: dynamically allocates resources based on workload

  • fault tolerance: nodeManagers and applicationMasters can recover independently