Extended Hadoop & Big Data Flashcards

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/29

Earn XP

Description and Tags

A collection of flashcards covering key concepts, definitions, and details related to Hadoop and Big Data.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

30 Terms

New cards

What is bandwidth in networking?

Bandwidth is the speed at which data moves through a network, typically measured in Mbps (Megabits per second) or Gbps (Gigabits per second).

New cards

Why is bandwidth important in cloud migration projects?

Low bandwidth can cause data transfers to take days or weeks, especially when moving terabytes of data to the cloud.

New cards

What is an example of slow bandwidth affecting data transfer?

A 10 Mbps connection only transfers 1.25 MB per second, which is very slow for large-scale data movement.

New cards

What type of file system is needed for Big Data?

A Distributed File System (DFS) is used to handle large volumes of data efficiently.

New cards

Why is DFS preferred for Big Data?

DFS supports parallel processing and horizontal scaling, speeding up data access and processing.

New cards

What are the benefits of using DFS?

DFS reduces read time, increases flexibility and scalability, and enables better use of system bandwidth.

New cards

What are the drawbacks of using DFS?

DFS can lead to high network traffic and issues with data consistency and availability when DataNodes fail or lag.

New cards

What is Hadoop?

Hadoop is an open-source framework for processing and storing large-scale datasets across clusters of commodity hardware.

New cards

What are the core components of Hadoop?

HDFS, MapReduce, and YARN.

New cards

What is HDFS in Hadoop?

HDFS is the Hadoop Distributed File System that stores large data across many machines.

New cards

What is MapReduce in Hadoop?

MapReduce is a data processing model that uses Map and Reduce functions to process data in parallel.

New cards

What is YARN in Hadoop?

YARN (Yet Another Resource Negotiator) manages system resources like CPU and memory across the cluster.

New cards

What is Apache Pig used for?

Pig is used for high-level data processing with a data flow language.

New cards

What is Apache Hive?

Hive allows users to query data in HDFS using SQL-like HiveQL.

New cards

What is Apache Spark?

Spark is a fast, in-memory data processing engine suitable for iterative tasks.

New cards

What is Apache HBase?

HBase is a NoSQL database for storing large volumes of unstructured data on top of HDFS.

New cards

What do Flume and Sqoop do?

They manage data ingestion from external systems into Hadoop.

New cards

What are Ambari and ZooKeeper used for?

Ambari monitors Hadoop clusters, while ZooKeeper coordinates distributed systems.

New cards

What does Oozie do in Hadoop?

Oozie schedules and manages Hadoop jobs and workflows.

New cards

What are Solr and Lucene used for?

They provide indexing and search capabilities for large datasets.

New cards

What is HDFS's storage model?

Write-once, read-many model, allowing appending but not editing of files.

New cards

What is the NameNode in HDFS?

It stores metadata like file names and block locations in the cluster.

New cards

What are DataNodes in HDFS?

They store the actual file data in fixed-size blocks.

New cards

What is the replication factor in HDFS?

It determines how many copies of each block are stored across different DataNodes.

New cards

What makes Hadoop resilient to failure?

Block replication across nodes, heartbeat checks, and a backup NameNode for metadata recovery.

New cards

How do you calculate total Hadoop storage needed?

Multiply the original data size by the replication factor. Example: 25TB × 5 = 125TB.

New cards

How do you calculate the number of DataNodes needed?

Divide total storage required by the storage capacity of each DataNode. Example: 125TB / 5TB = 25 DataNodes.

New cards

What is the Map phase in MapReduce?

It processes raw data into intermediate key-value pairs, running in parallel across nodes.

New cards

What is the Reduce phase in MapReduce?

It aggregates and summarizes the output of the Map phase to produce the final result.

New cards

Does MapReduce move data or code?

It moves the code (query process) to where the data is stored, not the other way around.