1/29
A collection of flashcards covering key concepts, definitions, and details related to Hadoop and Big Data.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
What is bandwidth in networking?
Bandwidth is the speed at which data moves through a network, typically measured in Mbps (Megabits per second) or Gbps (Gigabits per second).
Why is bandwidth important in cloud migration projects?
Low bandwidth can cause data transfers to take days or weeks, especially when moving terabytes of data to the cloud.
What is an example of slow bandwidth affecting data transfer?
A 10 Mbps connection only transfers 1.25 MB per second, which is very slow for large-scale data movement.
What type of file system is needed for Big Data?
A Distributed File System (DFS) is used to handle large volumes of data efficiently.
Why is DFS preferred for Big Data?
DFS supports parallel processing and horizontal scaling, speeding up data access and processing.
What are the benefits of using DFS?
DFS reduces read time, increases flexibility and scalability, and enables better use of system bandwidth.
What are the drawbacks of using DFS?
DFS can lead to high network traffic and issues with data consistency and availability when DataNodes fail or lag.
What is Hadoop?
Hadoop is an open-source framework for processing and storing large-scale datasets across clusters of commodity hardware.
What are the core components of Hadoop?
HDFS, MapReduce, and YARN.
What is HDFS in Hadoop?
HDFS is the Hadoop Distributed File System that stores large data across many machines.
What is MapReduce in Hadoop?
MapReduce is a data processing model that uses Map and Reduce functions to process data in parallel.
What is YARN in Hadoop?
YARN (Yet Another Resource Negotiator) manages system resources like CPU and memory across the cluster.
What is Apache Pig used for?
Pig is used for high-level data processing with a data flow language.
What is Apache Hive?
Hive allows users to query data in HDFS using SQL-like HiveQL.
What is Apache Spark?
Spark is a fast, in-memory data processing engine suitable for iterative tasks.
What is Apache HBase?
HBase is a NoSQL database for storing large volumes of unstructured data on top of HDFS.
What do Flume and Sqoop do?
They manage data ingestion from external systems into Hadoop.
What are Ambari and ZooKeeper used for?
Ambari monitors Hadoop clusters, while ZooKeeper coordinates distributed systems.
What does Oozie do in Hadoop?
Oozie schedules and manages Hadoop jobs and workflows.
What are Solr and Lucene used for?
They provide indexing and search capabilities for large datasets.
What is HDFS's storage model?
Write-once, read-many model, allowing appending but not editing of files.
What is the NameNode in HDFS?
It stores metadata like file names and block locations in the cluster.
What are DataNodes in HDFS?
They store the actual file data in fixed-size blocks.
What is the replication factor in HDFS?
It determines how many copies of each block are stored across different DataNodes.
What makes Hadoop resilient to failure?
Block replication across nodes, heartbeat checks, and a backup NameNode for metadata recovery.
How do you calculate total Hadoop storage needed?
Multiply the original data size by the replication factor. Example: 25TB Ă— 5 = 125TB.
How do you calculate the number of DataNodes needed?
Divide total storage required by the storage capacity of each DataNode. Example: 125TB / 5TB = 25 DataNodes.
What is the Map phase in MapReduce?
It processes raw data into intermediate key-value pairs, running in parallel across nodes.
What is the Reduce phase in MapReduce?
It aggregates and summarizes the output of the Map phase to produce the final result.
Does MapReduce move data or code?
It moves the code (query process) to where the data is stored, not the other way around.