chapter 10
Q: What is HDFS?
A: Hadoop Distributed File System, a distributed file system that runs on large clusters and provides high-throughput access to data.
Q: What are the key characteristics of HDFS?
A:
Scalable storage for large files
Replication for fault tolerance
Streaming data access
File appends
Q: How does HDFS ensure fault tolerance?
A: By replicating blocks of files on multiple machines.
Q: What are the main components of HDFS Architecture?
A: NameNode, Secondary NameNode, and DataNodes.
Q: What is the role of the NameNode in HDFS?
A: Manages the filesystem namespace, stores metadata, and is responsible for opening and closing files. No data flows through the NameNode.
Flashcard 6
Q: What does the Secondary NameNode do in HDFS?
A: Responsible for the checkpointing process, which helps in managing the filesystem metadata.
Q: What is a DataNode in HDFS?
A: Stores actual data blocks and serves read and write requests from clients.
Q: Describe the Read Path in HDFS.
A:
Client requests file metadata from NameNode.
NameNode responds with DataNode locations.
Client reads data directly from DataNodes.
Q: Describe the Write Path in HDFS.
A:
Client requests to create a file from NameNode.
NameNode responds with an output stream.
Client writes data, which is split into packets and sent to DataNodes.
Data is replicated across DataNodes forming a pipeline.
DataNodes acknowledges successful writes.
Client finalizes the file creation by closing the output stream.
Q: What is MapReduce 2.0 - YARN?
A: In Hadoop 2.0, YARN separates the resource management from the processing engine, allowing Hadoop to support different processing engines like MapReduce, Tez, and Spark.
Q: What are the main components of YARN?
A: Resource Manager (RM), Node Manager (NM), Application Master (AM), and Containers.
Q: What does the Resource Manager (RM) do in YARN?
A: Manages the global assignment of compute resources to applications through the Scheduler and Applications Manager.
Q: What is the role of the Application Master (AM) in YARN?
A: Manages the lifecycle of a specific application, including starting, monitoring, and restarting application tasks.
Q: What are Containers in YARN?
A: Bundles of resources allocated by the RM, granting applications the privilege to use specific amounts of CPU and memory.
Q: Name the three types of YARN Schedulers.
A: FIFO Scheduler, Fair Scheduler, and Capacity Scheduler.
Q: What is the FIFO Scheduler in YARN?
A: The default scheduler that processes jobs in a first-in, first-out manner without considering job priority or size.
Q: What is the Fair Scheduler in YARN?
A: Assigns resources to applications so that all applications receive, on average, an equal share of resources over time. Developed at Facebook.
Q: What is the Capacity Scheduler in YARN?
A: Enables running Hadoop applications in a shared, multi-tenant environment by providing capacity guarantees and resource limits. Developed at Yahoo.
Q: What are the two phases of MapReduce?
A:
Map Phase: Processes input data and generates intermediate key-value pairs.
Reduce Phase: Aggregates intermediate data with the same key.
Q: What are Numerical Summarization Patterns in MapReduce?
A: Patterns used to compute statistics such as counts, maximum, minimum, and mean.
Q: What is Sort and Shuffle in MapReduce?
A: The process of sorting and transferring intermediate data from the Map phase to the Reduce phase based on keys.
Q: What is the purpose of Top-N in batch analytics?
A: To identify the top N records based on specific criteria, such as highest sales or most active users.
Q: What is a Filter operation in batch analytics?
A: Selecting a subset of data that meets certain conditions or criteria.
Q: What is a Distinct operation in batch analytics?
A: Extracting unique records from a dataset by removing duplicates.
Q: What is Binning in batch analytics?
A: Grouping continuous data into discrete intervals or "bins" for analysis.
Q: What is an Inverted Index?
A: A data structure used to map content, such as words or terms, to their locations within a dataset, commonly used in search engines.
Q: What are Joins in batch analytics?
A: Combining records from multiple datasets based on a related key or attribute.
Q: What is the Hortonworks Data Platform (HDP)?
A: An open-source platform distribution that includes various big data frameworks like Hadoop, YARN, HDFS, HBase, Hive, and Pig.
Q: What is the Cloudera Distribution for Hadoop (CDH)?
A: An open-source platform distribution that includes big data tools and frameworks such as Hadoop, YARN, HDFS, HBase, Hive, and Pig.
Q: What is Amazon Elastic MapReduce (EMR)?
A: A cloud-based big data cluster platform that supports various frameworks like Hadoop, Spark, Hive, and Pig.
Q: What is Pig🐷 in the Hadoop ecosystem?
A: A high-level data processing language that translates scripts into MapReduce programs, using Pig Latin 🐷. Igpay atinlay
Q: What are the two main components of Pig🐷?
A:
Pig Latin🐷: The high-level data processing language.
Compiler: Translates Pig Latin scripts into MapReduce jobs.
Q: What are the two modes of operation in Pig🐷?
A: Local mode and MapReduce mode.
Q: What are the primary data types in Pig🐷?
A: Tuple, Bag, and Map.
Q: What is Apache Oozie?
A: A workflow scheduler system for managing Hadoop jobs, allowing the creation of workflows arranged as Directed Acyclic Graphs (DAG).
Q: How does Apache Oozie define workflows?
A: Using an XML-based process defining language called Hadoop Process Definition Language.
Q: What is Spark?
A: A cluster computing framework that supports in-memory processing, enabling real-time, batch, and interactive queries.
Q: What are the main components of Spark?
A: Spark Core, Spark Streaming, Spark SQL, Spark MLlib, and Spark GraphX.
Q: What is Spark Core?
A: Provides the basic functionalities of Spark, including data abstraction through Resilient Distributed Datasets (RDDs).
Q: What is Spark Streaming?
A: A Spark component for processing and analyzing streaming data in real-time.
Q: What is Spark SQL?
A: A Spark component that enables interactive querying of data using SQL queries.
Q: What is Spark MLlib?
A: Spark’s machine learning library that provides algorithms for clustering, classification, regression, collaborative filtering, and dimensionality reduction.
Q: What is Spark GraphX?
A: A Spark component for graph processing, supporting graph algorithms like PageRank, connected components, and triangle counting.
Q: What are the main components of a Spark Cluster?
A: Driver, Cluster Manager, and Executors.
Q: What is the Driver in a Spark Cluster?
A: Consists of a driver program coordinated by the SparkContext object, managing the execution of the Spark application.
Q: What is the Cluster Manager in Spark?
A: Allocates resources across the cluster and manages the distribution of tasks to Executors.
Q: What are Executors in Spark?
A: Processes allocated on worker nodes that run application code and perform tasks.
Q: What is SparkContext?
A: An object that connects the Spark application to the cluster, used to create RDDs and manage resources.
Q: What are Resilient Distributed Datasets (RDDs) in Spark?
A: The primary data abstraction in Spark, representing an immutable, distributed collection of objects that can be operated on in parallel.
Q: What are the two types of RDD Operations?
A: Transformations and Actions.
Q: What are Transformations in Spark RDD operations?
A: Operations that create a new RDD from an existing one. They are lazy and only executed when an action is called.
Q: What are Actions in Spark RDD operations?
A: Operations that trigger the execution of transformations and return a value to the driver program.
Q: What is Apache Solr?
A: A scalable open-source framework for searching data, built on Apache Lucene, enabling indexing and searching of various data formats.
Q: What are the key features of Apache Solr?
A:
Faceting
Clustering
Spatial Search
Pagination and Ranking
Q: What is Elasticsearch?
A: An open-source search and analytics engine designed for scalability, used to store, search, and analyze large volumes of data quickly.
Q: What is a Cluster in Elasticsearch?
A: A group of nodes that work together to store and index data.
Q: What is a Node in Elasticsearch?
A: A single server within an Elasticsearch cluster responsible for storing and indexing data.
Q: What is an Index in Elasticsearch?
A: A collection of similar documents, such as customer records or product catalogs.
Q: What is a Document in Elasticsearch?
A: A single unit of data stored within an index.
Q: What are Shards in Elasticsearch?
A: Subdivisions of an index that allow for parallel storage and retrieval of data, enhancing scalability and performance.
Q: What are the Common Features of Solr and Elasticsearch?
A:
Both are open-source, distributed, and fault-tolerant search frameworks.
Built on Apache Lucene.
Use shards for scalability.
Q: How does Solr differ from Elasticsearch in search capabilities?
A: Solr is more focused on text searches, while Elasticsearch excels in analytics queries, including grouping and filtering.
Q: What is a key deployment difference between Solr and Elasticsearch?
A: Elasticsearch is easier to set up and deploy as it doesn’t require Apache ZooKeeper, unlike Solr.
Q: How does Elasticsearch handle real-time search compared to Solr?
A: Elasticsearch is designed for near real-time search with low latency between indexing and search availability, whereas Solr was not originally built for real-time search.
Q: How do Solr and Elasticsearch manage shards differently?
A: Solr supports shard splitting to divide existing shards, while Elasticsearch supports shard rebalancing to distribute shards across nodes as new nodes join the cluster.
Q: What is HDFS?
A: Hadoop Distributed File System, a distributed file system that runs on large clusters and provides high-throughput access to data.
Q: What are the key characteristics of HDFS?
A:
Scalable storage for large files
Replication for fault tolerance
Streaming data access
File appends
Q: How does HDFS ensure fault tolerance?
A: By replicating blocks of files on multiple machines.
Q: What are the main components of HDFS Architecture?
A: NameNode, Secondary NameNode, and DataNodes.
Q: What is the role of the NameNode in HDFS?
A: Manages the filesystem namespace, stores metadata, and is responsible for opening and closing files. No data flows through the NameNode.
Flashcard 6
Q: What does the Secondary NameNode do in HDFS?
A: Responsible for the checkpointing process, which helps in managing the filesystem metadata.
Q: What is a DataNode in HDFS?
A: Stores actual data blocks and serves read and write requests from clients.
Q: Describe the Read Path in HDFS.
A:
Client requests file metadata from NameNode.
NameNode responds with DataNode locations.
Client reads data directly from DataNodes.
Q: Describe the Write Path in HDFS.
A:
Client requests to create a file from NameNode.
NameNode responds with an output stream.
Client writes data, which is split into packets and sent to DataNodes.
Data is replicated across DataNodes forming a pipeline.
DataNodes acknowledges successful writes.
Client finalizes the file creation by closing the output stream.
Q: What is MapReduce 2.0 - YARN?
A: In Hadoop 2.0, YARN separates the resource management from the processing engine, allowing Hadoop to support different processing engines like MapReduce, Tez, and Spark.
Q: What are the main components of YARN?
A: Resource Manager (RM), Node Manager (NM), Application Master (AM), and Containers.
Q: What does the Resource Manager (RM) do in YARN?
A: Manages the global assignment of compute resources to applications through the Scheduler and Applications Manager.
Q: What is the role of the Application Master (AM) in YARN?
A: Manages the lifecycle of a specific application, including starting, monitoring, and restarting application tasks.
Q: What are Containers in YARN?
A: Bundles of resources allocated by the RM, granting applications the privilege to use specific amounts of CPU and memory.
Q: Name the three types of YARN Schedulers.
A: FIFO Scheduler, Fair Scheduler, and Capacity Scheduler.
Q: What is the FIFO Scheduler in YARN?
A: The default scheduler that processes jobs in a first-in, first-out manner without considering job priority or size.
Q: What is the Fair Scheduler in YARN?
A: Assigns resources to applications so that all applications receive, on average, an equal share of resources over time. Developed at Facebook.
Q: What is the Capacity Scheduler in YARN?
A: Enables running Hadoop applications in a shared, multi-tenant environment by providing capacity guarantees and resource limits. Developed at Yahoo.
Q: What are the two phases of MapReduce?
A:
Map Phase: Processes input data and generates intermediate key-value pairs.
Reduce Phase: Aggregates intermediate data with the same key.
Q: What are Numerical Summarization Patterns in MapReduce?
A: Patterns used to compute statistics such as counts, maximum, minimum, and mean.
Q: What is Sort and Shuffle in MapReduce?
A: The process of sorting and transferring intermediate data from the Map phase to the Reduce phase based on keys.
Q: What is the purpose of Top-N in batch analytics?
A: To identify the top N records based on specific criteria, such as highest sales or most active users.
Q: What is a Filter operation in batch analytics?
A: Selecting a subset of data that meets certain conditions or criteria.
Q: What is a Distinct operation in batch analytics?
A: Extracting unique records from a dataset by removing duplicates.
Q: What is Binning in batch analytics?
A: Grouping continuous data into discrete intervals or "bins" for analysis.
Q: What is an Inverted Index?
A: A data structure used to map content, such as words or terms, to their locations within a dataset, commonly used in search engines.
Q: What are Joins in batch analytics?
A: Combining records from multiple datasets based on a related key or attribute.
Q: What is the Hortonworks Data Platform (HDP)?
A: An open-source platform distribution that includes various big data frameworks like Hadoop, YARN, HDFS, HBase, Hive, and Pig.
Q: What is the Cloudera Distribution for Hadoop (CDH)?
A: An open-source platform distribution that includes big data tools and frameworks such as Hadoop, YARN, HDFS, HBase, Hive, and Pig.
Q: What is Amazon Elastic MapReduce (EMR)?
A: A cloud-based big data cluster platform that supports various frameworks like Hadoop, Spark, Hive, and Pig.
Q: What is Pig🐷 in the Hadoop ecosystem?
A: A high-level data processing language that translates scripts into MapReduce programs, using Pig Latin 🐷. Igpay atinlay
Q: What are the two main components of Pig🐷?
A:
Pig Latin🐷: The high-level data processing language.
Compiler: Translates Pig Latin scripts into MapReduce jobs.
Q: What are the two modes of operation in Pig🐷?
A: Local mode and MapReduce mode.
Q: What are the primary data types in Pig🐷?
A: Tuple, Bag, and Map.
Q: What is Apache Oozie?
A: A workflow scheduler system for managing Hadoop jobs, allowing the creation of workflows arranged as Directed Acyclic Graphs (DAG).
Q: How does Apache Oozie define workflows?
A: Using an XML-based process defining language called Hadoop Process Definition Language.
Q: What is Spark?
A: A cluster computing framework that supports in-memory processing, enabling real-time, batch, and interactive queries.
Q: What are the main components of Spark?
A: Spark Core, Spark Streaming, Spark SQL, Spark MLlib, and Spark GraphX.
Q: What is Spark Core?
A: Provides the basic functionalities of Spark, including data abstraction through Resilient Distributed Datasets (RDDs).
Q: What is Spark Streaming?
A: A Spark component for processing and analyzing streaming data in real-time.
Q: What is Spark SQL?
A: A Spark component that enables interactive querying of data using SQL queries.
Q: What is Spark MLlib?
A: Spark’s machine learning library that provides algorithms for clustering, classification, regression, collaborative filtering, and dimensionality reduction.
Q: What is Spark GraphX?
A: A Spark component for graph processing, supporting graph algorithms like PageRank, connected components, and triangle counting.
Q: What are the main components of a Spark Cluster?
A: Driver, Cluster Manager, and Executors.
Q: What is the Driver in a Spark Cluster?
A: Consists of a driver program coordinated by the SparkContext object, managing the execution of the Spark application.
Q: What is the Cluster Manager in Spark?
A: Allocates resources across the cluster and manages the distribution of tasks to Executors.
Q: What are Executors in Spark?
A: Processes allocated on worker nodes that run application code and perform tasks.
Q: What is SparkContext?
A: An object that connects the Spark application to the cluster, used to create RDDs and manage resources.
Q: What are Resilient Distributed Datasets (RDDs) in Spark?
A: The primary data abstraction in Spark, representing an immutable, distributed collection of objects that can be operated on in parallel.
Q: What are the two types of RDD Operations?
A: Transformations and Actions.
Q: What are Transformations in Spark RDD operations?
A: Operations that create a new RDD from an existing one. They are lazy and only executed when an action is called.
Q: What are Actions in Spark RDD operations?
A: Operations that trigger the execution of transformations and return a value to the driver program.
Q: What is Apache Solr?
A: A scalable open-source framework for searching data, built on Apache Lucene, enabling indexing and searching of various data formats.
Q: What are the key features of Apache Solr?
A:
Faceting
Clustering
Spatial Search
Pagination and Ranking
Q: What is Elasticsearch?
A: An open-source search and analytics engine designed for scalability, used to store, search, and analyze large volumes of data quickly.
Q: What is a Cluster in Elasticsearch?
A: A group of nodes that work together to store and index data.
Q: What is a Node in Elasticsearch?
A: A single server within an Elasticsearch cluster responsible for storing and indexing data.
Q: What is an Index in Elasticsearch?
A: A collection of similar documents, such as customer records or product catalogs.
Q: What is a Document in Elasticsearch?
A: A single unit of data stored within an index.
Q: What are Shards in Elasticsearch?
A: Subdivisions of an index that allow for parallel storage and retrieval of data, enhancing scalability and performance.
Q: What are the Common Features of Solr and Elasticsearch?
A:
Both are open-source, distributed, and fault-tolerant search frameworks.
Built on Apache Lucene.
Use shards for scalability.
Q: How does Solr differ from Elasticsearch in search capabilities?
A: Solr is more focused on text searches, while Elasticsearch excels in analytics queries, including grouping and filtering.
Q: What is a key deployment difference between Solr and Elasticsearch?
A: Elasticsearch is easier to set up and deploy as it doesn’t require Apache ZooKeeper, unlike Solr.
Q: How does Elasticsearch handle real-time search compared to Solr?
A: Elasticsearch is designed for near real-time search with low latency between indexing and search availability, whereas Solr was not originally built for real-time search.
Q: How do Solr and Elasticsearch manage shards differently?
A: Solr supports shard splitting to divide existing shards, while Elasticsearch supports shard rebalancing to distribute shards across nodes as new nodes join the cluster.