knowt logo

chapter 10

Flashcard 1

Q: What is HDFS? 

A: Hadoop Distributed File System, a distributed file system that runs on large clusters and provides high-throughput access to data.


Flashcard 2

Q: What are the key characteristics of HDFS? 

A:

  • Scalable storage for large files

  • Replication for fault tolerance

  • Streaming data access

  • File appends


Flashcard 3

Q: How does HDFS ensure fault tolerance? 

A: By replicating blocks of files on multiple machines.


Flashcard 4

Q: What are the main components of HDFS Architecture? 

A: NameNode, Secondary NameNode, and DataNodes.


Flashcard 5

Q: What is the role of the NameNode in HDFS? 

A: Manages the filesystem namespace, stores metadata, and is responsible for opening and closing files. No data flows through the NameNode.

Flashcard 6

Q: What does the Secondary NameNode do in HDFS? 

A: Responsible for the checkpointing process, which helps in managing the filesystem metadata.


Flashcard 7

Q: What is a DataNode in HDFS? 

A: Stores actual data blocks and serves read and write requests from clients.


Flashcard 8

Q: Describe the Read Path in HDFS. 

A:

  1. Client requests file metadata from NameNode.

  2. NameNode responds with DataNode locations.

  3. Client reads data directly from DataNodes.


Flashcard 9

Q: Describe the Write Path in HDFS. 

A:

  1. Client requests to create a file from NameNode.

  2. NameNode responds with an output stream.

  3. Client writes data, which is split into packets and sent to DataNodes.

  4. Data is replicated across DataNodes forming a pipeline.

  5. DataNodes acknowledges successful writes.

  6. Client finalizes the file creation by closing the output stream.


Flashcard 10

Q: What is MapReduce 2.0 - YARN? 

A: In Hadoop 2.0, YARN separates the resource management from the processing engine, allowing Hadoop to support different processing engines like MapReduce, Tez, and Spark.

Flashcard 11

Q: What are the main components of YARN? 

A: Resource Manager (RM), Node Manager (NM), Application Master (AM), and Containers.


Flashcard 12

Q: What does the Resource Manager (RM) do in YARN? 

A: Manages the global assignment of compute resources to applications through the Scheduler and Applications Manager.


Flashcard 13

Q: What is the role of the Application Master (AM) in YARN? 

A: Manages the lifecycle of a specific application, including starting, monitoring, and restarting application tasks.


Flashcard 14

Q: What are Containers in YARN? 

A: Bundles of resources allocated by the RM, granting applications the privilege to use specific amounts of CPU and memory.


Flashcard 15

Q: Name the three types of YARN Schedulers. 

A: FIFO Scheduler, Fair Scheduler, and Capacity Scheduler.




Flashcard 16

Q: What is the FIFO Scheduler in YARN? 

A: The default scheduler that processes jobs in a first-in, first-out manner without considering job priority or size.


Flashcard 17

Q: What is the Fair Scheduler in YARN? 

A: Assigns resources to applications so that all applications receive, on average, an equal share of resources over time. Developed at Facebook.


Flashcard 18

Q: What is the Capacity Scheduler in YARN? 

A: Enables running Hadoop applications in a shared, multi-tenant environment by providing capacity guarantees and resource limits. Developed at Yahoo.


Flashcard 19

Q: What are the two phases of MapReduce

A:

  1. Map Phase: Processes input data and generates intermediate key-value pairs.

  2. Reduce Phase: Aggregates intermediate data with the same key.


Flashcard 20

Q: What are Numerical Summarization Patterns in MapReduce? 

A: Patterns used to compute statistics such as counts, maximum, minimum, and mean.


Flashcard 21

Q: What is Sort and Shuffle in MapReduce? 

A: The process of sorting and transferring intermediate data from the Map phase to the Reduce phase based on keys.


Flashcard 22

Q: What is the purpose of Top-N in batch analytics? 

A: To identify the top N records based on specific criteria, such as highest sales or most active users.


Flashcard 23

Q: What is a Filter operation in batch analytics? 

A: Selecting a subset of data that meets certain conditions or criteria.


Flashcard 24

Q: What is a Distinct operation in batch analytics? 

A: Extracting unique records from a dataset by removing duplicates.


Flashcard 25

Q: What is Binning in batch analytics? 

A: Grouping continuous data into discrete intervals or "bins" for analysis.




Flashcard 26

Q: What is an Inverted Index

A: A data structure used to map content, such as words or terms, to their locations within a dataset, commonly used in search engines.


Flashcard 27

Q: What are Joins in batch analytics? 

A: Combining records from multiple datasets based on a related key or attribute.


Flashcard 28

Q: What is the Hortonworks Data Platform (HDP)

A: An open-source platform distribution that includes various big data frameworks like Hadoop, YARN, HDFS, HBase, Hive, and Pig.


Flashcard 29

Q: What is the Cloudera Distribution for Hadoop (CDH)

A: An open-source platform distribution that includes big data tools and frameworks such as Hadoop, YARN, HDFS, HBase, Hive, and Pig.


Flashcard 30

Q: What is Amazon Elastic MapReduce (EMR)

A: A cloud-based big data cluster platform that supports various frameworks like Hadoop, Spark, Hive, and Pig.



Flashcard 31

Q: What is Pig🐷 in the Hadoop ecosystem? 

A: A high-level data processing language that translates scripts into MapReduce programs, using Pig Latin 🐷.  Igpay atinlay 


Flashcard 32

Q: What are the two main components of Pig🐷?

A:

  1. Pig Latin🐷: The high-level data processing language.

  2. Compiler: Translates Pig Latin scripts into MapReduce jobs.


Flashcard 33

Q: What are the two modes of operation in Pig🐷

A: Local mode and MapReduce mode.


Flashcard 34

Q: What are the primary data types in Pig🐷

A: Tuple, Bag, and Map.


Flashcard 35

Q: What is Apache Oozie

A: A workflow scheduler system for managing Hadoop jobs, allowing the creation of workflows arranged as Directed Acyclic Graphs (DAG).



Flashcard 36

Q: How does Apache Oozie define workflows? 

A: Using an XML-based process defining language called Hadoop Process Definition Language.


Flashcard 37

Q: What is Spark

A: A cluster computing framework that supports in-memory processing, enabling real-time, batch, and interactive queries.


Flashcard 38

Q: What are the main components of Spark

A: Spark Core, Spark Streaming, Spark SQL, Spark MLlib, and Spark GraphX.


Flashcard 39

Q: What is Spark Core

A: Provides the basic functionalities of Spark, including data abstraction through Resilient Distributed Datasets (RDDs).


Flashcard 40

Q: What is Spark Streaming

A: A Spark component for processing and analyzing streaming data in real-time.




Flashcard 41

Q: What is Spark SQL

A: A Spark component that enables interactive querying of data using SQL queries.


Flashcard 42

Q: What is Spark MLlib

A: Spark’s machine learning library that provides algorithms for clustering, classification, regression, collaborative filtering, and dimensionality reduction.


Flashcard 43

Q: What is Spark GraphX

A: A Spark component for graph processing, supporting graph algorithms like PageRank, connected components, and triangle counting.


Flashcard 44

Q: What are the main components of a Spark Cluster

A: Driver, Cluster Manager, and Executors.


Flashcard 45

Q: What is the Driver in a Spark Cluster? 

A: Consists of a driver program coordinated by the SparkContext object, managing the execution of the Spark application.




Flashcard 46

Q: What is the Cluster Manager in Spark? 

A: Allocates resources across the cluster and manages the distribution of tasks to Executors.


Flashcard 47

Q: What are Executors in Spark? 

A: Processes allocated on worker nodes that run application code and perform tasks.


Flashcard 48

Q: What is SparkContext

A: An object that connects the Spark application to the cluster, used to create RDDs and manage resources.


Flashcard 49

Q: What are Resilient Distributed Datasets (RDDs) in Spark? 

A: The primary data abstraction in Spark, representing an immutable, distributed collection of objects that can be operated on in parallel.


Flashcard 50

Q: What are the two types of RDD Operations

A: Transformations and Actions.




Flashcard 51

Q: What are Transformations in Spark RDD operations? 

A: Operations that create a new RDD from an existing one. They are lazy and only executed when an action is called.


Flashcard 52

Q: What are Actions in Spark RDD operations? 

A: Operations that trigger the execution of transformations and return a value to the driver program.


Flashcard 53

Q: What is Apache Solr

A: A scalable open-source framework for searching data, built on Apache Lucene, enabling indexing and searching of various data formats.


Flashcard 54

Q: What are the key features of Apache Solr

A:

  • Faceting

  • Clustering

  • Spatial Search

  • Pagination and Ranking


Flashcard 55

Q: What is Elasticsearch

A: An open-source search and analytics engine designed for scalability, used to store, search, and analyze large volumes of data quickly.


Flashcard 56

Q: What is a Cluster in Elasticsearch? 

A: A group of nodes that work together to store and index data.


Flashcard 57

Q: What is a Node in Elasticsearch? 

A: A single server within an Elasticsearch cluster responsible for storing and indexing data.


Flashcard 58

Q: What is an Index in Elasticsearch? 

A: A collection of similar documents, such as customer records or product catalogs.


Flashcard 59

Q: What is a Document in Elasticsearch? 

A: A single unit of data stored within an index.


Flashcard 60

Q: What are Shards in Elasticsearch? 

A: Subdivisions of an index that allow for parallel storage and retrieval of data, enhancing scalability and performance.




Flashcard 61

Q: What are the Common Features of Solr and Elasticsearch? 

A:

  • Both are open-source, distributed, and fault-tolerant search frameworks.

  • Built on Apache Lucene.

  • Use shards for scalability.


Flashcard 62

Q: How does Solr differ from Elasticsearch in search capabilities? 

A: Solr is more focused on text searches, while Elasticsearch excels in analytics queries, including grouping and filtering.


Flashcard 63

Q: What is a key deployment difference between Solr and Elasticsearch? 

A: Elasticsearch is easier to set up and deploy as it doesn’t require Apache ZooKeeper, unlike Solr.


Flashcard 64

Q: How does Elasticsearch handle real-time search compared to Solr

A: Elasticsearch is designed for near real-time search with low latency between indexing and search availability, whereas Solr was not originally built for real-time search.


Flashcard 65

Q: How do Solr and Elasticsearch manage shards differently? 

A: Solr supports shard splitting to divide existing shards, while Elasticsearch supports shard rebalancing to distribute shards across nodes as new nodes join the cluster.

chapter 10

Flashcard 1

Q: What is HDFS? 

A: Hadoop Distributed File System, a distributed file system that runs on large clusters and provides high-throughput access to data.


Flashcard 2

Q: What are the key characteristics of HDFS? 

A:

  • Scalable storage for large files

  • Replication for fault tolerance

  • Streaming data access

  • File appends


Flashcard 3

Q: How does HDFS ensure fault tolerance? 

A: By replicating blocks of files on multiple machines.


Flashcard 4

Q: What are the main components of HDFS Architecture? 

A: NameNode, Secondary NameNode, and DataNodes.


Flashcard 5

Q: What is the role of the NameNode in HDFS? 

A: Manages the filesystem namespace, stores metadata, and is responsible for opening and closing files. No data flows through the NameNode.

Flashcard 6

Q: What does the Secondary NameNode do in HDFS? 

A: Responsible for the checkpointing process, which helps in managing the filesystem metadata.


Flashcard 7

Q: What is a DataNode in HDFS? 

A: Stores actual data blocks and serves read and write requests from clients.


Flashcard 8

Q: Describe the Read Path in HDFS. 

A:

  1. Client requests file metadata from NameNode.

  2. NameNode responds with DataNode locations.

  3. Client reads data directly from DataNodes.


Flashcard 9

Q: Describe the Write Path in HDFS. 

A:

  1. Client requests to create a file from NameNode.

  2. NameNode responds with an output stream.

  3. Client writes data, which is split into packets and sent to DataNodes.

  4. Data is replicated across DataNodes forming a pipeline.

  5. DataNodes acknowledges successful writes.

  6. Client finalizes the file creation by closing the output stream.


Flashcard 10

Q: What is MapReduce 2.0 - YARN? 

A: In Hadoop 2.0, YARN separates the resource management from the processing engine, allowing Hadoop to support different processing engines like MapReduce, Tez, and Spark.

Flashcard 11

Q: What are the main components of YARN? 

A: Resource Manager (RM), Node Manager (NM), Application Master (AM), and Containers.


Flashcard 12

Q: What does the Resource Manager (RM) do in YARN? 

A: Manages the global assignment of compute resources to applications through the Scheduler and Applications Manager.


Flashcard 13

Q: What is the role of the Application Master (AM) in YARN? 

A: Manages the lifecycle of a specific application, including starting, monitoring, and restarting application tasks.


Flashcard 14

Q: What are Containers in YARN? 

A: Bundles of resources allocated by the RM, granting applications the privilege to use specific amounts of CPU and memory.


Flashcard 15

Q: Name the three types of YARN Schedulers. 

A: FIFO Scheduler, Fair Scheduler, and Capacity Scheduler.




Flashcard 16

Q: What is the FIFO Scheduler in YARN? 

A: The default scheduler that processes jobs in a first-in, first-out manner without considering job priority or size.


Flashcard 17

Q: What is the Fair Scheduler in YARN? 

A: Assigns resources to applications so that all applications receive, on average, an equal share of resources over time. Developed at Facebook.


Flashcard 18

Q: What is the Capacity Scheduler in YARN? 

A: Enables running Hadoop applications in a shared, multi-tenant environment by providing capacity guarantees and resource limits. Developed at Yahoo.


Flashcard 19

Q: What are the two phases of MapReduce

A:

  1. Map Phase: Processes input data and generates intermediate key-value pairs.

  2. Reduce Phase: Aggregates intermediate data with the same key.


Flashcard 20

Q: What are Numerical Summarization Patterns in MapReduce? 

A: Patterns used to compute statistics such as counts, maximum, minimum, and mean.


Flashcard 21

Q: What is Sort and Shuffle in MapReduce? 

A: The process of sorting and transferring intermediate data from the Map phase to the Reduce phase based on keys.


Flashcard 22

Q: What is the purpose of Top-N in batch analytics? 

A: To identify the top N records based on specific criteria, such as highest sales or most active users.


Flashcard 23

Q: What is a Filter operation in batch analytics? 

A: Selecting a subset of data that meets certain conditions or criteria.


Flashcard 24

Q: What is a Distinct operation in batch analytics? 

A: Extracting unique records from a dataset by removing duplicates.


Flashcard 25

Q: What is Binning in batch analytics? 

A: Grouping continuous data into discrete intervals or "bins" for analysis.




Flashcard 26

Q: What is an Inverted Index

A: A data structure used to map content, such as words or terms, to their locations within a dataset, commonly used in search engines.


Flashcard 27

Q: What are Joins in batch analytics? 

A: Combining records from multiple datasets based on a related key or attribute.


Flashcard 28

Q: What is the Hortonworks Data Platform (HDP)

A: An open-source platform distribution that includes various big data frameworks like Hadoop, YARN, HDFS, HBase, Hive, and Pig.


Flashcard 29

Q: What is the Cloudera Distribution for Hadoop (CDH)

A: An open-source platform distribution that includes big data tools and frameworks such as Hadoop, YARN, HDFS, HBase, Hive, and Pig.


Flashcard 30

Q: What is Amazon Elastic MapReduce (EMR)

A: A cloud-based big data cluster platform that supports various frameworks like Hadoop, Spark, Hive, and Pig.



Flashcard 31

Q: What is Pig🐷 in the Hadoop ecosystem? 

A: A high-level data processing language that translates scripts into MapReduce programs, using Pig Latin 🐷.  Igpay atinlay 


Flashcard 32

Q: What are the two main components of Pig🐷?

A:

  1. Pig Latin🐷: The high-level data processing language.

  2. Compiler: Translates Pig Latin scripts into MapReduce jobs.


Flashcard 33

Q: What are the two modes of operation in Pig🐷

A: Local mode and MapReduce mode.


Flashcard 34

Q: What are the primary data types in Pig🐷

A: Tuple, Bag, and Map.


Flashcard 35

Q: What is Apache Oozie

A: A workflow scheduler system for managing Hadoop jobs, allowing the creation of workflows arranged as Directed Acyclic Graphs (DAG).



Flashcard 36

Q: How does Apache Oozie define workflows? 

A: Using an XML-based process defining language called Hadoop Process Definition Language.


Flashcard 37

Q: What is Spark

A: A cluster computing framework that supports in-memory processing, enabling real-time, batch, and interactive queries.


Flashcard 38

Q: What are the main components of Spark

A: Spark Core, Spark Streaming, Spark SQL, Spark MLlib, and Spark GraphX.


Flashcard 39

Q: What is Spark Core

A: Provides the basic functionalities of Spark, including data abstraction through Resilient Distributed Datasets (RDDs).


Flashcard 40

Q: What is Spark Streaming

A: A Spark component for processing and analyzing streaming data in real-time.




Flashcard 41

Q: What is Spark SQL

A: A Spark component that enables interactive querying of data using SQL queries.


Flashcard 42

Q: What is Spark MLlib

A: Spark’s machine learning library that provides algorithms for clustering, classification, regression, collaborative filtering, and dimensionality reduction.


Flashcard 43

Q: What is Spark GraphX

A: A Spark component for graph processing, supporting graph algorithms like PageRank, connected components, and triangle counting.


Flashcard 44

Q: What are the main components of a Spark Cluster

A: Driver, Cluster Manager, and Executors.


Flashcard 45

Q: What is the Driver in a Spark Cluster? 

A: Consists of a driver program coordinated by the SparkContext object, managing the execution of the Spark application.




Flashcard 46

Q: What is the Cluster Manager in Spark? 

A: Allocates resources across the cluster and manages the distribution of tasks to Executors.


Flashcard 47

Q: What are Executors in Spark? 

A: Processes allocated on worker nodes that run application code and perform tasks.


Flashcard 48

Q: What is SparkContext

A: An object that connects the Spark application to the cluster, used to create RDDs and manage resources.


Flashcard 49

Q: What are Resilient Distributed Datasets (RDDs) in Spark? 

A: The primary data abstraction in Spark, representing an immutable, distributed collection of objects that can be operated on in parallel.


Flashcard 50

Q: What are the two types of RDD Operations

A: Transformations and Actions.




Flashcard 51

Q: What are Transformations in Spark RDD operations? 

A: Operations that create a new RDD from an existing one. They are lazy and only executed when an action is called.


Flashcard 52

Q: What are Actions in Spark RDD operations? 

A: Operations that trigger the execution of transformations and return a value to the driver program.


Flashcard 53

Q: What is Apache Solr

A: A scalable open-source framework for searching data, built on Apache Lucene, enabling indexing and searching of various data formats.


Flashcard 54

Q: What are the key features of Apache Solr

A:

  • Faceting

  • Clustering

  • Spatial Search

  • Pagination and Ranking


Flashcard 55

Q: What is Elasticsearch

A: An open-source search and analytics engine designed for scalability, used to store, search, and analyze large volumes of data quickly.


Flashcard 56

Q: What is a Cluster in Elasticsearch? 

A: A group of nodes that work together to store and index data.


Flashcard 57

Q: What is a Node in Elasticsearch? 

A: A single server within an Elasticsearch cluster responsible for storing and indexing data.


Flashcard 58

Q: What is an Index in Elasticsearch? 

A: A collection of similar documents, such as customer records or product catalogs.


Flashcard 59

Q: What is a Document in Elasticsearch? 

A: A single unit of data stored within an index.


Flashcard 60

Q: What are Shards in Elasticsearch? 

A: Subdivisions of an index that allow for parallel storage and retrieval of data, enhancing scalability and performance.




Flashcard 61

Q: What are the Common Features of Solr and Elasticsearch? 

A:

  • Both are open-source, distributed, and fault-tolerant search frameworks.

  • Built on Apache Lucene.

  • Use shards for scalability.


Flashcard 62

Q: How does Solr differ from Elasticsearch in search capabilities? 

A: Solr is more focused on text searches, while Elasticsearch excels in analytics queries, including grouping and filtering.


Flashcard 63

Q: What is a key deployment difference between Solr and Elasticsearch? 

A: Elasticsearch is easier to set up and deploy as it doesn’t require Apache ZooKeeper, unlike Solr.


Flashcard 64

Q: How does Elasticsearch handle real-time search compared to Solr

A: Elasticsearch is designed for near real-time search with low latency between indexing and search availability, whereas Solr was not originally built for real-time search.


Flashcard 65

Q: How do Solr and Elasticsearch manage shards differently? 

A: Solr supports shard splitting to divide existing shards, while Elasticsearch supports shard rebalancing to distribute shards across nodes as new nodes join the cluster.

robot