A: In Hadoop 2.0, YARN separates the resource management from the processing engine, allowing Hadoop to support different processing engines like MapReduce, Tez, and Spark.

Flashcard 11

Q: What are the main components of YARN?

A: Resource Manager (RM), Node Manager (NM), Application Master (AM), and Containers.

Flashcard 12

Q: What does the Resource Manager (RM) do in YARN?

A: Manages the global assignment of compute resources to applications through the Scheduler and Applications Manager.

Flashcard 13

Q: What is the role of the Application Master (AM) in YARN?

A: Manages the lifecycle of a specific application, including starting, monitoring, and restarting application tasks.

Flashcard 14

Q: What are Containers in YARN?

A: Bundles of resources allocated by the RM, granting applications the privilege to use specific amounts of CPU and memory.

Flashcard 15

Q: Name the three types of YARN Schedulers.

A: FIFO Scheduler, Fair Scheduler, and Capacity Scheduler.

Flashcard 16

Q: What is the FIFO Scheduler in YARN?

A: The default scheduler that processes jobs in a first-in, first-out manner without considering job priority or size.

Flashcard 17

Q: What is the Fair Scheduler in YARN?

A: Assigns resources to applications so that all applications receive, on average, an equal share of resources over time. Developed at Facebook.

Flashcard 18

Q: What is the Capacity Scheduler in YARN?

A: Enables running Hadoop applications in a shared, multi-tenant environment by providing capacity guarantees and resource limits. Developed at Yahoo.

Flashcard 19

Q: What are the two phases of MapReduce?

Map Phase: Processes input data and generates intermediate key-value pairs.
Reduce Phase: Aggregates intermediate data with the same key.

Flashcard 20

Q: What are Numerical Summarization Patterns in MapReduce?

A: Patterns used to compute statistics such as counts, maximum, minimum, and mean.

Flashcard 21

Q: What is Sort and Shuffle in MapReduce?

A: The process of sorting and transferring intermediate data from the Map phase to the Reduce phase based on keys.

Flashcard 22

Q: What is the purpose of Top-N in batch analytics?

A: To identify the top N records based on specific criteria, such as highest sales or most active users.

Flashcard 23

Q: ~~What is a~~ ~~Filter~~ ~~operation in batch analytics?~~

A: ~~Selecting a subset of data that meets certain conditions or criteria.~~

Flashcard 24

Q: What is a Distinct operation in batch analytics?

A: Extracting unique records from a dataset by removing duplicates.

Flashcard 25

Q: What is Binning in batch analytics?

A: Grouping continuous data into discrete intervals or "bins" for analysis.

Flashcard 26

Q: What is an Inverted Index?

A: A data structure used to map content, such as words or terms, to their locations within a dataset, commonly used in search engines.

Flashcard 27

Q: What are Joins in batch analytics?

A: Combining records from multiple datasets based on a related key or attribute.

Flashcard 28

Q: What is the Hortonworks Data Platform (HDP)?

A: An open-source platform distribution that includes various big data frameworks like Hadoop, YARN, HDFS, HBase, Hive, and Pig.

Flashcard 29

Q: What is the Cloudera Distribution for Hadoop (CDH)?

A: An open-source platform distribution that includes big data tools and frameworks such as Hadoop, YARN, HDFS, HBase, Hive, and Pig.

Flashcard 30

Q: What is Amazon Elastic MapReduce (EMR)?

A: A cloud-based big data cluster platform that supports various frameworks like Hadoop, Spark, Hive, and Pig.

Flashcard 31

Q: What is Pig🐷 in the Hadoop ecosystem?

A: A high-level data processing language that translates scripts into MapReduce programs, using Pig Latin 🐷. Igpay atinlay

Flashcard 32

Q: What are the two main components of Pig🐷?

Pig Latin🐷: The high-level data processing language.
Compiler: Translates Pig Latin scripts into MapReduce jobs.

Flashcard 33

Q: What are the two modes of operation in Pig🐷?

A: Local mode and MapReduce mode.

Flashcard 34

Q: What are the primary data types in Pig🐷?

A: Tuple, Bag, and Map.

Flashcard 35

Q: What is Apache Oozie?

A: A workflow scheduler system for managing Hadoop jobs, allowing the creation of workflows arranged as Directed Acyclic Graphs (DAG).

Flashcard 36

Q: How does Apache Oozie define workflows?

A: Using an XML-based process defining language called Hadoop Process Definition Language.

Flashcard 37

Q: What is Spark?

A: A cluster computing framework that supports in-memory processing, enabling real-time, batch, and interactive queries.

Flashcard 38

Q: What are the main components of Spark?

A: Spark Core, Spark Streaming, Spark SQL, Spark MLlib, and Spark GraphX.

Flashcard 39

Q: What is Spark Core?

A: Provides the basic functionalities of Spark, including data abstraction through Resilient Distributed Datasets (RDDs).

Flashcard 40

Q: What is Spark Streaming?

A: A Spark component for processing and analyzing streaming data in real-time.

Flashcard 41

Q: What is Spark SQL?

A: A Spark component that enables interactive querying of data using SQL queries.

Flashcard 42

Q: What is Spark MLlib?

A: Spark’s machine learning library that provides algorithms for clustering, classification, regression, collaborative filtering, and dimensionality reduction.

Flashcard 43

Q: What is Spark GraphX?

A: A Spark component for graph processing, supporting graph algorithms like PageRank, connected components, and triangle counting.

Flashcard 44

Q: What are the main components of a Spark Cluster?

A: Driver, Cluster Manager, and Executors.

Flashcard 45

Q: What is the Driver in a Spark Cluster?

A: Consists of a driver program coordinated by the SparkContext object, managing the execution of the Spark application.

Flashcard 46

Q: What is the Cluster Manager in Spark?

A: Allocates resources across the cluster and manages the distribution of tasks to Executors.

Flashcard 47

Q: What are Executors in Spark?

A: Processes allocated on worker nodes that run application code and perform tasks.

Flashcard 48

Q: What is SparkContext?

A: An object that connects the Spark application to the cluster, used to create RDDs and manage resources.

Flashcard 49

Q: What are Resilient Distributed Datasets (RDDs) in Spark?

A: The primary data abstraction in Spark, representing an immutable, distributed collection of objects that can be operated on in parallel.

Flashcard 50

Q: What are the two types of RDD Operations?

A: Transformations and Actions.

Flashcard 51

Q: What are Transformations in Spark RDD operations?

A: Operations that create a new RDD from an existing one. They are lazy and only executed when an action is called.

Flashcard 52

Q: What are Actions in Spark RDD operations?

A: Operations that trigger the execution of transformations and return a value to the driver program.

Flashcard 53

Q: What is Apache Solr?

A: A scalable open-source framework for searching data, built on Apache Lucene, enabling indexing and searching of various data formats.

Flashcard 54

Q: What are the key features of Apache Solr?

Faceting
Clustering
Spatial Search
Pagination and Ranking

Flashcard 55

Q: What is Elasticsearch?

A: An open-source search and analytics engine designed for scalability, used to store, search, and analyze large volumes of data quickly.

Flashcard 56

Q: What is a Cluster in Elasticsearch?

A: A group of nodes that work together to store and index data.

Flashcard 57

Q: What is a Node in Elasticsearch?

A: A single server within an Elasticsearch cluster responsible for storing and indexing data.

Flashcard 58

Q: What is an Index in Elasticsearch?

A: A collection of similar documents, such as customer records or product catalogs.

Flashcard 59

Q: What is a Document in Elasticsearch?

A: A single unit of data stored within an index.

Flashcard 60

Q: What are Shards in Elasticsearch?

A: Subdivisions of an index that allow for parallel storage and retrieval of data, enhancing scalability and performance.

Flashcard 61

Q: What are the Common Features of Solr and Elasticsearch?

Both are open-source, distributed, and fault-tolerant search frameworks.
Built on Apache Lucene.
Use shards for scalability.

Flashcard 62

Q: How does Solr differ from Elasticsearch in search capabilities?

A: Solr is more focused on text searches, while Elasticsearch excels in analytics queries, including grouping and filtering.

Flashcard 63

Q: What is a key deployment difference between Solr and Elasticsearch?

A: Elasticsearch is easier to set up and deploy as it doesn’t require Apache ZooKeeper, unlike Solr.

Flashcard 64

Q: How does Elasticsearch handle real-time search compared to Solr?

A: Elasticsearch is designed for near real-time search with low latency between indexing and search availability, whereas Solr was not originally built for real-time search.

Flashcard 65

Q: How do Solr and Elasticsearch manage shards differently?

A: Solr supports shard splitting to divide existing shards, while Elasticsearch supports shard rebalancing to distribute shards across nodes as new nodes join the cluster.

Note