Looks like no one added any tags here yet for you.
What is HDFS?
Hadoop Distributed File System, a distributed file system that runs on large clusters and provides high-throughput access to data.
What are the key characteristics of HDFS?
Scalable storage for large files, Replication for fault tolerance, Streaming data access, File appends.
How does HDFS ensure fault tolerance?
By replicating blocks of files on multiple machines.
What are the main components of HDFS Architecture?
NameNode, Secondary NameNode, and DataNodes.
What is the role of the NameNode in HDFS?
Manages the filesystem namespace, stores metadata, and is responsible for opening and closing files.
What does the Secondary NameNode do in HDFS?
Responsible for the checkpointing process, which helps in managing the filesystem metadata.
What is a DataNode in HDFS?
Stores actual data blocks and serves read and write requests from clients.
Describe the Read Path in HDFS.
Client requests file metadata from NameNode; NameNode responds with DataNode locations; Client reads data directly from DataNodes.
Describe the Write Path in HDFS.
Client requests to create a file from NameNode; NameNode responds with an output stream; Client writes data, which is split into packets and sent to DataNodes; Data is replicated across DataNodes forming a pipeline; DataNodes acknowledges successful writes; Client finalizes the file creation by closing the output stream.
What is MapReduce 2.0 - YARN?
In Hadoop 2.0, YARN separates the resource management from the processing engine, allowing Hadoop to support different processing engines like MapReduce, Tez, and Spark.
What are the main components of YARN?
Resource Manager (RM), Node Manager (NM), Application Master (AM), and Containers.
What does the Resource Manager (RM) do in YARN?
Manages the global assignment of compute resources to applications through the Scheduler and Applications Manager.
What is the role of the Application Master (AM) in YARN?
Manages the lifecycle of a specific application, including starting, monitoring, and restarting application tasks.
What are Containers in YARN?
Bundles of resources allocated by the RM, granting applications the privilege to use specific amounts of CPU and memory.
Name the three types of YARN Schedulers.
FIFO Scheduler, Fair Scheduler, and Capacity Scheduler.
What is the FIFO Scheduler in YARN?
The default scheduler that processes jobs in a first-in, first-out manner without considering job priority or size.
What is the Fair Scheduler in YARN?
Assigns resources to applications so that all applications receive, on average, an equal share of resources over time.
What is the Capacity Scheduler in YARN?
Enables running Hadoop applications in a shared, multi-tenant environment by providing capacity guarantees and resource limits.
What are the two phases of MapReduce?
Map Phase: Processes input data and generates intermediate key-value pairs; Reduce Phase: Aggregates intermediate data with the same key.
What are Numerical Summarization Patterns in MapReduce?
Patterns used to compute statistics such as counts, maximum, minimum, and mean.
What is Sort and Shuffle in MapReduce?
The process of sorting and transferring intermediate data from the Map phase to the Reduce phase based on keys.
What is the purpose of Top-N in batch analytics?
To identify the top N records based on specific criteria, such as highest sales or most active users.
What is a Filter operation in batch analytics?
Selecting a subset of data that meets certain conditions or criteria.
What is a Distinct operation in batch analytics?
Extracting unique records from a dataset by removing duplicates.
What is Binning in batch analytics?
Grouping continuous data into discrete intervals or 'bins' for analysis.
What is an Inverted Index?
A data structure used to map content, such as words or terms, to their locations within a dataset, commonly used in search engines.
What are Joins in batch analytics?
Combining records from multiple datasets based on a related key or attribute.
What is the Hortonworks Data Platform (HDP)?
An open-source platform distribution that includes various big data frameworks like Hadoop, YARN, HDFS, HBase, Hive, and Pig.
What is the Cloudera Distribution for Hadoop (CDH)?
An open-source platform distribution that includes big data tools and frameworks such as Hadoop, YARN, HDFS, HBase, Hive, and Pig.
What is Amazon Elastic MapReduce (EMR)?
A cloud-based big data cluster platform that supports various frameworks like Hadoop, Spark, Hive, and Pig.
What is Pig in the Hadoop ecosystem?
A high-level data processing language that translates scripts into MapReduce programs, using Pig Latin.
What are the two main components of Pig?
Pig Latin: The high-level data processing language; Compiler: Translates Pig Latin scripts into MapReduce jobs.
What are the two modes of operation in Pig?
Local mode and MapReduce mode.
What are the primary data types in Pig?
Tuple, Bag, and Map.
What is Apache Oozie?
A workflow scheduler system for managing Hadoop jobs, allowing the creation of workflows arranged as Directed Acyclic Graphs (DAG).
How does Apache Oozie define workflows?
Using an XML-based process defining language called Hadoop Process Definition Language.
What is Spark?
A cluster computing framework that supports in-memory processing, enabling real-time, batch, and interactive queries.
What are the main components of Spark?
Spark Core, Spark Streaming, Spark SQL, Spark MLlib, and Spark GraphX.
What is Spark Core?
Provides the basic functionalities of Spark, including data abstraction through Resilient Distributed Datasets (RDDs).
What is Spark Streaming?
A Spark component for processing and analyzing streaming data in real-time.
What is Spark SQL?
A Spark component that enables interactive querying of data using SQL queries.
What is Spark MLlib?
Spark’s machine learning library that provides algorithms for clustering, classification, regression, collaborative filtering, and dimensionality reduction.
What is Spark GraphX?
A Spark component for graph processing, supporting graph algorithms like PageRank, connected components, and triangle counting.
What are the main components of a Spark Cluster?
Driver, Cluster Manager, and Executors.
What is the Driver in a Spark Cluster?
Consists of a driver program coordinated by the SparkContext object, managing the execution of the Spark application.
What is the Cluster Manager in Spark?
Allocates resources across the cluster and manages the distribution of tasks to Executors.
What are Executors in Spark?
Processes allocated on worker nodes that run application code and perform tasks.
What is SparkContext?
An object that connects the Spark application to the cluster, used to create RDDs and manage resources.
What are Resilient Distributed Datasets (RDDs) in Spark?
The primary data abstraction in Spark, representing an immutable, distributed collection of objects that can be operated on in parallel.
What are the two types of RDD Operations?
Transformations and Actions.
What are Transformations in Spark RDD operations?
Operations that create a new RDD from an existing one. They are lazy and only executed when an action is called.
What are Actions in Spark RDD operations?
Operations that trigger the execution of transformations and return a value to the driver program.
What is Apache Solr?
A scalable open-source framework for searching data, built on Apache Lucene, enabling indexing and searching of various data formats.
What are the key features of Apache Solr?
Faceting, Clustering, Spatial Search, Pagination and Ranking.
What is Elasticsearch?
An open-source search and analytics engine designed for scalability, used to store, search, and analyze large volumes of data quickly.
What is a Cluster in Elasticsearch?
A group of nodes that work together to store and index data.
What is a Node in Elasticsearch?
A single server within an Elasticsearch cluster responsible for storing and indexing data.
What is an Index in Elasticsearch?
A collection of similar documents, such as customer records or product catalogs.
What is a Document in Elasticsearch?
A single unit of data stored within an index.
What are Shards in Elasticsearch?
Subdivisions of an index that allow for parallel storage and retrieval of data, enhancing scalability and performance.
What are the Common Features of Solr and Elasticsearch?
Both are open-source, distributed, and fault-tolerant search frameworks. Built on Apache Lucene. Use shards for scalability.
How does Solr differ from Elasticsearch in search capabilities?
Solr is more focused on text searches, while Elasticsearch excels in analytics queries, including grouping and filtering.
What is a key deployment difference between Solr and Elasticsearch?
Elasticsearch is easier to set up and deploy as it doesn’t require Apache ZooKeeper, unlike Solr.
How does Elasticsearch handle real-time search compared to Solr?
Elasticsearch is designed for near real-time search with low latency between indexing and search availability, whereas Solr was not originally built for real-time search.
How do Solr and Elasticsearch manage shards differently?
Solr supports shard splitting to divide existing shards, while Elasticsearch supports shard rebalancing to distribute shards across nodes as new nodes join the cluster.