chapter 10

5.0(1)

Studied by 0 people

View linked note

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/64

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

65 Terms

New cards

What is HDFS?

Hadoop Distributed File System, a distributed file system that runs on large clusters and provides high-throughput access to data.

New cards

What are the key characteristics of HDFS?

Scalable storage for large files, Replication for fault tolerance, Streaming data access, File appends.

New cards

How does HDFS ensure fault tolerance?

By replicating blocks of files on multiple machines.

New cards

What are the main components of HDFS Architecture?

NameNode, Secondary NameNode, and DataNodes.

New cards

What is the role of the NameNode in HDFS?

Manages the filesystem namespace, stores metadata, and is responsible for opening and closing files.

New cards

What does the Secondary NameNode do in HDFS?

Responsible for the checkpointing process, which helps in managing the filesystem metadata.

New cards

What is a DataNode in HDFS?

Stores actual data blocks and serves read and write requests from clients.

New cards

Describe the Read Path in HDFS.

Client requests file metadata from NameNode; NameNode responds with DataNode locations; Client reads data directly from DataNodes.

New cards

Describe the Write Path in HDFS.

Client requests to create a file from NameNode; NameNode responds with an output stream; Client writes data, which is split into packets and sent to DataNodes; Data is replicated across DataNodes forming a pipeline; DataNodes acknowledges successful writes; Client finalizes the file creation by closing the output stream.

New cards

What is MapReduce 2.0 - YARN?

In Hadoop 2.0, YARN separates the resource management from the processing engine, allowing Hadoop to support different processing engines like MapReduce, Tez, and Spark.

New cards

What are the main components of YARN?

Resource Manager (RM), Node Manager (NM), Application Master (AM), and Containers.

New cards

What does the Resource Manager (RM) do in YARN?

Manages the global assignment of compute resources to applications through the Scheduler and Applications Manager.

New cards

What is the role of the Application Master (AM) in YARN?

Manages the lifecycle of a specific application, including starting, monitoring, and restarting application tasks.

New cards

What are Containers in YARN?

Bundles of resources allocated by the RM, granting applications the privilege to use specific amounts of CPU and memory.

New cards

Name the three types of YARN Schedulers.

FIFO Scheduler, Fair Scheduler, and Capacity Scheduler.

New cards

What is the FIFO Scheduler in YARN?

The default scheduler that processes jobs in a first-in, first-out manner without considering job priority or size.

New cards

What is the Fair Scheduler in YARN?

Assigns resources to applications so that all applications receive, on average, an equal share of resources over time.

New cards

What is the Capacity Scheduler in YARN?

Enables running Hadoop applications in a shared, multi-tenant environment by providing capacity guarantees and resource limits.

New cards

What are the two phases of MapReduce?

Map Phase: Processes input data and generates intermediate key-value pairs; Reduce Phase: Aggregates intermediate data with the same key.

New cards

What are Numerical Summarization Patterns in MapReduce?

Patterns used to compute statistics such as counts, maximum, minimum, and mean.

New cards

What is Sort and Shuffle in MapReduce?

The process of sorting and transferring intermediate data from the Map phase to the Reduce phase based on keys.

New cards

What is the purpose of Top-N in batch analytics?

To identify the top N records based on specific criteria, such as highest sales or most active users.

New cards

What is a Filter operation in batch analytics?

Selecting a subset of data that meets certain conditions or criteria.

New cards

What is a Distinct operation in batch analytics?

Extracting unique records from a dataset by removing duplicates.

New cards

What is Binning in batch analytics?

Grouping continuous data into discrete intervals or 'bins' for analysis.

New cards

What is an Inverted Index?

A data structure used to map content, such as words or terms, to their locations within a dataset, commonly used in search engines.

New cards

What are Joins in batch analytics?

Combining records from multiple datasets based on a related key or attribute.

New cards

What is the Hortonworks Data Platform (HDP)?

An open-source platform distribution that includes various big data frameworks like Hadoop, YARN, HDFS, HBase, Hive, and Pig.

New cards

What is the Cloudera Distribution for Hadoop (CDH)?

An open-source platform distribution that includes big data tools and frameworks such as Hadoop, YARN, HDFS, HBase, Hive, and Pig.

New cards

What is Amazon Elastic MapReduce (EMR)?

A cloud-based big data cluster platform that supports various frameworks like Hadoop, Spark, Hive, and Pig.

New cards

What is Pig in the Hadoop ecosystem?

A high-level data processing language that translates scripts into MapReduce programs, using Pig Latin.

New cards

What are the two main components of Pig?

Pig Latin: The high-level data processing language; Compiler: Translates Pig Latin scripts into MapReduce jobs.

New cards

What are the two modes of operation in Pig?

Local mode and MapReduce mode.

New cards

What are the primary data types in Pig?

Tuple, Bag, and Map.

New cards

What is Apache Oozie?

A workflow scheduler system for managing Hadoop jobs, allowing the creation of workflows arranged as Directed Acyclic Graphs (DAG).

New cards

How does Apache Oozie define workflows?

Using an XML-based process defining language called Hadoop Process Definition Language.

New cards

What is Spark?

A cluster computing framework that supports in-memory processing, enabling real-time, batch, and interactive queries.

New cards

What are the main components of Spark?

Spark Core, Spark Streaming, Spark SQL, Spark MLlib, and Spark GraphX.

New cards

What is Spark Core?

Provides the basic functionalities of Spark, including data abstraction through Resilient Distributed Datasets (RDDs).

New cards

What is Spark Streaming?

A Spark component for processing and analyzing streaming data in real-time.

New cards

What is Spark SQL?

A Spark component that enables interactive querying of data using SQL queries.

New cards

What is Spark MLlib?

Spark’s machine learning library that provides algorithms for clustering, classification, regression, collaborative filtering, and dimensionality reduction.

New cards

What is Spark GraphX?

A Spark component for graph processing, supporting graph algorithms like PageRank, connected components, and triangle counting.

New cards

What are the main components of a Spark Cluster?

Driver, Cluster Manager, and Executors.

New cards

What is the Driver in a Spark Cluster?

Consists of a driver program coordinated by the SparkContext object, managing the execution of the Spark application.

New cards

What is the Cluster Manager in Spark?

Allocates resources across the cluster and manages the distribution of tasks to Executors.

New cards

What are Executors in Spark?

Processes allocated on worker nodes that run application code and perform tasks.

New cards

What is SparkContext?

An object that connects the Spark application to the cluster, used to create RDDs and manage resources.

New cards

What are Resilient Distributed Datasets (RDDs) in Spark?

The primary data abstraction in Spark, representing an immutable, distributed collection of objects that can be operated on in parallel.

New cards

What are the two types of RDD Operations?

Transformations and Actions.

New cards

What are Transformations in Spark RDD operations?

Operations that create a new RDD from an existing one. They are lazy and only executed when an action is called.

New cards

What are Actions in Spark RDD operations?

Operations that trigger the execution of transformations and return a value to the driver program.

New cards

What is Apache Solr?

A scalable open-source framework for searching data, built on Apache Lucene, enabling indexing and searching of various data formats.

New cards

What are the key features of Apache Solr?

Faceting, Clustering, Spatial Search, Pagination and Ranking.

New cards

What is Elasticsearch?

An open-source search and analytics engine designed for scalability, used to store, search, and analyze large volumes of data quickly.

New cards

What is a Cluster in Elasticsearch?

A group of nodes that work together to store and index data.

New cards

What is a Node in Elasticsearch?

A single server within an Elasticsearch cluster responsible for storing and indexing data.

New cards

What is an Index in Elasticsearch?

A collection of similar documents, such as customer records or product catalogs.

New cards

What is a Document in Elasticsearch?

A single unit of data stored within an index.

New cards

What are Shards in Elasticsearch?

Subdivisions of an index that allow for parallel storage and retrieval of data, enhancing scalability and performance.

New cards

What are the Common Features of Solr and Elasticsearch?

Both are open-source, distributed, and fault-tolerant search frameworks. Built on Apache Lucene. Use shards for scalability.

New cards

How does Solr differ from Elasticsearch in search capabilities?

Solr is more focused on text searches, while Elasticsearch excels in analytics queries, including grouping and filtering.

New cards

What is a key deployment difference between Solr and Elasticsearch?

Elasticsearch is easier to set up and deploy as it doesn’t require Apache ZooKeeper, unlike Solr.

New cards

How does Elasticsearch handle real-time search compared to Solr?

Elasticsearch is designed for near real-time search with low latency between indexing and search availability, whereas Solr was not originally built for real-time search.

New cards

How do Solr and Elasticsearch manage shards differently?

Solr supports shard splitting to divide existing shards, while Elasticsearch supports shard rebalancing to distribute shards across nodes as new nodes join the cluster.