chapter 10

studied byStudied by 0 people
5.0(1)
Get a hint
Hint

What is HDFS?

1 / 64

encourage image

There's no tags or description

Looks like no one added any tags here yet for you.

65 Terms

1

What is HDFS?

Hadoop Distributed File System, a distributed file system that runs on large clusters and provides high-throughput access to data.

New cards
2

What are the key characteristics of HDFS?

Scalable storage for large files, Replication for fault tolerance, Streaming data access, File appends.

New cards
3

How does HDFS ensure fault tolerance?

By replicating blocks of files on multiple machines.

New cards
4

What are the main components of HDFS Architecture?

NameNode, Secondary NameNode, and DataNodes.

New cards
5

What is the role of the NameNode in HDFS?

Manages the filesystem namespace, stores metadata, and is responsible for opening and closing files.

New cards
6

What does the Secondary NameNode do in HDFS?

Responsible for the checkpointing process, which helps in managing the filesystem metadata.

New cards
7

What is a DataNode in HDFS?

Stores actual data blocks and serves read and write requests from clients.

New cards
8

Describe the Read Path in HDFS.

Client requests file metadata from NameNode; NameNode responds with DataNode locations; Client reads data directly from DataNodes.

New cards
9

Describe the Write Path in HDFS.

Client requests to create a file from NameNode; NameNode responds with an output stream; Client writes data, which is split into packets and sent to DataNodes; Data is replicated across DataNodes forming a pipeline; DataNodes acknowledges successful writes; Client finalizes the file creation by closing the output stream.

New cards
10

What is MapReduce 2.0 - YARN?

In Hadoop 2.0, YARN separates the resource management from the processing engine, allowing Hadoop to support different processing engines like MapReduce, Tez, and Spark.

New cards
11

What are the main components of YARN?

Resource Manager (RM), Node Manager (NM), Application Master (AM), and Containers.

New cards
12

What does the Resource Manager (RM) do in YARN?

Manages the global assignment of compute resources to applications through the Scheduler and Applications Manager.

New cards
13

What is the role of the Application Master (AM) in YARN?

Manages the lifecycle of a specific application, including starting, monitoring, and restarting application tasks.

New cards
14

What are Containers in YARN?

Bundles of resources allocated by the RM, granting applications the privilege to use specific amounts of CPU and memory.

New cards
15

Name the three types of YARN Schedulers.

FIFO Scheduler, Fair Scheduler, and Capacity Scheduler.

New cards
16

What is the FIFO Scheduler in YARN?

The default scheduler that processes jobs in a first-in, first-out manner without considering job priority or size.

New cards
17

What is the Fair Scheduler in YARN?

Assigns resources to applications so that all applications receive, on average, an equal share of resources over time.

New cards
18

What is the Capacity Scheduler in YARN?

Enables running Hadoop applications in a shared, multi-tenant environment by providing capacity guarantees and resource limits.

New cards
19

What are the two phases of MapReduce?

Map Phase: Processes input data and generates intermediate key-value pairs; Reduce Phase: Aggregates intermediate data with the same key.

New cards
20

What are Numerical Summarization Patterns in MapReduce?

Patterns used to compute statistics such as counts, maximum, minimum, and mean.

New cards
21

What is Sort and Shuffle in MapReduce?

The process of sorting and transferring intermediate data from the Map phase to the Reduce phase based on keys.

New cards
22

What is the purpose of Top-N in batch analytics?

To identify the top N records based on specific criteria, such as highest sales or most active users.

New cards
23

What is a Filter operation in batch analytics?

Selecting a subset of data that meets certain conditions or criteria.

New cards
24

What is a Distinct operation in batch analytics?

Extracting unique records from a dataset by removing duplicates.

New cards
25

What is Binning in batch analytics?

Grouping continuous data into discrete intervals or 'bins' for analysis.

New cards
26

What is an Inverted Index?

A data structure used to map content, such as words or terms, to their locations within a dataset, commonly used in search engines.

New cards
27

What are Joins in batch analytics?

Combining records from multiple datasets based on a related key or attribute.

New cards
28

What is the Hortonworks Data Platform (HDP)?

An open-source platform distribution that includes various big data frameworks like Hadoop, YARN, HDFS, HBase, Hive, and Pig.

New cards
29

What is the Cloudera Distribution for Hadoop (CDH)?

An open-source platform distribution that includes big data tools and frameworks such as Hadoop, YARN, HDFS, HBase, Hive, and Pig.

New cards
30

What is Amazon Elastic MapReduce (EMR)?

A cloud-based big data cluster platform that supports various frameworks like Hadoop, Spark, Hive, and Pig.

New cards
31

What is Pig in the Hadoop ecosystem?

A high-level data processing language that translates scripts into MapReduce programs, using Pig Latin.

New cards
32

What are the two main components of Pig?

Pig Latin: The high-level data processing language; Compiler: Translates Pig Latin scripts into MapReduce jobs.

New cards
33

What are the two modes of operation in Pig?

Local mode and MapReduce mode.

New cards
34

What are the primary data types in Pig?

Tuple, Bag, and Map.

New cards
35

What is Apache Oozie?

A workflow scheduler system for managing Hadoop jobs, allowing the creation of workflows arranged as Directed Acyclic Graphs (DAG).

New cards
36

How does Apache Oozie define workflows?

Using an XML-based process defining language called Hadoop Process Definition Language.

New cards
37

What is Spark?

A cluster computing framework that supports in-memory processing, enabling real-time, batch, and interactive queries.

New cards
38

What are the main components of Spark?

Spark Core, Spark Streaming, Spark SQL, Spark MLlib, and Spark GraphX.

New cards
39

What is Spark Core?

Provides the basic functionalities of Spark, including data abstraction through Resilient Distributed Datasets (RDDs).

New cards
40

What is Spark Streaming?

A Spark component for processing and analyzing streaming data in real-time.

New cards
41

What is Spark SQL?

A Spark component that enables interactive querying of data using SQL queries.

New cards
42

What is Spark MLlib?

Spark’s machine learning library that provides algorithms for clustering, classification, regression, collaborative filtering, and dimensionality reduction.

New cards
43

What is Spark GraphX?

A Spark component for graph processing, supporting graph algorithms like PageRank, connected components, and triangle counting.

New cards
44

What are the main components of a Spark Cluster?

Driver, Cluster Manager, and Executors.

New cards
45

What is the Driver in a Spark Cluster?

Consists of a driver program coordinated by the SparkContext object, managing the execution of the Spark application.

New cards
46

What is the Cluster Manager in Spark?

Allocates resources across the cluster and manages the distribution of tasks to Executors.

New cards
47

What are Executors in Spark?

Processes allocated on worker nodes that run application code and perform tasks.

New cards
48

What is SparkContext?

An object that connects the Spark application to the cluster, used to create RDDs and manage resources.

New cards
49

What are Resilient Distributed Datasets (RDDs) in Spark?

The primary data abstraction in Spark, representing an immutable, distributed collection of objects that can be operated on in parallel.

New cards
50

What are the two types of RDD Operations?

Transformations and Actions.

New cards
51

What are Transformations in Spark RDD operations?

Operations that create a new RDD from an existing one. They are lazy and only executed when an action is called.

New cards
52

What are Actions in Spark RDD operations?

Operations that trigger the execution of transformations and return a value to the driver program.

New cards
53

What is Apache Solr?

A scalable open-source framework for searching data, built on Apache Lucene, enabling indexing and searching of various data formats.

New cards
54

What are the key features of Apache Solr?

Faceting, Clustering, Spatial Search, Pagination and Ranking.

New cards
55

What is Elasticsearch?

An open-source search and analytics engine designed for scalability, used to store, search, and analyze large volumes of data quickly.

New cards
56

What is a Cluster in Elasticsearch?

A group of nodes that work together to store and index data.

New cards
57

What is a Node in Elasticsearch?

A single server within an Elasticsearch cluster responsible for storing and indexing data.

New cards
58

What is an Index in Elasticsearch?

A collection of similar documents, such as customer records or product catalogs.

New cards
59

What is a Document in Elasticsearch?

A single unit of data stored within an index.

New cards
60

What are Shards in Elasticsearch?

Subdivisions of an index that allow for parallel storage and retrieval of data, enhancing scalability and performance.

New cards
61

What are the Common Features of Solr and Elasticsearch?

Both are open-source, distributed, and fault-tolerant search frameworks. Built on Apache Lucene. Use shards for scalability.

New cards
62

How does Solr differ from Elasticsearch in search capabilities?

Solr is more focused on text searches, while Elasticsearch excels in analytics queries, including grouping and filtering.

New cards
63

What is a key deployment difference between Solr and Elasticsearch?

Elasticsearch is easier to set up and deploy as it doesn’t require Apache ZooKeeper, unlike Solr.

New cards
64

How does Elasticsearch handle real-time search compared to Solr?

Elasticsearch is designed for near real-time search with low latency between indexing and search availability, whereas Solr was not originally built for real-time search.

New cards
65

How do Solr and Elasticsearch manage shards differently?

Solr supports shard splitting to divide existing shards, while Elasticsearch supports shard rebalancing to distribute shards across nodes as new nodes join the cluster.

New cards

Explore top notes

note Note
studied byStudied by 38 people
... ago
5.0(1)
note Note
studied byStudied by 85 people
... ago
5.0(1)
note Note
studied byStudied by 23 people
... ago
5.0(1)
note Note
studied byStudied by 26 people
... ago
5.0(1)
note Note
studied byStudied by 7 people
... ago
4.0(1)
note Note
studied byStudied by 2339 people
... ago
4.7(11)
note Note
studied byStudied by 10 people
... ago
5.0(1)
note Note
studied byStudied by 5551 people
... ago
5.0(32)

Explore top flashcards

flashcards Flashcard (24)
studied byStudied by 1 person
... ago
5.0(1)
flashcards Flashcard (75)
studied byStudied by 36 people
... ago
5.0(1)
flashcards Flashcard (40)
studied byStudied by 18 people
... ago
5.0(1)
flashcards Flashcard (79)
studied byStudied by 60 people
... ago
5.0(1)
flashcards Flashcard (108)
studied byStudied by 33 people
... ago
5.0(1)
flashcards Flashcard (34)
studied byStudied by 7 people
... ago
5.0(3)
flashcards Flashcard (50)
studied byStudied by 22 people
... ago
5.0(1)
flashcards Flashcard (21)
studied byStudied by 2 people
... ago
5.0(1)
robot