Big Data Test 1

0.0(0)

Studied by 0 people

0.0(0)

Call with Kai

Knowt Play

New

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/160

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

161 Terms

New cards

Big Data

data whose scale, distribution, diversity and timliness

requires new data architetures
new analytical methodologies
integrating multiple skills

New cards

4 V’s of Big Data

Volume: high volume
Variety: diverse sources
- Structured (table), semi-structured (XML), unstructured
Velocity: changing rapidly
Value: has some positive value (utility)

New cards

knowledge

information understanding that is gained from experience or education

cloud computing, mobile computing, knowledge management, ontology

New cards

intelligence

mental capability involved in the ability to acquire and apply knowledge and skills

Agents, etc.

New cards

Data Processing Capabilities

skill, rule, knowledge triangle (knowledge bottom up)

New cards

Embedded Intelligence

every device will have processing power and storage

Ex:

Smart phones generate half of data traffic
- Possible Issue: communication bottleneck
Being able to track everything anywhere anytime
- Possible Issue: processing bottleneck

New cards

Big Data tasks

problem solving

learning

decision making

planning

New cards

Vehicle Commute Time Prediciton

New cards

big data structures

structured
semi-structured
quasi structured
unstructured

New cards

unstructured

data contains defined data type

New cards

semi-structured

textual data files with discernable pattern enabling parsing

New cards

Quasi Structured

textual data with erratic data formats, can be formatted with effort

New cards

Unstructured

no coherent structure and is typicallu stored as different types of files

New cards

Business Risk

customer churn, regulatory requirments

New cards

Data product

any application that combines data and algorithms that combine data with statistical algorithms for inference or prediction (ie. systems that learn from data)

New cards

data products redefined

derive value from data and generate more data by influencing human behaviour or making inferences or predictions upon new data

New cards

why cant you use traditional file system

cannot handle fault tolerant, small block size, large seek time, only one copy

New cards

hadoop

open source implementation of two papers at google, created by doug cutting, creator of apache lucene

New cards

HDFS properties

fault tolerance
reliable, scalable platform for storage/analysis

New cards

fault tolerance

common way of avoiding data loss through replication, redundant copies kept by system

New cards

why hadoop better than databases

Disk Seek Time: Faster by moving code to data, minimizing latency from disk seek times.
Flexible Schema: Less rigid, supports unstructured and semi-structured data with schema-on-read.
Handles Large Datasets: Optimized for big data processing with efficient streaming reads/writes.
No Costly Data Loading: Avoids extensive data loading phases typical of databases.
Flat Data Structure: Works well with denormalized data, avoiding latency from normalized (nonlocal) data

New cards

Disk Seek time

Databases move data to code, relying on high disk seek times.
High seek times add latency, slowing down large-scale analysis.
Hadoop moves code to data, reducing the need for disk seeking and latency

New cards

normal forms

Databases normalize data to maintain integrity and reduce redundancy.
Normalization causes data to be nonlocal, increasing latency for data retrieval.
Hadoop optimizes for contiguous, local data for high-speed streaming reads/writes.
Flat, denormalized data structures are more efficient for Hadoop processing

New cards

rigid schema

Databases require a predefined schema (schema-on-write), limiting flexibility.
Structured data must follow rigid formats (e.g., tables with set columns).
Hadoop uses schema-on-read, working well with unstructured/semi-structured data.
No strict schema requirements in Hadoop, allowing for faster data loading.

New cards

Data Locality and HDFS

New cards

Why isn’t a database with lots of disks enough for large-scale analysis?

Databases require high disk seek times, which adds latency. Hadoop instead moves code to data, reducing latency

New cards

How does Hadoop handle data structure differently from traditional databases?

Hadoop uses schema-on-read, which works well with unstructured or semi-structured data, unlike databases that need rigid schemas

New cards

What challenges does data normalization pose for Hadoop?

Normalization creates nonlocal data, increasing latency for Hadoop's high-speed streaming reads, which work best with local, contiguous data.

New cards

What is data locality in Hadoop?

Data locality means Hadoop co-locates data with compute nodes, so data access is faster because it’s local

New cards

How does Hadoop conserve network bandwidth?

By using data locality and modeling network topology, Hadoop reduces the need to move data, which conserves bandwidth

precious because it can easily be saturated by data transfers, impacting performance

New cards

distributed system requirements:

fault tolerance
recoverability
consistency
scalability

New cards

how hadoop addresses requirements

data distributed immediately when added to cluster and stored on multiple nodes
nodes prefer locally to minimize traffic across network
data stored in fixed size blocks and each is duplicated multiple times across system to provide redundancy

New cards

computation

job, jobs broken into tasks where each node performs task on single block

written at high level without concern for network programming time or low-level infrastruture allow developers to ficus on data vs. details

New cards

How does Hadoop handle large computations?

Hadoop breaks computations into jobs, which are further broken into tasks. Each node performs a task on a single block of data

New cards

Why don’t Hadoop jobs require concern for network programming?

Jobs are written at a high level, allowing developers to focus on data and computation rather than network or low-level distributed programming

New cards

How does Hadoop minimize network traffic between nodes?

Tasks are independent with no inter-process dependencies, reducing the need for nodes to communicate during processing and avoiding potential deadlocks

New cards

How is Hadoop fault-tolerant?

Hadoop uses task redundancy, so if a node or task fails, another node can complete the task, ensuring the final computation is correct

New cards

How does Hadoop utilize worker nodes

The master program allocates work to nodes, allowing parallel processing across many nodes on different portions of the dataset

New cards

What makes Hadoop different from traditional data warehouses?

Hadoop can run on economical, commercial off-the-shelf hardware, unlike traditional data warehouses that often require specialized hardware

New cards

What are the two primary components of Hadoop's architecture?

Hadoop's architecture is composed of HDFS (Hadoop Distributed File System) and YARN (Yet Another Resource Negotiator).

New cards

What is the role of HDFS in Hadoop

HDFS manages data storage across disks in the cluster, providing a distributed file system for big data

New cards

What does YARN do in the Hadoop architecture?

YARN acts as a cluster resource manager, allocating computational resources (like CPU and memory) to applications for distributed computation.

New cards

How do HDFS and YARN work together to optimize network traffic?

HDFS and YARN minimize network traffic by ensuring that data is local to the required computation, reducing the need to transfer data across the network

New cards

How does Hadoop ensure fault tolerance and consistency?

Hadoop duplicates both data and tasks across nodes, enabling fault tolerance, recoverability, and consistency within the cluster

New cards

What feature of Hadoop’s architecture allows for scalability?

The cluster is centrally managed to provide scalability and to abstract low-level details of clustering programming

New cards

How do HDFS and YARN function as more than just a platform?

Together, HDFS and YARN provide an operating system for big data applications, supporting distributed storage and computation

New cards

Is Hadoop a type of hardware?

No, Hadoop is not hardware; it is a distributed file system (HDFS) that runs on a cluster of machines

New cards

What is YARN in the Hadoop cluster?

YARN is a cluster resource manager that coordinates background services on a group of machines, typically using commodity hardware.

New cards

Is Hadoop a type of hardware

No, Hadoop is not hardware; it is a distributed file system (HDFS) that runs on a cluster of machines

New cards

What is YARN in the Hadoop cluster

YARN is a cluster resource manager that coordinates background services on a group of machines, typically using commodity hardware

New cards

What is a Hadoop cluster

A Hadoop cluster is a set of machines running HDFS and YARN, where each machine is known as a node

New cards

How do Hadoop clusters scale?

Hadoop clusters scale horizontally, meaning that as more nodes are added, the cluster’s capacity and performance increase linearly

New cards

What are daemon processes in Hadoop

Daemon processes are background services in YARN and HDFS that do not require user input and run continuously

New cards

How do Hadoop processes operate within a cluster

Hadoop processes run as services on each cluster node, accepting input and delivering output through the network.

New cards

How are Hadoop daemon processes managed in terms of resources

Each process runs in its own Java Virtual Machine (JVM) with dedicated system resource allocation, managed independently by the operating system

New cards

What role do master nodes play in a Hadoop cluster?

Master nodes coordinate services for Hadoop workers and serve as entry points for user access. They are essential for maintaining cluster coordination

New cards

What is the primary function of worker nodes in a Hadoop cluster?

Worker nodes perform tasks assigned by master nodes, such as storing or retrieving data and executing distributed computations in parallel

New cards

Why are master processes usually run on their own nodes?

Running master processes on separate nodes prevents them from competing for resources with worker nodes, avoiding potential bottlenecks

New cards

Can master daemons run on a single node in a Hadoop cluster?

Yes, in smaller clusters, master daemons can run on a single node if necessary

New cards

What is the role of the NameNode in Hadoop?

The NameNode (Master) stores the directory tree of the file system, file metadata, and the locations of each file block in the cluster. It directs clients to the appropriate DataNodes for data access.

New cards

What does the Secondary NameNode do?

The Secondary NameNode performs housekeeping tasks and check-pointing for the NameNode but is not a backup NameNode

New cards

What is the function of a DataNode in Hadoop?

A DataNode (Worker) stores HDFS blocks on local disks and reports health and data status back to the NameNode

New cards

Describe the workflow between a client, NameNode, and DataNode in Hadoop.

The client requests data from the NameNode, which provides a list of DataNodes storing the data. The client then retrieves data directly from those DataNodes, with the NameNode directing traffic like a “traffic cop.”

New cards

What is the role of the ResourceManager in YARN?

The ResourceManager (Master) allocates and monitors cluster resources like memory and CPU, managing job scheduling for applications

New cards

What is the purpose of the ApplicationMaster in YARN?

The ApplicationMaster coordinates the execution of a specific application on the cluster as scheduled by the ResourceManager.

New cards

What does the NodeManager do in YARN?

The NodeManager (Worker) runs tasks on individual nodes, manages resource allocation, and reports the health and status of tasks

New cards

Explain the YARN workflow for executing a job.

A client requests resources from the ResourceManager, which assigns an ApplicationMaster to handle the job. The ApplicationMaster coordinates the job, while the ResourceManager tracks status, and NodeManagers execute tasks on the nodes

New cards

MapReduce

programming model for data processing

New cards

MapReduce functions

map function, shuffle and sort, reduce

New cards

map function

step takes large dataset and transforms into keyvalue pairs.

New cards

shuffle and sort

once map phase complete, framework groups output by keys and organizes it for further processing by reduce

New cards

reduce function

processes sorted key value pairs typically aggregating or summarizing data to produce final result

New cards

What Unix tool is commonly used for processing line-oriented data in data analysis?

awk is the classic tool for processing line-oriented data

New cards

What is the purpose of the Unix script shown for analyzing temperature data?

The script loops through compressed year files, extracts air temperature and quality code fields with awk, checks data validity, and finds the maximum temperature for each year

New cards

How does the awk script validate air temperature data?

It ensures the temperature is not a missing value (9999) and checks the quality code to confirm the reading is not erroneous

New cards

What are the limitations of using Unix tools like awk for large-scale data analysis?

Limitations include difficulty in dividing equal-size pieces, the need for complex result combination, and being limited to the capacity of a single machine

New cards

Why is it challenging to divide data into equal-size pieces in Unix-based data analysis?

File sizes can vary by year, so some processes finish faster, with the longest file determining the total runtime. Dividing into fixed-size chunks is one solution

New cards

What are the challenges in combining results from independent processes in Unix-based analysis?

Results may need further processing, such as sorting and concatenating, which can be delicate when using fixed-size chunks

New cards

What is the limitation of single-machine processing in Unix-based data analysis?

Some datasets exceed the capacity of one machine, necessitating multiple machines and coordination for distributed processing

New cards

How does MapReduce differ from Unix tools in handling large datasets?

MapReduce splits tasks into map and reduce phases, allowing parallel processing across multiple machines, unlike Unix tools which are limited to a single machine

New cards

What is the purpose of the map function in Hadoop’s MapReduce?

The map function extracts relevant fields (e.g., year and temperature) from raw data and filters out bad records, preparing data for the reduce phase

New cards

How does the reduce function work in Hadoop’s MapReduce?

The reduce function aggregates data from the map phase, such as finding the maximum temperature for each year.

New cards

Why is the key-value structure important in Hadoop’s MapReduce?

The key-value pairs allow MapReduce to process and organize data across multiple nodes, with the key often being the offset in the file and the value being the data

New cards

What role does the map function play in handling erroneous data in Hadoop?

The map function filters out missing, suspect, or erroneous data during its processing phase

New cards

What is the role of the END block in the awk script?

The END block in awk executes after all lines in a file are processed, printing the maximum temperature value found

New cards

What is the benefit of MapReduce’s two-phase processing approach?

By separating processing into the map and reduce phases, MapReduce enables efficient parallel processing, improving scalability and speed

New cards

Why is network efficiency critical in Hadoop’s MapReduce?

Network efficiency is important because moving data across nodes is costly. MapReduce minimizes this by processing data locally in the map phase and only transmitting necessary results for the reduce phase

New cards

Why doesn’t the awk script in Unix handle large datasets as efficiently as MapReduce?

The awk script is limited by the processing capacity of a single machine, whereas MapReduce can distribute and process data across multiple nodes, allowing for handling of larger datasets

New cards

How is a MapReduce job executed in Hadoop?

A MapReduce job is executed by calling submit() on a Job object or using waitForCompletion(), which submits the job if not already submitted and waits for it to finish

New cards

What are the five main entities involved in a MapReduce job process?

The five entities are: the client, YARN Resource Manager, YARN Node Managers, MapReduce Application Master, and the distributed filesystem

New cards

What is the role of the YARN Resource Manager in a MapReduce job?

The Resource Manager allocates and coordinates computing resources across the cluster

New cards

Describe the role of the YARN Node Manager in MapReduce.

Node Managers launch and monitor computing containers on each node in the cluster

New cards

What does the MapReduce Application Master do?

It coordinates the execution of tasks within the MapReduce job, running tasks in containers allocated by the Resource Manager

New cards

What are the steps for job submission in a MapReduce process?

Step 1: submit() method requests a new application ID from the Resource Manager.
Step 2: JobSubmitter calculates input splits and checks output specifications.
Step 3: Job resources (JAR file, configuration, input splits) are copied to the shared filesystem.
Step 4: JobSubmitter calls submitApplication() to officially submit the job

New cards

What are the steps in job initialization for MapReduce?

Step 5a: YARN scheduler allocates a container.
Step 5b: Resource Manager launches the Application Master's process in the allocated container under the Node Manager.
Step 6: Application Master (MRAppMaster) starts, responsible for managing the job's map and reduce tasks

New cards

What are the steps in task initialization for MapReduce?

Step 7: MRAppMaster retrieves input splits, creates map and reduce task objects, and assigns IDs.
Step 8: Application Master requests containers from the Resource Manager for each map and reduce task

New cards

What are the steps involved in task execution in a MapReduce job?

Step 9a: Resource Manager assigns containers, and the Node Manager starts tasks as YarnChild processes.
Step 9b: YarnChild retrieves configuration, JAR file, and necessary resources.
Step 10: YarnChild executes the map or reduce task, processing data as instructed

New cards

What are the five main entities in a MapReduce job, and what are their roles?

Client: Submits the job.
YARN Resource Manager: Allocates cluster resources and coordinates task scheduling.
YARN Node Manager: Launches and monitors containers on each node.
MapReduce Application Master: Manages and tracks tasks, requesting resources from the Resource Manager.
Distributed Filesystem (e.g., HDFS): Stores job files, input splits, and configurations for access by all component