BDA UNIT -1_pdf

Introduction to Big Data

Big Data refers to large, complex datasets that traditional data management tools cannot efficiently process.
Characteristics (Five V’s): Volume, Velocity, Variety, Veracity, and Value.

Problems with Traditional Approach

Traditional data systems (relational databases) do not handle the rapid growth and complexity of big data effectively.
Key issues include high storage costs, lack of high velocity processing, and challenges with data heterogeneity.

What is Hadoop

An open-source distributed processing framework designed for big data applications.
Designed to run on commodity hardware and emphasizes high data throughput.

Need of HDFS and MapReduce in Hadoop

HDFS (Hadoop Distributed File System): Stores large datasets across multiple nodes for redundancy and fault tolerance.
MapReduce: A programming model for processing large data sets in parallel across a distributed cluster.

Google File System (GFS)

GFS is a proprietary distributed file system developed by Google to manage large datasets across clusters of commodity hardware.
It was designed for reliability and efficient data access.

Building Blocks of Hadoop

Core components include HDFS, YARN (resource management), and MapReduce (data processing).

Introducing and Configuring Hadoop Cluster

Installation requires setting up Linux, Java, and user permissions.
Configuration through XML files (core-site.xml, hdfs-site.xml, etc.) to control Hadoop's behavior.

Hadoop Ecosystem Components

HDFS: Manages storage.
YARN: Manages resources.
MapReduce: Processes data.
Other components include Pig, Hive, and Mahout for data processing and machine learning.

HDFS Architecture

Comprises a master-slave setup with Namenode (master) and Datanodes (slaves).
Datanodes store actual data while Namenode handles metadata.

Key Features of Hadoop

Scalability, fault tolerance, and high availability are critical for managing large datasets.
Hadoop's ability to work with unstructured data gives it an edge in handling big data.

Advantages of Hadoop

Cost-effective due to commodity hardware use.
Scalability allows for accommodating increasing data volumes.
Fault tolerance ensures data reliability across distributed nodes.

Disadvantages of Hadoop

Primarily supports batch processing, not real-time analytics.
Not suited for small files, which can overwhelm the NameNode.

Real-time Applications of Hadoop

Used in sectors like National Security, Healthcare, Finance, and Smart City development to analyze big data for decision-making and optimizing operations.

Summary

Hadoop provides a robust framework for handling large datasets, integrating diverse data sources, and processing data efficiently in a distributed environment, making it a cornerstone technology for big data analytics.