KS

BDA UNIT -1_pdf

Introduction to Big Data

  • Big Data refers to large, complex datasets that traditional data management tools cannot efficiently process.

  • Characteristics (Five V’s): Volume, Velocity, Variety, Veracity, and Value.

Problems with Traditional Approach

  • Traditional data systems (relational databases) do not handle the rapid growth and complexity of big data effectively.

  • Key issues include high storage costs, lack of high velocity processing, and challenges with data heterogeneity.

What is Hadoop

  • An open-source distributed processing framework designed for big data applications.

  • Designed to run on commodity hardware and emphasizes high data throughput.

Need of HDFS and MapReduce in Hadoop

  • HDFS (Hadoop Distributed File System): Stores large datasets across multiple nodes for redundancy and fault tolerance.

  • MapReduce: A programming model for processing large data sets in parallel across a distributed cluster.

Google File System (GFS)

  • GFS is a proprietary distributed file system developed by Google to manage large datasets across clusters of commodity hardware.

  • It was designed for reliability and efficient data access.

Building Blocks of Hadoop

  • Core components include HDFS, YARN (resource management), and MapReduce (data processing).

Introducing and Configuring Hadoop Cluster

  • Installation requires setting up Linux, Java, and user permissions.

  • Configuration through XML files (core-site.xml, hdfs-site.xml, etc.) to control Hadoop's behavior.

Hadoop Ecosystem Components

  • HDFS: Manages storage.

  • YARN: Manages resources.

  • MapReduce: Processes data.

  • Other components include Pig, Hive, and Mahout for data processing and machine learning.

HDFS Architecture

  • Comprises a master-slave setup with Namenode (master) and Datanodes (slaves).

  • Datanodes store actual data while Namenode handles metadata.

Key Features of Hadoop

  • Scalability, fault tolerance, and high availability are critical for managing large datasets.

  • Hadoop's ability to work with unstructured data gives it an edge in handling big data.

Advantages of Hadoop

  • Cost-effective due to commodity hardware use.

  • Scalability allows for accommodating increasing data volumes.

  • Fault tolerance ensures data reliability across distributed nodes.

Disadvantages of Hadoop

  • Primarily supports batch processing, not real-time analytics.

  • Not suited for small files, which can overwhelm the NameNode.

Real-time Applications of Hadoop

  • Used in sectors like National Security, Healthcare, Finance, and Smart City development to analyze big data for decision-making and optimizing operations.

Summary

Hadoop provides a robust framework for handling large datasets, integrating diverse data sources, and processing data efficiently in a distributed environment, making it a cornerstone technology for big data analytics.