BDA UNIT -1_pdf
Introduction to Big Data
Big Data refers to large, complex datasets that traditional data management tools cannot efficiently process.
Characteristics (Five V’s): Volume, Velocity, Variety, Veracity, and Value.
Problems with Traditional Approach
Traditional data systems (relational databases) do not handle the rapid growth and complexity of big data effectively.
Key issues include high storage costs, lack of high velocity processing, and challenges with data heterogeneity.
What is Hadoop
An open-source distributed processing framework designed for big data applications.
Designed to run on commodity hardware and emphasizes high data throughput.
Need of HDFS and MapReduce in Hadoop
HDFS (Hadoop Distributed File System): Stores large datasets across multiple nodes for redundancy and fault tolerance.
MapReduce: A programming model for processing large data sets in parallel across a distributed cluster.
Google File System (GFS)
GFS is a proprietary distributed file system developed by Google to manage large datasets across clusters of commodity hardware.
It was designed for reliability and efficient data access.
Building Blocks of Hadoop
Core components include HDFS, YARN (resource management), and MapReduce (data processing).
Introducing and Configuring Hadoop Cluster
Installation requires setting up Linux, Java, and user permissions.
Configuration through XML files (core-site.xml, hdfs-site.xml, etc.) to control Hadoop's behavior.
Hadoop Ecosystem Components
HDFS: Manages storage.
YARN: Manages resources.
MapReduce: Processes data.
Other components include Pig, Hive, and Mahout for data processing and machine learning.
HDFS Architecture
Comprises a master-slave setup with Namenode (master) and Datanodes (slaves).
Datanodes store actual data while Namenode handles metadata.
Key Features of Hadoop
Scalability, fault tolerance, and high availability are critical for managing large datasets.
Hadoop's ability to work with unstructured data gives it an edge in handling big data.
Advantages of Hadoop
Cost-effective due to commodity hardware use.
Scalability allows for accommodating increasing data volumes.
Fault tolerance ensures data reliability across distributed nodes.
Disadvantages of Hadoop
Primarily supports batch processing, not real-time analytics.
Not suited for small files, which can overwhelm the NameNode.
Real-time Applications of Hadoop
Used in sectors like National Security, Healthcare, Finance, and Smart City development to analyze big data for decision-making and optimizing operations.
Summary
Hadoop provides a robust framework for handling large datasets, integrating diverse data sources, and processing data efficiently in a distributed environment, making it a cornerstone technology for big data analytics.