Big Data Study Notes

Unit I: Introduction to Big Data

Definition of Big Data:
  - Big Data refers to large volumes of structured and unstructured data that are so complex and voluminous that traditional data processing software cannot adequately manage them.
  - Evolution:
    - The evolution of Big Data can be traced through the increasing efficiency of data generation, storage, and analysis technologies since the 1960s.
    - Key contributions from various innovations, including the internet, cloud computing, and IoT devices, have propelled Big Data to prominence in analytics and decision-making processes.
5 Vs of Big Data:
  - Volume:
    - Refers to the amount of data generated every second.
    - Example: Social media platforms generate over 500 terabytes of data daily.   - Velocity:
    - The speed at which data is generated, processed, and analyzed.
    - Example: Stock market transactions happening in milliseconds.   - Variety:
    - The different types of data (structured, semi-structured, unstructured).
    - Example: User-generated content (text, images, video) from social media.   - Veracity:
    - The trustworthiness and accuracy of data.
    - Example: Inconsistent data from public sources can lead to misleading conclusions.   - Value:
    - The insights gained from analyzing data to enhance decision-making.
    - Example: Retailers using customer data to create personalized marketing campaigns.
Data Types:
  - Structured Data:
    - Data that adheres to a predefined model or format.
    - Example: SQL databases, where data is organized in tables with defined relationships.
  - Semi-Structured Data:
    - Data that does not conform to a strict structure but still contains tags or markers to separate data elements.
    - Example: JSON and XML files.
  - Unstructured Data:
    - Data that lacks a specific format; often textual or visual.
    - Example: Text documents, images, and video files.
Characteristics of Big Data:
  - Velocity:
    - Data is continuously generated and must be processed in real time.
      - Example: Online transaction processing.   - Variety:
    - Data types include structured, semi-structured, and unstructured sources.
      - Example: Integration of customer feedback from surveys and reviews with structured sales data.
Sources of Big Data:
  - Social Media:
    - User interactions and content creation across platforms like Facebook, Twitter, and Instagram.   - IoT Devices:
    - Sensors and devices creating data streams in real-time, such as smart home devices.   - Transactional Data:
    - Sales transactions captured in retail and banking sectors.   - Machine-Generated Data:
    - Logs and telemetry data generated by machines and systems.   - Public Data:
    - Government databases, census data, and public datasets available online.
Limitations of Traditional Data Processing Systems:
- Inability to handle the volume, variety, and velocity of Big Data. - Long data processing times that do not allow for near real-time analysis. - Rigid data models and schemas that are not suited for flexible data types.
Challenges in Big Data Storage and Processing:
- Scalability issues as data grows exponentially. - Managing diverse data formats and types. - Data quality and integrity concerns, especially from external sources. - Privacy and compliance regulations affecting data storage and usage.
Applications of Big Data:
  - Retail Analytics:
    - Using customer data to enhance inventory management and personalize marketing efforts.   - IoT:
    - Leveraging data from connected devices for smarter resource management, predictive maintenance, and enhanced customer experiences.

Unit II: Hadoop Ecosystem

HDFS Architecture:
  - NameNode:
    - The master server that manages the metadata and namespace of HDFS.
    - Responsible for maintaining the directory structure, file to block mapping, and block location information.   - DataNode:
    - The worker nodes that store data blocks and serve read and write requests from clients.
MapReduce Job Workflow:
- Input: Read input data from the Hadoop Distributed File System (HDFS). - Mapper: Processes input data and produces intermediate key-value pairs. - Shuffle: Sorts and reorganizes the intermediate data by keys from the Mapper. - Reducer: Aggregates the intermediate data by key, resulting in the output dataset. - Output: Stores the final output in HDFS.
YARN Architecture and Components:
  - ResourceManager:
    - Manages resources and allocates them to applications.
  - NodeManager:
    - Manages the execution of containers within a node.   - ApplicationMaster:
    - Manages the lifecycle of applications and negotiates resources from the ResourceManager.
Importance of YARN in Hadoop:
- Enables multi-tenancy, allowing multiple data processing frameworks to run concurrently on the same cluster. - Offers improved resource management and allocation for better optimization.
Pig vs Hive:
  - Pig:
    - A data flow language for processing large data sets using scripts.
    - Use case: Text processing and transformation tasks.   - Hive:
    - A SQL-like interface for data warehousing and querying large datasets.
    - Use case: Data analysis using SQL queries.
HBase vs HDFS:
  - HBase:
    - A non-relational, distributed database that runs on HDFS; suitable for real-time read/write access.
  - HDFS:
    - A distributed file system designed for high throughput access to large datasets.
  - Use Cases:
    - HBase for real-time analytics, and HDFS for batch processing of large amounts of data.
Fault Tolerance in HDFS:
- Data is replicated across multiple DataNodes to ensure reliability.
- Automatic recovery mechanisms are triggered when a DataNode fails, allowing continued data availability.
Data Locality in Hadoop:
- Importance: Deploying compute tasks close to data storage reduces data transfer overhead and optimizes processing efficiency.

Unit III: Apache Spark

Resilient Distributed Dataset (RDD):
- Fundamental data structure of Spark that enables parallel processing of data spread across a cluster.
- Provides fault tolerance through lineage information; if a partition of an RDD is lost, it can be rebuilt using its lineage.
Comparison of Apache Spark and Hadoop MapReduce:
- Spark provides faster processing due to in-memory computation while MapReduce relies on disk storage.
- Spark supports real-time data processing and batch processing, whereas MapReduce is typically limited to batch processing.
Apache Spark Architecture:
  - Driver:
    - Coordinates the execution of tasks and manages resources.
  - Executor:
    - Processes the data and executes computations on the cluster nodes.   - Cluster Manager:
    - Allocates resources and monitors the cluster’s status.
Transformations and Actions in Spark:
  - Transformations:
    - Lazy evaluation operations that produce new RDDs from existing ones.
    - Example: map() and filter() functions.   - Actions:
    - Operations that return a result or write data to an external system.
    - Example: collect() and count() functions.
DataFrames and Datasets:
  - DataFrames:
    - A distributed collection of data organized into named columns; offers optimization features over RDDs.   - Datasets:
    - A typed extension of DataFrames; combines the benefits of RDDs and DataFrames.
Spark SQL:
- Provides a programming interface for working with structured and semi-structured data using SQL queries. - Benefits include ease of use, integration with BI tools, and optimized execution plans for queries.
Fault Tolerance in Spark:
- Achieves fault tolerance through lineage graphs, which keep track of RDD transformations and allow for rebuilding lost data quickly.
Comparison of Apache Storm and Apache Flink:
  - Apache Storm:
    - Real-time computation system designed for distributed stream processing.   - Apache Flink:
    - Stream and batch processing engine with advanced state management features; suited for event-driven applications.

Unit IV: Distributed Systems and NoSQL Databases

CAP Theorem:
- States that a distributed data store can only guarantee two of the following three properties at any given time: Consistency, Availability, and Partition Tolerance. - Examples:
- A system that prioritizes consistency and availability cannot tolerate network partitions.
Partition Tolerance:
- The ability of a distributed system to continue operating despite network failures that prevent some nodes from communicating with others.
Types of NoSQL Databases:
  - Column-Family Stores:
    - Example: Apache Cassandra for large amounts of structured data.   - Key-Value Stores:
    - Example: Redis for caching and session management.   - Document-Oriented Databases:
    - Example: MongoDB for unstructured data, allowing schema flexibility.   - Graph Databases:
    - Example: Neo4j for complex relationships and social network data.
Comparison of HBase and Cassandra:
- HBase is built on HDFS and best used for high-volume, transactional operations; Cassandra provides high availability and is a better fit for distributed networks.
Key-Value Stores (e.g., Redis):
- Characterized by storing data as pairs of keys and values, enabling fast access.
- Commonly used for caching, session management, and real-time analytics.
Document-Oriented Databases (MongoDB):
- Provides a flexible schema that allows the storage of complex, hierarchical data structures in JSON-like documents.
BASE vs ACID Models:
  - ACID:
    - Database transaction properties to ensure reliable processing (Atomicity, Consistency, Isolation, Durability).
  - BASE:
    - Offers flexibility and availability over strict consistency, focusing on Basically Available, Soft state, and Eventually consistent systems.
Comparison of Data Formats:
  - JSON:
    - A lightweight format for data interchange but can be verbose; suitable for web APIs.   - Avro:
    - A binary format that supports schema evolution; good for data serialization.   - Parquet:
    - A columnar storage file format optimized for performance and storage efficiency; commonly used in big data processing.   - ORC:
    - A highly efficient columnar format designed for read-heavy and analytics workloads.

Unit V: Data Ingestion and Analysis

Definition of Data Ingestion:
  - The process of obtaining and importing data for immediate use or storage in a database.
  - Importance:
    - Crucial for data analysis, machine learning, and building predictive models.
Architecture of Apache Flume:
- A service for efficiently collecting, aggregating, and moving large amounts of streaming data.
- Components include sources (data input), channels (data transfer), and sinks (data output).
Comparison of Apache Sqoop and Apache Kafka:
  - Sqoop:
    - Tool for transferring bulk data between Hadoop and structured data stores.   - Kafka:
    - Distributed streaming platform for building real-time data pipelines and streaming applications.
Role of Apache Sqoop:
- Facilitates data transfer between Hadoop and relational databases, enabling ETL processes in big data environments.
Advantages of Python in Big Data Analysis:
- Rich ecosystem of libraries (e.g., Pandas, NumPy, PySpark) enables data manipulation, analysis, and visualization with ease.
- Compatibility with other technologies in the Hadoop ecosystem.
Spark MLlib:
- A scalable machine learning library for Spark, providing algorithms and utilities for classification, regression, clustering, and more.
Comparison of Tableau and Power BI:
  - Tableau:
    - Data visualization tool known for its robust capabilities in visual analytics and dashboards.   - Power BI:
    - Microsoft’s business analytics service enabling interactive visualizations and business intelligence features with an easy-to-use interface.
Real-World Big Data Application Case Study:
- Retail Analytics:
- Utilizing customer transaction data to optimize inventory levels, enhance customer experiences, and drive sales through personalized promotions.