Big Data Study Notes

Unit I: Introduction to Big Data

  • Definition of Big Data:
      - Big Data refers to large volumes of structured and unstructured data that are so complex and voluminous that traditional data processing software cannot adequately manage them.
      - Evolution:
        - The evolution of Big Data can be traced through the increasing efficiency of data generation, storage, and analysis technologies since the 1960s.
        - Key contributions from various innovations, including the internet, cloud computing, and IoT devices, have propelled Big Data to prominence in analytics and decision-making processes.

  • 5 Vs of Big Data:
      - Volume:
        - Refers to the amount of data generated every second.
        - Example: Social media platforms generate over 500 terabytes of data daily.   - Velocity:
        - The speed at which data is generated, processed, and analyzed.
        - Example: Stock market transactions happening in milliseconds.   - Variety:
        - The different types of data (structured, semi-structured, unstructured).
        - Example: User-generated content (text, images, video) from social media.   - Veracity:
        - The trustworthiness and accuracy of data.
        - Example: Inconsistent data from public sources can lead to misleading conclusions.   - Value:
        - The insights gained from analyzing data to enhance decision-making.
        - Example: Retailers using customer data to create personalized marketing campaigns.

  • Data Types:
      - Structured Data:
        - Data that adheres to a predefined model or format.
        - Example: SQL databases, where data is organized in tables with defined relationships.
      - Semi-Structured Data:
        - Data that does not conform to a strict structure but still contains tags or markers to separate data elements.
        - Example: JSON and XML files.
      - Unstructured Data:
        - Data that lacks a specific format; often textual or visual.
        - Example: Text documents, images, and video files.

  • Characteristics of Big Data:
      - Velocity:
        - Data is continuously generated and must be processed in real time.
          - Example: Online transaction processing.   - Variety:
        - Data types include structured, semi-structured, and unstructured sources.
          - Example: Integration of customer feedback from surveys and reviews with structured sales data.

  • Sources of Big Data:
      - Social Media:
        - User interactions and content creation across platforms like Facebook, Twitter, and Instagram.   - IoT Devices:
        - Sensors and devices creating data streams in real-time, such as smart home devices.   - Transactional Data:
        - Sales transactions captured in retail and banking sectors.   - Machine-Generated Data:
        - Logs and telemetry data generated by machines and systems.   - Public Data:
        - Government databases, census data, and public datasets available online.

  • Limitations of Traditional Data Processing Systems:
      - Inability to handle the volume, variety, and velocity of Big Data.   - Long data processing times that do not allow for near real-time analysis.   - Rigid data models and schemas that are not suited for flexible data types.

  • Challenges in Big Data Storage and Processing:
      - Scalability issues as data grows exponentially.   - Managing diverse data formats and types.   - Data quality and integrity concerns, especially from external sources.   - Privacy and compliance regulations affecting data storage and usage.

  • Applications of Big Data:
      - Retail Analytics:
        - Using customer data to enhance inventory management and personalize marketing efforts.   - IoT:
        - Leveraging data from connected devices for smarter resource management, predictive maintenance, and enhanced customer experiences.

Unit II: Hadoop Ecosystem

  • HDFS Architecture:
      - NameNode:
        - The master server that manages the metadata and namespace of HDFS.
        - Responsible for maintaining the directory structure, file to block mapping, and block location information.   - DataNode:
        - The worker nodes that store data blocks and serve read and write requests from clients.

  • MapReduce Job Workflow:
      - Input: Read input data from the Hadoop Distributed File System (HDFS).   - Mapper: Processes input data and produces intermediate key-value pairs.   - Shuffle: Sorts and reorganizes the intermediate data by keys from the Mapper.   - Reducer: Aggregates the intermediate data by key, resulting in the output dataset.   - Output: Stores the final output in HDFS.

  • YARN Architecture and Components:
      - ResourceManager:
        - Manages resources and allocates them to applications.
      - NodeManager:
        - Manages the execution of containers within a node.   - ApplicationMaster:
        - Manages the lifecycle of applications and negotiates resources from the ResourceManager.

  • Importance of YARN in Hadoop:
      - Enables multi-tenancy, allowing multiple data processing frameworks to run concurrently on the same cluster.   - Offers improved resource management and allocation for better optimization.

  • Pig vs Hive:
      - Pig:
        - A data flow language for processing large data sets using scripts.
        - Use case: Text processing and transformation tasks.   - Hive:
        - A SQL-like interface for data warehousing and querying large datasets.
        - Use case: Data analysis using SQL queries.

  • HBase vs HDFS:
      - HBase:
        - A non-relational, distributed database that runs on HDFS; suitable for real-time read/write access.
      - HDFS:
        - A distributed file system designed for high throughput access to large datasets.
      - Use Cases:
        - HBase for real-time analytics, and HDFS for batch processing of large amounts of data.

  • Fault Tolerance in HDFS:
      - Data is replicated across multiple DataNodes to ensure reliability.
      - Automatic recovery mechanisms are triggered when a DataNode fails, allowing continued data availability.

  • Data Locality in Hadoop:
      - Importance: Deploying compute tasks close to data storage reduces data transfer overhead and optimizes processing efficiency.

Unit III: Apache Spark

  • Resilient Distributed Dataset (RDD):
      - Fundamental data structure of Spark that enables parallel processing of data spread across a cluster.
      - Provides fault tolerance through lineage information; if a partition of an RDD is lost, it can be rebuilt using its lineage.

  • Comparison of Apache Spark and Hadoop MapReduce:
      - Spark provides faster processing due to in-memory computation while MapReduce relies on disk storage.
      - Spark supports real-time data processing and batch processing, whereas MapReduce is typically limited to batch processing.

  • Apache Spark Architecture:
      - Driver:
        - Coordinates the execution of tasks and manages resources.
      - Executor:
        - Processes the data and executes computations on the cluster nodes.   - Cluster Manager:
        - Allocates resources and monitors the cluster’s status.

  • Transformations and Actions in Spark:
      - Transformations:
        - Lazy evaluation operations that produce new RDDs from existing ones.
        - Example: map() and filter() functions.   - Actions:
        - Operations that return a result or write data to an external system.
        - Example: collect() and count() functions.

  • DataFrames and Datasets:
      - DataFrames:
        - A distributed collection of data organized into named columns; offers optimization features over RDDs.   - Datasets:
        - A typed extension of DataFrames; combines the benefits of RDDs and DataFrames.

  • Spark SQL:
      - Provides a programming interface for working with structured and semi-structured data using SQL queries.   - Benefits include ease of use, integration with BI tools, and optimized execution plans for queries.

  • Fault Tolerance in Spark:
      - Achieves fault tolerance through lineage graphs, which keep track of RDD transformations and allow for rebuilding lost data quickly.

  • Comparison of Apache Storm and Apache Flink:
      - Apache Storm:
        - Real-time computation system designed for distributed stream processing.   - Apache Flink:
        - Stream and batch processing engine with advanced state management features; suited for event-driven applications.

Unit IV: Distributed Systems and NoSQL Databases

  • CAP Theorem:
      - States that a distributed data store can only guarantee two of the following three properties at any given time: Consistency, Availability, and Partition Tolerance.   - Examples:
        - A system that prioritizes consistency and availability cannot tolerate network partitions.

  • Partition Tolerance:
      - The ability of a distributed system to continue operating despite network failures that prevent some nodes from communicating with others.

  • Types of NoSQL Databases:
      - Column-Family Stores:
        - Example: Apache Cassandra for large amounts of structured data.   - Key-Value Stores:
        - Example: Redis for caching and session management.   - Document-Oriented Databases:
        - Example: MongoDB for unstructured data, allowing schema flexibility.   - Graph Databases:
        - Example: Neo4j for complex relationships and social network data.

  • Comparison of HBase and Cassandra:
      - HBase is built on HDFS and best used for high-volume, transactional operations; Cassandra provides high availability and is a better fit for distributed networks.

  • Key-Value Stores (e.g., Redis):
      - Characterized by storing data as pairs of keys and values, enabling fast access.
      - Commonly used for caching, session management, and real-time analytics.

  • Document-Oriented Databases (MongoDB):
      - Provides a flexible schema that allows the storage of complex, hierarchical data structures in JSON-like documents.

  • BASE vs ACID Models:
      - ACID:
        - Database transaction properties to ensure reliable processing (Atomicity, Consistency, Isolation, Durability).
      - BASE:
        - Offers flexibility and availability over strict consistency, focusing on Basically Available, Soft state, and Eventually consistent systems.

  • Comparison of Data Formats:
      - JSON:
        - A lightweight format for data interchange but can be verbose; suitable for web APIs.   - Avro:
        - A binary format that supports schema evolution; good for data serialization.   - Parquet:
        - A columnar storage file format optimized for performance and storage efficiency; commonly used in big data processing.   - ORC:
        - A highly efficient columnar format designed for read-heavy and analytics workloads.

Unit V: Data Ingestion and Analysis

  • Definition of Data Ingestion:
      - The process of obtaining and importing data for immediate use or storage in a database.
      - Importance:
        - Crucial for data analysis, machine learning, and building predictive models.

  • Architecture of Apache Flume:
      - A service for efficiently collecting, aggregating, and moving large amounts of streaming data.
      - Components include sources (data input), channels (data transfer), and sinks (data output).

  • Comparison of Apache Sqoop and Apache Kafka:
      - Sqoop:
        - Tool for transferring bulk data between Hadoop and structured data stores.   - Kafka:
        - Distributed streaming platform for building real-time data pipelines and streaming applications.

  • Role of Apache Sqoop:
      - Facilitates data transfer between Hadoop and relational databases, enabling ETL processes in big data environments.

  • Advantages of Python in Big Data Analysis:
      - Rich ecosystem of libraries (e.g., Pandas, NumPy, PySpark) enables data manipulation, analysis, and visualization with ease.
      - Compatibility with other technologies in the Hadoop ecosystem.

  • Spark MLlib:
      - A scalable machine learning library for Spark, providing algorithms and utilities for classification, regression, clustering, and more.

  • Comparison of Tableau and Power BI:
      - Tableau:
        - Data visualization tool known for its robust capabilities in visual analytics and dashboards.   - Power BI:
        - Microsoft’s business analytics service enabling interactive visualizations and business intelligence features with an easy-to-use interface.

  • Real-World Big Data Application Case Study:
      - Retail Analytics:
        - Utilizing customer transaction data to optimize inventory levels, enhance customer experiences, and drive sales through personalized promotions.