Big Data Study Notes
Unit I: Introduction to Big Data
Definition of Big Data:
- Big Data refers to large volumes of structured and unstructured data that are so complex and voluminous that traditional data processing software cannot adequately manage them.
- Evolution:
- The evolution of Big Data can be traced through the increasing efficiency of data generation, storage, and analysis technologies since the 1960s.
- Key contributions from various innovations, including the internet, cloud computing, and IoT devices, have propelled Big Data to prominence in analytics and decision-making processes.5 Vs of Big Data:
- Volume:
- Refers to the amount of data generated every second.
- Example: Social media platforms generate over 500 terabytes of data daily. - Velocity:
- The speed at which data is generated, processed, and analyzed.
- Example: Stock market transactions happening in milliseconds. - Variety:
- The different types of data (structured, semi-structured, unstructured).
- Example: User-generated content (text, images, video) from social media. - Veracity:
- The trustworthiness and accuracy of data.
- Example: Inconsistent data from public sources can lead to misleading conclusions. - Value:
- The insights gained from analyzing data to enhance decision-making.
- Example: Retailers using customer data to create personalized marketing campaigns.Data Types:
- Structured Data:
- Data that adheres to a predefined model or format.
- Example: SQL databases, where data is organized in tables with defined relationships.
- Semi-Structured Data:
- Data that does not conform to a strict structure but still contains tags or markers to separate data elements.
- Example: JSON and XML files.
- Unstructured Data:
- Data that lacks a specific format; often textual or visual.
- Example: Text documents, images, and video files.Characteristics of Big Data:
- Velocity:
- Data is continuously generated and must be processed in real time.
- Example: Online transaction processing. - Variety:
- Data types include structured, semi-structured, and unstructured sources.
- Example: Integration of customer feedback from surveys and reviews with structured sales data.Sources of Big Data:
- Social Media:
- User interactions and content creation across platforms like Facebook, Twitter, and Instagram. - IoT Devices:
- Sensors and devices creating data streams in real-time, such as smart home devices. - Transactional Data:
- Sales transactions captured in retail and banking sectors. - Machine-Generated Data:
- Logs and telemetry data generated by machines and systems. - Public Data:
- Government databases, census data, and public datasets available online.Limitations of Traditional Data Processing Systems:
- Inability to handle the volume, variety, and velocity of Big Data. - Long data processing times that do not allow for near real-time analysis. - Rigid data models and schemas that are not suited for flexible data types.Challenges in Big Data Storage and Processing:
- Scalability issues as data grows exponentially. - Managing diverse data formats and types. - Data quality and integrity concerns, especially from external sources. - Privacy and compliance regulations affecting data storage and usage.Applications of Big Data:
- Retail Analytics:
- Using customer data to enhance inventory management and personalize marketing efforts. - IoT:
- Leveraging data from connected devices for smarter resource management, predictive maintenance, and enhanced customer experiences.
Unit II: Hadoop Ecosystem
HDFS Architecture:
- NameNode:
- The master server that manages the metadata and namespace of HDFS.
- Responsible for maintaining the directory structure, file to block mapping, and block location information. - DataNode:
- The worker nodes that store data blocks and serve read and write requests from clients.MapReduce Job Workflow:
- Input: Read input data from the Hadoop Distributed File System (HDFS). - Mapper: Processes input data and produces intermediate key-value pairs. - Shuffle: Sorts and reorganizes the intermediate data by keys from the Mapper. - Reducer: Aggregates the intermediate data by key, resulting in the output dataset. - Output: Stores the final output in HDFS.YARN Architecture and Components:
- ResourceManager:
- Manages resources and allocates them to applications.
- NodeManager:
- Manages the execution of containers within a node. - ApplicationMaster:
- Manages the lifecycle of applications and negotiates resources from the ResourceManager.Importance of YARN in Hadoop:
- Enables multi-tenancy, allowing multiple data processing frameworks to run concurrently on the same cluster. - Offers improved resource management and allocation for better optimization.Pig vs Hive:
- Pig:
- A data flow language for processing large data sets using scripts.
- Use case: Text processing and transformation tasks. - Hive:
- A SQL-like interface for data warehousing and querying large datasets.
- Use case: Data analysis using SQL queries.HBase vs HDFS:
- HBase:
- A non-relational, distributed database that runs on HDFS; suitable for real-time read/write access.
- HDFS:
- A distributed file system designed for high throughput access to large datasets.
- Use Cases:
- HBase for real-time analytics, and HDFS for batch processing of large amounts of data.Fault Tolerance in HDFS:
- Data is replicated across multiple DataNodes to ensure reliability.
- Automatic recovery mechanisms are triggered when a DataNode fails, allowing continued data availability.Data Locality in Hadoop:
- Importance: Deploying compute tasks close to data storage reduces data transfer overhead and optimizes processing efficiency.
Unit III: Apache Spark
Resilient Distributed Dataset (RDD):
- Fundamental data structure of Spark that enables parallel processing of data spread across a cluster.
- Provides fault tolerance through lineage information; if a partition of an RDD is lost, it can be rebuilt using its lineage.Comparison of Apache Spark and Hadoop MapReduce:
- Spark provides faster processing due to in-memory computation while MapReduce relies on disk storage.
- Spark supports real-time data processing and batch processing, whereas MapReduce is typically limited to batch processing.Apache Spark Architecture:
- Driver:
- Coordinates the execution of tasks and manages resources.
- Executor:
- Processes the data and executes computations on the cluster nodes. - Cluster Manager:
- Allocates resources and monitors the cluster’s status.Transformations and Actions in Spark:
- Transformations:
- Lazy evaluation operations that produce new RDDs from existing ones.
- Example:map()andfilter()functions. - Actions:
- Operations that return a result or write data to an external system.
- Example:collect()andcount()functions.DataFrames and Datasets:
- DataFrames:
- A distributed collection of data organized into named columns; offers optimization features over RDDs. - Datasets:
- A typed extension of DataFrames; combines the benefits of RDDs and DataFrames.Spark SQL:
- Provides a programming interface for working with structured and semi-structured data using SQL queries. - Benefits include ease of use, integration with BI tools, and optimized execution plans for queries.Fault Tolerance in Spark:
- Achieves fault tolerance through lineage graphs, which keep track of RDD transformations and allow for rebuilding lost data quickly.Comparison of Apache Storm and Apache Flink:
- Apache Storm:
- Real-time computation system designed for distributed stream processing. - Apache Flink:
- Stream and batch processing engine with advanced state management features; suited for event-driven applications.
Unit IV: Distributed Systems and NoSQL Databases
CAP Theorem:
- States that a distributed data store can only guarantee two of the following three properties at any given time: Consistency, Availability, and Partition Tolerance. - Examples:
- A system that prioritizes consistency and availability cannot tolerate network partitions.Partition Tolerance:
- The ability of a distributed system to continue operating despite network failures that prevent some nodes from communicating with others.Types of NoSQL Databases:
- Column-Family Stores:
- Example: Apache Cassandra for large amounts of structured data. - Key-Value Stores:
- Example: Redis for caching and session management. - Document-Oriented Databases:
- Example: MongoDB for unstructured data, allowing schema flexibility. - Graph Databases:
- Example: Neo4j for complex relationships and social network data.Comparison of HBase and Cassandra:
- HBase is built on HDFS and best used for high-volume, transactional operations; Cassandra provides high availability and is a better fit for distributed networks.Key-Value Stores (e.g., Redis):
- Characterized by storing data as pairs of keys and values, enabling fast access.
- Commonly used for caching, session management, and real-time analytics.Document-Oriented Databases (MongoDB):
- Provides a flexible schema that allows the storage of complex, hierarchical data structures in JSON-like documents.BASE vs ACID Models:
- ACID:
- Database transaction properties to ensure reliable processing (Atomicity, Consistency, Isolation, Durability).
- BASE:
- Offers flexibility and availability over strict consistency, focusing on Basically Available, Soft state, and Eventually consistent systems.Comparison of Data Formats:
- JSON:
- A lightweight format for data interchange but can be verbose; suitable for web APIs. - Avro:
- A binary format that supports schema evolution; good for data serialization. - Parquet:
- A columnar storage file format optimized for performance and storage efficiency; commonly used in big data processing. - ORC:
- A highly efficient columnar format designed for read-heavy and analytics workloads.
Unit V: Data Ingestion and Analysis
Definition of Data Ingestion:
- The process of obtaining and importing data for immediate use or storage in a database.
- Importance:
- Crucial for data analysis, machine learning, and building predictive models.Architecture of Apache Flume:
- A service for efficiently collecting, aggregating, and moving large amounts of streaming data.
- Components include sources (data input), channels (data transfer), and sinks (data output).Comparison of Apache Sqoop and Apache Kafka:
- Sqoop:
- Tool for transferring bulk data between Hadoop and structured data stores. - Kafka:
- Distributed streaming platform for building real-time data pipelines and streaming applications.Role of Apache Sqoop:
- Facilitates data transfer between Hadoop and relational databases, enabling ETL processes in big data environments.Advantages of Python in Big Data Analysis:
- Rich ecosystem of libraries (e.g., Pandas, NumPy, PySpark) enables data manipulation, analysis, and visualization with ease.
- Compatibility with other technologies in the Hadoop ecosystem.Spark MLlib:
- A scalable machine learning library for Spark, providing algorithms and utilities for classification, regression, clustering, and more.Comparison of Tableau and Power BI:
- Tableau:
- Data visualization tool known for its robust capabilities in visual analytics and dashboards. - Power BI:
- Microsoft’s business analytics service enabling interactive visualizations and business intelligence features with an easy-to-use interface.Real-World Big Data Application Case Study:
- Retail Analytics:
- Utilizing customer transaction data to optimize inventory levels, enhance customer experiences, and drive sales through personalized promotions.