Big Data Introduction

UNIT I: What is Big Data?

  • Big Data Definition

    • Big Data is defined as a large amount of data that cannot be stored or processed by conventional data storage or processing equipment.

    • The complexity and expanse of data generated from human and machine activities make it difficult for humans to interpret and unfit to fit into a relational database for analysis.

    • However, when properly analyzed using modern tools, this massive data can provide organizations with insights that help improve business decisions.

1. INTRODUCTION OF BIG DATA

  • With the rapid growth of Internet usage, there is an exponential increase in data generation.

    • Sources of data:

    • Millions of messages on platforms like WhatsApp, Facebook, or Twitter.

    • Trillions of photos and countless hours of videos uploaded on platforms like YouTube.

  • According to recent surveys, 2.5 quintillion bytes of data are generated daily ($2.5 imes 10^{18}$ bytes).

  • Characteristics of Big Data:

    • Size: Data sets are too large.

    • Complexity: Data may be structured or unstructured, arriving at a high velocity.

    • 80% of the existing data has been generated in recently years.

    • Capturing this data provides little value unless transformed into business value.

    • Challenges include managing, analyzing, and converting data into valuable insights.

2. Evolution of Big Data

  • The first reference to Big Data was in a 1997 paper by NASA scientists on visualizing large data sets.

  • Big Data conceptualization was publicized by McKinsey.

  • The 3Vs of Big Data, defined by analyst Doug Laney:

    • Volume

    • Velocity

    • Variety

  • The processing life cycle encompasses several stages:

    • Acquisition

    • Preprocessing

    • Storage and Management

    • Privacy and Security

    • Analyzing

    • Visualization

  • Growth of data from 600 MB in the 1950s to 100 petabytes in 2010 ($100,000,000,000$ MB).

3. Failure of Traditional Database in Handling Big Data

  • Relational Database Management Systems (RDBMS): Traditionally prevalent for data storage, they struggle with the massive, complex data landscape due to limitations in performance and scalability.

  • Numerous vendors provide database systems, but they encounter performance declines and increased costs with larger data sets.

  • A comparative attribute table illustrates the differences between RDBMS and Big Data.

4. The 3 Vs of Big Data

  1. Volume:

    • Big Data continuously grows due to increased data capture by businesses.

    • Measurement ranges from terabytes (TB) to zettabytes (ZB) where:

      • 1extTB=1024extGB1 ext{ TB} = 1024 ext{ GB}

      • 1extPB=1024extTB1 ext{ PB} = 1024 ext{ TB}

      • 1extEB=1024extPB1 ext{ EB} = 1024 ext{ PB}

      • 1extZB=1024extEB1 ext{ ZB} = 1024 ext{ EB}

    • Source examples: Social media, POS transactions, online banking.

  2. Velocity:

    • Refers to the speed of data generation and processing.

    • Big Data's rapid flow complicates capture and analysis:

      • For example, in 60 seconds:

      • 3.3 million Facebook posts.

      • 450,000 tweets.

      • 400 hours of YouTube uploads.

  3. Variety:

    • Data format variation includes structured, semi-structured, and unstructured datasets.

    • Structured data: Organized in tables (e.g., employee details).

    • Semi-structured data: Contains tags (e.g., XML).

    • Unstructured data: Raw and unorganized (e.g., emails, photos).

5. Different Types of Data

  • Data can be categorized into two primary types:

    • Human-generated Data: Produced through human interactions with machines (e.g., emails, social media posts).

    • Machine-generated Data: Generated by computer applications (e.g., GPS data).

  • Types of Data:

    • Structured Data: Formatted for relational databases (e.g., company records).

    • Unstructured Data: Lack clear organization (e.g., social media posts).

    • Semi-structured Data: Contains some organizational properties but doesn’t fit relational schemas (e.g., XML files).

6. Characteristics Of Big Data

  1. Volume: Large data sets from interactions (e.g., bookings, streaming).

  2. Velocity: Speed of data flow from various sources, requiring real-time analytics.

  3. Value: Importance of data in supporting business goals.

  4. Variety: Different data types being processed.

  5. Veracity: Consistency and accuracy of data.

  6. Validity: Ensuring lawful and ethical data sourcing (GDPR compliance).

  7. Volatility: Lifespan and changing nature of data, e.g., social media sentiment.

  8. Visualization: Presenting data in understandable formats for decision-making.

  9. Vulnerability: Risks associated with data breaches.

  10. Variability: Inconsistencies in data across sources and over time.

7. Big Challenges with Big Data

  • Issues that pose implementation hurdles:

    • Data Volume: Handling the vast quantity of data.

    • Data Variety: Processing various data types.

    • Data Velocity: Real-time processing needs.

    • Data Veracity: Ensuring quality and reliability.

    • Data Security and Privacy: Safeguarding sensitive information.

    • Data Integration: Combining data from different origins.

    • Data Analytics: Given the complexity of datasets.

    • Data Governance: Creating policies and systems.

8. Business Intelligence vs Big Data

  • Business Intelligence (BI): Focuses on analyzing structured data for operational insights.

  • Big Data: Centers on large, complex datasets, emphasizing predictive and prescriptive analytics.

  • Both BI and Big Data support data-driven decisions and performance evaluations, but differ significantly in handling data types, volumes, sources, and analysis approaches.

9. Difference between Data Warehouse and Hadoop

  • Data Warehouse: Structured for querying and analytical reporting.

  • Hadoop: Open-source framework for distributed data storage and processing.

10. Non-definitional traits of Big Data

  • Data Exhaust: Byproducts of digital activities useful for analysis.

  • Dark Data: Collected data not utilized for analysis, leading to wasted resources.

  • Data Quality Challenges: Negative impacts on insights due to poor quality.

  • Data Lifecycle Complexity: Challenges managing data through its life stages.

  • Integration Difficulties: Merging diverse data types can be complex.

  • Data Sensitivity and Privacy: Addressing ethical and legal challenges.

  • Infrastructure Costs: The requirement of scalable technologies incurs costs.

11. Big Data Infrastructure

  • Key components include:

    • Hadoop: Open-source framework for data storage on commodity hardware.

    • HDFS: Distributed file system for efficient storage of large datasets.

    • MapReduce: Enables distributed processing of data across multiple machines.

12. Big Data Life Cycle

  • Phases include data generation, aggregation, preprocessing, analytics, and visualization.

    • Challenges persist across the lifecycle, but supporting technologies address data handling effectively.

    • Tools like MapReduce and HDFS facilitate processing and meaningful insights extraction for effective decision-making.