Big Data Introduction

Big Data Definition
- Big Data is defined as a large amount of data that cannot be stored or processed by conventional data storage or processing equipment.
- The complexity and expanse of data generated from human and machine activities make it difficult for humans to interpret and unfit to fit into a relational database for analysis.
- However, when properly analyzed using modern tools, this massive data can provide organizations with insights that help improve business decisions.

With the rapid growth of Internet usage, there is an exponential increase in data generation.
- Sources of data:
- Millions of messages on platforms like WhatsApp, Facebook, or Twitter.
- Trillions of photos and countless hours of videos uploaded on platforms like YouTube.
According to recent surveys, 2.5 quintillion bytes of data are generated daily ($2.5 imes 10^{18}$ bytes).
Characteristics of Big Data:
- Size: Data sets are too large.
- Complexity: Data may be structured or unstructured, arriving at a high velocity.
- 80% of the existing data has been generated in recently years.
- Capturing this data provides little value unless transformed into business value.
- Challenges include managing, analyzing, and converting data into valuable insights.

The first reference to Big Data was in a 1997 paper by NASA scientists on visualizing large data sets.
Big Data conceptualization was publicized by McKinsey.
The 3Vs of Big Data, defined by analyst Doug Laney:
- Volume
- Velocity
- Variety
The processing life cycle encompasses several stages:
- Acquisition
- Preprocessing
- Storage and Management
- Privacy and Security
- Analyzing
- Visualization
Growth of data from 600 MB in the 1950s to 100 petabytes in 2010 ($100,000,000,000$ MB).

Relational Database Management Systems (RDBMS): Traditionally prevalent for data storage, they struggle with the massive, complex data landscape due to limitations in performance and scalability.
Numerous vendors provide database systems, but they encounter performance declines and increased costs with larger data sets.
A comparative attribute table illustrates the differences between RDBMS and Big Data.

Volume:
- Big Data continuously grows due to increased data capture by businesses.
- Measurement ranges from terabytes (TB) to zettabytes (ZB) where:
  - $1 ext{ TB} = 1024 ext{ GB}$
  - $1 ext{ PB} = 1024 ext{ TB}$
  - $1 ext{ EB} = 1024 ext{ PB}$
  - $1 ext{ ZB} = 1024 ext{ EB}$
- Source examples: Social media, POS transactions, online banking.
Velocity:
- Refers to the speed of data generation and processing.
- Big Data's rapid flow complicates capture and analysis:
  - For example, in 60 seconds:
  - 3.3 million Facebook posts.
  - 450,000 tweets.
  - 400 hours of YouTube uploads.
Variety:
- Data format variation includes structured, semi-structured, and unstructured datasets.
- Structured data: Organized in tables (e.g., employee details).
- Semi-structured data: Contains tags (e.g., XML).
- Unstructured data: Raw and unorganized (e.g., emails, photos).

Data can be categorized into two primary types:
- Human-generated Data: Produced through human interactions with machines (e.g., emails, social media posts).
- Machine-generated Data: Generated by computer applications (e.g., GPS data).
Types of Data:
- Structured Data: Formatted for relational databases (e.g., company records).
- Unstructured Data: Lack clear organization (e.g., social media posts).
- Semi-structured Data: Contains some organizational properties but doesn’t fit relational schemas (e.g., XML files).

Volume: Large data sets from interactions (e.g., bookings, streaming).
Velocity: Speed of data flow from various sources, requiring real-time analytics.
Value: Importance of data in supporting business goals.
Variety: Different data types being processed.
Veracity: Consistency and accuracy of data.
Validity: Ensuring lawful and ethical data sourcing (GDPR compliance).
Volatility: Lifespan and changing nature of data, e.g., social media sentiment.
Visualization: Presenting data in understandable formats for decision-making.
Vulnerability: Risks associated with data breaches.
Variability: Inconsistencies in data across sources and over time.

Business Intelligence (BI): Focuses on analyzing structured data for operational insights.
Big Data: Centers on large, complex datasets, emphasizing predictive and prescriptive analytics.
Both BI and Big Data support data-driven decisions and performance evaluations, but differ significantly in handling data types, volumes, sources, and analysis approaches.

Data Exhaust: Byproducts of digital activities useful for analysis.
Dark Data: Collected data not utilized for analysis, leading to wasted resources.
Data Quality Challenges: Negative impacts on insights due to poor quality.
Data Lifecycle Complexity: Challenges managing data through its life stages.
Integration Difficulties: Merging diverse data types can be complex.
Data Sensitivity and Privacy: Addressing ethical and legal challenges.
Infrastructure Costs: The requirement of scalable technologies incurs costs.

Key components include:
- Hadoop: Open-source framework for data storage on commodity hardware.
- HDFS: Distributed file system for efficient storage of large datasets.
- MapReduce: Enables distributed processing of data across multiple machines.

Phases include data generation, aggregation, preprocessing, analytics, and visualization.
- Challenges persist across the lifecycle, but supporting technologies address data handling effectively.
- Tools like MapReduce and HDFS facilitate processing and meaningful insights extraction for effective decision-making.