Big Data Introduction
UNIT I: What is Big Data?
Big Data Definition
Big Data is defined as a large amount of data that cannot be stored or processed by conventional data storage or processing equipment.
The complexity and expanse of data generated from human and machine activities make it difficult for humans to interpret and unfit to fit into a relational database for analysis.
However, when properly analyzed using modern tools, this massive data can provide organizations with insights that help improve business decisions.
1. INTRODUCTION OF BIG DATA
With the rapid growth of Internet usage, there is an exponential increase in data generation.
Sources of data:
Millions of messages on platforms like WhatsApp, Facebook, or Twitter.
Trillions of photos and countless hours of videos uploaded on platforms like YouTube.
According to recent surveys, 2.5 quintillion bytes of data are generated daily ($2.5 imes 10^{18}$ bytes).
Characteristics of Big Data:
Size: Data sets are too large.
Complexity: Data may be structured or unstructured, arriving at a high velocity.
80% of the existing data has been generated in recently years.
Capturing this data provides little value unless transformed into business value.
Challenges include managing, analyzing, and converting data into valuable insights.
2. Evolution of Big Data
The first reference to Big Data was in a 1997 paper by NASA scientists on visualizing large data sets.
Big Data conceptualization was publicized by McKinsey.
The 3Vs of Big Data, defined by analyst Doug Laney:
Volume
Velocity
Variety
The processing life cycle encompasses several stages:
Acquisition
Preprocessing
Storage and Management
Privacy and Security
Analyzing
Visualization
Growth of data from 600 MB in the 1950s to 100 petabytes in 2010 ($100,000,000,000$ MB).
3. Failure of Traditional Database in Handling Big Data
Relational Database Management Systems (RDBMS): Traditionally prevalent for data storage, they struggle with the massive, complex data landscape due to limitations in performance and scalability.
Numerous vendors provide database systems, but they encounter performance declines and increased costs with larger data sets.
A comparative attribute table illustrates the differences between RDBMS and Big Data.
4. The 3 Vs of Big Data
Volume:
Big Data continuously grows due to increased data capture by businesses.
Measurement ranges from terabytes (TB) to zettabytes (ZB) where:
Source examples: Social media, POS transactions, online banking.
Velocity:
Refers to the speed of data generation and processing.
Big Data's rapid flow complicates capture and analysis:
For example, in 60 seconds:
3.3 million Facebook posts.
450,000 tweets.
400 hours of YouTube uploads.
Variety:
Data format variation includes structured, semi-structured, and unstructured datasets.
Structured data: Organized in tables (e.g., employee details).
Semi-structured data: Contains tags (e.g., XML).
Unstructured data: Raw and unorganized (e.g., emails, photos).
5. Different Types of Data
Data can be categorized into two primary types:
Human-generated Data: Produced through human interactions with machines (e.g., emails, social media posts).
Machine-generated Data: Generated by computer applications (e.g., GPS data).
Types of Data:
Structured Data: Formatted for relational databases (e.g., company records).
Unstructured Data: Lack clear organization (e.g., social media posts).
Semi-structured Data: Contains some organizational properties but doesn’t fit relational schemas (e.g., XML files).
6. Characteristics Of Big Data
Volume: Large data sets from interactions (e.g., bookings, streaming).
Velocity: Speed of data flow from various sources, requiring real-time analytics.
Value: Importance of data in supporting business goals.
Variety: Different data types being processed.
Veracity: Consistency and accuracy of data.
Validity: Ensuring lawful and ethical data sourcing (GDPR compliance).
Volatility: Lifespan and changing nature of data, e.g., social media sentiment.
Visualization: Presenting data in understandable formats for decision-making.
Vulnerability: Risks associated with data breaches.
Variability: Inconsistencies in data across sources and over time.
7. Big Challenges with Big Data
Issues that pose implementation hurdles:
Data Volume: Handling the vast quantity of data.
Data Variety: Processing various data types.
Data Velocity: Real-time processing needs.
Data Veracity: Ensuring quality and reliability.
Data Security and Privacy: Safeguarding sensitive information.
Data Integration: Combining data from different origins.
Data Analytics: Given the complexity of datasets.
Data Governance: Creating policies and systems.
8. Business Intelligence vs Big Data
Business Intelligence (BI): Focuses on analyzing structured data for operational insights.
Big Data: Centers on large, complex datasets, emphasizing predictive and prescriptive analytics.
Both BI and Big Data support data-driven decisions and performance evaluations, but differ significantly in handling data types, volumes, sources, and analysis approaches.
9. Difference between Data Warehouse and Hadoop
Data Warehouse: Structured for querying and analytical reporting.
Hadoop: Open-source framework for distributed data storage and processing.
10. Non-definitional traits of Big Data
Data Exhaust: Byproducts of digital activities useful for analysis.
Dark Data: Collected data not utilized for analysis, leading to wasted resources.
Data Quality Challenges: Negative impacts on insights due to poor quality.
Data Lifecycle Complexity: Challenges managing data through its life stages.
Integration Difficulties: Merging diverse data types can be complex.
Data Sensitivity and Privacy: Addressing ethical and legal challenges.
Infrastructure Costs: The requirement of scalable technologies incurs costs.
11. Big Data Infrastructure
Key components include:
Hadoop: Open-source framework for data storage on commodity hardware.
HDFS: Distributed file system for efficient storage of large datasets.
MapReduce: Enables distributed processing of data across multiple machines.
12. Big Data Life Cycle
Phases include data generation, aggregation, preprocessing, analytics, and visualization.
Challenges persist across the lifecycle, but supporting technologies address data handling effectively.
Tools like MapReduce and HDFS facilitate processing and meaningful insights extraction for effective decision-making.