Definition of Big Data:
Prevalent since late 1990s to early 2000s.
Quantitative Definition: Information greater than 1 gigabyte (1 GB).
Subjective Definition: Describes the massive volume of continuously generated information, its speed, and the variety of its sources.
5 V's of Big Data:
Volume: Amount of data (in bytes) being managed.
Velocity: Speed of data acquisition and processing.
Variety: Range of data types (unstructured, semi-structured, and structured).
Veracity: Accuracy and trustworthiness of the data.
Value: Usefulness and insight potential of the data.
Understanding the scale of data:
Kilobytes (KB): 1,024 bytes; example: a paragraph of a text document.
Megabytes (MB): 1,024 KB; example: 2 novels each 50,000 words.
Gigabytes (GB): 1,024 MB; example: Beethoven's 5th Symphony in FLAC format.
Terabytes (TB): 1,024 GB; example: all X-rays stored in a large hospital.
Petabytes (PB): 1,024 TB; example: half the contents of all US academic research libraries.
Exabytes (EB): 1,024 PB; example: one fifth of the words ever spoken.
Zettabytes (ZB): 1,024 EB; example: as many grains of sand as there are on Earth’s beaches.
Yottabytes (YB): 1,024 ZB; example: as many atoms as in 7,000 human bodies.
Daily, vast amounts of data are generated.
2024 forecast: 0.4ZB generated per day.
Historical data growth: 2ZB in 2010 to an expected 147ZB in 2024; a growth factor of ~74x.
Who generates this data?:
Start with both structured and unstructured data; examples include:
Text, images, audio, and video on social media.
Click-stream data from e-commerce sites.
IoT devices, with 200 billion in 2020 generating extensive data.
Machine data from industrial sensors.
Healthcare data including electronic health records.
Logs from web apps providing visitor data.
Financial transaction data.
Big data challenges include storage and processing.
Difficulties using traditional databases due to large volumes from:
Device measurements (e.g. COVID-19 temp data).
User-generated content from social media.
Transaction data from retail and financial sectors.
Video streams from services like Netflix.
Data types and examples:
Structured data: Defined format (e.g. tables).
Semi-structured data: Textual data with patterns (e.g. HTML, JSON).
Unstructured data: Lacks inherent structure (e.g. images, PDFs).
Accuracy, consistency, and trustworthiness are crucial.
Emphasize the need for validation against reliable sources.
GIGO principle: Garbage In Garbage Out.
Focus is on extracting meaningful insights for applications like:
Dynamic ticket pricing in airline industries.
Optimizing marketing strategies based on analysis of customer behavior.
Volume:
E-commerce transactions generate vast datasets from customer behavior.
Social media platforms create large volumes of user data daily.
Velocity:
Real-time stock trading systems handling high-frequency data.
Platforms like Netflix using immediate data for user recommendations.
Variety:
Multimedia platforms like YouTube managing diverse data types.
Veracity:
Financial fraud detection requires accuracy in transaction data analysis.
Value:
Predictive analytics for retail that personalize customer experiences and enhance operational performance.
Low costs for data acquisition and storage.
Effects of digitalization across various domains (Docs, IoT).
Increased effectiveness of data in AI and machine learning contexts.
Ability to integrate heterogeneous data from numerous sources for analysis.