SF

BIGdata-concepts

Concepts in Big Data

  • Definition of Big Data:

    • Prevalent since late 1990s to early 2000s.

    • Quantitative Definition: Information greater than 1 gigabyte (1 GB).

    • Subjective Definition: Describes the massive volume of continuously generated information, its speed, and the variety of its sources.

Characteristics of Big Data

  • 5 V's of Big Data:

    1. Volume: Amount of data (in bytes) being managed.

    2. Velocity: Speed of data acquisition and processing.

    3. Variety: Range of data types (unstructured, semi-structured, and structured).

    4. Veracity: Accuracy and trustworthiness of the data.

    5. Value: Usefulness and insight potential of the data.

Data Volume Units

  • Understanding the scale of data:

    • Kilobytes (KB): 1,024 bytes; example: a paragraph of a text document.

    • Megabytes (MB): 1,024 KB; example: 2 novels each 50,000 words.

    • Gigabytes (GB): 1,024 MB; example: Beethoven's 5th Symphony in FLAC format.

    • Terabytes (TB): 1,024 GB; example: all X-rays stored in a large hospital.

    • Petabytes (PB): 1,024 TB; example: half the contents of all US academic research libraries.

    • Exabytes (EB): 1,024 PB; example: one fifth of the words ever spoken.

    • Zettabytes (ZB): 1,024 EB; example: as many grains of sand as there are on Earth’s beaches.

    • Yottabytes (YB): 1,024 ZB; example: as many atoms as in 7,000 human bodies.

Current and Future Data Generation

  • Daily, vast amounts of data are generated.

    • 2024 forecast: 0.4ZB generated per day.

    • Historical data growth: 2ZB in 2010 to an expected 147ZB in 2024; a growth factor of ~74x.

Data Sources

  • Who generates this data?:

    • Start with both structured and unstructured data; examples include:

      • Text, images, audio, and video on social media.

      • Click-stream data from e-commerce sites.

      • IoT devices, with 200 billion in 2020 generating extensive data.

      • Machine data from industrial sensors.

      • Healthcare data including electronic health records.

      • Logs from web apps providing visitor data.

      • Financial transaction data.

Characteristics Explained

Volume

  • Big data challenges include storage and processing.

    • Difficulties using traditional databases due to large volumes from:

      • Device measurements (e.g. COVID-19 temp data).

      • User-generated content from social media.

      • Transaction data from retail and financial sectors.

      • Video streams from services like Netflix.

Variety

  • Data types and examples:

    • Structured data: Defined format (e.g. tables).

    • Semi-structured data: Textual data with patterns (e.g. HTML, JSON).

    • Unstructured data: Lacks inherent structure (e.g. images, PDFs).

Veracity

  • Accuracy, consistency, and trustworthiness are crucial.

    • Emphasize the need for validation against reliable sources.

    • GIGO principle: Garbage In Garbage Out.

Value

  • Focus is on extracting meaningful insights for applications like:

    • Dynamic ticket pricing in airline industries.

    • Optimizing marketing strategies based on analysis of customer behavior.

Real-World Applications

  • Volume:

    • E-commerce transactions generate vast datasets from customer behavior.

    • Social media platforms create large volumes of user data daily.

  • Velocity:

    • Real-time stock trading systems handling high-frequency data.

    • Platforms like Netflix using immediate data for user recommendations.

  • Variety:

    • Multimedia platforms like YouTube managing diverse data types.

  • Veracity:

    • Financial fraud detection requires accuracy in transaction data analysis.

  • Value:

    • Predictive analytics for retail that personalize customer experiences and enhance operational performance.

Drivers of Big Data

  • Low costs for data acquisition and storage.

  • Effects of digitalization across various domains (Docs, IoT).

  • Increased effectiveness of data in AI and machine learning contexts.

  • Ability to integrate heterogeneous data from numerous sources for analysis.