l

Big Data

  • Big data is a collection of large, complex data sets that are difficult to process using traditional tools.

  • Challenges include: capture, storage, search, sharing, transfer, analysis, and visualization.

  • Every day over 3.3Imes10183.3 Imes 10^{18} bytes of data are created.

  • In 2025, it is predicted that 181 zettabytes of data will be created, captured, copied, and consumed globally.

  • What is considered "big data" varies depending on the organization's capabilities.

Sources of Big Data

  • Archives (Historical records)

  • Documents (e.g., Email, Word, PDF)

  • Data from business apps (ERP, CRM, HR)

  • Public data (Government websites)

  • Social media (Twitter, Facebook)

  • Machine log data (Call details, event logs)

  • Media (Images, audio, video)

  • Sensor data (Process control devices, smart meters)

Characteristics of Big Data (4Vs)

  • Volume: Amount of data.

  • Velocity: Speed at which data is created and stored.

  • Variety: Different forms of data (structured, semi-structured, unstructured).

  • Veracity: Quality and trustworthiness of data.

Challenges of Big Data

  • Choosing what data to store.

  • Where and how to store the data.

  • Finding relevant data for decision-making.

  • Deriving value from the data.

  • Protecting data from unauthorized access.

Data Integration and Data Warehousing

  • Data Rich, Information Poor: Organizations have a lot of data but lack processes to turn it into meaningful information.

  • Solution: Data Integration

    • Improves the quality of business decisions.

    • Enables reliable, consistent, understandable, and easily manipulated data for analysis.

Data Warehouse

  • A large database that collects business information from many sources to support management decision-making.

  • Data Sources:

    • Internal operations systems.

    • External data purchased from outside sources.

    • Data from social networking.

    • Clickstream data.

Data Marts and Data Lakes

  • Data mart: A subset of a data warehouse for decision-making in small to medium-sized businesses or departments.

  • Data lake: Stores all data in its raw, unaltered form.

    • Raw data available when needed for analysis.