AV

Big Data Analytics Notes

Big Data Analytics

Introduction

  • Data Definition: According to the Merriam-Webster Dictionary, data is defined as:
    • Information (such as measurements or statistics) used as a basis for reasoning, discussion, or calculation.
    • Information in digital form that can be transmitted or processed.
    • Information that has been translated into a form that is efficient for movement or processing in computing.

Data Generation

  • Daily Data Production: Over the past decade, global data production has increased dramatically due to the digital transformation of societies worldwide.
  • Amount of Data Created Daily: Approximately 402.74 million terabytes of data are generated every day.
    • Drivers: cloud computing, streaming services, and connected devices.

Storing Data: Types of File Sizes

  • Computers represent data (video, images, sound, text) as binary values (1s and 0s).
  • A bit is the smallest unit representing a single binary value.
  • Storage and memory are typically measured in megabytes and gigabytes, but the scale continues to increase.
    • "Brontobyte" is an unofficial term for 10^{27} bytes.
  • Units of Data Measurement:
    • 1 byte = 8 bits
    • 1 kilobyte = 1024 bytes
    • 1 megabyte = 1024 kilobytes
    • 1 gigabyte = 1024 megabytes
    • 1 terabyte = 1024 gigabytes
    • 1 petabyte = 1024 terabytes
    • 1 exabyte = 1024 petabytes
    • 1 zettabyte = 1024 exabytes
    • 1 yottabyte = 1024 zettabytes
    • 1 brontobyte = 1024 yottabytes

Big Data

  • Extremely large, complex datasets that cannot be processed using traditional data management tools.
  • Generated rapidly from sources like social media, IoT devices, transactions, and sensors.

Characteristics of Big Data (5 V's)

  • Volume
  • Velocity
  • Variety
  • Veracity
  • Value

Volume

  • Massive amount of data.
  • Big data is a form of data whose volume is so extensive that it cannot fit on a single machine.
  • Specialized tools and frameworks are needed to store, process, and analyze it.

Velocity

  • Fast generation and processing of data (real-time or nearly real-time).
  • Describes how quickly data flows from various sources and how fast it needs to be handled to remain useful.

Variety

  • Different data types: structured, unstructured, or semi-structured.
  • Includes text data, images, audio, video, and sensor data.
Types of Big Data
  • Unstructured:
    • No inherent format and stored as different file types.
    • Examples: texts, PDFs, images, and videos
  • Semi-structured:
    • Data files with an apparent pattern, enabling analysis.
    • Uses tags or markers to separate elements.
    • Examples: Spreadsheets
  • Structured:
    • Have defined data model, format, and structure.
    • With strict schema of rows and columns for easy processing of data.
    • Examples: SQL Databases

Veracity

  • Data accuracy and reliability.
  • Refers to the quality, accuracy, integrity, and credibility of data.
  • Assesses how accurate, consistent, and free from errors the data is.

Value

  • Extracting meaningful insights used in decision-making.
  • Data value means usefulness, importance, and impact as a benefit or insight.

Main Sources of Big Data

  • Social Media Data (posts, likes, comments, shares)
  • Business Transactions and Sales Data (sales records, invoices, bank transactions)
  • IoT (Internet of Things) and Sensor Data (logs, temperature, GPS data)
  • Web and Clickstream Data (website logs, cookies, session data)
  • Machine-generated data (event logs, error reports, system performance)
  • Scientific and Healthcare data (patient records, MRI scans, lab reports)
  • Financial and Economic Data (market trends, financial reports)
  • Government and Public Data (government databases, statistics)
  • Multimedia and Streaming Data (videos, audios, images)

Big Data Analytics

  • The process of examining large, complex datasets to uncover hidden patterns, correlations, trends, and insights.
  • A broad term that encompasses the processes, technologies, frameworks, and algorithms to extract meaningful insights from data.

Goals of Big Data Analytics

  • To find patterns in the data:
    • Examples: finding the top 10 coldest days in the year, finding which pages are visited the most on a particular website, or finding the most searched celebrity in a specific year
  • To find relationships in the data:
    • Examples: finding similar news articles, finding similar patients in an electronic health record system, finding related products on an eCommerce website, or finding the correlation between news items and stock prices
  • To predict something:
    • Examples: whether a transaction is a fraud or not, whether it will rain on a particular day, or whether a tumor is benign or malignant

Types of Big Data Analytics

  • Descriptive Analytics
  • Diagnostic Analytics
  • Predictive Analytics
  • Prescriptive Analytics

Descriptive Analytics

  • What happened?
  • Analyzes past data to present it in a summarized form that can be easily interpreted.
  • Example: A store analyzes last year’s sales data to identify trends.
  • Insight: They found out that customers prefer shopping in the evening and spend more during holiday seasons.

Diagnostic Analytics

  • Why did it happen?
  • Comprises the analysis of past data to diagnose the reasons as to why certain events happened.
  • Example: A company investigates why sportswear sales increased in Q4.
  • Insight: The sales spike was driven by seasonal fitness trends and New Year resolutions.

Predictive Analytics

  • What will happen?
  • Comprises predicting the occurrence of an event or the likely outcome of an event or forecasting the future values using prediction models.
  • Example: A clothing company wants to forecast sales for the next quarter.
  • Insight: Weather forecasts predict a cold winter, leading to increased sales of jackets and hoodies.

Prescriptive Analytics

  • What should we do next?
  • It prescribes actions or the best option to follow from the available options.
  • Example: A store wants to maximize sales and customer engagement.
  • Insight: Strategies like targeted ads, loyalty programs, and discount coupons increase customer retention, optimize inventory, and boost sales revenue.

How Big Data Analytics Works

  • Data Collection
  • Data Storage and Management
  • Data Processing / Data Cleaning / Data Wrangling
  • Data Analysis
  • Data Visualization and Decision-Making

Example: Sales Data

  • Data Collection: Website clicks, product views, customer purchases, reviews, cart abandonment.
  • Data Storage and Management: Databases store millions of transactions like product searches, purchase history, etc.
  • Data Processing / Data Cleaning / Data Wrangling: Removes duplicate transactions, formats data, and detects trends.
  • Data Analysis: Predict future sales trends.
  • Data Visualization and Decision-Making: Track sales performance & adjust pricing.

Applications of Big Data Analytics

IndustryBig Data ApplicationsExampleTools Used
Business & E-CommercePersonalized recommendations, price optimization, fraud detectionAmazon, ShopifyGoogle Analytics, Hadoop
Healthcare & MedicinePredicting diseases, personalized treatmentsIBM Watson HealthAI, IoT, EHRs
Finance & BankingFraud detection, risk assessment, stock market predictionsJPMorgan ChaseSQL, Python, Blockchain
Retail & Supply ChainInventory management, demand forecastingWalmartPredictive Analytics, IoT
Social Media & MarketingTargeted ads, sentiment analysis, engagement trackingNetflix, FacebookAI, Data Mining
Government & Smart CitiesTraffic management, crime prediction, disaster responseSingapore Smart CityIoT Sensors, AI
Education & LearningPersonalized learning, student performance trackingCoursera, UdemyAI, LMS
Sports & EntertainmentPlayer tracking, fan engagement, game strategyNBA, FIFAAI, Wearable Tech

Benefits of Big Data Analytics

  • Improvement of decision-making.
  • More effective marketing.
  • Improvement of customer service.
  • Increased efficiency of operations.

Summary

  • Data is defined as raw (unprocessed) information used as a basis for reasoning, discussion, or calculation.
  • The world generates approximately 402.74 million terabytes of data every day.
  • Characteristics of big data: volume, velocity, variety, veracity, and value.
  • Big data analytics encompasses the processes, technologies, frameworks, and algorithms to extract meaningful insights from data.
  • Goals of BDA: find patterns, find relationships, and predict (forecast).
  • Types of BDA: descriptive, diagnostic, predictive, prescriptive.
  • Processes of BDA: data collection, data storage and management, data processing and cleaning, data analysis, and data visualization and decision-making.