Big Data Analytics Notes
Big Data Analytics
Introduction
- Data Definition: According to the Merriam-Webster Dictionary, data is defined as:
- Information (such as measurements or statistics) used as a basis for reasoning, discussion, or calculation.
- Information in digital form that can be transmitted or processed.
- Information that has been translated into a form that is efficient for movement or processing in computing.
Data Generation
- Daily Data Production: Over the past decade, global data production has increased dramatically due to the digital transformation of societies worldwide.
- Amount of Data Created Daily: Approximately 402.74 million terabytes of data are generated every day.
- Drivers: cloud computing, streaming services, and connected devices.
Storing Data: Types of File Sizes
- Computers represent data (video, images, sound, text) as binary values (1s and 0s).
- A bit is the smallest unit representing a single binary value.
- Storage and memory are typically measured in megabytes and gigabytes, but the scale continues to increase.
- "Brontobyte" is an unofficial term for 10^{27} bytes.
- Units of Data Measurement:
- 1 byte = 8 bits
- 1 kilobyte = 1024 bytes
- 1 megabyte = 1024 kilobytes
- 1 gigabyte = 1024 megabytes
- 1 terabyte = 1024 gigabytes
- 1 petabyte = 1024 terabytes
- 1 exabyte = 1024 petabytes
- 1 zettabyte = 1024 exabytes
- 1 yottabyte = 1024 zettabytes
- 1 brontobyte = 1024 yottabytes
Big Data
- Extremely large, complex datasets that cannot be processed using traditional data management tools.
- Generated rapidly from sources like social media, IoT devices, transactions, and sensors.
Characteristics of Big Data (5 V's)
- Volume
- Velocity
- Variety
- Veracity
- Value
Volume
- Massive amount of data.
- Big data is a form of data whose volume is so extensive that it cannot fit on a single machine.
- Specialized tools and frameworks are needed to store, process, and analyze it.
Velocity
- Fast generation and processing of data (real-time or nearly real-time).
- Describes how quickly data flows from various sources and how fast it needs to be handled to remain useful.
Variety
- Different data types: structured, unstructured, or semi-structured.
- Includes text data, images, audio, video, and sensor data.
Types of Big Data
- Unstructured:
- No inherent format and stored as different file types.
- Examples: texts, PDFs, images, and videos
- Semi-structured:
- Data files with an apparent pattern, enabling analysis.
- Uses tags or markers to separate elements.
- Examples: Spreadsheets
- Structured:
- Have defined data model, format, and structure.
- With strict schema of rows and columns for easy processing of data.
- Examples: SQL Databases
Veracity
- Data accuracy and reliability.
- Refers to the quality, accuracy, integrity, and credibility of data.
- Assesses how accurate, consistent, and free from errors the data is.
Value
- Extracting meaningful insights used in decision-making.
- Data value means usefulness, importance, and impact as a benefit or insight.
Main Sources of Big Data
- Social Media Data (posts, likes, comments, shares)
- Business Transactions and Sales Data (sales records, invoices, bank transactions)
- IoT (Internet of Things) and Sensor Data (logs, temperature, GPS data)
- Web and Clickstream Data (website logs, cookies, session data)
- Machine-generated data (event logs, error reports, system performance)
- Scientific and Healthcare data (patient records, MRI scans, lab reports)
- Financial and Economic Data (market trends, financial reports)
- Government and Public Data (government databases, statistics)
- Multimedia and Streaming Data (videos, audios, images)
Big Data Analytics
- The process of examining large, complex datasets to uncover hidden patterns, correlations, trends, and insights.
- A broad term that encompasses the processes, technologies, frameworks, and algorithms to extract meaningful insights from data.
Goals of Big Data Analytics
- To find patterns in the data:
- Examples: finding the top 10 coldest days in the year, finding which pages are visited the most on a particular website, or finding the most searched celebrity in a specific year
- To find relationships in the data:
- Examples: finding similar news articles, finding similar patients in an electronic health record system, finding related products on an eCommerce website, or finding the correlation between news items and stock prices
- To predict something:
- Examples: whether a transaction is a fraud or not, whether it will rain on a particular day, or whether a tumor is benign or malignant
Types of Big Data Analytics
- Descriptive Analytics
- Diagnostic Analytics
- Predictive Analytics
- Prescriptive Analytics
Descriptive Analytics
- What happened?
- Analyzes past data to present it in a summarized form that can be easily interpreted.
- Example: A store analyzes last year’s sales data to identify trends.
- Insight: They found out that customers prefer shopping in the evening and spend more during holiday seasons.
Diagnostic Analytics
- Why did it happen?
- Comprises the analysis of past data to diagnose the reasons as to why certain events happened.
- Example: A company investigates why sportswear sales increased in Q4.
- Insight: The sales spike was driven by seasonal fitness trends and New Year resolutions.
Predictive Analytics
- What will happen?
- Comprises predicting the occurrence of an event or the likely outcome of an event or forecasting the future values using prediction models.
- Example: A clothing company wants to forecast sales for the next quarter.
- Insight: Weather forecasts predict a cold winter, leading to increased sales of jackets and hoodies.
Prescriptive Analytics
- What should we do next?
- It prescribes actions or the best option to follow from the available options.
- Example: A store wants to maximize sales and customer engagement.
- Insight: Strategies like targeted ads, loyalty programs, and discount coupons increase customer retention, optimize inventory, and boost sales revenue.
How Big Data Analytics Works
- Data Collection
- Data Storage and Management
- Data Processing / Data Cleaning / Data Wrangling
- Data Analysis
- Data Visualization and Decision-Making
Example: Sales Data
- Data Collection: Website clicks, product views, customer purchases, reviews, cart abandonment.
- Data Storage and Management: Databases store millions of transactions like product searches, purchase history, etc.
- Data Processing / Data Cleaning / Data Wrangling: Removes duplicate transactions, formats data, and detects trends.
- Data Analysis: Predict future sales trends.
- Data Visualization and Decision-Making: Track sales performance & adjust pricing.
Applications of Big Data Analytics
Industry | Big Data Applications | Example | Tools Used |
---|---|---|---|
Business & E-Commerce | Personalized recommendations, price optimization, fraud detection | Amazon, Shopify | Google Analytics, Hadoop |
Healthcare & Medicine | Predicting diseases, personalized treatments | IBM Watson Health | AI, IoT, EHRs |
Finance & Banking | Fraud detection, risk assessment, stock market predictions | JPMorgan Chase | SQL, Python, Blockchain |
Retail & Supply Chain | Inventory management, demand forecasting | Walmart | Predictive Analytics, IoT |
Social Media & Marketing | Targeted ads, sentiment analysis, engagement tracking | Netflix, Facebook | AI, Data Mining |
Government & Smart Cities | Traffic management, crime prediction, disaster response | Singapore Smart City | IoT Sensors, AI |
Education & Learning | Personalized learning, student performance tracking | Coursera, Udemy | AI, LMS |
Sports & Entertainment | Player tracking, fan engagement, game strategy | NBA, FIFA | AI, Wearable Tech |
Benefits of Big Data Analytics
- Improvement of decision-making.
- More effective marketing.
- Improvement of customer service.
- Increased efficiency of operations.
Summary
- Data is defined as raw (unprocessed) information used as a basis for reasoning, discussion, or calculation.
- The world generates approximately 402.74 million terabytes of data every day.
- Characteristics of big data: volume, velocity, variety, veracity, and value.
- Big data analytics encompasses the processes, technologies, frameworks, and algorithms to extract meaningful insights from data.
- Goals of BDA: find patterns, find relationships, and predict (forecast).
- Types of BDA: descriptive, diagnostic, predictive, prescriptive.
- Processes of BDA: data collection, data storage and management, data processing and cleaning, data analysis, and data visualization and decision-making.