TOPIC 1
Big Data Computing
Topic 1: Data Analytics and Statistics
01 What is Big Data?
Definition: Big Data refers to a huge volume of data that cannot be stored and processed using traditional computing approaches within a given time frame.
02 What is Big Data Analytics?
Big Data Analytics is a process that aims to extract meaningful insights from large datasets, identifying hidden patterns, unknown correlations, market trends, and customer preferences.
03 Types of Big Data Analytics
Structured Data: Data with a defined data model. Examples: databases, CSV files, Excel spreadsheets.
Unstructured Data: Data without a predefined structure. Examples: images, audio files, video files.
Semi-Structured Data: A type of data that does not follow a strict structure but still has some form of organization. Examples: emails, log files, and XML documents.
04 Characteristics of Big Data Analytics
Big Data is characterized by three primary attributes:
Volume: Refers to the vast amounts of data generated.
Velocity: Refers to the speed at which data is generated and transmitted.
Variety: Refers to the different types of data being generated (structured, semi-structured, unstructured).
05 Process of Big Data Analytics
The analytics process typically involves the following steps:
Case study and evaluation
Identification of particular data
Filtering data
Data extraction
Aggregation of data
Visualization of data
Data analysis
Final analysis and result generation, aiding in decision-making.
06 Big Data Application Domains
Major application domains include:
Education
Healthcare
Weather forecasting
Agriculture
Manufacturing
07 Benefits of Big Data Analytics
In Education: Personalized education, better career choices, improved grading systems, enhanced student results, and reduced dropout rates.
In Healthcare: Improved care delivery, fraud detection, health tracking, efficient operations, and advanced patient care.
In Weather Forecasting: Accuracy in storm predictions, risk assessments for flooding, and optimally managing resources.
In Agriculture: Boosted productivity through crop predictions, risk assessment, natural trend monitoring, and automation.
In Manufacturing: Quality control, defect tracking, improved supply planning, and the efficiency of manufacturing processes.
08 Some Real-Time Applications of Big Data Analytics
Google and IBM applications in weather forecasting and stock trading, among others.
Google employs various big data tools to provide instant information based on user preferences through search algorithms and inquiry handling.
IBM's Deep Thunder provides hyper-localized weather forecasts for precise areas.
The New York Stock Exchange generates ~1 terabyte of trade data daily.
Daily data from social media such as Facebook exceeds 500TB.
A single jet engine generates over 10 terabytes of data in half an hour.
09 Big Data Analytics Tools and Technologies
Hadoop: An open-source framework for distributed storage and processing.
NoSQL Databases: Databases like MongoDB and Cassandra that handle various data types without fixed schemas.
Tableau: A visualization platform for analyzing big data.
Python and R: Programming languages used for machine learning and statistical analysis.
Machine Learning Frameworks: Tools for predictive modeling such as TensorFlow and PyTorch.
10 Benefits of Big Data Analytics
Real-time Intelligence: Quick analysis and decision-making.
Better-informed Decisions: Uncovering hidden patterns and trends for planning.
Cost Savings: By identifying inefficiencies and optimizing processes.
Enhanced Customer Engagement: Greater understanding of customer behaviors to improve experiences.
Optimized Risk Management: Proactive risk assessment and prediction using analytics.
Disadvantages of Big Data
01 Privacy and Security Concerns
Potential risks associated with data breaches and personal data misuse.
02 Data Quality and Reliability
Issues stemming from improper data management affecting analysis outcomes.
03 High Implementation and Maintenance Costs
Significant investment required for infrastructure and human resources.
04 Data Governance and Compliance
Legal regulations governing data handling and usage.
05 Data Overload and Complexity
Difficulty in deriving insights from vast amounts of data due to complexity.
06 Lack of Skilled Professionals
Shortage of qualified personnel with the necessary expertise in data analytics.
07 Legal and Regulatory Challenges
Navigating laws pertaining to data management and integrity.
08 Ethical Considerations
Maintaining ethical standards in data collection and usage practices.
09 Data Bias and Discrimination
Risk of biased outcomes from unrepresentative datasets.
10 Resistance to Change
Organizational inertia that hinders adoption of big data solutions.
Data Analytics
Definition: A process of cleaning, transforming, and modeling data to extract useful information for decision-making.
Purpose: To derive actionable insights from data analyses.
Types of Data Analytics
Descriptive Analytics: Summarizes historical data to understand past events.
Diagnostic Analytics: Analyzes past outcomes to uncover reasons behind them.
Predictive Analytics: Uses historical data to forecast future events.
Prescriptive Analytics: Offers recommendations to optimize future outcomes.
Real-time Analytics: Enables immediate data processing for quick decision-making.
Spatial Analytics: Deals with location-based data optimization.
Text Analytics: Extracts insights from unstructured textual data using NLP techniques.
Statistical Foundations for Big Data
Statistics: The science of collecting, analyzing, and interpreting data to find patterns and make decisions.
Descriptive Statistics: Simplifies and organizes data for better comprehension.
Inferential Statistics: Draws conclusions about populations based on sample data.
Types of Data: Qualitative (descriptive) and Quantitative (numerical).
Basics of Statistics
Parameters and Definitions:
Population Mean (μ): Average of the entire group.
Sample Mean: Average of a subset of the population.
Standard Deviation (σ): Measures how spread out the data is from the mean.
Variance: Shows how far values are from the mean.
Range (R): Difference between largest and smallest values in the dataset.
Measures of Central Tendency
Mean: $ar{x} = rac{ ext{Sum of Values}}{ ext{Number of Values}}$
Median: Middle value when data is sorted.
Mode: Most frequently occurring value in the dataset.
Measures of Dispersion
Range: Difference between max and min values.
Variance ($ ext{Var}$): $ ext{Var}( ext{X}) = E[(X - ext{E}[X])^2]$
Standard Deviation ($ ext{SD}$): $ ext{SD} = ext{sqrt(Variance)}$
Interquartile Range (IQR): $IQR = Q_3 - Q_1$
Quartiles are computed as follows:
Q1 (First Quartile): Middle of the lower half of data.
Q2 (Median): The median of the entire dataset.
Q3 (Third Quartile): Middle of the upper half of data.
Measure of Shape
Skewness: Asymmetry of a distribution.
Positive: $ ext{Mean} > ext{Median}$
Negative: $ ext{Mean} < ext{Median}$
Symmetrical: $ ext{Mean} = ext{Median}$
Kurtosis: Degree of deviation from normal distribution.
Types: Mesokurtic, Leptokurtic, Platykurtic.
Measure of Relationship
Covariance: $Cov(X, Y) = rac{1}{n} ext{Sum}((X_i - ar{X})(Y_i - ar{Y}))$
Correlation: Strength and direction of linear relationship; $r_{xy} = rac{Cov(X, Y)}{SD_X imes SD_Y}$
Probability Theory
Sample Space: Set of all possible outcomes.
Event: A subset of the sample space.
Joint Probability: $P(A ext{ and } B) = P(A) imes P(B)$
Union of Events: $P(A ext{ or } B) = P(A) + P(B) - P(A ext{ and } B)$
Conditional Probability: $P(A|B) = rac{P(A ext{ and } B)}{P(B)}$
Bayes' Theorem
Formula: $P(A|B) = rac{P(B|A) imes P(A)}{P(B)}$
Helps update probabilities based on new evidence.
Probability Functions
Probability Mass Function (PMF): For discrete random variables.
Probability Density Function (PDF): For continuous random variables.
Cumulative Distribution Function (CDF): Probability that the variable takes a value less than or equal to a certain threshold.
Probability Distribution Functions
Normal Distribution: Described by mean (μ) and standard deviation (σ); bell-shaped curve.$$
Student t-distribution: Used when sample size is small.
Chi-squared distribution: Used for categorical variables.
Binomial distribution: Models number of successes in a set of trials.
Poisson distribution: Number of events in a given interval.
Uniform distribution: Constant probability across outcomes.
Parameter Estimation for Statistical Inference
Population: Complete group of interest.
Sample: Subset drawn from the population.
Expectation: Mean or expected value of a variable.
Parameter: Numerical characteristics of the population.
Statistic: Value computed from the sample to estimate population parameters.
Estimation: Process of inferring parameters from statistics.
Bias: Difference between estimator's expected value and the true parameter.
Hypothesis Testing
Null Hypothesis (H₀): No significant effect.
Alternative Hypothesis (H₁): There is a significant effect.
Degrees of Freedom: Number of independent values.
Level of Significance (α): Threshold for determining significance.
p-value: Probability of observing result given H₀ is true.
Type I and Type II Errors:
Type I: Incorrectly rejecting H₀; false positive.
Type II: Not rejecting H₀ when it is false; false negative.
Confidence Intervals: Range to estimate the true population parameter with a specified confidence level.
Statistical Tests
Parametric Tests: Assume data follows a normal distribution.
Z-test, t-test, F-test: For means and variances in different group comparisons.
ANOVA (Analysis of Variance): Compares means.
Chi-squared Test: Association between categorical variables.
Non-Parametric Tests: No assumptions about distribution, e.g., Mann-Whitney U Test.
A/B Testing: Compares two versions to determine effectiveness.
Regression: Models relationship between variables; formula: $y = eta_0 + eta_1x$.
Summary
These study notes encompass various aspects of big data computing, analytics, and statistics, providing a solid foundation for understanding the vast field of data analytics and its applications in the real world. From definitions to complex concepts like hypothesis testing and regression, this guide serves as a comprehensive resource for students and professionals alike in the domains of data science and analytics.
What is Big Data?
A) A small amount of data that can be processed easily
B) A large volume of data that cannot be processed using traditional methods
C) Only structured data
D) Any data generated by a computer
What are the primary characteristics of Big Data analytics?
A) Volume, Variety, Velocity
B) Quality, Reliability, Consistency
C) Speed, Accuracy, Flexibility
D) Size, Variability, Validity
Which of the following is NOT a type of Big Data?
A) Structured Data
B) Semi-Structured Data
C) Unstructured Data
D) Temporal Data
Which tool is an open-source framework for distributed storage and processing?
A) Tableau
B) Hadoop
C) Python
D) SQL
What is one benefit of Big Data analytics in healthcare?
A) Increased paperwork
B) Fraud detection and improved care delivery
C) More medication errors
D) Less patient tracking
What is 'data overload'?
A) When data is too small to analyze
B) Managing too much data that becomes complex and hard to interpret
C) A process of data cleansing
D) A type of software for data storage
What is Big Data?
A) A small amount of data that can be processed easily
B) A large volume of data that cannot be processed using traditional methods
C) Only structured data
D) Any data generated by a computer
Correct Answer: B) A large volume of data that cannot be processed using traditional methods
What are the primary characteristics of Big Data analytics?
A) Volume, Variety, Velocity
B) Quality, Reliability, Consistency
C) Speed, Accuracy, Flexibility
D) Size, Variability, Validity
Correct Answer: A) Volume, Variety, Velocity
Which of the following is NOT a type of Big Data?
A) Structured Data
B) Semi-Structured Data
C) Unstructured Data
D) Temporal Data
Correct Answer: D) Temporal Data
Which tool is an open-source framework for distributed storage and processing?
A) Tableau
B) Hadoop
C) Python
D) SQL
Correct Answer: B) Hadoop
What is one benefit of Big Data analytics in healthcare?
A) Increased paperwork
B) Fraud detection and improved care delivery
C) More medication errors
D) Less patient tracking
Correct Answer: B) Fraud detection and improved care delivery
What is 'data overload'?
A) When data is too small to analyze
B) Managing too much data that becomes complex and hard to interpret
C) A process of data cleansing
D) A type of software for data storage
Correct Answer: B) Managing too much data that becomes complex and hard to interpret
What is Big Data Analytics?
A) The process of collecting data from small datasets
B) A process that aims to extract meaningful insights from large datasets
C) A type of data storage system
D) A method for cleaning data
Correct Answer: B) A process that aims to extract meaningful insights from large datasets
Which of the following is an application domain of Big Data?
A) Education
B) Fashion
C) Sports
D) None of the above
Correct Answer: A) Education
What is the primary attribute of Big Data that refers to the speed at which data is generated?
A) Volume
B) Variety
C) Velocity
D) Variability
Correct Answer: C) Velocity
What does 'semi-structured data' refer to?
A) Data with a fixed structure
B) Data that does not follow a strict structure but has some form of organization
C) Data that is completely unorganized
D) None of the above
Correct Answer: B) Data that does not follow a strict structure but has some form of organization
Which of the following is a benefit of Big Data analytics in agriculture?
A) Increased pesticide use
B) More significant environmental impact
C) Boosted productivity through crop predictions
D) Reduced efficiency in farming techniques
Correct Answer: C) Boosted productivity through crop predictions
What kind of data is dealt with in spatial analytics?
A) Historical data
B) Numeric data
C) Location-based data
D) Textual data
Correct Answer: C) Location-based data
What is a common tool used for data visualization in Big Data analytics?
A) Excel
B) Tableau
C) SQL Server
D) Access
Correct Answer: B) Tableau
Which statistical method is used to predict future events based on historical data?
A) Prescriptive Analytics
B) Descriptive Analytics
C) Predictive Analytics
D) Diagnostic Analytics
Correct Answer: C) Predictive Analytics
What does the term 'data governance' refer to?
A) Managing data infrastructure
B) Developing policies for data management
C) Cleaning data
D) Storing data securely
Correct Answer: B) Developing policies for data management
Which distribution is commonly used for categorical variables?
A) Normal distribution
B) Binomial distribution
C) Chi-squared distribution
D) Poisson distribution
Correct Answer: C) Chi-squared distribution
What is the main challenge of data quality in Big Data?
A) Accurate representation of data
B) High storage cost
C) Consistency over time
D) All of the above
Correct Answer: A) Accurate representation of data
Which of the following statements about NoSQL databases is TRUE?
A) They use fixed schemas for data storage
B) They only handle structured data
C) They can manage various data types without fixed schemas
D) They are not scalable
Correct Answer: C) They can manage various data types without fixed schemas
What is a limitation of traditional data processing systems when dealing with Big Data?
A) They are highly scalable
B) They can handle real-time data
C) They struggle with large volumes of streaming data
D) They support structured query languages
Correct Answer: C) They struggle with large volumes of streaming data
What is an example of unstructured data?
A) Excel spreadsheet
B) Customer survey results
C) Email messages
D) Database entries
Correct Answer: C) Email messages