TOPIC 1

Big Data Computing

Topic 1: Data Analytics and Statistics

01 What is Big Data?
  • Definition: Big Data refers to a huge volume of data that cannot be stored and processed using traditional computing approaches within a given time frame.

02 What is Big Data Analytics?
  • Big Data Analytics is a process that aims to extract meaningful insights from large datasets, identifying hidden patterns, unknown correlations, market trends, and customer preferences.

03 Types of Big Data Analytics
  • Structured Data: Data with a defined data model. Examples: databases, CSV files, Excel spreadsheets.

  • Unstructured Data: Data without a predefined structure. Examples: images, audio files, video files.

  • Semi-Structured Data: A type of data that does not follow a strict structure but still has some form of organization. Examples: emails, log files, and XML documents.

04 Characteristics of Big Data Analytics
  • Big Data is characterized by three primary attributes:

    • Volume: Refers to the vast amounts of data generated.

    • Velocity: Refers to the speed at which data is generated and transmitted.

    • Variety: Refers to the different types of data being generated (structured, semi-structured, unstructured).

05 Process of Big Data Analytics
  • The analytics process typically involves the following steps:

    1. Case study and evaluation

    2. Identification of particular data

    3. Filtering data

    4. Data extraction

    5. Aggregation of data

    6. Visualization of data

    7. Data analysis

    8. Final analysis and result generation, aiding in decision-making.

06 Big Data Application Domains
  • Major application domains include:

    • Education

    • Healthcare

    • Weather forecasting

    • Agriculture

    • Manufacturing

07 Benefits of Big Data Analytics
  • In Education: Personalized education, better career choices, improved grading systems, enhanced student results, and reduced dropout rates.

  • In Healthcare: Improved care delivery, fraud detection, health tracking, efficient operations, and advanced patient care.

  • In Weather Forecasting: Accuracy in storm predictions, risk assessments for flooding, and optimally managing resources.

  • In Agriculture: Boosted productivity through crop predictions, risk assessment, natural trend monitoring, and automation.

  • In Manufacturing: Quality control, defect tracking, improved supply planning, and the efficiency of manufacturing processes.

08 Some Real-Time Applications of Big Data Analytics
  • Google and IBM applications in weather forecasting and stock trading, among others.

    • Google employs various big data tools to provide instant information based on user preferences through search algorithms and inquiry handling.

    • IBM's Deep Thunder provides hyper-localized weather forecasts for precise areas.

    • The New York Stock Exchange generates ~1 terabyte of trade data daily.

    • Daily data from social media such as Facebook exceeds 500TB.

    • A single jet engine generates over 10 terabytes of data in half an hour.

09 Big Data Analytics Tools and Technologies
  • Hadoop: An open-source framework for distributed storage and processing.

  • NoSQL Databases: Databases like MongoDB and Cassandra that handle various data types without fixed schemas.

  • Tableau: A visualization platform for analyzing big data.

  • Python and R: Programming languages used for machine learning and statistical analysis.

  • Machine Learning Frameworks: Tools for predictive modeling such as TensorFlow and PyTorch.

10 Benefits of Big Data Analytics
  • Real-time Intelligence: Quick analysis and decision-making.

  • Better-informed Decisions: Uncovering hidden patterns and trends for planning.

  • Cost Savings: By identifying inefficiencies and optimizing processes.

  • Enhanced Customer Engagement: Greater understanding of customer behaviors to improve experiences.

  • Optimized Risk Management: Proactive risk assessment and prediction using analytics.

Disadvantages of Big Data

01 Privacy and Security Concerns
  • Potential risks associated with data breaches and personal data misuse.

02 Data Quality and Reliability
  • Issues stemming from improper data management affecting analysis outcomes.

03 High Implementation and Maintenance Costs
  • Significant investment required for infrastructure and human resources.

04 Data Governance and Compliance
  • Legal regulations governing data handling and usage.

05 Data Overload and Complexity
  • Difficulty in deriving insights from vast amounts of data due to complexity.

06 Lack of Skilled Professionals
  • Shortage of qualified personnel with the necessary expertise in data analytics.

07 Legal and Regulatory Challenges
  • Navigating laws pertaining to data management and integrity.

08 Ethical Considerations
  • Maintaining ethical standards in data collection and usage practices.

09 Data Bias and Discrimination
  • Risk of biased outcomes from unrepresentative datasets.

10 Resistance to Change
  • Organizational inertia that hinders adoption of big data solutions.

Data Analytics

  • Definition: A process of cleaning, transforming, and modeling data to extract useful information for decision-making.

  • Purpose: To derive actionable insights from data analyses.

Types of Data Analytics
  1. Descriptive Analytics: Summarizes historical data to understand past events.

  2. Diagnostic Analytics: Analyzes past outcomes to uncover reasons behind them.

  3. Predictive Analytics: Uses historical data to forecast future events.

  4. Prescriptive Analytics: Offers recommendations to optimize future outcomes.

  5. Real-time Analytics: Enables immediate data processing for quick decision-making.

  6. Spatial Analytics: Deals with location-based data optimization.

  7. Text Analytics: Extracts insights from unstructured textual data using NLP techniques.

Statistical Foundations for Big Data

  • Statistics: The science of collecting, analyzing, and interpreting data to find patterns and make decisions.

  • Descriptive Statistics: Simplifies and organizes data for better comprehension.

  • Inferential Statistics: Draws conclusions about populations based on sample data.

    • Types of Data: Qualitative (descriptive) and Quantitative (numerical).

Basics of Statistics
  • Parameters and Definitions:

    • Population Mean (μ): Average of the entire group.

    • Sample Mean: Average of a subset of the population.

    • Standard Deviation (σ): Measures how spread out the data is from the mean.

    • Variance: Shows how far values are from the mean.

    • Range (R): Difference between largest and smallest values in the dataset.

Measures of Central Tendency
  1. Mean: $ar{x} = rac{ ext{Sum of Values}}{ ext{Number of Values}}$

  2. Median: Middle value when data is sorted.

  3. Mode: Most frequently occurring value in the dataset.

Measures of Dispersion
  • Range: Difference between max and min values.

  • Variance ($ ext{Var}$): $ ext{Var}( ext{X}) = E[(X - ext{E}[X])^2]$

  • Standard Deviation ($ ext{SD}$): $ ext{SD} = ext{sqrt(Variance)}$

  • Interquartile Range (IQR): $IQR = Q_3 - Q_1$

  • Quartiles are computed as follows:

    • Q1 (First Quartile): Middle of the lower half of data.

    • Q2 (Median): The median of the entire dataset.

    • Q3 (Third Quartile): Middle of the upper half of data.

Measure of Shape
  1. Skewness: Asymmetry of a distribution.

    • Positive: $ ext{Mean} > ext{Median}$

    • Negative: $ ext{Mean} < ext{Median}$

    • Symmetrical: $ ext{Mean} = ext{Median}$

  2. Kurtosis: Degree of deviation from normal distribution.

    • Types: Mesokurtic, Leptokurtic, Platykurtic.

Measure of Relationship
  • Covariance: $Cov(X, Y) = rac{1}{n} ext{Sum}((X_i - ar{X})(Y_i - ar{Y}))$

  • Correlation: Strength and direction of linear relationship; $r_{xy} = rac{Cov(X, Y)}{SD_X imes SD_Y}$

Probability Theory
  • Sample Space: Set of all possible outcomes.

  • Event: A subset of the sample space.

    • Joint Probability: $P(A ext{ and } B) = P(A) imes P(B)$

    • Union of Events: $P(A ext{ or } B) = P(A) + P(B) - P(A ext{ and } B)$

    • Conditional Probability: $P(A|B) = rac{P(A ext{ and } B)}{P(B)}$

Bayes' Theorem
  • Formula: $P(A|B) = rac{P(B|A) imes P(A)}{P(B)}$

    • Helps update probabilities based on new evidence.

Probability Functions
  1. Probability Mass Function (PMF): For discrete random variables.

  2. Probability Density Function (PDF): For continuous random variables.

  3. Cumulative Distribution Function (CDF): Probability that the variable takes a value less than or equal to a certain threshold.

Probability Distribution Functions
  1. Normal Distribution: Described by mean (μ) and standard deviation (σ); bell-shaped curve.$$

  2. Student t-distribution: Used when sample size is small.

  3. Chi-squared distribution: Used for categorical variables.

  4. Binomial distribution: Models number of successes in a set of trials.

  5. Poisson distribution: Number of events in a given interval.

  6. Uniform distribution: Constant probability across outcomes.

Parameter Estimation for Statistical Inference
  • Population: Complete group of interest.

  • Sample: Subset drawn from the population.

  • Expectation: Mean or expected value of a variable.

  • Parameter: Numerical characteristics of the population.

  • Statistic: Value computed from the sample to estimate population parameters.

  • Estimation: Process of inferring parameters from statistics.

  • Bias: Difference between estimator's expected value and the true parameter.

Hypothesis Testing
  1. Null Hypothesis (H₀): No significant effect.

  2. Alternative Hypothesis (H₁): There is a significant effect.

  3. Degrees of Freedom: Number of independent values.

  4. Level of Significance (α): Threshold for determining significance.

  5. p-value: Probability of observing result given H₀ is true.

  6. Type I and Type II Errors:

    • Type I: Incorrectly rejecting H₀; false positive.

    • Type II: Not rejecting H₀ when it is false; false negative.

  7. Confidence Intervals: Range to estimate the true population parameter with a specified confidence level.

Statistical Tests
  1. Parametric Tests: Assume data follows a normal distribution.

    • Z-test, t-test, F-test: For means and variances in different group comparisons.

  2. ANOVA (Analysis of Variance): Compares means.

  3. Chi-squared Test: Association between categorical variables.

  4. Non-Parametric Tests: No assumptions about distribution, e.g., Mann-Whitney U Test.

  5. A/B Testing: Compares two versions to determine effectiveness.

  6. Regression: Models relationship between variables; formula: $y = eta_0 + eta_1x$.

Summary

  • These study notes encompass various aspects of big data computing, analytics, and statistics, providing a solid foundation for understanding the vast field of data analytics and its applications in the real world. From definitions to complex concepts like hypothesis testing and regression, this guide serves as a comprehensive resource for students and professionals alike in the domains of data science and analytics.

  1. What is Big Data?

    • A) A small amount of data that can be processed easily

    • B) A large volume of data that cannot be processed using traditional methods

    • C) Only structured data

    • D) Any data generated by a computer

  2. What are the primary characteristics of Big Data analytics?

    • A) Volume, Variety, Velocity

    • B) Quality, Reliability, Consistency

    • C) Speed, Accuracy, Flexibility

    • D) Size, Variability, Validity

  3. Which of the following is NOT a type of Big Data?

    • A) Structured Data

    • B) Semi-Structured Data

    • C) Unstructured Data

    • D) Temporal Data

  4. Which tool is an open-source framework for distributed storage and processing?

    • A) Tableau

    • B) Hadoop

    • C) Python

    • D) SQL

  5. What is one benefit of Big Data analytics in healthcare?

    • A) Increased paperwork

    • B) Fraud detection and improved care delivery

    • C) More medication errors

    • D) Less patient tracking

  6. What is 'data overload'?

    • A) When data is too small to analyze

    • B) Managing too much data that becomes complex and hard to interpret

    • C) A process of data cleansing

    • D) A type of software for data storage

  1. What is Big Data?

    • A) A small amount of data that can be processed easily

    • B) A large volume of data that cannot be processed using traditional methods

    • C) Only structured data

    • D) Any data generated by a computer

    • Correct Answer: B) A large volume of data that cannot be processed using traditional methods

  2. What are the primary characteristics of Big Data analytics?

    • A) Volume, Variety, Velocity

    • B) Quality, Reliability, Consistency

    • C) Speed, Accuracy, Flexibility

    • D) Size, Variability, Validity

    • Correct Answer: A) Volume, Variety, Velocity

  3. Which of the following is NOT a type of Big Data?

    • A) Structured Data

    • B) Semi-Structured Data

    • C) Unstructured Data

    • D) Temporal Data

    • Correct Answer: D) Temporal Data

  4. Which tool is an open-source framework for distributed storage and processing?

    • A) Tableau

    • B) Hadoop

    • C) Python

    • D) SQL

    • Correct Answer: B) Hadoop

  5. What is one benefit of Big Data analytics in healthcare?

    • A) Increased paperwork

    • B) Fraud detection and improved care delivery

    • C) More medication errors

    • D) Less patient tracking

    • Correct Answer: B) Fraud detection and improved care delivery

  6. What is 'data overload'?

    • A) When data is too small to analyze

    • B) Managing too much data that becomes complex and hard to interpret

    • C) A process of data cleansing

    • D) A type of software for data storage

    • Correct Answer: B) Managing too much data that becomes complex and hard to interpret

  7. What is Big Data Analytics?

    • A) The process of collecting data from small datasets

    • B) A process that aims to extract meaningful insights from large datasets

    • C) A type of data storage system

    • D) A method for cleaning data

    • Correct Answer: B) A process that aims to extract meaningful insights from large datasets

  8. Which of the following is an application domain of Big Data?

    • A) Education

    • B) Fashion

    • C) Sports

    • D) None of the above

    • Correct Answer: A) Education

  9. What is the primary attribute of Big Data that refers to the speed at which data is generated?

    • A) Volume

    • B) Variety

    • C) Velocity

    • D) Variability

    • Correct Answer: C) Velocity

  10. What does 'semi-structured data' refer to?

    • A) Data with a fixed structure

    • B) Data that does not follow a strict structure but has some form of organization

    • C) Data that is completely unorganized

    • D) None of the above

    • Correct Answer: B) Data that does not follow a strict structure but has some form of organization

  11. Which of the following is a benefit of Big Data analytics in agriculture?

    • A) Increased pesticide use

    • B) More significant environmental impact

    • C) Boosted productivity through crop predictions

    • D) Reduced efficiency in farming techniques

    • Correct Answer: C) Boosted productivity through crop predictions

  12. What kind of data is dealt with in spatial analytics?

    • A) Historical data

    • B) Numeric data

    • C) Location-based data

    • D) Textual data

    • Correct Answer: C) Location-based data

  13. What is a common tool used for data visualization in Big Data analytics?

    • A) Excel

    • B) Tableau

    • C) SQL Server

    • D) Access

    • Correct Answer: B) Tableau

  14. Which statistical method is used to predict future events based on historical data?

    • A) Prescriptive Analytics

    • B) Descriptive Analytics

    • C) Predictive Analytics

    • D) Diagnostic Analytics

    • Correct Answer: C) Predictive Analytics

  15. What does the term 'data governance' refer to?

    • A) Managing data infrastructure

    • B) Developing policies for data management

    • C) Cleaning data

    • D) Storing data securely

    • Correct Answer: B) Developing policies for data management

  16. Which distribution is commonly used for categorical variables?

    • A) Normal distribution

    • B) Binomial distribution

    • C) Chi-squared distribution

    • D) Poisson distribution

    • Correct Answer: C) Chi-squared distribution

  17. What is the main challenge of data quality in Big Data?

    • A) Accurate representation of data

    • B) High storage cost

    • C) Consistency over time

    • D) All of the above

    • Correct Answer: A) Accurate representation of data

  18. Which of the following statements about NoSQL databases is TRUE?

    • A) They use fixed schemas for data storage

    • B) They only handle structured data

    • C) They can manage various data types without fixed schemas

    • D) They are not scalable

    • Correct Answer: C) They can manage various data types without fixed schemas

  19. What is a limitation of traditional data processing systems when dealing with Big Data?

    • A) They are highly scalable

    • B) They can handle real-time data

    • C) They struggle with large volumes of streaming data

    • D) They support structured query languages

    • Correct Answer: C) They struggle with large volumes of streaming data

  20. What is an example of unstructured data?

    • A) Excel spreadsheet

    • B) Customer survey results

    • C) Email messages

    • D) Database entries

    • Correct Answer: C) Email messages