Module 1 - Statistics

Data Distributions

Introduction to Statistical Methods

Course Overview

  • Class Title: STAT 2331
  • Institution: Southern Methodist University (SMU)

What is Statistics?

  • Real-World Examples:
    • Points scored by each player on a basketball team
    • Unemployment rate
    • Percentage of students who indulge in binge drinking
  • Definition of Statistics:
    • Statistics are numbers calculated from data. They involve:
    • Designing studies.
    • Making data-informed decisions.
  • Additional Aspects of Statistics:
    • Statistics is not just about numbers; it is fundamentally about:
    • Collecting data.
    • Analyzing data.
    • Interpreting data to influence decisions.

Why Should You Care?

  • Growth of the Field:
    • The field of statistics has grown exponentially over the past few decades.
  • Related Fields:
    • Data science and machine learning are recent outgrowths of statistics.
  • Job Security:
    • Statisticians enjoy excellent job security with opportunities in multiple sectors:
    • Academia
    • Industry
    • Consulting
    • Government
  • Everyday Relevance:
    • Statistics permeate our daily lives, affecting decision-making and understanding of the world.

Questions Statistics Can Answer

  • Questions that statistics can address include:
    • Are large doses of vitamin C beneficial?
    • Which team is most likely to win a tournament?
    • How can you analyze whether a diet really works?
    • What is the chance your tax return will be audited?
    • Is there bias against women in appointing managers?
    • Are rates of anxiety and depression increasing over time?

Module Objectives

  • Key Learning Outcomes:
    • Distinguish between statistics and parameters.
    • Explain the impact of variability on sampling and inference.
    • Distinguish between categorical and quantitative data.
    • Describe and display categorical data:
    • Graphically (e.g., charts).
    • Numerically (e.g., through totals or percentages).
    • Describe and display quantitative data:
    • Graphically (e.g., histograms).
    • Numerically (e.g., means or medians).

Four Main Phases of Statistical Analysis

  • Research Question:
    • Choose a question you are interested in answering.
  • Design:
    • Outline how to obtain data to answer the question of interest.
  • Description:
    • Summarize the data obtained.
  • Inference:
    • Make decisions in the face of uncertainty based on the data.

Probability versus Statistics

  • Scenario 1:

    • Known probability: 21% of the people in a population of 1000 aged 60 or older have diabetes.
    • Calculate the probability that 3 out of 10 sampled people will have diabetes.
    • This scenario involves: Probability.
  • Scenario 2:

    • Unknown population percentage: sample indicates 2 out of 10 people have diabetes.
    • Use data to predict the population percentage.
    • This scenario involves: Statistics.
    • Focus in this class: Scenario 2.

The Language of Statistics

  • Subjects:
    • The subjects we observe or measure (often people).
  • Target Population:
    • The specific subjects in which we are interested.
  • Sample Frame:
    • The list from which the sample is selected.
  • Sample:
    • A subset of the population for which we plan to have data.
  • Parameter:
    • A numerical summary of the population.
  • Statistic:
    • A numerical summary of a sample.

Examples in Statistical Analysis

Example 1: Dieting Obese Women Study

  • Research Question:
    • Do dieting obese women following a new diet plan lose more weight over 6 weeks than those who don’t?
  • Design:
    • Randomly sample 30 women following the plan and 30 not following it.
    • Weigh everyone at the start and again at 6 weeks.
  • Description:
    • Total weight loss recorded; mean weight loss: 18 pounds for the new plan, 12 pounds for those not on it.
  • Inference:
    • Conclusion: Women on the new diet plan lose more weight than those not on it.
    • Variable Identification:
    • Subjects
    • Target Population
    • Sample Frame
    • Sample
    • Statistic(s)
    • Parameter(s)
  • Importance of Random Sampling:
    • Ensures results are generalizable and minimizes bias.
  • Impact of Larger Sample Sizes:
    • Generally leads to more reliable results.

Inference: The Goal of Statistical Analysis

  • Research Questions:
    • Focused on specific populations.
  • Design Considerations:
    • Create a list (sample frame) and gather random data.
  • Description of Data:
    • Summarizes sample findings with statistics.
  • Inference:
    • Leads to hypotheses or predictions about population parameters.

Statistical Analysis Starts with a Data Set

Example Data Set Overview

IDProductTypeSensory Score# CookiesCalsFat
1Archway Home StyleSoft321.21556
2Busy BakerHard332.11719
3Entenmann'sSoft4431508
4EsteeHard134.11507
5Famous AmosHard333.21486
6Freihofer'sSoft452.31387
7Keebler Chips DeluxeHard50216010
  • Key Variables:
    • Categorical Variables: Labels for the types of products.
    • Quantitative Variables: Sensory score, number of cookies, calories, fat content.

Types of Variables

Categorical Variables

  • Definition:
    • A variable that fits observations into categories.
  • Sub-types:
    • Nominal: Cases are classified into groups without any order.
    • Examples: Gender, color.
    • Ordinal: Groups maintain a natural order.
    • Examples: Educational levels, satisfaction ratings.

Quantitative Variables

  • Definition:
    • Observations take numerical values.
  • Sub-types:
    • Discrete: Finite list of values with no intermediate values.
    • Examples: Number of students.
    • Continuous: Infinite values possible between any two fixed points.
    • Examples: Weight, height.

Exploratory Data Analysis

Components of Exploratory Data Analysis

  • Steps Involved:
    • Look at each variable individually.
    • Analyze relationships between variables.
    • Display data visually (charts, graphs).
    • Summarize data numerically.
  • Data Example:
    • From 1900 to 2016, 1303 unprovoked shark attacks reported in the USA:
    • California: 187
    • Florida: 828
    • Hawaii: 230
    • Texas: 58
    • Related questions for investigations:
    • Identify variable of interest.
    • Establish type and sub-type.
    • Define values for the variable.

Exploring a Single Categorical Variable

Methods of Analysis

  • Frequency Tables:
    • List possible values for a variable along with their counts.
  • Pie Charts:
    • Visualize the percentage of each category within the whole dataset.
  • Bar Charts:
    • Use bars to represent counts or proportions for each category.

Example Data for Categorical Analysis

StateFrequencyPercentage
California18714.4%
Florida82863.5%
Hawaii23017.7%
Texas584.5%
Total1303100%

Categorical Data Summary

  • Characteristics:
    • Observations fit within groups (e.g., named categories, ordered groups).
  • Key Features:
    • Count, percentage, or proportion calculated.
  • Graphical Displays:
    • Include frequency tables and various chart types.

Exploring a Single Quantitative Variable

Features Analyzed

  • Shape:
    • Distribution of the sample values.
  • Center:
    • Measured using mean, median, or mode.
  • Spread:
    • Measured using standard deviation, variance, range, or interquartile range (IQR).
  • Graphical Displays:
    • Histograms and boxplots help visualize data patterns.

Histograms

Basic Features

  • Definition:
    • A graphical representation that divides quantitative data into intervals (bins).
  • Count Observations in Each Bin:
    • Created a frequency table reflecting the grouped data.
  • Graphical Display:
    • Bins on the horizontal axis, frequency on the vertical axis.

Example Histogram Data

AgeFrequencyRelative Frequency
6.5-8.560.021
8.5-10.570.025
10.5-12.5140.049

Describing the Shape of Quantitative Data

Key Shapes to Identify

  • Number of Modes:
    • Unimodal, bimodal, uniform distributions.
  • Skewness:
    • Symmetric, skewed right (positive skew), skewed left (negative skew).

Comparing Distributions

Hypothetical Examples

  • Common Quantitative Variables: Include those typically having specific skewness patterns.
  • Examples:
    • IQ (generally normally distributed).
    • Life expectancy for humans (slightly skewed right).
    • Annual incomes of adults (typically right skewed).

Measuring the Center of Quantitative Data

Measures of Central Tendency

  • Mode:
    • The most frequently occurring value.
  • Mean:
    • Average calculated as the sum of observations divided by the sample size.
    • Mathematical Expression: \text{Mean} = \frac{\sum x_i}{n}
  • Median:
    • Middle value when observations are ordered.
    • Example Series: [2, 4, 6, 8, 10] – Median = 6.

Comparing Mean and Median

Impact of Outliers

  • Outlier Definition:
    • A data point that is far outside the general distribution of the dataset.
  • Comparative Behavior:
    • Median: Resistant to outliers; remains stable with extreme values.
    • Mean: Sensitive to outliers; can skew dramatically.

Measuring the Spread of Quantitative Data

Key Metrics

  • Range:
    • Calculated as the difference between maximum and minimum values.
      \text{Range} = \text{Maximum} - \text{Minimum}
  • Variance:
    • Reflects the average squared deviation from the mean:
      s^2 = \frac{1}{n-1}\sum{i=1}^{n}(xi - \bar{x})^2
  • Standard Deviation (SD):
    • Square root of the variance:
      s = \sqrt{\frac{1}{n-1}\sum{i=1}^{n}(xi - \bar{x})^2}
  • Interquartile Range (IQR):
    • Difference between the third quartile (Q3) and the first quartile (Q1):
      \text{IQR} = Q3 - Q1

Identifying Outliers

Criteria for Outlier Detection

  • Mild Outliers:
    • Values that are less than Q1 - 1.5 \times IQR or greater than Q3 + 1.5 \times IQR.
  • Extreme Outliers:
    • Values less than Q1 - 3 \times IQR or greater than Q3 + 3 \times IQR.

Boxplots

Features of Boxplots

  • Summary Visualization:
    • Represents the five-number summary: minimum, Q1, median, Q3, and maximum.
  • Display characteristics of the data:
    • Spread and potential outliers.
  • Whiskers:
    • Extend from the quartiles to the farthest non-outlier points.

Example Boxplot Analysis

Example Data:

  • Example Data:
    • Set: [1, 2, 7, 11, 12, 18, 23, 25, 98, 197]
  • Five Number Summary:
    • Minimum = 1
    • Q1 = 7
    • Median = 12
    • Q3 = 25
    • Maximum = 197

Comparing Histograms and Corresponding Boxplots

Insight on Distribution:

  • Histograms:
  • Boxplots:
    • Both graphically represent spread and outliers while offering an intuitive understanding of the central tendency.

Conclusion: Understanding Quantitative Data

Summary of Key Concepts

  • Characteristics:
    • Numerical values used in operation.
  • Sub-types:
    • Discrete vs. Continuous Variables: Indicates the type of numerical data.
  • Graphical Displays:
    • Histogram and boxplot for analyzing spread and identifying outliers.
  • Numerical Summaries:
    • Essential features: shape, center (mean/median), spread (variance, standard deviation, IQR), and outliers.
    • Use resistant measures for skewed data (median and IQR).