Module 1 - Statistics

Data Distributions

Introduction to Statistical Methods

Course Overview

Class Title: STAT 2331
Institution: Southern Methodist University (SMU)

What is Statistics?

Real-World Examples:
- Points scored by each player on a basketball team
- Unemployment rate
- Percentage of students who indulge in binge drinking
Definition of Statistics:
- Statistics are numbers calculated from data. They involve:
- Designing studies.
- Making data-informed decisions.
Additional Aspects of Statistics:
- Statistics is not just about numbers; it is fundamentally about:
- Collecting data.
- Analyzing data.
- Interpreting data to influence decisions.

Why Should You Care?

Growth of the Field:
- The field of statistics has grown exponentially over the past few decades.
Related Fields:
- Data science and machine learning are recent outgrowths of statistics.
Job Security:
- Statisticians enjoy excellent job security with opportunities in multiple sectors:
- Academia
- Industry
- Consulting
- Government
Everyday Relevance:
- Statistics permeate our daily lives, affecting decision-making and understanding of the world.

Questions Statistics Can Answer

Questions that statistics can address include:
- Are large doses of vitamin C beneficial?
- Which team is most likely to win a tournament?
- How can you analyze whether a diet really works?
- What is the chance your tax return will be audited?
- Is there bias against women in appointing managers?
- Are rates of anxiety and depression increasing over time?

Module Objectives

Key Learning Outcomes:
- Distinguish between statistics and parameters.
- Explain the impact of variability on sampling and inference.
- Distinguish between categorical and quantitative data.
- Describe and display categorical data:
- Graphically (e.g., charts).
- Numerically (e.g., through totals or percentages).
- Describe and display quantitative data:
- Graphically (e.g., histograms).
- Numerically (e.g., means or medians).

Four Main Phases of Statistical Analysis

Research Question:
- Choose a question you are interested in answering.
Design:
- Outline how to obtain data to answer the question of interest.
Description:
- Summarize the data obtained.
Inference:
- Make decisions in the face of uncertainty based on the data.

Probability versus Statistics

Scenario 1:
- Known probability: 21% of the people in a population of 1000 aged 60 or older have diabetes.
- Calculate the probability that 3 out of 10 sampled people will have diabetes.
- This scenario involves: Probability.
Scenario 2:
- Unknown population percentage: sample indicates 2 out of 10 people have diabetes.
- Use data to predict the population percentage.
- This scenario involves: Statistics.
- Focus in this class: Scenario 2.

The Language of Statistics

Subjects:
- The subjects we observe or measure (often people).
Target Population:
- The specific subjects in which we are interested.
Sample Frame:
- The list from which the sample is selected.
Sample:
- A subset of the population for which we plan to have data.
Parameter:
- A numerical summary of the population.
Statistic:
- A numerical summary of a sample.

Examples in Statistical Analysis

Example 1: Dieting Obese Women Study

Research Question:
- Do dieting obese women following a new diet plan lose more weight over 6 weeks than those who don’t?
Design:
- Randomly sample 30 women following the plan and 30 not following it.
- Weigh everyone at the start and again at 6 weeks.
Description:
- Total weight loss recorded; mean weight loss: 18 pounds for the new plan, 12 pounds for those not on it.
Inference:
- Conclusion: Women on the new diet plan lose more weight than those not on it.
- Variable Identification:
- Subjects
- Target Population
- Sample Frame
- Sample
- Statistic(s)
- Parameter(s)
Importance of Random Sampling:
- Ensures results are generalizable and minimizes bias.
Impact of Larger Sample Sizes:
- Generally leads to more reliable results.

Inference: The Goal of Statistical Analysis

Research Questions:
- Focused on specific populations.
Design Considerations:
- Create a list (sample frame) and gather random data.
Description of Data:
- Summarizes sample findings with statistics.
Inference:
- Leads to hypotheses or predictions about population parameters.

Statistical Analysis Starts with a Data Set

Example Data Set Overview

ID	Product	Type	Sensory Score	# Cookies	Cals	Fat
1	Archway Home Style	Soft	32	1.2	155	6
2	Busy Baker	Hard	33	2.1	171	9
3	Entenmann's	Soft	44	3	150	8
4	Estee	Hard	13	4.1	150	7
5	Famous Amos	Hard	33	3.2	148	6
6	Freihofer's	Soft	45	2.3	138	7
7	Keebler Chips Deluxe	Hard	50	2	160	10
…	…	…	…	…	…	…

Key Variables:
- Categorical Variables: Labels for the types of products.
- Quantitative Variables: Sensory score, number of cookies, calories, fat content.

Types of Variables

Categorical Variables

Definition:
- A variable that fits observations into categories.
Sub-types:
- Nominal: Cases are classified into groups without any order.
- Examples: Gender, color.
- Ordinal: Groups maintain a natural order.
- Examples: Educational levels, satisfaction ratings.

Quantitative Variables

Definition:
- Observations take numerical values.
Sub-types:
- Discrete: Finite list of values with no intermediate values.
- Examples: Number of students.
- Continuous: Infinite values possible between any two fixed points.
- Examples: Weight, height.

Exploratory Data Analysis

Components of Exploratory Data Analysis

Steps Involved:
- Look at each variable individually.
- Analyze relationships between variables.
- Display data visually (charts, graphs).
- Summarize data numerically.
Data Example:
- From 1900 to 2016, 1303 unprovoked shark attacks reported in the USA:
- California: 187
- Florida: 828
- Hawaii: 230
- Texas: 58
- Related questions for investigations:
- Identify variable of interest.
- Establish type and sub-type.
- Define values for the variable.

Exploring a Single Categorical Variable

Methods of Analysis

Frequency Tables:
- List possible values for a variable along with their counts.
Pie Charts:
- Visualize the percentage of each category within the whole dataset.
Bar Charts:
- Use bars to represent counts or proportions for each category.

Example Data for Categorical Analysis

State	Frequency	Percentage
California	187	14.4%
Florida	828	63.5%
Hawaii	230	17.7%
Texas	58	4.5%
Total	1303	100%

Categorical Data Summary

Characteristics:
- Observations fit within groups (e.g., named categories, ordered groups).
Key Features:
- Count, percentage, or proportion calculated.
Graphical Displays:
- Include frequency tables and various chart types.

Exploring a Single Quantitative Variable

Features Analyzed

Shape:
- Distribution of the sample values.
Center:
- Measured using mean, median, or mode.
Spread:
- Measured using standard deviation, variance, range, or interquartile range (IQR).
Graphical Displays:
- Histograms and boxplots help visualize data patterns.

Histograms

Basic Features

Definition:
- A graphical representation that divides quantitative data into intervals (bins).
Count Observations in Each Bin:
- Created a frequency table reflecting the grouped data.
Graphical Display:
- Bins on the horizontal axis, frequency on the vertical axis.

Example Histogram Data

Age	Frequency	Relative Frequency
6.5-8.5	6	0.021
8.5-10.5	7	0.025
10.5-12.5	14	0.049
…	…	…

Describing the Shape of Quantitative Data

Key Shapes to Identify

Number of Modes:
- Unimodal, bimodal, uniform distributions.
Skewness:
- Symmetric, skewed right (positive skew), skewed left (negative skew).

Comparing Distributions

Hypothetical Examples

Common Quantitative Variables: Include those typically having specific skewness patterns.
Examples:
- IQ (generally normally distributed).
- Life expectancy for humans (slightly skewed right).
- Annual incomes of adults (typically right skewed).

Measuring the Center of Quantitative Data

Measures of Central Tendency

Mode:
- The most frequently occurring value.
Mean:
- Average calculated as the sum of observations divided by the sample size.
- Mathematical Expression: \text{Mean} = \frac{\sum x_i}{n}
Median:
- Middle value when observations are ordered.
- Example Series: [2, 4, 6, 8, 10] – Median = 6.

Comparing Mean and Median

Impact of Outliers

Outlier Definition:
- A data point that is far outside the general distribution of the dataset.
Comparative Behavior:
- Median: Resistant to outliers; remains stable with extreme values.
- Mean: Sensitive to outliers; can skew dramatically.

Measuring the Spread of Quantitative Data

Key Metrics

Range:
- Calculated as the difference between maximum and minimum values.
  \text{Range} = \text{Maximum} - \text{Minimum}
Variance:
- Reflects the average squared deviation from the mean:
  s^2 = \frac{1}{n-1}\sum{i=1}^{n}(xi - \bar{x})^2
Standard Deviation (SD):
- Square root of the variance:
  s = \sqrt{\frac{1}{n-1}\sum{i=1}^{n}(xi - \bar{x})^2}
Interquartile Range (IQR):
- Difference between the third quartile (Q3) and the first quartile (Q1):
  \text{IQR} = Q3 - Q1

Identifying Outliers

Criteria for Outlier Detection

Mild Outliers:
- Values that are less than Q1 - 1.5 \times IQR or greater than Q3 + 1.5 \times IQR.
Extreme Outliers:
- Values less than Q1 - 3 \times IQR or greater than Q3 + 3 \times IQR.

Boxplots

Features of Boxplots

Summary Visualization:
- Represents the five-number summary: minimum, Q1, median, Q3, and maximum.
Display characteristics of the data:
- Spread and potential outliers.
Whiskers:
- Extend from the quartiles to the farthest non-outlier points.

Example Boxplot Analysis

Example Data:

Example Data:
- Set: [1, 2, 7, 11, 12, 18, 23, 25, 98, 197]
Five Number Summary:
- Minimum = 1
- Q1 = 7
- Median = 12
- Q3 = 25
- Maximum = 197

Comparing Histograms and Corresponding Boxplots

Insight on Distribution:

Histograms:
Boxplots:
- Both graphically represent spread and outliers while offering an intuitive understanding of the central tendency.

Conclusion: Understanding Quantitative Data

Summary of Key Concepts

Characteristics:
- Numerical values used in operation.
Sub-types:
- Discrete vs. Continuous Variables: Indicates the type of numerical data.
Graphical Displays:
- Histogram and boxplot for analyzing spread and identifying outliers.
Numerical Summaries:
- Essential features: shape, center (mean/median), spread (variance, standard deviation, IQR), and outliers.
- Use resistant measures for skewed data (median and IQR).