Module 1 - Statistics
Data Distributions
Introduction to Statistical Methods
Course Overview
- Class Title: STAT 2331
- Institution: Southern Methodist University (SMU)
What is Statistics?
- Real-World Examples:
- Points scored by each player on a basketball team
- Unemployment rate
- Percentage of students who indulge in binge drinking
- Definition of Statistics:
- Statistics are numbers calculated from data. They involve:
- Designing studies.
- Making data-informed decisions.
- Additional Aspects of Statistics:
- Statistics is not just about numbers; it is fundamentally about:
- Collecting data.
- Analyzing data.
- Interpreting data to influence decisions.
Why Should You Care?
- Growth of the Field:
- The field of statistics has grown exponentially over the past few decades.
- Related Fields:
- Data science and machine learning are recent outgrowths of statistics.
- Job Security:
- Statisticians enjoy excellent job security with opportunities in multiple sectors:
- Academia
- Industry
- Consulting
- Government
- Everyday Relevance:
- Statistics permeate our daily lives, affecting decision-making and understanding of the world.
Questions Statistics Can Answer
- Questions that statistics can address include:
- Are large doses of vitamin C beneficial?
- Which team is most likely to win a tournament?
- How can you analyze whether a diet really works?
- What is the chance your tax return will be audited?
- Is there bias against women in appointing managers?
- Are rates of anxiety and depression increasing over time?
Module Objectives
- Key Learning Outcomes:
- Distinguish between statistics and parameters.
- Explain the impact of variability on sampling and inference.
- Distinguish between categorical and quantitative data.
- Describe and display categorical data:
- Graphically (e.g., charts).
- Numerically (e.g., through totals or percentages).
- Describe and display quantitative data:
- Graphically (e.g., histograms).
- Numerically (e.g., means or medians).
Four Main Phases of Statistical Analysis
- Research Question:
- Choose a question you are interested in answering.
- Design:
- Outline how to obtain data to answer the question of interest.
- Description:
- Summarize the data obtained.
- Inference:
- Make decisions in the face of uncertainty based on the data.
Probability versus Statistics
Scenario 1:
- Known probability: 21% of the people in a population of 1000 aged 60 or older have diabetes.
- Calculate the probability that 3 out of 10 sampled people will have diabetes.
- This scenario involves: Probability.
Scenario 2:
- Unknown population percentage: sample indicates 2 out of 10 people have diabetes.
- Use data to predict the population percentage.
- This scenario involves: Statistics.
- Focus in this class: Scenario 2.
The Language of Statistics
- Subjects:
- The subjects we observe or measure (often people).
- Target Population:
- The specific subjects in which we are interested.
- Sample Frame:
- The list from which the sample is selected.
- Sample:
- A subset of the population for which we plan to have data.
- Parameter:
- A numerical summary of the population.
- Statistic:
- A numerical summary of a sample.
Examples in Statistical Analysis
Example 1: Dieting Obese Women Study
- Research Question:
- Do dieting obese women following a new diet plan lose more weight over 6 weeks than those who don’t?
- Design:
- Randomly sample 30 women following the plan and 30 not following it.
- Weigh everyone at the start and again at 6 weeks.
- Description:
- Total weight loss recorded; mean weight loss: 18 pounds for the new plan, 12 pounds for those not on it.
- Inference:
- Conclusion: Women on the new diet plan lose more weight than those not on it.
- Variable Identification:
- Subjects
- Target Population
- Sample Frame
- Sample
- Statistic(s)
- Parameter(s)
- Importance of Random Sampling:
- Ensures results are generalizable and minimizes bias.
- Impact of Larger Sample Sizes:
- Generally leads to more reliable results.
Inference: The Goal of Statistical Analysis
- Research Questions:
- Focused on specific populations.
- Design Considerations:
- Create a list (sample frame) and gather random data.
- Description of Data:
- Summarizes sample findings with statistics.
- Inference:
- Leads to hypotheses or predictions about population parameters.
Statistical Analysis Starts with a Data Set
Example Data Set Overview
| ID | Product | Type | Sensory Score | # Cookies | Cals | Fat |
|---|
| 1 | Archway Home Style | Soft | 32 | 1.2 | 155 | 6 |
| 2 | Busy Baker | Hard | 33 | 2.1 | 171 | 9 |
| 3 | Entenmann's | Soft | 44 | 3 | 150 | 8 |
| 4 | Estee | Hard | 13 | 4.1 | 150 | 7 |
| 5 | Famous Amos | Hard | 33 | 3.2 | 148 | 6 |
| 6 | Freihofer's | Soft | 45 | 2.3 | 138 | 7 |
| 7 | Keebler Chips Deluxe | Hard | 50 | 2 | 160 | 10 |
| … | … | … | … | … | … | … |
- Key Variables:
- Categorical Variables: Labels for the types of products.
- Quantitative Variables: Sensory score, number of cookies, calories, fat content.
Types of Variables
Categorical Variables
- Definition:
- A variable that fits observations into categories.
- Sub-types:
- Nominal: Cases are classified into groups without any order.
- Examples: Gender, color.
- Ordinal: Groups maintain a natural order.
- Examples: Educational levels, satisfaction ratings.
Quantitative Variables
- Definition:
- Observations take numerical values.
- Sub-types:
- Discrete: Finite list of values with no intermediate values.
- Examples: Number of students.
- Continuous: Infinite values possible between any two fixed points.
- Examples: Weight, height.
Exploratory Data Analysis
Components of Exploratory Data Analysis
- Steps Involved:
- Look at each variable individually.
- Analyze relationships between variables.
- Display data visually (charts, graphs).
- Summarize data numerically.
- Data Example:
- From 1900 to 2016, 1303 unprovoked shark attacks reported in the USA:
- California: 187
- Florida: 828
- Hawaii: 230
- Texas: 58
- Related questions for investigations:
- Identify variable of interest.
- Establish type and sub-type.
- Define values for the variable.
Exploring a Single Categorical Variable
Methods of Analysis
- Frequency Tables:
- List possible values for a variable along with their counts.
- Pie Charts:
- Visualize the percentage of each category within the whole dataset.
- Bar Charts:
- Use bars to represent counts or proportions for each category.
Example Data for Categorical Analysis
| State | Frequency | Percentage |
|---|
| California | 187 | 14.4% |
| Florida | 828 | 63.5% |
| Hawaii | 230 | 17.7% |
| Texas | 58 | 4.5% |
| Total | 1303 | 100% |
Categorical Data Summary
- Characteristics:
- Observations fit within groups (e.g., named categories, ordered groups).
- Key Features:
- Count, percentage, or proportion calculated.
- Graphical Displays:
- Include frequency tables and various chart types.
Exploring a Single Quantitative Variable
Features Analyzed
- Shape:
- Distribution of the sample values.
- Center:
- Measured using mean, median, or mode.
- Spread:
- Measured using standard deviation, variance, range, or interquartile range (IQR).
- Graphical Displays:
- Histograms and boxplots help visualize data patterns.
Histograms
Basic Features
- Definition:
- A graphical representation that divides quantitative data into intervals (bins).
- Count Observations in Each Bin:
- Created a frequency table reflecting the grouped data.
- Graphical Display:
- Bins on the horizontal axis, frequency on the vertical axis.
Example Histogram Data
| Age | Frequency | Relative Frequency |
|---|
| 6.5-8.5 | 6 | 0.021 |
| 8.5-10.5 | 7 | 0.025 |
| 10.5-12.5 | 14 | 0.049 |
| … | … | … |
Describing the Shape of Quantitative Data
Key Shapes to Identify
- Number of Modes:
- Unimodal, bimodal, uniform distributions.
- Skewness:
- Symmetric, skewed right (positive skew), skewed left (negative skew).
Comparing Distributions
Hypothetical Examples
- Common Quantitative Variables: Include those typically having specific skewness patterns.
- Examples:
- IQ (generally normally distributed).
- Life expectancy for humans (slightly skewed right).
- Annual incomes of adults (typically right skewed).
Measuring the Center of Quantitative Data
Measures of Central Tendency
- Mode:
- The most frequently occurring value.
- Mean:
- Average calculated as the sum of observations divided by the sample size.
- Mathematical Expression: \text{Mean} = \frac{\sum x_i}{n}
- Median:
- Middle value when observations are ordered.
- Example Series: [2, 4, 6, 8, 10] – Median = 6.
Impact of Outliers
- Outlier Definition:
- A data point that is far outside the general distribution of the dataset.
- Comparative Behavior:
- Median: Resistant to outliers; remains stable with extreme values.
- Mean: Sensitive to outliers; can skew dramatically.
Measuring the Spread of Quantitative Data
Key Metrics
- Range:
- Calculated as the difference between maximum and minimum values.
\text{Range} = \text{Maximum} - \text{Minimum}
- Variance:
- Reflects the average squared deviation from the mean:
s^2 = \frac{1}{n-1}\sum{i=1}^{n}(xi - \bar{x})^2
- Standard Deviation (SD):
- Square root of the variance:
s = \sqrt{\frac{1}{n-1}\sum{i=1}^{n}(xi - \bar{x})^2}
- Interquartile Range (IQR):
- Difference between the third quartile (Q3) and the first quartile (Q1):
\text{IQR} = Q3 - Q1
Identifying Outliers
Criteria for Outlier Detection
- Mild Outliers:
- Values that are less than Q1 - 1.5 \times IQR or greater than Q3 + 1.5 \times IQR.
- Extreme Outliers:
- Values less than Q1 - 3 \times IQR or greater than Q3 + 3 \times IQR.
Boxplots
Features of Boxplots
- Summary Visualization:
- Represents the five-number summary: minimum, Q1, median, Q3, and maximum.
- Display characteristics of the data:
- Spread and potential outliers.
- Whiskers:
- Extend from the quartiles to the farthest non-outlier points.
Example Boxplot Analysis
Example Data:
- Example Data:
- Set: [1, 2, 7, 11, 12, 18, 23, 25, 98, 197]
- Five Number Summary:
- Minimum = 1
- Q1 = 7
- Median = 12
- Q3 = 25
- Maximum = 197
Comparing Histograms and Corresponding Boxplots
Insight on Distribution:
- Histograms:
- Boxplots:
- Both graphically represent spread and outliers while offering an intuitive understanding of the central tendency.
Conclusion: Understanding Quantitative Data
Summary of Key Concepts
- Characteristics:
- Numerical values used in operation.
- Sub-types:
- Discrete vs. Continuous Variables: Indicates the type of numerical data.
- Graphical Displays:
- Histogram and boxplot for analyzing spread and identifying outliers.
- Numerical Summaries:
- Essential features: shape, center (mean/median), spread (variance, standard deviation, IQR), and outliers.
- Use resistant measures for skewed data (median and IQR).