Descriptive Statistics and Probability Concepts

Part II: Descriptive Statistics

Ch 5: The Normal Approximation for Data

Required Reading: All Sections

Instructor: Shengjie Jiang, Ph.D.

Date: 1/18

The Normal (Probability) Distribution

The normal distribution is the most important of all probability distributions.
Applications of the Normal Distribution:
- Health-related characteristics (e.g., heights, weights, cholesterol levels, blood pressure).
- Psychological measurements (e.g., intelligence and aptitude tests).
- Measurement errors in scientific experiments.
- Economic measurements and indicators (e.g., flood measurements).
Definition: A continuous random variable X is said to be normally distributed with mean µ and standard deviation σ if the density function of X has the form:
f(x) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right), \quad -\infty < x < \infty
Notation:
- $X \sim N(\mu, \sigma)$
- Read as: X follows a Normal Distribution of mean µ and standard deviation σ.

Normal Curve Characteristics

Single Peak: The curve has one maximum point.
Total Area: The total area under the curve is 100% (or 1).
Position: The curve is always above the horizontal axis.
Center: The mean µ indicates the center of the distribution.
Symmetry: The distribution is symmetric around the mean.
Inflection Points: Areas where the curve changes concavity.
Standard Deviation (SD): Distance from the mean to the inflection points.
- Roughly 68% of the area under the curve is between one standard deviation from the mean, i.e., $(\mu - \sigma, \mu + \sigma)$.

Mean and Standard Deviation of the Normal Curve

A normal curve is fully defined by its parameters, mean µ and standard deviation σ:
f(x) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right), \quad -\infty < x < \infty
Since the total area is 1 (or 100%), we can derive insights about data represented by histograms that model normally distributed data.
The area under the curve corresponds to probabilities.

Changes in the Normal Curve

Increasing Mean: Shifts the curve to the right.
Increasing Standard Deviation: Flattens the curve.
Shape: The basic shape remains unchanged and the area under the normal curve remains equal to 1.
Empirical Rule: All normal distributions, regardless of parameters, share the Empirical Rule:

Normal Curve and the Empirical Rule (68-95-99.7 Rule)

68% of values of all data observations are within $(µ - σ, µ + σ)$.
95% of values are within $(µ - 2σ, µ + 2σ)$.
99.7% of values are within $(µ - 3σ, µ + 3σ)$.

Example: Women’s Heights

Women’s heights are normally distributed with:
- mean µ = 64.5 inches
- standard deviation σ = 2.5 inches
According to the Empirical Rule:
1. 68% of women: Heights between $64.5 - 2.5 = 62$ and $64.5 + 2.5 = 67$ inches.
2. 95% of women: Heights between $64.5 - 2 imes 2.5 = 59.5$ and $64.5 + 2 imes 2.5 = 69.5$ inches.
3. 99.7% of women: Heights between $64.5 - 3 imes 2.5 = 57$ and $64.5 + 3 imes 2.5 = 72$ inches.

Additional Questions regarding Women’s Heights

In what range do the middle 95% of all women lie?
About what percentage of women are taller than 67 inches?
About what percentage are shorter than 59.5 inches?
What percentage of women are shorter than 69.5 inches?
In what range do the top 2.5% of all women lie?

Problematic Scenario

What if the percentages we are interested in cannot be expressed in terms of 68%, 95%, or 99.7%?
Example scenarios: What are the percentages of women shorter than 68 inches or taller than 70 inches?

Standard Normal Distribution

Defined by:
- mean $µ = 0$
- standard deviation $σ = 1$
Notation: $N(0, 1)$
Equation:
f(x) = \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{x^2}{2}\right), \quad -\infty < x < \infty
Area: The total area equals 1.
Bell-shape: The distribution is symmetric about its mean.
Standardization Process: Any normal distribution can be converted to a standard normal distribution using the formula:
$X \sim N(µ, σ) \Rightarrow z = \frac{X - µ}{σ} \sim N(0, 1)$
Z-score: A standard unit that indicates how many standard deviations an observation is from the mean.

Back to Women’s Heights with Standard Units

Finding Percentages for Z-scores: For common Z-scores ($z = 0, ±1, ±2, ±3$), percentages can be determined through empirical rules.
Calculation for Specific Heights:
- What percentage of women are taller than 70 inches?
- What percentage are shorter than 68 inches?
- What percentage are between 68 inches and 70 inches?

Practice Question

Hypothesis: Brain weights of people affected by a disease are normally distributed, mean 1000 g and standard deviation 100 g. Questions:
1. Find the probability that a brain weight is less than 850 g.
2. What is the probability that brain weight is above 1250 g?
3. What is the probability of a brain weight between 905 g and 1300 g?

Standard Units: Another Example

Z-Score Definition: Measures distance from the mean in standard deviations:
- Positive Z-score: Above the mean.
- Negative Z-score: Below the mean.
- Small Z-score: Close to the mean.
- Large Z-score: Far from the mean.

Comparison of Mid-term Scores: Alice's Performance

Exam Scores Comparison:
- Organic Chemistry: Mean = 55, SD = 25, Alice = 80.
- Statistics: Mean = 50, SD = 10, Alice = 75.
Questions:
1. On which test did Alice perform better relative to her peers?
2. What are her percentiles for both midterms?

Independent Exercise

2008 Olympics: Dobrynska performed a long jump of 6.63 m, higher than average, and Fountain won the 200m run at 23.21s, also faster than average.
Statistics:
- Long Jump Mean = 6.11 m, SD = 0.24 m.
- 200 m Run Mean = 24.71 s, SD = 0.70 s.
- Question: Whose performance was more impressive?

Percentiles of Normal Distribution

Backward Normal Calculation: Given an area or percentage, we want to find the corresponding value $x$ .
Formula:
$x = µ + z · σ$
Example: Find the score needed to fall into the top 10% (90th percentile).

Practice Questions on Backward Normal Calculation

In the “Brain weights” scenario, what weight does 10% of brain weights fall below?
For cereal boxes with a normal model, mean 16.3 ounces, SD 0.2 ounces: What fraction will be underweight (less than 16 ounces) and what weight represents 5% below?

Part III: Correlation and Regression

Ch 8: Correlation

Ch 9: Outliers & Association is Not Causation

Required Reading: Sections 8.1, 8.2, 8.4, Summary; Sections 9.1, 9.3, 9.5, Summary

Instructor: Shengjie Jiang, Ph.D.

Date: 1/23

Scatterplots (Scatter Diagrams)

Importance: Previously discussed methods were suitable for single quantitative variables.
However, relationships between two quantitative variables require different analytic methods.
Purpose: Scatterplots illustrate relationships between two variables.
Summary Analyzed:
1. Shape: linearity pattern, clusters, outliers.
2. Direction: increasing (positive), decreasing (negative), or no relationship.
3. Strength: proximity of points to an imaginary line.

Scatterplot Research Historical Context

Victorian England statisticians researched hereditary influences extensively, collecting vast datasets (e.g., father-son height pairs).
Summary context yields relationships, e.g., shorter fathers have shorter sons.

Describing Scatterplots' Shape

Shape Characteristics:
- Linearity: Is the pattern linear or curved?
- Clusters: Are there several clusters?
- Outliers: Are there notable exceptions?
Direction:
- Positive: y-value rises with increasing x-value.
- Negative: y-value falls with increasing x-value.
- No Relationship: Random pattern observed.
Strength Assessment:
- Strong: Points are tightly clustered around a line/curve.
- Weak: Points are scattered far from a line.

Strength of Association in Scatterplots

A strong relationship means predicting one variable informs the other.
Conversely, a weak relationship implies little assistance in guessing one variable from the other.

Analyzing Scatterplots - Practical Situation

Example: Consider the association between beer consumption and blood alcohol level.
How to describe any association and possible outliers?

Correlation Coefficient

Defined as a measure of linear association or clustering around a line.
Observation: Both scatter variables may have the same center and spread.
- However, one scatter indicates a strong linear association while the other indicates looser clustering.

Direction and Strength of Linear Associations

Positive correlation: Increasing trends within datasets (correlation coefficient 0.00 to 1.00).
Negative correlation: Decreasing trends within datasets (correlation coefficient -1.00 to 0.00).

Correlation Coefficient Details

Definition: Measures strength of linear associations.
Formula for correlation coefficient:
$r = \frac{1}{n} \sum<em>{k=1}^{n} \frac{(X</em>k - \bar{X})}{SD<em>X} \frac{(Y</em>k - \bar{Y})}{SD_Y}$
Properties:
- Range: From -1 to 1.
- Sign indicates direction: positive/negative.
- Magnitude: Determines linear strength: 0 indicates weak correlation, 1 indicates strong correlation.
- Independence of Measurement: Correlation is unitless; does not change if variables are shifted or scaled.

Distinguishing Features of Correlation Coefficient

Remains unchanged under different measurement units of x and/or y.
Changed when multiplied by negative values, affecting correlation sign.
Example situation: Daily maximum temperatures recorded in Fahrenheit vs Celsius.

Common Misinterpretations of Correlation

Statements Review

Correct/Incorrect points:
1. Correlation cannot exceed 1 or go below -1.
2. Reporting correlation in units (like inches/pounds) is invalid.
3. Switching x and y does not alter correlation.

Necessity of Visual Data Representation

Caution: Always plot data!
- Example: Multiple datasets can share identical correlation values (very misleading as illustrated by Anscombe’s quartet).

Outlier Sensitivity in Correlation

Sensitivity: Correlation is influenced by outliers whose positions affect the magnitude and sign.
Common Practice: Recap of correlation post-outlier removal for accurate assessments.

Correlation does not Imply Causation

Case Study: Strong correlation between shoe size and arithmetic scores in children due to age acting as a confounding variable.
- Importance of controlling for lurking variables is vital to derive causal inferences.

Causation Analysis Example

Positive correlation exists between hospital size (number of beds) and median patient stay duration. - Causation Implication: Cannot determine causation based solely on correlation.

Managing Common Confounding Factors

Example: Correlation observed between TV violence and adolescent behavior often stems from common confounding variable of upbringing out of a violent environment.

Part IV: Probability

Ch 14: More about Chances

Required Reading: All Sections

Instructor: Shengjie Jiang, Ph.D.

Outcomes Listing Technique

Methodology: Identify all possible outcomes within an ‘equally likely’ framework to accurately calculate events' chances.
Example: Tossing two fair coins: What is the probability of at least one tail?
- Outcome List: {(H,H), (H,T), (T,H), (T,T)}
- Total Probability Calculation:
  $P(at\ least\ one\ T) = \frac{3}{4}$

Listing Techniques: Box Example

Method: Draw two tickets from a box with replacement.

Probability Evaluation:

Outcomes of draws successfully listed.
Probabilities calculated as necessary.

Addition Rule Introduction

The probability of at least one of two events A or B occurring is given by:
$P(A\ or\ B) = P(A) + P(B) − P(A\ and\ B)$

Additive Rule for Mutually Exclusive Events

Definition: If two events cannot occur simultaneously (mutually exclusive), then:
- $P(A\ and\ B) = 0$
Calculation for presence of mutually exclusive events translates to:
$P(A\ or\ B) = P(A) + P(B)$

Practical Example of Addition Rule in Boxes

Given the events defined with random selections, determine pairs and their probabilities using the addition principles.

Review of Addition vs. Multiplication Rules

Addition Rule: Evaluates at least one occurrence of A or B taking place.
Multiplication Rule: Evaluates simultaneous occurrence of two events.
Note: Independent events’ multiplication reduces to the simpler case when outcomes are multiplied directly.

Insurance Plan Example

Health and dental insurance choice analysis involving random employee selection to assess probabilities based on provided employee data.

Try at Home Tasks

Explore card drawing scenarios providing varied outcomes within probability frameworks.

Closing Remarks

Theoretical conclusions must always be buttressed with practical applications and dependencies recognized on physical or flower relationships.

Probabilities and Calculating Techniques

Residuals

Definition: Error margins between measured values and predicted values
Formula: $R = y − ŷ$
Assess model fit based on residual behavior against expected outcomes.

Residuals as Validation Tool

Properties of residuals assist in validating linear models and checking model assumptions, notably through residual plots representation.

Types of Predictions

Extrapolation: Attempts to predict based on data beyond sampled ranges.
Interpolation: Prediction made within data bounds ensuring relative validity.

Conclusion on Predictions

Key to predictive analysis is thorough visual representation of data, models, and residuals to ascertain operational significance.

Summary

Evolving understanding of statistical principles, visual representations, and rigorous inquiries is paramount to mastery in the field.
Through detailed articulation of complex illustrations, students are better prepared to navigate statistical realms effectively.