Descriptive Statistics and Probability Concepts
Part II: Descriptive Statistics
Ch 5: The Normal Approximation for Data
Required Reading: All Sections
Instructor: Shengjie Jiang, Ph.D.
Date: 1/18
The Normal (Probability) Distribution
The normal distribution is the most important of all probability distributions.
Applications of the Normal Distribution:
Health-related characteristics (e.g., heights, weights, cholesterol levels, blood pressure).
Psychological measurements (e.g., intelligence and aptitude tests).
Measurement errors in scientific experiments.
Economic measurements and indicators (e.g., flood measurements).
Definition: A continuous random variable X is said to be normally distributed with mean µ and standard deviation σ if the density function of X has the form:
f(x) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right), \quad -\infty < x < \inftyNotation:
Read as: X follows a Normal Distribution of mean µ and standard deviation σ.
Normal Curve Characteristics
Single Peak: The curve has one maximum point.
Total Area: The total area under the curve is 100% (or 1).
Position: The curve is always above the horizontal axis.
Center: The mean µ indicates the center of the distribution.
Symmetry: The distribution is symmetric around the mean.
Inflection Points: Areas where the curve changes concavity.
Standard Deviation (SD): Distance from the mean to the inflection points.
Roughly 68% of the area under the curve is between one standard deviation from the mean, i.e., $(\mu - \sigma, \mu + \sigma)$.
Mean and Standard Deviation of the Normal Curve
A normal curve is fully defined by its parameters, mean µ and standard deviation σ:
f(x) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right), \quad -\infty < x < \inftySince the total area is 1 (or 100%), we can derive insights about data represented by histograms that model normally distributed data.
The area under the curve corresponds to probabilities.
Changes in the Normal Curve
Increasing Mean: Shifts the curve to the right.
Increasing Standard Deviation: Flattens the curve.
Shape: The basic shape remains unchanged and the area under the normal curve remains equal to 1.
Empirical Rule: All normal distributions, regardless of parameters, share the Empirical Rule:
Normal Curve and the Empirical Rule (68-95-99.7 Rule)
68% of values of all data observations are within $(µ - σ, µ + σ)$.
95% of values are within $(µ - 2σ, µ + 2σ)$.
99.7% of values are within $(µ - 3σ, µ + 3σ)$.
Example: Women’s Heights
Women’s heights are normally distributed with:
mean µ = 64.5 inches
standard deviation σ = 2.5 inches
According to the Empirical Rule:
68% of women: Heights between $64.5 - 2.5 = 62$ and $64.5 + 2.5 = 67$ inches.
95% of women: Heights between $64.5 - 2 imes 2.5 = 59.5$ and $64.5 + 2 imes 2.5 = 69.5$ inches.
99.7% of women: Heights between $64.5 - 3 imes 2.5 = 57$ and $64.5 + 3 imes 2.5 = 72$ inches.
Additional Questions regarding Women’s Heights
In what range do the middle 95% of all women lie?
About what percentage of women are taller than 67 inches?
About what percentage are shorter than 59.5 inches?
What percentage of women are shorter than 69.5 inches?
In what range do the top 2.5% of all women lie?
Problematic Scenario
What if the percentages we are interested in cannot be expressed in terms of 68%, 95%, or 99.7%?
Example scenarios: What are the percentages of women shorter than 68 inches or taller than 70 inches?
Standard Normal Distribution
Defined by:
mean
standard deviation
Notation:
Equation:
f(x) = \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{x^2}{2}\right), \quad -\infty < x < \inftyArea: The total area equals 1.
Bell-shape: The distribution is symmetric about its mean.
Standardization Process: Any normal distribution can be converted to a standard normal distribution using the formula:
Z-score: A standard unit that indicates how many standard deviations an observation is from the mean.
Back to Women’s Heights with Standard Units
Finding Percentages for Z-scores: For common Z-scores ($z = 0, ±1, ±2, ±3$), percentages can be determined through empirical rules.
Calculation for Specific Heights:
What percentage of women are taller than 70 inches?
What percentage are shorter than 68 inches?
What percentage are between 68 inches and 70 inches?
Practice Question
Hypothesis: Brain weights of people affected by a disease are normally distributed, mean 1000 g and standard deviation 100 g. Questions:
Find the probability that a brain weight is less than 850 g.
What is the probability that brain weight is above 1250 g?
What is the probability of a brain weight between 905 g and 1300 g?
Standard Units: Another Example
Z-Score Definition: Measures distance from the mean in standard deviations:
Positive Z-score: Above the mean.
Negative Z-score: Below the mean.
Small Z-score: Close to the mean.
Large Z-score: Far from the mean.
Comparison of Mid-term Scores: Alice's Performance
Exam Scores Comparison:
Organic Chemistry: Mean = 55, SD = 25, Alice = 80.
Statistics: Mean = 50, SD = 10, Alice = 75.
Questions:
On which test did Alice perform better relative to her peers?
What are her percentiles for both midterms?
Independent Exercise
2008 Olympics: Dobrynska performed a long jump of 6.63 m, higher than average, and Fountain won the 200m run at 23.21s, also faster than average.
Statistics:
Long Jump Mean = 6.11 m, SD = 0.24 m.
200 m Run Mean = 24.71 s, SD = 0.70 s.
Question: Whose performance was more impressive?
Percentiles of Normal Distribution
Backward Normal Calculation: Given an area or percentage, we want to find the corresponding value .
Formula:
Example: Find the score needed to fall into the top 10% (90th percentile).
Practice Questions on Backward Normal Calculation
In the “Brain weights” scenario, what weight does 10% of brain weights fall below?
For cereal boxes with a normal model, mean 16.3 ounces, SD 0.2 ounces: What fraction will be underweight (less than 16 ounces) and what weight represents 5% below?
Part III: Correlation and Regression
Ch 8: Correlation
Ch 9: Outliers & Association is Not Causation
Required Reading: Sections 8.1, 8.2, 8.4, Summary; Sections 9.1, 9.3, 9.5, Summary
Instructor: Shengjie Jiang, Ph.D.
Date: 1/23
Scatterplots (Scatter Diagrams)
Importance: Previously discussed methods were suitable for single quantitative variables.
However, relationships between two quantitative variables require different analytic methods.
Purpose: Scatterplots illustrate relationships between two variables.
Summary Analyzed:
Shape: linearity pattern, clusters, outliers.
Direction: increasing (positive), decreasing (negative), or no relationship.
Strength: proximity of points to an imaginary line.
Scatterplot Research Historical Context
Victorian England statisticians researched hereditary influences extensively, collecting vast datasets (e.g., father-son height pairs).
Summary context yields relationships, e.g., shorter fathers have shorter sons.
Describing Scatterplots' Shape
Shape Characteristics:
Linearity: Is the pattern linear or curved?
Clusters: Are there several clusters?
Outliers: Are there notable exceptions?
Direction:
Positive: y-value rises with increasing x-value.
Negative: y-value falls with increasing x-value.
No Relationship: Random pattern observed.
Strength Assessment:
Strong: Points are tightly clustered around a line/curve.
Weak: Points are scattered far from a line.
Strength of Association in Scatterplots
A strong relationship means predicting one variable informs the other.
Conversely, a weak relationship implies little assistance in guessing one variable from the other.
Analyzing Scatterplots - Practical Situation
Example: Consider the association between beer consumption and blood alcohol level.
How to describe any association and possible outliers?
Correlation Coefficient
Defined as a measure of linear association or clustering around a line.
Observation: Both scatter variables may have the same center and spread.
However, one scatter indicates a strong linear association while the other indicates looser clustering.
Direction and Strength of Linear Associations
Positive correlation: Increasing trends within datasets (correlation coefficient 0.00 to 1.00).
Negative correlation: Decreasing trends within datasets (correlation coefficient -1.00 to 0.00).
Correlation Coefficient Details
Definition: Measures strength of linear associations.
Formula for correlation coefficient:
Properties:
Range: From -1 to 1.
Sign indicates direction: positive/negative.
Magnitude: Determines linear strength: 0 indicates weak correlation, 1 indicates strong correlation.
Independence of Measurement: Correlation is unitless; does not change if variables are shifted or scaled.
Distinguishing Features of Correlation Coefficient
Remains unchanged under different measurement units of x and/or y.
Changed when multiplied by negative values, affecting correlation sign.
Example situation: Daily maximum temperatures recorded in Fahrenheit vs Celsius.
Common Misinterpretations of Correlation
Statements Review
Correct/Incorrect points:
Correlation cannot exceed 1 or go below -1.
Reporting correlation in units (like inches/pounds) is invalid.
Switching x and y does not alter correlation.
Necessity of Visual Data Representation
Caution: Always plot data!
Example: Multiple datasets can share identical correlation values (very misleading as illustrated by Anscombe’s quartet).
Outlier Sensitivity in Correlation
Sensitivity: Correlation is influenced by outliers whose positions affect the magnitude and sign.
Common Practice: Recap of correlation post-outlier removal for accurate assessments.
Correlation does not Imply Causation
Case Study: Strong correlation between shoe size and arithmetic scores in children due to age acting as a confounding variable.
Importance of controlling for lurking variables is vital to derive causal inferences.
Causation Analysis Example
Positive correlation exists between hospital size (number of beds) and median patient stay duration. - Causation Implication: Cannot determine causation based solely on correlation.
Managing Common Confounding Factors
Example: Correlation observed between TV violence and adolescent behavior often stems from common confounding variable of upbringing out of a violent environment.
Part IV: Probability
Ch 14: More about Chances
Required Reading: All Sections
Instructor: Shengjie Jiang, Ph.D.
Outcomes Listing Technique
Methodology: Identify all possible outcomes within an ‘equally likely’ framework to accurately calculate events' chances.
Example: Tossing two fair coins: What is the probability of at least one tail?
Outcome List: {(H,H), (H,T), (T,H), (T,T)}
Total Probability Calculation:
Listing Techniques: Box Example
Method: Draw two tickets from a box with replacement.
Probability Evaluation:
Outcomes of draws successfully listed.
Probabilities calculated as necessary.
Addition Rule Introduction
The probability of at least one of two events A or B occurring is given by:
Additive Rule for Mutually Exclusive Events
Definition: If two events cannot occur simultaneously (mutually exclusive), then:
Calculation for presence of mutually exclusive events translates to:
Practical Example of Addition Rule in Boxes
Given the events defined with random selections, determine pairs and their probabilities using the addition principles.
Review of Addition vs. Multiplication Rules
Addition Rule: Evaluates at least one occurrence of A or B taking place.
Multiplication Rule: Evaluates simultaneous occurrence of two events.
Note: Independent events’ multiplication reduces to the simpler case when outcomes are multiplied directly.
Insurance Plan Example
Health and dental insurance choice analysis involving random employee selection to assess probabilities based on provided employee data.
Try at Home Tasks
Explore card drawing scenarios providing varied outcomes within probability frameworks.
Closing Remarks
Theoretical conclusions must always be buttressed with practical applications and dependencies recognized on physical or flower relationships.
Probabilities and Calculating Techniques
Residuals
Definition: Error margins between measured values and predicted values
Formula:
Assess model fit based on residual behavior against expected outcomes.
Residuals as Validation Tool
Properties of residuals assist in validating linear models and checking model assumptions, notably through residual plots representation.
Types of Predictions
Extrapolation: Attempts to predict based on data beyond sampled ranges.
Interpolation: Prediction made within data bounds ensuring relative validity.
Conclusion on Predictions
Key to predictive analysis is thorough visual representation of data, models, and residuals to ascertain operational significance.
Summary
Evolving understanding of statistical principles, visual representations, and rigorous inquiries is paramount to mastery in the field.
Through detailed articulation of complex illustrations, students are better prepared to navigate statistical realms effectively.