Comprehensive notes on Probability, Normal Distribution, Z-scores, and Normality Assessment
Probability foundations
- Probability definition: probability is the chance of something happening; a proportion with values from 0 to 1 (or 0% to 100%).
- Formal expression (conceptual): P(A)=total number of possible outcomesnumber of favorable outcomes.
- Why probability matters:
- We cannot observe the entire population; we rely on samples that should represent the population.
- Probability distributions and especially the normal distribution underpin many statistical techniques and tests.
- Many statistical tests assume the data follow a certain shape (often a symmetric bell-shaped curve), i.e., a probability distribution with particular properties.
- Empirical vs theoretical distributions:
- Empirical distributions: based on observed data (e.g., a sample of 50 people plotted as a curve).
- Theoretical distributions: based on theories/mathematical functions; used to calculate theoretical probabilities for outcomes under the curve.
- Data types and distributions:
- Data can be continuous or categorical.
- This unit focuses on continuous distributions, highlighting normal distribution and t distribution (and notes similarity with normal distribution for larger samples).
- Probability basics revisited with simple examples:
- Coin toss: two outcomes; probability of heads (or tails) is P(H)=P(T)=21=0.5=50%.
- Rolling a six-sided die: probability of getting a 4 is P(4)=61≈0.1667(≈16.67%).
- Multiple-choice guess with 4 options: probability of a correct guess is P=41=0.25=25%.
- If event probability is 0.05, that means a 5% chance, i.e., 0.05=5%(5 in 100) or 1 in 20.
- Empirical vs theoretical distributions (recap):
- Empirical: observed data shape.
- Theoretical: based on formulas; used to compute probabilities for outcomes under the curve.
- Data types revisited for distributions in this unit:
- Continuous vs categorical; emphasis on continuous distributions here.
- The three primary continuous distributions discussed: normal distribution, t distribution (and their similarities).
- Shape requirements in statistics:
- Many statistical tests assume a symmetric, bell-shaped curve (normal distribution). A variety of other theoretical forms exist for different data types.
Normal distribution: key features and intuition
- Normal distribution (bell-shaped, symmetric):
- One peak (mode), symmetry around the center.
- Mean = Median = Mode for a perfectly normal distribution (in practice, they are identical or very close when data are approximately normal).
- If you draw a vertical line at the mean, left half and right half are mirror images (line of symmetry).
- Area under the curve:
- Total area under the normal curve is 1 (or 100%).
- The area between two points on the curve represents the probability (relative frequency) of observing values within that interval.
- Example and spread:
- IQ distribution example: mean μ=100, standard deviation σ=15.
- About fixed-proportion intervals around the mean:
- Within one standard deviation: ≈68.3% of observations.
- Within two standard deviations: ≈95.4%.
- Within three standard deviations: ≈99.7%.
- Consequences: about 68%, 95%, and 99% of observations lie within 1, 2, or 3 SDs from the mean, respectively (the 68-95-99 rule).
- Practice with the IQ example (numeric illustration):
- Mean IQ = 100, σ=15.
- Interval within 1 SD: [100−15,100+15]=[85,115] contains 68.3% of IQs.
- Interval within 2 SDs: [100−30,100+30]=[70,130] contains 95.4%.
- Interval within 3 SDs: [100−45,100+45]=[55,145] contains 99.7%.
- Standard normal distribution:
- A special form of the normal distribution with mean μ=0 and standard deviation σ=1.
- Denoted as Z∼N(0,1).
- The horizontal axis represents z-scores (standardized scores):
- Z = +1 corresponds to 1 standard deviation above the mean.
- Z = -1 corresponds to 1 standard deviation below the mean.
- Z-scores provide a common scale to compare scores from different distributions (e.g., comparing blood pressure and cholesterol on a common footing).
Z-scores: calculation, interpretation, and applications
- Definition and purpose:
- Z-score is the distance between a value and the mean, measured in standard deviations.
- Formula (population parameters): z=σX−μ
- For sample data (to standardize sample statistics): z=sX−Xˉ
- Worked examples:
- Example 1: observed score X=40.5, sample mean Xˉ=40, sample SD s=0.5.
- Z-score: z=0.540.5−40=1.
- Interpretation: the value is 1 standard deviation above the average for the population/sample.
- Example 2: cross-unit comparisons (e.g., blood pressure vs cholesterol):
- Two different measures with different units and scales can be compared using their Z-scores.
- If blood pressure z-score = 0 (average) and cholesterol z-score = +1 (above average), you can interpret them on a common scale without units.
- Example 3: comparing performance across units (final assessments) using z-scores:
- If unit A has z = -1 and unit B has z = +1.5 for the same student, relative performance is 1.5 SDs above average in unit B and 1 SD below average in unit A.
- Why z-scores are useful:
- Enable meaningful comparisons across different metrics and scales.
- Helpful for health indicators and other outcomes when you want a consistent reference (relative standing within the population).
- Z-score calculation recap:
- For a single observed score: z=σX−μ.
- For a sample score: z=sX−Xˉ.
Calculating and interpreting a z-score: a worked example
- Given a raw score, mean, and standard deviation:
- Suppose observed score X=40.5, sample mean Xˉ=40, sample SD s=0.5.
- Z-score: z=0.540.5−40=1.
- Interpretation: this is one standard deviation above the average for the population/sample.
The normal vs t distributions; when to use each
- Normal distribution vs t distribution:
- Normal distribution: the baseline bell-shaped curve used for many large-sample statistics.
- t distribution: similar to normal distribution in shape, especially for larger samples, but with heavier tails; used in t-tests to compare means when sample sizes are small or when the population SD is unknown.
- For larger samples, the t distribution resembles the normal distribution closely and many tests converge to their normal counterparts.
- T-tests (what they are and how they relate to the unit):
- Parametric tests that compare means between groups.
- Appropriate for continuous data and when assumptions (e.g., normality) are reasonably met.
- There are three t-tests discussed in the unit (one per week).
- Important caveat about mean as a statistic:
- The mean is an appropriate descriptive statistic only for continuous data, not categorical data.
When data are not symmetric: skewness, outliers, and their effects
- Skewness and tails:
- Negative skew (left-skew): left tail longer; many data values cluster toward the right.
- Positive skew (right-skew): right tail longer; many data values cluster toward the left with a few extreme high values.
- The most common causes of skewness include outliers/extreme values in the dataset.
- Outliers and the mean vs the median:
- Example with a small data set: {1,2,3,4,5} (mean = 3, median = 3).
- Replace a value (e.g., 5) with 10: mean shifts (to 4), median remains 3.
- Replace with 20: mean shifts further (to 6), median still 3.
- This demonstrates that the mean is sensitive to outliers, which can drag tails toward high or low ends.
- Parametric vs nonparametric tests (revisited):
- Parametric tests (e.g., t-tests) assume a particular distribution shape (often normal).
- Nonparametric tests do not require a normal distribution; they are less powerful but useful for non-normal data.
- Nonparametric tests may be introduced later in the course (e.g., in a later week).
- Data transformation as a remedy:
- If data are skewed, transformations can reduce skewness and outlier effects, making data more symmetric.
- Note: transformation techniques are not covered in this introductory unit.
How to assess whether data are normally distributed: an eight-part checklist (eight things, with a and b under item 8)
- The eight assessment tools (a and b are subparts of item 8):
1) Mean, median, and mode
- In a symmetric distribution, these three measures tend to be close or identical.
- In practice, the descriptive table in SPSS may not always show the mode; you may need to compute or request it in your lab handout.
2) Skewness statistic - The value should be near zero for normal data.
3) Skewness z-score - Computed as z<em>skew=SE</em>SkewSkewness.
- For normality, this z-score should lie within approximately ±1.96 (smaller samples) or, in larger samples, can be up to about ±2.5.
4) Kurtosis statistic - The value should be near zero for normal data (excess kurtosis is often reported).
5) Kurtosis z-score - Computed as z<em>kurt=SE</em>KurtKurtosis.
- For normality, this z-score should lie within approximately ±1.96 (smaller samples) or up to about ±2.5 in larger samples.
6) Shapiro-Wilk test (SW test) - Hypothesis: the data come from a population that is normally distributed.
- If the significance value (p-value) is < 0.05, the assumption of normality is violated (i.e., the data are not from a normal distribution).
- If p >= 0.05, the assumption of normality is not violated (the data may be normal).
- Note: Kolmogorov-Smirnov test is another option, but both tests are sometimes criticized for being oversensitive.
7) Histogram - Visual inspection of the shape; a histogram helps identify skewness and outliers (e.g., a long right tail indicating positive skew).
8) Graphical methods (a and b under item 8): - 8a Normal probability plots (a.k.a. normal probability plots or Q-Q plots):
- Data are plotted against a theoretical normal distribution; a straight diagonal line indicates approximate normality.
- Deviations from the line indicate departures from normality; outliers appear as points far from the line.
- Two examples may be shown to illustrate different data sets.
- 8b Detrended Q-Q plot (Detrended QQ plot):
- Plots the deviations from the diagonal line; helps visualize non-normality more clearly.
- If the data are perfectly normal, points would lie along a horizontal line; real data show deviations from this line.
- Practical note on interpretation:
- Do not decide normality based on a single test or plot; rely on all eight checks together and provide an overall conclusion.
- In the pulse rate example (pulse rate before exam vs during exam):
- Pulse rate before exam showed positive skew based on multiple normality checks.
- Pulse rate during exam appeared more normally distributed.
- Example dataset interpretation (pulse rate before vs during exam):
- Conclusion example: the pulse rates before the exam are positively skewed in the population.
- Graphs and outputs mentioned:
- Histogram: visual check for skewness and outliers.
- Stem-and-leaf plot: preserves actual data values and shows distribution; useful for small-to-moderate samples and is often undervalued compared to histograms.
- Box plot: shows median, quartiles, and potential outliers; symmetric box and whiskers indicate normality patterns; visible outliers suggest deviations from symmetry.
- Normal probability plot (a subset of item 8a): assesses normality by comparing observed data to a theoretical normal distribution.
- Detrended Q-Q plot (item 8b): complements QQ plot by highlighting deviations from normality more clearly.
- Practical lab use:
- In Computer Lab 3, you will learn how to generate all these graphs and tables in SPSS.
- The graphs and brief interpretations will be required in the first assessment: Data analysis and research design evaluation 1.
Putting it all together: applying the concepts in assessment
- When you are asked to determine whether a dataset is normally distributed:
- Use all eight methods (8a/8b included) to form an overall judgment.
- Check for symmetry: mean ≈ median ≈ mode; skewness/kurtosis near zero; SW test non-significant; histogram roughly bell-shaped; stem-and-leaf shows a realistic, unskewed pattern; box plot with symmetric whiskers around the median; normal probability plots show points close to the diagonal; detrended QQ plot shows points scattered around the horizontal line.
- Real-world takeaway:
- Normality is an assumption for many parametric tests (e.g., t-tests).
- If normality is violated, consider data transformation or nonparametric alternatives (nonparametric tests are covered later).
- Final reminder from the lecture:
- Do not rely on a single test to judge normality.
- The eight methods provide a robust toolkit for evaluating the shape of the data and the appropriateness of parametric tests.
- You will practice these techniques in SPSS in the lab and apply them in your first assessment.
- Probability basics:
- P(A)=total outcomesfavorable outcomes
- Normal distribution properties:
- Symmetric bell curve; mean = median = mode (in a perfect normal distribution).
- P(a≤X≤b)=∫<em>abf</em>X(x)dx with total area under the curve equal to 1.
- 68-95-99 rule (for normal distribution):
- Within one SD: ≈68.3%; within two SDs: ≈95.4%; within three SDs: ≈99.7%.
- Standard normal distribution:
- Z∼N(0,1); z=σX−μ
- Z-scores: interpretation and cross-metric comparisons (e.g., cardiovascular metrics).
- Skewness and kurtosis: assess symmetry and tail behavior; z-scores for skewness/kurtosis help decide normality.
- Normality tests and plots (eight tools): mean/median/mode; skewness; kurtosis; SW test; histogram; stem-and-leaf; box plot; normal probability plots; detrended QQ plot.
- Data handling notes:
- Transformation as a method to correct non-normality (not covered in this introductory unit).
- Parametric vs nonparametric tests: rely on normality for the former; nonparametric tests do not require such assumptions but may be less powerful.