Comprehensive notes on Probability, Normal Distribution, Z-scores, and Normality Assessment

Probability foundations

  • Probability definition: probability is the chance of something happening; a proportion with values from 0 to 1 (or 0% to 100%).
    • Formal expression (conceptual): P(A)=number of favorable outcomestotal number of possible outcomes.P(A)=\frac{\text{number of favorable outcomes}}{\text{total number of possible outcomes}}.
  • Why probability matters:
    • We cannot observe the entire population; we rely on samples that should represent the population.
    • Probability distributions and especially the normal distribution underpin many statistical techniques and tests.
    • Many statistical tests assume the data follow a certain shape (often a symmetric bell-shaped curve), i.e., a probability distribution with particular properties.
  • Empirical vs theoretical distributions:
    • Empirical distributions: based on observed data (e.g., a sample of 50 people plotted as a curve).
    • Theoretical distributions: based on theories/mathematical functions; used to calculate theoretical probabilities for outcomes under the curve.
  • Data types and distributions:
    • Data can be continuous or categorical.
    • This unit focuses on continuous distributions, highlighting normal distribution and t distribution (and notes similarity with normal distribution for larger samples).
  • Probability basics revisited with simple examples:
    • Coin toss: two outcomes; probability of heads (or tails) is P(H)=P(T)=12=0.5=50%.P(H)=P(T)=\tfrac{1}{2}=0.5=50\%.
    • Rolling a six-sided die: probability of getting a 4 is P(4)=160.1667  (16.67%).P(4)=\tfrac{1}{6}\approx 0.1667\;(\approx 16.67\%).
    • Multiple-choice guess with 4 options: probability of a correct guess is P=14=0.25=25%.P=\tfrac{1}{4}=0.25=25\%.
    • If event probability is 0.05, that means a 5% chance, i.e., 0.05=5%  (5 in 100)0.05=5\%\; \left(5\text{ in }100\right) or 1 in 20.
  • Empirical vs theoretical distributions (recap):
    • Empirical: observed data shape.
    • Theoretical: based on formulas; used to compute probabilities for outcomes under the curve.
  • Data types revisited for distributions in this unit:
    • Continuous vs categorical; emphasis on continuous distributions here.
    • The three primary continuous distributions discussed: normal distribution, t distribution (and their similarities).
  • Shape requirements in statistics:
    • Many statistical tests assume a symmetric, bell-shaped curve (normal distribution). A variety of other theoretical forms exist for different data types.

Normal distribution: key features and intuition

  • Normal distribution (bell-shaped, symmetric):
    • One peak (mode), symmetry around the center.
    • Mean = Median = Mode for a perfectly normal distribution (in practice, they are identical or very close when data are approximately normal).
    • If you draw a vertical line at the mean, left half and right half are mirror images (line of symmetry).
  • Area under the curve:
    • Total area under the normal curve is 1 (or 100%).
    • The area between two points on the curve represents the probability (relative frequency) of observing values within that interval.
  • Example and spread:
    • IQ distribution example: mean μ=100\mu=100, standard deviation σ=15\sigma=15.
    • About fixed-proportion intervals around the mean:
    • Within one standard deviation: 68.3%\approx 68.3\% of observations.
    • Within two standard deviations: 95.4%\approx 95.4\%.
    • Within three standard deviations: 99.7%\approx 99.7\%.
    • Consequences: about 68%, 95%, and 99% of observations lie within 1, 2, or 3 SDs from the mean, respectively (the 68-95-99 rule).
  • Practice with the IQ example (numeric illustration):
    • Mean IQ = 100, σ=15\sigma=15.
    • Interval within 1 SD: [10015,100+15]=[85,115][100-15, 100+15] = [85,115] contains 68.3%68.3\% of IQs.
    • Interval within 2 SDs: [10030,100+30]=[70,130][100-30, 100+30] = [70,130] contains 95.4%95.4\%.
    • Interval within 3 SDs: [10045,100+45]=[55,145][100-45, 100+45] = [55,145] contains 99.7%99.7\%.
  • Standard normal distribution:
    • A special form of the normal distribution with mean μ=0\mu=0 and standard deviation σ=1\sigma=1.
    • Denoted as ZN(0,1)Z \sim N(0,1).
    • The horizontal axis represents z-scores (standardized scores):
    • Z = +1 corresponds to 1 standard deviation above the mean.
    • Z = -1 corresponds to 1 standard deviation below the mean.
    • Z-scores provide a common scale to compare scores from different distributions (e.g., comparing blood pressure and cholesterol on a common footing).

Z-scores: calculation, interpretation, and applications

  • Definition and purpose:
    • Z-score is the distance between a value and the mean, measured in standard deviations.
    • Formula (population parameters): z=Xμσz = \frac{X-\mu}{\sigma}
    • For sample data (to standardize sample statistics): z=XXˉsz = \frac{X-\bar{X}}{s}
  • Worked examples:
    • Example 1: observed score X=40.5X=40.5, sample mean Xˉ=40\bar{X}=40, sample SD s=0.5s=0.5.
    • Z-score: z=40.5400.5=1.z = \frac{40.5-40}{0.5} = 1.
    • Interpretation: the value is 1 standard deviation above the average for the population/sample.
    • Example 2: cross-unit comparisons (e.g., blood pressure vs cholesterol):
    • Two different measures with different units and scales can be compared using their Z-scores.
    • If blood pressure z-score = 0 (average) and cholesterol z-score = +1 (above average), you can interpret them on a common scale without units.
    • Example 3: comparing performance across units (final assessments) using z-scores:
    • If unit A has z = -1 and unit B has z = +1.5 for the same student, relative performance is 1.5 SDs above average in unit B and 1 SD below average in unit A.
  • Why z-scores are useful:
    • Enable meaningful comparisons across different metrics and scales.
    • Helpful for health indicators and other outcomes when you want a consistent reference (relative standing within the population).
  • Z-score calculation recap:
    • For a single observed score: z=Xμσz = \frac{X-\mu}{\sigma}.
    • For a sample score: z=XXˉsz = \frac{X-\bar{X}}{s}.

Calculating and interpreting a z-score: a worked example

  • Given a raw score, mean, and standard deviation:
    • Suppose observed score X=40.5X=40.5, sample mean Xˉ=40\bar{X}=40, sample SD s=0.5s=0.5.
    • Z-score: z=40.5400.5=1.z = \frac{40.5-40}{0.5} = 1.
    • Interpretation: this is one standard deviation above the average for the population/sample.

The normal vs t distributions; when to use each

  • Normal distribution vs t distribution:
    • Normal distribution: the baseline bell-shaped curve used for many large-sample statistics.
    • t distribution: similar to normal distribution in shape, especially for larger samples, but with heavier tails; used in t-tests to compare means when sample sizes are small or when the population SD is unknown.
    • For larger samples, the t distribution resembles the normal distribution closely and many tests converge to their normal counterparts.
  • T-tests (what they are and how they relate to the unit):
    • Parametric tests that compare means between groups.
    • Appropriate for continuous data and when assumptions (e.g., normality) are reasonably met.
    • There are three t-tests discussed in the unit (one per week).
  • Important caveat about mean as a statistic:
    • The mean is an appropriate descriptive statistic only for continuous data, not categorical data.

When data are not symmetric: skewness, outliers, and their effects

  • Skewness and tails:
    • Negative skew (left-skew): left tail longer; many data values cluster toward the right.
    • Positive skew (right-skew): right tail longer; many data values cluster toward the left with a few extreme high values.
    • The most common causes of skewness include outliers/extreme values in the dataset.
  • Outliers and the mean vs the median:
    • Example with a small data set: {1,2,3,4,5} (mean = 3, median = 3).
    • Replace a value (e.g., 5) with 10: mean shifts (to 4), median remains 3.
    • Replace with 20: mean shifts further (to 6), median still 3.
    • This demonstrates that the mean is sensitive to outliers, which can drag tails toward high or low ends.
  • Parametric vs nonparametric tests (revisited):
    • Parametric tests (e.g., t-tests) assume a particular distribution shape (often normal).
    • Nonparametric tests do not require a normal distribution; they are less powerful but useful for non-normal data.
    • Nonparametric tests may be introduced later in the course (e.g., in a later week).
  • Data transformation as a remedy:
    • If data are skewed, transformations can reduce skewness and outlier effects, making data more symmetric.
    • Note: transformation techniques are not covered in this introductory unit.

How to assess whether data are normally distributed: an eight-part checklist (eight things, with a and b under item 8)

  • The eight assessment tools (a and b are subparts of item 8): 1) Mean, median, and mode
    • In a symmetric distribution, these three measures tend to be close or identical.
    • In practice, the descriptive table in SPSS may not always show the mode; you may need to compute or request it in your lab handout.
      2) Skewness statistic
    • The value should be near zero for normal data.
      3) Skewness z-score
    • Computed as z<em>skew=SkewnessSE</em>Skew.z<em>{\text{skew}} = \frac{\text{Skewness}}{SE</em>{\text{Skew}}}.
    • For normality, this z-score should lie within approximately ±1.96\pm 1.96 (smaller samples) or, in larger samples, can be up to about ±2.5\pm 2.5.
      4) Kurtosis statistic
    • The value should be near zero for normal data (excess kurtosis is often reported).
      5) Kurtosis z-score
    • Computed as z<em>kurt=KurtosisSE</em>Kurt.z<em>{\text{kurt}} = \frac{\text{Kurtosis}}{SE</em>{\text{Kurt}}}.
    • For normality, this z-score should lie within approximately ±1.96\pm 1.96 (smaller samples) or up to about ±2.5\pm 2.5 in larger samples.
      6) Shapiro-Wilk test (SW test)
    • Hypothesis: the data come from a population that is normally distributed.
    • If the significance value (p-value) is < 0.05, the assumption of normality is violated (i.e., the data are not from a normal distribution).
    • If p >= 0.05, the assumption of normality is not violated (the data may be normal).
    • Note: Kolmogorov-Smirnov test is another option, but both tests are sometimes criticized for being oversensitive.
      7) Histogram
    • Visual inspection of the shape; a histogram helps identify skewness and outliers (e.g., a long right tail indicating positive skew).
      8) Graphical methods (a and b under item 8):
    • 8a Normal probability plots (a.k.a. normal probability plots or Q-Q plots):
      • Data are plotted against a theoretical normal distribution; a straight diagonal line indicates approximate normality.
      • Deviations from the line indicate departures from normality; outliers appear as points far from the line.
      • Two examples may be shown to illustrate different data sets.
    • 8b Detrended Q-Q plot (Detrended QQ plot):
      • Plots the deviations from the diagonal line; helps visualize non-normality more clearly.
      • If the data are perfectly normal, points would lie along a horizontal line; real data show deviations from this line.
  • Practical note on interpretation:
    • Do not decide normality based on a single test or plot; rely on all eight checks together and provide an overall conclusion.
    • In the pulse rate example (pulse rate before exam vs during exam):
    • Pulse rate before exam showed positive skew based on multiple normality checks.
    • Pulse rate during exam appeared more normally distributed.
  • Example dataset interpretation (pulse rate before vs during exam):
    • Conclusion example: the pulse rates before the exam are positively skewed in the population.

Visual and tabular tools to assess distribution (practical notes for SPSS and lab work)

  • Graphs and outputs mentioned:
    • Histogram: visual check for skewness and outliers.
    • Stem-and-leaf plot: preserves actual data values and shows distribution; useful for small-to-moderate samples and is often undervalued compared to histograms.
    • Box plot: shows median, quartiles, and potential outliers; symmetric box and whiskers indicate normality patterns; visible outliers suggest deviations from symmetry.
    • Normal probability plot (a subset of item 8a): assesses normality by comparing observed data to a theoretical normal distribution.
    • Detrended Q-Q plot (item 8b): complements QQ plot by highlighting deviations from normality more clearly.
  • Practical lab use:
    • In Computer Lab 3, you will learn how to generate all these graphs and tables in SPSS.
    • The graphs and brief interpretations will be required in the first assessment: Data analysis and research design evaluation 1.

Putting it all together: applying the concepts in assessment

  • When you are asked to determine whether a dataset is normally distributed:
    • Use all eight methods (8a/8b included) to form an overall judgment.
    • Check for symmetry: mean ≈ median ≈ mode; skewness/kurtosis near zero; SW test non-significant; histogram roughly bell-shaped; stem-and-leaf shows a realistic, unskewed pattern; box plot with symmetric whiskers around the median; normal probability plots show points close to the diagonal; detrended QQ plot shows points scattered around the horizontal line.
  • Real-world takeaway:
    • Normality is an assumption for many parametric tests (e.g., t-tests).
    • If normality is violated, consider data transformation or nonparametric alternatives (nonparametric tests are covered later).
  • Final reminder from the lecture:
    • Do not rely on a single test to judge normality.
    • The eight methods provide a robust toolkit for evaluating the shape of the data and the appropriateness of parametric tests.
    • You will practice these techniques in SPSS in the lab and apply them in your first assessment.

Quick recap of key formulas and concepts (compact reference)

  • Probability basics:
    • P(A)=favorable outcomestotal outcomesP(A)=\frac{\text{favorable outcomes}}{\text{total outcomes}}
  • Normal distribution properties:
    • Symmetric bell curve; mean = median = mode (in a perfect normal distribution).
    • P(aXb)=<em>abf</em>X(x)dxP(a\le X \le b) = \int<em>{a}^{b} f</em>X(x)\,dx with total area under the curve equal to 1.
  • 68-95-99 rule (for normal distribution):
    • Within one SD: 68.3%\approx 68.3\%; within two SDs: 95.4%\approx 95.4\%; within three SDs: 99.7%\approx 99.7\%.
  • Standard normal distribution:
    • ZN(0,1)Z \sim N(0,1); z=Xμσz = \frac{X-\mu}{\sigma}
  • Z-scores: interpretation and cross-metric comparisons (e.g., cardiovascular metrics).
  • Skewness and kurtosis: assess symmetry and tail behavior; z-scores for skewness/kurtosis help decide normality.
  • Normality tests and plots (eight tools): mean/median/mode; skewness; kurtosis; SW test; histogram; stem-and-leaf; box plot; normal probability plots; detrended QQ plot.
  • Data handling notes:
    • Transformation as a method to correct non-normality (not covered in this introductory unit).
    • Parametric vs nonparametric tests: rely on normality for the former; nonparametric tests do not require such assumptions but may be less powerful.