Comprehensive notes on Probability, Normal Distribution, Z-scores, and Normality Assessment

Probability foundations

Probability definition: probability is the chance of something happening; a proportion with values from 0 to 1 (or 0% to 100%).
- Formal expression (conceptual): $P(A)=\frac{\text{number of favorable outcomes}}{\text{total number of possible outcomes}}.$
Why probability matters:
- We cannot observe the entire population; we rely on samples that should represent the population.
- Probability distributions and especially the normal distribution underpin many statistical techniques and tests.
- Many statistical tests assume the data follow a certain shape (often a symmetric bell-shaped curve), i.e., a probability distribution with particular properties.
Empirical vs theoretical distributions:
- Empirical distributions: based on observed data (e.g., a sample of 50 people plotted as a curve).
- Theoretical distributions: based on theories/mathematical functions; used to calculate theoretical probabilities for outcomes under the curve.
Data types and distributions:
- Data can be continuous or categorical.
- This unit focuses on continuous distributions, highlighting normal distribution and t distribution (and notes similarity with normal distribution for larger samples).
Probability basics revisited with simple examples:
- Coin toss: two outcomes; probability of heads (or tails) is $P(H)=P(T)=\tfrac{1}{2}=0.5=50\%.$
- Rolling a six-sided die: probability of getting a 4 is $P(4)=\tfrac{1}{6}\approx 0.1667\;(\approx 16.67\%).$
- Multiple-choice guess with 4 options: probability of a correct guess is $P=\tfrac{1}{4}=0.25=25\%.$
- If event probability is 0.05, that means a 5% chance, i.e., $0.05=5\%\; \left(5\text{ in }100\right)$ or 1 in 20.
Empirical vs theoretical distributions (recap):
- Empirical: observed data shape.
- Theoretical: based on formulas; used to compute probabilities for outcomes under the curve.
Data types revisited for distributions in this unit:
- Continuous vs categorical; emphasis on continuous distributions here.
- The three primary continuous distributions discussed: normal distribution, t distribution (and their similarities).
Shape requirements in statistics:
- Many statistical tests assume a symmetric, bell-shaped curve (normal distribution). A variety of other theoretical forms exist for different data types.

Normal distribution: key features and intuition

Normal distribution (bell-shaped, symmetric):
- One peak (mode), symmetry around the center.
- Mean = Median = Mode for a perfectly normal distribution (in practice, they are identical or very close when data are approximately normal).
- If you draw a vertical line at the mean, left half and right half are mirror images (line of symmetry).
Area under the curve:
- Total area under the normal curve is 1 (or 100%).
- The area between two points on the curve represents the probability (relative frequency) of observing values within that interval.
Example and spread:
- IQ distribution example: mean $\mu=100$ , standard deviation $\sigma=15$ .
- About fixed-proportion intervals around the mean:
- Within one standard deviation: $\approx 68.3\%$ of observations.
- Within two standard deviations: $\approx 95.4\%$ .
- Within three standard deviations: $\approx 99.7\%$ .
- Consequences: about 68%, 95%, and 99% of observations lie within 1, 2, or 3 SDs from the mean, respectively (the 68-95-99 rule).
Practice with the IQ example (numeric illustration):
- Mean IQ = 100, $\sigma=15$ .
- Interval within 1 SD: $[100-15, 100+15] = [85,115]$ contains $68.3\%$ of IQs.
- Interval within 2 SDs: $[100-30, 100+30] = [70,130]$ contains $95.4\%$ .
- Interval within 3 SDs: $[100-45, 100+45] = [55,145]$ contains $99.7\%$ .
Standard normal distribution:
- A special form of the normal distribution with mean $\mu=0$ and standard deviation $\sigma=1$ .
- Denoted as $Z \sim N(0,1)$ .
- The horizontal axis represents z-scores (standardized scores):
- Z = +1 corresponds to 1 standard deviation above the mean.
- Z = -1 corresponds to 1 standard deviation below the mean.
- Z-scores provide a common scale to compare scores from different distributions (e.g., comparing blood pressure and cholesterol on a common footing).

Z-scores: calculation, interpretation, and applications

Definition and purpose:
- Z-score is the distance between a value and the mean, measured in standard deviations.
- Formula (population parameters): $z = \frac{X-\mu}{\sigma}$
- For sample data (to standardize sample statistics): $z = \frac{X-\bar{X}}{s}$
Worked examples:
- Example 1: observed score $X=40.5$ , sample mean $\bar{X}=40$ , sample SD $s=0.5$ .
- Z-score: $z = \frac{40.5-40}{0.5} = 1.$
- Interpretation: the value is 1 standard deviation above the average for the population/sample.
- Example 2: cross-unit comparisons (e.g., blood pressure vs cholesterol):
- Two different measures with different units and scales can be compared using their Z-scores.
- If blood pressure z-score = 0 (average) and cholesterol z-score = +1 (above average), you can interpret them on a common scale without units.
- Example 3: comparing performance across units (final assessments) using z-scores:
- If unit A has z = -1 and unit B has z = +1.5 for the same student, relative performance is 1.5 SDs above average in unit B and 1 SD below average in unit A.
Why z-scores are useful:
- Enable meaningful comparisons across different metrics and scales.
- Helpful for health indicators and other outcomes when you want a consistent reference (relative standing within the population).
Z-score calculation recap:
- For a single observed score: $z = \frac{X-\mu}{\sigma}$ .
- For a sample score: $z = \frac{X-\bar{X}}{s}$ .

Calculating and interpreting a z-score: a worked example

Given a raw score, mean, and standard deviation:
- Suppose observed score $X=40.5$ , sample mean $\bar{X}=40$ , sample SD $s=0.5$ .
- Z-score: $z = \frac{40.5-40}{0.5} = 1.$
- Interpretation: this is one standard deviation above the average for the population/sample.

The normal vs t distributions; when to use each

Normal distribution vs t distribution:
- Normal distribution: the baseline bell-shaped curve used for many large-sample statistics.
- t distribution: similar to normal distribution in shape, especially for larger samples, but with heavier tails; used in t-tests to compare means when sample sizes are small or when the population SD is unknown.
- For larger samples, the t distribution resembles the normal distribution closely and many tests converge to their normal counterparts.
T-tests (what they are and how they relate to the unit):
- Parametric tests that compare means between groups.
- Appropriate for continuous data and when assumptions (e.g., normality) are reasonably met.
- There are three t-tests discussed in the unit (one per week).
Important caveat about mean as a statistic:
- The mean is an appropriate descriptive statistic only for continuous data, not categorical data.

When data are not symmetric: skewness, outliers, and their effects

Skewness and tails:
- Negative skew (left-skew): left tail longer; many data values cluster toward the right.
- Positive skew (right-skew): right tail longer; many data values cluster toward the left with a few extreme high values.
- The most common causes of skewness include outliers/extreme values in the dataset.
Outliers and the mean vs the median:
- Example with a small data set: {1,2,3,4,5} (mean = 3, median = 3).
- Replace a value (e.g., 5) with 10: mean shifts (to 4), median remains 3.
- Replace with 20: mean shifts further (to 6), median still 3.
- This demonstrates that the mean is sensitive to outliers, which can drag tails toward high or low ends.
Parametric vs nonparametric tests (revisited):
- Parametric tests (e.g., t-tests) assume a particular distribution shape (often normal).
- Nonparametric tests do not require a normal distribution; they are less powerful but useful for non-normal data.
- Nonparametric tests may be introduced later in the course (e.g., in a later week).
Data transformation as a remedy:
- If data are skewed, transformations can reduce skewness and outlier effects, making data more symmetric.
- Note: transformation techniques are not covered in this introductory unit.

How to assess whether data are normally distributed: an eight-part checklist (eight things, with a and b under item 8)

The eight assessment tools (a and b are subparts of item 8): 1) Mean, median, and mode
- In a symmetric distribution, these three measures tend to be close or identical.
- In practice, the descriptive table in SPSS may not always show the mode; you may need to compute or request it in your lab handout.
 2) Skewness statistic
- The value should be near zero for normal data.
 3) Skewness z-score
- Computed as $z{\text{skew}} = \frac{\text{Skewness}}{SE{\text{Skew}}}.$
- For normality, this z-score should lie within approximately $\pm 1.96$ (smaller samples) or, in larger samples, can be up to about $\pm 2.5$ .
 4) Kurtosis statistic
- The value should be near zero for normal data (excess kurtosis is often reported).
 5) Kurtosis z-score
- Computed as $z{\text{kurt}} = \frac{\text{Kurtosis}}{SE{\text{Kurt}}}.$
- For normality, this z-score should lie within approximately $\pm 1.96$ (smaller samples) or up to about $\pm 2.5$ in larger samples.
 6) Shapiro-Wilk test (SW test)
- Hypothesis: the data come from a population that is normally distributed.
- If the significance value (p-value) is < 0.05, the assumption of normality is violated (i.e., the data are not from a normal distribution).
- If p >= 0.05, the assumption of normality is not violated (the data may be normal).
- Note: Kolmogorov-Smirnov test is another option, but both tests are sometimes criticized for being oversensitive.
 7) Histogram
- Visual inspection of the shape; a histogram helps identify skewness and outliers (e.g., a long right tail indicating positive skew).
 8) Graphical methods (a and b under item 8):
- 8a Normal probability plots (a.k.a. normal probability plots or Q-Q plots):
 - Data are plotted against a theoretical normal distribution; a straight diagonal line indicates approximate normality.
 - Deviations from the line indicate departures from normality; outliers appear as points far from the line.
 - Two examples may be shown to illustrate different data sets.
- 8b Detrended Q-Q plot (Detrended QQ plot):
 - Plots the deviations from the diagonal line; helps visualize non-normality more clearly.
 - If the data are perfectly normal, points would lie along a horizontal line; real data show deviations from this line.
Practical note on interpretation:
- Do not decide normality based on a single test or plot; rely on all eight checks together and provide an overall conclusion.
- In the pulse rate example (pulse rate before exam vs during exam):
- Pulse rate before exam showed positive skew based on multiple normality checks.
- Pulse rate during exam appeared more normally distributed.
Example dataset interpretation (pulse rate before vs during exam):
- Conclusion example: the pulse rates before the exam are positively skewed in the population.

Visual and tabular tools to assess distribution (practical notes for SPSS and lab work)

Graphs and outputs mentioned:
- Histogram: visual check for skewness and outliers.
- Stem-and-leaf plot: preserves actual data values and shows distribution; useful for small-to-moderate samples and is often undervalued compared to histograms.
- Box plot: shows median, quartiles, and potential outliers; symmetric box and whiskers indicate normality patterns; visible outliers suggest deviations from symmetry.
- Normal probability plot (a subset of item 8a): assesses normality by comparing observed data to a theoretical normal distribution.
- Detrended Q-Q plot (item 8b): complements QQ plot by highlighting deviations from normality more clearly.
Practical lab use:
- In Computer Lab 3, you will learn how to generate all these graphs and tables in SPSS.
- The graphs and brief interpretations will be required in the first assessment: Data analysis and research design evaluation 1.

Putting it all together: applying the concepts in assessment

When you are asked to determine whether a dataset is normally distributed:
- Use all eight methods (8a/8b included) to form an overall judgment.
- Check for symmetry: mean ≈ median ≈ mode; skewness/kurtosis near zero; SW test non-significant; histogram roughly bell-shaped; stem-and-leaf shows a realistic, unskewed pattern; box plot with symmetric whiskers around the median; normal probability plots show points close to the diagonal; detrended QQ plot shows points scattered around the horizontal line.
Real-world takeaway:
- Normality is an assumption for many parametric tests (e.g., t-tests).
- If normality is violated, consider data transformation or nonparametric alternatives (nonparametric tests are covered later).
Final reminder from the lecture:
- Do not rely on a single test to judge normality.
- The eight methods provide a robust toolkit for evaluating the shape of the data and the appropriateness of parametric tests.
- You will practice these techniques in SPSS in the lab and apply them in your first assessment.

Quick recap of key formulas and concepts (compact reference)

Probability basics:
- $P(A)=\frac{\text{favorable outcomes}}{\text{total outcomes}}$
Normal distribution properties:
- Symmetric bell curve; mean = median = mode (in a perfect normal distribution).
- $P(a\le X \le b) = \int{a}^{b} fX(x)\,dx$ with total area under the curve equal to 1.
68-95-99 rule (for normal distribution):
- Within one SD: $\approx 68.3\%$ ; within two SDs: $\approx 95.4\%$ ; within three SDs: $\approx 99.7\%$ .
Standard normal distribution:
- $Z \sim N(0,1)$ ; $z = \frac{X-\mu}{\sigma}$
Z-scores: interpretation and cross-metric comparisons (e.g., cardiovascular metrics).
Skewness and kurtosis: assess symmetry and tail behavior; z-scores for skewness/kurtosis help decide normality.
Normality tests and plots (eight tools): mean/median/mode; skewness; kurtosis; SW test; histogram; stem-and-leaf; box plot; normal probability plots; detrended QQ plot.
Data handling notes:
- Transformation as a method to correct non-normality (not covered in this introductory unit).
- Parametric vs nonparametric tests: rely on normality for the former; nonparametric tests do not require such assumptions but may be less powerful.