Lecture Notes: Describing Data, Normal Distribution, and Box Plots

Raw Data, Frequency Distributions, and Descriptive Thinking

  • Lecturer: Ollie (office 151, building 41; open-door policy; contact via email).

  • Focus of lecture: turn raw data into a frequency distribution; understand data characteristics; interpret and graph data (histograms, bar charts); describe shape of distributions; introduce parameters like skewness (SKU) and kurtosis; calculate and interpret measures of central tendency and spread (mean, variability measures such as range, variance, standard deviation); central limit theorem; box plots and interpreting them by hand.

  • Practical framing: consider rugby season data as an example: tallying points (scores) per match. Each match yields a count of tries; possible scores per match range from 0 to 7. Data form: a sequence of 24 matches with a count of tries per match.

  • Key idea: start with simple counts (frequencies), then move to probabilities and descriptive summaries.

  • Example setup from rugby data:

    • Minimum score (tries in a match) = 0; maximum = 7.

    • Possible outcomes per match: 0, 1, 2, 3, 4, 5, 6, 7 tries.

    • Frequency table concept: count how many matches produced each outcome (e.g., 3 matches with 0 tries; 5 matches with 1 try; 4 matches with 7 tries; total matches = 24).

    • A frequency chart mirrors the same information in a graphical form (counts for each possible score).

  • Frequency charts vs. descriptive questions:

    • Frequency charts describe “what happened” (descriptive performance).

    • To make predictions, convert to probabilities (predictive statements).

Probability, Frequency, and Cumulative Concepts

  • Probability definition used here:

    • For a discrete outcome, the probability is the frequency divided by the total number of outcomes:

    • P(X=x) = rac{nx}{N} where $nx$ is the number of observations with value $x$ and $N$ is the total number of observations.

  • Cumulative probability concept:

    • By converting the frequency table to probabilities and summing, you obtain a distribution that sums to 100% (total probability = 1 or 100%).

    • The summed probabilities across outcomes give a cumulative picture of likelihoods for different scores.

  • Linking to averages and forecasting:

    • With a probability distribution, you can ask: on average, how many tries will we score across the next $k$ matches (e.g., next 7 matches).

    • The term "average" introduces a central tendency concept; note that averages can vary across seasons due to random factors, changes in players, luck, etc. This leads to the idea of a sampling distribution of the mean.

Averages, Season-to-Season Variability, and the Sampling Distribution

  • Sample mean for a season of 24 matches:

    • The example gives an observed average of approximately ar{x} = 3.42 tries per match for the 24 matches in the season.

  • Variability in averages:

    • The same game played in a different season would likely yield a different average due to random variation and changing conditions.

    • The concept of random variation leading to wiggle in the observed average is introduced as the need to model variability across samples.

  • Sampling distribution of the mean (concept):

    • To understand how much the observed average might wobble due to randomness, imagine shuffling the existing outcomes and recomputing the mean many times (a simulation).

    • Process described: start with the original data; shuffle the results at random (a permutation); compute the new average; repeat ~1000 times; collect all the new averages.

    • Plot: a distribution of these new averages forms the sampling distribution of the mean.

    • The running mean across the simulation is indicated as a red line; the average of all the new means (the mean of the sampling distribution) is shown as a green line.

  • Purpose of the simulation:

    • Shows how the mean could vary under random reordering of the same data, helping to understand the variability of the sample mean in future seasons.

    • With many repetitions (e.g., 1000), a bell-shaped distribution tends to emerge – the normal distribution – under fairly broad conditions (Central Limit Theorem).

  • Important caveat about the interpretation:

    • The average from the historical data is not a guaranteed constant; randomness means the observed central tendency can shift from season to season.

    • The sampling distribution provides a framework to quantify this wobble and to judge how surprising a future observation would be under random variation.

The Central Limit Theorem and the Normal Distribution (Bell Curve)

  • Central idea: when randomness is present in a sufficiently large system, the distribution of the sample mean approaches a bell-shaped (normal) distribution.

  • Demonstration description:

    • Repeatedly shuffling (randomizing) the data and computing the mean produces a distribution of means that resembles a normal curve.

    • This is called the sampling distribution of the mean.

    • As the number of simulations grows (e.g., ~1000), the resulting distribution stabilizes, illustrating the normal distribution phenomenon.

  • The bell curve is a fundamental shape in statistics for random processes; when data are truly random, the bell shape tends to appear.

  • Visual/contextual example:

    • The bell curve is used to assess how surprising a result is by comparing it to what would be expected under random chance alone.

    • The idea that the normal curve represents a predictable pattern for many natural phenomena is a key theme in statistics.

Properties of the Normal Distribution (as introduced)

  • Symmetry:

    • The normal curve is symmetric about its center; folding the curve in half yields alignment.

  • Modality:

    • It is unimodal; the mode (most common value) aligns with the mean in a symmetric normal distribution.

  • Skew (positive/negative):

    • SKU (skewness) describes asymmetry; negative skew means tail on the left; positive skew means tail on the right.

    • In a perfectly normal distribution, skewness is zero and the left/right sides mirror each other.

  • Kurtosis:

    • Describes peakedness; leptokurtic means the distribution is more peaked than normal; platykurtic means flatter; mesokurtic means roughly normal (normal kurtosis).

    • The terms used: leptokurtic (peaked), platykurtic (flat), mesokurtic (normal-like).

  • Notation used in the lecture:

    • Skewness (SKU) can be positive or negative; the direction indicates the tail direction.

    • Kurtosis is discussed qualitatively with terms leptokurtic, platykurtic, mesokurtic rather than a single numeric value here.

  • Interactive exploration (tool):

    • A tool allows you to adjust distribution features (e.g., mode, skew, ketosis) and visualize a histogram/bar chart with an overlay of the median (green line) and other features.

    • You can move between tri-modal, bimodal, and unimodal shapes and observe how skew and ketosis affect the graph.

    • The median line is shown and can be reset; the tool helps illustrate how skew and ketosis move the mean relative to the median.

  • Practical note:

    • The center of a skewed distribution is often better described by the median than by the mean (e.g., median house price in highly-skewed price data).

    • Example explanation: in highly skewed distributions (e.g., very high outliers), the mean can be pulled away from the bulk of the data; the median remains a better central tendency measure.

Measures of Spread and Central Tendency

  • Central tendency measures introduced/used:

    • Mean:

    • The average value of the data set.

    • Median: the middle value when data are ordered; in even-sized samples, the median is the average of the two central values.

    • Mode: the most frequent value in the data set.

  • Spread (variability) measures introduced:

    • Range: max minus min (overall span).

    • Variance: a measure of how spread out the data are around the mean.

    • Standard deviation: the square root of the variance; a direct, interpretable scale of dispersion around the mean.

  • Definitional vs computational forms of variance and standard deviation:

    • Variance (definitional form):

    • ext{Var}(X) = rac{1}{N} \sum{i=1}^{N} (xi - ar{x})^2

    • where $ar{x}$ is the sample mean.

    • Standard deviation (definitional form):

    • ext{SD}(X) = rac{1}{N}
      abla?

    • The lecture emphasizes a computational form to avoid rounding and complexity when calculating by hand:

    • ext{Var}(X) = rac{1}{N} igg( rac{ extstyleig( extstyle rac{1}{N}ig)ig( extstyleig( extstyleig[ extstyle extstyle rac{ }
      }{N}ig)}}

    • Note: In practice, the common usable computational form is:

    • ext{Var}(X) = rac{1}{N} iggl(\sum{i=1}^{N} xi^2 - rac{(\sum{i=1}^{N} xi)^2}{N} iggr)

    • Then,

    • ext{SD}(X) =
      o ext{sqrt}ig( ext{Var}(X)ig)

  • Practical calculation steps (as described):

    • Step 1: compute sums needed (e.g., $ ext{sum}(xi)$ and $ ext{sum}(xi^2)$) and the sample size $N$.

    • Step 2: plug into the chosen formula; separate components before plugging into the full equation to avoid mistakes.

    • Step 3: take the square root if you are computing SD from the variance formula.

  • Note on usage:

    • The definitional form can be conceptually helpful, but the computational form is typically used in practice for efficiency and accuracy.

Box Plots, Quartiles, and the Five-Number Summary

  • Five-number summary components:

    • Minimum, Q1 (first quartile), Median (Q2), Q3 (third quartile), Maximum.

  • Quartiles and percentiles:

    • Quartile 1 (Q1) corresponds to the 25th percentile.

    • Quartile 2 (Q2) is the 50th percentile (the median).

    • Quartile 3 (Q3) corresponds to the 75th percentile.

  • How Q1 and Q3 are found (ordering and splitting):

    • Order data from smallest to largest.

    • Find the median (Q2); split data into lower and upper halves.

    • Compute the medians of each half to obtain Q1 (lower half) and Q3 (upper half).

    • For an even number of data points, the exact middle may lie between two values; the Q2 is the average of those two central values.

  • Example (from rugby data):

    • Bottom 25% (Q1) around a score of 1; Median (Q2) around 3; Top 25% (Q3) around 5.5.

  • Interquartile Range (IQR):

    • ext{IQR} = Q3 - Q1

    • Describes the middle 50% of the data, ignoring the extremes.

  • Box plots and their anatomy:

    • The box spans from Q1 to Q3; a line inside marks the median (Q2).

    • Hinges: alternative names for Q1 and Q3.

    • Whiskers extend from the box to values that are not outliers; they do not necessarily reach min and max.

    • Outliers: points that lie beyond the fences, often shown as hollow circles or asterisks; fences defined as:

    • Lower fence: Q1 - 1.5 \, \text{IQR}

    • Upper fence: Q3 + 1.5 \, \text{IQR}

    • Near vs far outliers: sometimes defined as beyond 3 times the IQR from the quartile values (i.e., 3 × IQR).

    • Adjacent values: the values just beyond the whiskers used to determine outliers (defined by the data set and the 1.5×IQR rule).

  • Box plots as a communicative tool:

    • Box plots convey central tendency, spread, skew, and presence of outliers in one image.

  • Practice tool for box plots:

    • A dedicated tool is provided to generate data sets, sort, and analyze them, showing the minimum, maximum, median, IQR, and where outliers lie; use it to practice and become proficient.

Skewness, Kurtosis, and Data Shape Interpretation

  • Skewness (SKU):

    • Describes asymmetry of the distribution around the mean.

    • Negative skew: tail to the left; Positive skew: tail to the right.

    • In the lecture, skew direction is indicated with a directional description rather than purely left/right terminology; the idea is to connect skew to the visual tilt of the distribution.

  • Kurtosis (ketosis in the lecture's informal phrasing):

    • Leptokurtic: highly peaked distribution (many values near the mean).

    • Platykurtic: flat distribution (more spread out around the mean).

    • Mesokurtic: roughly normal (moderate peakedness).

  • Median versus mean under skew:

    • In skewed distributions, the median is often a better measure of central tendency than the mean because it is less influenced by extreme values.

  • Interactive exploration (visual intuition):

    • The lecture mentions a tool that lets you adjust skew and kurtosis and observe how the histogram/bar chart and the green median line respond.

  • Practical example relating to real-world data:

    • The speaker uses house prices to illustrate why the median can be more informative than the mean in skewed distributions (e.g., very high outliers pulling the mean upward).

Quartiles, Percentiles, and the 5-Number Summary (Further Details)

  • Quartiles and percentiles explained with ordered data:

    • Order data from smallest to largest, then locate quartiles by halving data segments.

    • If the exact middle falls between two data points (even N), take the average of the two central values to obtain the median; apply the same principle recursively for Q1 and Q3.

  • Interpreting the 25th, 50th, and 75th percentiles:

    • 25th percentile (Q1): value below which 25% of data fall.

    • 50th percentile (Median): value below which 50% of data fall.

    • 75th percentile (Q3): value below which 75% of data fall.

  • Using quartiles to describe data spread:

    • The interquartile range (IQR) captures where the bulk (middle 50%) of data lie and is robust to outliers.

  • The relationship to the five-number summary and box plots:

    • The five-number summary feeds directly into the construction of a box plot.

    • The box plot visually encodes: min, Q1, median, Q3, max; with whiskers and potential outliers.

Important Worldview Notes and Practical Takeaways

  • Descriptive vs. inferential statistics:

    • Descriptive statistics summarize data (counts, proportions, means, spreads, shapes).

    • Inferential statistics (to be discussed next week) evaluate how surprising or significant observed patterns are, often via P-values and hypothesis testing.

  • The 68-95-99.7 rule (empirical rule) for normal distributions:

    • About 68% of data fall within 1 standard deviation of the mean.

    • About 95% fall within 2 standard deviations.

    • About 99.7% fall within 3 standard deviations.

  • P-values (hint for next week):

    • P-values quantify how surprising data are under a null hypothesis; they will be introduced after covering bell curves and the normal distribution in more depth.

  • Real-world relevance and ethical considerations:

    • Understanding data shape, central tendency, and spread informs decision-making in research, business, and policy.

    • You should consider how outliers and skew can distort interpretations and choose appropriate summary measures (e.g., mean vs. median).

Quick Notation Recap (Key Formulas to Remember)

  • Probability of an outcome:

    • P(X=x) = rac{n_x}{N}

  • Mean (sample):

    • ar{x} = rac{1}{N} \,
      abla ???

    • (Note: In practice,
      be the sum of observations divided by $N$; see detailed derivations in lectures.)

  • Variance (definitional):

    • ext{Var}(X) = rac{1}{N} \,
      abla ???

    • (Conceptually: average squared distance from the mean: ext{Var}(X) = rac{1}{N}
      igl(igl(x_i - ar{x}igr)^2igr) summed over all $i$.)

  • Variance (computational form):

    • ext{Var}(X) = rac{1}{N}igg( rac{igl( extstyle rac{1}{N}igl)igl( extstyleigl( extstyleigl(
      ablaigr)igr) igr)^{2}}{N} - rac{( extstyle ext{sum}(x_i))^2}{N^2}igg)

    • Practical compact form:

    • ext{Var}(X) = rac{1}{N} igg(
      olinebreak igl(
      sum xi^2igr) - rac{( sum xi)^2}{N}igg)

  • Standard deviation:

    • ext{SD}(X) =

ightarrow ext{sqrt}ig( ext{Var}(X)ig)

  • Interquartile range:

    • ext{IQR} = Q3 - Q1

  • Fences for outliers:

    • Lower fence: Q1 - 1.5\times ext{IQR}

    • Upper fence: Q3 + 1.5\times ext{IQR}

  • Adjacents and outliers:

    • Adjacent values define data points just beyond the whiskers that help determine outliers; outliers often shown as hollow circles or asterisks.

  • Box plot components (five-number summary):

    • Minimum, Q1, Median, Q3, Maximum; hinges and whiskers; box spans Q1–Q3; median line inside; outliers beyond fences.

  • Normal distribution and the empirical rule:

    • Within 1 SD: ~68% of data; within 2 SD: ~95%; within 3 SD: ~99.7%.

// End of notes