Understanding Data – Comprehensive Study Notes
Understanding Data: Comprehensive Study Notes
Learning objectives (Todays Learning Objectives)
Organise raw data into a frequency distribution table.
Interpret and create graphical representations of data, such as histograms and bar charts.
Describe the shape of a distribution (e.g., skew and kurtosis).
Calculate and interpret measures of central tendency (mean, median, mode).
Calculate and interpret measures of variability (range, variance, standard deviation).
Explain and apply the central limit theorem, showing how sampling distributions of the mean tend towards normality as sample size increases.
Construct an interpret boxplots by hand.
Real-world framing (First Principles Counting Events)
View data collection as simple records of each occurrence of an event (e.g., rugby tries).
Example scenario: tallying tries in a rugby season across 24 games; per-game tries belong to the set {0,1,2,3,4,5,6,7}.
Frequency table example: number of matches that scored a specific number of tries (e.g., 0, 1, 2, …, 7) across 24 matches.
This forms the basis for constructing a frequency chart and then moving to probabilities.
From frequencies to probabilities
A frequency chart answers descriptive questions: how many matches produced each score.
To make predictive statements, convert counts to probabilities by dividing by the total number of matches.
Example: If 3 matches scored 0 tries out of 24, the probability is P(X=0)=\frac{3}{24}=0.125=12.5\% (and similarly for other scores).
Summary form: probability for a score s is P(X=s)=\frac{\text{count of matches with s}}{\text{total matches}}.
Shifting from a single game to a run of games: the sample mean
Definition of sam
Example: for seven matches with scores 3,5,1,4,2,3,2,4 the sum is 28, so the average is \bar{X}=\frac{28}{7}=4.0.
Issue: averages vary from season to season due to random factors (weather, injuries, referee decisions, opposition form, luck).
Conclusion: averages vary across seasons; randomness prevents exact replication of a season’s average.
Building the sampling distribution of the mean (thousands of “What‑If” seasons)
Concept: imagine many possible seasons by shuffling the actual season’s scores to simulate new seasons.
Process: shuffle scores, compute a new season average, and repeat thousands of times.
The collection of these averages forms the sampling distribution of the mean.
The mean of all these simulated averages is the population mean, denoted by \mu.
Visualization: plot the distribution of these averages; with enough repetitions, the distribution tends to a bell-shaped curve (normal distribution).
Key insight: randomness in data generation tends to produce a normal-shaped sampling distribution for the mean.
Normal bell curves and their properties
The normal distribution (bell curve) arises when randomness is added to data.
Bell curves appear when averaging independent, mildly random events.
Notation:
Theoretical (ideal) normal curve is used for comparison with actual data.
Actual distribution is shown by a dashed grey line; the theoretical curve is shown by a solid red area in demonstrations.
Interactive tools illustrate how changing data toward symmetry and kurtosis shifts the mean, median, and mode.
Bell Curve properties: symmetry, modality, skew, and kurtosis
Normal distributions: a subset of bell curves with specific characteristics:
Symmetry: the curve is perfectly balanced; folding in half yields identical sides. (Skew = 0)
Modality: only one hump (single mode).
Skew: no leaning; equal tail lengths; mean = median = mode in the center for a perfect normal.
Kurtosis: neither too pointy nor too flat; moderate peak and tails.
Skewness (asymmetry)
Positive skew (right-skewed): right tail longer; most scores cluster at lower values; mean > median.
Negative skew (left-skewed): left tail longer; most scores cluster at higher values; mean < median.
Kurtosis (peakedness)
Leptokurtic: sharp peak and fatter tails; extreme values more likely.
Platykurtic: flatter peak and thinner tails; extreme values less likely.
Mesokurtic: normal level of peakedness.
Distribution Shaper and interactive exploration (conceptual)
Distribution Shaper lets you manipulate a dataset and observe changes in:
Mean, Median, and Mode positions as the data reshapes.
Target Skew (γ) and Target Excess Kurtosis (δ) sliders to explore symmetry and peakedness effects.
Theoretical Bell Curve vs. Actual Distribution: compare a theoretical normal curve to your data’s actual shape.
Practical note: interactive tools can illustrate how, for example, bimodal distributions influence where the bell curve would lie on average.
Important link: distribution exploration tools and demonstrations (links provided in lecture materials).
Standard deviation and the idea of spread around the mean
Purpose: measure how spread out data are around the mean, not just the center.
Intuition: a larger standard deviation means data are more spread out; a smaller one means data are clustered near the mean.
The example demonstrates the concept visually by selecting a random score x and comparing its distance to the mean μ, using a tape-measure analogy.
Step-by-step intuition for standard deviation (illustrative demonstration)
Step 1: select a random score x on the bell curve.
Step 2: measure the horizontal distance from x to the mean μ.
Step 3: measure the distance from the mean to the center of the curve (for a symmetric distribution, this is also μ).
Step 4: compute the difference x − μ (the sign matters for a moment, but we account for all data points).
Step 5: to avoid cancellation of positive and negative differences, square the differences before averaging: \text{Var} = \frac{1}{N}\sum{i=1}^N (xi-\mu)^2 = \sigma^2. The square root gives the standard deviation: \sigma = \sqrt{\frac{1}{N}\sum{i=1}^N (xi-\mu)^2}.
Step 6: to move from a total squared distance to an average squared distance, divide by the number of values N; this yields the variance, not yet the standard deviation.
Step 7: the final step is taking the square root of the variance to obtain the standard deviation: \sigma = \sqrt{\frac{1}{N}\sum{i=1}^N (xi-\mu)^2}.
Computational form (often used in practice): rearrange the computation as\sigma = \sqrt{\frac{1}{N}\sum{i=1}^N xi^2 - \left(\frac{1}{N}\sum{i=1}^N xi\right)^2}.
Note on forms: the “definitional form” emphasizes the sum of squared deviations; the “computational form” uses sums of squares and the mean to simplify calculations.
Rugby data example: computing μ and σ for 24 matches
Data: Tries scored per match (24 values):
{4, 3, 5, 1, 6, 2, 1, 7, 0, 3, 1, 3, 1, 7, 7, 0, 5, 0, 1, 4, 5, 6, 7, 3}Totals:
Sum of scores: \sum x_i = 82
Sum of squares: \sum x_i^2 = 420
N = 24
Mean (population mean):
\mu = \frac{\sum x_i}{N} = \frac{82}{24} \approx 3.4167.Variance (population):
\sigma^2 = \frac{\sum x_i^2}{N} - \mu^2 = \frac{420}{24} - (3.4167)^2 \approx 5.8264.Standard deviation:\sigma = \sqrt{\sigma^2} = \sqrt{5.8264} \approx 2.414.
In the slides, these numbers are shown as components for demonstrating the computation, with the final result approximately 2.41.
Probability and the area under the curve
Frequencies convert to probabilities: the area under each bar in a histogram equals the probability of that outcome.
For the rugby data, probabilities for each score (0,1,2,3,4,5,6,7) are the corresponding counts divided by 24, e.g., P(X=0)=12.5%, P(X=1)=20.8%, etc.
The histogram is a discrete representation; the bell curve is its continuous counterpart in the normal-model context.
The key link: the area under the curve (the integral) equals probability; the sum of bar areas equals 1 (or 100%).
The 68–95–99.7 rule (empirical rule)
If data are normal, then approximately:
68% lie within one standard deviation of the mean: P(|X-\mu|\leq \sigma) \approx 0.68.
95% lie within two standard deviations: P(|X-\mu|\leq 2\sigma) \approx 0.95.
99.7% lie within three standard deviations: P(|X-\mu|\leq 3\sigma) \approx 0.997.
This provides a practical ruler for judging how surprising a result is based on its distance from the mean in terms of standard deviations.
Quartiles, percentiles, and the five-number summary
Quartiles divide data into four equal parts:
Q1: lower quartile (25th percentile)
Q2: median (50th percentile)
Q3: upper quartile (75th percentile)
Five-number summary: min, Q1, Q2 (median), Q3, max.
The slides provide a concrete example for 24 rugby scores: Q1 = 1, Q3 = 5.5, giving an IQR of 4.5 (Q3 − Q1).
Interquartile range (IQR) and fences
IQR measures the spread of the middle 50% of the data:
IQR = Q3 - Q1 = 4.5.Fences define boundaries for detecting outliers:
Inner fences:
Lower inner fence = Q1 - 1.5 \times IQR
Upper inner fence = Q3 + 1.5 \times IQR
Outer fences:
Lower outer fence = Q1 - 3 \times IQR
Upper outer fence = Q3 + 3 \times IQR
Adjacent values: the largest data point at least as large as the lower inner fence, and the smallest data point at most as large as the upper inner fence.
Whiskers extend to the furthest adjacent values; outliers are plotted beyond the inner fences.
In the rugby example, with Q1 = 1, Q3 = 5.5, IQR = 4.5:
Lower inner fence = 1 - 1.5 \times 4.5 = -5.75
Upper inner fence = 5.5 + 1.5 \times 4.5 = 12.25
All data lie within the inner fences, so there are no outliers.
Boxplots: construction and interpretation
Boxplot purpose: visually summarize distribution symmetry, spread, and central tendency.
Five-number summary anchors the box and whiskers:
Box represents the IQR (middle 50%), spanning from Q1 to Q3.
Median (Q2) is a line inside the box.
Whiskers extend to the most extreme data points within the inner fences (adjacent values).
Outliers are plotted individually using distinct symbols (e.g., hollow circle for mild outliers, asterisk or filled symbol for extreme outliers).
Outer fences separate mild from extreme outliers.
Boxplots are linked conceptually to histograms and normal distributions for assessing symmetry and spread.
Practical assessment: compare your boxplot to a symmetric, bell-curved distribution to gauge skewness and tail behavior.
Boxplots versus histograms and their link to bell curves
Boxplots are a compact summary focusing on center, spread, and tails; histograms display frequency distribution across values.
A boxplot corresponds to a symmetric normal distribution benchmark: the central box contains the densest part of the data; whiskers cover the remainder.
In norma-distributed data, over 99% of data fall within the standard whisker range defined by 1.5×IQR and 3×IQR boundaries.
Skewness in a boxplot appears as an asymmetrical box or whiskers.
Practical considerations and study tips
Understand when to use descriptive charts (histograms, frequency tables) vs when to infer probabilities (probability distributions, sampling distributions).
Remember the two forms for standard deviation computations and know how to switch between them for manual calculation vs using summaries:
Population form: \sigma = \sqrt{\frac{1}{N}\sum{i=1}^N (xi-\mu)^2}.
Computational form: \sigma = \sqrt{\frac{\sum xi^2}{N} - \left(\frac{\sum xi}{N}\right)^2}.
For boxplots, always interpret outliers in context; an absence of outliers does not imply perfect symmetry, and the fences are a methodological rule, not an absolute boundary.
Real-world relevance: Central limit theorem justifies using normal approximations for sampling distributions of the mean in many practical settings (e.g., psychology, economics, experimental data).
Ethical and practical implications: relying on the normal approximation requires awareness of data distribution; violations (heavy tails, strong skew) can lead to misinterpretation of p-values and confidence intervals.
Key equations to remember (LaTeX)
Population mean: \mu = \frac{1}{N}\sum{i=1}^N xi
Population variance: \sigma^2 = \frac{1}{N}\sum{i=1}^N (xi-\mu)^2
Population standard deviation (computational form): \sigma = \sqrt{\frac{\sum xi^2}{N} - \left(\frac{\sum xi}{N}\right)^2}
Sample mean (for a sample of size n): \bar{X}=\frac{1}{n}\sum{i=1}^{n} Xi
IQR: IQR = Q3 - Q1
Inner fences: \text{Lower inner fence} = Q1 - 1.5 \times IQR, \text{Upper inner fence} = Q3 + 1.5 \times IQR
Outer fences: \text{Lower outer fence} = Q1 - 3 \times IQR, \text{Upper outer fence} = Q3 + 3 \times IQR
68–95–99.7 rule: within 1\sigma, 2\sigma, 3\sigma from the mean correspond to roughly 68%, 95%, 99.7% of data respectively.
Quick takeaway
Descriptive stats (mean, median, mode; range, variance, standard deviation) describe data; sampling distributions and the central limit theorem justify normal models for averages as sample sizes grow.
Visual tools (histograms, boxplots) help assess distribution shape, skewness, outliers, and variability.
The five-number summary (min, Q1, median, Q3, max) and IQR are central to boxplots and outlier detection.
References to interactive tools and further exploration
Distribution Shaper and related interactive demonstrations mentioned in the slides illustrate how skewness and kurtosis affect the mean, median, and mode in real-time.
Online links included in lecture materials (e.g., gemini and related shared resources) provide hands-on practice with the boxplot construction and distribution visualization.
Summary slide reminders
Boxplots provide a compact summary of distribution shape, spread, and central tendency.
Boxplots are essential in psychology and other sciences for quickly assessing symmetry, skewness, and potential outliers.
Boxplots rely on the five-number summary and include whiskers, adjacent values, and outlier markers.
Final note on connections to earlier material and real-world relevance
Relationships among mean, median, mode become particularly clear in normal distributions; deviations reveal skewness/kurtosis and non-normality.
The central limit theorem underpins many statistical methods used in research and industry, reinforcing the practical value of understanding sampling distributions and standard deviation as a measure of spread.