Understanding Data – Comprehensive Study Notes

Understanding Data: Comprehensive Study Notes

Learning objectives (Todays Learning Objectives)
- Organise raw data into a frequency distribution table.
- Interpret and create graphical representations of data, such as histograms and bar charts.
- Describe the shape of a distribution (e.g., skew and kurtosis).
- Calculate and interpret measures of central tendency (mean, median, mode).
- Calculate and interpret measures of variability (range, variance, standard deviation).
- Explain and apply the central limit theorem, showing how sampling distributions of the mean tend towards normality as sample size increases.
- Construct an interpret boxplots by hand.
Real-world framing (First Principles Counting Events)
- View data collection as simple records of each occurrence of an event (e.g., rugby tries).
- Example scenario: tallying tries in a rugby season across 24 games; per-game tries belong to the set {0,1,2,3,4,5,6,7}.
- Frequency table example: number of matches that scored a specific number of tries (e.g., 0, 1, 2, …, 7) across 24 matches.
- This forms the basis for constructing a frequency chart and then moving to probabilities.
From frequencies to probabilities
- A frequency chart answers descriptive questions: how many matches produced each score.
- To make predictive statements, convert counts to probabilities by dividing by the total number of matches.
- Example: If 3 matches scored 0 tries out of 24, the probability is $P(X=0)=\frac{3}{24}=0.125=12.5\%$ (and similarly for other scores).
- Summary form: probability for a score s is $P(X=s)=\frac{\text{count of matches with s}}{\text{total matches}}.$
Shifting from a single game to a run of games: the sample mean
- Definition of sam
- Example: for seven matches with scores 3,5,1,4,2,3,2,4 the sum is 28, so the average is $\bar{X}=\frac{28}{7}=4.0.$
- Issue: averages vary from season to season due to random factors (weather, injuries, referee decisions, opposition form, luck).
- Conclusion: averages vary across seasons; randomness prevents exact replication of a season’s average.
Building the sampling distribution of the mean (thousands of “What‑If” seasons)
- Concept: imagine many possible seasons by shuffling the actual season’s scores to simulate new seasons.
- Process: shuffle scores, compute a new season average, and repeat thousands of times.
- The collection of these averages forms the sampling distribution of the mean.
- The mean of all these simulated averages is the population mean, denoted by $\mu$ .
- Visualization: plot the distribution of these averages; with enough repetitions, the distribution tends to a bell-shaped curve (normal distribution).
- Key insight: randomness in data generation tends to produce a normal-shaped sampling distribution for the mean.
Normal bell curves and their properties
- The normal distribution (bell curve) arises when randomness is added to data.
- Bell curves appear when averaging independent, mildly random events.
- Notation:
- Theoretical (ideal) normal curve is used for comparison with actual data.
- Actual distribution is shown by a dashed grey line; the theoretical curve is shown by a solid red area in demonstrations.
- Interactive tools illustrate how changing data toward symmetry and kurtosis shifts the mean, median, and mode.
Bell Curve properties: symmetry, modality, skew, and kurtosis
- Normal distributions: a subset of bell curves with specific characteristics:
- Symmetry: the curve is perfectly balanced; folding in half yields identical sides. (Skew = 0)
- Modality: only one hump (single mode).
- Skew: no leaning; equal tail lengths; mean = median = mode in the center for a perfect normal.
- Kurtosis: neither too pointy nor too flat; moderate peak and tails.
- Skewness (asymmetry)
- Positive skew (right-skewed): right tail longer; most scores cluster at lower values; mean > median.
- Negative skew (left-skewed): left tail longer; most scores cluster at higher values; mean < median.
- Kurtosis (peakedness)
- Leptokurtic: sharp peak and fatter tails; extreme values more likely.
- Platykurtic: flatter peak and thinner tails; extreme values less likely.
- Mesokurtic: normal level of peakedness.
Distribution Shaper and interactive exploration (conceptual)
- Distribution Shaper lets you manipulate a dataset and observe changes in:
- Mean, Median, and Mode positions as the data reshapes.
- Target Skew (γ) and Target Excess Kurtosis (δ) sliders to explore symmetry and peakedness effects.
- Theoretical Bell Curve vs. Actual Distribution: compare a theoretical normal curve to your data’s actual shape.
- Practical note: interactive tools can illustrate how, for example, bimodal distributions influence where the bell curve would lie on average.
- Important link: distribution exploration tools and demonstrations (links provided in lecture materials).
Standard deviation and the idea of spread around the mean
- Purpose: measure how spread out data are around the mean, not just the center.
- Intuition: a larger standard deviation means data are more spread out; a smaller one means data are clustered near the mean.
- The example demonstrates the concept visually by selecting a random score x and comparing its distance to the mean μ, using a tape-measure analogy.
Step-by-step intuition for standard deviation (illustrative demonstration)
- Step 1: select a random score x on the bell curve.
- Step 2: measure the horizontal distance from x to the mean μ.
- Step 3: measure the distance from the mean to the center of the curve (for a symmetric distribution, this is also μ).
- Step 4: compute the difference x − μ (the sign matters for a moment, but we account for all data points).
- Step 5: to avoid cancellation of positive and negative differences, square the differences before averaging: $\text{Var} = \frac{1}{N}\sum{i=1}^N (xi-\mu)^2 = \sigma^2.$ The square root gives the standard deviation: $\sigma = \sqrt{\frac{1}{N}\sum{i=1}^N (xi-\mu)^2}.$
- Step 6: to move from a total squared distance to an average squared distance, divide by the number of values N; this yields the variance, not yet the standard deviation.
- Step 7: the final step is taking the square root of the variance to obtain the standard deviation: $\sigma = \sqrt{\frac{1}{N}\sum{i=1}^N (xi-\mu)^2}.$
- Computational form (often used in practice): rearrange the computation as $\sigma = \sqrt{\frac{1}{N}\sum{i=1}^N xi^2 - \left(\frac{1}{N}\sum{i=1}^N xi\right)^2}.$
- Note on forms: the “definitional form” emphasizes the sum of squared deviations; the “computational form” uses sums of squares and the mean to simplify calculations.
Rugby data example: computing μ and σ for 24 matches
- Data: Tries scored per match (24 values):
  {4, 3, 5, 1, 6, 2, 1, 7, 0, 3, 1, 3, 1, 7, 7, 0, 5, 0, 1, 4, 5, 6, 7, 3}
- Totals:
- Sum of scores: $\sum x_i = 82$
- Sum of squares: $\sum x_i^2 = 420$
- N = 24
- Mean (population mean):
  $\mu = \frac{\sum x_i}{N} = \frac{82}{24} \approx 3.4167.$
- Variance (population):
  $\sigma^2 = \frac{\sum x_i^2}{N} - \mu^2 = \frac{420}{24} - (3.4167)^2 \approx 5.8264.$
- Standard deviation: $\sigma = \sqrt{\sigma^2} = \sqrt{5.8264} \approx 2.414.$
- In the slides, these numbers are shown as components for demonstrating the computation, with the final result approximately 2.41.
Probability and the area under the curve
- Frequencies convert to probabilities: the area under each bar in a histogram equals the probability of that outcome.
- For the rugby data, probabilities for each score (0,1,2,3,4,5,6,7) are the corresponding counts divided by 24, e.g., P(X=0)=12.5%, P(X=1)=20.8%, etc.
- The histogram is a discrete representation; the bell curve is its continuous counterpart in the normal-model context.
- The key link: the area under the curve (the integral) equals probability; the sum of bar areas equals 1 (or 100%).
The 68–95–99.7 rule (empirical rule)
- If data are normal, then approximately:
- 68% lie within one standard deviation of the mean: $P(|X-\mu|\leq \sigma) \approx 0.68.$
- 95% lie within two standard deviations: $P(|X-\mu|\leq 2\sigma) \approx 0.95.$
- 99.7% lie within three standard deviations: $P(|X-\mu|\leq 3\sigma) \approx 0.997.$
- This provides a practical ruler for judging how surprising a result is based on its distance from the mean in terms of standard deviations.
Quartiles, percentiles, and the five-number summary
- Quartiles divide data into four equal parts:
- Q1: lower quartile (25th percentile)
- Q2: median (50th percentile)
- Q3: upper quartile (75th percentile)
- Five-number summary: min, Q1, Q2 (median), Q3, max.
- The slides provide a concrete example for 24 rugby scores: Q1 = 1, Q3 = 5.5, giving an IQR of 4.5 (Q3 − Q1).
Interquartile range (IQR) and fences
- IQR measures the spread of the middle 50% of the data:
  $IQR = Q3 - Q1 = 4.5.$
- Fences define boundaries for detecting outliers:
- Inner fences:
  - Lower inner fence = $Q1 - 1.5 \times IQR$
  - Upper inner fence = $Q3 + 1.5 \times IQR$
- Outer fences:
  - Lower outer fence = $Q1 - 3 \times IQR$
  - Upper outer fence = $Q3 + 3 \times IQR$
- Adjacent values: the largest data point at least as large as the lower inner fence, and the smallest data point at most as large as the upper inner fence.
- Whiskers extend to the furthest adjacent values; outliers are plotted beyond the inner fences.
- In the rugby example, with Q1 = 1, Q3 = 5.5, IQR = 4.5:
- Lower inner fence = $1 - 1.5 \times 4.5 = -5.75$
- Upper inner fence = $5.5 + 1.5 \times 4.5 = 12.25$
- All data lie within the inner fences, so there are no outliers.
Boxplots: construction and interpretation
- Boxplot purpose: visually summarize distribution symmetry, spread, and central tendency.
- Five-number summary anchors the box and whiskers:
- Box represents the IQR (middle 50%), spanning from Q1 to Q3.
- Median (Q2) is a line inside the box.
- Whiskers extend to the most extreme data points within the inner fences (adjacent values).
- Outliers are plotted individually using distinct symbols (e.g., hollow circle for mild outliers, asterisk or filled symbol for extreme outliers).
- Outer fences separate mild from extreme outliers.
- Boxplots are linked conceptually to histograms and normal distributions for assessing symmetry and spread.
- Practical assessment: compare your boxplot to a symmetric, bell-curved distribution to gauge skewness and tail behavior.
Boxplots versus histograms and their link to bell curves
- Boxplots are a compact summary focusing on center, spread, and tails; histograms display frequency distribution across values.
- A boxplot corresponds to a symmetric normal distribution benchmark: the central box contains the densest part of the data; whiskers cover the remainder.
- In norma-distributed data, over 99% of data fall within the standard whisker range defined by 1.5×IQR and 3×IQR boundaries.
- Skewness in a boxplot appears as an asymmetrical box or whiskers.
Practical considerations and study tips
- Understand when to use descriptive charts (histograms, frequency tables) vs when to infer probabilities (probability distributions, sampling distributions).
- Remember the two forms for standard deviation computations and know how to switch between them for manual calculation vs using summaries:
- Population form: $\sigma = \sqrt{\frac{1}{N}\sum{i=1}^N (xi-\mu)^2}.$
- Computational form: $\sigma = \sqrt{\frac{\sum xi^2}{N} - \left(\frac{\sum xi}{N}\right)^2}.$
- For boxplots, always interpret outliers in context; an absence of outliers does not imply perfect symmetry, and the fences are a methodological rule, not an absolute boundary.
- Real-world relevance: Central limit theorem justifies using normal approximations for sampling distributions of the mean in many practical settings (e.g., psychology, economics, experimental data).
- Ethical and practical implications: relying on the normal approximation requires awareness of data distribution; violations (heavy tails, strong skew) can lead to misinterpretation of p-values and confidence intervals.
Key equations to remember (LaTeX)
- Population mean: $\mu = \frac{1}{N}\sum{i=1}^N xi$
- Population variance: $\sigma^2 = \frac{1}{N}\sum{i=1}^N (xi-\mu)^2$
- Population standard deviation (computational form): $\sigma = \sqrt{\frac{\sum xi^2}{N} - \left(\frac{\sum xi}{N}\right)^2}$
- Sample mean (for a sample of size n): $\bar{X}=\frac{1}{n}\sum{i=1}^{n} Xi$
- IQR: $IQR = Q3 - Q1$
- Inner fences: $\text{Lower inner fence} = Q1 - 1.5 \times IQR$ , $\text{Upper inner fence} = Q3 + 1.5 \times IQR$
- Outer fences: $\text{Lower outer fence} = Q1 - 3 \times IQR$ , $\text{Upper outer fence} = Q3 + 3 \times IQR$
- 68–95–99.7 rule: within $1\sigma, 2\sigma, 3\sigma$ from the mean correspond to roughly 68%, 95%, 99.7% of data respectively.
Quick takeaway
- Descriptive stats (mean, median, mode; range, variance, standard deviation) describe data; sampling distributions and the central limit theorem justify normal models for averages as sample sizes grow.
- Visual tools (histograms, boxplots) help assess distribution shape, skewness, outliers, and variability.
- The five-number summary (min, Q1, median, Q3, max) and IQR are central to boxplots and outlier detection.
References to interactive tools and further exploration
- Distribution Shaper and related interactive demonstrations mentioned in the slides illustrate how skewness and kurtosis affect the mean, median, and mode in real-time.
- Online links included in lecture materials (e.g., gemini and related shared resources) provide hands-on practice with the boxplot construction and distribution visualization.
Summary slide reminders
- Boxplots provide a compact summary of distribution shape, spread, and central tendency.
- Boxplots are essential in psychology and other sciences for quickly assessing symmetry, skewness, and potential outliers.
- Boxplots rely on the five-number summary and include whiskers, adjacent values, and outlier markers.
Final note on connections to earlier material and real-world relevance
- Relationships among mean, median, mode become particularly clear in normal distributions; deviations reveal skewness/kurtosis and non-normality.
- The central limit theorem underpins many statistical methods used in research and industry, reinforcing the practical value of understanding sampling distributions and standard deviation as a measure of spread.