Understanding Data – Comprehensive Study Notes

Understanding Data: Comprehensive Study Notes

  • Learning objectives (Todays Learning Objectives)

    • Organise raw data into a frequency distribution table.

    • Interpret and create graphical representations of data, such as histograms and bar charts.

    • Describe the shape of a distribution (e.g., skew and kurtosis).

    • Calculate and interpret measures of central tendency (mean, median, mode).

    • Calculate and interpret measures of variability (range, variance, standard deviation).

    • Explain and apply the central limit theorem, showing how sampling distributions of the mean tend towards normality as sample size increases.

    • Construct an interpret boxplots by hand.

  • Real-world framing (First Principles Counting Events)

    • View data collection as simple records of each occurrence of an event (e.g., rugby tries).

    • Example scenario: tallying tries in a rugby season across 24 games; per-game tries belong to the set {0,1,2,3,4,5,6,7}.

    • Frequency table example: number of matches that scored a specific number of tries (e.g., 0, 1, 2, …, 7) across 24 matches.

    • This forms the basis for constructing a frequency chart and then moving to probabilities.

  • From frequencies to probabilities

    • A frequency chart answers descriptive questions: how many matches produced each score.

    • To make predictive statements, convert counts to probabilities by dividing by the total number of matches.

    • Example: If 3 matches scored 0 tries out of 24, the probability is P(X=0)=\frac{3}{24}=0.125=12.5\% (and similarly for other scores).

    • Summary form: probability for a score s is P(X=s)=\frac{\text{count of matches with s}}{\text{total matches}}.

  • Shifting from a single game to a run of games: the sample mean

    • Definition of sam

    • Example: for seven matches with scores 3,5,1,4,2,3,2,4 the sum is 28, so the average is \bar{X}=\frac{28}{7}=4.0.

    • Issue: averages vary from season to season due to random factors (weather, injuries, referee decisions, opposition form, luck).

    • Conclusion: averages vary across seasons; randomness prevents exact replication of a season’s average.

  • Building the sampling distribution of the mean (thousands of “What‑If” seasons)

    • Concept: imagine many possible seasons by shuffling the actual season’s scores to simulate new seasons.

    • Process: shuffle scores, compute a new season average, and repeat thousands of times.

    • The collection of these averages forms the sampling distribution of the mean.

    • The mean of all these simulated averages is the population mean, denoted by \mu.

    • Visualization: plot the distribution of these averages; with enough repetitions, the distribution tends to a bell-shaped curve (normal distribution).

    • Key insight: randomness in data generation tends to produce a normal-shaped sampling distribution for the mean.

  • Normal bell curves and their properties

    • The normal distribution (bell curve) arises when randomness is added to data.

    • Bell curves appear when averaging independent, mildly random events.

    • Notation:

    • Theoretical (ideal) normal curve is used for comparison with actual data.

    • Actual distribution is shown by a dashed grey line; the theoretical curve is shown by a solid red area in demonstrations.

    • Interactive tools illustrate how changing data toward symmetry and kurtosis shifts the mean, median, and mode.

  • Bell Curve properties: symmetry, modality, skew, and kurtosis

    • Normal distributions: a subset of bell curves with specific characteristics:

    • Symmetry: the curve is perfectly balanced; folding in half yields identical sides. (Skew = 0)

    • Modality: only one hump (single mode).

    • Skew: no leaning; equal tail lengths; mean = median = mode in the center for a perfect normal.

    • Kurtosis: neither too pointy nor too flat; moderate peak and tails.

    • Skewness (asymmetry)

    • Positive skew (right-skewed): right tail longer; most scores cluster at lower values; mean > median.

    • Negative skew (left-skewed): left tail longer; most scores cluster at higher values; mean < median.

    • Kurtosis (peakedness)

    • Leptokurtic: sharp peak and fatter tails; extreme values more likely.

    • Platykurtic: flatter peak and thinner tails; extreme values less likely.

    • Mesokurtic: normal level of peakedness.

  • Distribution Shaper and interactive exploration (conceptual)

    • Distribution Shaper lets you manipulate a dataset and observe changes in:

    • Mean, Median, and Mode positions as the data reshapes.

    • Target Skew (γ) and Target Excess Kurtosis (δ) sliders to explore symmetry and peakedness effects.

    • Theoretical Bell Curve vs. Actual Distribution: compare a theoretical normal curve to your data’s actual shape.

    • Practical note: interactive tools can illustrate how, for example, bimodal distributions influence where the bell curve would lie on average.

    • Important link: distribution exploration tools and demonstrations (links provided in lecture materials).

  • Standard deviation and the idea of spread around the mean

    • Purpose: measure how spread out data are around the mean, not just the center.

    • Intuition: a larger standard deviation means data are more spread out; a smaller one means data are clustered near the mean.

    • The example demonstrates the concept visually by selecting a random score x and comparing its distance to the mean μ, using a tape-measure analogy.

  • Step-by-step intuition for standard deviation (illustrative demonstration)

    • Step 1: select a random score x on the bell curve.

    • Step 2: measure the horizontal distance from x to the mean μ.

    • Step 3: measure the distance from the mean to the center of the curve (for a symmetric distribution, this is also μ).

    • Step 4: compute the difference x − μ (the sign matters for a moment, but we account for all data points).

    • Step 5: to avoid cancellation of positive and negative differences, square the differences before averaging: \text{Var} = \frac{1}{N}\sum{i=1}^N (xi-\mu)^2 = \sigma^2. The square root gives the standard deviation: \sigma = \sqrt{\frac{1}{N}\sum{i=1}^N (xi-\mu)^2}.

    • Step 6: to move from a total squared distance to an average squared distance, divide by the number of values N; this yields the variance, not yet the standard deviation.

    • Step 7: the final step is taking the square root of the variance to obtain the standard deviation: \sigma = \sqrt{\frac{1}{N}\sum{i=1}^N (xi-\mu)^2}.

    • Computational form (often used in practice): rearrange the computation as\sigma = \sqrt{\frac{1}{N}\sum{i=1}^N xi^2 - \left(\frac{1}{N}\sum{i=1}^N xi\right)^2}.

    • Note on forms: the “definitional form” emphasizes the sum of squared deviations; the “computational form” uses sums of squares and the mean to simplify calculations.

  • Rugby data example: computing μ and σ for 24 matches

    • Data: Tries scored per match (24 values):
      {4, 3, 5, 1, 6, 2, 1, 7, 0, 3, 1, 3, 1, 7, 7, 0, 5, 0, 1, 4, 5, 6, 7, 3}

    • Totals:

    • Sum of scores: \sum x_i = 82

    • Sum of squares: \sum x_i^2 = 420

    • N = 24

    • Mean (population mean):
      \mu = \frac{\sum x_i}{N} = \frac{82}{24} \approx 3.4167.

    • Variance (population):
      \sigma^2 = \frac{\sum x_i^2}{N} - \mu^2 = \frac{420}{24} - (3.4167)^2 \approx 5.8264.

    • Standard deviation:\sigma = \sqrt{\sigma^2} = \sqrt{5.8264} \approx 2.414.

    • In the slides, these numbers are shown as components for demonstrating the computation, with the final result approximately 2.41.

  • Probability and the area under the curve

    • Frequencies convert to probabilities: the area under each bar in a histogram equals the probability of that outcome.

    • For the rugby data, probabilities for each score (0,1,2,3,4,5,6,7) are the corresponding counts divided by 24, e.g., P(X=0)=12.5%, P(X=1)=20.8%, etc.

    • The histogram is a discrete representation; the bell curve is its continuous counterpart in the normal-model context.

    • The key link: the area under the curve (the integral) equals probability; the sum of bar areas equals 1 (or 100%).

  • The 68–95–99.7 rule (empirical rule)

    • If data are normal, then approximately:

    • 68% lie within one standard deviation of the mean: P(|X-\mu|\leq \sigma) \approx 0.68.

    • 95% lie within two standard deviations: P(|X-\mu|\leq 2\sigma) \approx 0.95.

    • 99.7% lie within three standard deviations: P(|X-\mu|\leq 3\sigma) \approx 0.997.

    • This provides a practical ruler for judging how surprising a result is based on its distance from the mean in terms of standard deviations.

  • Quartiles, percentiles, and the five-number summary

    • Quartiles divide data into four equal parts:

    • Q1: lower quartile (25th percentile)

    • Q2: median (50th percentile)

    • Q3: upper quartile (75th percentile)

    • Five-number summary: min, Q1, Q2 (median), Q3, max.

    • The slides provide a concrete example for 24 rugby scores: Q1 = 1, Q3 = 5.5, giving an IQR of 4.5 (Q3 − Q1).

  • Interquartile range (IQR) and fences

    • IQR measures the spread of the middle 50% of the data:
      IQR = Q3 - Q1 = 4.5.

    • Fences define boundaries for detecting outliers:

    • Inner fences:

      • Lower inner fence = Q1 - 1.5 \times IQR

      • Upper inner fence = Q3 + 1.5 \times IQR

    • Outer fences:

      • Lower outer fence = Q1 - 3 \times IQR

      • Upper outer fence = Q3 + 3 \times IQR

    • Adjacent values: the largest data point at least as large as the lower inner fence, and the smallest data point at most as large as the upper inner fence.

    • Whiskers extend to the furthest adjacent values; outliers are plotted beyond the inner fences.

    • In the rugby example, with Q1 = 1, Q3 = 5.5, IQR = 4.5:

    • Lower inner fence = 1 - 1.5 \times 4.5 = -5.75

    • Upper inner fence = 5.5 + 1.5 \times 4.5 = 12.25

    • All data lie within the inner fences, so there are no outliers.

  • Boxplots: construction and interpretation

    • Boxplot purpose: visually summarize distribution symmetry, spread, and central tendency.

    • Five-number summary anchors the box and whiskers:

    • Box represents the IQR (middle 50%), spanning from Q1 to Q3.

    • Median (Q2) is a line inside the box.

    • Whiskers extend to the most extreme data points within the inner fences (adjacent values).

    • Outliers are plotted individually using distinct symbols (e.g., hollow circle for mild outliers, asterisk or filled symbol for extreme outliers).

    • Outer fences separate mild from extreme outliers.

    • Boxplots are linked conceptually to histograms and normal distributions for assessing symmetry and spread.

    • Practical assessment: compare your boxplot to a symmetric, bell-curved distribution to gauge skewness and tail behavior.

  • Boxplots versus histograms and their link to bell curves

    • Boxplots are a compact summary focusing on center, spread, and tails; histograms display frequency distribution across values.

    • A boxplot corresponds to a symmetric normal distribution benchmark: the central box contains the densest part of the data; whiskers cover the remainder.

    • In norma-distributed data, over 99% of data fall within the standard whisker range defined by 1.5×IQR and 3×IQR boundaries.

    • Skewness in a boxplot appears as an asymmetrical box or whiskers.

  • Practical considerations and study tips

    • Understand when to use descriptive charts (histograms, frequency tables) vs when to infer probabilities (probability distributions, sampling distributions).

    • Remember the two forms for standard deviation computations and know how to switch between them for manual calculation vs using summaries:

    • Population form: \sigma = \sqrt{\frac{1}{N}\sum{i=1}^N (xi-\mu)^2}.

    • Computational form: \sigma = \sqrt{\frac{\sum xi^2}{N} - \left(\frac{\sum xi}{N}\right)^2}.

    • For boxplots, always interpret outliers in context; an absence of outliers does not imply perfect symmetry, and the fences are a methodological rule, not an absolute boundary.

    • Real-world relevance: Central limit theorem justifies using normal approximations for sampling distributions of the mean in many practical settings (e.g., psychology, economics, experimental data).

    • Ethical and practical implications: relying on the normal approximation requires awareness of data distribution; violations (heavy tails, strong skew) can lead to misinterpretation of p-values and confidence intervals.

  • Key equations to remember (LaTeX)

    • Population mean: \mu = \frac{1}{N}\sum{i=1}^N xi

    • Population variance: \sigma^2 = \frac{1}{N}\sum{i=1}^N (xi-\mu)^2

    • Population standard deviation (computational form): \sigma = \sqrt{\frac{\sum xi^2}{N} - \left(\frac{\sum xi}{N}\right)^2}

    • Sample mean (for a sample of size n): \bar{X}=\frac{1}{n}\sum{i=1}^{n} Xi

    • IQR: IQR = Q3 - Q1

    • Inner fences: \text{Lower inner fence} = Q1 - 1.5 \times IQR, \text{Upper inner fence} = Q3 + 1.5 \times IQR

    • Outer fences: \text{Lower outer fence} = Q1 - 3 \times IQR, \text{Upper outer fence} = Q3 + 3 \times IQR

    • 68–95–99.7 rule: within 1\sigma, 2\sigma, 3\sigma from the mean correspond to roughly 68%, 95%, 99.7% of data respectively.

  • Quick takeaway

    • Descriptive stats (mean, median, mode; range, variance, standard deviation) describe data; sampling distributions and the central limit theorem justify normal models for averages as sample sizes grow.

    • Visual tools (histograms, boxplots) help assess distribution shape, skewness, outliers, and variability.

    • The five-number summary (min, Q1, median, Q3, max) and IQR are central to boxplots and outlier detection.

  • References to interactive tools and further exploration

    • Distribution Shaper and related interactive demonstrations mentioned in the slides illustrate how skewness and kurtosis affect the mean, median, and mode in real-time.

    • Online links included in lecture materials (e.g., gemini and related shared resources) provide hands-on practice with the boxplot construction and distribution visualization.

  • Summary slide reminders

    • Boxplots provide a compact summary of distribution shape, spread, and central tendency.

    • Boxplots are essential in psychology and other sciences for quickly assessing symmetry, skewness, and potential outliers.

    • Boxplots rely on the five-number summary and include whiskers, adjacent values, and outlier markers.

  • Final note on connections to earlier material and real-world relevance

    • Relationships among mean, median, mode become particularly clear in normal distributions; deviations reveal skewness/kurtosis and non-normality.

    • The central limit theorem underpins many statistical methods used in research and industry, reinforcing the practical value of understanding sampling distributions and standard deviation as a measure of spread.