Comprehensive Study Notes: Data Recording, Variables, Histograms, Boxplots, and Scatter Plots

Data Recording and Variables

  • Data in this class are recorded as a dataset where each row is a single case (e.g., individual subject, company, animal, or a specific event such as a bird and airplane strike).

  • For each case, information is stored in columns called variables.

  • Example variable shown: num Enjs (the number of engines of the airplane involved in the collision).

  • Important principle: variables describe characteristics of single cases, not aggregated statistics about the whole dataset.

  • Variables must be things that can differ from case to case. If a quantity is constant across all cases (like the overall average age), it is not a variable.

  • Distinguishing between variable types:

    • Numerical variables: can take many numerical values and allow arithmetic operations.

    • Discrete numerical variables: take values from a countable set (e.g., 0, 1, 2, 3).

      • Example: number of engines (you can’t have 4.5 engines in this dataset).

    • Continuous numerical variables: can take any value within an interval (e.g., feet above ground level).

    • Categorical variables: describe categories or groups.

    • Nominal: categories with no natural order (e.g., type of transportation: car, bus, train, bicycle).

    • Ordinal: categories with a natural order (e.g., political leanings: liberal, moderate, conservative).

  • Special-case examples from the dataset discussed:

    • Bird strike data: "num Enjs" is numerical and discrete.

    • Feet above ground level: numerical and continuous.

    • Cloud cover: measured and classified (snow clouds, overcast, etc.); this is categorical and ordinal because there’s a sensible order to cloud cover levels.

    • Number of birds hit: listed as categories (0, 1, 2–10, 11–100, over 100); this is ordinal because there is a natural ordering.

  • Note on two-category variables (e.g., gender): for many analyses it doesn’t matter which category you label first; either coding works similarly for binary variables.

  • Data examples used to illustrate concepts:

    • Bird strike dataset: number of engines (numerical discrete); feet above ground level (numerical continuous); cloud cover (categorical ordinal); number of birds hit (categorical ordinal with ordered bins).

  • In the FEV dataset (Forced Expiratory Volume): contains subjects’ age, FEV, and other variables; used to discuss distribution of lung capacity.

Histograms and Class Intervals

  • A histogram is a graphical representation of a single variable.

  • Build by dividing the data into class intervals (bins):

    • Example intervals for FEV: [0, 1), [1, 2), [2, 3), … (in liters).

  • Class intervals (bins) can be chosen arbitrarily; you decide how many intervals and how wide they should be based on what you want to communicate.

  • Process for a histogram:

    • Create interval boundaries on the x-axis.

    • Count how many observations fall into each interval and plot on the y-axis.

  • Observations about bin choices:

    • Fewer intervals (broader bins) give a smoother, simpler summary but may hide details.

    • More intervals (narrow bins) reveal more detail but may show noise or random fluctuations.

    • It’s common to try several bin configurations to communicate the main features (e.g., one peak vs multiple small bumps).

  • Shape concepts derived from histograms:

    • Symmetric (bell-shaped) distributions have roughly equal tails on both sides.

    • Right-skewed (positive skew): tail extends to the right.

    • Left-skewed (negative skew): tail extends to the left.

    • Skew direction is the direction of the longer tail.

  • Modality:

    • Unimodal: a single peak.

    • Bimodal: two distinct peaks (often suggests two subpopulations).

    • Multimodal: three or more peaks.

  • Outliers: points far away from the main cluster; can indicate data entry errors, unusual observations, or interesting phenomena.

  • Comparing groups with histograms:

    • For example, lung capacity (FEV) for smokers vs. non-smokers can be compared by overlaying or stacking histograms.

    • In the example, smokers appear to have a higher average lung capacity on the histogram, suggesting a shift to the right relative to non-smokers.

  • Scale options on the y-axis:

    • Frequency scale: y-axis shows actual counts (frequency) in each bin.

    • Density scale: y-axis shows density so that the bar areas sum to 1 (the total proportion).

  • Density-scale histogram details:

    • Height of a bar = (proportion in bin) / (bin width).

    • Proportion in a bin = (count in bin) / N, where N is the total number of observations.

    • Area of all bars equals 1, representing the total proportion of observations.

  • When to use density vs frequency:

    • Density scale handles unequal bin widths more easily and provides a direct interpretation in terms of area (proportion within a range).

    • In this course, bins are generally equal width, but the density interpretation remains useful.

  • Example interpretation with a density histogram:

    • If the blue area corresponds to FEV between 0.5 and 1.0 L, the area equals the proportion of subjects in that range.

    • The area between 2.0 and 4.0 L would be the proportion in that range, and the total area under the histogram is 1.

  • Practical interpretation: density-scale histograms allow straightforward back-and-forth between area under the curve and proportion of observations in a range.

  • Quick takeaways about histograms:

    • Shape: symmetry, skew, number of peaks.

    • Center and spread: where is the center (mean/median) and how spread out is the data (range, standard deviation, or IQR).

    • Outliers: look for unusual observations beyond the main cluster.

    • When comparing two groups, histograms can reveal shifts in distribution between groups.

Center and Spread: Mean, Median, and Variability

  • Mean (average):

    • Definition: ar{x} = rac{1}{n} \, \sum{i=1}^n xi

    • Intuition: the balancing point of the data if you imagine placing all mass evenly.

  • Median (middle value):

    • For odd n: the middle value after sorting.

    • For even n: the average of the two middle values.

    • Examples:

    • Dataset: ${4, 8, 3, 5, 12}$ sorted: ${3,4,5,8,12}$; median = 5.

    • If we add a sixth value (e.g., 100): sorted becomes ${3,4,5,8,12,100}$; median = \frac{5+8}{2} = 6.5.

  • Sensitivity to outliers:

    • Mean can be dramatically affected by extreme values (outliers).

    • Example in the transcript: adding a value of 100 changed the mean from 6.4 to 22, but the median changed much less (from 5 to 6.5).

    • Conclusion: the median is more robust (less affected by outliers) than the mean.

  • Estimating mean and median from a histogram:

    • Median: the point where half the data are to the left and half to the right (50% area on each side).

    • Mean: the balancing point of the actual distribution (imagine balancing the histogram physically).

  • Skew and location of mean vs median:

    • For symmetric distributions, mean and median coincide (both near the center).

    • For right-skewed distributions, the mean is larger than the median (mean pulled toward the longer tail).

    • For left-skewed distributions, the mean is smaller than the median.

  • Five-number summary (a compact numerical summary of a distribution):

    • The five numbers are: minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum.

    • The interquartile range (IQR) = Q3 - Q1; not technically part of the five-number summary, but used to describe spread.

    • Example process to compute a five-number summary:

    • Sort the data.

    • Find the overall median (Q2).

    • Find the median of the lower half (Q1) and the median of the upper half (Q3).

    • The min is the smallest value and the max is the largest value.

    • Note: different software may compute quartiles with slightly different values (e.g., 33.5 vs 34); small differences are acceptable.

  • Boxplot (box-and-whiskers plot):

    • Box spans from Q1 to Q3; a line inside the box marks the median (Q2).

    • Whiskers extend to the most extreme data within 1.5 * IQR from the quartiles.

    • Observations beyond the whiskers are plotted individually as outliers.

    • Upper and lower fences (often drawn as dotted lines) are:

    • Upper fence = Q3 + 1.5\cdot\mathrm{IQR}

    • Lower fence = Q1 - 1.5\cdot\mathrm{IQR}

    • Boxplot interpretation:

    • The box shows the middle 50% of the data.

    • The length of the box indicates the spread of the central portion; longer whiskers indicate more spread overall.

    • Boxplots do not show modality (unimodal vs bimodal) well.

  • Standard deviation vs interquartile range:

    • IQR describes the spread of the central 50% and is robust to outliers.

    • Standard deviation describes the typical distance of data from the mean and is measured in the same units as the data.

    • The standard deviation is often preferred when discussing distributions that are roughly bell-shaped and not heavily skewed.

  • Quick example of standard deviation (small dataset):

    • Dataset: ${1, 2, 2, 7}$.

    • Mean: $\bar{x} = \frac{1+2+2+7}{4} = \frac{12}{4} = 3$.

    • Deviations from the mean: ${-2, -1, -1, 4}$.

    • Population variance would be \sigma^2 = \frac{1}{n} \sum (x_i - \bar{x})^2 = \frac{1}{4}(4+1+1+16) = \frac{22}{4} = 5.5.

    • Sample variance: divide by $n-1 = 3$ to get s^2 = \frac{1}{3}(4+1+1+16) = \frac{22}{3} \approx 7.33.

    • Sample standard deviation: s = \sqrt{7.33} \approx 2.71.

  • Practical interpretation:

    • Standard deviation gives a sense of how far observations typically lie from the mean, in the same units as the data (e.g., inches, centimeters).

    • The 68% rule and the 95% rule (empirical rules) apply best to symmetric, bell-shaped data:

    • About 68% of observations lie within one standard deviation of the mean.

    • About 95% lie within two standard deviations of the mean.

  • Summary:

    • Centre and spread are essential summaries.

    • Depending on the data shape and outliers, you may prefer mean/SD or median/IQR for describing a distribution.

Two-Variable Representations: Scatter Plots

  • A scatter plot is used to explore relationships between two variables.

  • In a scatter plot:

    • The x-axis shows the variable on the horizontal axis (e.g., age).

    • The y-axis shows the variable on the vertical axis (e.g., FEV).

    • Each point represents one case in the dataset.

  • What to look for in a scatter plot:

    • General form of the relationship: linear, nonlinear, or no relationship.

    • Directionality: positive associations (both variables increase together) vs negative associations (one increases while the other decreases).

    • Strength of the association: how tightly clustered the points are around a pattern.

    • Nonlinear patterns: some relationships may bend or curve (less often analyzed in introductory stats).

    • Anomalies: clusters, multiple clusters, and outliers can be visible as unusual groupings or isolated points.

  • Cautions about interpretation:

    • Observational data: a scatter plot shows association, not causation. When we say X is associated with Y, it does not imply X causes Y.

  • Practical notes:

    • You can create multiple scatter plots for different pairs of variables when more than two variables exist.

    • A strong, tight cluster around a line suggests a strong linear association; a loose cluster indicates a weaker association.

    • A single fixed value of X with a wide range of Y indicates little dependence, whereas a tight Y-range for a fixed X suggests a strong dependency (high precision at that X).

  • Additional visual cues:

    • Clusters can indicate subgroups or different populations within the data.

    • Outliers can stand out and may warrant further investigation.

  • Final takeaway: scatter plots are a first-pass tool to assess relationships and guide subsequent modeling decisions.

Quick Reference and Connections

  • Variable types recap:

    • Numerical: continuous vs discrete.

    • Categorical: nominal vs ordinal.

    • Binary variables (two categories) can be coded flexibly without changing the analysis results for many methods.

  • Histogram vs density histogram:

    • Both convey distribution shape, center, and spread; density histograms emphasize area and can handle varying bin widths more naturally.

  • Boxplots and the five-number summary:

    • Boxplot visualizes Q1, Q2 (median), Q3, and potential outliers via fences and whiskers.

  • Center and spread choices depend on data shape:

    • For symmetric, bell-shaped data, mean and median align and SD and IQR provide complementary spread information.

    • For skewed data, the median is often a better measure of center, and the mean may be pulled toward the tail.

  • Real-world relevance and ethics:

    • Understanding distributions helps in quality control, safety assessments (e.g., aircraft-wildlife collisions), medical interpretation (e.g., FEV), and policy decisions.

    • When interpreting, distinguish correlation from causation and be mindful of outliers, sampling bias, and measurement error.

  • Foundational principles:

    • Descriptive statistics summarize data succinctly but do not imply causation.

    • Visual data exploration (histograms, boxplots, scatter plots) guides hypotheses and modeling choices.

  • Notation recap:

    • Mean: ar{x} = rac{1}{n}\sum{i=1}^n xi

    • Median: the middle value (or average of the two middle values) in the ordered data.

    • Quartiles: Q1, Q2 (= median), Q3; IQR = Q3 - Q1

    • Boxplot fences: upper fence = Q3 + 1.5\cdot\mathrm{IQR}, lower fence = Q1 - 1.5\cdot\mathrm{IQR}

    • Standard deviation (sample): s = \sqrt{\frac{1}{n-1}\sum{i=1}^n (xi - \bar{x})^2}

    • Density histogram height: \text{height} = \frac{\text{count in bin}}{N \cdot \text{bin width}}

    • Area under a density histogram equals 1 (proportion across all data).