Comprehensive Study Notes: Data Recording, Variables, Histograms, Boxplots, and Scatter Plots
Data Recording and Variables
Data in this class are recorded as a dataset where each row is a single case (e.g., individual subject, company, animal, or a specific event such as a bird and airplane strike).
For each case, information is stored in columns called variables.
Example variable shown: num Enjs (the number of engines of the airplane involved in the collision).
Important principle: variables describe characteristics of single cases, not aggregated statistics about the whole dataset.
Variables must be things that can differ from case to case. If a quantity is constant across all cases (like the overall average age), it is not a variable.
Distinguishing between variable types:
Numerical variables: can take many numerical values and allow arithmetic operations.
Discrete numerical variables: take values from a countable set (e.g., 0, 1, 2, 3).
Example: number of engines (you can’t have 4.5 engines in this dataset).
Continuous numerical variables: can take any value within an interval (e.g., feet above ground level).
Categorical variables: describe categories or groups.
Nominal: categories with no natural order (e.g., type of transportation: car, bus, train, bicycle).
Ordinal: categories with a natural order (e.g., political leanings: liberal, moderate, conservative).
Special-case examples from the dataset discussed:
Bird strike data: "num Enjs" is numerical and discrete.
Feet above ground level: numerical and continuous.
Cloud cover: measured and classified (snow clouds, overcast, etc.); this is categorical and ordinal because there’s a sensible order to cloud cover levels.
Number of birds hit: listed as categories (0, 1, 2–10, 11–100, over 100); this is ordinal because there is a natural ordering.
Note on two-category variables (e.g., gender): for many analyses it doesn’t matter which category you label first; either coding works similarly for binary variables.
Data examples used to illustrate concepts:
Bird strike dataset: number of engines (numerical discrete); feet above ground level (numerical continuous); cloud cover (categorical ordinal); number of birds hit (categorical ordinal with ordered bins).
In the FEV dataset (Forced Expiratory Volume): contains subjects’ age, FEV, and other variables; used to discuss distribution of lung capacity.
Histograms and Class Intervals
A histogram is a graphical representation of a single variable.
Build by dividing the data into class intervals (bins):
Example intervals for FEV: [0, 1), [1, 2), [2, 3), … (in liters).
Class intervals (bins) can be chosen arbitrarily; you decide how many intervals and how wide they should be based on what you want to communicate.
Process for a histogram:
Create interval boundaries on the x-axis.
Count how many observations fall into each interval and plot on the y-axis.
Observations about bin choices:
Fewer intervals (broader bins) give a smoother, simpler summary but may hide details.
More intervals (narrow bins) reveal more detail but may show noise or random fluctuations.
It’s common to try several bin configurations to communicate the main features (e.g., one peak vs multiple small bumps).
Shape concepts derived from histograms:
Symmetric (bell-shaped) distributions have roughly equal tails on both sides.
Right-skewed (positive skew): tail extends to the right.
Left-skewed (negative skew): tail extends to the left.
Skew direction is the direction of the longer tail.
Modality:
Unimodal: a single peak.
Bimodal: two distinct peaks (often suggests two subpopulations).
Multimodal: three or more peaks.
Outliers: points far away from the main cluster; can indicate data entry errors, unusual observations, or interesting phenomena.
Comparing groups with histograms:
For example, lung capacity (FEV) for smokers vs. non-smokers can be compared by overlaying or stacking histograms.
In the example, smokers appear to have a higher average lung capacity on the histogram, suggesting a shift to the right relative to non-smokers.
Scale options on the y-axis:
Frequency scale: y-axis shows actual counts (frequency) in each bin.
Density scale: y-axis shows density so that the bar areas sum to 1 (the total proportion).
Density-scale histogram details:
Height of a bar = (proportion in bin) / (bin width).
Proportion in a bin = (count in bin) / N, where N is the total number of observations.
Area of all bars equals 1, representing the total proportion of observations.
When to use density vs frequency:
Density scale handles unequal bin widths more easily and provides a direct interpretation in terms of area (proportion within a range).
In this course, bins are generally equal width, but the density interpretation remains useful.
Example interpretation with a density histogram:
If the blue area corresponds to FEV between 0.5 and 1.0 L, the area equals the proportion of subjects in that range.
The area between 2.0 and 4.0 L would be the proportion in that range, and the total area under the histogram is 1.
Practical interpretation: density-scale histograms allow straightforward back-and-forth between area under the curve and proportion of observations in a range.
Quick takeaways about histograms:
Shape: symmetry, skew, number of peaks.
Center and spread: where is the center (mean/median) and how spread out is the data (range, standard deviation, or IQR).
Outliers: look for unusual observations beyond the main cluster.
When comparing two groups, histograms can reveal shifts in distribution between groups.
Center and Spread: Mean, Median, and Variability
Mean (average):
Definition: ar{x} = rac{1}{n} \, \sum{i=1}^n xi
Intuition: the balancing point of the data if you imagine placing all mass evenly.
Median (middle value):
For odd n: the middle value after sorting.
For even n: the average of the two middle values.
Examples:
Dataset: ${4, 8, 3, 5, 12}$ sorted: ${3,4,5,8,12}$; median = 5.
If we add a sixth value (e.g., 100): sorted becomes ${3,4,5,8,12,100}$; median = \frac{5+8}{2} = 6.5.
Sensitivity to outliers:
Mean can be dramatically affected by extreme values (outliers).
Example in the transcript: adding a value of 100 changed the mean from 6.4 to 22, but the median changed much less (from 5 to 6.5).
Conclusion: the median is more robust (less affected by outliers) than the mean.
Estimating mean and median from a histogram:
Median: the point where half the data are to the left and half to the right (50% area on each side).
Mean: the balancing point of the actual distribution (imagine balancing the histogram physically).
Skew and location of mean vs median:
For symmetric distributions, mean and median coincide (both near the center).
For right-skewed distributions, the mean is larger than the median (mean pulled toward the longer tail).
For left-skewed distributions, the mean is smaller than the median.
Five-number summary (a compact numerical summary of a distribution):
The five numbers are: minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum.
The interquartile range (IQR) = Q3 - Q1; not technically part of the five-number summary, but used to describe spread.
Example process to compute a five-number summary:
Sort the data.
Find the overall median (Q2).
Find the median of the lower half (Q1) and the median of the upper half (Q3).
The min is the smallest value and the max is the largest value.
Note: different software may compute quartiles with slightly different values (e.g., 33.5 vs 34); small differences are acceptable.
Boxplot (box-and-whiskers plot):
Box spans from Q1 to Q3; a line inside the box marks the median (Q2).
Whiskers extend to the most extreme data within 1.5 * IQR from the quartiles.
Observations beyond the whiskers are plotted individually as outliers.
Upper and lower fences (often drawn as dotted lines) are:
Upper fence = Q3 + 1.5\cdot\mathrm{IQR}
Lower fence = Q1 - 1.5\cdot\mathrm{IQR}
Boxplot interpretation:
The box shows the middle 50% of the data.
The length of the box indicates the spread of the central portion; longer whiskers indicate more spread overall.
Boxplots do not show modality (unimodal vs bimodal) well.
Standard deviation vs interquartile range:
IQR describes the spread of the central 50% and is robust to outliers.
Standard deviation describes the typical distance of data from the mean and is measured in the same units as the data.
The standard deviation is often preferred when discussing distributions that are roughly bell-shaped and not heavily skewed.
Quick example of standard deviation (small dataset):
Dataset: ${1, 2, 2, 7}$.
Mean: $\bar{x} = \frac{1+2+2+7}{4} = \frac{12}{4} = 3$.
Deviations from the mean: ${-2, -1, -1, 4}$.
Population variance would be \sigma^2 = \frac{1}{n} \sum (x_i - \bar{x})^2 = \frac{1}{4}(4+1+1+16) = \frac{22}{4} = 5.5.
Sample variance: divide by $n-1 = 3$ to get s^2 = \frac{1}{3}(4+1+1+16) = \frac{22}{3} \approx 7.33.
Sample standard deviation: s = \sqrt{7.33} \approx 2.71.
Practical interpretation:
Standard deviation gives a sense of how far observations typically lie from the mean, in the same units as the data (e.g., inches, centimeters).
The 68% rule and the 95% rule (empirical rules) apply best to symmetric, bell-shaped data:
About 68% of observations lie within one standard deviation of the mean.
About 95% lie within two standard deviations of the mean.
Summary:
Centre and spread are essential summaries.
Depending on the data shape and outliers, you may prefer mean/SD or median/IQR for describing a distribution.
Two-Variable Representations: Scatter Plots
A scatter plot is used to explore relationships between two variables.
In a scatter plot:
The x-axis shows the variable on the horizontal axis (e.g., age).
The y-axis shows the variable on the vertical axis (e.g., FEV).
Each point represents one case in the dataset.
What to look for in a scatter plot:
General form of the relationship: linear, nonlinear, or no relationship.
Directionality: positive associations (both variables increase together) vs negative associations (one increases while the other decreases).
Strength of the association: how tightly clustered the points are around a pattern.
Nonlinear patterns: some relationships may bend or curve (less often analyzed in introductory stats).
Anomalies: clusters, multiple clusters, and outliers can be visible as unusual groupings or isolated points.
Cautions about interpretation:
Observational data: a scatter plot shows association, not causation. When we say X is associated with Y, it does not imply X causes Y.
Practical notes:
You can create multiple scatter plots for different pairs of variables when more than two variables exist.
A strong, tight cluster around a line suggests a strong linear association; a loose cluster indicates a weaker association.
A single fixed value of X with a wide range of Y indicates little dependence, whereas a tight Y-range for a fixed X suggests a strong dependency (high precision at that X).
Additional visual cues:
Clusters can indicate subgroups or different populations within the data.
Outliers can stand out and may warrant further investigation.
Final takeaway: scatter plots are a first-pass tool to assess relationships and guide subsequent modeling decisions.
Quick Reference and Connections
Variable types recap:
Numerical: continuous vs discrete.
Categorical: nominal vs ordinal.
Binary variables (two categories) can be coded flexibly without changing the analysis results for many methods.
Histogram vs density histogram:
Both convey distribution shape, center, and spread; density histograms emphasize area and can handle varying bin widths more naturally.
Boxplots and the five-number summary:
Boxplot visualizes Q1, Q2 (median), Q3, and potential outliers via fences and whiskers.
Center and spread choices depend on data shape:
For symmetric, bell-shaped data, mean and median align and SD and IQR provide complementary spread information.
For skewed data, the median is often a better measure of center, and the mean may be pulled toward the tail.
Real-world relevance and ethics:
Understanding distributions helps in quality control, safety assessments (e.g., aircraft-wildlife collisions), medical interpretation (e.g., FEV), and policy decisions.
When interpreting, distinguish correlation from causation and be mindful of outliers, sampling bias, and measurement error.
Foundational principles:
Descriptive statistics summarize data succinctly but do not imply causation.
Visual data exploration (histograms, boxplots, scatter plots) guides hypotheses and modeling choices.
Notation recap:
Mean: ar{x} = rac{1}{n}\sum{i=1}^n xi
Median: the middle value (or average of the two middle values) in the ordered data.
Quartiles: Q1, Q2 (= median), Q3; IQR = Q3 - Q1
Boxplot fences: upper fence = Q3 + 1.5\cdot\mathrm{IQR}, lower fence = Q1 - 1.5\cdot\mathrm{IQR}
Standard deviation (sample): s = \sqrt{\frac{1}{n-1}\sum{i=1}^n (xi - \bar{x})^2}
Density histogram height: \text{height} = \frac{\text{count in bin}}{N \cdot \text{bin width}}
Area under a density histogram equals 1 (proportion across all data).