Comprehensive Study Notes: Data Recording, Variables, Histograms, Boxplots, and Scatter Plots

Data Recording and Variables

Data in this class are recorded as a dataset where each row is a single case (e.g., individual subject, company, animal, or a specific event such as a bird and airplane strike).
For each case, information is stored in columns called variables.
Example variable shown: num Enjs (the number of engines of the airplane involved in the collision).
Important principle: variables describe characteristics of single cases, not aggregated statistics about the whole dataset.
Variables must be things that can differ from case to case. If a quantity is constant across all cases (like the overall average age), it is not a variable.
Distinguishing between variable types:
- Numerical variables: can take many numerical values and allow arithmetic operations.
- Discrete numerical variables: take values from a countable set (e.g., 0, 1, 2, 3).
  - Example: number of engines (you can’t have 4.5 engines in this dataset).
- Continuous numerical variables: can take any value within an interval (e.g., feet above ground level).
- Categorical variables: describe categories or groups.
- Nominal: categories with no natural order (e.g., type of transportation: car, bus, train, bicycle).
- Ordinal: categories with a natural order (e.g., political leanings: liberal, moderate, conservative).
Special-case examples from the dataset discussed:
- Bird strike data: "num Enjs" is numerical and discrete.
- Feet above ground level: numerical and continuous.
- Cloud cover: measured and classified (snow clouds, overcast, etc.); this is categorical and ordinal because there’s a sensible order to cloud cover levels.
- Number of birds hit: listed as categories (0, 1, 2–10, 11–100, over 100); this is ordinal because there is a natural ordering.
Note on two-category variables (e.g., gender): for many analyses it doesn’t matter which category you label first; either coding works similarly for binary variables.
Data examples used to illustrate concepts:
- Bird strike dataset: number of engines (numerical discrete); feet above ground level (numerical continuous); cloud cover (categorical ordinal); number of birds hit (categorical ordinal with ordered bins).
In the FEV dataset (Forced Expiratory Volume): contains subjects’ age, FEV, and other variables; used to discuss distribution of lung capacity.

Histograms and Class Intervals

A histogram is a graphical representation of a single variable.
Build by dividing the data into class intervals (bins):
- Example intervals for FEV: [0, 1), [1, 2), [2, 3), … (in liters).
Class intervals (bins) can be chosen arbitrarily; you decide how many intervals and how wide they should be based on what you want to communicate.
Process for a histogram:
- Create interval boundaries on the x-axis.
- Count how many observations fall into each interval and plot on the y-axis.
Observations about bin choices:
- Fewer intervals (broader bins) give a smoother, simpler summary but may hide details.
- More intervals (narrow bins) reveal more detail but may show noise or random fluctuations.
- It’s common to try several bin configurations to communicate the main features (e.g., one peak vs multiple small bumps).
Shape concepts derived from histograms:
- Symmetric (bell-shaped) distributions have roughly equal tails on both sides.
- Right-skewed (positive skew): tail extends to the right.
- Left-skewed (negative skew): tail extends to the left.
- Skew direction is the direction of the longer tail.
Modality:
- Unimodal: a single peak.
- Bimodal: two distinct peaks (often suggests two subpopulations).
- Multimodal: three or more peaks.
Outliers: points far away from the main cluster; can indicate data entry errors, unusual observations, or interesting phenomena.
Comparing groups with histograms:
- For example, lung capacity (FEV) for smokers vs. non-smokers can be compared by overlaying or stacking histograms.
- In the example, smokers appear to have a higher average lung capacity on the histogram, suggesting a shift to the right relative to non-smokers.
Scale options on the y-axis:
- Frequency scale: y-axis shows actual counts (frequency) in each bin.
- Density scale: y-axis shows density so that the bar areas sum to 1 (the total proportion).
Density-scale histogram details:
- Height of a bar = (proportion in bin) / (bin width).
- Proportion in a bin = (count in bin) / N, where N is the total number of observations.
- Area of all bars equals 1, representing the total proportion of observations.
When to use density vs frequency:
- Density scale handles unequal bin widths more easily and provides a direct interpretation in terms of area (proportion within a range).
- In this course, bins are generally equal width, but the density interpretation remains useful.
Example interpretation with a density histogram:
- If the blue area corresponds to FEV between 0.5 and 1.0 L, the area equals the proportion of subjects in that range.
- The area between 2.0 and 4.0 L would be the proportion in that range, and the total area under the histogram is 1.
Practical interpretation: density-scale histograms allow straightforward back-and-forth between area under the curve and proportion of observations in a range.
Quick takeaways about histograms:
- Shape: symmetry, skew, number of peaks.
- Center and spread: where is the center (mean/median) and how spread out is the data (range, standard deviation, or IQR).
- Outliers: look for unusual observations beyond the main cluster.
- When comparing two groups, histograms can reveal shifts in distribution between groups.

Center and Spread: Mean, Median, and Variability

Mean (average):
- Definition: ar{x} = rac{1}{n} \, \sum{i=1}^n xi
- Intuition: the balancing point of the data if you imagine placing all mass evenly.
Median (middle value):
- For odd n: the middle value after sorting.
- For even n: the average of the two middle values.
- Examples:
- Dataset: ${4, 8, 3, 5, 12}$ sorted: ${3,4,5,8,12}$; median = $5$ .
- If we add a sixth value (e.g., 100): sorted becomes ${3,4,5,8,12,100}$; median = $\frac{5+8}{2} = 6.5$ .
Sensitivity to outliers:
- Mean can be dramatically affected by extreme values (outliers).
- Example in the transcript: adding a value of 100 changed the mean from 6.4 to 22, but the median changed much less (from 5 to 6.5).
- Conclusion: the median is more robust (less affected by outliers) than the mean.
Estimating mean and median from a histogram:
- Median: the point where half the data are to the left and half to the right (50% area on each side).
- Mean: the balancing point of the actual distribution (imagine balancing the histogram physically).
Skew and location of mean vs median:
- For symmetric distributions, mean and median coincide (both near the center).
- For right-skewed distributions, the mean is larger than the median (mean pulled toward the longer tail).
- For left-skewed distributions, the mean is smaller than the median.
Five-number summary (a compact numerical summary of a distribution):
- The five numbers are: minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum.
- The interquartile range (IQR) = $Q3 - Q1$ ; not technically part of the five-number summary, but used to describe spread.
- Example process to compute a five-number summary:
- Sort the data.
- Find the overall median (Q2).
- Find the median of the lower half (Q1) and the median of the upper half (Q3).
- The min is the smallest value and the max is the largest value.
- Note: different software may compute quartiles with slightly different values (e.g., 33.5 vs 34); small differences are acceptable.
Boxplot (box-and-whiskers plot):
- Box spans from Q1 to Q3; a line inside the box marks the median (Q2).
- Whiskers extend to the most extreme data within 1.5 * IQR from the quartiles.
- Observations beyond the whiskers are plotted individually as outliers.
- Upper and lower fences (often drawn as dotted lines) are:
- Upper fence = $Q3 + 1.5\cdot\mathrm{IQR}$
- Lower fence = $Q1 - 1.5\cdot\mathrm{IQR}$
- Boxplot interpretation:
- The box shows the middle 50% of the data.
- The length of the box indicates the spread of the central portion; longer whiskers indicate more spread overall.
- Boxplots do not show modality (unimodal vs bimodal) well.
Standard deviation vs interquartile range:
- IQR describes the spread of the central 50% and is robust to outliers.
- Standard deviation describes the typical distance of data from the mean and is measured in the same units as the data.
- The standard deviation is often preferred when discussing distributions that are roughly bell-shaped and not heavily skewed.
Quick example of standard deviation (small dataset):
- Dataset: ${1, 2, 2, 7}$.
- Mean: $\bar{x} = \frac{1+2+2+7}{4} = \frac{12}{4} = 3$.
- Deviations from the mean: ${-2, -1, -1, 4}$.
- Population variance would be $\sigma^2 = \frac{1}{n} \sum (x_i - \bar{x})^2 = \frac{1}{4}(4+1+1+16) = \frac{22}{4} = 5.5.$
- Sample variance: divide by $n-1 = 3$ to get $s^2 = \frac{1}{3}(4+1+1+16) = \frac{22}{3} \approx 7.33.$
- Sample standard deviation: $s = \sqrt{7.33} \approx 2.71.$
Practical interpretation:
- Standard deviation gives a sense of how far observations typically lie from the mean, in the same units as the data (e.g., inches, centimeters).
- The 68% rule and the 95% rule (empirical rules) apply best to symmetric, bell-shaped data:
- About 68% of observations lie within one standard deviation of the mean.
- About 95% lie within two standard deviations of the mean.
Summary:
- Centre and spread are essential summaries.
- Depending on the data shape and outliers, you may prefer mean/SD or median/IQR for describing a distribution.

Two-Variable Representations: Scatter Plots

A scatter plot is used to explore relationships between two variables.
In a scatter plot:
- The x-axis shows the variable on the horizontal axis (e.g., age).
- The y-axis shows the variable on the vertical axis (e.g., FEV).
- Each point represents one case in the dataset.
What to look for in a scatter plot:
- General form of the relationship: linear, nonlinear, or no relationship.
- Directionality: positive associations (both variables increase together) vs negative associations (one increases while the other decreases).
- Strength of the association: how tightly clustered the points are around a pattern.
- Nonlinear patterns: some relationships may bend or curve (less often analyzed in introductory stats).
- Anomalies: clusters, multiple clusters, and outliers can be visible as unusual groupings or isolated points.
Cautions about interpretation:
- Observational data: a scatter plot shows association, not causation. When we say X is associated with Y, it does not imply X causes Y.
Practical notes:
- You can create multiple scatter plots for different pairs of variables when more than two variables exist.
- A strong, tight cluster around a line suggests a strong linear association; a loose cluster indicates a weaker association.
- A single fixed value of X with a wide range of Y indicates little dependence, whereas a tight Y-range for a fixed X suggests a strong dependency (high precision at that X).
Additional visual cues:
- Clusters can indicate subgroups or different populations within the data.
- Outliers can stand out and may warrant further investigation.
Final takeaway: scatter plots are a first-pass tool to assess relationships and guide subsequent modeling decisions.

Quick Reference and Connections

Variable types recap:
- Numerical: continuous vs discrete.
- Categorical: nominal vs ordinal.
- Binary variables (two categories) can be coded flexibly without changing the analysis results for many methods.
Histogram vs density histogram:
- Both convey distribution shape, center, and spread; density histograms emphasize area and can handle varying bin widths more naturally.
Boxplots and the five-number summary:
- Boxplot visualizes Q1, Q2 (median), Q3, and potential outliers via fences and whiskers.
Center and spread choices depend on data shape:
- For symmetric, bell-shaped data, mean and median align and SD and IQR provide complementary spread information.
- For skewed data, the median is often a better measure of center, and the mean may be pulled toward the tail.
Real-world relevance and ethics:
- Understanding distributions helps in quality control, safety assessments (e.g., aircraft-wildlife collisions), medical interpretation (e.g., FEV), and policy decisions.
- When interpreting, distinguish correlation from causation and be mindful of outliers, sampling bias, and measurement error.
Foundational principles:
- Descriptive statistics summarize data succinctly but do not imply causation.
- Visual data exploration (histograms, boxplots, scatter plots) guides hypotheses and modeling choices.
Notation recap:
- Mean: ar{x} = rac{1}{n}\sum{i=1}^n xi
- Median: the middle value (or average of the two middle values) in the ordered data.
- Quartiles: Q1, Q2 (= median), Q3; IQR = $Q3 - Q1$
- Boxplot fences: upper fence = $Q3 + 1.5\cdot\mathrm{IQR}$ , lower fence = $Q1 - 1.5\cdot\mathrm{IQR}$
- Standard deviation (sample): $s = \sqrt{\frac{1}{n-1}\sum<em>{i=1}^n (x</em>i - \bar{x})^2}$
- Density histogram height: $\text{height} = \frac{\text{count in bin}}{N \cdot \text{bin width}}$
- Area under a density histogram equals 1 (proportion across all data).