Video Lecture Notes Review: Chapter 1–8 (Data Distributions, Displays, and Relationships)

Chapter 1 recap: data basics and terminology

Observations, measurements, and data organization
- Observations are data points collected about variables; data are organized to reveal patterns.
- Distinguishing between observations, variables, experiments, vs. observational data.
Frequency distributions as a key organizing tool
- Purpose: show how often each value of a variable occurs.
- Frequency distribution components: a table listing each value (or interval) and its frequency (count).
- Visual representations: simple bar charts and histograms to visualize frequencies.
Bar charts vs histograms
- Bar charts: used for nominal and ordinal variables; gaps between bars; values on the x-axis are categories or ordered categories; adjectives like nominal and ordinal.
- Histograms: used for scale (continuous) variables; no gaps between bars; x-axis represents the variable values (or intervals/bins).
- Question to consider: which one has gaps? Answer: Bar charts have gaps; histograms do not.
Variables and types
- Nominal: categories with no natural order; typically visualized with bar charts.
- Ordinal: ordered categories; often visualized with bar charts, but histograms may be used when treating as a scale with bins.
- Scale (continuous): numerical values with meaningful distance; typically visualized with histograms or frequency polygons.
Grouped frequency tables
- When there are many distinct values, group into bins (evenly spaced or logically defined) to create a meaningful, readable table.
- Important details: define bin boundaries precisely to avoid double-counting values at cutoffs.
Precision and bin edges
- Use the appropriate decimal places to delineate bins so the binning is consistent with measurement precision.
- Example reasoning: if a scale can report to two decimals, cutoffs should reflect that precision.

The normal distribution: shape, properties, and interpretation

Shape and key properties
- The normal distribution is bell-shaped, symmetric, and unimodal (one peak).
- For a normal distribution, the mean equals the mode (and the median in the symmetric case).
- The area under the curve is what many statistics rely on to define probabilities/percentiles.
Normal curve math (conceptual, not just pictures)
- The curve is described by a mathematical formula that allows computing areas under the curve.
- The normal probability density function (PDF) is given by:
  f(x\mid \mu,\sigma)=\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)
- The two standard deviation boundaries around the mean denote the most common values and the areas under the curve within those ranges.
Standard deviations and empirical rule (68-95-99.7)
- Approximately 68% of observations lie within one standard deviation of the mean: P(|X-\mu|\le \sigma) \approx 0.68.
- Approximately 95% lie within two standard deviations: P(|X-\mu|\le 2\sigma) \approx 0.95.
- Approximately 99.7% lie within three standard deviations: P(|X-\mu|\le 3\sigma) \approx 0.997.
Example used in class
- If a normally distributed variable has mean (\mu=100) and standard deviation (\sigma=15):
- About 68% fall between [\mu-\sigma, \mu+\sigma] = [85, 115].
- About 95% fall between [\mu-2\sigma, \mu+2\sigma] = [70, 130].
- About 99.7% fall between [\mu-3\sigma, \mu+3\sigma] = [55, 145].
Tails and asymptotes
- The normal curve is defined mathematically to extend indefinitely; tails approach the x-axis but never touch it (asymptotic tails).
Real-world implications and inertia of the curve
- The mean is a central tendency that coincides with the most common value in a symmetric normal distribution.
- In practice, many real-world distributions are not perfectly normal; skewness and outliers are common.

Skewness, tails, and outliers

Skewness intuition
- Skew occurs when the distribution is stretched more toward one tail than the other.
- Direction of skew is described by the direction of the tail:
- Negative skew (left-skew): tail extends to the left (toward smaller values).
- Positive skew (right-skew): tail extends to the right (toward larger values).
The “ whale ” analogy for skew direction (informal visualization)
- A hump (the main body) plus a tail that points in the direction of skew helps visualize left vs right tails.
Outliers and their impact
- Outliers (like Warren Buffett in a sample of Omaha households) can pull the tail of the distribution and distort the mean.
- An extreme high value in a left-skew distribution can pull the tail to the right, creating a positive skew; an extreme low value can yield negative skew.
- Outliers affect measures of central tendency and dispersion, complicating interpretation and statistical analyses.
Floor and ceiling effects
- Floor effect: data cannot go below a certain minimum (e.g., ACT 0, GPA cannot be negative, etc.).
- Ceiling effect: data cannot exceed a maximum (e.g., SAT/ACT upper bounds, GPA capped at 4.0, five-point scales in some schools).
- These effects can distort the apparent distribution and limit the ability to distinguish high- or low-performing individuals.
Concrete classroom examples
- GPA example: 0.0 to 4.0 scale; some schools use 5.0 scales with AP credits; ceiling/floor define the actionable range.
- ACT example: scores go from 1 to 36 (or similar bounds in the course context); floor/ceiling considerations affect interpretation.

Stem-and-leaf plots

What they are and when they’re useful
- Stem-and-leaf plots organize small datasets visually by separating the leading digits (stems) from the trailing digits (leaves).
- Helpful for very small datasets where you want a quick sense of the distribution without performing full numerical summaries.
Example from slides (minutes spent in the shower by women)
- Stems: 0 through 6 (representing 0.x to 6.x minutes); leaves represent decimal parts.
- Reading: each leaf corresponds to a value with the given stem as the leading digit(s).
Multi-population stem plots
- Can compare distributions across groups (e.g., men vs. women) by displaying side-by-side stems/leaves.
Cautions
- Not particularly scalable to large datasets; for small samples, they can be quick and informative.
Practical notes from the lecture
- The instructor considered stem-and-leaf useful for very small datasets; for larger datasets, other displays are preferred.

Displays of data and depictions of deception

The chapter on displays is useful beyond being an analytic tool; it helps you read and interpret how data are presented and how they can be manipulated.
The instructor emphasizes “how to lie with data” as a defensive approach: learn the tricks so you can recognize and resist them.
Seven kinds of deceptive practices discussed (excluding outright lies):
- False face validity: data seem related to a claim but aren’t necessarily the appropriate indicator.
- Interpolation: omitting inconvenient data points in a graph; data look smoother than they are.
- Extrapolation: making claims beyond the observed range; assumes the same trend continues.
- Bias scales: the way questions are framed or the response scale nudges answers toward a desired direction.
- Sneaky samples: sampling from a non-representative or otherwise biased subgroup (e.g., sampling only long-term meal-plan subscribers).
- Inaccurate values (misleading units or labeling): misrepresenting unit scales to overstate or understate changes.
- Outright lies: stated clearly as false; limited ability to safeguard beyond careful reading.
Examples used in class
- Enrollment increases reported as 18% but the underlying metric was per-credit-hour changes, not headcount, altering interpretation.
- A survey claiming 100% of students agree faculty are exceptional, achieved by restricting the sample or response options to create a bias toward the positive answer.
- A cafeteria satisfaction study that sampled only residents in dormitories, biasing results toward a more favorable impression.
- Interpolated retention rates across missing years (e.g., 2016–2018 data missing) to imply a trend that isn’t fully supported by the data.
- Tuition-cost visuals that label a change as 5% when the axis uses a scale where the same space represents a larger absolute dollar amount, leading to over- or under-stated changes.
Practical takeaways
- Always consider sample representativeness, question wording, and axis labels when interpreting graphs or reporting statistics.
- Be wary of how data visuals can obscure or exaggerate true relationships.

Interpolation, extrapolation, and related cautions in data claims

Interpolation
- Filling in missing data points within the observed range requires justification; missing data can mislead if presented as continuous.
Extrapolation
- Extending observed trends into future or outside the data range requires strong assumptions; outcomes can be unreliable.
- In the lecture, extrapolation is presented as an area where the predictive claim goes beyond available evidence and should be treated with caution.
Bias scales
- The shape and granularity of response scales can bias interpretation (e.g., forcing a positive response).
Sneaky samples (sampling bias)
- Sampling from a non-representative group (e.g., sampling only long-term subscribers, or only people in a specific demographic) can distort conclusions.
Practical example prompts
- Asking a campus population about the value of a service by surveying only active users could bias results toward higher perceived value.
- Asking questions that presume one outcome (e.g., only “strongly agree” options) can coerce a biased response.
Importance for research design
- Thoughtful consideration of who is asked, how questions are framed, and how data are pictured is central to credible research.

Bias, sampling, and scale concepts; distinguishing terms

Bias scales (response bias in scales)
- Scale design that nudges respondents toward a particular answer (e.g., limiting options to positive only) introduces bias.
Sneaky samples versus unbiased scales
- Sneaky sample: sampling from a population subset intentionally chosen to bias results (e.g., only those with a favorable view, or a group with specific incentives).
- Unbiased scale: a well-constructed scale that aims to capture genuine variability without pushing respondents toward a particular direction.
Distinguishing examples
- iPhone vs Android preference survey: collecting data only from people with experience in both platforms versus sampling from a convenience group (e.g., family of Apple employees) to produce favorable responses.
- Rec center fee survey: sampling only frequent users to argue for high perceived value, versus sampling a broader group to get a more representative view.
Teaching takeaway
- The difference between a biased scale and a sneaky sample is that one concerns how questions are scaled (response options), while the other concerns who is being asked and how that selection may distort results.
Real-world implications
- When evaluating programs or products, think about who was asked, how many, and which perspectives are included or excluded.

Scatter plots and relationships between two scale variables

What a scatter plot shows
- A scatter plot displays two continuous (scale) variables on the x- and y-axes, creating a cloud of points to inspect possible relationships.
- Requires both variables to be continuous; categorical variables can distort the visualization.
Plane range and readability
- The range frame is a technique to adjust axes so the plot is easier to read; e.g., starting the y-axis at a higher minimum to reduce white space.
Interpreting relationships
- Linear relationships: a straight-line pattern; the direction is indicated by the slope.
- Positive relationship: as x goes up, y goes up (upward slope from left to right).
- Negative relationship: as x goes up, y goes down (downward slope from left to right).
- Nonlinear relationships: patterns like parabolas; a cloud with curvature but no single straight-line fit may indicate a nonlinear relationship.
Examples discussed in class
- A scatter plot comparing study time (hours) to test score (0–100): potential positive linear relationship if more study time correlates with higher scores.
- Anxiety versus performance: a nonlinear, likely inverted-U shape (too little or too much anxiety hurts performance; moderate anxiety might optimize performance).
Practical questions and axis orientation
- If the line slopes upward left to right, it is a positive relationship; downward slope indicates negative.
- If the plot appears inconsistent with the labeled axes, ensure axis orientation is correct (lower values on the left, higher values on the right for x-axis; similarly for y-axis).
- The instructor emphasizes consistency in axis labeling to avoid misinterpretation.

Quick recap of key ideas and practical notes

Always differentiate between types of data (nominal, ordinal, interval/ratio) to choose appropriate visualizations (bar chart vs histogram).
The normal distribution serves as a foundational model for many statistical methods, but real data often show skew and outliers; be prepared to assess how these distortions affect conclusions.
Be cautious of data visuals that suppress important variability or mislead through poor labeling, selective sampling, or inflated representations of change.
When designing studies or interpreting results, consider: who was surveyed, how questions were framed, what scale was used, and whether data gaps were interpolated or extrapolated.
For exams and coursework: plan to maximize performance on the first attempt, use resubmission strategically, and pace study with the scheduled units (Chapter 2 review, Chapter 3, then Chapter 4 topics).
Core formulas and concepts to remember:
- Normal PDF: f(x\mid \mu,\sigma)=\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)
- Empirical rule (68-95-99.7):
- P(|X-\mu|\le \sigma) \approx 0.68
- P(|X-\mu|\le 2\sigma) \approx 0.95
- P(|X-\mu|\le 3\sigma) \approx 0.997
- For a normal distribution with mean $\mu=100$ and SD $\sigma=15$, approximate ranges:
- 68%: [85, 115]
- 95%: [70, 130]
- 99.7%: [55, 145]
Notes on the exam schedule and logistics
- The exam is single attempt; resubmission can recover up to half the points.
- The exam format includes multiple-choice and long-answer questions; students should be concise and focused in long answers.
- Scheduling is flexible as to weekend openings and DLC reservations; the instructor will confirm specific dates and access arrangements.
- Computers and data analysis (Jamovi) will be integrated into coursework to help students practice data interpretation.

Quick glossary (terms to know for the unit)

Frequency distribution: a table or graphic showing how often each value of a variable occurs.
Bar chart: categorical data visualization with gaps between bars (nominal/ordinal).
Histogram: continuous data visualization without gaps between bars; uses bins.
Normal distribution: bell-shaped, symmetric, unimodal distribution described by the normal curve.
Skewness: asymmetry in a distribution’s tails (negative = left, positive = right).
Floor/Ceiling effects: natural limits that constrain observed data.
Stem-and-leaf plot: a data display that splits data into stems and leaves to show distribution for small datasets.
Interpolation/extrapolation: filling in or predicting beyond the observed data range, respectively.
Bias vs sneaky sampling: bias in scale/question design vs biased sampling from a subset of the population.
Scatter plot: a two-variable continuous display to assess relationships between X and Y.

If you’d like, I can tailor these notes to focus more on what will show up on the exam (e.g., more emphasis on histograms vs bar charts, or more examples of misrepresentation in graphs).