Video Lecture Notes Review: Chapter 1–8 (Data Distributions, Displays, and Relationships)

Chapter 1 recap: data basics and terminology

  • Observations, measurements, and data organization

    • Observations are data points collected about variables; data are organized to reveal patterns.

    • Distinguishing between observations, variables, experiments, vs. observational data.

  • Frequency distributions as a key organizing tool

    • Purpose: show how often each value of a variable occurs.

    • Frequency distribution components: a table listing each value (or interval) and its frequency (count).

    • Visual representations: simple bar charts and histograms to visualize frequencies.

  • Bar charts vs histograms

    • Bar charts: used for nominal and ordinal variables; gaps between bars; values on the x-axis are categories or ordered categories; adjectives like nominal and ordinal.

    • Histograms: used for scale (continuous) variables; no gaps between bars; x-axis represents the variable values (or intervals/bins).

    • Question to consider: which one has gaps? Answer: Bar charts have gaps; histograms do not.

  • Variables and types

    • Nominal: categories with no natural order; typically visualized with bar charts.

    • Ordinal: ordered categories; often visualized with bar charts, but histograms may be used when treating as a scale with bins.

    • Scale (continuous): numerical values with meaningful distance; typically visualized with histograms or frequency polygons.

  • Grouped frequency tables

    • When there are many distinct values, group into bins (evenly spaced or logically defined) to create a meaningful, readable table.

    • Important details: define bin boundaries precisely to avoid double-counting values at cutoffs.

  • Precision and bin edges

    • Use the appropriate decimal places to delineate bins so the binning is consistent with measurement precision.

    • Example reasoning: if a scale can report to two decimals, cutoffs should reflect that precision.

The normal distribution: shape, properties, and interpretation

  • Shape and key properties

    • The normal distribution is bell-shaped, symmetric, and unimodal (one peak).

    • For a normal distribution, the mean equals the mode (and the median in the symmetric case).

    • The area under the curve is what many statistics rely on to define probabilities/percentiles.

  • Normal curve math (conceptual, not just pictures)

    • The curve is described by a mathematical formula that allows computing areas under the curve.

    • The normal probability density function (PDF) is given by:
      f(x\mid \mu,\sigma)=\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)

    • The two standard deviation boundaries around the mean denote the most common values and the areas under the curve within those ranges.

  • Standard deviations and empirical rule (68-95-99.7)

    • Approximately 68% of observations lie within one standard deviation of the mean: P(|X-\mu|\le \sigma) \approx 0.68.

    • Approximately 95% lie within two standard deviations: P(|X-\mu|\le 2\sigma) \approx 0.95.

    • Approximately 99.7% lie within three standard deviations: P(|X-\mu|\le 3\sigma) \approx 0.997.

  • Example used in class

    • If a normally distributed variable has mean (\mu=100) and standard deviation (\sigma=15):

    • About 68% fall between [\mu-\sigma, \mu+\sigma] = [85, 115].

    • About 95% fall between [\mu-2\sigma, \mu+2\sigma] = [70, 130].

    • About 99.7% fall between [\mu-3\sigma, \mu+3\sigma] = [55, 145].

  • Tails and asymptotes

    • The normal curve is defined mathematically to extend indefinitely; tails approach the x-axis but never touch it (asymptotic tails).

  • Real-world implications and inertia of the curve

    • The mean is a central tendency that coincides with the most common value in a symmetric normal distribution.

    • In practice, many real-world distributions are not perfectly normal; skewness and outliers are common.

Skewness, tails, and outliers

  • Skewness intuition

    • Skew occurs when the distribution is stretched more toward one tail than the other.

    • Direction of skew is described by the direction of the tail:

    • Negative skew (left-skew): tail extends to the left (toward smaller values).

    • Positive skew (right-skew): tail extends to the right (toward larger values).

  • The “ whale ” analogy for skew direction (informal visualization)

    • A hump (the main body) plus a tail that points in the direction of skew helps visualize left vs right tails.

  • Outliers and their impact

    • Outliers (like Warren Buffett in a sample of Omaha households) can pull the tail of the distribution and distort the mean.

    • An extreme high value in a left-skew distribution can pull the tail to the right, creating a positive skew; an extreme low value can yield negative skew.

    • Outliers affect measures of central tendency and dispersion, complicating interpretation and statistical analyses.

  • Floor and ceiling effects

    • Floor effect: data cannot go below a certain minimum (e.g., ACT 0, GPA cannot be negative, etc.).

    • Ceiling effect: data cannot exceed a maximum (e.g., SAT/ACT upper bounds, GPA capped at 4.0, five-point scales in some schools).

    • These effects can distort the apparent distribution and limit the ability to distinguish high- or low-performing individuals.

  • Concrete classroom examples

    • GPA example: 0.0 to 4.0 scale; some schools use 5.0 scales with AP credits; ceiling/floor define the actionable range.

    • ACT example: scores go from 1 to 36 (or similar bounds in the course context); floor/ceiling considerations affect interpretation.

Stem-and-leaf plots

  • What they are and when they’re useful

    • Stem-and-leaf plots organize small datasets visually by separating the leading digits (stems) from the trailing digits (leaves).

    • Helpful for very small datasets where you want a quick sense of the distribution without performing full numerical summaries.

  • Example from slides (minutes spent in the shower by women)

    • Stems: 0 through 6 (representing 0.x to 6.x minutes); leaves represent decimal parts.

    • Reading: each leaf corresponds to a value with the given stem as the leading digit(s).

  • Multi-population stem plots

    • Can compare distributions across groups (e.g., men vs. women) by displaying side-by-side stems/leaves.

  • Cautions

    • Not particularly scalable to large datasets; for small samples, they can be quick and informative.

  • Practical notes from the lecture

    • The instructor considered stem-and-leaf useful for very small datasets; for larger datasets, other displays are preferred.

Displays of data and depictions of deception

  • The chapter on displays is useful beyond being an analytic tool; it helps you read and interpret how data are presented and how they can be manipulated.

  • The instructor emphasizes “how to lie with data” as a defensive approach: learn the tricks so you can recognize and resist them.

  • Seven kinds of deceptive practices discussed (excluding outright lies):

    • False face validity: data seem related to a claim but aren’t necessarily the appropriate indicator.

    • Interpolation: omitting inconvenient data points in a graph; data look smoother than they are.

    • Extrapolation: making claims beyond the observed range; assumes the same trend continues.

    • Bias scales: the way questions are framed or the response scale nudges answers toward a desired direction.

    • Sneaky samples: sampling from a non-representative or otherwise biased subgroup (e.g., sampling only long-term meal-plan subscribers).

    • Inaccurate values (misleading units or labeling): misrepresenting unit scales to overstate or understate changes.

    • Outright lies: stated clearly as false; limited ability to safeguard beyond careful reading.

  • Examples used in class

    • Enrollment increases reported as 18% but the underlying metric was per-credit-hour changes, not headcount, altering interpretation.

    • A survey claiming 100% of students agree faculty are exceptional, achieved by restricting the sample or response options to create a bias toward the positive answer.

    • A cafeteria satisfaction study that sampled only residents in dormitories, biasing results toward a more favorable impression.

    • Interpolated retention rates across missing years (e.g., 2016–2018 data missing) to imply a trend that isn’t fully supported by the data.

    • Tuition-cost visuals that label a change as 5% when the axis uses a scale where the same space represents a larger absolute dollar amount, leading to over- or under-stated changes.

  • Practical takeaways

    • Always consider sample representativeness, question wording, and axis labels when interpreting graphs or reporting statistics.

    • Be wary of how data visuals can obscure or exaggerate true relationships.

Interpolation, extrapolation, and related cautions in data claims

  • Interpolation

    • Filling in missing data points within the observed range requires justification; missing data can mislead if presented as continuous.

  • Extrapolation

    • Extending observed trends into future or outside the data range requires strong assumptions; outcomes can be unreliable.

    • In the lecture, extrapolation is presented as an area where the predictive claim goes beyond available evidence and should be treated with caution.

  • Bias scales

    • The shape and granularity of response scales can bias interpretation (e.g., forcing a positive response).

  • Sneaky samples (sampling bias)

    • Sampling from a non-representative group (e.g., sampling only long-term subscribers, or only people in a specific demographic) can distort conclusions.

  • Practical example prompts

    • Asking a campus population about the value of a service by surveying only active users could bias results toward higher perceived value.

    • Asking questions that presume one outcome (e.g., only “strongly agree” options) can coerce a biased response.

  • Importance for research design

    • Thoughtful consideration of who is asked, how questions are framed, and how data are pictured is central to credible research.

Bias, sampling, and scale concepts; distinguishing terms

  • Bias scales (response bias in scales)

    • Scale design that nudges respondents toward a particular answer (e.g., limiting options to positive only) introduces bias.

  • Sneaky samples versus unbiased scales

    • Sneaky sample: sampling from a population subset intentionally chosen to bias results (e.g., only those with a favorable view, or a group with specific incentives).

    • Unbiased scale: a well-constructed scale that aims to capture genuine variability without pushing respondents toward a particular direction.

  • Distinguishing examples

    • iPhone vs Android preference survey: collecting data only from people with experience in both platforms versus sampling from a convenience group (e.g., family of Apple employees) to produce favorable responses.

    • Rec center fee survey: sampling only frequent users to argue for high perceived value, versus sampling a broader group to get a more representative view.

  • Teaching takeaway

    • The difference between a biased scale and a sneaky sample is that one concerns how questions are scaled (response options), while the other concerns who is being asked and how that selection may distort results.

  • Real-world implications

    • When evaluating programs or products, think about who was asked, how many, and which perspectives are included or excluded.

Scatter plots and relationships between two scale variables

  • What a scatter plot shows

    • A scatter plot displays two continuous (scale) variables on the x- and y-axes, creating a cloud of points to inspect possible relationships.

    • Requires both variables to be continuous; categorical variables can distort the visualization.

  • Plane range and readability

    • The range frame is a technique to adjust axes so the plot is easier to read; e.g., starting the y-axis at a higher minimum to reduce white space.

  • Interpreting relationships

    • Linear relationships: a straight-line pattern; the direction is indicated by the slope.

    • Positive relationship: as x goes up, y goes up (upward slope from left to right).

    • Negative relationship: as x goes up, y goes down (downward slope from left to right).

    • Nonlinear relationships: patterns like parabolas; a cloud with curvature but no single straight-line fit may indicate a nonlinear relationship.

  • Examples discussed in class

    • A scatter plot comparing study time (hours) to test score (0–100): potential positive linear relationship if more study time correlates with higher scores.

    • Anxiety versus performance: a nonlinear, likely inverted-U shape (too little or too much anxiety hurts performance; moderate anxiety might optimize performance).

  • Practical questions and axis orientation

    • If the line slopes upward left to right, it is a positive relationship; downward slope indicates negative.

    • If the plot appears inconsistent with the labeled axes, ensure axis orientation is correct (lower values on the left, higher values on the right for x-axis; similarly for y-axis).

    • The instructor emphasizes consistency in axis labeling to avoid misinterpretation.

Quick recap of key ideas and practical notes

  • Always differentiate between types of data (nominal, ordinal, interval/ratio) to choose appropriate visualizations (bar chart vs histogram).

  • The normal distribution serves as a foundational model for many statistical methods, but real data often show skew and outliers; be prepared to assess how these distortions affect conclusions.

  • Be cautious of data visuals that suppress important variability or mislead through poor labeling, selective sampling, or inflated representations of change.

  • When designing studies or interpreting results, consider: who was surveyed, how questions were framed, what scale was used, and whether data gaps were interpolated or extrapolated.

  • For exams and coursework: plan to maximize performance on the first attempt, use resubmission strategically, and pace study with the scheduled units (Chapter 2 review, Chapter 3, then Chapter 4 topics).

  • Core formulas and concepts to remember:

    • Normal PDF: f(x\mid \mu,\sigma)=\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)

    • Empirical rule (68-95-99.7):

    • P(|X-\mu|\le \sigma) \approx 0.68

    • P(|X-\mu|\le 2\sigma) \approx 0.95

    • P(|X-\mu|\le 3\sigma) \approx 0.997

    • For a normal distribution with mean $\mu=100$ and SD $\sigma=15$, approximate ranges:

    • 68%: [85, 115]

    • 95%: [70, 130]

    • 99.7%: [55, 145]

  • Notes on the exam schedule and logistics

    • The exam is single attempt; resubmission can recover up to half the points.

    • The exam format includes multiple-choice and long-answer questions; students should be concise and focused in long answers.

    • Scheduling is flexible as to weekend openings and DLC reservations; the instructor will confirm specific dates and access arrangements.

    • Computers and data analysis (Jamovi) will be integrated into coursework to help students practice data interpretation.

Quick glossary (terms to know for the unit)

  • Frequency distribution: a table or graphic showing how often each value of a variable occurs.

  • Bar chart: categorical data visualization with gaps between bars (nominal/ordinal).

  • Histogram: continuous data visualization without gaps between bars; uses bins.

  • Normal distribution: bell-shaped, symmetric, unimodal distribution described by the normal curve.

  • Skewness: asymmetry in a distribution’s tails (negative = left, positive = right).

  • Floor/Ceiling effects: natural limits that constrain observed data.

  • Stem-and-leaf plot: a data display that splits data into stems and leaves to show distribution for small datasets.

  • Interpolation/extrapolation: filling in or predicting beyond the observed data range, respectively.

  • Bias vs sneaky sampling: bias in scale/question design vs biased sampling from a subset of the population.

  • Scatter plot: a two-variable continuous display to assess relationships between X and Y.

If you’d like, I can tailor these notes to focus more on what will show up on the exam (e.g., more emphasis on histograms vs bar charts, or more examples of misrepresentation in graphs).