Notes on Section 2.2: One Quantitative Variable and Related Topics

Schedule and Midterm logistics

  • Regular quiz this week on Friday.
  • Next Friday: first midterm held in-class during regular period.
  • No lecture or quiz during midterm week.
  • After midterm, bonus quiz next week.
  • Overall sequence: a regular quiz, then a midterm, then a bonus quiz.
  • Midterm structure: two parts, A and B.
    • Part A (in-person): in-class problem set; you must write out your work and hand it in.
    • Part B (in-person online): online portion using YLEAF+; you must bring a computer, tablet, or smartphone to do Part B here.
  • If you do not take Part A, your Part B score is automatically zero, regardless of Part B performance.
  • Part A must have your name on it; otherwise you may receive a zero.
  • Makeup policy:
    • If you miss the midterm and provide an official document (e.g., doctor’s note), you can request a makeup.
    • Otherwise, there is a uniform makeup option.
  • The makeup exam will cover sections 1.1 to 2.3 for sure, and possibly section 2.4.
  • The instructor will confirm whether 2.4 is included and when makeup might be offered (likely Friday or Monday).
  • Focus for study: sections 1.1 to 2.3 (with potential coverage of 2.4).
  • Exam timing is in-session, so you should manage your time accordingly.
  • Any questions about the schedule? (pause for questions)

Section 2.2: One quantitative variable

  • Topic focus: one quantitative (numerical) variable.
  • Contrast with categorical variables (where we know how to build a frequency table and a relative frequency table, and visualize with bar plots or pie charts).
  • For a quantitative variable, a frequency table is not as straightforward to write down, so we visualize data instead.
  • Goals for this section: understand the shape of the distribution, the center, and the spread of a single quantitative variable.
  • Analogy: disc vs. data
    • A disc has two parameters: center (the location) and radius (the size/spread).
    • A data set is not as neat as a disc; it may be a irregular collection possibly forming complex shapes.
    • Therefore, a single center is often insufficient to describe data; multiple parameters help describe location and spread.

Example context: movies in 2011

  • Data in an original four-column Excel-style table:
    • Column 1: cases (rows) = individual movies.
    • Variables include: studio (categorical) and world gross (quantitative).
  • For a categorical variable (e.g., studio), we create a frequency table with categories on the left and counts on the right.
  • For a quantitative variable (e.g., world gross), a frequency table is less practical; we turn to visualization.
  • Dot plot as an initial visualization
    • Axes: x-axis represents the quantitative values (numbers, not categories).
    • Each dot corresponds to a case (one dot per case).
    • If a value appears multiple times, display multiple dots at that value (e.g., two cases with the same value -> two dots at that value).
    • If a value does not occur in the data, there is no dot at that point.
    • Advantages: can see the distribution and density of individual values; captures all data points visually.
    • Limitations: can be cluttered if there are many distinct values; hard to see the overall pattern for large data sets.
  • Transition to histograms for a cleaner summary
    • A histogram groups data into intervals (bins) along the x-axis and uses rectangle heights to show counts.
    • How to construct a histogram:
    • Choose an interval width (e.g., 200, 100, etc.).
    • For each interval, count the number of observations (dots) that fall into that interval; this count becomes the height of the corresponding rectangle.
    • Example process: intervals [0,200], [200,400], [400,600], etc., and count dots within each interval to set rectangle heights.
    • Histogram vs dot plot:
    • Histogram uses bins/intervals; dot plot uses exact values.
    • Bar chart (for categorical data) uses bars for each category; a histogram uses continuous intervals along the x-axis.
    • When to use which:
    • For qualitative (categorical) data: bar chart.
    • For quantitative (numerical) data: histogram (and optionally dot plot for raw data).
  • Important distinctions in visualization choices
    • The number of bars in a bar chart is fixed by the number of categories.
    • The number of bars in a histogram is not fixed; it can be adjusted (by choosing bin width or the number of bins) to balance detail and readability.
    • Software defaults may influence the number of bins; you can adjust these settings to improve realism.

Shape of the distribution

  • Three common shapes discussed:
    • Symmetric (approximately symmetric around a center): the left and right sides mirror each other.
    • Right-skewed: a long tail extending to the right (toward larger values).
    • Left-skewed: a long tail extending to the left (toward smaller values).
  • Bell-shaped distribution (the statistician’s ideal): a smooth, roughly bell-shaped curve (a specific type of symmetric distribution).
  • Notes on symmetry
    • Perfect symmetry is rare in real data due to sampling variability and measurement error.
    • In statistics, shapes are described as approximately symmetric if they resemble a bell shape in practice.
  • Intuition about symmetry examples
    • A symmetric curve might resemble a parabola or a wavy line that can be flipped over to match the other half; in statistics, we focus on bell-shaped symmetry rather than exact mathematical symmetry.

Notation and basics of data tables

  • Variables and notation
    • In data tables, we denote the quantitative variable as x (or another letter, if you specify it).
    • The i-th observation is denoted as xi (or yi if you’re using y for the second variable).
    • If you denote the second variable by y, then the i-th observation is y_i.
    • If you switch rows in a two-column table, you effectively switch the corresponding values; the data as a set are unchanged, but the table representation changes.
  • The importance of fixing a table
    • Once you fix the table (i.e., identify which column represents which variable), do not switch rows independently when discussing data, to avoid misinterpretation.
  • Mean and median as measures of center
    • Mean (average): the arithmetic average of the data.
    • Formula for the sample mean:
      ar{x} = rac{ ext{sum of all x}i}{n} = rac{ b{\sum}{i=1}^{n} x_i}{n}
    • Population mean: denote as mu (μ).
    • Sample mean is written as ar{x}; population mean is denoted by mu.
    • The reasoning for notation variations is historical; always define what your symbols mean in your work.
  • Center concepts: mean vs median
    • The mean is one measure of center (the average of all data points).
    • The median (denoted by m) is the middle value when data are ordered from smallest to largest.
    • How to determine the median:
    • If n is odd, the median is the unique middle value after ordering.
    • If n is even, the median is the average of the two middle values after ordering.
    • Examples and intuition:
    • For a three-number dataset (n = 3): the median is the second value after sorting.
    • For an even-numbered dataset (n even): take the two central values and average them to obtain the median.
  • Outliers (intuitive discussion; later formal definition)
    • Outliers are observations that lie far from the center of the data and may be considered unusual or extreme.
    • Intuition: values like a very large or very small observation relative to the rest can be treated as outliers.
    • In class, outliers were discussed informally with examples (e.g., a data point much farther from the central cluster).
    • A precise, widely used mathematical definition of outliers is postponed for later in the course (referred to as “eight minutes” for a more formal treatment).

Notation conventions and practical tips

  • Notation for sample size and variables
    • The sample size is denoted by n.
    • Some instructors may use other letters (e.g., m, N) to denote a size; if you use another letter, you must define what it means to avoid ambiguity.
    • The instructor emphasizes: if you use a non-standard symbol (e.g., m) without defining it, it may be interpreted as the sample size by default; always define your notation clearly.
  • Variables in a two-variable context
    • Commonly, we use x for a primary quantitative variable and y for a secondary variable if needed.
    • For a data table with two variables, the i-th row contains the i-th observation across variables (e.g., xi, yi).
    • If you switch rows, you change the arrangement of the data; this affects the table, even though the underlying data are the same.
  • Basic statistical definitions in practice
    • Mean (sample): ar{x} = rac{
      b{\sum}{i=1}^{n} xi}{n}
    • Mean (population): mu
    • Median: defined as the middle value after ordering; for even n, the average of the two central values.
    • Order of operations: when describing the dataset, fix the convention for which column represents which variable and use consistent notation throughout.

Quick reference: key formulas and concepts from Section 2.2

  • Describing a dataset with one quantitative variable:
    • Dot plot: visualize each observation as a dot at its value; one dot per case; multiple dots stacked at the same value represent multiple cases.
    • Histogram: group data into bins; the height of each bin equals the number of observations within that bin; provides a summarized view of the distribution.
  • Measures of center:
    • Sample mean: ar{x} = rac{b{\sum}{i=1}^{n} xi}{n}
    • Population mean: \mu
    • Median: middle value after ordering; for odd n, single middle value; for even n, average of two central values.
  • Shape descriptors:
    • Symmetric (approximately bell-shaped) vs right-skewed vs left-skewed distributions.
    • Bell-shaped distribution is a common reference in statistics for a symmetric, unimodal shape.
  • Notation reminders:
    • Sample size: n (also sometimes denoted by other letters with explicit definitions).
    • Variables: commonly x (and sometimes y) with observations xi (or yi).
  • Practical data representation tips:
    • For categorical variables, use a frequency table (categories vs counts).
    • For quantitative variables, prefer visualization (dot plots, histograms) and summary statistics over raw tables when counts are large.
  • Real-world data example (movies, 2011):
    • Categorical: studio (e.g., Universal, Warner Bros., etc.)
    • Quantitative: world gross (in dollars)
    • Use a frequency table for categorical data; use dot plot/histogram for quantitative data to understand the distribution of world gross across movies.

Summary takeaways

  • Section 2.2 focuses on understanding a single quantitative variable, its distribution shape, center, and spread, and how to visualize data effectively when a simple frequency table is not feasible.
  • Visualization tools include dot plots and histograms, with histograms offering a flexible approach to summarize distributions via bins.
  • The mean and median provide different notions of center, with the mean being sensitive to outliers and the median offering robustness to outliers.
  • Notation and data representation practices are important for clear communication: define variables, fix the data table structure, and distinguish between sample and population concepts.
  • A practical data set (movies, 2011) illustrates the distinction between categorical vs quantitative variables and the corresponding visualization strategies.
  • Understand the differences between symmetric, right-skewed, and left-skewed shapes, and the intuition behind bell-shaped distributions.
  • Outliers are discussed conceptually; a precise definition is introduced later in the course.