Notes on Section 2.2: One Quantitative Variable and Related Topics

Schedule and Midterm logistics

Regular quiz this week on Friday.
Next Friday: first midterm held in-class during regular period.
No lecture or quiz during midterm week.
After midterm, bonus quiz next week.
Overall sequence: a regular quiz, then a midterm, then a bonus quiz.
Midterm structure: two parts, A and B.
- Part A (in-person): in-class problem set; you must write out your work and hand it in.
- Part B (in-person online): online portion using YLEAF+; you must bring a computer, tablet, or smartphone to do Part B here.
If you do not take Part A, your Part B score is automatically zero, regardless of Part B performance.
Part A must have your name on it; otherwise you may receive a zero.
Makeup policy:
- If you miss the midterm and provide an official document (e.g., doctor’s note), you can request a makeup.
- Otherwise, there is a uniform makeup option.
The makeup exam will cover sections 1.1 to 2.3 for sure, and possibly section 2.4.
The instructor will confirm whether 2.4 is included and when makeup might be offered (likely Friday or Monday).
Focus for study: sections 1.1 to 2.3 (with potential coverage of 2.4).
Exam timing is in-session, so you should manage your time accordingly.
Any questions about the schedule? (pause for questions)

Section 2.2: One quantitative variable

Topic focus: one quantitative (numerical) variable.
Contrast with categorical variables (where we know how to build a frequency table and a relative frequency table, and visualize with bar plots or pie charts).
For a quantitative variable, a frequency table is not as straightforward to write down, so we visualize data instead.
Goals for this section: understand the shape of the distribution, the center, and the spread of a single quantitative variable.
Analogy: disc vs. data
- A disc has two parameters: center (the location) and radius (the size/spread).
- A data set is not as neat as a disc; it may be a irregular collection possibly forming complex shapes.
- Therefore, a single center is often insufficient to describe data; multiple parameters help describe location and spread.

Example context: movies in 2011

Data in an original four-column Excel-style table:
- Column 1: cases (rows) = individual movies.
- Variables include: studio (categorical) and world gross (quantitative).
For a categorical variable (e.g., studio), we create a frequency table with categories on the left and counts on the right.
For a quantitative variable (e.g., world gross), a frequency table is less practical; we turn to visualization.
Dot plot as an initial visualization
- Axes: x-axis represents the quantitative values (numbers, not categories).
- Each dot corresponds to a case (one dot per case).
- If a value appears multiple times, display multiple dots at that value (e.g., two cases with the same value -> two dots at that value).
- If a value does not occur in the data, there is no dot at that point.
- Advantages: can see the distribution and density of individual values; captures all data points visually.
- Limitations: can be cluttered if there are many distinct values; hard to see the overall pattern for large data sets.
Transition to histograms for a cleaner summary
- A histogram groups data into intervals (bins) along the x-axis and uses rectangle heights to show counts.
- How to construct a histogram:
- Choose an interval width (e.g., 200, 100, etc.).
- For each interval, count the number of observations (dots) that fall into that interval; this count becomes the height of the corresponding rectangle.
- Example process: intervals [0,200], [200,400], [400,600], etc., and count dots within each interval to set rectangle heights.
- Histogram vs dot plot:
- Histogram uses bins/intervals; dot plot uses exact values.
- Bar chart (for categorical data) uses bars for each category; a histogram uses continuous intervals along the x-axis.
- When to use which:
- For qualitative (categorical) data: bar chart.
- For quantitative (numerical) data: histogram (and optionally dot plot for raw data).
Important distinctions in visualization choices
- The number of bars in a bar chart is fixed by the number of categories.
- The number of bars in a histogram is not fixed; it can be adjusted (by choosing bin width or the number of bins) to balance detail and readability.
- Software defaults may influence the number of bins; you can adjust these settings to improve realism.

Shape of the distribution

Three common shapes discussed:
- Symmetric (approximately symmetric around a center): the left and right sides mirror each other.
- Right-skewed: a long tail extending to the right (toward larger values).
- Left-skewed: a long tail extending to the left (toward smaller values).
Bell-shaped distribution (the statistician’s ideal): a smooth, roughly bell-shaped curve (a specific type of symmetric distribution).
Notes on symmetry
- Perfect symmetry is rare in real data due to sampling variability and measurement error.
- In statistics, shapes are described as approximately symmetric if they resemble a bell shape in practice.
Intuition about symmetry examples
- A symmetric curve might resemble a parabola or a wavy line that can be flipped over to match the other half; in statistics, we focus on bell-shaped symmetry rather than exact mathematical symmetry.

Notation and basics of data tables

Variables and notation
- In data tables, we denote the quantitative variable as x (or another letter, if you specify it).
- The i-th observation is denoted as xi (or yi if you’re using y for the second variable).
- If you denote the second variable by y, then the i-th observation is y_i.
- If you switch rows in a two-column table, you effectively switch the corresponding values; the data as a set are unchanged, but the table representation changes.
The importance of fixing a table
- Once you fix the table (i.e., identify which column represents which variable), do not switch rows independently when discussing data, to avoid misinterpretation.
Mean and median as measures of center
- Mean (average): the arithmetic average of the data.
- Formula for the sample mean:
  ar{x} = rac{ ext{sum of all x}i}{n} = rac{ b{\sum}{i=1}^{n} x_i}{n}
- Population mean: denote as mu (μ).
- Sample mean is written as ar{x}; population mean is denoted by mu.
- The reasoning for notation variations is historical; always define what your symbols mean in your work.
Center concepts: mean vs median
- The mean is one measure of center (the average of all data points).
- The median (denoted by m) is the middle value when data are ordered from smallest to largest.
- How to determine the median:
- If n is odd, the median is the unique middle value after ordering.
- If n is even, the median is the average of the two middle values after ordering.
- Examples and intuition:
- For a three-number dataset (n = 3): the median is the second value after sorting.
- For an even-numbered dataset (n even): take the two central values and average them to obtain the median.
Outliers (intuitive discussion; later formal definition)
- Outliers are observations that lie far from the center of the data and may be considered unusual or extreme.
- Intuition: values like a very large or very small observation relative to the rest can be treated as outliers.
- In class, outliers were discussed informally with examples (e.g., a data point much farther from the central cluster).
- A precise, widely used mathematical definition of outliers is postponed for later in the course (referred to as “eight minutes” for a more formal treatment).

Notation conventions and practical tips

Notation for sample size and variables
- The sample size is denoted by n.
- Some instructors may use other letters (e.g., m, N) to denote a size; if you use another letter, you must define what it means to avoid ambiguity.
- The instructor emphasizes: if you use a non-standard symbol (e.g., m) without defining it, it may be interpreted as the sample size by default; always define your notation clearly.
Variables in a two-variable context
- Commonly, we use x for a primary quantitative variable and y for a secondary variable if needed.
- For a data table with two variables, the i-th row contains the i-th observation across variables (e.g., xi, yi).
- If you switch rows, you change the arrangement of the data; this affects the table, even though the underlying data are the same.
Basic statistical definitions in practice
- Mean (sample): ar{x} = rac{
  b{\sum}{i=1}^{n} xi}{n}
- Mean (population): mu
- Median: defined as the middle value after ordering; for even n, the average of the two central values.
- Order of operations: when describing the dataset, fix the convention for which column represents which variable and use consistent notation throughout.

Quick reference: key formulas and concepts from Section 2.2

Describing a dataset with one quantitative variable:
- Dot plot: visualize each observation as a dot at its value; one dot per case; multiple dots stacked at the same value represent multiple cases.
- Histogram: group data into bins; the height of each bin equals the number of observations within that bin; provides a summarized view of the distribution.
Measures of center:
- Sample mean: ar{x} = rac{b{\sum}{i=1}^{n} xi}{n}
- Population mean: \mu
- Median: middle value after ordering; for odd n, single middle value; for even n, average of two central values.
Shape descriptors:
- Symmetric (approximately bell-shaped) vs right-skewed vs left-skewed distributions.
- Bell-shaped distribution is a common reference in statistics for a symmetric, unimodal shape.
Notation reminders:
- Sample size: n (also sometimes denoted by other letters with explicit definitions).
- Variables: commonly x (and sometimes y) with observations xi (or yi).
Practical data representation tips:
- For categorical variables, use a frequency table (categories vs counts).
- For quantitative variables, prefer visualization (dot plots, histograms) and summary statistics over raw tables when counts are large.
Real-world data example (movies, 2011):
- Categorical: studio (e.g., Universal, Warner Bros., etc.)
- Quantitative: world gross (in dollars)
- Use a frequency table for categorical data; use dot plot/histogram for quantitative data to understand the distribution of world gross across movies.

Summary takeaways

Section 2.2 focuses on understanding a single quantitative variable, its distribution shape, center, and spread, and how to visualize data effectively when a simple frequency table is not feasible.
Visualization tools include dot plots and histograms, with histograms offering a flexible approach to summarize distributions via bins.
The mean and median provide different notions of center, with the mean being sensitive to outliers and the median offering robustness to outliers.
Notation and data representation practices are important for clear communication: define variables, fix the data table structure, and distinguish between sample and population concepts.
A practical data set (movies, 2011) illustrates the distinction between categorical vs quantitative variables and the corresponding visualization strategies.
Understand the differences between symmetric, right-skewed, and left-skewed shapes, and the intuition behind bell-shaped distributions.
Outliers are discussed conceptually; a precise definition is introduced later in the course.