Notes on Section 2.2: One Quantitative Variable and Related Topics
Schedule and Midterm logistics
- Regular quiz this week on Friday.
- Next Friday: first midterm held in-class during regular period.
- No lecture or quiz during midterm week.
- After midterm, bonus quiz next week.
- Overall sequence: a regular quiz, then a midterm, then a bonus quiz.
- Midterm structure: two parts, A and B.
- Part A (in-person): in-class problem set; you must write out your work and hand it in.
- Part B (in-person online): online portion using YLEAF+; you must bring a computer, tablet, or smartphone to do Part B here.
- If you do not take Part A, your Part B score is automatically zero, regardless of Part B performance.
- Part A must have your name on it; otherwise you may receive a zero.
- Makeup policy:
- If you miss the midterm and provide an official document (e.g., doctor’s note), you can request a makeup.
- Otherwise, there is a uniform makeup option.
- The makeup exam will cover sections 1.1 to 2.3 for sure, and possibly section 2.4.
- The instructor will confirm whether 2.4 is included and when makeup might be offered (likely Friday or Monday).
- Focus for study: sections 1.1 to 2.3 (with potential coverage of 2.4).
- Exam timing is in-session, so you should manage your time accordingly.
- Any questions about the schedule? (pause for questions)
Section 2.2: One quantitative variable
- Topic focus: one quantitative (numerical) variable.
- Contrast with categorical variables (where we know how to build a frequency table and a relative frequency table, and visualize with bar plots or pie charts).
- For a quantitative variable, a frequency table is not as straightforward to write down, so we visualize data instead.
- Goals for this section: understand the shape of the distribution, the center, and the spread of a single quantitative variable.
- Analogy: disc vs. data
- A disc has two parameters: center (the location) and radius (the size/spread).
- A data set is not as neat as a disc; it may be a irregular collection possibly forming complex shapes.
- Therefore, a single center is often insufficient to describe data; multiple parameters help describe location and spread.
Example context: movies in 2011
- Data in an original four-column Excel-style table:
- Column 1: cases (rows) = individual movies.
- Variables include: studio (categorical) and world gross (quantitative).
- For a categorical variable (e.g., studio), we create a frequency table with categories on the left and counts on the right.
- For a quantitative variable (e.g., world gross), a frequency table is less practical; we turn to visualization.
- Dot plot as an initial visualization
- Axes: x-axis represents the quantitative values (numbers, not categories).
- Each dot corresponds to a case (one dot per case).
- If a value appears multiple times, display multiple dots at that value (e.g., two cases with the same value -> two dots at that value).
- If a value does not occur in the data, there is no dot at that point.
- Advantages: can see the distribution and density of individual values; captures all data points visually.
- Limitations: can be cluttered if there are many distinct values; hard to see the overall pattern for large data sets.
- Transition to histograms for a cleaner summary
- A histogram groups data into intervals (bins) along the x-axis and uses rectangle heights to show counts.
- How to construct a histogram:
- Choose an interval width (e.g., 200, 100, etc.).
- For each interval, count the number of observations (dots) that fall into that interval; this count becomes the height of the corresponding rectangle.
- Example process: intervals [0,200], [200,400], [400,600], etc., and count dots within each interval to set rectangle heights.
- Histogram vs dot plot:
- Histogram uses bins/intervals; dot plot uses exact values.
- Bar chart (for categorical data) uses bars for each category; a histogram uses continuous intervals along the x-axis.
- When to use which:
- For qualitative (categorical) data: bar chart.
- For quantitative (numerical) data: histogram (and optionally dot plot for raw data).
- Important distinctions in visualization choices
- The number of bars in a bar chart is fixed by the number of categories.
- The number of bars in a histogram is not fixed; it can be adjusted (by choosing bin width or the number of bins) to balance detail and readability.
- Software defaults may influence the number of bins; you can adjust these settings to improve realism.
Shape of the distribution
- Three common shapes discussed:
- Symmetric (approximately symmetric around a center): the left and right sides mirror each other.
- Right-skewed: a long tail extending to the right (toward larger values).
- Left-skewed: a long tail extending to the left (toward smaller values).
- Bell-shaped distribution (the statistician’s ideal): a smooth, roughly bell-shaped curve (a specific type of symmetric distribution).
- Notes on symmetry
- Perfect symmetry is rare in real data due to sampling variability and measurement error.
- In statistics, shapes are described as approximately symmetric if they resemble a bell shape in practice.
- Intuition about symmetry examples
- A symmetric curve might resemble a parabola or a wavy line that can be flipped over to match the other half; in statistics, we focus on bell-shaped symmetry rather than exact mathematical symmetry.
Notation and basics of data tables
- Variables and notation
- In data tables, we denote the quantitative variable as x (or another letter, if you specify it).
- The i-th observation is denoted as xi (or yi if you’re using y for the second variable).
- If you denote the second variable by y, then the i-th observation is y_i.
- If you switch rows in a two-column table, you effectively switch the corresponding values; the data as a set are unchanged, but the table representation changes.
- The importance of fixing a table
- Once you fix the table (i.e., identify which column represents which variable), do not switch rows independently when discussing data, to avoid misinterpretation.
- Mean and median as measures of center
- Mean (average): the arithmetic average of the data.
- Formula for the sample mean:
ar{x} = rac{ ext{sum of all x}i}{n} = rac{
b{\sum}{i=1}^{n} x_i}{n} - Population mean: denote as mu (μ).
- Sample mean is written as ar{x}; population mean is denoted by mu.
- The reasoning for notation variations is historical; always define what your symbols mean in your work.
- Center concepts: mean vs median
- The mean is one measure of center (the average of all data points).
- The median (denoted by m) is the middle value when data are ordered from smallest to largest.
- How to determine the median:
- If n is odd, the median is the unique middle value after ordering.
- If n is even, the median is the average of the two middle values after ordering.
- Examples and intuition:
- For a three-number dataset (n = 3): the median is the second value after sorting.
- For an even-numbered dataset (n even): take the two central values and average them to obtain the median.
- Outliers (intuitive discussion; later formal definition)
- Outliers are observations that lie far from the center of the data and may be considered unusual or extreme.
- Intuition: values like a very large or very small observation relative to the rest can be treated as outliers.
- In class, outliers were discussed informally with examples (e.g., a data point much farther from the central cluster).
- A precise, widely used mathematical definition of outliers is postponed for later in the course (referred to as “eight minutes” for a more formal treatment).
Notation conventions and practical tips
- Notation for sample size and variables
- The sample size is denoted by n.
- Some instructors may use other letters (e.g., m, N) to denote a size; if you use another letter, you must define what it means to avoid ambiguity.
- The instructor emphasizes: if you use a non-standard symbol (e.g., m) without defining it, it may be interpreted as the sample size by default; always define your notation clearly.
- Variables in a two-variable context
- Commonly, we use x for a primary quantitative variable and y for a secondary variable if needed.
- For a data table with two variables, the i-th row contains the i-th observation across variables (e.g., xi, yi).
- If you switch rows, you change the arrangement of the data; this affects the table, even though the underlying data are the same.
- Basic statistical definitions in practice
- Mean (sample): ar{x} = rac{
b{\sum}{i=1}^{n} xi}{n} - Mean (population): mu
- Median: defined as the middle value after ordering; for even n, the average of the two central values.
- Order of operations: when describing the dataset, fix the convention for which column represents which variable and use consistent notation throughout.
- Describing a dataset with one quantitative variable:
- Dot plot: visualize each observation as a dot at its value; one dot per case; multiple dots stacked at the same value represent multiple cases.
- Histogram: group data into bins; the height of each bin equals the number of observations within that bin; provides a summarized view of the distribution.
- Measures of center:
- Sample mean: ar{x} = rac{b{\sum}{i=1}^{n} xi}{n}
- Population mean: \mu
- Median: middle value after ordering; for odd n, single middle value; for even n, average of two central values.
- Shape descriptors:
- Symmetric (approximately bell-shaped) vs right-skewed vs left-skewed distributions.
- Bell-shaped distribution is a common reference in statistics for a symmetric, unimodal shape.
- Notation reminders:
- Sample size: n (also sometimes denoted by other letters with explicit definitions).
- Variables: commonly x (and sometimes y) with observations xi (or yi).
- Practical data representation tips:
- For categorical variables, use a frequency table (categories vs counts).
- For quantitative variables, prefer visualization (dot plots, histograms) and summary statistics over raw tables when counts are large.
- Real-world data example (movies, 2011):
- Categorical: studio (e.g., Universal, Warner Bros., etc.)
- Quantitative: world gross (in dollars)
- Use a frequency table for categorical data; use dot plot/histogram for quantitative data to understand the distribution of world gross across movies.
Summary takeaways
- Section 2.2 focuses on understanding a single quantitative variable, its distribution shape, center, and spread, and how to visualize data effectively when a simple frequency table is not feasible.
- Visualization tools include dot plots and histograms, with histograms offering a flexible approach to summarize distributions via bins.
- The mean and median provide different notions of center, with the mean being sensitive to outliers and the median offering robustness to outliers.
- Notation and data representation practices are important for clear communication: define variables, fix the data table structure, and distinguish between sample and population concepts.
- A practical data set (movies, 2011) illustrates the distinction between categorical vs quantitative variables and the corresponding visualization strategies.
- Understand the differences between symmetric, right-skewed, and left-skewed shapes, and the intuition behind bell-shaped distributions.
- Outliers are discussed conceptually; a precise definition is introduced later in the course.