Chapter 1 Notes: Exploring Data — Summary and Notes

Statistics and Data: Key Concepts from Chapter 1

Statistics is the science of learning from data. Data are numbers with context; numbers by themselves are not informative. Context connects data to the real world (e.g., a baby weighing 10.5 pounds vs. 10.5 ounces is meaningful only with context).
Why study statistics? To make sound, data-based decisions in careers and everyday life amid abundant data (polls, prices, tests, etc.).
Data Beat Personal Experiences: anecdotes can be misleading; large, well-designed studies provide more reliable conclusions. Example contrast: Danish Cancer Society study with 350,000 Danish residents found no statistical difference in brain-cancer rates between cell-phone users and non-users, despite compelling anecdotes.
Where data come from matters: data sources and sampling methods affect conclusions. Anecdotes can mislead; representative data are needed for trustworthy results. Example: Ann Landers poll (70% parents regret having kids) vs a representative survey showing 91% would have children again.
Always Plot Your Data: graphs often reveal patterns a table cannot. Graphs are powerful for learning from data (Yogi Berra quote: observe a lot just by watching).
Gapminder example: Life expectancy vs. income shows overall positive association but with notable outliers and differences across countries. Data visualization can reveal nontrivial patterns and exceptions.
Individuals and Variables (Section 1.1):
- Individuals: the objects described by a data set (people, animals, things).
- Variables: characteristics measured for each individual; can take different values across individuals.
- Types of variables:
- Categorical variable: places an individual into categories (labels). Examples: gender, race, occupation. Some categorical variables group values of a quantitative variable (e.g., age groups like 0–9, 10–19, …).
- Quantitative variable: takes numerical values where arithmetic makes sense (e.g., age, GPA, height).
- Not every number is quantitative (e.g., ZIP code is numeric but not arithmetic-valued).
- Practical questions when encountering a new data set: Who are the individuals? What are the variables and their units?
CensusAtSchool example (data set exercise): random sample of 10 Canadian students with variables such as Province, Gender, Languages spoken, Handedness, Height (cm), Wrist circumference, Preferred communication. Used to illustrate data structure and variable types; helps distinguish categorical vs. quantitative variables.
Distribution: the set of values a variable takes and the frequency or relative frequency of each value.
Data exploration approach: start by examining each variable by itself, then study relationships among variables. Graph first, then numerical summaries.
Inference: moving from data at hand to conclusions about a population. An example activity uses a lottery to explore whether a result could happen by chance (discrimination in hiring). Inference depends on how data were produced (sampling vs. experiment). Probability is developed in later chapters.
Chapter 1 summary (organization):
- Distinguish between categorical and quantitative variables; understand distributions;
- Use graphs (bar charts, pie charts for categorical; dotplots, stemplots, histograms for quantitative);
- Learn about marginal and conditional distributions in two-way tables to study relationships between two categorical variables;
- Learn about center and spread for quantitative data (mean, median, range, IQR, standard deviation);
- Understand outliers and the 1.5 × IQR rule; learn to interpret five-number summary and boxplots; and know when to use boxplots for comparisons.

Section 1.1 Analyzing Categorical Data

Goals for analyzing categorical data:
- Display the distribution of a single categorical variable with bar graphs or pie charts; decide when a pie chart is appropriate.
- Identify deceptive graphs (e.g., mis-scaled bars, misleading pies, pictographs).
- Describe the distribution with counts or percents.
- For two categorical variables, use a two-way table to describe joint distributions and compute:
- Marginal distributions: distributions of a variable across the whole table (row or column margins).
- Conditional distributions: distributions of one variable for a fixed value of the other variable (e.g., Opinion among Women).
- Association: an association exists if knowing one variable's value helps predict the other; no association means conditional distributions are the same.
Example: Radio Station Formats
- A two-way table shows counts of stations by format; a corresponding relative frequency (percent) table shows percentages. Total counts should sum to the overall total; rounded percentages may sum to 99.9% due to rounding.
- Visuals: Bar graphs and pie charts display distributions; bar graphs are typically easier to read and more flexible for comparing quantities.
- Pie charts must include all categories that form the whole; they are awkward if the goal is to compare multiple quantities in the same units.
Two-way tables and marginal distributions
- In a two-way table, the row totals describe the distribution of the row category within the entire sample; the column totals describe the distribution of the column category within the entire sample.
- Percentages can be computed as counts divided by the table total to get the marginal distribution in percent.
Conditional distributions (an example with gender and opinion about wealth by age 30)
- For a given gender, compute the distribution of opinions (percentages across the five categories in that column).
- For each opinion category, compute the distribution of gender within that category (percentages across the two genders within that row).
Association and interpretation
- Side-by-side bar graphs help compare conditional distributions across groups (women vs. men).
- An obvious association appears when conditional distributions differ notably between groups; no association is suggested when distributions are similar.
Titanic data example (illustrative two-way table)
- Survival status by class of travel and by gender shows how two categorical variables interact; prompts exploration of gender effect, class effect, and their interaction.
AP exam tips for Section 1.1
- Distinguish categorical vs. quantitative variables early; both affect the appropriate graphs and summaries.
- Use side-by-side bar graphs to compare conditional distributions; interpret association by comparing conditional distributions.

Section 1.2 Displaying Quantitative Data with Graphs

Graphs for quantitative data:
- Dotplots: simple, show each value as a dot on a number line; good for small data sets.
- Stemplots (stem-and-leaf plots): retain actual data values while showing distribution; effective for small to moderate data sets; can split stems to improve readability; back-to-back stemplots compare two groups.
- Histograms: group data into equal-width classes; display the distribution in terms of counts or relative frequencies (percentages).
SOCS: Shape, Outliers, Center, Spread — a quick, informal framework for describing a distribution from a graph.
Examples and guidance:
- U.S. women’s soccer goals data (2012): example of a dotplot; describe shape (peak near 4 goals), center (~3 goals), spread (0 to 14); identify possible outliers (13, 14) and discuss whether they are genuine outliers depending on context.
- EPA highway mileage data (dotplot): identify shape, center, spread, and outliers (e.g., Prius very high mileage; Bentley Mulsanne very low mileage).
- Distribution shapes: symmetric, skewed left/right, bimodal, multimodal; note that many biological measurements are roughly symmetric, salaries/prices tend to be right-skewed.
Histograms: how to construct
- Divide data into classes of equal width (e.g., 0 to <5, 5 to <10, etc.).
- Count or compute relative frequencies for each class.
- The choice of class width and boundaries affects the histogram’s appearance; five classes is a common minimum for a useful view.
- Relative frequency histograms facilitate comparisons across data sets of different sizes.
When to use which graph:
- Use dotplots/stemplots for small data; histograms for large data sets; avoid pictographs for quantitative data.
Two-way tables and marginal/conditional distributions: expanding from Section 1.1 examples to include distributions across groups for quantitative contexts (e.g., comparing household sizes by country or comparing distributions across age groups).
Technology corners (calculator tips): how to set up and interpret histograms using TI-Nspire, TI-83/84, or TI-89; steps include inputting data, choosing histogram type, adjusting bin width, and interpreting outputs.
Practice concepts from Section 1.2:
- Distinguishing between bar graphs (categorical) and histograms (quantitative).
- The importance of consistent axis labeling and scale, especially when comparing distributions of different sizes.
- The risk of misleading graphs (e.g., scale manipulation, pictographs) and the need for cautious interpretation.

Section 1.3 Describing Quantitative Data with Numbers

Measures of center:
- Mean (arithmetic average):
- Definition: for sample X1,…,Xn, the mean is ar{x} = rac{1}{n}
\sum{i=1}^n Xi.
- Population mean: $\mu = \frac{1}{N} \sum{i=1}^N Xi$ .
- The mean is a balance point of the data; it uses all data values and is sensitive to outliers and skew.
- Median: the middle value when data are ordered; resistant to outliers; preferred for skewed distributions or when outliers are present.
Measures of spread:
- Range: difference between max and min; sensitive to extreme values.
- Interquartile Range (IQR): $IQR = Q3 - Q1$ ; measures spread of the central 50% of the data; resistant to outliers.
- Standard Deviation: $s = \sqrt{\frac{1}{n-1}\sum{i=1}^n (Xi - \bar{x})^2}$ ; measures typical distance from the mean; not resistant to outliers; uses same units as the data.
- Variance: $s^2 = \frac{1}{n-1}\sum{i=1}^n (Xi - \bar{x})^2$ .
The five-number summary: minimum, Q1, median, Q3, maximum. Used to construct a boxplot and to describe center/spread; used to identify outliers via the 1.5 × IQR rule:
- Outlier criterion: an observation is an outlier if it falls below $Q1 - 1.5 \times IQR$ or above $Q3 + 1.5 \times IQR$ .
Boxplots: graphical representation of the five-number summary; whiskers extend to the smallest/largest non-outlier values; outliers are plotted individually.
Choosing measures of center and spread:
- For roughly symmetric distributions without outliers: use mean and standard deviation.
- For skewed distributions or data with outliers: use median and IQR (more resistant to extreme values).
The relationship between mean and median:
- In symmetric distributions, they are close (often equal when perfectly symmetric).
- In skewed distributions, the mean is pulled toward the tail; the median is more robust.
- Example: travel times to work in North Carolina showed mean > median due to right-skew; removing an outlier reduces the mean more than the median.
Inference and data production: remember that inferences depend on how data were produced (sampling vs. experiments) and on the assumptions behind the methods used.
Technology and practice:
- Calculators and software can compute numerical summaries (mean, median, quartiles, IQR, standard deviation) and produce boxplots. They also offer practice with “What’s the shape? SOCS.”
Four-step problem-solving framework for statistics problems:
- State: what is the question?
- Plan: what methods will answer it?
- Do: perform the calculations and make graphs.
- Conclude: interpret the result in the real-world context.
Data Exploration Case Study: “Do Pets or Friends Help Reduce Stress?”
- Randomized groups (pet, friend, control) to measure stress via heart rate during a stressful task.
- Through such case studies, you practice comparing distributions and choosing appropriate measures of center/spread.

Practical Formulas and Concepts to Remember

Distribution of a variable: what values the variable takes and how often it takes them.
Marginal distribution (two-way tables): distribution of a single variable ignoring the other;
- Example: marginal distribution of opinions across all respondents.
Conditional distribution (two-way tables): distribution of a variable for a fixed value of another variable;
- Example: conditional distribution of gender within each opinion category, or opinions within each gender.
Association in two categorical variables: knowing one variable helps predict the other; absence of association means conditional distributions are identical.
SOCS framework for quantitative data: Shape, Outliers, Center, Spread.
Five-number summary and boxplot: (Minimum, Q1, Median, Q3, Maximum).
1.5 × IQR rule for outliers: observations outside [Q1 − 1.5 × IQR, Q3 + 1.5 × IQR] are outliers.
When to use which: use histograms/dotplots/stemplots for quantitative data; bar charts/pie charts for categorical data; use relative frequencies when comparing groups of different sizes.
Inference and data production two-step reference:
- Sampling (descriptive statistics; planning to infer to a population)
- Experiments (causal inference) with emphasis on randomization and control.
Technology: calculators (TI-83/84, TI-89, TI-Nspire) and software can compute
- One-variable statistics: n, mean, median, min, max, Q1, Q3, IQR, standard deviation.
- Boxplots: with or without outliers; parallel boxplots for comparisons.

Quick Reference: Key Formulas (LaTeX)

Mean (sample): ar{x} = rac{1}{n}
\sum{i=1}^n xi
Population mean: \mu = \frac{1}{N}
\sum{i=1}^N Xi
Variance (sample): $s^2 = \frac{1}{n-1} \sum{i=1}^n (Xi - \bar{x})^2$
Standard deviation: $s = \sqrt{s^2} = \sqrt{\frac{1}{n-1} \sum{i=1}^n (Xi - \,\bar{x})^2}$
Interquartile Range: $IQR = Q3 - Q1$
Five-number summary: (Minimum, Q1, Median, Q3, Maximum)
Boxplot whiskers extend to the smallest and largest data points that are not outliers.
Outlier rule: an observation is an outlier if
X < Q1 - 1.5 \times IQR \quad \text{or} \quad X > Q3 + 1.5 \times IQR.