Chapter 1 Notes: Exploring Data — Summary and Notes
Statistics and Data: Key Concepts from Chapter 1
- Statistics is the science of learning from data. Data are numbers with context; numbers by themselves are not informative. Context connects data to the real world (e.g., a baby weighing 10.5 pounds vs. 10.5 ounces is meaningful only with context).
- Why study statistics? To make sound, data-based decisions in careers and everyday life amid abundant data (polls, prices, tests, etc.).
- Data Beat Personal Experiences: anecdotes can be misleading; large, well-designed studies provide more reliable conclusions. Example contrast: Danish Cancer Society study with 350,000 Danish residents found no statistical difference in brain-cancer rates between cell-phone users and non-users, despite compelling anecdotes.
- Where data come from matters: data sources and sampling methods affect conclusions. Anecdotes can mislead; representative data are needed for trustworthy results. Example: Ann Landers poll (70% parents regret having kids) vs a representative survey showing 91% would have children again.
- Always Plot Your Data: graphs often reveal patterns a table cannot. Graphs are powerful for learning from data (Yogi Berra quote: observe a lot just by watching).
- Gapminder example: Life expectancy vs. income shows overall positive association but with notable outliers and differences across countries. Data visualization can reveal nontrivial patterns and exceptions.
- Individuals and Variables (Section 1.1):
- Individuals: the objects described by a data set (people, animals, things).
- Variables: characteristics measured for each individual; can take different values across individuals.
- Types of variables:
- Categorical variable: places an individual into categories (labels). Examples: gender, race, occupation. Some categorical variables group values of a quantitative variable (e.g., age groups like 0–9, 10–19, …).
- Quantitative variable: takes numerical values where arithmetic makes sense (e.g., age, GPA, height).
- Not every number is quantitative (e.g., ZIP code is numeric but not arithmetic-valued).
- Practical questions when encountering a new data set: Who are the individuals? What are the variables and their units?
- CensusAtSchool example (data set exercise): random sample of 10 Canadian students with variables such as Province, Gender, Languages spoken, Handedness, Height (cm), Wrist circumference, Preferred communication. Used to illustrate data structure and variable types; helps distinguish categorical vs. quantitative variables.
- Distribution: the set of values a variable takes and the frequency or relative frequency of each value.
- Data exploration approach: start by examining each variable by itself, then study relationships among variables. Graph first, then numerical summaries.
- Inference: moving from data at hand to conclusions about a population. An example activity uses a lottery to explore whether a result could happen by chance (discrimination in hiring). Inference depends on how data were produced (sampling vs. experiment). Probability is developed in later chapters.
- Chapter 1 summary (organization):
- Distinguish between categorical and quantitative variables; understand distributions;
- Use graphs (bar charts, pie charts for categorical; dotplots, stemplots, histograms for quantitative);
- Learn about marginal and conditional distributions in two-way tables to study relationships between two categorical variables;
- Learn about center and spread for quantitative data (mean, median, range, IQR, standard deviation);
- Understand outliers and the 1.5 × IQR rule; learn to interpret five-number summary and boxplots; and know when to use boxplots for comparisons.
Section 1.1 Analyzing Categorical Data
- Goals for analyzing categorical data:
- Display the distribution of a single categorical variable with bar graphs or pie charts; decide when a pie chart is appropriate.
- Identify deceptive graphs (e.g., mis-scaled bars, misleading pies, pictographs).
- Describe the distribution with counts or percents.
- For two categorical variables, use a two-way table to describe joint distributions and compute:
- Marginal distributions: distributions of a variable across the whole table (row or column margins).
- Conditional distributions: distributions of one variable for a fixed value of the other variable (e.g., Opinion among Women).
- Association: an association exists if knowing one variable's value helps predict the other; no association means conditional distributions are the same.
- Example: Radio Station Formats
- A two-way table shows counts of stations by format; a corresponding relative frequency (percent) table shows percentages. Total counts should sum to the overall total; rounded percentages may sum to 99.9% due to rounding.
- Visuals: Bar graphs and pie charts display distributions; bar graphs are typically easier to read and more flexible for comparing quantities.
- Pie charts must include all categories that form the whole; they are awkward if the goal is to compare multiple quantities in the same units.
- Two-way tables and marginal distributions
- In a two-way table, the row totals describe the distribution of the row category within the entire sample; the column totals describe the distribution of the column category within the entire sample.
- Percentages can be computed as counts divided by the table total to get the marginal distribution in percent.
- Conditional distributions (an example with gender and opinion about wealth by age 30)
- For a given gender, compute the distribution of opinions (percentages across the five categories in that column).
- For each opinion category, compute the distribution of gender within that category (percentages across the two genders within that row).
- Association and interpretation
- Side-by-side bar graphs help compare conditional distributions across groups (women vs. men).
- An obvious association appears when conditional distributions differ notably between groups; no association is suggested when distributions are similar.
- Titanic data example (illustrative two-way table)
- Survival status by class of travel and by gender shows how two categorical variables interact; prompts exploration of gender effect, class effect, and their interaction.
- AP exam tips for Section 1.1
- Distinguish categorical vs. quantitative variables early; both affect the appropriate graphs and summaries.
- Use side-by-side bar graphs to compare conditional distributions; interpret association by comparing conditional distributions.
Section 1.2 Displaying Quantitative Data with Graphs
- Graphs for quantitative data:
- Dotplots: simple, show each value as a dot on a number line; good for small data sets.
- Stemplots (stem-and-leaf plots): retain actual data values while showing distribution; effective for small to moderate data sets; can split stems to improve readability; back-to-back stemplots compare two groups.
- Histograms: group data into equal-width classes; display the distribution in terms of counts or relative frequencies (percentages).
- SOCS: Shape, Outliers, Center, Spread — a quick, informal framework for describing a distribution from a graph.
- Examples and guidance:
- U.S. women’s soccer goals data (2012): example of a dotplot; describe shape (peak near 4 goals), center (~3 goals), spread (0 to 14); identify possible outliers (13, 14) and discuss whether they are genuine outliers depending on context.
- EPA highway mileage data (dotplot): identify shape, center, spread, and outliers (e.g., Prius very high mileage; Bentley Mulsanne very low mileage).
- Distribution shapes: symmetric, skewed left/right, bimodal, multimodal; note that many biological measurements are roughly symmetric, salaries/prices tend to be right-skewed.
- Histograms: how to construct
- Divide data into classes of equal width (e.g., 0 to <5, 5 to <10, etc.).
- Count or compute relative frequencies for each class.
- The choice of class width and boundaries affects the histogram’s appearance; five classes is a common minimum for a useful view.
- Relative frequency histograms facilitate comparisons across data sets of different sizes.
- When to use which graph:
- Use dotplots/stemplots for small data; histograms for large data sets; avoid pictographs for quantitative data.
- Two-way tables and marginal/conditional distributions: expanding from Section 1.1 examples to include distributions across groups for quantitative contexts (e.g., comparing household sizes by country or comparing distributions across age groups).
- Technology corners (calculator tips): how to set up and interpret histograms using TI-Nspire, TI-83/84, or TI-89; steps include inputting data, choosing histogram type, adjusting bin width, and interpreting outputs.
- Practice concepts from Section 1.2:
- Distinguishing between bar graphs (categorical) and histograms (quantitative).
- The importance of consistent axis labeling and scale, especially when comparing distributions of different sizes.
- The risk of misleading graphs (e.g., scale manipulation, pictographs) and the need for cautious interpretation.
Section 1.3 Describing Quantitative Data with Numbers
Measures of center:
- Mean (arithmetic average):
- Definition: for sample X1,…,Xn, the mean is ar{x} = rac{1}{n}
\sum{i=1}^n Xi.
- Population mean: .
- The mean is a balance point of the data; it uses all data values and is sensitive to outliers and skew.
- Median: the middle value when data are ordered; resistant to outliers; preferred for skewed distributions or when outliers are present.
Measures of spread:
- Range: difference between max and min; sensitive to extreme values.
- Interquartile Range (IQR): ; measures spread of the central 50% of the data; resistant to outliers.
- Standard Deviation: ; measures typical distance from the mean; not resistant to outliers; uses same units as the data.
- Variance: .
The five-number summary: minimum, Q1, median, Q3, maximum. Used to construct a boxplot and to describe center/spread; used to identify outliers via the 1.5 × IQR rule:
- Outlier criterion: an observation is an outlier if it falls below or above .
Boxplots: graphical representation of the five-number summary; whiskers extend to the smallest/largest non-outlier values; outliers are plotted individually.
Choosing measures of center and spread:
- For roughly symmetric distributions without outliers: use mean and standard deviation.
- For skewed distributions or data with outliers: use median and IQR (more resistant to extreme values).
The relationship between mean and median:
- In symmetric distributions, they are close (often equal when perfectly symmetric).
- In skewed distributions, the mean is pulled toward the tail; the median is more robust.
- Example: travel times to work in North Carolina showed mean > median due to right-skew; removing an outlier reduces the mean more than the median.
Inference and data production: remember that inferences depend on how data were produced (sampling vs. experiments) and on the assumptions behind the methods used.
Technology and practice:
- Calculators and software can compute numerical summaries (mean, median, quartiles, IQR, standard deviation) and produce boxplots. They also offer practice with “What’s the shape? SOCS.”
Four-step problem-solving framework for statistics problems:
- State: what is the question?
- Plan: what methods will answer it?
- Do: perform the calculations and make graphs.
- Conclude: interpret the result in the real-world context.
Data Exploration Case Study: “Do Pets or Friends Help Reduce Stress?”
- Randomized groups (pet, friend, control) to measure stress via heart rate during a stressful task.
- Through such case studies, you practice comparing distributions and choosing appropriate measures of center/spread.
Practical Formulas and Concepts to Remember
- Distribution of a variable: what values the variable takes and how often it takes them.
- Marginal distribution (two-way tables): distribution of a single variable ignoring the other;
- Example: marginal distribution of opinions across all respondents.
- Conditional distribution (two-way tables): distribution of a variable for a fixed value of another variable;
- Example: conditional distribution of gender within each opinion category, or opinions within each gender.
- Association in two categorical variables: knowing one variable helps predict the other; absence of association means conditional distributions are identical.
- SOCS framework for quantitative data: Shape, Outliers, Center, Spread.
- Five-number summary and boxplot: (Minimum, Q1, Median, Q3, Maximum).
- 1.5 × IQR rule for outliers: observations outside [Q1 − 1.5 × IQR, Q3 + 1.5 × IQR] are outliers.
- When to use which: use histograms/dotplots/stemplots for quantitative data; bar charts/pie charts for categorical data; use relative frequencies when comparing groups of different sizes.
- Inference and data production two-step reference:
- Sampling (descriptive statistics; planning to infer to a population)
- Experiments (causal inference) with emphasis on randomization and control.
- Technology: calculators (TI-83/84, TI-89, TI-Nspire) and software can compute
- One-variable statistics: n, mean, median, min, max, Q1, Q3, IQR, standard deviation.
- Boxplots: with or without outliers; parallel boxplots for comparisons.
Quick Reference: Key Formulas (LaTeX)
- Mean (sample): ar{x} = rac{1}{n}
\sum{i=1}^n xi - Population mean: \mu = \frac{1}{N}
\sum{i=1}^N Xi - Variance (sample):
- Standard deviation:
- Interquartile Range:
- Five-number summary: (Minimum, Q1, Median, Q3, Maximum)
- Boxplot whiskers extend to the smallest and largest data points that are not outliers.
- Outlier rule: an observation is an outlier if
X < Q1 - 1.5 \times IQR \quad \text{or} \quad X > Q3 + 1.5 \times IQR.