Chapter 1 Notes: Exploring Data — Summary and Notes

Statistics and Data: Key Concepts from Chapter 1

  • Statistics is the science of learning from data. Data are numbers with context; numbers by themselves are not informative. Context connects data to the real world (e.g., a baby weighing 10.5 pounds vs. 10.5 ounces is meaningful only with context).
  • Why study statistics? To make sound, data-based decisions in careers and everyday life amid abundant data (polls, prices, tests, etc.).
  • Data Beat Personal Experiences: anecdotes can be misleading; large, well-designed studies provide more reliable conclusions. Example contrast: Danish Cancer Society study with 350,000 Danish residents found no statistical difference in brain-cancer rates between cell-phone users and non-users, despite compelling anecdotes.
  • Where data come from matters: data sources and sampling methods affect conclusions. Anecdotes can mislead; representative data are needed for trustworthy results. Example: Ann Landers poll (70% parents regret having kids) vs a representative survey showing 91% would have children again.
  • Always Plot Your Data: graphs often reveal patterns a table cannot. Graphs are powerful for learning from data (Yogi Berra quote: observe a lot just by watching).
  • Gapminder example: Life expectancy vs. income shows overall positive association but with notable outliers and differences across countries. Data visualization can reveal nontrivial patterns and exceptions.
  • Individuals and Variables (Section 1.1):
    • Individuals: the objects described by a data set (people, animals, things).
    • Variables: characteristics measured for each individual; can take different values across individuals.
    • Types of variables:
    • Categorical variable: places an individual into categories (labels). Examples: gender, race, occupation. Some categorical variables group values of a quantitative variable (e.g., age groups like 0–9, 10–19, …).
    • Quantitative variable: takes numerical values where arithmetic makes sense (e.g., age, GPA, height).
    • Not every number is quantitative (e.g., ZIP code is numeric but not arithmetic-valued).
    • Practical questions when encountering a new data set: Who are the individuals? What are the variables and their units?
  • CensusAtSchool example (data set exercise): random sample of 10 Canadian students with variables such as Province, Gender, Languages spoken, Handedness, Height (cm), Wrist circumference, Preferred communication. Used to illustrate data structure and variable types; helps distinguish categorical vs. quantitative variables.
  • Distribution: the set of values a variable takes and the frequency or relative frequency of each value.
  • Data exploration approach: start by examining each variable by itself, then study relationships among variables. Graph first, then numerical summaries.
  • Inference: moving from data at hand to conclusions about a population. An example activity uses a lottery to explore whether a result could happen by chance (discrimination in hiring). Inference depends on how data were produced (sampling vs. experiment). Probability is developed in later chapters.
  • Chapter 1 summary (organization):
    • Distinguish between categorical and quantitative variables; understand distributions;
    • Use graphs (bar charts, pie charts for categorical; dotplots, stemplots, histograms for quantitative);
    • Learn about marginal and conditional distributions in two-way tables to study relationships between two categorical variables;
    • Learn about center and spread for quantitative data (mean, median, range, IQR, standard deviation);
    • Understand outliers and the 1.5 × IQR rule; learn to interpret five-number summary and boxplots; and know when to use boxplots for comparisons.

Section 1.1 Analyzing Categorical Data

  • Goals for analyzing categorical data:
    • Display the distribution of a single categorical variable with bar graphs or pie charts; decide when a pie chart is appropriate.
    • Identify deceptive graphs (e.g., mis-scaled bars, misleading pies, pictographs).
    • Describe the distribution with counts or percents.
    • For two categorical variables, use a two-way table to describe joint distributions and compute:
    • Marginal distributions: distributions of a variable across the whole table (row or column margins).
    • Conditional distributions: distributions of one variable for a fixed value of the other variable (e.g., Opinion among Women).
    • Association: an association exists if knowing one variable's value helps predict the other; no association means conditional distributions are the same.
  • Example: Radio Station Formats
    • A two-way table shows counts of stations by format; a corresponding relative frequency (percent) table shows percentages. Total counts should sum to the overall total; rounded percentages may sum to 99.9% due to rounding.
    • Visuals: Bar graphs and pie charts display distributions; bar graphs are typically easier to read and more flexible for comparing quantities.
    • Pie charts must include all categories that form the whole; they are awkward if the goal is to compare multiple quantities in the same units.
  • Two-way tables and marginal distributions
    • In a two-way table, the row totals describe the distribution of the row category within the entire sample; the column totals describe the distribution of the column category within the entire sample.
    • Percentages can be computed as counts divided by the table total to get the marginal distribution in percent.
  • Conditional distributions (an example with gender and opinion about wealth by age 30)
    • For a given gender, compute the distribution of opinions (percentages across the five categories in that column).
    • For each opinion category, compute the distribution of gender within that category (percentages across the two genders within that row).
  • Association and interpretation
    • Side-by-side bar graphs help compare conditional distributions across groups (women vs. men).
    • An obvious association appears when conditional distributions differ notably between groups; no association is suggested when distributions are similar.
  • Titanic data example (illustrative two-way table)
    • Survival status by class of travel and by gender shows how two categorical variables interact; prompts exploration of gender effect, class effect, and their interaction.
  • AP exam tips for Section 1.1
    • Distinguish categorical vs. quantitative variables early; both affect the appropriate graphs and summaries.
    • Use side-by-side bar graphs to compare conditional distributions; interpret association by comparing conditional distributions.

Section 1.2 Displaying Quantitative Data with Graphs

  • Graphs for quantitative data:
    • Dotplots: simple, show each value as a dot on a number line; good for small data sets.
    • Stemplots (stem-and-leaf plots): retain actual data values while showing distribution; effective for small to moderate data sets; can split stems to improve readability; back-to-back stemplots compare two groups.
    • Histograms: group data into equal-width classes; display the distribution in terms of counts or relative frequencies (percentages).
  • SOCS: Shape, Outliers, Center, Spread — a quick, informal framework for describing a distribution from a graph.
  • Examples and guidance:
    • U.S. women’s soccer goals data (2012): example of a dotplot; describe shape (peak near 4 goals), center (~3 goals), spread (0 to 14); identify possible outliers (13, 14) and discuss whether they are genuine outliers depending on context.
    • EPA highway mileage data (dotplot): identify shape, center, spread, and outliers (e.g., Prius very high mileage; Bentley Mulsanne very low mileage).
    • Distribution shapes: symmetric, skewed left/right, bimodal, multimodal; note that many biological measurements are roughly symmetric, salaries/prices tend to be right-skewed.
  • Histograms: how to construct
    • Divide data into classes of equal width (e.g., 0 to <5, 5 to <10, etc.).
    • Count or compute relative frequencies for each class.
    • The choice of class width and boundaries affects the histogram’s appearance; five classes is a common minimum for a useful view.
    • Relative frequency histograms facilitate comparisons across data sets of different sizes.
  • When to use which graph:
    • Use dotplots/stemplots for small data; histograms for large data sets; avoid pictographs for quantitative data.
  • Two-way tables and marginal/conditional distributions: expanding from Section 1.1 examples to include distributions across groups for quantitative contexts (e.g., comparing household sizes by country or comparing distributions across age groups).
  • Technology corners (calculator tips): how to set up and interpret histograms using TI-Nspire, TI-83/84, or TI-89; steps include inputting data, choosing histogram type, adjusting bin width, and interpreting outputs.
  • Practice concepts from Section 1.2:
    • Distinguishing between bar graphs (categorical) and histograms (quantitative).
    • The importance of consistent axis labeling and scale, especially when comparing distributions of different sizes.
    • The risk of misleading graphs (e.g., scale manipulation, pictographs) and the need for cautious interpretation.

Section 1.3 Describing Quantitative Data with Numbers

  • Measures of center:

    • Mean (arithmetic average):
    • Definition: for sample X1,…,Xn, the mean is ar{x} = rac{1}{n}

    \sum{i=1}^n Xi.

    • Population mean: μ=1N<em>i=1NX</em>i\mu = \frac{1}{N} \sum<em>{i=1}^N X</em>i.
    • The mean is a balance point of the data; it uses all data values and is sensitive to outliers and skew.
    • Median: the middle value when data are ordered; resistant to outliers; preferred for skewed distributions or when outliers are present.
  • Measures of spread:

    • Range: difference between max and min; sensitive to extreme values.
    • Interquartile Range (IQR): IQR=Q<em>3Q</em>1IQR = Q<em>3 - Q</em>1; measures spread of the central 50% of the data; resistant to outliers.
    • Standard Deviation: s=1n1<em>i=1n(X</em>ixˉ)2s = \sqrt{\frac{1}{n-1}\sum<em>{i=1}^n (X</em>i - \bar{x})^2}; measures typical distance from the mean; not resistant to outliers; uses same units as the data.
    • Variance: s2=1n1<em>i=1n(X</em>ixˉ)2s^2 = \frac{1}{n-1}\sum<em>{i=1}^n (X</em>i - \bar{x})^2.
  • The five-number summary: minimum, Q1, median, Q3, maximum. Used to construct a boxplot and to describe center/spread; used to identify outliers via the 1.5 × IQR rule:

    • Outlier criterion: an observation is an outlier if it falls below Q<em>11.5×IQRQ<em>1 - 1.5 \times IQR or above Q</em>3+1.5×IQRQ</em>3 + 1.5 \times IQR.
  • Boxplots: graphical representation of the five-number summary; whiskers extend to the smallest/largest non-outlier values; outliers are plotted individually.

  • Choosing measures of center and spread:

    • For roughly symmetric distributions without outliers: use mean and standard deviation.
    • For skewed distributions or data with outliers: use median and IQR (more resistant to extreme values).
  • The relationship between mean and median:

    • In symmetric distributions, they are close (often equal when perfectly symmetric).
    • In skewed distributions, the mean is pulled toward the tail; the median is more robust.
    • Example: travel times to work in North Carolina showed mean > median due to right-skew; removing an outlier reduces the mean more than the median.
  • Inference and data production: remember that inferences depend on how data were produced (sampling vs. experiments) and on the assumptions behind the methods used.

  • Technology and practice:

    • Calculators and software can compute numerical summaries (mean, median, quartiles, IQR, standard deviation) and produce boxplots. They also offer practice with “What’s the shape? SOCS.”
  • Four-step problem-solving framework for statistics problems:

    • State: what is the question?
    • Plan: what methods will answer it?
    • Do: perform the calculations and make graphs.
    • Conclude: interpret the result in the real-world context.
  • Data Exploration Case Study: “Do Pets or Friends Help Reduce Stress?”

    • Randomized groups (pet, friend, control) to measure stress via heart rate during a stressful task.
    • Through such case studies, you practice comparing distributions and choosing appropriate measures of center/spread.

Practical Formulas and Concepts to Remember

  • Distribution of a variable: what values the variable takes and how often it takes them.
  • Marginal distribution (two-way tables): distribution of a single variable ignoring the other;
    • Example: marginal distribution of opinions across all respondents.
  • Conditional distribution (two-way tables): distribution of a variable for a fixed value of another variable;
    • Example: conditional distribution of gender within each opinion category, or opinions within each gender.
  • Association in two categorical variables: knowing one variable helps predict the other; absence of association means conditional distributions are identical.
  • SOCS framework for quantitative data: Shape, Outliers, Center, Spread.
  • Five-number summary and boxplot: (Minimum, Q1, Median, Q3, Maximum).
  • 1.5 × IQR rule for outliers: observations outside [Q1 − 1.5 × IQR, Q3 + 1.5 × IQR] are outliers.
  • When to use which: use histograms/dotplots/stemplots for quantitative data; bar charts/pie charts for categorical data; use relative frequencies when comparing groups of different sizes.
  • Inference and data production two-step reference:
    • Sampling (descriptive statistics; planning to infer to a population)
    • Experiments (causal inference) with emphasis on randomization and control.
  • Technology: calculators (TI-83/84, TI-89, TI-Nspire) and software can compute
    • One-variable statistics: n, mean, median, min, max, Q1, Q3, IQR, standard deviation.
    • Boxplots: with or without outliers; parallel boxplots for comparisons.

Quick Reference: Key Formulas (LaTeX)

  • Mean (sample): ar{x} = rac{1}{n}
    \sum{i=1}^n xi
  • Population mean: \mu = \frac{1}{N}
    \sum{i=1}^N Xi
  • Variance (sample): s2=1n1<em>i=1n(X</em>ixˉ)2s^2 = \frac{1}{n-1} \sum<em>{i=1}^n (X</em>i - \bar{x})^2
  • Standard deviation: s=s2=1n1<em>i=1n(X</em>ixˉ)2s = \sqrt{s^2} = \sqrt{\frac{1}{n-1} \sum<em>{i=1}^n (X</em>i - \,\bar{x})^2}
  • Interquartile Range: IQR=Q<em>3Q</em>1IQR = Q<em>3 - Q</em>1
  • Five-number summary: (Minimum, Q1, Median, Q3, Maximum)
  • Boxplot whiskers extend to the smallest and largest data points that are not outliers.
  • Outlier rule: an observation is an outlier if
    X < Q1 - 1.5 \times IQR \quad \text{or} \quad X > Q3 + 1.5 \times IQR.