Identify situations in which statistics can be misleading
Define “Statistics”
What statistics are
Statistics include numerical facts and figures (data). Examples:
The largest earthquake measured 9.2 on the Richter scale.
Men are at least 10× more likely than women to commit murder.
One in every 8 South Africans is HIV positive (fractional representation: 81=0.125).
By the year 2020, there will be 15 people aged 65 and over for every new baby born (ratio 15:1).
The study of statistics involves math and calculations of numbers, but also depends heavily on how numbers are chosen and interpreted.
Flaws in interpretation (three scenarios to spot incorrect interpretations)
Scenario 1: A new advertisement for Ben & Jerry's ice cream led to a 30% increase in sales for the following three months. Major flaw: history effect. The increase may be due to typical seasonal increases in June–August rather than the advertisement.
Scenario 2: The more churches in a city, the more crime there is. Major flaw: third-variable problem. A third variable (e.g., population size) could cause both higher church counts and higher crime rates; correlation does not imply causation.
Scenario 3: 75% more interracial marriages are occurring this year than 25 years ago. Major flaw: lack of context and rate information. Without the base rate of interracial marriages, the claim may be misleading; the number could be fluctuating historically, and the statistic alone does not show acceptability.
Takeaway about statistics
Statistics are not only facts and figures; they are a range of techniques for analyzing, interpreting, displaying, and making decisions based on data.
Statistics involve both data collection and the interpretation of data in context.
Importance of Statistics
Learning objectives
Give examples of statistics encountered in everyday life
Give examples of how statistics can lend credibility to an argument
Why statistics matter
To take control of your life, you must evaluate data and claims.
Poor reasoning can lead to manipulation or bad decisions; statistics provide tools to react intelligently.
Examples of statistical claims (illustrative, not guaranteed true)
4/5 dentists recommend Dentine.
Lung cancer risk: about 0.85 for men and 0.45 for women if tobacco-related.
Condoms are effective 94% of the time (effectiveness claim).
Native Americans are more likely to be hit crossing the street than other groups (ethnic comparisons).
Persuasive effect of eye contact, loud and rapid speech.
Women earn about 0.75 for every dollar earned by men in the same job.
Egg whites study claiming longer life span (new finding).
Baseball batting averages over 400 (prediction/expectation claim).
In a room of 30 people, there is about an 80% chance that at least two share a birthday.
A tongue-in-cheek claim: 79.48% of all statistics are made up on the spot.
Takeaway
These claims illustrate statistics are diverse across domains (psychology, health, law, sports, business).
Data interpretation matters; not all statistics are equally credible or interpreted appropriately.
The goal is to become an intelligent consumer of statistical claims by questioning sources and procedures.
Descriptive Statistics
Prerequisites: none
Learning objectives
Define “descriptive statistics”
Distinguish descriptive statistics from inferential statistics
Descriptive statistics defined
Descriptive statistics summarize and describe data (the data being collected from experiments, surveys, or records).
Data vs datum: data is plural; a single piece is a datum.
Example: birth certificates – descriptive statistics might be the percentage issued in New York State or the average age of the mother; any computed number counts as descriptor for the data at hand.
Descriptive vs inferential
Descriptive statistics describe the data at hand and do not generalize beyond it.
Inferential statistics generalize from a sample to a larger population; this is covered in another section.
Examples (Table 1 and Table 2 in the text)
Example: Average salaries for various occupations in 1999 (descriptive table)
Pediatricians: 112760
Dentists: 106130
Podiatrists: 100090
Physicists: 76140
Architects: 53410
School, clinical, and counseling psychologists: 49720
Flight attendants: 475?10 (note: exact value as provided in the source)
Elementary school teachers: 39560
Police officers: 385?10 (note: exact value as provided in the source)
Floral designers: 18980
Example: Number of unmarried men per 100 unmarried women in U.S. metro areas in 1990 (descriptive)
Descriptive statistics in sports and other domains
Descriptive statistics are central to sports (e.g., shooting percentages, etc.).
Olympic marathon data show historical winning times for men and women (since 1984 for women and earlier for men).
Observations
Descriptive statistics can reveal disparities (e.g., gender/occupation pay gaps, regional distributions) but they require careful interpretation.
Descriptive statistics and interpretations
They can highlight patterns, but by themselves they do not explain causes or allow generalizations beyond the observed data.
They can be used to illustrate points in arguments, but they can also mislead if the data source, sampling, or context is biased or incomplete.
Additional context
The text includes additional descriptive data around Olympic times and gender comparisons, emphasizing the need to question data sources and to connect descriptive statistics to the larger questions they raise.
Men: 1896 Spiridon Louis (GRE) 2:58:50; 1900 Michel Theato (FRA) 2:59:45; 1904 Thomas Hicks (USA) 3:28:53; …; 1988 Gelindo Bordin (ITA) 2:10:32; 2004 Stefano Baldini (ITA) 2:10:55.
Descriptive insight into inference
Descriptive statistics can be used to explore questions like whether gender gaps are closing or whether record times will continue to fall, but such inferences require inferential methods.
Inferential Statistics
Prerequisites: Chapter 1: Descriptive Statistics
Learning objectives
Distinguish between a sample and a population
Define inferential statistics
Identify biased samples
Distinguish between simple random sampling and stratified sampling
Distinguish between random sampling and random assignment
Populations and samples
A population is the entire group of interest; a sample is a subset drawn from the population.
Example #1: National Election Commission survey on voting fairness. You cannot ask every American; you sample a subset and infer about the population.
The sample should be representative; bias occurs if the sample over-represents a segment (e.g., only Floridians or only Republicans).
Inferential statistics rely on sampling assumptions; a random sample is expected to represent the population in approximate proportions depending on size.
Example #2: Average number of math classes taken by graduating seniors nationwide
Population: all graduating seniors in the U.S.
A sample might be 50 students from each of several institutions; compute average and generalize with caution.
Potential sampling bias: overrepresentation of math majors or institutions with heavy math requirements.
Sampling bias and examples
Example #3: Substitute teacher asks the 10 students in the front row for their scores; population is all students in the class; front-row sample may be biased.
Example #4: Coach samples 8 volunteers to estimate cartwheels by freshmen; volunteers are not representative.
Example #5: Twins study uses National Twin Registry; last-name-based selection (Z, B) and every-other-name sampling introduces bias and non-representativeness.
Population vs sample clarity is essential; the registry may not be representative of all twins.
Sample size matters
Random samples of small size may be non-representative due to sampling variability.
Example: If you sample 20 individuals from a population with equal male/female distribution, there is a nontrivial chance (about 0.06) that 70% or more are female purely by chance.
When simple random sampling is infeasible, other methods are used (random assignment, stratified sampling).
Random assignment vs sampling
Random assignment: in experiments, randomly assign subjects to treatment vs control groups to ensure equivalence; crucial for internal validity.
Example: antidepressant vs placebo; random assignment prevents systematic bias (e.g., early arrivals potentially different from late arrivals).
A non-random assignment can bias results; a non-random sample affects generalizability rather than internal validity.
Stratified sampling
Stratified sampling ensures representation across distinct strata (subgroups).
Example: urban university study on capital punishment; 200 students; 70% day students, 30% night students; sample 140 day and 60 night students so the sample proportions match population proportions, improving inference reliability.
Variables
Prerequisites: none
Learning objectives
Define and distinguish between independent and dependent variables
Define and distinguish between discrete and continuous variables
Define and distinguish between qualitative and quantitative variables
Independent and dependent variables
A variable is a property that can take on different values; independent variable is manipulated by the experimenter; dependent variable is the outcome measured.
Example #1: Can blueberries slow aging? Independent: dietary supplement (none, blueberry, strawberry, spinach); Dependent: memory test and motor skills tests; blueberry shows strongest improvement.
Example #2: Does beta-carotene protect against cancer? Independent: supplement vs placebo; Dependent: cancer occurrence over lifetime; results showed no systematic difference.
Example #3: How bright should brake lights be? Independent: brightness of brake lights; Dependent: time to hit brakes.
Levels of an independent variable
If there are two experimental conditions (experimental vs. control), the independent variable has two levels.
If comparing five types of diets, the independent variable has five levels.
Qualitative vs Qualitative and Quantitative variables
Qualitative (categorical) variables express a quality (e.g., hair color, eye color, religion, gender). They do not imply a numerical ordering.
Quantitative variables are numerical (e.g., height, weight, shoe size).
The type of independent variable in the blueberries example is qualitative; the dependent variable memory test is quantitative.
Discrete vs Continuous variables
Discrete: possible values are distinct points (e.g., number of children in a household).
Continuous: possible values form a continuum (e.g., time to respond to a question).
In practice, measurement limits often prevent true continuity, but measurement remains conceptually continuous.
Percentiles
Prerequisites: none
Learning objectives
Define percentiles
Use three formulas for computing percentiles
Why percentiles matter
A percentile ranks a score relative to a distribution (e.g., what percentage of scores fall below yours).
Definitions of percentile
Definition 1: The percentile is the lowest score that is greater than the specified percentage of scores.
Definition 2: The percentile is the smallest score that is greater than or equal to the specified percentage of scores.
A third, commonly used approach is a weighted average of the percentiles to handle rounding and to enable straightforward definition of the median as the 50th percentile.
Third definition (default in this text)
R = P/100 × (N + 1) where P is the desired percentile and N is the number of scores.
If R is an integer, the percentile equals the value with rank R.
If R is not an integer, interpolate between the values with ranks IR and IR+1, where IR = floor(R) and FR is the fractional part of R.
They embody the lowest level of measurement; you cannot order categories meaningfully.
Ordinal scales
Categories are ordered (e.g., very dissatisfied, somewhat dissatisfied, somewhat satisfied, very satisfied)
Allow some comparison (one person more satisfied than another) but do not guarantee equal intervals between adjacent levels.
Differences between adjacent levels may not be equal; the step from 1 to 2 may not equal the step from 3 to 4.
Changing response format (e.g., using numbers 1–4) does not inherently make the scale interval/ratio; equal-interval interpretation still may not hold.
Interval scales
Intervals have the same meaning across the scale (e.g., Fahrenheit temperature).
Do not have a true zero point; zero is arbitrary (e.g., 0°F is not the absence of temperature).
Ratios do not make sense on interval scales (e.g., 80°F is not twice as hot as 40°F).
Ratio scales
Have all properties of nominal, ordinal, and interval scales, plus a true zero point (absence of the quantity).
Examples: Kelvin temperature (true zero), money (e.g., 50 cents vs 25 cents; 50 is twice 25).
Psychology measurement in context
Rating scales used in psychology are often ordinal (e.g., 5- or 7-point scales).
It is common to compute means for ordinal data, but there are debates; caution is warranted because equal-interval assumptions may be violated.
Memory experiments often yield counts (which can be treated as ratio data due to a true zero and meaningful differences).
Takeaway
The level of measurement constrains what statistics can be meaningfully computed.
Distributions
Prerequisites: Chapter 1: Variables
Learning objectives
Define distribution; interpret a frequency distribution
Distinguish a frequency distribution from a probability distribution
Construct a grouped frequency distribution for a continuous variable
Identify skew, bimodality, leptokurtosis, and platykurtosis
Distributions of discrete variables (M&M example)
Count colors in a bag; frequency table describes the distribution of color counts.
A frequency distribution can be graphed as a histogram of discrete counts (Figure 1).
For all M&Ms produced, the manufacturer reports proportions (probability distribution) that sum to 1. For example, Brown ≈ 0.30, Red ≈ 0.30, Yellow ≈ 0.15, Green ≈ 0.15, Blue ≈ 0.05, Orange ≈ 0.05.
Continuous variables and grouped frequency distributions
Example: times taken to move a cursor over a target (20 trials) produce a continuous variable; a simple frequency distribution would be uninformative since almost no two times are identical.
Solution: group into intervals and create a grouped frequency distribution; visualize with a histogram (Figure 3).
Probability densities and the normal distribution
For continuous variables, distributions are represented as probability densities (area under the curve equals 1).
The probability of exactly a specific value is essentially zero; the probability of falling within an interval is the area under the curve for that interval.
The normal distribution is a bell-shaped density; it is used as a common approximation for many naturally occurring phenomena.
Shape of distributions
Normal density is symmetric with a single peak in the middle; tails extend indefinitely.
Skew: a distribution with a longer tail on one side. Positive skew (skewed to the right) has a longer right tail. Negative skew to the left.
Kurtosis: leptokurtic (heavy tails; more data in tails) vs platykurtic (lighter tails; flatter). A distribution with a longer tail is leptokurtic; shorter tails yield platykurtic.
Bimodal distribution: two distinct peaks.
Example: Old Faithful geyser eruption times show a bimodal distribution.
Visual guides (descriptive): shape descriptions include symmetry, skewness, and kurtosis as important features for understanding data distributions.
Summation Notation
Prerequisites: None
Learning objectives
Use summation notation to express the sum of all numbers
Use summation notation to express the sum of a subset of numbers
Use summation notation to express the sum of squares
Basic idea
Σ denotes summation. For X1, X2, X3, X4 (weights in grams of 4 grapes):
The sum: ∑<em>i=14X</em>i=X<em>1+X</em>2+X<em>3+X</em>4
Example (sum of four numbers)
If X = [4.6, 5.1, 4.9, 4.4], then
∑<em>i=14X</em>i=4.6+5.1+4.9+4.4=19.0
Sum of squares vs square of sums
Sum of squares: ∑<em>i=14X</em>i2=4.62+5.12+4.92+4.42=90.54
The square of the sum: (∑<em>i=14X</em>i)2=192=361
The distinction: squaring before summing vs summing then squaring are not the same.
Cross products
Given pairs (Xi, Yi), the sum of cross products is ∑<em>i=1nX</em>iYi; example values can yield a sum like 28.
Example (X, Y pairs and cross products): the sum of Yi Xi is 28 for the data shown.
Practical takeaway
Summation notation provides compact, precise ways to express sums and related statistics (sums of squares, cross products, etc.).
Linear Transformations
Prerequisites: None
Learning objectives
Give the formula for a linear transformation
Determine whether a transformation is linear
Describe what is linear about a linear transformation
What a linear transformation does
You may need to transform measurements from one scale to another (e.g., feet to inches, inches to feet).
Example: To convert feet to inches, multiply by 12. If x is feet, then the transformation T is:
T(x)=12x
Conversely, to convert inches to feet, divide by 12: T(x)=x/12
What makes a transformation linear
A linear transformation preserves addition and scalar multiplication (i.e., T(aX + bY) = aT(X) + bT(Y)).
Common examples in measurement scale changes preserve proportional relationships when a true linear mapping exists.
Significance in statistics
Linear transformations can be used to rescale, standardize, or normalize data without changing the underlying relationships between variables.