Ch. 1 Statistics

What Are Statistics

Learning objectives
- Describe the range of applications of statistics
- Identify situations in which statistics can be misleading
- Define “Statistics”
What statistics are
- Statistics include numerical facts and figures (data). Examples:
- The largest earthquake measured $9.2$ on the Richter scale.
- Men are at least $10\times$ more likely than women to commit murder.
- One in every $8$ South Africans is HIV positive (fractional representation: $\frac{1}{8} = 0.125$ ).
- By the year 2020, there will be 15 people aged 65 and over for every new baby born (ratio $15:1$ ).
- The study of statistics involves math and calculations of numbers, but also depends heavily on how numbers are chosen and interpreted.
Flaws in interpretation (three scenarios to spot incorrect interpretations)
- Scenario 1: A new advertisement for Ben & Jerry's ice cream led to a 30% increase in sales for the following three months. Major flaw: history effect. The increase may be due to typical seasonal increases in June–August rather than the advertisement.
- Scenario 2: The more churches in a city, the more crime there is. Major flaw: third-variable problem. A third variable (e.g., population size) could cause both higher church counts and higher crime rates; correlation does not imply causation.
- Scenario 3: 75% more interracial marriages are occurring this year than 25 years ago. Major flaw: lack of context and rate information. Without the base rate of interracial marriages, the claim may be misleading; the number could be fluctuating historically, and the statistic alone does not show acceptability.
Takeaway about statistics
- Statistics are not only facts and figures; they are a range of techniques for analyzing, interpreting, displaying, and making decisions based on data.
- Statistics involve both data collection and the interpretation of data in context.

Importance of Statistics

Learning objectives
- Give examples of statistics encountered in everyday life
- Give examples of how statistics can lend credibility to an argument
Why statistics matter
- To take control of your life, you must evaluate data and claims.
- Poor reasoning can lead to manipulation or bad decisions; statistics provide tools to react intelligently.
Examples of statistical claims (illustrative, not guaranteed true)
- $4/5$ dentists recommend Dentine.
- Lung cancer risk: about $0.85$ for men and $0.45$ for women if tobacco-related.
- Condoms are effective $94\%$ of the time (effectiveness claim).
- Native Americans are more likely to be hit crossing the street than other groups (ethnic comparisons).
- Persuasive effect of eye contact, loud and rapid speech.
- Women earn about $0.75$ for every dollar earned by men in the same job.
- Egg whites study claiming longer life span (new finding).
- Baseball batting averages over $400$ (prediction/expectation claim).
- In a room of $30$ people, there is about an $80\%$ chance that at least two share a birthday.
- A tongue-in-cheek claim: $79.48\%$ of all statistics are made up on the spot.
Takeaway
- These claims illustrate statistics are diverse across domains (psychology, health, law, sports, business).
- Data interpretation matters; not all statistics are equally credible or interpreted appropriately.
- The goal is to become an intelligent consumer of statistical claims by questioning sources and procedures.

Descriptive Statistics

Prerequisites: none
Learning objectives
- Define “descriptive statistics”
- Distinguish descriptive statistics from inferential statistics
Descriptive statistics defined
- Descriptive statistics summarize and describe data (the data being collected from experiments, surveys, or records).
- Data vs datum: data is plural; a single piece is a datum.
- Example: birth certificates – descriptive statistics might be the percentage issued in New York State or the average age of the mother; any computed number counts as descriptor for the data at hand.
Descriptive vs inferential
- Descriptive statistics describe the data at hand and do not generalize beyond it.
- Inferential statistics generalize from a sample to a larger population; this is covered in another section.
Examples (Table 1 and Table 2 in the text)
- Example: Average salaries for various occupations in 1999 (descriptive table)
- Pediatricians: $112760$
- Dentists: $106130$
- Podiatrists: $100090$
- Physicists: $76140$
- Architects: $53410$
- School, clinical, and counseling psychologists: $49720$
- Flight attendants: $475?10$ (note: exact value as provided in the source)
- Elementary school teachers: $39560$
- Police officers: $385?10$ (note: exact value as provided in the source)
- Floral designers: $18980$
- Example: Number of unmarried men per 100 unmarried women in U.S. metro areas in 1990 (descriptive)
- Jacksonville, NC: 224
- Killeen-Temple, TX: 123
- Fayetteville, NC: 118
- Springfield, IL: 70
- Lawton, OK: 116
- State College, PA: 113
- Clarksville-Hopkinsville, TN-KY: 113
- Anchorage, Alaska: 112
- Salinas-Seaside-Monterey, CA: 112
- Bryan-College Station, TX: 111
- Sarasota, FL (cities with mostly women): 66; Bradenton, FL: 68; Altoona, PA: 69
- Descriptive statistics in sports and other domains
- Descriptive statistics are central to sports (e.g., shooting percentages, etc.).
- Olympic marathon data show historical winning times for men and women (since 1984 for women and earlier for men).
- Observations
- Descriptive statistics can reveal disparities (e.g., gender/occupation pay gaps, regional distributions) but they require careful interpretation.
Descriptive statistics and interpretations
- They can highlight patterns, but by themselves they do not explain causes or allow generalizations beyond the observed data.
- They can be used to illustrate points in arguments, but they can also mislead if the data source, sampling, or context is biased or incomplete.
Additional context
- The text includes additional descriptive data around Olympic times and gender comparisons, emphasizing the need to question data sources and to connect descriptive statistics to the larger questions they raise.
Table 3 (Olympic marathon – winning times)
- Women: 1984 Joan Benoit (USA) 2:24:52; 1988 Rosa Mota (POR) 2:25:40; 1992 Valentina Yegorova (UT) 2:32:41; 1996 Fatuma Roba (ETH) 2:26:05; 2000 Naoko Takahashi (JPN) 2:23:14; 2004 Mizuki Noguchi (JPN) 2:26:20.
- Men: 1896 Spiridon Louis (GRE) 2:58:50; 1900 Michel Theato (FRA) 2:59:45; 1904 Thomas Hicks (USA) 3:28:53; …; 1988 Gelindo Bordin (ITA) 2:10:32; 2004 Stefano Baldini (ITA) 2:10:55.
Descriptive insight into inference
- Descriptive statistics can be used to explore questions like whether gender gaps are closing or whether record times will continue to fall, but such inferences require inferential methods.

Inferential Statistics

Prerequisites: Chapter 1: Descriptive Statistics
Learning objectives
- Distinguish between a sample and a population
- Define inferential statistics
- Identify biased samples
- Distinguish between simple random sampling and stratified sampling
- Distinguish between random sampling and random assignment
Populations and samples
- A population is the entire group of interest; a sample is a subset drawn from the population.
- Example #1: National Election Commission survey on voting fairness. You cannot ask every American; you sample a subset and infer about the population.
- The sample should be representative; bias occurs if the sample over-represents a segment (e.g., only Floridians or only Republicans).
- Inferential statistics rely on sampling assumptions; a random sample is expected to represent the population in approximate proportions depending on size.
Example #2: Average number of math classes taken by graduating seniors nationwide
- Population: all graduating seniors in the U.S.
- A sample might be 50 students from each of several institutions; compute average and generalize with caution.
- Potential sampling bias: overrepresentation of math majors or institutions with heavy math requirements.
Sampling bias and examples
- Example #3: Substitute teacher asks the 10 students in the front row for their scores; population is all students in the class; front-row sample may be biased.
- Example #4: Coach samples 8 volunteers to estimate cartwheels by freshmen; volunteers are not representative.
- Example #5: Twins study uses National Twin Registry; last-name-based selection (Z, B) and every-other-name sampling introduces bias and non-representativeness.
- Population vs sample clarity is essential; the registry may not be representative of all twins.
Sample size matters
- Random samples of small size may be non-representative due to sampling variability.
- Example: If you sample 20 individuals from a population with equal male/female distribution, there is a nontrivial chance (about $0.06$ ) that 70% or more are female purely by chance.
- Thus, inferential statistics incorporate sample size into generalizations; larger samples reduce sampling error.
More complex sampling and issues
- When simple random sampling is infeasible, other methods are used (random assignment, stratified sampling).
Random assignment vs sampling
- Random assignment: in experiments, randomly assign subjects to treatment vs control groups to ensure equivalence; crucial for internal validity.
- Example: antidepressant vs placebo; random assignment prevents systematic bias (e.g., early arrivals potentially different from late arrivals).
- A non-random assignment can bias results; a non-random sample affects generalizability rather than internal validity.
Stratified sampling
- Stratified sampling ensures representation across distinct strata (subgroups).
- Example: urban university study on capital punishment; 200 students; 70% day students, 30% night students; sample 140 day and 60 night students so the sample proportions match population proportions, improving inference reliability.

Variables

Prerequisites: none
Learning objectives
- Define and distinguish between independent and dependent variables
- Define and distinguish between discrete and continuous variables
- Define and distinguish between qualitative and quantitative variables
Independent and dependent variables
- A variable is a property that can take on different values; independent variable is manipulated by the experimenter; dependent variable is the outcome measured.
- Example #1: Can blueberries slow aging? Independent: dietary supplement (none, blueberry, strawberry, spinach); Dependent: memory test and motor skills tests; blueberry shows strongest improvement.
- Example #2: Does beta-carotene protect against cancer? Independent: supplement vs placebo; Dependent: cancer occurrence over lifetime; results showed no systematic difference.
- Example #3: How bright should brake lights be? Independent: brightness of brake lights; Dependent: time to hit brakes.
Levels of an independent variable
- If there are two experimental conditions (experimental vs. control), the independent variable has two levels.
- If comparing five types of diets, the independent variable has five levels.
Qualitative vs Qualitative and Quantitative variables
- Qualitative (categorical) variables express a quality (e.g., hair color, eye color, religion, gender). They do not imply a numerical ordering.
- Quantitative variables are numerical (e.g., height, weight, shoe size).
- The type of independent variable in the blueberries example is qualitative; the dependent variable memory test is quantitative.
Discrete vs Continuous variables
- Discrete: possible values are distinct points (e.g., number of children in a household).
- Continuous: possible values form a continuum (e.g., time to respond to a question).
- In practice, measurement limits often prevent true continuity, but measurement remains conceptually continuous.

Percentiles

Prerequisites: none
Learning objectives
- Define percentiles
- Use three formulas for computing percentiles
Why percentiles matter
- A percentile ranks a score relative to a distribution (e.g., what percentage of scores fall below yours).
Definitions of percentile
- Definition 1: The percentile is the lowest score that is greater than the specified percentage of scores.
- Definition 2: The percentile is the smallest score that is greater than or equal to the specified percentage of scores.
- A third, commonly used approach is a weighted average of the percentiles to handle rounding and to enable straightforward definition of the median as the 50th percentile.
Third definition (default in this text)
- R = P/100 × (N + 1) where P is the desired percentile and N is the number of scores.
- If R is an integer, the percentile equals the value with rank R.
- If R is not an integer, interpolate between the values with ranks IR and IR+1, where IR = floor(R) and FR is the fractional part of R.
- Percentile = ValueIR + FR × (ValueIR+1 − Value_IR).
Example 1 (8 numbers in Table 1)
- P = 25, N = 8 → R = 25/100 × (8 + 1) = 2.25
- IR = 2, FR = 0.25; values with ranks 2 and 3 are 5 and 7; percentile = 5 + 0.25 × (7 − 5) = 5.5
- If using Definition 1, the 25th percentile would be 7; Definition 2 would be 5.
Example 2 (20 quiz scores in Table 2)
- 25th percentile: IR = 2, FR = 0.25 → percentile = 5 (as shown by the data)
- 85th percentile: IR = 17, FR = 0.85; score at rank 17 is 9 and rank 18 is 10; percentile = 0.85 × (10 − 9) + 9 = 9.85
Important notes
- When FR = 0, percentile equals the score at rank IR.
- The 50th percentile can be computed as 50/100 × (N + 1) when using the third definition.

Levels of Measurement

Prerequisites: Chapter 1: Variables
Learning objectives
- Define and distinguish among nominal, ordinal, interval, and ratio scales
- Identify scale type for a given variable
- Discuss proper use of measurement scales in psychological measurement
- Give examples of errors from misusing measurement scales
Nominal scales
- Name/categorize responses; no ordering implied (e.g., gender, handedness, favorite color, religion)
- They embody the lowest level of measurement; you cannot order categories meaningfully.
Ordinal scales
- Categories are ordered (e.g., very dissatisfied, somewhat dissatisfied, somewhat satisfied, very satisfied)
- Allow some comparison (one person more satisfied than another) but do not guarantee equal intervals between adjacent levels.
- Differences between adjacent levels may not be equal; the step from 1 to 2 may not equal the step from 3 to 4.
- Changing response format (e.g., using numbers 1–4) does not inherently make the scale interval/ratio; equal-interval interpretation still may not hold.
Interval scales
- Intervals have the same meaning across the scale (e.g., Fahrenheit temperature).
- Do not have a true zero point; zero is arbitrary (e.g., 0°F is not the absence of temperature).
- Ratios do not make sense on interval scales (e.g., 80°F is not twice as hot as 40°F).
Ratio scales
- Have all properties of nominal, ordinal, and interval scales, plus a true zero point (absence of the quantity).
- Examples: Kelvin temperature (true zero), money (e.g., 50 cents vs 25 cents; 50 is twice 25).
Psychology measurement in context
- Rating scales used in psychology are often ordinal (e.g., 5- or 7-point scales).
- It is common to compute means for ordinal data, but there are debates; caution is warranted because equal-interval assumptions may be violated.
- Memory experiments often yield counts (which can be treated as ratio data due to a true zero and meaningful differences).
Takeaway
- The level of measurement constrains what statistics can be meaningfully computed.

Distributions

Prerequisites: Chapter 1: Variables
Learning objectives
- Define distribution; interpret a frequency distribution
- Distinguish a frequency distribution from a probability distribution
- Construct a grouped frequency distribution for a continuous variable
- Identify skew, bimodality, leptokurtosis, and platykurtosis
Distributions of discrete variables (M&M example)
- Count colors in a bag; frequency table describes the distribution of color counts.
- A frequency distribution can be graphed as a histogram of discrete counts (Figure 1).
- For all M&Ms produced, the manufacturer reports proportions (probability distribution) that sum to 1. For example, Brown ≈ 0.30, Red ≈ 0.30, Yellow ≈ 0.15, Green ≈ 0.15, Blue ≈ 0.05, Orange ≈ 0.05.
Continuous variables and grouped frequency distributions
- Example: times taken to move a cursor over a target (20 trials) produce a continuous variable; a simple frequency distribution would be uninformative since almost no two times are identical.
- Solution: group into intervals and create a grouped frequency distribution; visualize with a histogram (Figure 3).
Probability densities and the normal distribution
- For continuous variables, distributions are represented as probability densities (area under the curve equals 1).
- The probability of exactly a specific value is essentially zero; the probability of falling within an interval is the area under the curve for that interval.
- The normal distribution is a bell-shaped density; it is used as a common approximation for many naturally occurring phenomena.
Shape of distributions
- Normal density is symmetric with a single peak in the middle; tails extend indefinitely.
- Skew: a distribution with a longer tail on one side. Positive skew (skewed to the right) has a longer right tail. Negative skew to the left.
- Kurtosis: leptokurtic (heavy tails; more data in tails) vs platykurtic (lighter tails; flatter). A distribution with a longer tail is leptokurtic; shorter tails yield platykurtic.
- Bimodal distribution: two distinct peaks.
Example: Old Faithful geyser eruption times show a bimodal distribution.
Visual guides (descriptive): shape descriptions include symmetry, skewness, and kurtosis as important features for understanding data distributions.

Summation Notation

Prerequisites: None
Learning objectives
- Use summation notation to express the sum of all numbers
- Use summation notation to express the sum of a subset of numbers
- Use summation notation to express the sum of squares
Basic idea
- Σ denotes summation. For X1, X2, X3, X4 (weights in grams of 4 grapes):
- The sum: $\sum{i=1}^{4} Xi = X1 + X2 + X3 + X4$
Example (sum of four numbers)
- If X = [4.6, 5.1, 4.9, 4.4], then
- $\sum{i=1}^{4} Xi = 4.6 + 5.1 + 4.9 + 4.4 = 19.0$
Sum of squares vs square of sums
- Sum of squares: $\sum{i=1}^{4} Xi^2 = 4.6^2 + 5.1^2 + 4.9^2 + 4.4^2 = 90.54$
- The square of the sum: $\left(\sum{i=1}^{4} Xi\right)^2 = 19^2 = 361$
- The distinction: squaring before summing vs summing then squaring are not the same.
Cross products
- Given pairs (Xi, Yi), the sum of cross products is $\sum{i=1}^{n} Xi Y_i$ ; example values can yield a sum like 28.
- Example (X, Y pairs and cross products): the sum of Yi Xi is 28 for the data shown.
Practical takeaway
- Summation notation provides compact, precise ways to express sums and related statistics (sums of squares, cross products, etc.).

Linear Transformations

Prerequisites: None
Learning objectives
- Give the formula for a linear transformation
- Determine whether a transformation is linear
- Describe what is linear about a linear transformation
What a linear transformation does
- You may need to transform measurements from one scale to another (e.g., feet to inches, inches to feet).
- Example: To convert feet to inches, multiply by 12. If x is feet, then the transformation T is:
- $T(x) = 12x$
- Conversely, to convert inches to feet, divide by 12: $T(x) = x/12$
What makes a transformation linear
- A linear transformation preserves addition and scalar multiplication (i.e., T(aX + bY) = aT(X) + bT(Y)).
- Common examples in measurement scale changes preserve proportional relationships when a true linear mapping exists.
Significance in statistics
- Linear transformations can be used to rescale, standardize, or normalize data without changing the underlying relationships between variables.