Ch. 1 Statistics

What Are Statistics

  • Learning objectives
    • Describe the range of applications of statistics
    • Identify situations in which statistics can be misleading
    • Define “Statistics”
  • What statistics are
    • Statistics include numerical facts and figures (data). Examples:
    • The largest earthquake measured 9.29.2 on the Richter scale.
    • Men are at least 10×10\times more likely than women to commit murder.
    • One in every 88 South Africans is HIV positive (fractional representation: 18=0.125\frac{1}{8} = 0.125).
    • By the year 2020, there will be 15 people aged 65 and over for every new baby born (ratio 15:115:1).
    • The study of statistics involves math and calculations of numbers, but also depends heavily on how numbers are chosen and interpreted.
  • Flaws in interpretation (three scenarios to spot incorrect interpretations)
    • Scenario 1: A new advertisement for Ben & Jerry's ice cream led to a 30% increase in sales for the following three months. Major flaw: history effect. The increase may be due to typical seasonal increases in June–August rather than the advertisement.
    • Scenario 2: The more churches in a city, the more crime there is. Major flaw: third-variable problem. A third variable (e.g., population size) could cause both higher church counts and higher crime rates; correlation does not imply causation.
    • Scenario 3: 75% more interracial marriages are occurring this year than 25 years ago. Major flaw: lack of context and rate information. Without the base rate of interracial marriages, the claim may be misleading; the number could be fluctuating historically, and the statistic alone does not show acceptability.
  • Takeaway about statistics
    • Statistics are not only facts and figures; they are a range of techniques for analyzing, interpreting, displaying, and making decisions based on data.
    • Statistics involve both data collection and the interpretation of data in context.

Importance of Statistics

  • Learning objectives
    • Give examples of statistics encountered in everyday life
    • Give examples of how statistics can lend credibility to an argument
  • Why statistics matter
    • To take control of your life, you must evaluate data and claims.
    • Poor reasoning can lead to manipulation or bad decisions; statistics provide tools to react intelligently.
  • Examples of statistical claims (illustrative, not guaranteed true)
    • 4/54/5 dentists recommend Dentine.
    • Lung cancer risk: about 0.850.85 for men and 0.450.45 for women if tobacco-related.
    • Condoms are effective 94%94\% of the time (effectiveness claim).
    • Native Americans are more likely to be hit crossing the street than other groups (ethnic comparisons).
    • Persuasive effect of eye contact, loud and rapid speech.
    • Women earn about 0.750.75 for every dollar earned by men in the same job.
    • Egg whites study claiming longer life span (new finding).
    • Baseball batting averages over 400400 (prediction/expectation claim).
    • In a room of 3030 people, there is about an 80%80\% chance that at least two share a birthday.
    • A tongue-in-cheek claim: 79.48%79.48\% of all statistics are made up on the spot.
  • Takeaway
    • These claims illustrate statistics are diverse across domains (psychology, health, law, sports, business).
    • Data interpretation matters; not all statistics are equally credible or interpreted appropriately.
    • The goal is to become an intelligent consumer of statistical claims by questioning sources and procedures.

Descriptive Statistics

  • Prerequisites: none
  • Learning objectives
    • Define “descriptive statistics”
    • Distinguish descriptive statistics from inferential statistics
  • Descriptive statistics defined
    • Descriptive statistics summarize and describe data (the data being collected from experiments, surveys, or records).
    • Data vs datum: data is plural; a single piece is a datum.
    • Example: birth certificates – descriptive statistics might be the percentage issued in New York State or the average age of the mother; any computed number counts as descriptor for the data at hand.
  • Descriptive vs inferential
    • Descriptive statistics describe the data at hand and do not generalize beyond it.
    • Inferential statistics generalize from a sample to a larger population; this is covered in another section.
  • Examples (Table 1 and Table 2 in the text)
    • Example: Average salaries for various occupations in 1999 (descriptive table)
    • Pediatricians: 112760112760
    • Dentists: 106130106130
    • Podiatrists: 100090100090
    • Physicists: 7614076140
    • Architects: 5341053410
    • School, clinical, and counseling psychologists: 4972049720
    • Flight attendants: 475?10475?10 (note: exact value as provided in the source)
    • Elementary school teachers: 3956039560
    • Police officers: 385?10385?10 (note: exact value as provided in the source)
    • Floral designers: 1898018980
    • Example: Number of unmarried men per 100 unmarried women in U.S. metro areas in 1990 (descriptive)
    • Jacksonville, NC: 224
    • Killeen-Temple, TX: 123
    • Fayetteville, NC: 118
    • Springfield, IL: 70
    • Lawton, OK: 116
    • State College, PA: 113
    • Clarksville-Hopkinsville, TN-KY: 113
    • Anchorage, Alaska: 112
    • Salinas-Seaside-Monterey, CA: 112
    • Bryan-College Station, TX: 111
    • Sarasota, FL (cities with mostly women): 66; Bradenton, FL: 68; Altoona, PA: 69
    • Descriptive statistics in sports and other domains
    • Descriptive statistics are central to sports (e.g., shooting percentages, etc.).
    • Olympic marathon data show historical winning times for men and women (since 1984 for women and earlier for men).
    • Observations
    • Descriptive statistics can reveal disparities (e.g., gender/occupation pay gaps, regional distributions) but they require careful interpretation.
  • Descriptive statistics and interpretations
    • They can highlight patterns, but by themselves they do not explain causes or allow generalizations beyond the observed data.
    • They can be used to illustrate points in arguments, but they can also mislead if the data source, sampling, or context is biased or incomplete.
  • Additional context
    • The text includes additional descriptive data around Olympic times and gender comparisons, emphasizing the need to question data sources and to connect descriptive statistics to the larger questions they raise.
  • Table 3 (Olympic marathon – winning times)
    • Women: 1984 Joan Benoit (USA) 2:24:52; 1988 Rosa Mota (POR) 2:25:40; 1992 Valentina Yegorova (UT) 2:32:41; 1996 Fatuma Roba (ETH) 2:26:05; 2000 Naoko Takahashi (JPN) 2:23:14; 2004 Mizuki Noguchi (JPN) 2:26:20.
    • Men: 1896 Spiridon Louis (GRE) 2:58:50; 1900 Michel Theato (FRA) 2:59:45; 1904 Thomas Hicks (USA) 3:28:53; …; 1988 Gelindo Bordin (ITA) 2:10:32; 2004 Stefano Baldini (ITA) 2:10:55.
  • Descriptive insight into inference
    • Descriptive statistics can be used to explore questions like whether gender gaps are closing or whether record times will continue to fall, but such inferences require inferential methods.

Inferential Statistics

  • Prerequisites: Chapter 1: Descriptive Statistics
  • Learning objectives
    • Distinguish between a sample and a population
    • Define inferential statistics
    • Identify biased samples
    • Distinguish between simple random sampling and stratified sampling
    • Distinguish between random sampling and random assignment
  • Populations and samples
    • A population is the entire group of interest; a sample is a subset drawn from the population.
    • Example #1: National Election Commission survey on voting fairness. You cannot ask every American; you sample a subset and infer about the population.
    • The sample should be representative; bias occurs if the sample over-represents a segment (e.g., only Floridians or only Republicans).
    • Inferential statistics rely on sampling assumptions; a random sample is expected to represent the population in approximate proportions depending on size.
  • Example #2: Average number of math classes taken by graduating seniors nationwide
    • Population: all graduating seniors in the U.S.
    • A sample might be 50 students from each of several institutions; compute average and generalize with caution.
    • Potential sampling bias: overrepresentation of math majors or institutions with heavy math requirements.
  • Sampling bias and examples
    • Example #3: Substitute teacher asks the 10 students in the front row for their scores; population is all students in the class; front-row sample may be biased.
    • Example #4: Coach samples 8 volunteers to estimate cartwheels by freshmen; volunteers are not representative.
    • Example #5: Twins study uses National Twin Registry; last-name-based selection (Z, B) and every-other-name sampling introduces bias and non-representativeness.
    • Population vs sample clarity is essential; the registry may not be representative of all twins.
  • Sample size matters
    • Random samples of small size may be non-representative due to sampling variability.
    • Example: If you sample 20 individuals from a population with equal male/female distribution, there is a nontrivial chance (about 0.060.06) that 70% or more are female purely by chance.
    • Thus, inferential statistics incorporate sample size into generalizations; larger samples reduce sampling error.
  • More complex sampling and issues
    • When simple random sampling is infeasible, other methods are used (random assignment, stratified sampling).
  • Random assignment vs sampling
    • Random assignment: in experiments, randomly assign subjects to treatment vs control groups to ensure equivalence; crucial for internal validity.
    • Example: antidepressant vs placebo; random assignment prevents systematic bias (e.g., early arrivals potentially different from late arrivals).
    • A non-random assignment can bias results; a non-random sample affects generalizability rather than internal validity.
  • Stratified sampling
    • Stratified sampling ensures representation across distinct strata (subgroups).
    • Example: urban university study on capital punishment; 200 students; 70% day students, 30% night students; sample 140 day and 60 night students so the sample proportions match population proportions, improving inference reliability.

Variables

  • Prerequisites: none
  • Learning objectives
    • Define and distinguish between independent and dependent variables
    • Define and distinguish between discrete and continuous variables
    • Define and distinguish between qualitative and quantitative variables
  • Independent and dependent variables
    • A variable is a property that can take on different values; independent variable is manipulated by the experimenter; dependent variable is the outcome measured.
    • Example #1: Can blueberries slow aging? Independent: dietary supplement (none, blueberry, strawberry, spinach); Dependent: memory test and motor skills tests; blueberry shows strongest improvement.
    • Example #2: Does beta-carotene protect against cancer? Independent: supplement vs placebo; Dependent: cancer occurrence over lifetime; results showed no systematic difference.
    • Example #3: How bright should brake lights be? Independent: brightness of brake lights; Dependent: time to hit brakes.
  • Levels of an independent variable
    • If there are two experimental conditions (experimental vs. control), the independent variable has two levels.
    • If comparing five types of diets, the independent variable has five levels.
  • Qualitative vs Qualitative and Quantitative variables
    • Qualitative (categorical) variables express a quality (e.g., hair color, eye color, religion, gender). They do not imply a numerical ordering.
    • Quantitative variables are numerical (e.g., height, weight, shoe size).
    • The type of independent variable in the blueberries example is qualitative; the dependent variable memory test is quantitative.
  • Discrete vs Continuous variables
    • Discrete: possible values are distinct points (e.g., number of children in a household).
    • Continuous: possible values form a continuum (e.g., time to respond to a question).
    • In practice, measurement limits often prevent true continuity, but measurement remains conceptually continuous.

Percentiles

  • Prerequisites: none
  • Learning objectives
    • Define percentiles
    • Use three formulas for computing percentiles
  • Why percentiles matter
    • A percentile ranks a score relative to a distribution (e.g., what percentage of scores fall below yours).
  • Definitions of percentile
    • Definition 1: The percentile is the lowest score that is greater than the specified percentage of scores.
    • Definition 2: The percentile is the smallest score that is greater than or equal to the specified percentage of scores.
    • A third, commonly used approach is a weighted average of the percentiles to handle rounding and to enable straightforward definition of the median as the 50th percentile.
  • Third definition (default in this text)
    • R = P/100 × (N + 1) where P is the desired percentile and N is the number of scores.
    • If R is an integer, the percentile equals the value with rank R.
    • If R is not an integer, interpolate between the values with ranks IR and IR+1, where IR = floor(R) and FR is the fractional part of R.
    • Percentile = ValueIR + FR × (ValueIR+1 − Value_IR).
  • Example 1 (8 numbers in Table 1)
    • P = 25, N = 8 → R = 25/100 × (8 + 1) = 2.25
    • IR = 2, FR = 0.25; values with ranks 2 and 3 are 5 and 7; percentile = 5 + 0.25 × (7 − 5) = 5.5
    • If using Definition 1, the 25th percentile would be 7; Definition 2 would be 5.
  • Example 2 (20 quiz scores in Table 2)
    • 25th percentile: IR = 2, FR = 0.25 → percentile = 5 (as shown by the data)
    • 85th percentile: IR = 17, FR = 0.85; score at rank 17 is 9 and rank 18 is 10; percentile = 0.85 × (10 − 9) + 9 = 9.85
  • Important notes
    • When FR = 0, percentile equals the score at rank IR.
    • The 50th percentile can be computed as 50/100 × (N + 1) when using the third definition.

Levels of Measurement

  • Prerequisites: Chapter 1: Variables
  • Learning objectives
    • Define and distinguish among nominal, ordinal, interval, and ratio scales
    • Identify scale type for a given variable
    • Discuss proper use of measurement scales in psychological measurement
    • Give examples of errors from misusing measurement scales
  • Nominal scales
    • Name/categorize responses; no ordering implied (e.g., gender, handedness, favorite color, religion)
    • They embody the lowest level of measurement; you cannot order categories meaningfully.
  • Ordinal scales
    • Categories are ordered (e.g., very dissatisfied, somewhat dissatisfied, somewhat satisfied, very satisfied)
    • Allow some comparison (one person more satisfied than another) but do not guarantee equal intervals between adjacent levels.
    • Differences between adjacent levels may not be equal; the step from 1 to 2 may not equal the step from 3 to 4.
    • Changing response format (e.g., using numbers 1–4) does not inherently make the scale interval/ratio; equal-interval interpretation still may not hold.
  • Interval scales
    • Intervals have the same meaning across the scale (e.g., Fahrenheit temperature).
    • Do not have a true zero point; zero is arbitrary (e.g., 0°F is not the absence of temperature).
    • Ratios do not make sense on interval scales (e.g., 80°F is not twice as hot as 40°F).
  • Ratio scales
    • Have all properties of nominal, ordinal, and interval scales, plus a true zero point (absence of the quantity).
    • Examples: Kelvin temperature (true zero), money (e.g., 50 cents vs 25 cents; 50 is twice 25).
  • Psychology measurement in context
    • Rating scales used in psychology are often ordinal (e.g., 5- or 7-point scales).
    • It is common to compute means for ordinal data, but there are debates; caution is warranted because equal-interval assumptions may be violated.
    • Memory experiments often yield counts (which can be treated as ratio data due to a true zero and meaningful differences).
  • Takeaway
    • The level of measurement constrains what statistics can be meaningfully computed.

Distributions

  • Prerequisites: Chapter 1: Variables
  • Learning objectives
    • Define distribution; interpret a frequency distribution
    • Distinguish a frequency distribution from a probability distribution
    • Construct a grouped frequency distribution for a continuous variable
    • Identify skew, bimodality, leptokurtosis, and platykurtosis
  • Distributions of discrete variables (M&M example)
    • Count colors in a bag; frequency table describes the distribution of color counts.
    • A frequency distribution can be graphed as a histogram of discrete counts (Figure 1).
    • For all M&Ms produced, the manufacturer reports proportions (probability distribution) that sum to 1. For example, Brown ≈ 0.30, Red ≈ 0.30, Yellow ≈ 0.15, Green ≈ 0.15, Blue ≈ 0.05, Orange ≈ 0.05.
  • Continuous variables and grouped frequency distributions
    • Example: times taken to move a cursor over a target (20 trials) produce a continuous variable; a simple frequency distribution would be uninformative since almost no two times are identical.
    • Solution: group into intervals and create a grouped frequency distribution; visualize with a histogram (Figure 3).
  • Probability densities and the normal distribution
    • For continuous variables, distributions are represented as probability densities (area under the curve equals 1).
    • The probability of exactly a specific value is essentially zero; the probability of falling within an interval is the area under the curve for that interval.
    • The normal distribution is a bell-shaped density; it is used as a common approximation for many naturally occurring phenomena.
  • Shape of distributions
    • Normal density is symmetric with a single peak in the middle; tails extend indefinitely.
    • Skew: a distribution with a longer tail on one side. Positive skew (skewed to the right) has a longer right tail. Negative skew to the left.
    • Kurtosis: leptokurtic (heavy tails; more data in tails) vs platykurtic (lighter tails; flatter). A distribution with a longer tail is leptokurtic; shorter tails yield platykurtic.
    • Bimodal distribution: two distinct peaks.
  • Example: Old Faithful geyser eruption times show a bimodal distribution.
  • Visual guides (descriptive): shape descriptions include symmetry, skewness, and kurtosis as important features for understanding data distributions.

Summation Notation

  • Prerequisites: None
  • Learning objectives
    • Use summation notation to express the sum of all numbers
    • Use summation notation to express the sum of a subset of numbers
    • Use summation notation to express the sum of squares
  • Basic idea
    • Σ denotes summation. For X1, X2, X3, X4 (weights in grams of 4 grapes):
    • The sum: <em>i=14X</em>i=X<em>1+X</em>2+X<em>3+X</em>4\sum<em>{i=1}^{4} X</em>i = X<em>1 + X</em>2 + X<em>3 + X</em>4
  • Example (sum of four numbers)
    • If X = [4.6, 5.1, 4.9, 4.4], then
    • <em>i=14X</em>i=4.6+5.1+4.9+4.4=19.0\sum<em>{i=1}^{4} X</em>i = 4.6 + 5.1 + 4.9 + 4.4 = 19.0
  • Sum of squares vs square of sums
    • Sum of squares: <em>i=14X</em>i2=4.62+5.12+4.92+4.42=90.54\sum<em>{i=1}^{4} X</em>i^2 = 4.6^2 + 5.1^2 + 4.9^2 + 4.4^2 = 90.54
    • The square of the sum: (<em>i=14X</em>i)2=192=361\left(\sum<em>{i=1}^{4} X</em>i\right)^2 = 19^2 = 361
    • The distinction: squaring before summing vs summing then squaring are not the same.
  • Cross products
    • Given pairs (Xi, Yi), the sum of cross products is <em>i=1nX</em>iYi\sum<em>{i=1}^{n} X</em>i Y_i; example values can yield a sum like 28.
    • Example (X, Y pairs and cross products): the sum of Yi Xi is 28 for the data shown.
  • Practical takeaway
    • Summation notation provides compact, precise ways to express sums and related statistics (sums of squares, cross products, etc.).

Linear Transformations

  • Prerequisites: None
  • Learning objectives
    • Give the formula for a linear transformation
    • Determine whether a transformation is linear
    • Describe what is linear about a linear transformation
  • What a linear transformation does
    • You may need to transform measurements from one scale to another (e.g., feet to inches, inches to feet).
    • Example: To convert feet to inches, multiply by 12. If x is feet, then the transformation T is:
    • T(x)=12xT(x) = 12x
    • Conversely, to convert inches to feet, divide by 12: T(x)=x/12T(x) = x/12
  • What makes a transformation linear
    • A linear transformation preserves addition and scalar multiplication (i.e., T(aX + bY) = aT(X) + bT(Y)).
    • Common examples in measurement scale changes preserve proportional relationships when a true linear mapping exists.
  • Significance in statistics
    • Linear transformations can be used to rescale, standardize, or normalize data without changing the underlying relationships between variables.