This chapter transitions from basic statistics to more advanced applications.
Prior chapters covered descriptive statistics and probability.
Moving from basic data description to statistical techniques for problem-solving.
In upcoming chapters, focus will shift to specific statistical methods and their real-world uses.
Describing data is fundamental but not enough to address complex statistical questions.
The goal is to use sample statistics to make inferences about larger population parameters.
The chapter introduces procedures for interpreting results of statistical tests.
Emphasis on how to apply statistical tests beyond just performing calculations.
8.1
Key Concepts:
Sampling Distributions: The distribution of sample statistics (e.g., means) from multiple samples taken from a population.
Sampling Error: Variability in sample statistics due to random differences in samples.
Standard Error of the Mean: The standard deviation of the sampling distribution; indicates the variability of sample means.
Study Example:
Duguid & Goncola (2012) explored how perceived power affects self-reported height.
The high-power group (managers) overestimated height by 0.66 inches; low-power (employees) underestimated by −0.21 inches.
The observed difference in means was 0.87 inches; the question is whether this difference is due to chance or a real effect.
Sampling Error:
Random samples will show some variability in sample statistics (like means).
Example: One sample may have more people who overestimate their height, leading to different sample means.
Illustration:
By simulating 10,000 samples, the spread of sample means shows sampling error. Most sample means will cluster around a central value, but some will fall below or above the average.
Standard error is calculated as the standard deviation of these sample means.
Application:
The concept of sampling distributions helps in hypothesis testing and predicting sample results without needing to collect thousands of samples.
8.2
Key Concepts from Section 8.2: Course Evaluations & Decision Making
1. Course Evaluations & Grades
Research Question: Do students’ expected grades influence how they rate a course?
Possible Explanations:
Course quality is independent of expected grades.
Poorly performing students rate the course poorly due to their struggles.
High-performing students rate the course highly because they gain more than just a grade.
Statistical Approach:
The correlation coefficient measures the relationship between expected grades and course ratings.
A sample of 50 courses shows an upward trend: better-expected grades correspond to better course ratings.
The key question: Is this pattern a real trend in the population or just a random fluctuation?
2. Sunk-Cost Fallacy Study
Definition: The tendency to continue investing in something just because of prior investments, even when it’s irrational.
Study by Strough et al. (2008):
Scenario: Watching a bad movie after paying for it vs. watching it for free.
Participants:
Younger adults (college students): Mean sunk-cost score = 1.39
Older adults (ages 58–91): Mean sunk-cost score = 0.75
Both groups had a standard deviation of 0.50
Possible Explanations:
The difference is due to random sampling variability.
The difference is real, meaning older adults are less likely to fall for the sunk-cost fallacy.
Statistical Approach:
Compares differences in means rather than relationships.
Uses sampling distributions to determine whether the observed difference is significant or due to chance.
3. Role of Sampling Distributions
Key Idea: Statistical tests rely on the concept of sampling distributions.
Sampling Distribution of a Statistic:
Describes the expected variation of a statistic (mean, correlation, etc.) across different samples.
Helps determine whether an observed value is likely under a given hypothesis.
Standard Error:
Measures how much sample statistics (e.g., means) vary across repeated samples.
Example: The standard error of means for 10 observations is 0.278, larger than the 0.125 for 50 observations, showing that smaller samples have greater variability.
4. Importance for Hypothesis Testing
The logic of hypothesis testing is the same across studies:
Compare observed data to what we expect under random chance.
Decide whether differences are meaningful or just noise.
Practical Application:
In both course evaluations and sunk-cost studies, the goal is to determine whether an observed pattern is significant.
This logic applies to all statistical tests, whether analyzing relationships (correlation) or differences (means).
Takeaway
Sampling distributions and standard errors help assess whether trends in data are real or due to chance.
Whether studying grades & evaluations or decision-making biases, hypothesis testing follows the same fundamental principles.
8.3
Key Concepts from Section 8.3: Hypothesis Testing
1. Understanding Variability in Sample Means
Behavior problem scores follow a normal distribution in the population: μ = 50, σ = 10.
Different samples will naturally have slightly different means due to chance.
Example: One sample may have 49.1, another 53.3, simply due to variability.
The key question: Are observed differences due to random chance or a real effect?
2. The Role of Sampling Distributions
A sampling distribution shows the expected variation in sample means.
If we repeatedly took samples of size n = 5, their means would form a distribution centered at μ = 50.
This allows us to determine how likely an extreme sample mean is under the assumption that H₀ is true.
3. Hypothesis Testing in Action
Case 1: Sample Mean = 56
Given the sampling distribution, the probability of getting a sample mean of 56 or higher is 0.094 (9.4%).
Since 9.4% is not very rare, we fail to reject H₀—this sample could still be from the normal population.
Case 2: Sample Mean = 62
Probability of getting a sample mean of 62 or higher is 0.0038 (0.38%).
Since this is a very rare event, we reject H₀—the sample likely comes from a population with μ > 50.
4. The Steps of Hypothesis Testing
State the Research Hypothesis (H₁): Stressed children show more behavior problems than normal children (μ > 50).
State the Null Hypothesis (H₀): Stressed children do not differ from normal children (μ = 50).
Collect Data: Obtain a sample of stressed children and compute their mean score.
Determine the Sampling Distribution: Assume H₀ is true and examine the expected variation in sample means.
Compare the Sample Statistic to the Distribution: Calculate the probability of obtaining a sample mean as extreme as the one observed.
Make a Decision:
If the probability is very low (typically p < 0.05), reject H₀ (suggesting a real effect).
If the probability is not low, fail to reject H₀ (observed difference may be due to chance).
5. The Big Picture
Hypothesis testing follows the same logic regardless of complexity.
We always assess whether an observed result is extreme enough to doubt the assumption that it came from the null hypothesis population.
In practice, we use statistical formulas instead of manually simulating thousands of samples, but the underlying logic remains the same.
Takeaway
Hypothesis testing provides a structured way to determine whether observed differences are meaningful or just due to chance. By comparing sample data to what we’d expect under random variability, we can make informed conclusions about population differences.
8.4
8.4 The Null Hypothesis
1. The Role of the Null Hypothesis
The null hypothesis (H₀) is central to hypothesis testing, serving as the baseline assumption. It is often the opposite of what researchers aim to demonstrate. For example:
If we hypothesize that college students’ self-confidence scores differ from 100, H₀ states that their mean score is exactly 100.
If we hypothesize that two population means (μ₁ and μ₂) differ, H₀ states that they are equal (i.e., μ₁ - μ₂ = 0).
The term “null hypothesis” originates from the idea that it assumes no differenceor no effect in the population.
2. Why Use the Null Hypothesis?
A. The Philosophical Argument (Fisher’s Perspective)
British statistician Sir Ronald Fisher emphasized that while we can’t prove something to be true, we can prove something to be false.
Example: Observing 3,000 cows with only one head does not prove that all cows have one head, but finding a single cow with two heads disproves that statement.
This idea is similar to the legal principle: “Innocent until proven guilty.”
The null hypothesis remains dominant in statistics despite debates on Fisher’s stance.
B. The Practical Reason
H₀ provides a starting point for statistical tests.
If we hypothesize that college students’ self-confidence is greater than 100, we don’t have a specific alternative hypothesis (e.g., μ = 101, 112, or 113?).
However, assuming H₀: μ = 100, we can create a sampling distribution for μ = 100and test whether the observed data significantly deviates from it.
3. Fisher’s Contributions to Statistics
Sir Ronald Aylmer Fisher (1890–1962) made groundbreaking contributions, rivaling Karl Pearson. His influence includes:
Developing analysis of variance (ANOVA), a cornerstone of modern statistics.
Introducing the theory of maximum likelihood, foundational to many statistical models.
Formulating the concept of the null hypothesis, famously stating that “Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis.”
Despite his immense contributions, Fisher had ongoing disputes with Jerzy Neyman and Egon Pearson (Karl Pearson’s son), who proposed an alternative hypothesis testing framework. Modern statistical testing is an amalgamation of their conflicting approaches.
Fisher’s impact on statistics remains unparalleled, shaping hypothesis testing as we use it today.
8.5
8.5 Test Statistics and Their Sampling Distributions
Sample statistics describe samples (e.g., mean, median, variance, correlation).
Test statistics (e.g., t, F, χ²) are used in hypothesis testing and have their own sampling distributions.
t-test:
Used to determine if two population means are equal (H₀: μ₁ = μ₂).
Sampling distribution of t can be constructed by drawing infinite sample pairs from the same population.
If observed t value is unlikely under H₀, we reject the null hypothesis.
Other test statistics (F, χ²) follow the same logic with different calculation methods.
Instead of infinite sampling, statistical techniques approximate these distributions.
Fisher (1935):
Null hypothesis is never proven, only disproven or retained.
Analogy to the legal system: “Innocent until proven guilty beyond a reasonable doubt.”
Understanding this concept is essential for hypothesis testing and significance tests.
8.6
8.6 Using the Normal Distribution to Test Hypotheses
Logic vs. Calculation: Understanding hypothesis testing does not require knowing the arithmetic yet.
Normal Distribution in Hypothesis Testing:
Used to test hypotheses about individual observations and sample statistics.
Here, focus is on individual observations for clarity.
Example: Finger-Tapping Test
Normal adults: Mean = 59 taps, SD = 7 taps.
Slower tapping rates indicate neurological issues.
Hypothetical case:
Grandfather scores 20 → Clearly abnormal.
Score of 52 → Slightly low, but not concerning.
Score of 45 → Should we assume a neurological issue?
Hypothesis Testing Approach:
H₀: The individual belongs to the population of healthy individuals.
If H₀ is true, we use mean = 59, SD = 7 to assess probability.
Compute z-score to determine how extreme the score is:
From standard normal table, P(z ≤ -2.00) = 0.0228.
If probability is very low, we reject H₀ and suspect a neurological issue.
If probability is not particularly low, we retain H₀ and assume health.
Key Takeaways:
Normal distribution provides a basis for testing hypotheses.
z-scores help determine how likely an observation is under H₀.
If probability is low, we reject H₀ (similar to legal concept of “beyond a reasonable doubt”).
8.7
Type I and Type II Errors: Key Points
Type I Error (False Positive):
Occurs when we reject the null hypothesis (H₀), even though it is actually true.
Symbol: α\alphaα (alpha) — the probability of a Type I error.
Example: A medical test indicates someone has a disease when they are actually healthy.
Rejection Region: The threshold that determines when we reject H₀. If the result falls in this region, we reject H₀.
Probability of Type I Error: Determined by the size of the rejection region (e.g., α=0.05\alpha = 0.05α=0.05).
Type II Error (False Negative):
Occurs when we fail to reject the null hypothesis (H₀), even though it is actually false.
Symbol: β\betaβ (beta) — the probability of a Type II error.
Example: A medical test indicates someone is healthy when they are actually sick.
Decreasing α: Reduces the probability of Type I errors but increases the probability of Type II errors.
Power: The probability of correctly rejecting H₀ when it is false. Power = 1−β1 - \beta1−β.
Critical Value: The score that separates the rejection region from the rest of the distribution (e.g., a z-score of -1.645 for a 5% rejection region).
Hypothetical Example (Finger Tapping Test):
Null Hypothesis (H₀): The person is tapping at a normal speed (healthy).
Alternative Hypothesis (H₁): The person is tapping slower than normal (not healthy).
The critical value is 47.48, so if a person's score is below this, we reject H₀. However, there is still a 5% chance that a healthy person’s score will fall below this value (Type I error).
If H₁ is true, we could fail to detect the difference if the person’s score is not low enough to reject H₀ (Type II error).
Relationship between Type I and Type II Errors:
Decrease α: Less chance of a Type I error but higher chance of a Type II error.
Increase α: More chance of a Type I error but lower chance of a Type II error.
Power of a Test: The probability of correctly rejecting H₀ when it is false. It equals 1−β1 - \beta1−β.
Summary Table of Possible Outcomes
8.8
One-Tailed Tests:
Predict a specific direction for the outcome.
Rejection region is in only one tail (either low or high).
Risk of missing significant results in the opposite direction.
Two-Tailed Tests:
Open to outcomes in either direction.
Rejection region is in both tails of the distribution (e.g., 2.5% in each tail for α = 0.05).
More common, conservative approach.
Covers both high and low extreme outcomes.
Choosing Between Tests:
Decision should be made before data collection.
One-tailed tests are more liberal (higher Type I error probability).
Two-tailed tests are more conservative (lower Type I error probability).
Type I and Type II Errors:
Type I error: Rejecting a true null hypothesis.
Type II error: Failing to reject a false null hypothesis.
Test choice impacts these error probabilities.
Jones and Tukey's Proposal:
Advocate for one-tailed tests without specifying direction beforehand.
Suggest focusing on the direction of the difference after data collection.
Shift Towards Confidence Intervals:
Emphasis is moving towards confidence intervals and effect sizes.
Provides more detailed insights than traditional hypothesis testing.
8.10
Study by Duguid and Goncalo (2012):
Compared reported heights for two groups: one told it would play managers, the other employees.
Manager group: Overestimated height by 0.66 inches.
Employee group: Underestimated height by −0.21 inches.
Difference in estimates: 0.87 inches.
Question: Did the roles assigned (but never played) influence height estimates?
Hypothesis Testing:
Three Alternatives:
Powerful roles lead to overestimations.
Powerful roles lead to underestimations.
No effect (fail to reject null hypothesis).
Choose two-tailed test: Difference could go either way.
Set significance level at α = 0.05 (5%).
Test involves t test for two independent groups (not discussed yet).
Data from Duguid and Goncalo:
Average overestimate: 0.225 inches.
Standard deviation of height differences: 0.88 inches.
Null hypothesis: Groups drawn from the same population (no effect of roles).
Sampling Study Setup:
Draw two samples (managers and employees) from population with mean difference of 0.225 inches.
Repeat 10,000 times to create distribution of mean differences under the null hypothesis.
Expected distribution is centered at 0.00, with most values between −0.5 and +0.5.
Findings:
Actual observed difference: 0.87 inches.
Probability of observing this difference under the null hypothesis: p = 0.0003.
Conclusion: Reject the null hypothesis (roles influenced height estimates).
Visualization:
Figure 8.7 shows distribution of mean differences under the null hypothesis.
Future Implications:
The resampling approach demonstrated here (via simulation) is an alternative to traditional t tests.
This method may replace the t test in the future, but the fundamental idea remains the same: resampling and looking at results.
8.11
Course Evaluations and Sunk Costs Example:
First Example: Relationship between students’ course evaluations and expected grade.
Use correlation coefficient to represent the relationship.
Null hypothesis: Population correlation is 0.00 (no relationship).
Question: What’s the probability of getting a correlation as large as the sample correlation with a sample size of 15 if the null hypothesis is true?
Second Example: Age differences in endorsing the sunk-costs fallacy.
Compare the mean sunk-cost fallacy scores between younger and older participants.
Null hypothesis: No difference between the means in the population.
Question: What’s the probability of obtaining a difference in means as large as 0.64 (observed sample difference) if the null hypothesis is true?
Key Concepts:
For both examples, the basic approach is:
Set up the null hypothesis (no relationship or no difference).
Calculate the probability of obtaining the observed result if the null hypothesis is true.
Summary
Purpose of the Chapter:
Examine hypothesis testing theory without focusing on specific calculations.
Hypothesis testing is crucial for analyzing data in experiments.
Key Concepts:
Sampling Distribution:
A distribution of a statistic computed from an infinite number of samples from one population.
Tells us what values are reasonable for the statistic under certain conditions.
Null Hypothesis:
Represents no difference or relationship between groups or variables.
Tested by comparing the statistic to the sampling distribution if the null hypothesis is true.
Testing Hypotheses:
We can test simple hypotheses using knowledge of the normal distribution.
Errors:
Type I error: Rejecting the null hypothesis when it is true.
Type II error: Failing to reject a false null hypothesis.
One-tailed and Two-tailed Tests:
One-tailed: Rejects the null hypothesis when the result falls in the extreme of the designated tail.
Two-tailed: Rejects the null hypothesis when the result falls in the extreme of either tail.
Jones and Tukey’s Proposal:
Suggest ignoring the null hypothesis since it’s rarely exactly true.
Focus on determining which group has the larger mean.
Propose a one-tailed test without predefining the direction of the tail.
Correlation coefficient: a measure of the degree of relationship between variables. 0-1
Latter: represents the variability of those stats from one sample to another
Sampling error: the value of a sample stat probably will be in error (outlier)
Standard error of the mean: a standard error is just thr standard deviation of the sampling distribution
Standard distribution: tells us what values we might (or might now) expect to obtain for a particular stat under a set of predefined conditions
Conditional probability : something happening if something else happens
Sunk cost: cost that has already been incurred and cannot be recovered.
The standard error estimates how much sample means vary based on the samples standard deviation
Sampling distribution help determine the likelihood of a sample statistic under given conditions. We use these to tests hypothesis'
A smapling distribution is the distribution of a stat from repeated random samples of a population
The standard error is the standard deviation of a sampling distribution , representing how much sample statistics vary
Variability due to chance: stats obtained from samples naturally vary from one sample to another
50, although the actual value of g remains unspecified. ">
Steps:
Fisher: we can never prove something to be true, but we can prove it to be false
Test statistics: specific statistical procedjures and has its own sampling distributions. Test stats are stats such as t, F, X^2
Rejecting H0 = if the probability under H0 is less than or equal to .05 while another one says whenever the probability undr H0 is less than or equal to 0.01
LEFT OFF AT 8.7 hypothesis testing
Introduction:
The chapter focuses on understanding the relationship between two variables.
The concepts explored include plotting data, covariance, correlation coefficient, and statistical tests.
Covariance and Correlation Coefficient:
Covariance is used to measure the relationship between two variables numerically.
The correlation coefficient (r) is a better measure than covariance because it standardizes the relationship.
Rank-Based Data:
When data is in the form of ranks, correlation coefficients still work similarly.
Factors Affecting Correlation:
There are factors that influence correlation coefficients, which will be explored.
Statistical Test:
A statistical test is used to determine if the correlation is significantly different from 0, implying a true relationship between variables.
Different Correlation Coefficients:
Various correlation coefficients exist, and software can be used to compute them.
Research Examples:
Studies might look at relationships between two variables, such as:
Breast cancer incidence and sunlight exposure.
Life expectancy and alcohol consumption.
Likability and physical attractiveness.
Hoarding behavior and deprivation in hamsters.
Performance accuracy and response speed.
Life span and health expenditure in countries.
Correlation:
The relationship between two variables is called correlation.
The most common measure of correlation is the Pearson product-moment correlation coefficient (r).
Scatter Diagrams (Scatter Plots):
Scatter plots are used to visualize the relationship between two variables.
Each point represents a subject’s scores on two variables (X and Y).
The X-axis represents the predictor (independent) variable, and the Y-axis represents the criterion (dependent) variable.
Scatterplots help illustrate relationships, correlations, and regression lines.
Examples of Scatter Plots:
Figure 9.1: Relationship between infant mortality and the number of physicians per 10,000 population.
Figure 9.2: Relationship between life expectancy and health care expenditures in 23 developed countries.
Figure 9.3: Relationship between breast cancer rates and solar radiation.
Regression Lines:
Regression lines represent the best prediction of Y from X.
The regression line helps clarify the relationship and shows predicted values (Y hat) for given values of X.
Correlation:
The degree to which points cluster around the regression line is related to the correlation (r).
Correlation ranges from -1 to +1, indicating the strength and direction of the relationship.
Example: In Figure 9.1, the correlation is .81 (strong positive relationship).
Example: In Figure 9.2, the correlation is .14 (weak relationship, not statistically significant).
Example: In Figure 9.3, the correlation is −.76 (strong negative relationship).
Correlation Interpretation:
Positive correlations indicate that both variables move in the same direction.
Negative correlations indicate that as one variable increases, the other decreases.
The sign of the correlation (positive or negative) indicates the direction of the relationship, but both directions can have the same strength.
Example (Wine Consumption and Heart Disease):
Figure 9.4: Data on wine consumption and heart disease death rates across countries.
The relationship is negative: higher wine consumption correlates with lower death rates from heart disease.
The correlation is −.78, showing a strong negative relationship.
Quadrants in Scatter Plots:
Data points can be divided into four quadrants based on their position relative to the mean values of X and Y.
The pattern of data points in these quadrants helps determine the type of relationship:
Negative relationship: Most points in “Above-Below” and “Below-Above” quadrants.
Positive relationship: Most points in “Above-Above” and “Below-Below” quadrants.
No relationship: Even distribution of points across quadrants.
Other Influencing Variables:
Other variables (like solar radiation) can influence results.
For example, solar radiation may impact the rates of coronary heart disease, similar to its impact on breast cancer rates.
Conclusion:
Scatter plots and correlation coefficients help visualize and quantify relationships between variables.
The relationship between variables can be influenced by other factors, so interpretations should consider these additional influences.
Summary (Point Form):
Example: Relationship between Pace of Life and Heart Disease:
Study by Levine (1990): Investigated the relationship between the pace of life and age-adjusted death rates from ischemic heart disease in 36 cities.
Pace of Life Measurement:
Measured through the time taken for a bank clerk to make change, the time to walk 60 feet, and the speed at which people speak.
Average of these three measures was used as the "pace."
Data Plot (Figure 9.5):
X-axis: Pace of life (faster-paced cities on the right).
Y-axis: Age-adjusted death rate from heart disease.
A positive correlation is observed: Faster pace of life tends to be associated with higher death rates from heart disease.
Key Observations:
Strong Positive Relationship: As the pace of life increases, death rates from heart disease also increase, and vice versa.
Linear Relationship: The best-fit line is straight, indicating a linear relationship between the two variables.
Group Comparison: The highest pace scores have nearly twice the death rate compared to the lowest pace scores.
Conclusion:
The relationship between pace of life and heart disease is positive and linear, though not as strong as other examples.
This pattern mirrors findings in many psychological and behavioral studies.
Covariance Definition:
Covariance reflects the degree to which two variables vary together.
Positive Covariance: High scores on one variable are paired with high scores on the other.
Zero Covariance: High scores on one variable are paired equally with both high and low scores on the other.
Negative Covariance: High scores on one variable are paired with low scores on the other.
Mathematical Definition:
Covariance is mathematically similar to variance, with the formula reflecting how two variables change together.
Formula for Covariance: The equation involves differences between individual scores and their means for both variables X and Y.
Covariance Relationships:
Maximum Positive Covariance: When X and Y are perfectly positively correlated (both variables increase or decrease together).
Maximum Negative Covariance: When X and Y are perfectly negatively correlated (one increases while the other decreases).
Zero Covariance: When X and Y are not related, the covariance will be zero.
Summary (Point Form):
Pearson Product-Moment Correlation Coefficient (r):
Measures the strength and direction of the linear relationship between two variables.
Covariance is scaled by the standard deviations of the two variables to calculate r.
Formula for r:
Maximum value of r is ±1.00, with:
+1.00 indicating a perfect positive relationship.
−1.00 indicating a perfect negative relationship.
0.00 indicating no relationship.
Interpretation of r:
Correlation coefficient rrr gives the degree of relationship between variables.
Does not imply that, for example, 36% of the relationship exists; rather, it's the degree of the relationship.
Further interpretation can be done using r2r^2r2 (coefficient of determination), discussed later.
Calculating r:
The Pearson correlation formula is used to compute r based on covariance and standard deviations.
The result for the data will show the degree of correlation, and it is typically computed using software or calculators.
Karl Pearson:
Developed the Pearson correlation coefficient and chi-square statistic.
Influential in statistics and contributed to other statistical techniques, though had a noted rivalry with R. A. Fisher.
Founded the first department of statistics at University College London.
Correlations with Ranked Data:
When data is ranked instead of using raw scores, the correlation between two variables can be measured.
Example: Ranking the quality of applicants based on different criteria (e.g., clarity, specificity of statements).
Spearman’s Rank Correlation Coefficient (ρ):
Measures the relationship between two sets of ranked data.
Denoted as ρ (Spearman’s rho).
Simple and commonly used, but not the only coefficient for ranked data.
Spearman’s Formula vs Pearson’s Formula:
Spearman’s formula is derived from Pearson’s formula but applied to ranked data.
Both formulas give the same result when calculating correlations for ranked data.
Spearman’s formula adjusts for tied ranks, bringing it in line with Pearson's r.
Interpretation:
Spearman’s ρ is similar to Pearson’s r in terms of interpretation.
Measures the strength and direction of the relationship between two variables, but based on ranks instead of raw data.
Why Rank?
Trust in Scale: Doubt about data accuracy (e.g., number of friends, attractiveness).
Down-weight Extreme Scores: Reduces distortion from outliers (e.g., Down syndrome incidence by mother's age).
Spearman’s ρ Interpretation:
Linear vs Monotonic: Measures relationship strength in ranks (linear or continuous rise/fall).
Monotonic Relationship: Focuses on rank order, not exact values.
Factors Affecting Correlation:
Restriction of Range: Limited variability weakens correlation.
Nonlinearity: Nonlinear relationships aren't well captured by Pearson’s r.
Heterogeneous Subsamples: Mixed groups distort correlation.
The Effect of Range Restrictions and Nonlinearity
Range Restrictions: Limiting the range of X and Y can alter correlation.
Positive Effect: Range restrictions increase correlation when curvilinear relationships are eliminated.
Example: Restricting age in a study on reading ability can lead to higher correlations by removing nonlinear data points.
Curvilinear Relationships: Range restrictions may reduce the correlation if the data are curvilinear. Example: health care expenditures vs life expectancy shows a curvilinear relationship at higher expenditure levels.
Practical Example: Test scores and college GPAs — colleges select higher scoring students, which restricts the range and can reduce correlation.
The Effect of Heterogeneous Subsamples
Mixed Groups: Combining data from different groups can distort correlation.
Example: Height and weight correlation for males and females may appear strong when combined, but the correlation is weaker when data is separated by gender.
Cardiovascular Disease and Cholesterol: Combining male and female data may obscure trends. When separated, men show a stronger relationship between cholesterol and cardiovascular disease than women.
Beware Extreme Observations
Outliers: Extreme observations can significantly affect correlation.
Example: A dataset on tobacco and alcohol expenditures in Great Britain shows a low correlation of 0.224 due to an outlier from Northern Ireland.
Impact of Outliers: Removing Northern Ireland from the dataset leads to a stronger correlation of 0.784, highlighting the effect of extreme values on correlation.
9.8 Correlation and Causation
Correlation Does Not Imply Causation
Just because two variables are correlated doesn't mean one caused the other.
Common example: More physicians may correlate with higher infant mortality, but more doctors don't cause more deaths.
Seven Possible Reasons for a Correlation:
Causal Relationship: One variable causes the other (e.g., sunlight and vitamin D production linked to reduced breast cancer risk).
Reverse Causality: The response variable may actually cause the predictor variable (e.g., good relationships may lead to happiness).
Partial Causality: One variable may be a necessary cause, but the effect also depends on other factors (e.g., wealth leading to happiness only if conditions like family support exist).
Third, Confounding Variable: A third variable could cause both correlated variables (e.g., wine consumption and solar radiation both reducing heart disease).
Third Causal Variable: Both variables are affected by another external factor (e.g., family stability and physical illness being affected by external stressors).
Time-based Changes: Two variables might change together over time (e.g., divorce rate and drug offenses increasing together).
Coincidence: Two events may occur together by chance (e.g., marriage issues coinciding with a new family member's arrival, not necessarily causing each other).
Important Considerations in Causation:
Time Order: For A to cause B, A must occur before or at the same time as B.
Ruling Out Other Variables: The correlation should hold true even when controlling for other factors.
Randomization and Random Assignment: Randomly splitting people into groups and observing behavior helps strengthen the argument for causation.
Explanation: A reasonable explanation for the observed correlation is crucial to argue for causation. If no explanation exists, the relationship should be considered for further exploration.
9.9 If Something Looks Too Good to Be True, Perhaps It Is
Not All Statistical Results Are Meaningful:
Statistical findings, particularly correlations, might not always be as straightforward as they seem.
It's important to be cautious when interpreting correlations and regression results, as they may not reflect true relationships.
Example of Infant Mortality and Number of Physicians:
A positive correlation was found between the number of physicians per 10,000 population and the infant mortality rate, suggesting that as the number of physicians increases, infant mortality also increases.
This result is highly unlikely to be causal, as no one seriously believes that physicians cause infant deaths.
Considerations About the Data:
Sample Selection:
The data come from developed countries with high levels of healthcare, meaning variability in infant mortality and physicians isn't as large as it might be in a broader global sample.
Data Selection:
The data were selectively chosen, and the relationship might not be as dramatic if a broader set of countries were included.
Surprising Findings:
The relationship stood out because it was unexpected; researchers may find interesting results by looking for correlations in large datasets, even if there's no true connection.
Possible (But Unlikely) Explanations:
Reporting Problem:
More physicians could lead to more deaths being reported, which would inflate the correlation. However, this is unlikely in developed countries with established reporting practices.
Reverse Causality:
High infant mortality could attract more physicians, implying that the direction of causality might be reversed.
Population Density:
High population density, often linked to higher infant mortality, might also attract more physicians, creating a spurious correlation.
Weakening the Relationship:
The correlation is weaker when focusing on pediatricians and obstetricians, who have a more direct impact on infant mortality, but it remains positive, suggesting the relationship isn't purely causal.
9.10 Testing the Significance of a Correlation Coefficient
Correlation and Sampling Error:
A sample correlation coefficient may not exactly reflect the true correlation in the population.
For example, random data might produce a correlation of 0.278, but there’s no real correlation because the data are just random numbers.
Correlation coefficients are subject to sampling error and can vary from the true population correlation (often denoted as ρ).
Hypothesis Testing for Correlations:
The null hypothesis (H₀) assumes the true population correlation (ρ) is 0.
If we can reject H₀, it suggests that the variables are truly correlated in the population.
If we can't reject H₀, we lack sufficient evidence to conclude there's a relationship between the variables.
Significance of Correlation:
To determine if the correlation is statistically significant, we use hypothesis testing and a t-statistic.
The formula for the t-statistic is:
where:
r is the sample correlation coefficient
N is the number of pairs of data
This statistic is evaluated with degrees of freedom (df) as N−2
Decision Rule:
Calculate the t value and use a probability calculator (like VassarStats) to find the p-value.
If the p-value is less than 0.05, we reject H₀ and conclude the correlation is statistically significant.
If p-value > 0.05, we fail to reject H₀ and conclude there's insufficient evidence to claim a correlation.
9.11 Confidence Intervals on Correlation Coefficients
Significance Tests vs. Confidence Intervals:
Significance test tells if correlation is different from 0.
Confidence interval estimates the range of true correlation.
Interpretation of Confidence Interval:
95% confidence interval (e.g., 0.042 to 0.619) means 950 out of 1,000 intervals will contain the true correlation.
Understanding Confidence Interval:
You are 95% confident the true correlation lies within the interval.
Strictly, the interval either contains the true correlation or it doesn’t, but this is practically understood as a probability.
Alternative Phrasing:
"The probability is 95% that the interval (0.042 − 0.619) includes ρ."
Widely accepted and practical, though not strictly correct.
9.12
Intercorrelation Matrices
Purpose: Shows pairwise correlations between multiple variables.
Example: Figure 9.12 shows correlations between educational expenditures, pupil-teacher ratio, salary, SAT, ACT scores, and participation in SAT/ACT across 50 states.
Table Format:
Rows and columns represent variables.
Each cell shows correlation between the variable on the row and the variable on the column.
Significant correlations marked with asterisks (* p < .05, p < .01, * p < .001).
Scatterplot Matrix: Visualizes scatterplots between each pair of variables in a matrix format.
Interpretation:
Example anomaly: Negative correlation between SAT scores and expenditures suggests more spending leads to worse performance. Further exploration needed to understand this result.
Software:
SPSS: Provides intercorrelation matrix with exact p-values.
R: Requires more work to display similar output.
Other Correlation Coefficients
Pearson’s r: Standard for continuous variables, measures linear relationship.
Spearman’s ρ: Used for ranked data; computes correlation based on ranks.
Point Biserial Correlation (r_pb):
Used when one variable is continuous and the other is dichotomous (two levels).
Example: Correlating test scores with "right/wrong" answers.
Calculation: Use Pearson’s formula, just label as point biserial correlation.
Interpretation remains same as Pearson’s r, but for dichotomous variables.
Phi (ϕ) Coefficient:
Used when both variables are dichotomous.
Example: Correlating gender (coded 0/1) with church attendance (coded 0/1).
Calculated like Pearson’s r but called Phi for dichotomous variables.
Correlation Coefficients Overview:
All coefficients discussed (Pearson’s r, point biserial, phi, Spearman's ρ) are special cases of Pearson's r.
Most of them can be calculated using Pearson’s r formula, with appropriate interpretation based on variable types.
Table Summary:
Pearson’s r: for continuous data.
Point Biserial: for one continuous, one dichotomous variable.
Phi: for two dichotomous variables.
Spearman's ρ: for ranked data.
Using SPSS to Obtain Correlation Coefficients
SPSS Output Example:
Data shows the relationship between Pace of Life and Heart Disease.
Descriptive Statistics:
Pace: Mean = 22.8422, Std. Dev. = 3.01462, N = 36
Heart: Mean = 19.8056, Std. Dev. = 5.21437, N = 36
Correlations Table:
Pace and Heart:
Pearson correlation: 0.365
Significance (2-tailed): 0.029
Since 0.029 < 0.05, reject the null hypothesis that there is no correlation between the two variables.
Interpretation:
The correlation between pace of life and heart disease is significantat the 0.05 level.
The SPSS output provides both the correlation coefficient and the p-value for testing statistical significance.
9.15 and the Magnitude of an Effect
Effect Size Concept:
Effect sizes measure the strength of relationships.
One common effect size is r² (the square of the correlation coefficient).
Interpreting r²:
r² represents the percentage of variation in the dependent variable that can be explained by the independent variable.
Example: For a correlation (r) of 0.365, r² ≈ 0.133 (or 13%).
This means about 13% of the variation in heart disease across cities is explained by differences in pace of life.
Practical Meaning:
Although 13% may seem small, it is often considered respectable in many research contexts.
Ideal (but rarely achieved) is an r² near 1.00 (or 100%), meaning almost all variation is explained.
Importance:
r² gives an idea of how “important” or “influential” the predictor variable is.
9.17 A Review: Does Rated Course Quality Relate to Expected Grade?
Example Data:
Data from 50 courses, showing the relationship between Expected Grade (X) and Overall Quality Rating (Y).
Only the first 15 cases are shown for space-saving.
Steps to Calculate Correlation:
Calculate Mean & Standard Deviation for each variable (X and Y).
Covariance Calculation between Expected Grade and Overall Quality.
Pearson’s Correlation between the two variables.
Results:
Moderate Positive Correlation: Courses with higher mean grades tend to have higher ratings.
Significance: The correlation is statistically significant.
Interpretation:
The correlation doesn’t imply that higher grades cause higher ratings.
It’s possible that advanced courses, which often get higher ratings, also tend to have better-performing students.
Writing up the Results of a Correlational Study
Research Hypothesis:
The hypothesis tested was whether course evaluations are related to the grades students expect to receive in a course.
Data Collection:
Data were collected from 50 courses at a large state university in the Northeast.
Students rated the overall quality of the course on a 5-point scale and reported their anticipated grade.
Statistical Results:
A Pearson correlation between the mean course rating and the mean anticipated grade was calculated.
The correlation coefficient was significant at p < .05, indicating a relationship between the two variables.
Conclusion:
The results suggest that higher anticipated grades are associated with higher course ratings.
Interpretation remains unclear: It might be that students with higher grade expectations rate their courses more favorably, or perhaps students perform better in courses they rate higher due to better quality teaching.
Summary
Correlation Coefficient:
A measure of the relationship between two variables.
Started with linear relationships using scatterplots.
Covariance:
Briefly introduced as a measure that increases as the relationship between variables increases.
Pearson Product-Moment Correlation:
Derived using covariance and standard deviations to define the correlation coefficient.
Also discussed Spearman’s rank correlation (for ranked data), point biserial, and phi correlations (for dichotomous variables).
Factors Affecting Correlation:
Range Restriction: Limits the variability of data.
Heterogeneous Samples: Combining different groups can skew results.
Outliers: Extreme data points can distort the correlation.
Correlation vs. Causation:
It's essential to distinguish between correlation and causation. Review Utts’ list of possible explanations for observed relationships.
Testing Correlation Significance:
We can test if a sample correlation is significant using tables or software. The probability (p-value) shows whether the sample correlation reflects a true relationship in the population. Later, we’ll discuss using the t-distribution for this test.
Study Overview:
Investigated how stress affects mental health in first-year college students.
Used a scale to measure:
Frequency, importance, and impact of negative life events → Stress score
Presence of psychological symptoms → Symptom score
More weight given to frequent and impactful events.
Key Findings:
Distributions (stem-and-leaf displays & boxplots) show:
Both stress and symptom scores are unimodal and slightly positively skewed.
High variability → Needed to show differences in symptoms based on stress levels.
Outliers present in both variables.
Handling Outliers:
Check data legitimacy – Ensure subjects didn’t report unrealistic events or symptoms.
Identify overlapping outliers – See if the same participants have extreme scores in both variables.
Create a scatterplot – Detect potential influence of outliers on correlations.
Run analyses with and without outliers – Check if results significantly change.
Conclusion:
No evidence that outliers significantly impacted correlation or regression results.
These preliminary steps ensure reliable analysis and greater confidence in results.
Dataset Overview:
107 participants measured for stress (negative life events) and symptoms(psychological distress).
Descriptive statistics:
Mean Stress Score: 21.467
Mean Symptom Score: 90.701
Standard Deviations: Stress = 13.096, Symptoms = 20.266
Covariance: 134.301
Key Findings:
Correlation coefficient (r) = 0.506 → Indicates a strong positive relationship between stress and symptoms.
Statistical significance: p = .00000003 → Extremely unlikely that this correlation occurred by chance.
Degrees of freedom (df) = 105 (since df = N – 2).
Null hypothesis rejected, confirming a significant relationship between stress and mental health symptoms.
Important Notes:
Correlation ≠ Causation – Stress may contribute to symptoms, but other factors could be involved.
Two-tailed test used, meaning the critical rejection region was divided across both extremes of the distribution.
Key Concepts:
Regression analysis helps predict Symptoms (Y) based on Stress (X)using a straight-line equation.
Scatterplot & Regression Line:
The scatterplot shows Symptoms increasing linearly with Stress.
The best-fitting regression line is superimposed on the scatterplot.
Regression Equation:
Y^=73.891+0.7831X\hat{Y} = 73.891 + 0.7831XY^=73.891+0.7831X
Intercept (a = 73.891): Predicted symptom score when Stress = 0.
Slope (b = 0.7831): For each 1-point increase in Stress, Symptoms increase by 0.7831 points on average.
Key Observations:
Consistent Scatter: The spread of data around the regression line is relatively even across all stress levels.
Least Squares Regression: The best-fit line minimizes squared prediction errors.
Interpreting the Relationship:
The regression confirms the strong positive correlation (r = 0.506) found earlier.
While stress predicts symptoms, correlation does not imply causation.
Final Takeaway:
Higher stress levels are associated with higher psychological symptoms, but other factors may contribute.
Regression helps quantify how much symptoms change as stress increases.
Standardizing the Data:
Z-scores convert data to a mean of 0 and a standard deviation of 1 without changing relationships.
The standardized regression coefficient (β) represents how much Y (Symptoms) changes in standard deviation units for each 1 SD increase in X (Stress).
Key Fact: In simple regression, β = r (correlation coefficient) → Here, β = 0.506.
A 1 SD increase in Stress is linked to a 0.506 SD increase in Symptoms, showing a meaningful effect.
Regression to the Mean (Galton’s Observation):
Concept: Extreme scores (high or low) tend to be closer to the mean in later measurements.
Examples:
Test Performance: A student who scores very high or low on a test is likely to score closer to average next time.
Sports: "Rookie of the Year" may underperform in year two (sophomore slump).
Height: Very tall parents tend to have slightly shorter children; very short parents tend to have slightly taller children.
Practical Implications:
Statistical illusion: Improvements in low performers may occur without actual intervention (e.g., tutoring, training).
Random Assignment: Controls for regression to the mean when testing interventions.
Real-World Example: Gun laws and crime rates → Changes may be due to statistical trends, not actual policy effects.
Final Takeaway:
Standardizing data makes regression results easier to interpret.
Regression to the mean explains why extreme cases tend to shift toward average over time.
Key Point:
Fitting a regression line is just the beginning – the real question is how accurate the predictions are.
The goal isn’t just to draw a line, but to ensure that the line is a good fit to the data and provides reasonable predictions.
Prediction Without Knowledge of X:
Initially, when predicting Y (e.g., symptoms), we considered scenarios where we don’t know X (e.g., stress levels).
We need to assess whether predicting Y accurately without X is meaningful, and if using X (stress) really improves prediction accuracy.
Next Steps:
Focus on errors of prediction (how far predictions are from actual values) to evaluate the line’s usefulness.
Fit isn’t enough – we need to understand the accuracy of the regression model in making predictions.
Key Concept:
When predicting Y (symptoms) without knowing X (stress), the best prediction is the mean of all symptom scores.
The mean is the most accurate guess because it's closest to most of the data points.
Extreme predictions (like the smallest or largest score) will often be far off, while the mean provides a more reasonable estimate.
Error in Prediction:
The error of your prediction is the standard deviation of Y (symptoms).
This error is measured by how far the actual values deviate from the mean.
Formula for the standard deviation
Yi = each individual symptom score
Yˉ= the mean of all symptom scores,
N = the number of observations.
Variance of Y:
The variance (sᵧ²) is the average of the squared deviations from the mean, calculated as:
Summary:
Predicting the mean minimizes error since it’s the closest estimate to most of the data.
The standard deviation and variance quantify how much individual data points deviate from the predicted mean.
Key Concept:
When predicting Y (symptoms) from X (stress), the best prediction is Ŷ(the value predicted from the regression equation).
The Standard Error of Estimate (sᵧₓ) measures how much actual Yvalues deviate from the predicted Ŷ values. It’s a measure of prediction error, showing how accurate our regression model is.
Formula for the Standard Error of Estimate (sᵧₓ):
The standard error of estimate is calculated as:
Yi= actual values of symptoms,
= predicted values of symptoms,
N = number of pairs of data points.
Residuals and Error Variance:
The residuals are the differences between actual values (Yi) and predicted values
Squared residuals give us the error variance (or residual variance), which is the variance of the prediction errors:
Example Calculation (Table 10.3):
The table shows data for the first 10 subjects, including:
Stress (X) and Symptoms (Y),
Predicted Symptoms (Ŷ) based on the regression line,
Residuals (Y − Ŷ) for each subject.
The sum of residuals is 0 because for every overestimation, there's an equal underestimation.
Squaring and summing the residuals gives us the value used to calculate the standard error of estimate.
Descriptive Statistics for the Data Set:
Mean Stress: 21.467
Mean Symptoms: 90.701
Standard Deviations:
Stress = 13.096
Symptoms = 20.266
Covariance: 134.301
Summary:
Standard Error of Estimate quantifies how far predictions (Ŷ) are from actual values (Y).
The residuals and their squared sum are used to calculate this error measure.
A smaller sᵧₓ means better predictions, indicating that the regression line is a good fit for the data.
Key Concept:
The Standard Error of Estimate (sᵧₓ) quantifies how much actual values (Y) deviate from predicted values (Ŷ) on the regression line.
It represents the standard deviation of the errors (residuals) made when using the regression equation to predict Y.
Formula for the Standard Error of Estimate (sᵧₓ):
From the previous explanation, the standard error of estimate is calculated as:
Alternative Expression for Standard Error of Estimate:
After some algebraic manipulation, the standard error of estimate can also be expressed as:
Calculation of Standard Error:
Direct Calculation:
From the data, we compute the sum of squared residuals(deviations of Y from Ŷ) and divide by N−2 to get the standard error.
The R output provides the "Residual standard error," which corresponds to this value.
Interpretation:
The standard error of estimate is 17.562, meaning the standard deviation of the errors (or residuals) around the regression line is 17.562.
This indicates that, on average, our predictions are off by 17.562points when predicting symptoms from stress using the regression equation.
Final Takeaway:
Smaller standard error means more accurate predictions.
A standard error of 17.562 suggests that there is variability in the predictions, but the goal is to make this value as small as possible for better-fitting models.
Key Concept:
The squared correlation coefficient (r²) tells us how much of the variability in Y (the dependent variable) can be explained by variability in X (the independent variable). It gives us a percentage of the variation in Y that is predictable from X.
For example, if r = 0.506, then r² = 0.506² ≈ 0.256, meaning that 25.6% of the variability in Symptoms can be predicted by Stress.
Example – Stress and Symptoms:
In the case of stress predicting symptoms, we can say that about 25% of the differences in symptoms are related to differences in stress levels. This percentage represents how much variability in symptoms is predictablefrom stress.
Important: A 25% prediction of variability in symptoms is impressive in behavioral sciences.
Example – Narcissism Over Time:
If we look at the relationship between time and narcissism, the r = -0.29, and r² = 0.0841 or 9%, meaning only 9% of the variability in narcissism can be explained by changes over time.
Since the correlation is negative, it suggests that narcissism is decreasingover time, although this correlation is not statistically significant.
Example – Cigarette Smoking and Life Expectancy:
Consider the relationship between smoking behavior (X) and age at death (Y):
Variability in smoking behavior → Affects variability in life expectancy (Ŷ), as smokers and nonsmokers will have different predicted life expectancies.
Error variability → The differences in life expectancy that cannot be explained by smoking behavior (e.g., random factors, genetics).
The total variability in life expectancy can be broken down into:
Variability in smoking behavior
Variability in life expectancy
Predictable variability in life expectancy due to smoking behavior
Unpredictable error variability in life expectancy
Interpreting r² (The Percentage of Predictable Variability):
If the correlation (r) between smoking and life expectancy were 0.80, then r² = 0.64, meaning 64% of the differences in life expectancy can be explained by smoking behavior.
A smaller correlation, such as r = 0.20, would result in r² = 0.04, meaning only 4% of life expectancy variability is related to smoking behavior.
Misleading Perspective on r²:
The r² value can sometimes be misleading because it may suggest that a factor like smoking accounts for only 4% of life expectancy variability. However, this still has significance since many other factors (e.g., accidents, disease) influence life expectancy.
A 4% contribution to life expectancy variability is important, especially when dealing with major life outcomes like age at death.
The Debate Between r and r²:
r (the correlation) can be just as meaningful as r², particularly when we consider standardized measures.
For instance, if r = 0.75, then a difference of 1 standard deviationin smoking would result in 0.75 SD change in life expectancy.
Some researchers argue for using r over r² to avoid underestimating the effect size since squaring r reduces its value.
Caution on Interpretation of Relationships:
Correlation does not imply causation.
For instance, saying "Smoking accounts for 4% of life expectancy variability" doesn’t mean smoking causes life expectancy to change. There are other factors at play.
Example: Shoulder pain and weather—while you might observe a correlation between the two, it doesn't mean one causes the other.
Conclusion:
The squared correlation coefficient (r²) is a useful measure for understanding how much variability in one variable can be predicted by another. However, care should be taken when interpreting small values, as they can still represent meaningful effects in complex real-world outcomes.
Key Concept:
Extreme values (outliers) can significantly affect regression results, including both the slope and statistical significance of the regression model. Even if an outlier doesn’t seem unusual in terms of a single variable, it can still distort the overall analysis by skewing the regression line.
Example - Expenditure Data:
In Table 10.4, the relationship between alcohol and tobacco expendituresin Great Britain was analyzed, and an extreme data point from Northern Ireland was included.
With the outlier: The slope of the regression line was 0.302, and the p-value was 0.509 (not significant).
Without the outlier: The slope increased to 1.006, and the p-value dropped to 0.007 (significant).
Effect of Extreme Value:
The inclusion of Northern Ireland in the data caused the regression slopeto change dramatically, from a small value of 0.302 to a much larger 1.006.
This illustrates how one extreme data point can pull the regression line toward it, affecting both the slope and the significance of the relationship.
Visual Illustration (Figure 10.2):
The scatterplot in Figure 10.2 shows two regression lines:
(a) With Northern Ireland: The line is less steep and has a higher p-value.
(b) Without Northern Ireland: The line becomes steeper, and the p-value drops, showing a significant relationship between the variables.
Conclusion:
Extreme values can heavily influence regression results, so it's essential to identify and consider the impact of outliers. In this case, the extreme value from Northern Ireland caused a drastic shift in the regression slope, making the relationship between tobacco and alcohol expenditures appear much stronger once the outlier was removed.
Key Concept:
In regression, we test both the correlation coefficient (r) and the slope (b)for significance.
Testing the significance of the slope is critical because it indicates whether there is a significant relationship between the predictor variable (X) and the outcome (Y).
Testing the Slope (b):
Simplified Approach:
If the correlation coefficient r is significant, then the slope (b) is also significant. This is because for simple linear regression with only one predictor, a significant correlation implies a significant relationship between the variables, and therefore the slope will also be non-zero.
So, if r is significant, you don’t need a separate test for the slope — it's already implied that the slope is non-zero.
Alternative Approach (Using t-test for the Slope):
You can also test the slope using the t-test, which compares the observed value of b to a test value (typically 0) to determine if the slope is significantly different from zero.
The formula for the t statistic is:
The t statistic is compared to a critical value from the t distribution table (or calculated directly using software). If the calculated texceeds the critical value, the null hypothesis (that the slope is zero) is rejected, indicating a significant relationship between X and Y.
Understanding the Output in Regression Analysis:
In the regression output (e.g., SPSS or R printout), you’ll see:
Slope (b): The estimated change in Y for a one-unit change in X.
t: The t-statistic for testing whether the slope is significantly different from zero.
Sig: The p-value associated with the t-statistic. If p < 0.05, you reject the null hypothesis and conclude that the slope is significantly different from zero.
Example Output Interpretation:
t = 9.84 (for the slope)
Sig = 0.000 (indicating that the slope is significant at the 0.05 level).
Conclusion: Since the p-value is less than 0.05, we reject the null hypothesis and conclude that stress significantly predicts symptoms.
Formula Recap:
t-test for the Slope:
Decision Rule:
If p < 0.05, reject the null hypothesis H0:b=00, meaning the slope is significant.
If p ≥ 0.05, fail to reject H0, meaning the slope is not significant.
Summary:
If the correlation coefficient r is significant, then the slope (b) is also significant in simple regression.
Alternatively, you can use a t-test for the slope to test whether the slope is significantly different from zero.
The p-value from the t-test helps determine if the slope is significantly different from zero, thus indicating a meaningful relationship between X and Y.
Key Concept:
SPSS provides a detailed printout of the regression analysis, which includes the regression equation, significance tests, and various related statistics such as the slope, intercept, standard error of estimate, and ANOVAresults.
Steps for Performing Regression in SPSS:
Go to: Analyze > Regression > Linear
Select Stress as the independent variable and Symptoms as the dependent variable.
In the Statistics button, request confidence intervals and select the appropriate plots.
Review the output, which includes:
Descriptive statistics (mean, standard deviation, sample size).
Correlation matrix with the correlation coefficient (e.g., r = 0.506).
Standard error of estimate and ANOVA results.
Interpretation of SPSS Output:
Correlation Matrix:
The correlation coefficient (r = 0.506) is reported, confirming our calculation.
p-value: Since SPSS reports one-tailed significance, double the p-value for a two-tailed test. For r = 0.506, the p-value is 0.000, meaning the correlation is statistically significant.
ANOVA Section:
This section tests whether the correlation coefficient is significant.
Sig. = 0.000 shows a significant relationship between Stress and Symptoms.
Coefficients Section:
The slope (B) for Stress and the intercept are reported.
The t-test for the slope tests if the slope is different from zero, which in this case is significant.
The t-test on the intercept tests if the intercept is significantly different from zero. This test is usually not important unless the intercept has a meaningful interpretation in the context of the data.
Final Output Summary:
The slope (B) and intercept values are given, along with t-tests and their associated p-values.
If the p-value for the slope is less than 0.05, it indicates asignificant relationship between the independent and dependent variables.
The ANOVA and p-value confirm that the regression model is significant.
Conclusion:
SPSS provides a comprehensive printout that allows you to easily interpret the regression results, including the relationship between variables, significance tests, and confidence intervals.
Writing up the Results (Point Form)
Study Overview:
Examined the relationship between stress and mental health in 107 college students.
Students completed checklists on recent negative life events (stress) and psychological symptoms.
Hypothesis:
Prediction: Increased stress would be associated with higher symptoms.
Key Results:
Correlation between stress and symptoms: r = 0.506, p < 0.0001(significant).
Interpretation: Higher stress levels are linked to higher symptom levels.
Variance Explained: Stress accounts for 25% of the variation in symptoms.
Additional Details:
Intercept not reported, as it has no meaningful interpretation in this context.
Conclusion:
Stress significantly predicts psychological symptoms, explaining 25% of the variation in symptoms.
Key Concepts:
Applets help illustrate key regression principles, making it easier to understand how changes in slope and intercept affect the regression line and predictions for Y.
Interactive tools allow you to explore best-fitting lines, influence of extreme data points, and perform t-tests for slope significance.
Main Applets and Their Uses:
Predicting Y:
Interactive Sliders: Adjust X, slope, and intercept to see how Ŷ(predicted Y) changes.
Example: When X = 10, Ŷ = 5. Change the slope to make it steeper and observe the effect on Ŷ.
Finding the Best-Fitting Line:
Adjustable Line: Move the regression line vertically (intercept) or rotate it (slope) to find the best fit for the data.
Goal: Align the line to minimize error and achieve the optimal prediction.
Best Fit Example: For given data, optimal intercept = 10.9, slope = 0.42.
Effect of Extreme Data Points:
Manipulate Data Points: Remove or adjust points and see how the slope and intercept change.
Example: Remove a point (e.g., at (2.5, 26.5)) to observe a change in slope from -3.0 to -3.3.
Slope Testing (t-test):
SlopeTest Applet: Illustrates the t-test for testing if the true slope is 0 (no relationship).
Process: Generate 100 samples, calculate slopes and t-values, and observe how t values vary.
Result: Distribution of t-values smooths as more sets are generated, with most values falling below ±3.00.
Slope Calculation and Significance:
Applets for t-test: Use the slope, standard error, and degrees of freedom to calculate t and p-values.
Example (with Northern Ireland outlier):
Slope = 0.302, Standard Error = 0.439, t = 0.688, p = 0.509.
This indicates the slope is not significantly different from 0(no relationship).
Summary of Key Actions:
Predict Y: Use sliders to adjust X and observe changes in Ŷ.
Find the best-fitting line: Adjust the intercept and slope to fit the data.
Manipulate extreme data points: Remove data points and see how the regression line changes.
Use t-test for slope: Calculate the significance of the slope using t and p-values.
Conclusion:
These applets offer a hands-on approach to understanding regression analysis, helping visualize how data points, slopes, and intercepts interact, and how t-tests are used to assess the significance of the regression slope.
Key Concept:
Regression analysis is used to predict the Overall Quality (Y) of a course based on the Expected Grade (X).
Data in Table 10.5:
First 15 cases of the data are shown (50 cases used in calculations).
The goal is to calculate the regression equation and interpret the coefficients.
Steps for Regression Calculations:
Calculate Means and Standard Deviations:
Find the mean and standard deviation for both Expected Grade (X) and Overall Quality (Y).
Calculate Covariance:
Covariance=∑(Xi−Xˉ)(Yi−Yˉ)N−1\text{Covariance} = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{N - 1}Covariance=N−1∑(Xi−Xˉ)(Yi−Yˉ)
Calculate the Slope (b):
b=CovarianceVariance of Xb = \frac{\text{Covariance}}{\text{Variance of X}}b=Variance of XCovariance
Calculate the Intercept (a):
a=Yˉ−b×Xˉa = \bar{Y} - b \times \bar{X}a=Yˉ−b×Xˉ
Regression Equation:
The regression equation becomes:
Y=a+bXY = a + bXY=a+bX
Example equation:
Y=1.7174+0.5257XY = 1.7174 + 0.5257XY=1.7174+0.5257X
Interpretation of Coefficients:
Intercept (1.7174): If the expected grade is 0, the predicted course rating is 1.7174 (though not meaningful as a real-world scenario).
Slope (0.5257): For every 1-point increase in expected grade, the overall course rating is predicted to increase by 0.5257 points.
Example: A course with a C (2.0) expected grade would have a rating 0.5257 points lower than a course with a B (3.0) expected grade.
Important Notes:
The regression equation is based on association and does not imply causality (e.g., low expected grades do not necessarily cause low ratings).
Using SPSS:
To reproduce the results, use Analyze > Regression > Linear and select Expected Grade (X) as the independent variable and Overall Quality (Y)as the dependent variable.
Select Estimates, Confidence Intervals, and Model Fit under the Statistics button.
SPSS Output (Table 10.6):
The SPSS output would provide the slope, intercept, and statistical significance of the regression model.
Conclusion:
This example demonstrates how to predict course ratings based on expected grades and how to interpret the slope and intercept of the regression equation
Key Concept:
Correlation and regression both describe the relationship between two variables, but they are used for different purposes and provide different types of information.
Why Use Both?
Correlation:
Single number that describes the strength and direction of a relationship between two variables.
Example: A correlation of 0.85 between a manual skills test and job performance tells you the strength of the relationship quickly.
Advantage: Quick and simple summary of how two variables are related.
Limitation: Does not tell you about the magnitude of changebetween the variables.
Regression:
Provides more detailed information about the relationship, including the magnitude of change.
Example: Regression coefficient might tell you that a 10% increase in contraception use reduces infant mortality by 9.7%.
Advantage: Provides specific prediction of the outcome variable based on the predictor.
Limitation: Requires more calculation and is not as simple as a correlation.
Key Differences Between Regression and Correlation:
Correlation measures the strength of the relationship between variables, whereas regression quantifies how much the dependent variable (Y) changes when the independent variable (X) changes.
In simple regression with one predictor, both correlation and regression give similar information. However, regression provides more detail about the magnitude of change in the outcome variable.
Multiple Regression:
When you have multiple predictors, correlation and regression behave differently:
High correlation between predictors and the outcome might be due to one or both predictors.
Regression coefficients allow you to separate the influence of each predictor.
In multiple regression, you can examine separate roles of each predictor variable, even if the overall correlation is high.
Conclusion:
Use correlation when you want a quick measure of the relationship strength.
Use regression when you want to understand how much one variable affects another or when you need to predict the outcome.
In multiple regression, regression coefficients help distinguish the influence of individual predictors.
10.11 Summary
Regression Definition:
Regression is the prediction of one variable (Y) from one or more other variables (X).
The regression line is a straight line that represents the relationship between X and Y.
The equation for the regression line is: Y^=a+bX\hat{Y} = a + bXY^=a+bX
Ŷ: Predicted value of Y based on X.
b: Slope of the regression line, representing how much Y changes with each unit change in X.
a: Intercept, representing the predicted value of Y when X is 0 (often has little substantive meaning).
Regression to the Mean:
Refers to the phenomenon where extreme scores tend to be followed by scores closer to the mean.
This concept also applies to regression, where extreme values in one measurement tend to predict less extreme values in subsequent measurements.
Errors of Prediction (Residuals):
Residuals are the differences between the observed Y and the predicted Ŷ.
The regression line is drawn to minimize the squared residuals, which is why it is called least squares regression.
Standard Error of Estimate:
The standard error of estimate measures the standard deviation of residuals.
A larger standard error indicates larger prediction errors.
It’s compared to the normal standard deviation, which measures deviations from the mean.
Coefficient of Determination (r²):
r² represents the percentage of variability in Y that can be predicted from X.
Interpreting r² can be difficult without context, which is why r is sometimes used as an alternative measure of effect size.
Significance Testing:
For two variables, testing the significance of r is equivalent to testing the significance of the slope (b).
A t-test for b is often used in regression analysis and is included in computer printouts.
Simple vs. Multiple Regression:
Simple regression: One predictor variable (X).
Multiple regression: Multiple predictors used to predict a single criterion variable (Y), which will be covered in the next chapter.
Introduction to t Tests for One Sample
Overview of t tests:
Focus on t test for one sample.
Understand sample means by drawing samples from a population.
Null Hypothesis Testing:
Discuss testing a null hypothesis about a mean when population variance is unknown.
Introduce t tests as an alternative when the population variance is unknown.
t Test Calculation:
Learn how to calculate t values.
Confidence Intervals:
Use confidence intervals to estimate true population mean.
Effect Size:
Measure the meaningfulness of results using effect size.
Case Example:
Apply concepts to an example where confidence intervals provide more insight than a standard t test.
Context of Hypothesis Testing
Chapters 8-11:
Focus on hypothesis testing, correlations, and regression coefficients.
Current Chapter:
Focus on hypothesis testing about means (testing population mean).
Example:
Williamson’s (2008) thesis on children of depressed parents and behavior problems.
YSR (Youth Self-Report Inventory) used to measureanxious/depressed subscale.
Population mean (μ) = 50 (known).
Hypothesis Structure
Null Hypothesis (H₀):
μ = 50 (Sample mean equals population mean).
Alternative Hypothesis (H₁):
μ ≠ 50 (Sample mean is different from population mean).
Two-Tailed Alternative:
Reject H₀ if μ > 50 or μ < 50.
Example Data
Sample of 5 children:
Scores: 48, 62, 53, 66, 51.
Mean:
56.0 (6 points above population mean).
Standard Deviation:
7.65.
Question:
Is 56 surprisingly large, or could it be due to sampling error?
Next Steps
Determine Range:
Determine what range of sample means is expected for a population of normal children.
Test Calculations:
Proceed to calculate t values, p values, and construct confidence intervals.
12.1 Sampling Distribution of the Mean
Sampling Distribution Overview:
Distribution of values for a statistic (here, sample means) drawn from the population.
If we draw infinite samples from a population, calculate the sample statistic for each sample, we get the sampling distribution.
Central Limit Theorem (CLT):
Mean of Sampling Distribution:
Mean of the sampling distribution = μ (population mean).
Variance and Standard Deviation:
Variance of the sampling distribution = σ² / N,
Standard deviation (standard error) = σ / √N.
Shape of Distribution:
As N (sample size) increases, the shape of the sampling distribution approaches a normal distribution.
Key Concepts of CLT:
If the population is normal, the sampling distribution of the mean will also be normal, regardless of N.
If the population is symmetric (but not normal), the sampling distribution will still be nearly normal for larger sample sizes(especially unimodal).
If the population is highly skewed, sample sizes of 30 or more are required for the sampling distribution to approximate normal.
Illustrating CLT:
Example with population of N = 5 children:
Population: Random numbers from a normal distribution with μ = 50 and σ = 7.65.
Draw 10,000 samples of size 5 from the population.
Sampling Distribution: Plotting the 10,000 sample means shows a nearly normal distribution.
Figure 12.2a:
10,000 samples of size 5 (N = 5).
Resulting sampling distribution is nearly normal, with mean and standard deviation close to the population mean and standard deviation.
Figure 12.2b:
10,000 samples of size 30 (N = 30).
Sampling distribution becomes more normal, with mean close to 50 and standard deviation reduced to nearly σ / √N.
Summary
The Central Limit Theorem is essential for understanding how sample means behave and how they approximate a normal distribution as the sample size increases.
Even if the population is not normal, the sampling distribution of the mean will become normal as N increases.
CLT gives us useful information on mean, variance, standard deviation, and shape of the sampling distribution.
12.2 Testing Hypotheses about Means When σ Is Known
Overview:
Testing hypotheses about population means often involves comparing sample statistics to known population parameters.
In some cases, the population variance (σ²) is known, which allows us to use a more straightforward approach to hypothesis testing.
Link to Previous Concepts:
In Chapter 8, we learned about hypothesis testing using a z-test for a single observation (e.g., finger-tapping score).
This logic can be extended to testing sample means, using a z-testto compare the sample mean to a population mean when the population standard deviation (σ) is known.
Example of Testing a Hypothesis for Means:
Scenario: Testing if children from depressed homes show more behavior problems than the general population.
The null hypothesis (H₀): The sample mean comes from a population with a mean of 50 (normal children).
The alternative hypothesis (H₁): The sample mean is different from 50 (stressed children show a different level of behavior problems).
Central Limit Theorem Application:
We know the population mean (μ) and standard deviation (σ), so we can use the Central Limit Theorem to construct the sampling distribution of the mean.
The sampling distribution of the mean for samples of size 5 will have:
Mean = 50
Standard Error (SE) = σ / √N
Z-score Formula:
M = sample mean
μ = population mean
SE = standard error of the mean
Example Calculation:
Sample: 5 children with mean score of 56.0 and sample standard deviation of 7.65.
Population: μ = 50, σ = 7.65 (known).
Standard Error:
Z-score:
The z-value tells us how far the sample mean is from the population mean in terms of standard errors.
Significance Level:
Using standard normal distribution tables or software, we find the probability of obtaining a z-value of 1.76.
Two-tailed test: Multiply the probability by 2 to account for deviations in both directions.
Decision:
If the probability (p-value) is less than 0.05, reject the null hypothesis.
In this case, the p-value for z = 1.76 is 0.079 (p-value > 0.05).
Conclusion: We do not reject the null hypothesis. The sample mean of 56.0 is not significantly different from the population mean of 50. We do not have enough evidence to conclude that stressed children show more or fewer behavior problems.
Large Sample Example (Williamson, 2008):
Sample of 166 children: Sample mean = 55.71, sample standard deviation = 7.35.
The population mean for normal children = 50, and population standard deviation = 10.
Using similar steps, we calculate the z-value for the sample mean.
The result gives us a high z-value and a very small p-value, so we reject the null hypothesis. The children in the study have a significantly higher mean behavior problem score than the general population.
Why Use Population σ When Available?:
If the population standard deviation (σ) is known, it's better to use it instead of the sample standard deviation (s) for more accurate results.
Reason: Sample standard deviation (s) can be biased and may underestimate the true population standard deviation if the sample is skewed or non-representative.
Key Takeaways:
When σ is known, we can use the z-test for hypothesis testing about population means.
The standard error plays a crucial role in determining the z-score.
If sample size is large, the sampling distribution of the mean approximates normal, even if the population is not normal.
The Sampling Distribution of s² (Sample Variance)
Key Concept:
The t-test uses s² (sample variance) as an estimate of σ² (population variance). Therefore, understanding the sampling distribution of s² is important, especially with small sample sizes.
Unbiased Estimator:
s² is an unbiased estimator of σ². This means that with repeated sampling, the average value of s² will equal σ².
Unbiased estimator = The expected value of s² is equal to σ² across many samples.
Skewness of the Sampling Distribution:
The sampling distribution of s² is positively skewed, particularly for small sample sizes.
This means that s² is more likely to underestimate σ² than overestimate it, especially in small samples.
Figure 12.4 shows a computer-generated sampling distribution of s², with sample size N = 10 from a normally distributed population.
Impact of Skewness:
Because of this skewness, using s² directly as an estimate of σ²might lead to errors in statistical tests, such as the t-test.
Problem: A sample variance (s²) is more likely to underestimate the population variance (σ²) than to overestimate it.
Consequence for the t-test:
If we use s² to estimate σ² and apply it directly in the t-test, this could lead to an overestimation of the t-value.
This happens because the sample variance often underestimates the true population variance, making the resulting t-value larger than the z-value that would have been calculated if the true variance (σ²) were known.
Example:
If we substitute s² for σ² and calculate the t-statistic:
The result will more likely be inflated because of theskewness of the sample variance distribution.
In smaller sample sizes, this inflation can lead to larger t-values, which can result in a higher chance of Type I error(rejecting a true null hypothesis).
Conclusion:
While s² is an unbiased estimator, its positive skewness, especially for small sample sizes, affects the reliability of the t-test.
This skewness increases the chances of underestimating the population variance and consequently inflating the t-statistic.
The t Statistic
Purpose:
The t statistic is used when we need to account for the fact that we are using sample estimates of population parameters (like sample variance s² instead of population variance σ²).
Formula:
For the z statistic:
For the t statistic (when σ is unknown):
Here, X is the sample mean, μ is the population mean, and s is the sample standard deviation.
Why Use t Statistic?:
Since s is likely to be smaller than σ (sample variance often underestimates population variance), using s in the denominator would inflate the t statistic, resulting in a larger value of t than expected if we had used σ.
If we used the z statistic with s instead of σ, we would incorrectly treat the result as a z-score, leading to too many significant results (Type I errors).
For example, using z = 1.96 as the cutoff for significance at α = 0.05is inappropriate because it leads to more than 5% Type I errors. The correct cutoff for the t statistic, when σ is unknown, should be 2.776.
The Solution – Student's t Distribution:
William Gosset (aka "Student") provided the solution by showing that substituting s for σ results in a Student’s t distribution.
We use the t statistic to evaluate the sample mean relative to the population mean by comparing it to the t distribution (instead of the normal distribution used for the z-test).
Degrees of Freedom (df):
The t distribution depends on degrees of freedom (df), which is typically calculated as df = n - 1, where n is the sample size.
The t distribution shape changes with df. As the sample size increases (larger df), the distribution becomes closer to the normal distribution, and the t statistic approaches the z statistic.
Impact of Increasing df:
Small df (small sample sizes): The t distribution is wider and more spread out, reflecting the greater uncertainty in the sample variance estimate.
Large df (large sample sizes): As df increases, the distribution becomes more like the normal distribution, and the t statisticapproaches the z statistic.
Example:
In Figure 12.5, we see the t distribution for different df values:
For 1 degree of freedom: The t distribution is wider and more spread out.
For 30 degrees of freedom: The t distribution is closer to the normal distribution.
For ∞ degrees of freedom: The t distribution is identical to the normal distribution (z distribution).
Conclusion:
The t statistic adjusts for the fact that we are estimating the population variance, making it more appropriate for small sample sizes.
The t distribution allows us to evaluate the significance of the t statistic and to determine the correct cutoff values based on the degrees of freedom.
Degrees of Freedom (df)
Definition:
The degrees of freedom (df) refer to the number of independent values in a dataset that are free to vary when calculating a statistic.
For a one-sample t test, df = N - 1, where N is the sample size.
Why N - 1?:
When calculating the sample variance s², the deviations of the observations from the sample mean are considered.
Since the sum of the deviations from the mean must always equal 0, only N - 1 of the deviations are independent. The final deviation is determined based on the sum constraint.
Example: With five scores (e.g., 18, 18, 16, 2, and a mean of 10), four scores can vary freely, but the fifth score is determined by the other four to ensure the mean is 10. Therefore, there are 4 degrees of freedom (N - 1).
Degrees of Freedom in t-Test:
The t statistic uses df = N - 1 because s² is calculated based on sample data, not the population variance σ².
The formula for the sample variance uses N - 1 in the denominator, and thus the t distribution also depends on N - 1 degrees of freedom.
Context of Degrees of Freedom:
Degrees of freedom are not always N - 1. In different contexts:
For two-sample tests, df might be N₁ + N₂ - 2.
For chi-square tests, df is calculated based on the number of categories or groups minus the number of parameters estimated.
In general, df reflects the number of values in a dataset that are free to vary after accounting for any constraints or parameters.
General Rule:
Whenever you estimate a parameter using sample statistics (like the sample mean), you lose degrees of freedom because one or more values are constrained based on the estimation.
Example of the Use of t: Do Children Always Say What They Feel?
Study Context:
Study by Compas et al. (1994): Investigated children from families with a parent diagnosed with cancer.
Hypothesis: Children under stress may mask anxiety by giving socially desirable answers, reflected in high scores on the "Lie Scale" of the Children's Manifest Anxiety Scale (CMAS).
Goal: Test if children’s Lie Scale scores are higher than the population mean of 3.87 for normal children.
Data:
Sample Size (N): 36 children
Sample Mean (X̄): 4.39
Sample Standard Deviation (s): 2.61
Population Mean (μ): 3.87 (reported by Reynolds & Richmond, 1978)
Null Hypothesis:
The null hypothesis (H₀) is that the children’s Lie Scale scores come from a population with a mean of 3.87 (i.e., the children’s scores are not unusually high).
The alternative hypothesis (H₁) is that the children’s scores differ from 3.87.
Test Type:
Two-tailed t-test at the 5% significance level.
t Statistic Formula:
Formula:
For this study:
Xˉ=4.39X̄, μ=3.87, s=2.6, N=36.
Calculation of t:
Standard Error of the Mean (SE):
t value:
Critical t value:
For df = 35 (degrees of freedom = N - 1), the critical t value at the two-tailed .05 significance level is ±2.03 (from the t-distribution table).
Decision:
The calculated t value (1.20) is less than the critical t value (±2.03).
p-value: Using the statistical tool, the p-value for t = 1.20 with 35 df is approximately 0.238.
Since p > 0.05, we fail to reject the null hypothesis.
Conclusion:
There is insufficient evidence to suggest that the children in the study have unusually high scores on the Lie Scale compared to the population of normal children. This implies that the children might not be masking their anxiety, and we need to explore other reasons for their low anxiety scores.
12.4 Factors Affecting the t Statistic and Decision About H0
Several factors influence the magnitude of the t statistic and the likelihood of rejecting H0:
Obtained difference (Xˉ−μ): Larger differences result in larger t values.
Sample variance (s^2): Smaller variance leads to a larger t value.
Sample size (N): Larger sample sizes decrease the standard error, increasing the t value.
Significance level (α): Determines the size of the rejection region.
One- or two-tailed test: A two-tailed test has a larger critical region, making it easier to reject H0.
12.5 Example: The Moon Illusion
Kaufman and Rock (1962) investigated the "moon illusion," where the moon looks larger on the horizon than at its zenith. To test this, they asked subjects to adjust a variable "moon" to match the size of a standard moon in different positions. Data for 10 subjects were used to determine if there was a significant illusion.
Data for the moon illusion:
Ratios of the variable moon to the standard moon: 1.73, 1.06, 2.03, 1.40, 0.95, 1.13, 1.41, 1.73, 1.63, 1.56
N=10
Xˉ=1.463
s=0.341
Using a t-test with H0:μ=1.00, we calculate the t stat
For 9 degrees of freedom and α=0.05\alpha = 0.05α=0.05, the critical value of t is ±2.262. Since 4.29 > 2.262, we reject H0H_0H0 and conclude that the true mean ratio is not equal to 1.00, confirming the moon illusion.
Two-tailed probability for t=4.29: 0.002, indicating strong evidence to reject H0
If t had been -4.29, the result would still lead to the rejection of H0, as the t statistic magnitude, not its sign, determines significance.
Researchers often focus only on statistical significance, ignoring practical significance.
Jacob Cohen advocated for measuring effect size, showing how large a difference is.
Statistical significance with large samples can show trivial differences, while effect size shows if the difference is meaningful.
Effect size helps convey the practical relevance of findings.
The moon illusion experiment showed a 46.3% increase in size (ratio of 1.463), a substantial effect.
Reporting the mean ratio (1.463) gives a clear, meaningful understanding of the effect.
In some cases, average scores (e.g., 2.63 points increase in self-esteem) may not adequately communicate the magnitude of the effect.
Better measures are needed in such cases to convey impact effectively.
Confidence intervals estimate the true population mean (μ), giving a range where μ is likely to lie.
Point estimate: A single estimate, like the sample mean (X̄).
Interval estimate: A range of values with high probability of containing μ.
Confidence interval formula: CI = X̄ ± t(α) * (s / √N).
Confidence level: 95% means 95% of the time the interval contains the true mean, 99% means 99% of the time.
Example: For moon illusion data:
95% CI = [1.219, 1.707].
99% CI = [1.112, 1.814].
Both exclude μ = 1.00 (no illusion), supporting the existence of the illusion.
Confidence intervals reflect the sample's variability, and their width depends on the sample's standard deviation.
Interpretation of confidence intervals: The parameter (μ) is constant, but intervals vary.
A confidence statement indicates the probability that the interval includes the true value of μ, not the probability of μ being in the interval.
Data Example:
Katz et al. (1990) study on SAT-like exam scores without passage.
Sample size (N) = 28, with scores ranging from 33 to 60.
Null hypothesis: H₀: μ = 20 (expected score if guessed blindly).
t-test Calculation:
Sample mean (X̄) = 46.21, sample standard deviation (s) = 6.73.
t = (46.21 - 20) / (6.73 / √28) = 20.61.
The critical value for t at α = 0.05 with 27 degrees of freedom is 2.052.
t(27) = 20.61 > 2.052, so reject H₀ and conclude students performed better than chance.
Confidence Interval:
95% CI: X̄ ± t.05 (s / √N) = 46.21 ± 2.052 * 1.27 = [43.60, 48.82].
Effect Size (Cohen’s d):
d̂ = (X̄ - μ) / s = (46.21 - 20) / 6.73 = 3.89.
This indicates the students scored nearly 4 standard deviations higher than expected by chance.
Interpretation:
The students' performance is highly above chance levels, suggesting effective test-taking skills.
A small effect size (e.g., d̂ = 0.15) would indicate minimal practical significance, but d̂ = 3.89 shows a large and meaningful effect.
Conclusion:
t(27) = 20.61, p < .05, with a large effect size of d̂ = 3.89.
Students performed significantly better than expected purely by chance.
Sampling Distribution of t:
McClelland’s applet demonstrates how t distribution is generated by drawing samples from a population with a known mean (μ = 0).
Initial samples give extreme t values, which smooth out as more samples (up to 10,000) are drawn.
Distribution of t is wider than the normal (z) distribution; with small df, you reject H₀ too often if using the normal distribution.
Actual 5% cutoff for t (with 4 df) is +2.776, not +1.96.
Comparing z and t:
The t distribution has higher tails than the normal distribution, requiring a larger cutoff for rejecting the null hypothesis.
As degrees of freedom (df) increase, the t distribution approaches the normal distribution.
Confidence Intervals (CIs):
Confidence intervals (CIs) illustrate the range of values that likely contain the population mean.
If the null hypothesis is true, many intervals will not contain the true mean, especially with a false null hypothesis.
Power of a test: the probability of correctly rejecting a false null hypothesis. Power = 1 - probability of Type II error.
Example: With 54% of intervals including the null hypothesis, power is 46%.
Applications:
Confidence limits for sample means can help assess the likelihood of including the true population mean.
Applets offer an interactive way to visualize t distributions, CIs, and power calculations.
Hart et al. (2013) Study on Dog Orientation:
The study explored whether dogs are sensitive to the earth's magnetic field, similar to birds and mammals.
Data was collected from 70 dogs of 37 breeds over 1,893 observations, examining their orientation during walks.
The goal was not just to test if the orientation was random but to determine what the orientation was and the confidence interval on that mean.
Null Hypothesis Test vs. Confidence Interval:
Although a null hypothesis test (Rayleigh test) could show if the orientation was random, it doesn’t provide detailed information about the orientation.
The confidence interval was far more informative as it showed the specific direction and the variability of that orientation.
Results by Field Stability:
Stable Field: A clear North/South orientation with a narrow confidence interval (173°/353°), SD = 29°, 95% CI = [164°, 182°] and [344°, 2°].
Moderately Unstable Field: Broader confidence interval width of 46°.
Unstable Field: No clear orientation, with no reasonable confidence interval, p = 0.233 (random-like behavior).
Key Takeaway:
Confidence intervals provided more meaningful information than null hypothesis testing.
The narrow confidence interval for stable fields showed the consistency of the dogs' behavior.
The results demonstrated the importance of confidence intervals to understand variability, rather than just determining significance.
Sampling Distribution of the Mean:
The distribution of means from repeated sampling, with its mean equal to the population mean and standard deviation equal to population SD divided by the square root of N.
Central Limit Theorem: As sample size increases, the sampling distribution approaches normal, allowing us to predict results of repeated sampling.
Testing the Null Hypothesis:
Known Population SD: Use z score to test the null hypothesis, subtracting hypothesized population mean from sample mean and dividing by the standard error.
Unknown Population SD: Substitute sample SD for population SD and use Student’s t distribution, factoring in degrees of freedom (N-1).
Factors Affecting t:
The difference between sample mean and null hypothesis mean, sample variance, and sample size (N).
The choice between a one- or two-tailed test affects the critical value but not the magnitude of t.
Effect Size:
A measure of how large the difference is, often scaled by standard deviation.
In some cases, the mean adjustment alone is sufficient to understand the effect size, but other cases may require scaling in terms of standard deviations.
Confidence Limits:
Represent limits on the true population mean with a certain level of confidence (e.g., 95%).
The goal is to create intervals where the true population mean falls 95% of the time.
Power:
Introduced briefly, to be discussed more in later chapters, related to the probability of correctly rejecting a false null hypothesis.
Paired Samples t-Test (Related Samples t-Test)
1. Transition from One-Sample to Two-Sample Tests
Previously: One-sample t-test (Comparing a sample mean to a known population mean).
Now: Paired-samples t-test (Comparing two means from the same individuals or matched pairs).
Goal: Test if the difference between two related means is significant.
2. What Are Related Samples?
Repeated Measures → Same participants measured twice (e.g., pre-test vs. post-test).
Example: Anxiety levels before & after donating blood.
Matched Pairs → Different but related participants (e.g., spouses, twins).
Example: Husband & wife rating marital satisfaction.
Key Idea: Scores in one sample predict scores in the other sample.
If one variable is high, the other is likely high too (positive correlation).
3. Why Use Related Samples?
Reduces individual differences → Each person acts as their own control.
Eliminates extraneous variance (between-subject variability).
Increases statistical power → More likely to detect a real effect.
4. Example Scenarios for Related Samples
Pre/Post Comparison:
Measuring blood pressure before & after a treatment.
Twin Studies:
Comparing cognitive abilities between identical twins.
Spousal Satisfaction:
Examining relationship satisfaction in couples.
🚀 Key Takeaway:
Using related samples helps reduce variability and improves statistical efficiency, making it easier to detect real differences!
Paired Samples t-Test: Weight Gain in Anorexia Study
1. Study Overview
Research Question: Does family therapy help anorexic girls gain weight?
Participants: 17 girls diagnosed with anorexia.
Measurements:
Before Treatment Weight (Mean = 83.23 lbs, SD = 5.02).
After Treatment Weight (Mean = 90.49 lbs, SD = 8.48).
Difference Score (After - Before): Mean = 7.26 lbs, SD = 7.16.
2. Why Use a Paired t-Test?
The same participants were measured before and after therapy → Related Samples.
Their weight before therapy predicts their weight after therapy (correlation = 0.54).
A standard independent samples t-test assumes no relationship between samples, which would be incorrect here.
Solution? Use difference scores (After - Before) and run a one-sample t-test on these difference scores.
3. Hypothesis Testing
Null Hypothesis (H₀): No significant weight change (Mean Difference = 0).
Alternative Hypothesis (H₁): Significant weight change (Mean Difference ≠ 0).
4. Data Insights & Considerations
Mean Weight Gain: 7.26 lbs → Suggests therapy was effective.
Possible Confounding Factor: Could weight gain be due to time rather than therapy?
Control Group (Not receiving therapy) did not gain weight.
Suggests weight gain was due to therapy, not time.
🚀 Key Takeaway:
Using difference scores in a paired t-test allows us to control for individual differencesand determine whether the observed weight gain was statistically significant.
Paired t-Test Using Difference Scores
1. Why Use Difference Scores?
Instead of treating Before and After weights as two separate samples, we calculate difference scores (After - Before).
This approach reduces between-subject variability, making the test more sensitive.
If therapy had no effect, the average weight difference would be zero (H₀: μd = 0).
2. Null & Alternative Hypotheses
H₀ (Null Hypothesis): The mean difference score is 0 (no weight change).
H₁ (Alternative Hypothesis): The mean difference score is not 0 (significant weight change).
3. The t-Statistic Formula
where:
Dˉ = Mean difference score (7.26 lbs)
sD = Standard deviation of difference scores (7.16 lbs)
N = Number of participants (17)
4. Key Insights
This test is mathematically identical to a one-sample t-test, except we analyze difference scores instead of raw data.
If t is large enough, we reject H₀ and conclude that therapy had a significant effect.
🚀 Bottom Line:
By using difference scores, we account for individual variability and conduct a more precise test of whether therapy led to significant weight gain.
Degrees of Freedom in a Paired t-Test
1. Degrees of Freedom (df) Calculation
Since we are working with difference scores, NNN is the number of paired observations.
Just like in a one-sample t-test, we lose one degree of freedom for the sample mean.
Formula:
Df = N -1
In this case: df=17-1=16
2. Interpreting the Results
Critical t-value (from table): t.05(16)=2.12
Obtained t-value: 4.18
Since 4.18 > 2.12, we reject H₀ → The weight gain is statistically significant.
p-value: 0.001 (very low), reinforcing that the weight gain is unlikely due to chance.
3. Running the Test in Statistical Software
SPSS:
Use Analyze → Compare Means → One-Sample t Test
Default 95% confidence interval (can be adjusted).
R Command: t.test(Before,After,paired=TRUE)
This specifies that the Before and After scores are paired measurements.
🚀 Bottom Line:
The statistically significant weight gain suggests that the therapy had an impact, but we should consider other factors like natural growth before drawing final conclusions.
The Crowd Within Is Like the Crowd Without – Summary
Key Concept
Group judgments are often more accurate than individual judgments.
Vul & Pashler (2008): Explored whether multiple guesses from the same person improve accuracy.
Study Setup
Participants answered the same question twice, three weeks apart.
Correct answer = 100 (or rescaled to 100).
Compared:
Error of Average Guess (column 5) – Error from averaging both guesses.
Average of Individual Errors (column 8) – Mean of absolute errors from each guess.
Hypothesis: The Error of the Average Guess should be smaller than the Average of Errors.
Data Findings
Metric | Mean |
Error of Average Guess | 5.767 |
Average of Individual Errors | 7.8 |
Difference (Error Reduction) | -1.33 |
Standard Deviation | 2.024 |
The difference score distribution was analyzed using a paired t-test.
Correlation between two guesses: 0.76 (positive correlation but not perfect).
Statistical Analysis
Null Hypothesis (H₀): The mean of the difference scores = 0 (no improvement).
Degrees of Freedom: df=15−1=14
Critical t-value (two-tailed, α = 0.05): Found in appendix tables.
Obtained t-value: Exceeded the critical value → Reject H₀.
Conclusion: Averaging a person’s two guesses results in a better estimate than taking either guess alone.
Further Insights
A person’s first guess is usually more accurate than the second.
However, averaging both guesses gives an even better result.
This aligns with the “Wisdom of the Crowds” principle, even within individuals.
🚀 Bottom Line: If unsure, make two independent guesses and take their average!
Advantages and Disadvantages of Using Related Samples – Summary
Advantages of Related Samples
✅ Reduces Participant Variability:
Controls for individual differences.
Example: In the anorexia study, a 2-pound gain is treated the same whether a girl started at 73 lbs or 93 lbs.
Increases statistical power (better ability to detect real effects).
✅ Controls for Extraneous Variables:
No pre-existing differences between groups since the same participants are tested twice.
Eliminates confounds that would exist in independent-sample designs.
✅ Requires Fewer Participants:
More efficient than independent samples.
Easier to recruit 20 participants measured twice than 40 participants measured once.
Disadvantages of Related Samples
⚠ Order Effects:
First measurement influences the second (e.g., learning, fatigue).
Example: Taking a pretest on current events makes people more aware, improving posttest scores.
⚠ Carry-Over Effects:
Residual effects from the first condition affect performance in the second.
Example: Drug studies—first drug might still be in the system during the second condition.
⚠ Sensitization to Treatment:
Pretest might make participants aware of the study’s purpose.
Example: A pretest on breastfeeding attitudes might make participants suspicious when later exposed to a pro-breastfeeding intervention.
Key Question:
Would order or carry-over effects impact the moon illusion or anorexia studies?
Moon illusion: Probably not affected by order effects.
Anorexia study: Potential issue—normal growth over time could confound results.
Solution: Use a control group to account for natural changes.
🚀 Bottom Line: Related-measures designs are powerful but require careful control of order and carry-over effects
Effect Size – Summary
Why Effect Size Matters
✅ Significance ≠ Meaningfulness:
A large sample can make even small, unimportant differences statistically significant.
Effect size tells us how big a difference is, not just whether it exists.
Ways to Report Effect Size
1⃣ Raw Score Change (Easy to Understand)
Example: Average weight gain of 7.26 lbs in anorexia study.
Can also express as a percentage: e.g., 8.7% weight increase from baseline.
Best when the variable (e.g., weight) has clear meaning.
2⃣ Standardized Mean Difference (Cohen’s d)
Useful when raw scores aren’t intuitive (e.g., self-esteem).
Formula:
Numerator = mean difference.
Denominator = standard deviation (SD) of pretest scores.
Example: Girls gained 1.45 SD in weight.
Interpretation: If self-esteem increased by 1.5 SD, that would be very meaningful—but "7.26 points on a self-esteem scale" isn't as clear.
Choosing the Right Denominator for d
Situation | Best Standard Deviation to Use |
Single group, no before/after | Use SD of raw scores |
Before/after (paired data) | Use SD of pretest scores |
Two separate groups (e.g., husbands & wives) | Use pooled SD (average variance of both groups) |
Key Takeaway
📝 Use the measure that best communicates meaning!
If people understand raw scores, use them.
If the scale isn’t intuitive, express differences in standard deviation units (Cohen’s d).
Confidence Limits on Change – Summary
What Are Confidence Limits?
✅ A confidence interval (CI) estimates the range where the true population mean likely falls.
✅ Formula for a CI:
Applying Confidence Intervals to Related Samples
Instead of a single mean, we use the mean difference (e.g., pretest vs. posttest).
Instead of the SD of raw scores, we use the SD of the difference scores.
Purpose: To determine if the true mean difference is likely different from zero.
Example: Anorexia Study
Mean weight gain: 7.26 lbs
SD of difference scores: 7.16
Confidence interval tells us:
There is 95% probability that the true population mean weight gain falls within this range.
If zero is NOT in the interval, we confirm a significant effect.
Key Takeaway
📌 Confidence limits provide a more informative measure than just a p-value.
📌 If the CI excludes zero, the result is statistically significant.
SPSS and R for Related-Samples t Test
SPSS output provides descriptive statistics, correlation analysis, and t test results.
Data entered as two separate variables (Before and After).
Correlation between variables is tested for significance.
Paired t test on the mean differences follows.
SPSS computes t as -4.185 (negative due to subtraction order).
The sign of t is irrelevant—depends on how differences are calculated.
R Code for Paired t Test
Can be written in different formats, specifying paired differences.
Ensures proper handling of related-samples analysis.
Writing up the Results
Study Context: Everitt (in Hand, 1994) investigated family therapy as a treatment for anorexia in 17 girls.
Procedure: Participants were weighed before and after several weeks of therapy.
Results:
Mean Pretreatment Weight: 83.23 lbs
Mean Posttreatment Weight: 90.49 lbs
Mean Weight Gain: 7.26 lbs
Statistical Significance: The weight gain was statistically significant (t = 4.185, p < .05).
Effect Size: d = 1.45 (gain of nearly 1.5 standard deviations from pretreatment weight).
Confidence Interval: 95% CI for weight gain = [Lower bound, Upper bound], confirming a meaningful increase in weight.
Conclusion: Family therapy appears to contribute to significant weight gain beyond what can be attributed to normal growth.
Summary of Related Samples t-Tests
Comparison of Means:
Similar to one-sample t-tests, but with two related samples.
Uses difference scores to test if the mean difference is likely from a population where μ = 0.
Key Examples:
Vul & Pashler (2008): Multiple estimates from individuals improve accuracy.
Everitt (1994): Family therapy for anorexia led to significant weight gain.
Advantages of Repeated Measures Designs:
Reduces variability: Individual differences do not impact results.
Controls extraneous variables: Same participants in both conditions.
More statistical power: Requires fewer participants than independent samples.
Potential Issues:
Carry-over effects: Prior exposure may influence later responses.
Order effects: First measurement may change response to the second.
Effect Size:
Mean difference is useful when variables are intuitive (e.g., weight gain).
Cohen’s d standardizes differences using the pretest standard deviation rather than the standard deviation of the differences.
Confidence Intervals:
Purpose: Provides a range of plausible values for the mean difference.
Calculation: Mean difference ± (t-critical × standard error).
This method ensures a more accurate analysis when dealing with paired data, improving statistical power and control over variability.
Independent Samples t-Test Overview
Key Differences from Previous Tests:
Moving to independent samples: The t-test in this chapter applies to comparing two independent groups, unlike previous tests where the same participants were measured repeatedly.
New assumptions: We introduce assumptions that were not required for related samples. These assumptions are important for the t-test when using independent samples.
Context of Study:
Example from Chapter 13:
The study involved anorexic girls measured before and after a family therapy intervention.
Same participants observed at two different times (related samples).
Why Independent Samples?
Limitations of repeated measurements: In many studies, it’s not feasible or desirable to use repeated measures of the same participants.
Example: Comparing males and females on social ineptness, where it’s impossible to test the same people as both males and females.
Independent groups: Instead, we need two independent groups (e.g., a sample of males and a sample of females).
Common Uses of Independent Samples t-Test:
Visual discrimination in rats:
Comparing trials needed to reach a criterion between two groups (normal conditions vs. sensory deprivation).
Memory study:
Comparing retention levels in two groups of students (active declarative vs. passive negative sentences).
Helping behavior:
Comparing latency of helping behavior between participants tested alone vs. in groups.
Key Question:
Difference in sample means:
In experiments, the two sample means usually differ by at least a small amount.
The important question: Is that difference large enough to conclude that the samples come from different populations?
Example: Are the mean latencies of helping behavior in single-tested vs. group-tested participants significantly different?
Sampling Distribution & t-Test:
Sampling distribution of differences between means:
We need to understand this distribution and the t-test derived from it.
Analogous to Chapter 12: The process is similar to what was done in Chapter 12 when discussing the mean of one sample.
14.1: Distribution of Differences Between Means
Concept Overview:
When testing for a difference between the means of two populations, we typically test a null hypothesis. The null hypothesis involves the difference between two sample means.
Key Terminology:
Population 1 (μ₁, σ₁²): Mean (μ₁) and variance (σ₁²).
Population 2 (μ₂, σ₂²): Mean (μ₂) and variance (σ₂²).
Sample Sizes (n₁, n₂): Independent samples drawn from each population.
Steps for Analysis:
Sampling Process:
Sample size n₁ from population 1 and n₂ from population 2.
Record the sample means for each sample and their differences.
Independent Sampling:
Means are independent (though they are paired in the sense they are drawn at the same time).
Distribution of Differences:
Focus on the sampling distribution of the differences between means.
Sampling Distribution Properties:
Mean of Distribution: Equal to μ₁ - μ₂.
Variance of Distribution: Determined by the Variance Sum Law:
Var ( X - Y) = Var (X) + Var (Y) where X and Y are independent variables (sample means).
Variance of Sample Means:
From the Central Limit Theorem:
Variance of the distribution of sample mean 1: σ₁²/n₁
Variance of the distribution of sample mean 2: σ₂²/n₂
Therefore, the variance of the difference is:
Shape of Distribution:
The difference between two independent normally distributed variableswill itself be normally distributed.
Reasonable sample sizes lead to an approximately normal distribution.
Figure 14.1:
The figure shows the sampling distributions of the means and their differences between the two populations.
Conclusion:
Understanding the mean and variance of the difference between sample means allows us to conduct hypothesis testing on the differences.
The normality of the distribution is crucial and is supported by the Central Limit Theorem for reasonable sample sizes.
The t Statistic
Overview:
The t statistic is used when we want to compare the means of two independent samples, especially when population variances are unknown. It is similar to the z statistic but adapted for scenarios where we must estimate population variance from sample data.
Key Steps:
Formulating the Test:
The observed difference between the sample means (denoted as M1−M2) is the statistic of interest.
The mean of the sampling distribution of differences is μ1−μ2
Standard error is calculated using the formula:
Where σ₁² and σ₂² are the population and n1 and n2 are the sample sizes.
Using t Statistic:
The formula for the t statistic: t=(M1−M2)/SE
where M1 and M2 are sample means.
The critical value for t is determined by the degrees of freedom, which are often approximated.
Challenges with Unknown Population Variances:
In most cases, the population variances are unknown. Therefore, we use sample variances to estimate the population variances, leading to the t-distribution.
The degrees of freedom are adjusted to account for the use of sample variances instead of population variances.
Pooling Variances:
When sample sizes are unequal, the pooled variance provides a better estimate of the population variance by averaging the two sample variances with respect to their degrees of freedom.
Formula for pooled variance sp2
After calculating the pooled variance, use it to compute the t statistic.
Degrees of Freedom:
The degrees of freedom for two independent sample t-tests is df=n1+n2−2.
Hypothesis Testing:
The null hypothesis is typically H0:μ1=μ2, meaning there is no significant difference between the means.
The alternative hypothesis is two-tailed: H1:μ1≠μ2
If the calculated t value exceeds the critical value from the t-distribution, the null hypothesis is rejected.
Example: Anorexia Study
Data:
Family Therapy Group (FT) and Control Group (C) are compared on weight gain.
Table 14.1 shows the weight gain data for both groups, where:
Family Therapy Group (FT): Mean = 7.26, St. Dev = 7.16, n = 17
Control Group (C): Mean = -0.45, St. Dev = 7.99, n = 26
Hypothesis:
Null Hypothesis (H₀): μFT=μC (No difference in weight gain)
Alternative Hypothesis (H₁): μFT≠μC (There is a significant difference)
Calculation:
Pooled Variance:
t Statistic:
Critical t Value: Using degrees of freedom df=41, the critical t value is approximately ±2.021 for a 95% confidence level (two-tailed test).
Conclusion: Since the calculated t value (3.20) exceeds the critical value (±2.021), we reject the null hypothesis. Thus, we conclude that family therapy leads to significantly more weight gain compared to the control group.
Summary:
The t-test compares the means of two independent samples, especially when variances are unknown.
We calculate the t statistic by estimating the standard error using sample variances.
Pooling variances can improve estimates when sample sizes differ.
The degrees of freedom play a crucial role in determining the critical value for hypothesis testing.
14.2: Heterogeneity of Variance
Key Concept: Heterogeneity of Variance
Homogeneity of Variance is an assumption of the t-test for two independent samples, meaning that both populations have equal variances.
Heterogeneity of Variance occurs when the population variances are unequal (i.e.,σ₁² ≠σ₂²).
Effects of Heterogeneity of Variance on the t-Test:
Homogeneity assumption refers to population variances, not sample variances. It’s unlikely that sample variances will be exactly equal, even if the population variances are equal.
Rule of Thumb: If the sample variances are no more than 4 times larger than each other and the sample sizes are approximately equal, you can proceed with the standard t-test. Heterogeneity won’t have a significant effect in this case.
When variances are highly unequal, and sample sizes are also unequal, an alternative approach is needed.
Practical Example:
In Everitt’s study, the variances of the groups were very unequal (one variance was over six times larger than the other), so pooling the variances was inappropriate. Instead, Welch’s method was applied by:
Calculating separate variances for the two groups.
Adjusting degrees of freedom using the appropriate formula.
Statistical software can handle this calculation, and results (like the adjusted degrees of freedom) are easily obtained.
What Causes Heterogeneity of Variance?
Exploring Differences: A key question to consider is why one group’s variance might be larger than another’s. For example, if family therapy is more effective for some girls but not for others, this could result in large variability in the family therapy group’s outcomes.
Important Implication: Heterogeneity of variance might not just be a nuisance. It could reveal something important about the treatment’s differential effectiveness and could shift research focus towards understanding the causes of the differences in variance.
Conclusion:
Homogeneity of variance is a key assumption for t-tests, but when violated, Welch’s t-test is a good alternative.
Heterogeneity of variance may indicate interesting findings about the data, especially when different treatments have varying effects on different individuals.
Statistical software makes handling these cases much easier and can compute results efficiently.
14.3: Nonnormality of Distributions
Key Concept:
One of the assumptions for the t-test is that the populations from which the data are sampled are normally distributed. However, in practice, slight deviations from normality (such as a roughly mound-shaped distribution) generally won't invalidate the t-test, especially for large sample sizes (typically n>30.
Central Limit Theorem (CLT): For large samples, the sampling distribution of differences between means becomes approximately normal, regardless of the shape of the population distributions.
Guidelines:
Small Sample Sizes: If the sample size is small, nonnormality in the data can affect the validity of the test.
Large Sample Sizes: For larger samples, the sampling distribution will tend to be normal due to the CLT, even if the underlying population distribution is not perfectly normal.
14.4: A Second Example with Two Independent Samples
Example: Homophobia and Sexual Arousal Study (Adams, Wright, and Lohr, 1996)
Research Question: Investigate whether homophobia (irrational fear of homosexuality) is related to anxiety about one's own sexuality. Specifically, homophobic individuals were hypothesized to show more sexual arousal to homosexual videos than nonhomophobic individuals.
Data: The data show sexual arousal levels (degree of sexual arousal) in response to a homosexual video for two groups:
Homophobic group: 35 participants
Nonhomophobic group: 29 participants
Data Summary:
Homophobic group:
Mean: 24.00
Variance: 148.87
Sample size (n1n_1n1): 35
Nonhomophobic group:
Mean: 16.50
Variance: 139.16
Sample size (n2n_2n2): 29
Hypotheses:
Null Hypothesis (H₀): There is no significant difference in arousal between homophobic and nonhomophobic groups. H0:μ1=μ2
Alternative Hypothesis (H₁): There is a significant difference in arousal between homophobic and nonhomophobic groups. H1:μ1≠μ2 (two-tailed test)
Procedure:
Calculate the Pooled Variance: Since the sample variances are similar, we can pool the variances for the t-test.
Pooled variance (sp2) is computed as:
Compute the t-Statistic: Using the pooled variance, we calculate the t statistic:
Degrees of Freedom: The degrees of freedom for the t-test is df=n1+n2−2.
Critical Value: Using the t-distribution table and the degrees of freedom, we determine the critical value (for a two-tailed test at α=0.05
Calculation Results:
Calculated t value: This value exceeds the critical t value, leading us to reject the null hypothesis.
Conclusion:
Statistically, homophobic individuals show significantly greater arousal to homosexual videos compared to nonhomophobic individuals.
Summary:
Nonnormality of distributions is less of a concern with large sample sizes due to the Central Limit Theorem.
In the homophobia study, we used the t-test to compare the means of two groups with similar variances, concluding that homophobic individuals show more sexual arousal to homosexual stimuli than nonhomophobic individuals.
14.5: Effect Size Again
Key Concept:
Effect size measures the magnitude of the difference between groups, providing context for statistical significance. It is important to go beyond simply stating that a difference is statistically significant and convey how large that difference is in a meaningful way.
Measuring Effect Size:
Standardized Measure:
Effect size can be measured by Cohen's d, which expresses the difference between the means in standard deviation units.
The formula for Cohen's d is:
d = M1-M2/sp
where M1 and M2 are the means of the two groups, and sp is the pooled standard deviation.
Pooled Standard Deviation:
When there is no clear control group, the pooled standard deviation is used:
This gives us a combined estimate of the standard deviation from both groups.
Choosing the Standard Deviation:
If there is a clear control group, use its standard deviation.
If there is no control group, pool the variances from both groups and take the square root to calculate sp
If variances are noticeably different, it’s better to use one group’s standard deviation and note this choice to the reader.
Example: Homophobia and Sexual Arousal
Pooled Variance: 144.48
The pooled standard deviation sp is the square root of the pooled variance
Difference between Means: 7.5
Now, compute Cohen’s d:
d=7.5/12.0=0.625
This result tells us that the mean sexual arousal for homophobic participants is 0.625 standard deviations higher than the mean for nonhomophobic participants, which indicates a moderate effect size.
Choosing the Best Measure:
Standardized Effect Size: In cases where the units of measurement are arbitrary or unclear (like sexual arousal in the homophobia study), it’s useful to report the effect size in standard deviation units.
Units with Meaning: If the units of measurement are meaningful (e.g., weight gain, moon illusion), it may be more useful to report the raw difference between means or the ratio of means.
For example, in the moon illusion study, saying that the horizon moon appears 50% larger than the zenith moon is more informative than standardizing the measurement.
Goal: The goal is to give the reader an appreciation of the magnitude of the difference. Choose the measure that best represents the size of the difference:
Standardized effect size (e.g., Cohen’s d) is best when the units are arbitrary or unclear.
Raw difference or ratio is better when the units have clear meaning.
Conclusion:
Effect size provides a contextualized measure of the difference between groups, which is essential in conveying how meaningful a statistically significant result is.
Choose standardized effect sizes for arbitrary measurements and raw differencesor ratios when the units are meaningful.
14.6: Confidence Limits on μ1−μ2\mu_1 - \mu_2μ1−μ2
Key Concept:
In addition to testing a null hypothesis and calculating an effect size, it’s crucial to examine confidence limits on the difference between population means. These confidence limits provide a range in which we expect the true difference to lie, giving us more context about the precision of our estimate.
Confidence Interval Calculation:
The logic for setting confidence limits is the same as for the one-sample case (discussed in Chapter 12). Instead of using the mean and standard error of the mean, we use the difference between the means and the standard error of the difference between means.
For a 95% confidence interval on the difference between means, the formula is:
Confidence Interval=(μ1−μ2)±(tcritical×SE)
where:
(μ1−μ2) is the observed difference between the sample means.
Tcriticalt is the critical t-value from the t-distribution for the desired confidence level (95%).
SE is the standard error of the difference between means.
Example: Homophobia Study
Difference between Means: 7.5
Standard Error (SE): 12.0
Critical t-value: Based on the degrees of freedom (approximated or calculated using statistical tables or software).
Confidence Interval: For the homophobia study, after performing the calculation, we obtain a 95% confidence interval of (1.46, 13.54) for the difference in sexual arousal between homophobic and nonhomophobic participants.
Interpretation:
The 95% probability is that the true difference in sexual arousal between the two groups lies between 1.46 and 13.54.
The confidence interval does not include 0, which is consistent with our rejection of the null hypothesis. This confirms that homophobic individuals are statistically more sexually aroused by homosexual videos than nonhomophobic individuals.
Caution:
Although the result is statistically significant and the effect size (Cohen’s d = 0.62) is substantial, the confidence interval is wide.
A wide interval suggests uncertainty about the exact size of the difference between the means. This indicates that while we have a significant result, there is some doubt about the precision of the estimated difference.
Conclusion:
Confidence limits help to understand the range of plausible values for the difference between population means. They provide a more nuanced picture of the result, giving us both statistical significance and an idea of the precision of our estimate.
In this case, while the difference between means is significant, the wide confidence interval suggests that further research or a larger sample might be necessary to narrow down the exact magnitude of the difference.
14.7: Confidence Limits on Effect Size
Key Concept:
In addition to confidence intervals on the difference between means, we also want to calculate confidence intervals on the effect size. This allows us to not only determine the magnitude of the difference but also assess how precise that estimate is.
The statistic for effect size is denoted as d (Cohen’s d), which represents the standardized mean difference. The population effect size is represented by the Greek letter δ.
Confidence Interval for Effect Size:
The confidence interval for effect size is more difficult to calculate manually, but it is possible to estimate it using software tools like:
The program by Cumming and Finch (2001), which provides detailed steps for calculating these intervals.
The MBESS library in R (by Kelley and Lai), which is simpler to use and provides confidence intervals for effect size directly.
Steps to Calculate Confidence Interval for Effect Size:
A simplified approach is to use the calculated standardized effect size (Cohen’s d) along with the sample sizes for both groups.
Cohen’s d was previously calculated as 0.620.620.62 for the homophobia study.
The sample sizes are n1=35 and n2=29
Example:
The program or function will use the obtained t value and sample sizes to calculate the standardized mean difference (SMD) and its confidence interval.
By using the t statistic directly, you can bypass some intermediate steps and obtain the same result with minimal rounding errors.
Practical Application:
Confidence intervals on effect size provide more precision around the magnitudeof the effect.
For instance, the calculated effect size of 0.62 indicates a moderate effect. By calculating the confidence interval, we can understand how reliable this estimate is and how it might vary with different samples.
14.8: Plotting the Results
Key Concept:
Plotting the results of a study helps to visualize the findings and make them more accessible for understanding. A bar graph is one of the most common ways to represent the data, and it can be enhanced with error bars to show the variability around the sample means.
Bar Graphs:
In a bar graph, the height of the bar represents the sample mean of each group. Each group is represented by a separate bar.
Error bars are added to show the variability of the data around the mean. These bars can indicate the standard error, standard deviation, or confidence intervals.
Error Bars:
Standard Error: In many cases, error bars represent the standard error of the mean, which is an estimate of how much the sample mean is expected to vary if the study were repeated.
Standard Deviation or Confidence Intervals: Some authors use error bars to show either the standard deviation or the confidence intervals around the mean. The exact definition of the error bars must be checked, as it’s not always clearly stated.
Example: Homophobia Study Data
For the homophobia study, the pooled variance was 144.48, and the pooled standard deviation is calculated from this value.
The standard errors for the two groups were calculated as follows:
Homophobic group:
Nonhomophobic group:
The ends of the error bars (one standard error above and below the mean) are calculated as:
Homophobic group: The error bars extend from approximately 22.0 to 24.0.
Nonhomophobic group: The error bars extend from approximately 14.3 to 18.7.
Interpretation:
Short error bars: In this study, the error bars are quite short, which indicates low variability around the mean. This suggests that if the study were repeated many times, the sample means for the homophobic group would fall between 22 and 24about two-thirds of the time. It also suggests that the homophobic group’s mean will likely not overlap with the nonhomophobic group’s mean.
Robust Effect: The short error bars and the clear separation between the groups indicate that the observed difference is likely to be robust and consistent if the study were repeated.
Conclusion:
Bar graphs with error bars are an effective way to visually represent the means and variability of the data.
Short error bars indicate reliable results with little fluctuation in the mean if the study were to be repeated.
14.9: Writing Up the Results
Key Concept:
When writing up the results of a statistical study, it's essential to provide clear and concise information about the study's purpose, procedures, statistical outcomes, effect size, and conclusions. This helps the reader understand the significance of the findings and their implications.
Components of the Write-Up:
Study Purpose and Procedure:
Briefly describe the research question, purpose, and methods of the study. This should give context for the data analysis.
Reporting Means and Standard Deviations:
Clearly state the mean and standard deviation for each group. This can either be done in the text or in a table for clarity.
t-Test Results:
Report the t-statistic, degrees of freedom (df), and the p-value. This provides the statistical evidence for the difference between the groups.
Effect Size:
Include a statement on the size of the effect (e.g., Cohen’s d), which gives the reader an understanding of the magnitude of the difference between groups, not just its significance.
Concluding Sentence:
End with a sentence summarizing the main conclusion drawn from the results.
14.10: Do Lucky Charms Work?
Study Overview:
Damisch, Stoberock, and Mussweiler (2010) conducted a series of studies to examine whether superstitious behaviors, like carrying a lucky charm, can actually improve performance on tasks. We will focus on their third study, which involved university students performing a memory task with or without a lucky charm.
Key Variables:
Independent Variable: Presence or absence of a lucky charm.
Dependent Variable: A combined measure of time and number of trials to complete the task, with lower scores indicating better performance.
Method:
Participants: 41 university students.
Group Division: Random assignment to either the Lucky Charm Present or Lucky Charm Absent group.
Task: Memory task.
Null Hypothesis:
The null hypothesis (H0H_0H0) is that there is no difference in performance between the two groups. In other words, the presence of the lucky charm does not affect performance.
Significance Level:
Alpha (α) = 0.05 (5% level of significance).
We will use a two-tailed test because it's possible that the lucky charm might also act as a distraction, leading to poorer performance.
Step-by-Step Analysis:
Calculate Sample Statistics:
The means and variances for the two groups are calculated as follows (you would provide these specific values here based on the data):
Lucky Charm Present Group:
Mean = X
Variance = Y
Lucky Charm Absent Group:
Mean = A
Variance = B
Pooling the Variances:
Pool the variances to get a better estimate of the population variance. The pooled variance formula is:
where S^2_1 and S^2_2 are the variances of the 2 groups and , n1 and n2 are the sample sizes for each group
Calculate t:
Use the pooled variance to compute the t statistic:
where M1 and M2 are the sample means.
Degrees of Freedom:
The degrees of freedom are calculated as:df=n1+n2−2
Critical t-Value:
Using the t-distribution table, find the critical value for the given degrees of freedom and alpha level.
In this case, for df and α=0.05, the critical t-value is approximately ±2.021 (based on the appendices or software output).
Conclusion:
If the calculated t-value exceeds the critical t-value, we reject the null hypothesis.
For this example, if 2.021 < ± 2.12, we reject the null hypothesis and conclude that the presence of a lucky charm was associated with better performance on the memory task.
Effect Size:
The effect size can be calculated using Cohen’s d, which measures the standardized mean difference:
d=M1−M2/sp
where M1 and M2are the sample means and sp is the pooled standard deviation.
The result for Cohen's d in this case is 0.62, indicating that the group with the lucky charm performed about 2/3 of a standard deviation better than the group without the charm.
Confidence Interval on Effect Size:
Using software like R or a similar statistical tool, calculate the confidence interval on the effect size. A wide interval indicates uncertainty, but the important point is that the confidence interval does not include 0, confirming that the effect is significant.
Confidence Limits on the Mean Difference:
The confidence limits for the mean difference provide a range in which the true difference lies, but in this case, the interval might not give much insight due to the unfamiliarity with the dependent variable. Effect sizeremains the most meaningful measure here.
Write-Up of the Results:
Study Overview: In a study designed to examine whether superstitious behavior(specifically, the presence of a lucky charm) affects performance, Damisch et al. (2010) asked participants to perform a memory task in the presence or absence of a lucky charm.
Results: The results showed that participants with their lucky charm present performed better than those without their charm (t(df) = X.XX, p < 0.05). The observed mean difference was significant, and the effect size (Cohen’s d) was 0.62, indicating a moderate effect. Participants with the lucky charm performed nearly 2/3 of a standard deviation better.
Conclusion: This study provides support for the idea that superstitions, like using a lucky charm, may enhance performance. Further experiments within this series also showed that the lucky charm group reported a greater sense of self-efficacy, which the authors suggest may explain the improved performance.
Conclusion:
The study demonstrates that the presence of a lucky charm led to better performance, supported by statistical significance and a meaningful effect size.
The confidence intervals further confirm the robustness of the effect.
This write-up clearly communicates the statistical results, their interpretation, and their practical significance.
14.12: Summary
This chapter focused on comparing the means of two independent samples, providing key insights into the statistical methods and concepts for such comparisons. Here's a recap of the main points:
1. Standard Error of Mean Differences:
The standard error of the difference between two means is the standard deviation of a theoretical set of differences. For independent samples, this is calculated as the square root of the sum of the variances divided by their respective sample sizes:
2. Central Limit Theorem (CLT):
The CLT ensures that, under many conditions, the distribution of the differences between sample means will be normally distributed, especially as sample sizes increase.
3. Calculating t:
To calculate the t statistic, subtract the means of the two samples and divide by the estimated standard error of the difference: t=M1−M2/SE
Pooling variances: If the variances of the two groups are assumed to be equal, we take the pooled variance (weighted average) for our standard error estimate.
4. Degrees of Freedom:
The degrees of freedom for a two-sample t-test is the sum of the degrees of freedom for each sample: df=n1+n2−2
5. Heterogeneity of Variance:
Heterogeneity of variance occurs when the variances of the two groups differ significantly.
In this case, the t statistic can be computed using the individual variances for each sample, and the degrees of freedom are adjusted accordingly. This adjustment is known as Welch’s test, and it is commonly reported by statistical software.
6. Effect Size (Cohen’s d):
Cohen’s d measures the size of the difference between the two sample means in terms of the standard deviation. This is calculated as:
D = M1-M2/Sp
The standard deviation used could be from a control group, another logically chosen group, or the pooled standard deviation.
7. Confidence Intervals:
We discussed calculating two types of confidence limits:
Confidence limits on the difference between means: This gives a range in which we expect the true difference in means to fall.
Confidence intervals on effect size: This can be easily calculated using software. These intervals provide a range of values for the true effect size.
95% confidence means that, in repeated studies, the calculated intervals will contain the true population values 95% of the time.
8. Plotting the Results:
Bar graphs are a common way to plot results, with error bars representing the variability around the means.
Error bars typically represent standard error, but they could also show standard deviation or confidence intervals.
It’s important to clearly specify what the error bars represent in any graph.
Conclusion:
This chapter covered all the critical steps for comparing the means of two independent samples, including calculating t, considering the effect size, and interpreting confidence intervals.
It also emphasized the importance of plotting the results clearly, including specifying what error bars represent, to make the findings easy to understand