Chapter 1. Defining, Measuring, and Sampling: Statistics as Social Constructions
• Statistics as Social Constructions – Subjective choices in defining variables, measurement, and sampling shape statistical findings.
• Critical Evaluation of Statistics – bad statistics, mutant statistics, soft statistics, and the dark figure.
• Defining Variables – Conceptual definitions (abstract meanings) vs. operational definitions (measurable specifications).• Sampling Methods – Populations vs. samples, representativeness and adequacy of sample size.
• Sampling Bias – Systematic differences between a sample and the population:
o Selection bias – Certain individuals are more likely to be included.
o Volunteer bias – Participants who volunteer differ from those who don’t.
o Convenience sampling bias – Limited diversity due to easy access.
o Undercoverage bias – Certain groups are underrepresented.
o Nonresponse bias – Participants who decline differ systematically from those who participate.
o Survivorship bias – Only "survivors" of a process are analyzed.
o Healthy user bias – Participants are healthier than the general population.
o Recall bias – Inaccuracies in participants' memories affect data.
• Sampling Error – Random differences between a sample statistic and the true population parameter, shrinks with larger sample size.
Chapter 2. Distributions and Measures: Tools for Summarizing Data
• Measurement Scales – Nominal (categories with no order), ordinal (ordered categories with unequal intervals), interval (equal intervals but no true zero), raTo (equal intervals with a true zero).
• Categorical vs. ConOnuous Data – Categorical (nominal, someTmes ordinal) vs. conTinuous (someTmes ordinal, interval, raTo); determines choice of summary staTsTcs and visualizaTons.
• Describing Categorical Data – Frequency distribuTons, bar charts; use mode as a measure of central tendency.
• Describing ConOnuous Data – Histograms or density plots to assess shape of distribuTon.
o Symmetric DistribuOons – Normal (bell curve, mean = median = mode), uniform (all values equally likely); mean and standard deviaTon are preferred summary staTsTcs.
o Asymmetric DistribuOons – PosiTve skew (long right tail, mean > median), negaTve skew (long leY tail, mean < median); median and IQR are more appropriate than mean and standard deviaTon.
• Symmetry, the Mean, and the Median – If the mean and median are equal, the distribuTon is symmetric; if they differ, the distribuTon is asymmetric, with the mean pulled toward the tail.Chapter 3. Standard Scores: Providing Context
• Standard Scores – Transform raw scores into a common scale; z scores are the most widely used standard scores.
• Interpreting z Scores – Positive z scores indicate values above the mean; negative z scores indicate values below the mean; magnitude shows distance in standard deviations.
• Calculating z Scores – z = (X – M) / SD, where X is the raw score, M is the mean, and SD is the standard deviation.
• Using z Scores to Compare Different Distributions – Standardization allows meaningful comparisons between scores on different scales.
• The Normal Distribution – A symmetric, bell-shaped distribution where most scores cluster around the mean.
o 68 – 95 – 99.7 Rule – Approximately 68% of scores fall within 1 SD of the mean, 95% within 2 SDs, and 99.7% within 3 SDs.
• Percentile Ranks and the Unit Normal Table – Convert z scores into percentile ranks using a reference table.
• Estimating vs. Calculating Percentiles – Visual inspection of graphs helps estimate percentiles before using the unit normal table.
• Standardizing vs. Normalizing – Standardizing (e.g., linear transformation to z scores) changes scale but not shape; normalizing (e.g., nonlinear transformation such as logarithm) changes distribution shape.
Chapter 4. The Logic of Statistical Testing: Ruling Out Chance
• Statistical Testing – Moves beyond describing data to evaluating whether observed patterns reflect true effects or random variation.
• Modeling Chance – Establishes expectations for data under the assumption that no real effect exists; comparisons to these expectations determine significance.
• Null and Alternative Hypotheses – The null hypothesis (H0) assumes no effect, while the alternative hypothesis (H1) suggests a true effect or difference.
• Sampling Error – Random variability causes sample statistics to differ from population parameters; statistical tests account for this variability when assessing significance.
• Sampling Distributions – Describe the expected variation in sample statistics under the null hypothesis, forming the basis for statistical decision-making.
• Decision Threshold and the Critical Region o Alpha Level (α) – The pre-set probability threshold (typically 0.05) for defining statistical significance.o One-Tailed vs. Two-Tailed Tests – One-tailed tests allocate the entire α\alpha level to one extreme of the sampling distribution, while two-tailed tests split it across both extremes; two-tailed tests are standard to avoid bias. o Critical Value and Critical Region – The critical value marks the boundary for significance; the critical region consists of sample results so extreme that they would occur with probability less than α under H0, leading to its rejection.
• Interpreting Results o If the test statistic falls inside the critical region, reject H0; the result is statistically significant. o If the test statistic falls outside the critical region, retain H0; the result is not statistically significant.
• p Value and the Decision Rule o p-Value Definition – The probability of obtaining a result at least as extreme as the observed data if H0 is true.
o Decision Rule – If p ≤ α, reject H0; if p > α, retain H0.
• Type I and Type II Errors – o Type I Error (False Alarm) – Rejecting H0 when it is actually true; probability of occurrence is α.
o Type II Error (Miss) – Retaining H0 when it is actually false; probability of occurrence is denoted as β, and depends on statistical power.
Chapter 5. Effect Size: The Strength of Results
• Statistical vs. Practical Significance – Statistical significance indicates whether results are unlikely due to chance, but does not address whether they are meaningful inpractical terms.
• Practical Significance – A statistically significant result may lack practical importance
due to: o Flawed study design – Poor controls or methodological issues can produce
misleading results. o Small effect size – A difference may be too minor to matter in real-world
applications.
• Effect Size – Measures the magnitude of an observed difference or relationship, independent of sample size.
• Why Effect Size Matters – Helps compare findings across studies, interpret unfamiliar metrics, and assess the impact of research results.• Effect Size for Mean Comparisons —Different approaches for measuring effect size when comparing two means:
o Raw Differences – Useful when measures are familiar (e.g., salary differences).
o Percentages – Helps interpret effects for ratio-scale data (e.g., older adults recall 33% fewer words than younger adults).
o Standardized Measures – Necessary when raw scores are unfamiliar or when comparing across studies.
• Standardized Effect Sizes for Mean Comparisons
o Cohen’s d – Expresses mean differences in standard deviation units, making
comparisons across studies easier.
o η2 (Eta squared) – Proportion of variance, indicating how much of the variability in the dependent variable is accounted for by group differences.
• Effect Sizes for Associations – Measures the strength of relationships rather than group
differences:
o r (Pearson’s correlation coefficient) – Strength of linear relationships, with values from -1 to 1.
o r2 (Coefficient of determination) – Proportion of variance in one variable explained by another.
• Rules of Thumb for Effect Sizes – Guidelines for interpreting effect sizes vary by context, but common benchmarks include:
o Cohen’s d – small (0.20), medium (0.50), large (0.80).
o r (correlation) – small (0.10), medium (0.30), large (0.50).
o η2 or r2 – small (0.01), medium (0.09), large (0.25).
• Contextual Interpretation of Effect Sizes
o Outcome Importance – Even small effects can be meaningful when outcomes are significant (e.g., public health interventions).
o Costs and Benefits – The value of an effect must be considered alongside the costs of acting on it:
§ Even small effect sizes may justify action if the benefits (e.g., improved efficiency, learning, health) outweigh costs.
§ Large effects may be impractical if implementation is costly or has unintended downsides (e.g., workplace interventions that reduce errors but harm employee morale). § Comparing intervention feasibility and opportunity costs ensures findings are meaningfully applied.o Accumulation Over Time – Effects may compound across contexts, increasing their real-world impact.
o Generality of Effects – Large effects in controlled studies may not generalize to real-world settings.
Chapter 6. Statistical Power: The Ability to Detect an Effect
• Statistical Power – The probability of correctly rejecting a false null hypothesis; higher power means a greater ability to detect true effects; studies with low power have a high risk of missing true effects (Type II errors) and produce ambiguous null results that could reflect either no effect or insufficient power
• Typical Power Levels in Research – Studies often have insufficient power, particularly for detecting small effects:
o Power is typically low for small effects (~0.23).
o Power is moderate for medium effects (~0.62).
o Power is acceptable for large effects (~0.84).
o Power has remained low across decades of research.
• Factors That Influence Statistical Power – Three main factors determine power:
o Sample Size – Larger samples reduce sampling error, increasing power.
o Effect Size – Larger effects are easier to detect, increasing power.
o Decision Threshold (α Level and Tails of the Test) – A higher alpha level or a
one-tailed test increases power but also raises the risk of a Type I error.
• Power Analysis – Conducted during study planning to determine:
o Whether the study has a good chance of yielding informative results.
o The minimum sample size needed to achieve adequate power.
o The smallest effect size the study can reliably detect.
o Whether the study is worth conducting given constraints on data collection.
• How to Maximize Power – Strategies to improve a study’s ability to detect true effects:
o Increase Sample Size – The most effective way to boost power, as it reduces sampling variability.
o Collect More Data in Other Ways – Adding more trials in an experiment or more items on a scale provides more observations per participant, reducing noise and increasing power.
o Use a Within-Subjects Design When Possible – Comparing each participant to themselves reduces variability and boosts power.o Use a Higher Alpha Level – Expands the critical region but increases Type I error risk.
o Use a One-Tailed Test When Justified – If the direction of the effect is confidently predicted, a one-tailed test allocates all α to one tail, improving power.
o Measure Variables More Precisely – Reduces measurement error, leading to clearer effect detection.
o Avoid Dichotomizing Continuous Variables – Grouping continuous measures into categories (e.g., high vs. low) discards information and weakens power.
Chapter 7. Confidence Intervals: The Precision of Results
• Confidence Intervals as Precision Estimates – Confidence intervals (CIs) quantify the precision of sample estimates by providing a range of plausible values for a population parameter.
• Range of Values – A confidence interval is centered on the point estimate, with lower and upper bounds determined by statistical calculations.
• Confidence Levels – The standard confidence level is 95%, meaning that in repeated sampling, 95% of CIs would contain the true population parameter. Higher confidence levels (e.g., 99%) yield wider intervals, while lower confidence levels (e.g., 90%) yield narrower intervals.
• Sampling Error and Confidence Intervals
o Random variability (sampling error) affects CI width.
o Larger samples yield narrower intervals with greater precision.
o Small samples produce wider intervals, reflecting greater uncertainty.
• Interpreting Confidence Intervals – A 95% confidence interval does not mean there is a 95% probability that the population parameter falls within it; rather, it means that if the study were repeated many times, 95% of the resulting intervals would contain the true value.
• Confidence Intervals and Statistical Significance
o If the null hypothesis value (e.g., 0 for mean differences) falls outside the CI, the result is statistically significant at α = 0.05 in a two-tailed test.
o If the null hypothesis value falls within the CI, the result is not statistically significant.
• Advantages of Confidence Intervals Over Significance Tests
o Provide more information than p values alone by indicating precision.
o Centered around effect sizes, helping evaluate practical significance.o Reveal direction and magnitude of effects, avoiding overreliance on arbitrary significance thresholds.
Chapter 8. Reproducibility: The Problem of False Findings
• Challenges to Reproducibility – Many published research findings fail to replicate, undermining trust in scientific conclusions.
• False Findings in Research – John Ioannidis argued that most published findings are false due to high false-positive rates, small sample sizes, flexibility in study designs and analyses, conflicts of interest, and competitive research environments.
• Replication Crisis
o Replication is a cornerstone of science, but replications are rare due to a focus on novelty in academic publishing.
o Large-scale replication efforts in psychology found that fewer than half of published findings were successfully replicated.
• Questionable Research Practices (QRPs) – Researchers often engage in flexible analysis and reporting strategies that increase false positives:
o Researcher Degrees of Freedom – The flexibility researchers have in designing studies, analyzing data, and reporting results can lead to inflated false-positive rates.
o p Hacking – Conducting multiple analyses and reporting only those that produce significant results.
o HARKing (Hypothesizing After the Results are Known) – Presenting post hoc explanations as if they were planned in advance.
o Selective Reporting – Failing to report all experimental conditions, variables, or analyses, leading to biased literature.
• Bias in Peer Review – Peer review is designed as a quality control mechanism but has systemic flaws:
o Volunteer Nature of Reviewing – Reviewers are unpaid, leading to variable effort and care in evaluations.
o Anonymity and Accountability – Anonymous reviews encourage honesty but reduce accountability and recognition, leading to inconsistent diligence.
o Reviewer Errors – Even careful reviewers may fail to detect undisclosed multiple statistical tests or subtle questionable research practices.
o Shared and Idiosyncratic Biases – Reviewers may favor research aligned with their own views or overlook methodological flaws in studies that support a preferred narrative.
o Resistance to Criticizing Common Practices – Reviewers may avoid critiquing methods they themselves use, such as convenience sampling or reliance on online participant pools.
• Strategies to Improve Reproducibility –
o Replication Studies – Encouraging direct and conceptual replications to verify results.
o Pre-Registration – Researchers publicly post hypotheses, methods, and analysis plans before data collection to prevent p hacking and HARKing.
o Registered Reports – Journals accept studies for publication based on methodological quality before results are known, reducing publication bias.
o Open Science Practices – Making data, analysis code, and materials publicly available for verification and reanalysis.
o Comprehensive Reporting – Requiring full disclosure of all analyses, experimental conditions, and results to ensure transparency.
Chapter 9. Meta-Analysis: Pooling Results Across Studies
• Limitations of Narrative Reviews – Traditional literature reviews rely on subjective impressions, which can lead to bias and inconsistency:
o Reviewers form qualitative impressions rather than aggregating data statistically.
o Small primary studies have high variability, making it hard to detect true effects.
o Differences in study findings are difficult to interpret without statistical tools.
• Meta-Analysis as a Systematic Review Method – A quantitative approach that synthesizes research findings to provide a more objective and reproducible summary of evidence.
• Effect Size Aggregation – Instead of counting significant results, meta-analysis calculates effect sizes (e.g., Cohen’s d, correlation coefficients) and statistically combines them to estimate the overall effect.
• Heterogeneity Across Studies – Meta-analyses examine variability in results:
o Differences in methods, samples, or conditions may explain variations in effect sizes.
o Some variability is expected due to sampling error, but systematic differences require statistical analysis.
• Moderator Analysis – Investigates factors that influence effect sizes across studies, such as differences in study design, participant characteristics, or measurement techniques.
• Publication Biaso Studies with significant results are more likely to be published, skewing the literature.
o Funnel Plots – A visual diagnostic tool for detecting asymmetry in study distribution, which may indicate missing studies.
o Trim-and-Fill Method – A statistical adjustment to estimate the true effect size in the presence of publication bias.
• The Eight Core Elements of a Meta-Analysis –
1. Clearly defined research question to guide the synthesis.
2. Systematic literature search to identify relevant studies.
3. Effect size extraction from each study, converting findings into a common metric.
4. Weighting of studies so larger or more precise studies contribute more to the overall estimate.
5. Computation of a summary effect size to quantify the overall pattern across studies.
6. Assessment of heterogeneity to determine variability in effect sizes.
7. Moderator analysis to explore factors that might influence results.
8. Evaluation of publication bias using funnel plots or statistical adjustments