All
Lectures
Overview
Week 3 focuses on the design of a study, building on last week’s content about study design, hypothesis, and theory-to-prediction work.
The first half is a recap with new examples; the second half extends to additional topics in research design.
This lecture emphasizes that the mid-semester exam will heavily cover design concepts such as controlling variables, reducing random variability, managing individual differences, and avoiding confounding variables.
Acknowledges traditional land custodians and ongoing cultural connections as context for the session.
Goals of science (clarifying "control").
Four goals of science discussed previously: predict, describe, explain, and control.
There was a moment of confusion about mentioning “explain”; the speaker confirms: describe and control are key components, with the implication that explanation involves understanding mechanisms behind observed relationships.
When scientists say they are "controlling" a variable in the context of study design, they mean reducing unwanted variability to better reveal the effect of the manipulated variables, not the everyday sense of manipulating the phenomenon to demonstrate full understanding.
Analogy: a stop sign controls behavior in everyday life; in experiments, control means limiting sources of variability (noise) so that any observed effect can be attributed more confidently to the manipulations rather than extraneous factors.
Key concepts in design and variability
Four sources of variability to manage in quantitative designs:
Experimental (systematic) manipulation of the IV and random assignment to conditions.
Noise/random variability: unpredictable fluctuations across participants or trials.
Individual differences: stable differences between participants that can obscure effects.
Confounding variables: variables that covary with the IV and DV and offer alternative explanations for observed effects.
Distinction of “control” meanings:
In theory-building, control means understanding and manipulating a phenomenon; in design, control means minimizing extraneous variability to reveal causal relationships.
Signal-to-noise ratio: noise (extraneous variability) can obscure the true signal (the effect of interest).
Situational variables (noise): context features like room lighting, sitting position, or distractions that can affect performance independent of the manipulation.
Measurement error: inaccuracies in how outcomes are measured, which add noise and can mask true effects.
Example of measurement error history: past psych experiments relied on human observers and prone to human error; modern computer-based measures reduce this source of error but historical methods illustrate why measurement error matters.
Experimental designs: true experiments, quasi-experiments, and correlational designs
True experiments (experimental design):
Directly manipulate the independent variable (IV).
Random assignment of participants to conditions.
Two core features: manipulation of IV and random assignment to control for extraneous variability.
Quasi-experiments:
Also called “almost experiments” because they are not fully randomized or manipulated in a controlled way.
Use existing groups (e.g., smokers vs. non-smokers) when random assignment is infeasible or unethical.
Strengths: feasible when true experiments aren’t possible; ecological validity can be higher.
Limitations: higher risk of confounds and lower causal inference strength; may require careful matching, but cannot guarantee equivalence.
Correlational designs:
No random assignment; variables are measured as they occur in the world.
Relationships are assessed with a statistical coefficient (e.g., Pearson’s r) and scatter plots.
Cannot infer causality because direction of effect and third-variable confounds cannot be ruled out.
Useful for naturalistic testing and when random assignment is not feasible; high ecological validity but limited causal claims.
Key terminology in different designs:
Experimental design: independent variable (IV) is manipulated; dependent variable (DV) is measured.
Quasi-experimental design: IV-like variable is used to group participants, but groups are not created by random assignment.
Correlational design: predictor is the variable hypothesized to predict the other; criterion (or DV) is the outcome measured.
In correlational designs, the predictor is typically on the x-axis of scatter plots; the criterion on the y-axis.
In correlational work, terms IV and DV may be used loosely, but best practice in some courses uses predictor and criterion to reflect causal direction assumptions.
Terminology: IV, DV, predictor, and criterion
Independent Variable (IV): the variable that the experimenter directly manipulates in an experiment.
Dependent Variable (DV): the outcome measured, presumed to be influenced by the IV.
In quasi-experiments: the IV-like variable is used to categorize participants, not randomly assigned.
In correlational designs: there is no manipulation and no fixed IV/DV; the variable hypothesized to influence the other is termed the predictor, and the other variable is the criterion (DV).
Note on terminology in class: tutors may refer to the predictor and criterion in correlational designs; using IV/DV in correlational contexts is generally acceptable but can be flagged as imprecise by some instructors.
Noise, measurement error, and confounds: how noise can derail signals
Situational variables can introduce noise (e.g., flickering light, room distractions).
Individual differences: natural variation between participants that can obscure the effect of the IV; more critical in between-subjects designs.
Measurement error: inaccuracies in outcome measurement (e.g., human scoring mistakes in early psych experiments) that obscure true effects.
Confounding variables: variables that covary with the IV and DV, offering alternative explanations for observed effects (e.g., lighting confounds, education level differences across generations in longitudinal studies).
Example of confounding: if a memory task uses distractors in a room with flickering lights, it’s unclear whether observed effects are due to distraction or lighting.
Confounds threaten causal inference; eliminating systematic differences between groups narrows explanations to two primary possibilities:
Differences due to chance (random variability).
Differences due to the IV (the manipulation).
Hypotheses and statistical testing: null vs alternative, and what we test
Null hypothesis (H0, H_naught): there is no relationship between the IV and DV (or no difference between groups).
Alternative hypothesis (H1): there is a relationship or a difference; often the direction is hypothesized (e.g., distraction impairs memory).
Important point about hypothesis testing:
Statistical tests assess the null hypothesis; rejecting H0 suggests the observed results are unlikely due to chance, thus tentatively supporting the alternative.
We never directly test the alternative; rejection of H0 leaves us with the alternative as the plausible explanation given the data.
The alternative is always tentative because there are infinitely many plausible alternative explanations and we cannot test them all.
Practical implications for interpretation:
If results are significant, we say the null hypothesis is rejected; if not, we fail to reject the null.
The presence of variability in groups does not invalidate the null hypothesis; it reflects real-world variability that must be accounted for in design and analysis.
Approaches to testing: quasi-experiments, correlational studies, and true experiments
Quasi-experiments revisited:
Use existing groups to address questions when random assignment is not feasible or ethical.
Examples: longitudinal “age and fluid intelligence” studies where education differences across generations can confound results; later longitudinal designs mitigate these confounds and reveal more gradual declines in fluid intelligence with age.
Limitation: more vulnerable to confounds; stronger causal claims require careful design and interpretation.
Correlational research revisited:
Measures a relationship between two variables without manipulating them.
Pros: naturalistic setting; ecologically valid.
Cons: cannot infer causality; third variables or reverse causation may explain observed relationships.
Example: ultrasounds and birth weight; more ultrasounds may be associated with lower birth weight because high-risk pregnancies prompt more ultrasounds, not because ultrasounds cause low birth weight.
True experiments revisited:
True experiments are distinguished by random assignment and controlled conditions, enabling stronger causal inferences.
In practice, ethical and logistical constraints often necessitate quasi-experimental or correlational designs.
When evaluating causal claims, the first question is whether participants were randomly assigned to conditions (random assignment is essential for strong causal claims).
Random assignment helps ensure equivalence of groups, reducing potential confounds and distributing individual differences and other random factors evenly.
Randomization, control, and experimental integrity
Random assignment vs haphazard group allocation:
True random assignment uses a defined random process (e.g., computer-generated random numbers) to assign participants to conditions.
Historically, randomization used random-number tables or physical randomization; modern practice largely uses computers with a random seed based on variable inputs (e.g., current time) to ensure unpredictability.
Pseudo-randomness is usually sufficient for research purposes; truly random numbers are not strictly necessary for good practice.
Why random assignment matters:
Avoids systematic differences between groups (e.g., illness prevalence, personality traits) that could confound results.
In drug trials, randomization helps mitigate placebo effects and expectation biases by equal distribution of expectations across groups.
Break and reminder on terminology:
The lecturer postpones a formal definition of “treatment” but uses the term to describe the experimental manipulation that is expected to have an effect (e.g., the distraction condition is the treatment in a memory task).
Practical example (randomization in practice):
Random allocation with group sizes up to 100 participants can be achieved using computer-generated randomization; pseudo-random seeds are fine due to sufficient unpredictability.
Independent groups (between-subjects) vs repeated measures (within-subjects) designs
Independent groups design (between-subjects):
Each participant is tested in only one condition.
Advantages: simple to implement; reduces carryover and learning effects within a participant; each person contributes one data point.
Disadvantages: higher susceptibility to random variability due to individual differences; typically less sensitive to detecting IV effects because of between-subject noise.
Example: tickling study where one group self-tickles and another group is tickled by a robot; random assignment balances individual differences across groups.
Repeated measures design (within-subjects):
Each participant experiences all conditions (e.g., control and treatment) and is measured in each.
Advantages: reduces variability due to individual differences since each person serves as their own control; increases statistical power and sensitivity to detect effects.
Disadvantages: susceptibility to order effects, fatigue, practice, and carryover effects that can confound results.
Counterbalancing as a solution: toggling the order of conditions across participants to distribute order effects evenly (e.g., ABBA design).
Key contrasts and implications:
Independent groups have more noise due to individual differences; repeated measures minimize that noise but introduce order-related confounds that counterbalancing aims to mitigate.
Example with the tickling task:
Independent groups: 16 participants per group, two separate tables of scores for self-tickling vs robot tickling.
Repeated measures: each participant has both conditions; fetches two scores per participant, one per condition; data typically shown with lines connecting the two scores for each participant to illustrate within-subject changes.
Practical design considerations:
Repeated measures designs require careful planning to avoid non-equivalent conditions due to carryover or fatigue; counterbalancing (e.g., ABBA) helps balance practice and fatigue across conditions.
Balanced design concept: symmetry in the order of conditions across participants helps ensure that order effects do not favor one condition over another.
When data collection is lengthy (e.g., EEG studies), you may prefer repeated measures for sensitivity but must plan for order effects; counterbalancing is often essential.
Carryover effects, order effects, and counterbalancing
Order effects: outcomes in later conditions are systematically influenced by having completed earlier conditions (e.g., fatigue, learning, practice effects).
Carryover effects: effects of a previous condition persist and influence subsequent conditions, complicating interpretation of the current condition.
Counterbalancing strategies:
ABBA counterbalancing: half of participants do A then B, the other half do B then A; helps balance both order and carryover effects.
Balanced design: arranging conditions so that potential confounds are evenly distributed across order sequences.
When counterbalancing cannot resolve concerns:
Some effects (like long-lasting learning across different teaching methods) may not be fully counterbalanced; in such cases, researchers may need to redesign the experiment or use alternative methods.
Data representation in repeated-measures studies:
For independent groups: two separate data tables by group with each row representing a participant.
For repeated measures: a single data table with a row per participant and columns for each condition; you can visualize with connected lines to show within-participant changes.
Real-world and classroom examples used in the lecture
1999 self-tickling experiment (Blakemore, Frith, and Walpitt): within-subjects vs between-subjects control research on ticklishness using a robot that can move in two ways (predictable vs unpredictable). Design emphasizes intra-subject control and careful manipulation of the predictive element to elicit the hypothesized difference in ticklishness.
Spoon creativity task (hypothetical, but used to illustrate order effects): repeat task to measure creativity with different music conditions (pop vs classical); without counterbalancing, order effects would confound the effect of music type on creativity. Counterbalancing (or using two different tasks) is necessary to avoid this confound.
Grip-strength device study (hypothetical): random assignment to training device; potential confounder is researcher encouragement; solution is to equalize encouragement across groups or use a factorial design to study interaction effects (two-by-two design).
Data interpretation and visualization: what to look for
Independent groups design data illustration:
Separate distributions for each group; significant differences indicate potential effects of the IV but may be obscured by between-group variability.
Repeated measures data illustration:
Each participant contributes data to every condition; plots often show lines connecting a participant’s two scores to visualize within-subject changes.
Patterns to watch for:
Large within-subject consistency suggests strong treatment effects; large between-subject variability suggests potential noise that counterbalancing and proper randomization must address.
Non-overlapping distributions between groups in an independent groups design strengthen confidence in a treatment effect; overlapping distributions indicate higher measurement noise and lower sensitivity.
Reading, preparation, and assessment guidance
Reading assignments for the course progression:
Grove chapters 1 and 2 (required for this lecture and upcoming assessments).
UQ Extend modules: chapters 1 and 2 in prior weeks; chapter 3 is the novel one for this lecture.
Aaron textbook (6th or 7th edition acceptable) as additional context.
Quiz and exam guidance:
Quiz 3 opens in one hour and closes on Monday; covers content from this lecture and Grove readings.
Mid-semester exam date announced: Saturday, September 6; more information to come as the date approaches.
Practical study tips:
Focus on understanding the null hypothesis and the logic of rejecting/failing to reject it.
Be comfortable distinguishing between true experiments, quasi-experiments, and correlational designs; know the strengths and limitations of each.
Practice identifying potential confounds and proposing counterbalancing or design changes to mitigate them.
Review terminology (IV, DV, predictor, criterion, treatment, control) and when each term is most appropriate.
Summary of takeaways
The central aim of research design is to maximize the ability to detect true effects by minimizing confounding factors and random noise.
Experimental control is about reducing variability from situational factors, measurement error, and individual differences to isolate the effect of the IV on the DV.
There are three main quantitative designs with distinct trade-offs: true experiments (random assignment; strong causal inference), quasi-experiments (existing groups; ethically feasible but weaker causal claims), and correlational designs (no manipulation; high ecological validity but cannot establish causality).
Independent groups and repeated measures designs each have advantages and pitfalls; counterbalancing is essential in repeated measures to manage order and carryover effects.
Random assignment is critical in true experiments to ensure equivalence of groups and to minimize selection bias; modern practice uses computer-generated randomization with a random seed.
Real-world examples (tickling study, memory distraction, educational gen/longitudinal design) illustrate how these principles are applied, the kinds of confounds that can arise, and the strategies used to address them.
Acknowledgement and Course Context
Welcome note and course focus: measurement, frequency distributions, and percentiles; gentle introduction to numbers.
Mid-semester exam scope: weeks 1–4; scheduled examSaturday, September 6 (announced on Blackboard).
Course trajectory: earlier weeks covered scientific process, study design, and questions in psychology; this week moves to data after collecting numbers.
Practical relevance: data cleaning, exploration, and plotting are essential across assignments, in honors year, and in the research process.
Measurement, Constructs, and the Philosophy of Measurement
Measurement goal: assign numbers to objects/observations according to consistent rules (operational definitions).
Constructs in psychology: psychological phenomena like anxiety, memory that are not directly observable but are labeled and studied.
Operational definition: boundaries/criteria to determine whether a phenomenon (construct) is present in a measured instance (e.g., infant imitation). Researchers may disagree; scientific discourse can refine definitions over time.
Observable phenomena and empiricism: measurement relies on observable, checkable, verifiable evidence shared openly for replication.
Scientific disagreement as progress: debate over definitions/methods pushes for better processes.
Variables: Types, Qualities, and Scales
Variable: a characteristic of interest for each individual in a population/sample (e.g., memory capacity, anxiety).
Qualitative vs. quantitative variables:
Qualitative: categories/labels (e.g., gender, eye color, political affiliation); not meaningful to compute averages.
Quantitative: numeric measures (e.g., height, weight, income); meaningful to apply statistics.
Coding and measurement rules:
Numbers can be used as labels (e.g., 0/1 coding for deceased/alive) but not all label-numbers support arithmetic operations.
Types of variables (overview, basic):
Discrete: whole-number values (no meaningful halves). Example: number of cars observed in a period.
Dichotomous: two possible values within discrete (e.g., yes/no; male/female; correct/incorrect).
Continuous: any value within a range (e.g., height, volume).
Measurement scales (order of sophistication):
Nominal: labels without meaningful order; e.g., color categories, political parties, jersey numbers.
Ordinal: ordered categories where order matters but intervals are not necessarily equal.
Interval: ordered with meaningful differences between values, but no meaningful zero. Example discussed: IQ differences; temperature scales like Celsius.
Ratio: interval properties plus a meaningful zero, allowing ratios (e.g., height, weight, Kelvin temperature).
Examples and nuances:
IQ: ordinal → interval when actual scores provided; distance between scores meaningful.
Temperature: Celsius is interval (differences meaningful) but lacks a true zero; Kelvin is ratio (has meaningful zero).
Age: often treated as ratio (meaningful zero) in many contexts; sometimes discussed as interval in teaching contexts.
Implications of scale choice for analysis: the chosen scale constrains which statistics and claims are valid.
Practical examples in measurement:
Eye color as nominal; cannot average eye color.
Height as ratio; allows means, proportions, comparisons like “twice as tall.”
How to report numbers: use of consistent labels and units; interpretability depends on scale properties.
Reliability and Validity of Measures
Reliability: consistency of a measure across time or raters.
Test-retest reliability: administering the same test twice should yield similar scores if the underlying trait is unchanged.
In practice, perfect identical scores are unrealistic due to day-to-day variation (e.g., sleep, mood).
Reliability is quantified via correlation between scores across occasions:
If scores on Test 1 and Test 2 are highly correlated, reliability is high.
Inter-rater reliability: when multiple raters judge the same thing (e.g., video ratings), their scores should be correlated.
Typical adequacy: correlations around 0.60 or higher are considered acceptable for reliability in many contexts.
Example: alpha waves as a biological fingerprint show very high test-retest reliability over months (almost identical scores).
Validity: whether a measure actually assesses the intended construct.
Internal validity: the extent to which observed effects are due to the manipulated variables, not confounds.
External validity: generalizability of results beyond the lab to real-world settings (issues with WEIRD samples: Western, Educated, Industrialized, Rich, Democratic).
Construct validity: whether the measure truly taps the theoretical construct (e.g., Beck Depression Inventory potentially overlapping with anxiety items; concerns about how well items map to depression construct).
Content/face validity: whether the measure appears to assess the intended construct on the surface (e.g., mental math tests appearing to measure math ability; head circumference appearing to measure head size, not intelligence).
Predictive validity: the extent to which scores on a measure predict related outcomes (e.g., ATAR predicting university performance).
Other validity considerations:
Construct validity and evolving measures: early measures may drift as constructs are better understood; poor initial alignment may be revised.
Content/face validity distinctions: a measure can be reliable but have low face validity if it doesn’t intuitively fit the construct.
Reliability vs validity relationship: a measure can be reliable but not valid; it must measure what it intends to measure to be useful.
Pilot Testing, Range Effects, and Study Design Considerations
Pilot testing: iterative testing of experimental design and stimuli to ensure the measurement range is appropriate.
Goals: avoid floor effects (too hard) and ceiling effects (too easy); ensure middle-range performance to observe differences.
Real-world example: quick demonstration with speed of stimulus presentation; initial results suggested adjustments to avoid near-zero performance.
Range effects and measurement quality:
Ceiling effect: all participants perform near the top; little room for differentiation.
Floor effect: all participants perform near the bottom; little room for differentiation.
Ideal measures sit in a middle range to maximize sensitivity to group differences.
Pilot testing as a standard in research: many published studies include extensive pilot work not visible in the final paper.
Study design considerations discussed earlier in the course:
Types of studies: experimental, randomized controlled trials; observational, quasi-experimental, correlational.
Randomization and control groups as tools to manage confounds.
Independent groups design vs. repeated measures design; counterbalancing as a method to balance potential confounds.
Construct-focused design notes: importance of naming and constructing meaningful constructs before measurement.
Data Presentation, Exploration, and Cleaning
Purpose of data presentation: to tell a clear story about results using figures and tables rather than lengthy narrative only.
Data are often messy: human data can include errors, non-sensical responses, and noise; cleaning is essential before analysis.
Data cleaning and exploration steps:
Inspect raw data to identify values outside plausible ranges (e.g., 0–10 scales with a value of 20).
Look for transcription or entry errors (e.g., too-high values in a given scale).
Clean data and summarize before performing analyses.
Data organization example: raw data matrix (100 students × 10 true/false questions) vs. summarized representations.
Summary representations help reveal patterns quickly:
Frequency tables: list all possible scores and the count of observations per score.
Frequency of 0–10 scores example: helps identify most common scores and check data integrity.
Frequency tables vs. variability in data:
With many possible scores, frequency tables become unwieldy; interval-based bins improve readability.
Rule of thumb: 10–20 intervals (bins) balance granularity and interpretability; 15 bins often cited as a good middle ground.
Interpreting frequency data:
Relative frequency: proportion of observations in each bin: extrelativefrequency=racfNextrelativefrequency=racfN
Cumulative frequency (CF): total observations with scores at or below a given bin.
Percentiles: boundaries where a given percentage of scores fall below that value.
Practical examples: weights of 72 male students; intervals like 60–64, 65–69, etc.; note about inclusive/exclusive bin definitions to avoid overlaps.
Why include empty/zero-edge bins: to enable certain plots (e.g., frequency polygons) that require zero values at the ends.
Frequency polygons and alternative plots:
Frequency polygon visually connects bin midpoints to show distribution shape.
Bar graphs for nominal data; histograms for continuous data with touching bars to show continuity.
Box-and-whisker plots provide information about median, interquartile range, and extremes.
Bar graphs vs. histograms:
Bar graphs: for qualitative (nominal) data; bars not touching; order is flexible to aid readability.
Histograms: for continuous data; bars touch to indicate continuity between bins; bin intervals matter.
Frequency polygons for multiple groups:
Example with male actual weight vs. male ideal weight; female weights and ideal weights plotted to compare distributions.
Telling a story with graphs:
Well-chosen figures reveal patterns and differences (e.g., male vs. female weight patterns and ideal vs. actual weights).
Graphs should be designed to convey a clear message, guiding interpretation.
Summary points for data presentation:
Sift, clean, and present data so a reader can understand at a glance.
Good figures prepare the data for inferential tests (e.g., verifying assumptions, handling missing data, removing outliers).
Choose graph types that best fit the data type (qualitative vs. quantitative) and the story you want to tell.
Use appropriate intervals (bins) when constructing histograms/frequency polygons.
Percentiles, Cumulative Frequencies, and Practical Calculations
Percentile concept: the value below which a specified percentage of scores fall.
90th percentile: 90% of scores are below this value.
Percentiles are computed by ranking scores and locating the boundary that separates the specified percentage of data.
Relative frequency vs. cumulative frequency:
Relative frequency: proportion of the total represented by a score/bin: extrelfreq=racfNextrelfreq=racfN
Cumulative frequency (CF): sum of frequencies for all scores up to a given point.
Percentile calculation method:
Percentile rank = racextCFNimes100racextCFNimes100
To find the percentile of a given score, determine CF up to that score and divide by N, then multiply by 100.
Example walkthrough: with a table of scores and frequencies, CF is calculated by summing frequencies up to the target score; percentile = (CF / N) × 100.
Inverting percentile calculations:
To find the score corresponding to a given percentile p, compute CF = (p/100) × N, then locate the score bin whose cumulative frequency reaches CF.
Worked example (class scores):
Suppose a small table with scores and frequencies; N = 20; to find the 35th percentile: compute CF = 0.35 × 20 = 7; find the score with CF of 7 (e.g., a score of 23). Therefore, 35th percentile corresponds to score 23.
Interpretation: a student scoring 23 did better than 35% of the class.
More advanced example: hours of TV watched by 259 students (data from a lecture):
Distribution across categories (0–1, 2–3, 4–5, etc.) with cumulative frequencies calculated up to seven hours.
To find the percentile for seven hours, compute CF up to 7 hours and divide by 259, then multiply by 100; here, around the 63rd percentile.
Frequency polygon interpretation example:
Shade region left of a percentile boundary to visualize the proportion of data below that boundary (e.g., 63% area under the curve to the left of 7 hours).
Modern data practices: most computing of percentiles and other statistics is done with software, but understanding the underlying calculations is essential for intuition and debugging.
Summary of percentiles in reporting:
Percentiles provide meaningful benchmarks (e.g., “above 80% of the class”).
Use cum freq and N carefully to avoid misinterpretation; ensure you read the correct cell when using rearranged equations.
Graph Types and Data Storytelling: Choosing the Right Display
Bar graphs (qualitative data):
Display counts per category; y-axis scale should reflect observed counts; bars should not touch to emphasize categorical separation.
Ordering the categories can help readers see patterns; the order is not inherently meaningful for nominal data but can aid interpretation.
Histograms (quantitative data):
Bars touch to indicate a continuum between bins; choose bin width carefully to reveal distribution shape without over-smoothing or over-fragmentation.
Box-and-whisker plots: quick view of distribution shape, median, interquartile range, and extremes.
Frequency polygons: smooth representation of distributions by connecting bin midpoints; useful for comparing distributions (e.g., groups vs. groups).
Practical storytelling with plots:
Use a graph to illustrate differences (e.g., male actual vs. ideal weight distributions) and to compare groups (e.g., male vs. female weight patterns).
Good plots support the narrative of your results and help convey your claims effectively.
Practical Advice for Exam Preparation and Next Steps
Data workflow in research:
Design measure with appropriate scale; pilot test to refine range and avoid floor/ceiling effects.
Collect data, then clean and explore before formal analysis.
Create figures that tell a story; choose graphs that fit the data type and the message.
Prepare for inferential statistics by ensuring data meet assumptions (normality, etc.).
Mathematical basics to brush up for next week:
Σ notation:
\
Distribution Shape
Three features to characterize a distribution: shape, central tendency, and variability.
Shape asks: What is the overall form of the distribution?
Normal distribution (bell curve, Gaussian) is a key reference shape in this course; many variables approximate it when there is enough data.
Visualizing shape: use histograms and frequency polygons; binning choice affects how features are seen.
Small to moderate samples (typical in psychology): using about 10–20 bins (e.g., ~15) is common; very large data sets allow more detailed binning.
Example with many data points (heights of 5,000 high school boys): using many bins reveals a smooth, continuous bell-shaped curve that matches the normal distribution.
Normal distributions enable a lot of math tricks and inferences; we’ll exploit this next week with Z scores.
Real data often depart from normality: small samples show deviations; most statistics discussed (correlations, t tests) are robust to small normality deviations.
Positively skewed distributions: tail extends to the right (e.g., house prices, reaction times).
Negatively skewed distributions: tail extends to the left (e.g., exam scores in a hard course).
Skewness affects which measures of central tendency are most appropriate; skew and ceiling/floor effects matter for interpretation.
If distribution is roughly symmetric and bell-shaped, mean, median, and mode are close to one another.
If distribution is not symmetric, middle measures diverge in informative ways (median often preferred for skewed data).
Z view of next steps: next week we’ll see how Z scores relate to standardization and later to t tests.
Central Tendency: Mean, Median, and Mode
Central tendency answers: where do most scores cluster around?
Mode (most frequent value)
Simple to identify; useful for nominal data (eye color, political preference, etc.).
Example: data 1-2-3-3-4-4-5-5-5-6-7-7, mode is 5; if another value ties, the distribution becomes bimodal (e.g., modes at 5 and 7).
Strengths: unchanged by extreme scores; represents the most common value.
Weaknesses: can be unstable with small samples; not informative for most statistical calculations.
Important note: mode is the only sensible descriptor for strictly nominal data; it cannot be used for many inferential procedures.
Median (middle value of ordered data; 50th percentile)
Calculation: order data; if odd n, the middle score; if even n, the average of the two middle scores.
Robust to extreme scores; good for skewed distributions (e.g., house prices).
Example: data with 6 scores: 10, 20, 30, 40, 50, 60 → median is the average of the 3rd and 4th values (here (30+40)/2 = 35).
In skewed data, the median better represents a typical value than the mean.
In news media, the median is often used for reported incomes or house prices because it’s less affected by extreme values.
Mean (arithmetic average)
Formula for a sample: ar{x} = rac{1}{n} ext{ with } x1, x2, \,\dots, xn, ormoreexplicitlyormoreexplicitly ar{x} = rac{1}{n} \sum{i=1}^{n} x_i.
The mean is the balancing point or fulcrum of the distribution; it uses every score in the dataset.
Strengths: most informative statistic; mathematically convenient; basis for many formulas and tests; tends to be relatively stable with more data.
Weaknesses: sensitive to extreme scores (outliers) and skewed distributions; can be a poor summary of the center when data are highly skewed.
Notation and population vs. sample
Sample: use regular Latin letters; mean is denoted as ar{x} ar{x} or sometimes mm m in this course.
Population: use Greek letters; the population mean is bc bc (mu).
The sample mean is an unbiased estimator of the population mean: over many repeated samples, the average of the sample means converges to the true population mean.
Practical guidance on choosing a measure
Symmetric, unimodal distributions: mean ≈ median ≈ mode; mean is often used.
Skewed distributions or distributions with outliers: median is often a better descriptor of a “typical” value; mode can be informative for nominal-type data but not for most numeric analyses.
Bimodal distributions: mode(s) are informative; mean/median may be less representative of the most typical values.
Examples illustrating central tendency choices
Salary example (skewed distribution): six salaries, one very high at the top drags the mean above most values; median (e.g., 50,500) better represents a typical salary in a skewed dataset; mode (e.g., 38,000) may reflect the most common salary but not the typical value for planning.
Bi-modal example (playground ages vs. parents’ ages): two modes (young and older group) suggest reporting the modes rather than the mean/median alone.
Summary guidance for central tendency measures
Mode: useful for nominal data; best when reporting “the most frequent category.”
Median: robust to outliers and skew; preferred for skewed distributions.
Mean: uses all data; most informative in symmetric distributions; sensitive to outliers; useful for further calculations and inferential statistics.
Variability: Range, Variance, and Standard Deviation
Variability measures describe how spread out the scores are around the center.
Range
Definition: difference between the highest and lowest score.
Example: two datasets with the same center can have different spreads; range can be similar even if data are very differently distributed in between.
Drawbacks: highly sensitive to extreme scores; provides minimal information about the distribution beyond the endpoints.
Deviation scores
Definition: deviation of each score from the mean: di = xi - ar{x} \sum{i=1}^{n} (xi - ar{x}) = 0. s^2 = rac{1}{n} \sum{i=1}^{n} (xi - ar{x})^2. SS = \sum{i=1}^{n} (xi - ar{x})^2, soso s^2 = \frac{SS}{n}. s = \sqrt{s^2} = \sqrt{ \frac{1}{n} \sum{i=1}^{n} (xi - ar{x})^2 }. x = [2, 4, 8, 10],
n = 4; \, \bar{x} = \frac{2+4+8+10}{4} = 6.Deviations: d=[2−6,4−6,8−6,10−6]=[−4,−2,2,4].d=[2−6,4−6,8−6,10−6]=[−4,−2,2,4]. d = [2-6, 4-6, 8-6, 10-6] = [-4, -2, 2, 4].
Squared deviations: d2=[16,4,4,16].d2=[16,4,4,16]. d^2 = [16, 4, 4, 16].
Sum of squares: SS=40;s2=SS/n=40/4=10;s=10≈3.16.SS=40;s2=SS/n=40/4=10;s=10≈3.16. SS = 40; \, s^2 = SS/n = 40/4 = 10; \, s = \sqrt{10} \approx 3.16. s^2 = 168/10 = 16.8, \, s = \sqrt{16.8} \approx 4.10. \bar{x} \pm s \bar{x} \pm 2s \bar{x} \pm 3s
This rule helps interpret how typical values lie relative to the mean in a normal distribution and underpins standardization via Z scores.
Practical implications of variability
Low variability around the mean means individuals are close to the mean; in school planning, you can tailor a lesson around the mean with confidence that most students perform similarly.
High variability means some individuals will be far from the mean; teaching, testing, or evaluation should accommodate a broader range of abilities.
In decision-making (e.g., selecting players, setting policies), knowing variability informs risk and planning (e.g., two players with same mean but different variability differ in reliability).
Population vs. Sample; Parameters vs. Statistics
Population vs. sample concepts
Population: the entire group of interest (e.g., all psych 1040 students, all Australians, all humans).
Sample: a subset drawn from the population (ideally randomly) to estimate population characteristics.
Notation and terminology
Population mean: μμ \mu (mu) — a parameter (true mean of the population).
Sample mean: xˉxˉ \bar{x} or sometimes mm m — a statistic used to estimate the population mean.
The idea of an estimator: a statistic (like xˉxˉ \bar{x} ) used to estimate a population parameter (like μμ \mu ).
Unbiasedness of the sample mean: across repeated random samples, the average of the sample means converges to the true population mean.
Why sampling matters
In practice, you rarely measure the entire population due to cost and feasibility; random sampling provides estimates that are informative about the population.
The sample mean as an estimator is central to many statistical methods; its unbiasedness supports inferences about the world.
Population parameters vs. sample statistics in research practice
Population parameter examples: population mean μμ \mu , population variance, etc. (unknown in most real-world cases).
Sample statistic examples: sample mean xˉxˉ \bar{x} ar{x} = rac{1}{n} \sum{i=1}^{n} xi s^2 = rac{1}{n} \sum{i=1}^{n} (xi - ar{x})^2 s = \sqrt{s^2} = \sqrt{ \frac{1}{n} \sum{i=1}^{n} (xi - ar{x})^2 } SS = \sum{i=1}^{n} (xi - ar{x})^2 \sum{i=1}^{n} (xi - \bar{x}) = 0
Normal distribution intuition (68-95-99.8 rule):
Within one standard deviation: ar{x} - s \leq x \leq \bar{x} + s ar{x} - s \leq x \leq \bar{x} + s contains about 68% of the data; within two standard deviations contains about 95%; within three contains about 99.8%.
Z-Scores and the Normal Distribution (Lecture Notes)Acknowledgment of country: recognize the traditional owners of the lands where we meet, their ancestors and descendants, and their cultural and spiritual connections to country, acknowledging their contributions to Australian and global society and that these lands have been sites of education and research for millennia.
Week-to-week building: this week builds on last week's topic (distributions) and sets up next week's focus (correlations), which rely on z-scores. Correlations use z-scores in their calculation and are a core method in research and the basis of the upcoming assignment.
Recap: distributions have three key characteristics to describe them:
Shape
Measure of central tendency (where the center sits; the middle score)
Measure of spread (how spread out the scores are)
When a distribution is symmetric around its central tendency (the mean for a normal distribution), it tends to form a normal distribution (bell-shaped curve, Gaussian).
Core concept for this week: z-scores are normal scores expressed in units of standard deviations (SD). They are a standardization of scores from any distribution, enabling comparisons across different scales.
Connection to standard deviation (SD): we already learned how to compute SD last week. Z-scores convert raw scores into standard deviation units using that SD. This is a unit conversion, not a change in the relative position of scores.
Why standardization matters: converting to z-scores allows comparisons across apples and oranges (different scales), and enables precise probability calculations under the standard normal curve.
What you’ll learn and how it fits into the course:
This week builds on SDs and normality; next week covers correlations (which use z-scores) and form a basis for the assignment.
The normal distribution is a mathematical construct that can be described with a formula, allowing calculations beyond direct empirical data, and enabling a uniform framework across many variables.
What is a normal distribution? a family of symmetric, bell-shaped curves where:
The mean, median, and mode coincide (in a perfectly normal distribution).
The spread is characterized by SD; different distributions can have different means and SDs.
With enough data, many real-world variables (e.g., height, IQ) approach this ideal shape.
Why can we rely on a normal shape? The central limit theorem underpins parametric tests (e.g., t-tests). If the data are roughly normally distributed or the sample size is large, many statistical procedures work well and yield inferences about populations.
The standard deviation as a central concept:
It reflects the typical distance of scores from the mean.
It is the unit used to express dispersion; converting data to SD units yields z-scores.
Units and conversions (intuition):
Converting to SD units is a unit change, akin to converting height between centimeters and inches. The actual value doesn’t change, only the label changes.
Example intuition: Phil’s height (in inches vs. cm) is the same measurement represented differently; the same goes for standard deviation when converting to z-scores.
What is a z-score? The deviation of a score from the mean, expressed in units of SD.
Positive z-scores: above the mean; negative z-scores: below the mean; z ≈ 0: around the mean.
Z-score is a standard score: it converts raw scores into standard units, enabling comparisons across distributions.
Notation for mean and SD (population vs. sample):
Population mean:
μ (mu)
Population SD: σ (sigma)
- Sample mean:
m (often denoted as b5 or sometimes bc; here referenced as m)
Sample SD: s.d. (often denoted as s or s_d)
In practice, if a formula uses μ and σ you’re dealing with population parameters; if it uses m and s (or s.d.) you’re dealing with a sample.
The z-score formula (transformation to standard normal):
For a score x from a distribution with mean μ and SD σ:
z=x−μσz=σx−μz = \frac{x - \mu}{\sigma}x = z\,\sigma + \muz = \frac{x - m}{s}x = z\,s + m
Why z-scores are useful:
They put different distributions on a common scale (standard normal with mean 0 and SD 1).
They allow meaningful comparisons across different measures (e.g., test scores from different subjects).
They enable precise probability statements about where a score lies in its distribution, using the standard normal curve.
Three distributions: a demonstration of standardization across different spreads:
Case A: mean = 100, SD = 10, score x = 110
Deviation = 110 - 100 = 10; z = 10/10 = 1.
Case B: mean = 100, SD = 15, score x = 110
Deviation = 10; z = 10/15 ≈ 0.67.
Case C: mean = 100, SD = 25, score x = 110
Deviation = 10; z = 10/25 = 0.40.
Observation: same raw score (110) is more unusual in Case A (higher SD means more spread) than in Case C, when viewed in SD units. Z-scores reveal equivalent relative positions across different distributions.
Why the same score can be equally 'unusual' across distributions:
If two distributions have the same relative position (e.g., 1 SD above the mean) but different spreads, the z-score places them at the same relative location in standard units. This allows fair comparison of scores from different contexts.
Practical example: comparing two courses with different grading schemes
If calculus has mean 60 and SD 10, and another subject has mean 90 and SD 15, a raw score of 70 vs 105 cannot be judged by raw scores alone. Converting to z-scores lets you compare who performed better relative to their course’s distribution.
Real-world examples (illustrative):
Don Bradman (cricket) vs. Ted Williams (baseball):
By converting their sport-specific scores to z-scores, you can compare dominance within their peers despite different scales and sports.
Example approximations mentioned in the lecture: Bradman’s z-score was extremely high (described as around 4 to 5 SDs above the mean in the example). Williams also scored highly in his own distribution, but the z-score difference highlights relative outperformance within each sport.
Einstein’s IQ (reported around 180):
IQ tests are designed to have μ = 100 and σ = 15, so z = (180 - 100)/15 ≈ 5.33. Such a score is extraordinarily rare under the normal assumption.
Worked examples (practice with z-scores and back-conversion):
Example 1 (reverse from z to raw score):
Given mean μ = 55, SD σ = 3, and desired z = 1.5
Convert to raw score: x=μ+zσ=55+1.5×3=59.5x=μ+zσ=55+1.5×3=59.5x = μ + z\,σ = 55 + 1.5\times 3 = 59.5x = 60 + (-0.4)\times 10 = 56\muZ = 0, \quad \sigmaZ = 1X = μ + Z\,σ = 100 + 2.05\times 15 \approx 130.75X = μ + z\,σ = 100 + 1.64\times 15 ≈ 124.6z = \frac{x - \mu}{\sigma}.
Step 3: Use the standard normal curve to determine probabilities, percentiles, and relative standing via z-tables (or calculators).
Step 4: If needed, convert back to raw scores using x=zσ+μx=zσ+μx = z\,\sigma + \mu to report concrete values.
Step 5: For comparisons across different measures, compare z-scores rather than raw scores.
Important practical implications and uses:
Z-scores provide a precise, standardized way to understand how far a score is from the mean, in SD units.
They enable meaningful comparisons across different measures and scales (e.g., cross-subject performance, different sports, or different tests).
They underpin probabilities and expectations about populations, enabling precise predictions and clinical cutoffs (e.g., thresholds like prosopagnosia ~5% cutoff).
They form the backbone of the standard normal curve, so results about probabilities and percentiles generalize across all normally distributed variables.
Real-world connections and philosophical notes:
The normal distribution is pervasive in psychology and natural phenomena; many variables converge to normality with sufficient data collection due to sampling processes and the central limit theorem.
Statistical reasoning using z-scores is a key skill for making inferences about populations from samples.
The practice has ethical and practical implications when used for clinical cutoffs or decisions (e.g., diagnosing conditions, eligibility for programs). Cutoffs are based on percentile/tail criteria, and the choice of α (e.g., 0.05) reflects conventions about balancing false positives and false negatives.
Quick connections to next topics:
Next week: correlations (which rely on z-scores for computation and interpretation).
Correlation analysis uses standardized scores to measure relationships between variables on different scales.
Summary takeaways:
Z-scores convert raw scores into SD units, enabling unit-free comparisons and precise probability calculations under the normal curve.
The standard normal distribution (mean 0, SD 1) provides a universal reference for all normal distributions.
Use z-tables (or calculators) to obtain percentile ranks and tail probabilities, and convert back to raw scores when needed.
The 68-95-99.7 rule gives quick intuition about where most data lie relative to the mean in a normal distribution.
Readings and prep for next lecture:
Reading: Chapter 3 of the textbook (this week).
Preparation for next week's lecture: Chapter 11 (Correlations).
Practice problems in UQ Extend Module 6.
Exam reminder (brief): details about permitted items and logistics for the upcoming exam are provided in class materials; bring pencils, ID, and any approved resources as specified.
Final note: the aim of this content is not just to pass an exam but to understand how distributions work in the world and how we can make precise inferences about them using z-scores and the normal distribution.
Introduction to Hypothesis Testing, Probability, and Sampling DistributionsThis week's lecture, delivered by Josh Sabio, focuses on introducing hypothesis testing, probability, and sampling distributions. It builds upon previous lectures and introduces a cornerstone concept for inferential statistics.
Acknowledgement of Country
The University of Queensland acknowledges the traditional owners and their custodianship of the lands on which we meet, paying respects to their ancestors and descendants who maintain cultural and spiritual connections to country, and recognizing their valuable contributions to Australian and global society.
Recalling the Normal Distribution
We begin by recalling the normal distribution, a fundamental concept from previous lectures. It is useful because many variables of interest to psychologists, such as IQ, short-term memory capacity, personality traits, and even facial attractiveness, tend to be normally distributed. If a distribution is normal, we can use z-tables to determine the probability of certain values occurring within it. For example:
The probability that an individual's score is one standard deviation below the mean.
The probability that values fall between specific intervals (e.g., within $ ext{one standard deviation} $ of the mean).
However, the utility of z-tables is limited to variables that are explicitly normally distributed. If a variable of interest is not normal, direct application of z-tables or finding the area under the curve is not possible. This week introduces a distribution that is always normal (under certain conditions), forming the foundation of inferential statistics.
Core Concepts of Inferential Statistics
This lecture lays the groundwork for understanding inferential statistics, covering:
Characteristics of populations versus samples.
Factors affecting sampling, such as sampling variability and sampling error.
The supremely important sampling distribution of the mean, including its characteristics, the standard error of the mean (SEM), and its connection to the Central Limit Theorem.
Practical applications in exam-style questions.
Populations vs. Samples
Population: Refers to the entire group of individuals or observations that share a particular characteristic and to which researchers wish to generalize their conclusions. Its size and characteristics depend on how it's defined (e.g., all Australian citizens, all marmosets). Researchers typically aim to make conclusions about the population at large.
Sample: A subset of the population, chosen for research due to feasibility and affordability limitations (e.g., it's impractical to study all $30$ million Australian residents). In rare cases (e.g., studying a rare disease or a specific, small cohort like all students in a course), access to a full population might be possible.
Notation: Parameters vs. Statistics
Parameters: Characteristics of a Population, described using Parameters and Greek alphabet.
Population mean: $ ext{mu} ext{ } ( ext{ } oldsymbol{oldsymbol{ ext{ extmu}}} ext{ }) $.
Population standard deviation: $ ext{sigma} ext{ } ( ext{ } oldsymbol{oldsymbol{ ext{ extsigma}}} ext{ }) $.
Statistics: Characteristics of a Sample, described using Statistics and Roman alphabet.
Sample mean: $ M $.
Sample standard deviation: $ S $.
Sampling Variability and Sampling Error
Random Sampling: The principle that individuals are selected from a population such that each has an equal and independent chance of being chosen. This aligns with the concept of independent draws.
Sampling Error: The inherent difference between a randomly drawn sample's statistics and the corresponding population parameters. A sample's mean will inevitably deviate from the population's mean (e.g., a handful of balls from a basket will have a mean different from the whole basket).
Sampling Variability: The fact that, due to chance, two random samples drawn from the same population will have different statistics. Iteratively drawing samples will demonstrate this fluctuation (e.g., multiple petri dish samples from the same pool of bacteria will show varying bacterial counts).
Literary Digest Example (1936 Presidential Election)
This historical example illustrates the importance of unbiased random sampling. Literary Digest predicted a Landon victory based on $2$ million responses from $10$ million questionnaires. However, Roosevelt won. The survey was biased because it used car registries and telephone numbers during the Great Depression, inherently sampling wealthier individuals who could afford such luxuries, thus not representing the general population.
Probability - First Principles
Understanding probability is crucial for inferential statistics.
Basic Rule: For any event, the $ ext{probability that it will occur} (P( ext{event})) + ext{the probability that it will not occur} (P( ext{not event})) = 1 $.
Exact Probabilities: These can be derived from frequency distributions through recursive computation.
Birthday Problem Example
The probability that $2$ of $3$ people share a birthday is $ P = 0.008 $. This is computed by considering the probability that each person does not share a birthday with the previous ones ($365/365 imes 364/365 imes 363/365$, etc.) and subtracting from $1$.
When plotted, the probability of a shared birthday rapidly increases with the number of people in a room. In a classroom of $30$ people, the probability of at least one shared birthday is approximately $70.63 ext{\textperthousand} $.
Jar of Balls Example
Jar 1: $10$ green, $10$ red balls ($20$ total). $P( ext{red ball}) = 10/20 = 0.5$ ($50 ext{\textperthousand} $).
Jar 2: $19$ green, $1$ red ball ($20$ total). $P( ext{red ball}) = 1/20 = 0.05$ ($5 ext{\textperthousand} $).
Conclusion: If we know the composition of a population, we can make statements about the probability of events. This connection is fundamental to inferential statistics.
Certainty and Statistical Convention
$ P=1.0 $: Absolutely certain (e.g., death and taxes).
$ P=0.5 $: $50/50$ chance (e.g., pulling a red card from a deck).
$ P=0.25 $: Pulling a diamond from a deck ($13 ext{ diamonds } / 52 ext{ cards } = 0.25$).
$ P=0.038 $: Pulling a red two from a $52$-card deck ($2 ext{ red } 2s / 52 ext{ cards } ext{ } oldsymbol{oldsymbol{ ext{ extapprox }}} 0.038$).
Convention: In statistics, $ oldsymbol{oldsymbol{ ext{P } < 0.05}} $ is arbitrarily adopted as the threshold for a rare event or a significant effect. This means an event occurring only $5 ext{\textperthousand} $ of the time is considered unusual or truly different.
The Gambler's Fallacy and Independent Draws
Independent Draws: The outcome of one trial does not influence the distribution of outcomes in the next (e.g., one coin flip doesn't affect the next). The
Introduction to Hypothesis TestingWelcome and importance of the lecture
This lecture builds on previous concepts, focusing on hypothesis testing as it is critical for conducting experiments and interpreting results in science.
Overview of Topics Covered
Statistical inference and decision-making processes under uncertainty
Understanding null hypothesis and alternative hypothesis
Explanation of statistical significance
Review of sampling distributions
Identifying decision errors in hypothesis testing
Concept of Statistical Inference
Science often deals with uncertainty when asking questions (example: Is smoking harmful?).
Hypothesis testing provides a framework for answering these questions, giving us likelihood rather than certainty.
Statistical tests can tell us the probability of results being due to chance but do not confirm the correctness of a theory or the quality of the experiment conducted.
Null and Alternative Hypotheses
Null Hypothesis (H0): Assumes that there is no effect or no difference; serves as a default position.
Example: "Distraction does not impair memory performance."
Alternative Hypothesis (H1): Represents what the researcher aims to prove, stating that there is an effect or a difference.
Example: "Distraction impairs memory performance."
Importance of accurately formulating these hypotheses when designing experiments.
Statistical Significance
Statistical Significance: A result is significant if it is unlikely to have occurred under the null hypothesis, typically denoted as $p < 0.05$.
The p-value indicates the probability of observing the results if the null hypothesis is true.
Conventionally, a p-value of less than 0.05 is accepted for statistical significance, implying a 5% chance of committing a Type I error.
Sampling Distributions
Defined characteristics of samples: shape, central tendency (mean, median, mode), and spread (range, variance, standard deviation).
Sampling Distribution of the Mean: Distributions are constructed by repeatedly sampling from the population and calculating means.
Central Limit Theorem states that as sample size increases, the distribution of sample means approaches a normal distribution.
This applies regardless of the population distribution.
Standard Error of the Mean (SEM): Indicates how much the sample mean is expected to vary from the true population mean; defined as:
SEM=racextPopulationStandardDeviation(extσ)extSquareRootofSampleSize(n)SEM=racextPopulationStandardDeviation(extσ)extSquareRootofSampleSize(n)SEM = rac{ ext{Population Standard Deviation} ( ext{σ})}{ ext{Square Root of Sample Size} (n)}
Decision Errors in Hypothesis Testing
Decision Errors: Errors that can be made during hypothesis testing, specifically Type I and Type II.
Type I Error (False Positive): Rejecting the null hypothesis when it is true.
Example: Concluding a drug is effective when it actually is not. Occurs with probability $α$ (commonly set at 0.05).
Type II Error (False Negative): Retaining the null hypothesis when it is false.
Example: Concluding a drug is ineffective when it actually works. Occurs with probability $β$.
The balance between reducing Type I and Type II errors is essential; reducing one can often increase the other.
Importance of Proper Sample Size and Design
The size of the sample impacts the SEM; larger samples typically lead to a smaller SEM, thus increasing the likelihood of detecting true differences.
Proper experimental design helps mitigate bias in sampling and reduces potential errors.
Conclusion
Understanding hypothesis testing, its foundations, and implications is crucial for conducting scientific research and interpreting data results clearly and accurately.
Emphasis on continuous learning as the concepts will be revisited in future lectures.
Next topics will include practical applications of t-tests in hypothesis testing and practical examples.
Reminder about important readings and quiz deadlines for reinforcing learning.
Introduction and Review of Previous LectureStart of class after break
Reminder for students to have taken their time off during break
Focus: Reviewing the sampling distribution of the mean
Understanding means expected from a particular population
Sampling Distribution of the Mean
Definition:
A population has a known mean ($\mu$) and standard deviation ($\sigma$).
From this population, one can calculate the distribution of means using sample size ($n$).
z-tables can be used to find probabilities for these means.
Focus of today's class:
Building on previous content regarding z-scores and introducing t-distributions.
Explain how today’s lesson consolidates the previous one with a small but critical adjustment.
Consolidation of Previous Content
z-tests were the main focus of prior lectures:
Understanding the z-score for individual scores and sample means.
z-distribution is rarely known in real-world phenomena.
Exception: IQ scores that have known means and standard deviations.
Transition to t-distributions:
Real-world data often does not provide population means or standard deviations.
Use of t-distributions to conduct statistical analysis based on estimated population parameters.
Acknowledgment of Indigenous Lands
Pay respect to traditional owners of the land.
Review of Key Concepts from Previous Lecture
Population Distribution:
Population mean ($\mu$).
Sample from population creating a distribution of means by sampling repeatedly.
Mean of the distribution of means is equal to the population mean ($\mu$).
Spread is defined by standard error of the mean (SEM):
(SEM=σn)(SEM=nσ)(SEM = \frac{\sigma}{\sqrt{n}}).SEM quantifies the error in mean estimates.
Hypothesis Testing:
Definition of the null hypothesis ($H0$) and alternative hypothesis ($H1$).
Emphasis on testing the null hypothesis and not the alternative hypothesis.
Errors in hypothesis testing: Type I error (false positive) and Type II error (false negative).
Importance of controlling confounding factors to validate hypotheses.
Procedures for Hypothesis Testing
Calculating the z-score for means:
Given a distribution of sample means, determine the likelihood of a sample mean under the null hypothesis.
Typical z-score threshold for significance: 1.96 (for a 5% level).
Example: If a mean of a sample falls within expected range, retain $H_0$. If outside, reject it.
Reiteration of Key Statistical Concepts
Null Hypothesis ($H_0$): No difference or effect expected in the measured variables (e.g., reaction times between alcoholics and non-drinkers).
Alternative Hypothesis ($H_1$): Expect a difference or effect.
Important to establish how to statistically support ($H_1$) through careful experimental design.
Transition to T-Distributions
Definition of t-distribution:
Similar shape to normal distribution but varies based on degrees of freedom ($df$).
t-distributions account for variability in sample size.
Critical t and z values differ; t-values become more extreme with fewer degrees of freedom.
Understanding Degrees of Freedom
Degrees of freedom ($df$) explained:
$df = n - 1$ for a single sample.
Concept: how many values can vary independently when estimating a parameter.
Example of calculating degrees of freedom:
If $n=4$, $df$ will be $3$. One value is fixed to satisfy constraints.
Variance and Sample Variance
Explanation of the variance calculation:
Population variance ($\sigma^2$) vs sample variance ($s^2$).
Sample variance is calculated with correction: s2=SSn−1s2=n−1SSs^2 = \frac{SS}{n - 1} to avoid underestimating variance.
Mean of sample means is unbiased, but variance calculations are biased unless corrected.
Bessel's Correction: Using $n-1$ corrects bias, making variance an unbiased estimator.
Summing Up Statistics Steps
Steps in statistical tests:
Calculate means, standard deviations, apply hypothesis testing frameworks.
Transition between t-tests and z-tests depending on known versus unknown parameters.
Use proper formulae to guide through variance estimation and the implications in reporting results.
Single Sample T-Test Explained
Condition where population mean is known, but not standard deviation leads to t-tests.
Basis for using t-tests involves estimating population parameters and applying statistical testing.
Defined as observing if the sample mean is significantly different from the known population mean.
Repeated Measures Example
Example scenario:
Comparing conditions: Self-tickling vs experimenter-tickling.
Operationalization and hypothesis framing.
Evaluating differences between two conditions using previous statistical methods discussed.
Final Notes and Homework Suggestions
Close out by assigning relevant exercises from Chapter 7.
Preparation for next week's lecture on independent groups t-tests covering further intricacies in analysis.
IntroductionThis lecture is focused on the final statistical tests of the semester, transitioning into exam preparation in the upcoming weeks.
The speaker acknowledges the traditional owners of the land, showing respect for their heritage and contributions to societal development, emphasizing the long-standing traditions of research and learning on these lands.
Overview of Statistical Testing Logic
The overall logic of the statistical tests discussed centers around measuring a mean and determining its relationship to a known population distribution.
Hypothesis Testing Framework:
When a mean falls within a specific range of values (likely area), it is assumed to originate from the hypothesized population distribution, leading to retention of the null hypothesis.
If the mean is in an unlikely area, the null hypothesis is rejected, suggesting the mean may come from a different population.
The primary goal is to minimize human bias in making scientific conclusions through strict criteria for determining significant effects.
Inclusion of these criteria aids in objective decision making, allowing the evaluation of means based on set statistical cutoffs (e.g., t or z cutoffs).
Central Limit Theorem and Distribution Characteristics
The Central Limit Theorem asserts that regardless of a population’s distribution, the distribution of sample means will approach a normal distribution as sample size increases.
Key Characteristics of Distributions:
Shape: Determining whether the distribution is normal.
Center: Mean of the sampling distribution approximating the population mean.
Spread: Standard deviation of the distribution, which is crucial for analysis.
Most complexity in calculations arises from estimating or working with standard deviations, particularly when unknown.
Statistical Tests Overview
Recap of statistical tests from the semester:
Z Test: Applicable when population standard deviation is known.
T Test: Used when the population standard deviation is unknown, requiring estimation.
Repeated Measures T Test: Involves one dataset examining the distribution of differences in scores related to factors like fatigue or carryover influences.
Independent Groups T Test: Compares two different groups with independent data distributions, focusing on estimating variance based on sample means and a pooled approach for variance when populations are assumed equal.
Detailed Explanation of Statistical Calculations
When population variance is unknown, adjustments are made, specifically using (n−1)(n−1)(n-1) for degrees of freedom in variance calculations:
s2=racextsumsofsquaresn−1s2=racextsumsofsquaresn−1s^2 = rac{ ext{sums of squares}}{n-1} .The critical part of hypothesis testing revolves around confirming whether your observed mean difference is significant compared to the null expectation.
Emphasis on using a pooled variance weighted towards individual group sizes for estimating the population variance when using independent samples.
Independent Groups T Test Process
Logistics of conducting an independent t-test entail:
Establishing and confirming assumptions (normal distribution, homogeneity of variance, independence of observations).
Calculating individual group means, variances, and pooled variance (the latter being weighted based on sample sizes). This combines estimates based on degrees of freedom to draw sound conclusions on population characteristics.
Use variance sum law: the variance of a distribution of differences equal to the sum of individual variances:
extVar(Aext−B)=extVar(A)+extVar(B)extVar(Aext−B)=extVar(A)+extVar(B) ext{Var}(A ext{ - } B) = ext{Var}(A) + ext{Var}(B) t = rac{(ar{X}1 - ar{X}2)}{S{ ext{diff}}} wherewhereS{ ext{diff}}$$ is the standard error of the mean difference. The results inform whether to reject the null hypothesis based on observed versus expected outcomes from samples.
Example: Learning to Juggle
A practical example demonstrates an independent groups t-test applied to learning efficiency in juggling under two conditions: distributed learning (15 participants) and massed learning (20 participants).
Assign variables confirming the different learning approaches impacts: comparing their performance in terms of catches made within the initial practice hours.
The hypothesis explores:
Null Hypothesis (H0): There is no significant difference in performance between learning schedules.
Alternative Hypothesis (H1): One learning schedule significantly outperforms the other.
Upon calculating means for each group from observed data:
Distributed learning mean = 7.4 catches
Massed learning mean = 5.2 catches
Statistical Testing: Conducting independent groups t-tests incorporated the estimated variances and applied the expected null hypothesis to assess whether recorded differences fell outside normal sampling error—effectively making the case for one approach being better than the other.
Conclusion and Key Takeaways
The independence of the groups is crucial for valid statistical testing.
The speaker emphasizes the importance of correct experimental design and how poor design leads to confounding results that statistics alone cannot clarify.
Preparing for the exam requires focus on comprehensive understanding—working definitions and practice problems across different statistical tests.
The next week's material will include discussions of effect sizes and confidence intervals, critical tools for interpreting test results in practical applications.
Overview of Current Lecture ContentIntroduction and reminders about assignments.
Assignment due in three hours.
Expectation that all have submitted and that everything is okay.
Transition from discussing statistical tests to exploring additional questions in hypothesis testing.
Disappointment in Hypothesis Testing
A common sentiment exists regarding hypothesis testing, specifically its limitations:
Main Limitation: We can only assess if a result is likely due to chance.
We cannot directly support the alternative hypothesis.
The course will cover additional methods to address significant questions such as:
How big is the effect likely to be?
Confidence intervals and their significance in understanding data sets.
Confidence Intervals
Definition: A confidence interval provides a plausible range for estimates, beyond the single point estimate.
Application: The lecture today will discuss:
How to calculate confidence intervals within various contexts and statistical tests.
The construction involves wrapping a "buffer zone" around point estimates to account for potential error.
Example of Confidence Intervals
Confidence intervals are essential for understanding variability and uncertainty:
For example, when taking a sample mean, we can extend the analysis to determine probable ranges where the true mean exists.
This leads to determining the confidence interval for sample data, ensuring estimates consider sampling errors.
Relations to Other Statistical Concepts
Connection Between Hypothesis Testing and Confidence Intervals:
Confidence intervals provide equal information to what p-values offer. Both are now standard practice in scientific reporting.
Example Statement in Scientific Writing:
When writing journal articles, results typically include:
Mean improvements
T-value
P-value
95% confidence interval
Effect size such as Cohen's d.
Evolution of Statistical Practices
Discussion of historical context regarding the replication crisis in psychology statistics:
Approximately 15 years ago highlighted an absence of replicable results in many established studies.
Field has increasing awareness of statistical limitations and is transitioning towards better statistical practices.
Effect Sizes
Transition into effect sizes as a concept tied closely with confidence intervals:
Definition: An effect size measures how substantial or impactful a result is, focusing on how big an effect is compared to other variables.
Exploring Effect Sizes Further
Cohen's d: The most common effect size measure, standardizing effects across different studies.
It addresses additional concerns surrounding nuances of significance vs. meaningfulness.
Small vs. Big Effects
The influence of sample size on the detection of effects:
A significant finding may not equate to clinical or practical importance, especially if effects are tiny (e.g., 6.33 milliseconds in a tickling condition).
Importance of reporting effect sizes in studies.
Statistical Understanding Evolution
Awareness that misconception previously existed regarding interpreting p-values.
Historical views equated low p-values (e.g., 0.01) with a high likelihood of the alternative hypothesis being true (incorrect).
Importance of establishing educational norms in statistical understanding.
Bayesian vs. Frequentist Approaches
Suggestion to recognize additional approaches for analyzing data:
Bayesian Statistics: Emphasizes prior knowledge and is complicated, typically taught later in academic programs.
Estimation Approach: Suggests a focus on estimating parameters instead of relying solely on conventional p-values.
Recap of Lecture
Acknowledgment of traditional owners of lecture grounds to foster a respectful learning environment.
Recap of confidence intervals, hypothesis testing, and effect sizes as intertwined yet crucial statistical practices for a robust understanding of data.
Suggestion that students should prepare for midterm evaluations and exercises based on discussed exercises.
Practical Exercises with Statistical Tests
Application examples with confidence intervals, p-values, t-tests explored in detail:
Illustrating how to conduct single sample t-tests vs. independent groups t-tests.
Working through examples to bolster conceptual understanding of statistical concepts and applications to real-world observational data.
Conclusion
Discussion of upcoming material and expectations for quizzes and course evaluations:
Students are encouraged to give feedback on course content and resources.
Final Lecture OverviewThe final lecture marks the end of the course where students will focus on exam preparation and course revision.
Exam Expectations
The exam is scheduled for the 19th at 8 AM.
Important to confirm the exam location in advance to avoid getting lost.
The exam carries a weight of 45% of the overall course assessment.
Caution: This exam will be more challenging than the mid-semester exam.
The scope of the exam covers the entire course material, including prior lectures.
Course Evaluations
Course evaluations are open for one more week; student participation is encouraged.
Evaluations help educators assess teaching effectiveness and provide feedback for improvement.
Positive feedback also benefits tutors who are the first point of contact for students.
Review of Key Concepts
Principles of Science
Psychology is presented as an empirical science, involving measurement and theory development.
Importance of skepticism in evaluating results:
Skepticism implies questioning results and considering alternative explanations until substantial evidence supports conclusions.
Introduced tentativeness in reporting findings:
Researchers express findings with caution, using phrases like "may cause" instead of asserting causal relationships definitively.
The role of openness in research:
Sharing methods, data, and software used for analyses promotes error correction and transparency.
Emphasis on anti-authoritarianism in science:
Results should be scrutinized regardless of the laurels of the researcher or their accolades.
Research Methodology
Not just a statistics course but focused on research methods critical for proper experimental design.
Types of Designs:
Experimental Designs: Allow manipulation of variables and the establishment of causal relationships through random allocation (independent vs. repeated measures).
Observational Studies: Quasi-experimental designs cannot establish causality and are subject to confounding factors.
Limitations of observational studies in asserting cause-effect relationships.
Measurement Scales
Four measurement scales:
Nominal: Categorical data (counts of occurrences).
Ordinal: Ordered categories without defined intervals.
Interval: Defined intervals without a true zero (e.g., temperature).
Ratio Scale: Defined intervals with a true zero point (e.g., weight).
Knowledge of different data representation methods is crucial, especially in interpreting graphical data.
Data Interpretation in Exam
Understanding of concepts like central tendency (mean, median, mode) and variability (variance, standard deviation) is essential.
Familiarity with normal distribution, z-scores, and their interpretation is vital for exam preparedness.
Example problem: Calculating IQ scores that bound the middle 95% based on a mean of 100 and standard deviation of 16:
Cutoffs established using z-tables (1.96 for 95% confidence).
Tips for questions regarding graphs include reading instructions carefully to ensure the correct data representation is utilized.
Understanding Hypothesis Testing
Hypothesis testing is central in determining the significance of findings.
Null Hypothesis (H₀): Assumes no effect or difference.
Alternative Hypothesis (H₁): Represents the effect or difference researchers seek to establish.
Test types to differentiate:
Z-tests and T-tests based on given population parameters (considering known vs. unknown population standard deviations).
Understanding Type I (false positive) and Type II (false negative) errors, the significance levels (p < 0.05), and their consequences in research outcomes.
Statistical power is the likelihood of detecting a true effect.
T-Tests and Confidence Intervals
Familiarity with various t-tests (e.g., single sample, paired samples, independent groups), and when to apply them is crucial.
Understanding of confidence intervals and their application for hypothesis testing (whether the interval contains the null hypothesis).
Example: Calculating confidence intervals for mean CO₂ levels.
Knowledge of effect sizes (Cohen's d) that help interpret the magnitude of findings beyond mere significance.
Exam Format & Preparation Strategies
The exam format includes 44 multiple-choice questions over 2 hours.
Resources allowed include an unmarked dictionary, formula sheet, and statistical tables (provided during the exam).
Recommendations for effective exam preparation include:
Practice with past questions and develop familiarity with the course materials using real exam tools.
Engaging in self-testing to enhance retention and understanding of the content.
Creating personalized quiz questions as reinforcement strategies.
Final Notes
A reminder about the final quiz opening shortly.
Students should prepare thoroughly for the upcoming exam, ensuring they understand the logistics and material covered throughout the course.
Open invitation for students to seek advice regarding further psychological studies and career pathways.
Modules
Science as a way of knowing
Psychology is a diverse discipline — 53
The unifying qualities of all psychologist disciplines is that all psychologists try to understand behaviour using the methods of science
Epistemology — branch of philosophy that is concerned with the nature and scope of knowledge
Depending on how knowledge is acquired, it may reflect real understanding about the world or it could embody misinformation
Acquiring Knowledge
Personal experience — people experience things and all these experiences contribute to our knowledge of the world. This is problematic however as it is only evaluated by you and is open to influence from your own biases.
Authority — people appeal to authority in order to live their lives. Typically, one can verify if these people are truly authorities, however the authority may not be an expert in their discipline
Scientific Method
How knowledge is gained
Logic — reason through problems to generate new knowledge, such as solving a maths question with a maths
Empiricism — gain knowledge through careful and objective observation (seeing, hearing, touching, etc)
Rational — formulation of hypotheses and theories
Main features
Systematic observation
Critical analysis of data, hypotheses, and theories
Tentative acceptance of hypotheses and theories
Openness and independence from authority
Theory, experiments and statistics
Goals for scientific research
Describe a behaviour — if we want to understand any type of behaviour; describe it in detail and give the conditions under which it occurs on however many levels
Explain behaviour — why does the behaviour occur? How do environmental factors affect the behaviour, how does the presence of other people affect behaviour
Predict behaviour — want to know when the behaviour occurs. what are the specific cognitive, emotional, social, or environmental conditions
Deep understanding — if we can control the behaviour, we can identify and manipulate the critical factors that promote or discourage a particular behaviour
Theory
Ideas about how nature works in psychology theories explain why behaviour occurs the way it does
A fully formed theory fulfills all four goals
A psychological theory is a precise statement of how events in the world affect behaviour specifically
it summarises existing knowledge on a topic
it outlines the relationships between the different factors involved
explains the phenomenon of interest most importantly a theory generates specific predictions about the outcomes of situations
Hypothesis
More focused and more tentative than a theory
Tales a theoretical claim and applies it to a specific setting the hypotheses are more focused than the theory the more specific instances are found to support the repeated study
In a typical formal experience, two hypotheses are proposed
null hypothesis — noted as each sub zero and the other is called the alternative higher hypothesis noted as some one. statement that there is no difference between the groups we are comparing or that there is no systematic change in one variable that is tied to another variable
example - relationship between smoking and health is that there is no relationship between the amount people smoke and their health
alternative hypothesis — a statement that there is a difference between the groups we are comparing or that there is a systematic change in one variable that is tied to another variable
example - relationship between smoking and health is that there is a relationship probably that the more people smoke the less healthy they will be
a drawback is that they cannot make an exact prediction
Statistics
Formal mathematical procedures that allow us to decide which of the two hypotheses to favour
Allow us to rule out chance as a possible reason for the pattern of results
A test of whether or not chance can explain the observed differences between the groups.
Principles of Science
The best method we have for generating knowledge about the universe, nature, and human behaviour is the scientific method.
The scientific method generates knowledge based on evidence. Faith-based knowledge and morality-based knowledge are examples of knowledge that are not generated through the scientific method.
Scientific claims, hypotheses, and theories are all based on evidence.
Objectivity
Evidence, when offered to support a claim or hypothesis, must be observable by any person.
Offering your personal thoughts or feelings as evidence is not acceptable because you are the only person who can observe them.
Therefore, in order to provide scientific evidence you have to be creative and think of ways to make your observations objective.
For example, a recent CNN article reported that some people’s phobia of flying is made worse when air disasters are reported in the media. To support that claim scientifically, we need to provide objective evidence that phobics are more anxious (which is a mental state not readily observable) than non-phobics when flying after seeing a news report of an air accident. We could measure heart rate, sweaty palms, or breathing rate, all of which are objective and tend to vary with levels of anxiety. These forms of evidence are more credible than simply asking a person to report how they feel because physiological responses are observable and measurable by anyone.
Skepticism
Science’s principle of skepticism requires that claims must be backed up with evidence and that this evidence must be carefully and critically evaluated.
When considering a claim that someone has made, your reflex skeptical response should be, “show me the evidence” and/or “let me see” and/or “let’s take a look”.
For example, the claim that heavier objects fall to the ground faster than lighter objects may sound intuitively correct without much reflection. However, we don’t know whether or not that claim is correct until we scrutinise it skeptically. We can thank Galileo for conducting his classic experiments to disprove that idea. Galileo’s skeptical approach forced him to test the claim, rather than accepting it without evidence. Galileo’s main finding was that two objects of different mass, dropped simultaneously from the same height in a vacuum, will indeed reach the ground at the same time. This is another instance revealing individual intuition as an unreliable source of knowledge. Imagine a feather and a boulder dropped from the same height in a vacuum. They will indeed reach the ground at the same time. The key is that they are dropped in a vacuum, which removes the effects of wind resistance. This is difficult to visualise since we rarely observe objects moving in a vacuum.
3. Openness/open-mindedness
When reporting their observations, scientists are required to describe the conditions under which these observations were made.
This includes exactly how the measurements were taken, who the participants were, and any other details relevant to the methods of acquiring the evidence.
It is imperative that another investigator reading the description can adequately reproduce the conditions of your observations so they can see for themselves.
The standard to which you should strive is to be able to report your observations so objectively that even your enemy would have to agree with you (Agnew & Pyke, 2007).
When different observers agree, their observations are said to be reliable.
When investigators ask about inter-rater or inter-observer reliability, they are asking whether different observers agree about the same observation.
4. Tentativeness
Scientists are never 100% certain of any finding, because they know that new evidence may come along that will force them to revise their conclusions or discard them altogether.
This is a difficult characteristic of science for the general public and new scientists to accept.
Why shouldn’t a well-executed study give the definitive answer to a question? Well, any study is only as good as the available theories, technology, and evidence.
In general, scientists accept that research findings are rarely 100% clear-cut and that ambiguity comes with the territory. Patience is required as the process of science weeds out erroneous conclusions and reveals the correct ones.
5. Independence from authority
The phrase, “because I said so” does not constitute scientific evidence. Solid, carefully collected evidence is the only authority in the scientific method.
Therefore, claims made from a source, no matter how reputable, must be supported by evidence.
Even then we interpret that evidence skeptically, evaluating how strongly it supports a claim, whether there were any errors made when collecting the evidence, and so on.
Scientific Process
Media vs. Science
Contentious topics in interviews often feature “gotcha” moments → entertaining but unproductive for truth-seeking.
Scientists use measured language (“balance of evidence supports…”, “unable to replicate…”) → reflects reality of knowledge building.
Nature of Scientific Process
Slow, methodical, involves blind alleys & dead ends.
Represented as a flowchart: idea → hypothesis → prediction → study → data collection → analysis → conclusion → replication.
Steps in the Process:
Idea/Theory: Explanation of how something works (e.g., spaced learning is more effective than cramming).
Hypothesis: Testable chunk of the theory.
Prediction: Specific, measurable outcome (e.g., spaced practice improves learning more than cramming).
Study Design:
Critical for reliable results.
Must follow principles of objectivity and empiricism.
Data Collection:
Questionnaires, interviews, online responses, etc.
Raw data initially disorganized.
Data Organization & Description:
Summarize and prepare data for analysis.
Inferential Statistics:
Generalize from sample to real-world population.
Decide if hypothesis is supported.
Re-evaluation:
If unsupported → modify, retest, or discard hypothesis.
If supported → publish results.
Replication:
Exact or conceptual replications.
Builds confidence if findings hold.
Failure to replicate reduces confidence → refine or discard hypothesis.
Outcome:
Supported hypotheses may evolve into accepted theories.
Process is iterative and self-correcting.


Ethics in Psychological Research
Definition:
Ethics = guidelines/principles for moral & just treatment of others.
In research: focus on how researchers treat participants, run studies, and conduct themselves.
Based on Universal Declaration of Ethical Principles for Psychology.
Four Guiding Ethical Principles
Respect for the Dignity of Persons and Peoples
Value, acknowledge, and treat all people equally regardless of origin, beliefs, or identity.
Special care for vulnerable groups (e.g., children, minorities).
Ensure equal opportunity to be seen, heard, acknowledged.
Protect anonymity and confidentiality.
Example: Evolving gender data collection → beyond male/female binary to non-binary & open responses to show respect and inclusion.
Competent Caring for the Well-Being of Persons and Peoples
Aim for research findings to enhance well-being.
Conduct research to benefit participants or at least cause no harm.
Plan for and mitigate possible harm.
“Competent” caring → researchers must have proper training for tools/tests used.
Case Study – Tuskegee Syphilis Study (1932–1972):
600 African-American men (399 with syphilis) misled, denied treatment (penicillin available from 1947).
No informed consent → participants not told study details.
Ethics committees now prevent such abuse (require informed consent, minimal/no harm).
Integrity
Conduct research with objectivity and honesty, free from self-interest or outside influence.
Avoid exploitation and bias in reporting.
Example – Grossarth-Maticek Research:
Linked personality types to cancer/heart disease.
Allegations of data falsification (e.g., reclassifying participants, duplicating data).
Funded by tobacco companies → possible conflict of interest.
Findings not replicated → likely due to falsified data.
Responsibility to Society
Psychology should contribute to understanding the human condition and improving well-being.
Researchers must:
Understand and follow ethical conduct.
Reflect on and update research practices to stay ethical.
Stanford Prison Experiment (1971)
Conducted at Stanford University by Philip Zimbardo.
Setup: Mock prison in psychology building; participants randomly assigned as prisoners or guards.
Payment: $15/day.
Role of Zimbardo: Prison superintendent.
Informed Consent Issues:
Participants given vague info, not told specifics (e.g., surprise home arrest, strip search).
Guards encouraged to be aggressive (no physical harm) to instill fear.
Ethical Concerns:
Prisoners who wanted to leave were told they could not.
Planned 2 weeks → ended after 6 days when an outsider raised concerns.
Zimbardo admitted losing objectivity due to his role.
Formal debriefing not until years later.
Variables in Research
Key Variables
Independent Variable (IV):
Manipulated by the experimenter (e.g., age groups, drug type).
Sometimes cannot be directly manipulated (e.g., age).
Dependent Variable (DV):
Measured outcome; depends on IV.
Example: Studying mental ability vs. age → IV = age, DV = IQ score.
Unwanted (Extraneous) Variables
Variables that contaminate results and obscure the relationship between IV and DV.
Situational Variables:
Environmental factors (temperature, noise, lighting, time of day).
Can affect all participants differently and unpredictably.
Individual Differences:
Natural variations between people (height, weight, motivation, anxiety).
Combine with situational variables to increase variability.
Measurement Error:
Inconsistencies in recording data (e.g., misreading ruler, stopwatch error).
Linked to experimenter’s attention, training, or bias.
Effect:
Random variability can weaken or completely hide real relationships.
Example: Teaching method study → with unwanted variables removed, clear difference; with them present, results less consistent.
Confounding Variables
Definition: Variables that vary systematically with IV, providing an alternative explanation for results → prevents establishing causation.
Example:
Testing two drugs on rats: all Drug A rats tested in the morning, all Drug B rats in the afternoon.
Time of day becomes a confounding variable.
Controlling Confounding Variables:
Keep constant: Test all groups under same conditions (e.g., all in morning).
Counterbalance: Spread variations evenly (e.g., half of each group in morning, half in afternoon).
True Experimental Designs — Key Features
At least two levels of the Independent Variable (IV)
One level can be absence of treatment (control group/placebo).
Other = presence of treatment (experimental group).
Random Assignment
Equal chance of being in any group (coin flip, random number table, etc.).
Purpose: Distribute extraneous factors (motivation, ability, age, health) evenly so they don’t vary systematically with the IV.
Control for Confounding Variables
Prevent alternative explanations for observed differences between conditions.
Independent Groups Design (Between-Subjects)
Structure:
Two or more groups, each experiencing a different level of the IV.
Participants randomly assigned to one group.
Experimental group receives IV; control group does not.
Example — Tickling Experiment:
IV: Who does the tickling (robot vs. self).
DV: Ticklishness rating (1–10).
32 participants → random assignment into 2 groups of 16.
Robot group tickled by robot; self group tickled themselves using robot arm.
Results: Robot tickle group generally rated higher ticklishness, but some overlap.
Drawbacks:
High variability from individual differences (e.g., natural differences in ticklishness).
Requires more participants than repeated measures.
Repeated Measures Design (Within-Subjects)
Structure:
Same participants tested in all conditions of the IV.
Fewer participants needed (e.g., 16 instead of 32 in tickling example).
Reduces random variability due to individual differences.
Order Effects:
Experiencing one condition may influence responses in the next.
Controlled via counterbalancing:
Half participants → Condition A then B.
Half participants → Condition B then A.
Example — Tickling Experiment:
All 16 participants experienced both robot and self tickling.
Order counterbalanced to control for order as a confound.
Results showed less spread in data → reduced variability from individual differences.
When Repeated Measures is NOT Suitable
If one condition permanently changes participant responses (e.g., learning effects).
Example: Comparing teaching methods for statistics → learning from first method influences performance in second method.
Summary Table
Feature
Independent Groups (Between)
Repeated Measures (Within)
Participants per Condition
Different people in each group
Same people in all conditions
Randomization Purpose
Equalize groups
Control order effects
Main Advantage
No carryover/order effects
Reduces variability from individual differences; fewer participants
Main Disadvantage
More participants needed; variability from individual differences
Risk of order/carryover effects
Key Control Method
Random assignment
Counterbalancing
Observational Designs
Two types covered:
Correlational Design
Quasi-Experimental Design
Why not use true randomized experiments?
Sometimes impractical or unethical (e.g., cannot assign people to start smoking).
Example: Smoking & Health
Hypothesis: Smoking is bad for health.
Operational definition of health: Number of doctor visits per year.
Prediction: More smoking → more doctor visits.
Correlational Study
Method:
Observe people who already smoke/don’t smoke.
Measure:
Cigarettes smoked/day (IV)
Doctor visits/year (DV)
Example: Ask 200 people about both variables.
Create scatter plot: each point = one person’s data.
Observation: Positive relationship — heavier smokers see doctors more often.
Key point: No variables manipulated → just observation.
Limitation: Cannot conclude causation.
Quasi-Experimental Design
Similar to true experiment, but no random assignment.
Method:
Form groups based on pre-existing characteristics (e.g., smoking habits).
Example:
Light smokers: 0–10 cigarettes/day
Heavy smokers: 20–30 cigarettes/day
DV = doctor visits/year
Plot results: heavy smokers have more doctor visits.
Key difference from true experiment: Grouping based on existing traits, not random assignment.
Other uses:
Age (e.g., young vs. older adults)
Health conditions (e.g., high blood pressure vs. normal)
Income levels (e.g., wealthy vs. middle class)
Causation?
From correlational or quasi-experimental designs → No causal conclusion possible.
Unknown third variables could explain results (e.g., drinking, diet).
Conditions for Causal Inference
(Only true randomized experiments can fully meet all three)
Relationship: Regular & reliable changes in one variable associated with changes in the other.
Time Order: Cause precedes effect.
No Other Explanations: Alternative causes ruled out (via randomization).
Summary
Observational Designs = Correlational studies + Quasi-experiments.
Correlational study: Measures 2+ variables in same group, examines relationship.
Quasi-experiment: Like true experiment but without random assignment.
Limitation: Lack of randomization → cannot infer causation.
Only true experiments (with randomization) allow causal claims.
Measurement in research
type of data we collect determines how we can analyse and test hypotheses
measuring = assigning numbers to observations
Quantitative vs qualitative
quantitative: numbers have meaningful numeric value (e.g. ml milk, cm height)
qualitative: categories with no numeric meaning (e.g. eye colour, political leaning, countries visited)
Numbers used as labels for qualitative
numbers assigned to categories are only labels (e.g. 1 = brown eyes, 2 = blue)
numbers do not imply magnitude
Discrete vs continuous
discrete: cannot be subdivided into meaningful smaller units, no values in between (e.g. number of chess pieces lost)
dichotomous: special type of discrete with only two possible values (e.g. heads/tails, yes/no)
continuous: infinite values between points (e.g. milk measured 250, 250.2, 250.57 ml)
Levels/scales of measurement (lowest to highest)
Nominal
qualitative only
numbers do not show magnitude or order
example: finished race vs not finished (yes/no)
dichotomous nominal possible
Ordinal
ordering is implied
but distance between ranks unknown/equal or not equal
example: 1st, 2nd, 3rd place; we know order but not how much faster one is over another
Interval
equal intervals between scale points
can have negative values
example: seconds slower/faster than club record (e.g. –2, –1, +3)
Ratio
interval properties plus absolute zero point (cannot go below zero)
can talk about ratios (twice as much)
example: actual swim time in seconds, grams, millilitres
Choosing level of measurement
consider: magnitude? equal intervals? absolute zero?
none → nominal
magnitude only → ordinal
magnitude + equal intervals → interval
magnitude + equal intervals + absolute zero → ratio
Why level of measurement matters
determines what statistical analyses can be used
aim for highest level possible
can convert down (ratio → ordinal) but not up (ordinal → ratio)
Role in research design
selecting appropriate measurement level is critical to valid data analysis and hypothesis testing
Types of measures in psychology
three main kinds:
self-report: questionnaires, surveys, interviews (what people say they think/do)
behavioural: what participants actually do (e.g. aggressive acts counted, reaction time, bar presses in rats)
physiological: body/brain responses (e.g. heart rate, hormones, blood flow)
Two major issues in measurement
reliability
validity
Reliability
stability/consistency of measurement
example: solid ruler produces near-identical results each measurement
unreliable example: flexible/rubber ruler = inconsistent results
Types of reliability
test-retest reliability: two similar test versions given at different times, compare patterns
internal consistency: compare scores on first half vs second half of test
interrater reliability: do different raters score the same behaviour similarly?
Validity
the degree to which the measure actually measures what it claims to measure
Types of validity
face validity:
does the measure look like it is measuring the right thing?
subjective judgement (e.g. math test should contain math problems)
predictive validity:
does the measure predict what it is supposed to predict?
example: ATAR predicting university performance
construct validity:
central to psychology
does the measure relate to other measures in ways theory says it should?
example: IQ test performance correlates with real-world indicators of intelligence (jobs, articulation, academic performance)
Summary
psychologists use self-report, behavioural, and physiological measures
every measure should be evaluated on reliability (consistency) and validity (does it measure what it should measure)
Organising raw data
first step: organise visually to get a sense of distribution
can use tables or graphs
Frequency tables (ungrouped)
used when range of scores is small and manageable
steps:
list every possible score from lowest to highest
tally occurrences
convert tallies to frequencies
can convert frequencies to percentages
Grouped frequency tables
used when range of scores is large (too many rows if ungrouped)
create equal-sized score intervals (rule of thumb: produce 10–20 rows)
tally scores into intervals
benefit: more manageable summary
drawback: loss of precision (cannot see exact individual values)
Stem and leaf plots
middle form between table and graph
stem = higher unit (e.g. tens)
leaf = final digit
preserves individual scores while showing distribution visually
longest leaf row = interval with most scores
Box and whisker plots
visual summary of range and quartiles
whiskers = lowest and highest scores
box = middle 50% (interquartile range)
middle of box = median (50th percentile)
box edges = 25th and 75th percentiles
Bar graphs
used for qualitative/nominal data
bars are spaced apart
X axis = categories
Y axis = frequencies
Y axis scale must accommodate highest frequency
Histograms
used for quantitative data
bars touch (no spacing)
can be standard or grouped
grouped histogram uses same intervals as grouped frequency table
Frequency polygons
histogram turned into line graph
plot a point for each interval frequency then connect points
allows overlay comparison (e.g. male vs female actual vs ideal weight)
Choosing table/graph format
choose based on data type (qualitative vs quantitative)
choose grouping based on range of scores (larger ranges require grouping)
aim for clearest visual snapshot of distribution
Purpose of percentiles
used to understand how a score compares to the rest of the data set
percentile = % of scores at or below a given score
helps interpret standing relative to others
Steps for computing percentiles (individual scores)
start with raw data (hard to interpret directly)
rank order scores (highest to lowest or lowest to highest)
determine n (total number of observations)
calculate simple frequency (SF) = how many times each score occurs
calculate cumulative frequency (CF) = number of scores at or below the given score
formula: percentile = (CF ÷ n) × 100
example: score of 14 with CF = 13 in dataset of n = 20 → percentile = 65
Grouped frequency distributions
used when too many individual score rows would be required
scores are grouped in equal sized intervals (e.g. 0–4, 5–9)
simple frequency = number of scores within each interval
loses precision (exact values unknown, only interval counts known)
Percentiles in grouped data
CF refers to total number of scores at or below the upper bound of the group
same formula used: (CF ÷ n) × 100
meaning of percentile here shifts: % of scores equal to or below the highest score in that group
example: group 25–29 with CF = 21 in dataset of n = 25 → percentile = 84
Summary definitions
n = total number of scores in dataset
simple frequency = how many times a score occurs (or count within group)
cumulative frequency = number of scores equal to or below that score (or group upper limit)
percentile = percent of all scores at or below a particular score (or group upper limit)
Once you have designed a study, the next step is to collect data and that involves measuring are variables - the characteristics of interest. After the data is collected, the job of a researcher is to sift through the data, clean it up and present it so it is understandable at a glance and more amenable to statistical tests and analysis.
In this module, we looked at both measurement and ways to summarise, organise and display data. Below is a short snapshot of each topic.
1. Levels of measurement: nominal; ordinal; interval and ratio

2. Reliability and validity
Psychologists employ self-report, behavioural, and physiological measures in research.
Every measure can be assessed on its reliability and its validity.
Reliability refers to the stability or consistency of a measure.
Validity refers to the extent to which a measure is assessing what it is meant to assess.
3. Presenting data
Tables and graphs are used to provide a quick visual snapshot or summary of the data.
Choose the one that tells the story most effectively.
Consider whether to use grouping or not based on the range of scores in your data set.
Consider which graph is appropriate based on whether you have qualitative or quantitative data.
4. Percentiles
N refers to the number of scores or observations in our data set.
The simple frequency of a score refers to how many times that score appears in the data set.
The cumulative frequency of a given score refers to the number of scores in the set that are equal to or less than that score.
A percentile is the percent of all scores at or below a given score in the set.
Shape of distributions
the classic “normal curve” is symmetrical and bell-shaped
tails = ends/extremes of distribution
peak = highest point = most frequent score
Skew (symmetry)
skew = lack of symmetry in distribution
Positive skew
right tail extends further (tail goes toward larger scores on X axis)
example: income (few very high values → long right tail)
Negative skew
left tail extends further (tail goes toward lower scores)
example: exam scores when most students do well and only a few do poorly
Kurtosis (spread + peakedness)
kurtosis describes whether the distribution is tall/thin or flat/broad
Leptokurtic
tall and narrow
most scores cluster tightly around centre
tails may extend far
memory trick: “leap” (tall and thin)
Platykurtic
flatter, squashed
more scores spread out across a wider middle range
memory trick: “plateau” (flat top)
Shape and central tendency
mode = score with highest frequency = top of curve
if perfectly symmetrical (normal): mean = median = mode
if skewed: mean and median move toward tail (direction of skew)
Using mean/median/mode to detect skew
positive skew: mean and median > mode (mode smallest)
negative skew: mean and median < mode (mode biggest)
Summary
skew = symmetry of distribution
kurtosis = peakedness/spread of distribution
both concepts matter for many statistical analyses and interpretations
Measures of central tendency
statistics that identify the most representative / typical score in a dataset
goal = find the central point of a distribution
Three measures
mode
median
mean
Mode
most frequently occurring score
in graphs = tallest bar / highest point
can be unimodal, bimodal, or multimodal
advantage: not affected by extreme scores
example use: elections (party with most votes wins = modal response)
drawback: can have multiple modes → not always helpful for a single “typical” score
Median
the middle score once data are ordered lowest → highest
divides dataset into top 50% and bottom 50%
= 50th percentile
odd n → exact middle score
even n → midpoint between two middle scores
advantage: not affected by extreme scores
example use: median house price (better indicator when extreme values exist)
Mean (average)
sum of all scores ÷ number of scores
influenced by every score
disadvantage: heavily affected by extreme scores
can think of as the “balancing point” of distribution (extreme scores pull mean towards them)
Summary
mode: unaffected by outliers, but may have many modes
median: unaffected by outliers, good for skewed distributions
mean: familiar and widely used, but distorted by extreme values
Variability
how spread out scores are in a dataset
two sets can have same mean but very different variability → affects interpretation (e.g. teaching a class)
Three measures of variability
range
variance
standard deviation
Range
highest score minus lowest score
simplest index of variability
unstable because one extreme score can change it dramatically
Variance
based on deviation scores
deviation score = individual score minus mean
deviations can be positive or negative → they sum to zero
to remove negative signs, deviations are squared
variance = mean of squared deviation scores
notation: SD² or sometimes SS/N
large variance = greater spread
Standard deviation
square root of variance
expressed in original units (not squared units)
easier to interpret than variance
indicates typical distance of scores from the mean
Summary
range: easy, but distorted by outliers
variance: stable, uses all scores, but expressed in squared units
standard deviation: best interpretation of spread, uses all data, in original units of measurement
In this module, we covered the different shapes of distributions and measures of central tendency and variability.
Shape of distributions
We considered how to evaluate the shape of a distribution of scores in terms of its symmetry and spread, starting with a bell shaped curve, called the normal curve.
2 statistical properties of distribution shape are:
Skew – the symmetry of the shape
positive skewnegative skew

Kurtosis – the spread and peakedness of the shape
leptokurtic
platykurtic

Source: Dorland, W (2012) Dorland's Illustrated Medical Dictionary 32nd Edition, Saunders/Elsevier. License: Statutory Educational License
Central tendency
The three measures of central tendency are the mode, the median, and the mean.
The mode is the most frequently occurring score in a distribution. In a frequency distribution, the mode is represented by the tallest bar in a bar graph or the highest point in a frequency polygon, for example.
The median is the middle most score after the data have been arranged from lowest to highest. The median divides the data set in half so an equal number of the scores are above and below it.
The mean is another word for the average and is the sum of all the scores divided by the number of scores you have. We summarise it with this formula:

Variability
The range is the simplest measure of variability. It is the difference between the highest score and the lowest score in the data set. The range is easy to compute but it is affected by extreme scores in the data.
The variance takes every score into account and therefore is more stable than the range. The formula for variance is ; It is expressed in squared units and so it is not an intuitive description of variability in a data set.
The standard deviation has all the features of variance and the added benefit of being expressed in the original units of the data set and so it is an intuitive description of variability in a data set. The formula for standard deviation is ;
Standard scores (z-scores)
used to express how far above or below average a score is
allows comparison of different measurements on a common scale
Purpose
compares an individual score to the distribution it came from
answers: “how many standard deviations from the mean is this score?”
converts different metrics into the same unit (standard deviations)
Definition
z-score = (raw score – mean) ÷ standard deviation
sign tells direction: + = above mean, − = below mean
magnitude tells how far from the mean in SD units
Examples
mean response time = 1.25 sec, SD = 0.25 sec
Bruce: 1.75 sec → +2 SD → slower than average
Nancy: 1.00 sec → –1 SD → faster than average
Why they are useful
allow precise comparison across different tests/attributes
baseline comparisons like we informally do daily (smart, fast, musical, etc.)
can compare “apples vs oranges” (scores measured differently)
Applied example
Larry’s marks:
maths: 65% (class M=50, SD=10) → z = +1.5 (well above average)
music: 75% (class M=60, SD=15) → z = +1 (above average but not as much)
interpretation: Larry stands out more in math relative to his peers than in music
Summary
z-scores show exact position of score within distribution
reported in standard deviation units
enable cross-comparison across different measures and scales
Standard normal distribution
a normal distribution expressed in z-scores
very important in psychology because many traits approximate normality (IQ, memory scores, etc)
Properties of normal distributions
unimodal and bell-shaped
symmetrical
tails approach x-axis but never touch
specific predictable % of scores fall within set SD units of mean
50% below mean, 50% above mean
34.13% between mean and +1 SD (and same between mean and –1 SD)
13.59% between +1 and +2 SD
2.14% between +2 and +3 SD
Transforming any normal distribution to standard normal
different normal distributions can have different means and SDs
when raw scores are converted to z-scores → distribution becomes “standard normal”
standard normal always has:
mean = 0
SD = 1
mean becomes 0 because subtracting mean from every score centres it
SD becomes 1 because we divide by SD in the z-score formula
z-tables (tables of areas)
show area under the standard normal curve for given z-scores
table usually only lists positive z-scores (symmetry means negative side is identical)
table columns show:
area between mean and that z-value
area above that z-value in tail
Why this matters
lets us calculate exact percentage of population within any standard deviation range
lets us compare different measures using a common scale (z) and known proportions under curve
Using z-tables
z-tables let us find % of scores below, above, between values, or find a raw value from a percentile
ALWAYS draw a diagram first
remember: normal distribution is symmetrical → 50% below mean, 50% above mean
Example 1: percentile rank of a raw score
IQ mean = 100, SD = 15
what is percentile rank of IQ = 115?
convert to z: (115−100)/15 = 1
area between mean and z=1 = 34.13%
50% below mean = 84.13%
percentile rank ≈ 84th percentile
Example 2: % of scores between two values
IQ mean = 100, SD = 15
what % is between 100 and 115?
z = 1
area from mean to z=1 = 34.13%
answer: 34.13%
Example 3: find the raw score that cuts off the top 5%
IQ mean = 100, SD = 15
top 5% = area beyond z
area between mean and boundary = 50% − 5% = 45%
find z where table area is ~0.4495 → z = 1.64
convert z back to raw: X = M + z×SD = 100 + 1.64×15 = 124.6
top 5% have IQ > 124.6
Example 4: find raw scores cutting off top and bottom 2.5%
IQ mean = 100, SD = 16
area in tail 2.5% → z = ±1.96
lower boundary: X = 100 − 1.96×16 = 68.64
upper boundary: X = 100 + 1.96×16 = 131.36
2.5% have IQ < 68.64 and 2.5% have IQ > 131.36
Summary
combine: z-scores + standard normal + z-table to find:
percentile rank of raw score
% between values
raw score for a given percentile
boundary cut-offs for any tail area
ALWAYS sketch the area and use symmetry to avoid mistakes
In this module, we introduced z-scores, z-tables and the normal distribution. We also looked at how z-tables can be used to work out specific areas under the standard normal curve in order to answer questions like:
1) What percentage of the scores fall below a given point?
2) What percentage of the scores fall above a given point?
3) What percentage of the scores fall between two values?
4) What is the actual value based on a percentile?
Here is a short summary of what we covered in this module.
z-scores
z-scores are a way of reporting a score’s precise position within a distribution.
They indicate a score’s distance from the mean in standard deviation units.
z-scores are a way of comparing performances that are measured in different units. In common terms, z-scores allow us to compare apples and oranges.
The normal distribution
Many of the attributes we measure in psychology tend to distribute as a normal distribution.
Normal distributions are unimodal and bell shaped.
The standard normal distribution has a mean of 0 and a standard deviation of 1.
We can use tables to work out specific areas under the curve.
z-scores and percentiles
We can use z-scores, the z-tables and the standard normal distribution to:
determine the percentile rank of specific scores
determine the specific score at a given percentile rank
find out the percentage of scores that fall within a given range
find the boundary values that mark off specific ranges in the distribution.
TIP!
To avoid careless mistakes, draw a picture of what you are looking for and remember that the normal distribution is symmetrical with 50% of the scores below the mean and 50% above it.
Correlation (overview)
describes the relationship between two variables
used constantly in psychology (brain activity vs behaviour, study time vs test score, etc)
Scatterplots
primary visual display for correlations
plots two variables on one graph
shows pattern/trend (if any)
linear relationship → points cluster around a straight line
positive relationship → higher X goes with higher Y
Pearson’s r (correlation coefficient)
numerical index capturing direction + strength of linear relationship
uses z-scores to compare how each person’s X score and Y score sit relative to their means
Logic of correlation using quadrants
vertical line = mean of X
horizontal line = mean of Y
creates 4 quadrants
upper right + lower left = both z-scores same sign → positive cross-products
upper left + lower right = opposite signs → negative cross-products
large magnitudes = greater influence on r
Computing Pearson’s r (conceptually)
convert raw scores → z-scores
multiply z-score pairs (cross-products)
sum cross-products
divide by N
result ranges from −1 to +1
+1 = perfect positive, −1 = perfect negative, 0 = no linear relationship
Why z-scores matter here
removes units of measurement
allows correlating different measures on different scales fairly
patterns remain identical after conversion, but now we can directly compute r
Summary
correlation quantifies whether above-average scores in one variable go with above or below average scores in another
uses z-scores + cross-products to produce a unitless index (Pearson’s r)
sign = direction, magnitude = strength
Correlation calculation using spreadsheets
dataset: 40 participants → hours studied vs test score (%)
first step: check data for errors or outliers using a scatterplot
impossible scores removed (e.g. negative hours or test scores)
after cleaning → 38 participants remain
Descriptive statistics
mean hours studied = 3.882
mean test score = 57.658
Step 1: deviation scores
subtract mean from each raw score
gives distance of each value from the mean
Step 2: variance and standard deviation
square deviation scores and sum
divide by N (38)
take square root to get SD
SD (hours studied) = 2.005
SD (test score) = 22.930
Step 3: z-scores
convert each deviation score to a z-score
(deviation ÷ SD)
Step 4: cross-products
multiply each pair of X and Y z-scores
sum of cross-products = 36.263
Step 5: compute Pearson’s r
r = Σ(zx × zy) / N
result: r(38) = .95
indicates a very strong positive linear correlation between study time and test score
Understanding Pearson’s r
measures how well data fit a straight line (line of best fit)
higher r → points cluster closer to line
perfect ±1 correlation = 100% shared variance, zero error variance
Coefficient of determination (shared variance)
r² = proportion of shared variance
for r = 0.95 → r² = 0.90 → 90% shared variance
meaning: 90% of variability in test scores is explained by hours studied
Reporting
standard reporting format:
r(38) = .95, p < .001, r² = .90
includes both correlation coefficient and proportion of shared variance
Summary
correlation analysis steps:
clean data (remove impossible/outlier scores)
compute means and SDs
convert to z-scores
compute cross-products
sum, divide by N → get Pearson’s r
r² × 100 = % shared variance (coefficient of determination)
Meaning of r (correlation coefficient)
magnitude of r = strength of relationship
~.20–.30 = small
~.60 = strong
1.0 = perfect relationship
r² = proportion of variance explained → increases as r increases
sign (positive/negative) = direction, but magnitude strength interpreted the same whether + or −
Direction interpretation
positive r → variables move together (↑X → ↑Y)
example: height ↑ → weight ↑
negative r → inverse relationship (↑X → ↓Y)
example: BAC ↑ → driving performance ↓
r = 0 → no linear association
Zero correlation could mean:
truly no relationship (data points random)
no variation in one variable (flat line)
Important cautions when interpreting correlation
correlation detects linear relationships only
non-linear (e.g. Yerkes-Dodson law: arousal vs performance) → r can hide real relationship → will look like “no correlation”
always check scatterplot first
Range restriction problems
small/narrow range of one variable → r artificially smaller (underestimation)
example: measuring children’s height only between ages 8.5–9.5 → hides true height-age relationship
truncated range (only top end or bottom end) → same issue
example: only sampling top OP students → hides full negative relationship between OP and GPA (lower OP = better performance)
Outliers
extreme scores can massively distort r
scatterplot review is critical before interpreting results
Correlation ≠ causation
correlation shows association, NOT cause
possible explanations when r exists:
X causes Y (e.g. increasing temperature increases pressure in sealed container)
Y caused by third variable (e.g. sleeping with shoes on and headaches both caused by alcohol consumption)
spurious coincidence (e.g. mozzarella consumption correlates with engineering doctorates)
Summary
r magnitude tells strength, sign tells direction
r² tells proportion of shared variance
must consider: linearity, range restriction, truncation, outliers, third variables
always check scatterplot → never infer causation from correlation alone
Introducing correlation
Correlation captures the extent to which above average scores on one variable, go with scores above or below average on another variable.
z-sores are an index of a score’s position relative to the mean of a set of scores; it removes the units of measurement for that score.
Cross products refers to multiplying the z-scores of one variable with the associated z-scores of the other variable.
The sign of the sum of the cross products indicates the direction of the relationship
expresses the relationship in standard form with a maximum value of +/- 1.
Calculating correlation
There are several steps involved in a typical correlation analysis.
The main analysis to arrive at the correlation coefficient for Pearson’s r uses the formula
An additional analysis gives us the coefficient of determination: , also referred as the proportion of shared variance.
The percentage of shared variance is calculated by .
Interpreting correlation
Factors that influence your interpretation of a correlation coefficient:
Magnitude
Direction
Linearity
Range restriction
Range truncation
Extreme scores
Attribution of result.
Probability + hypothesis testing context
humans are poor at intuitively judging probabilities
inferential statistics exist to formalise probability reasoning
Example: coin flip vs birthday paradox
coin flip: 50% chance
only ~23 people needed in a room for ~50% chance two share a birthday
illustrates how unintuitive probability is
Descriptive vs inferential statistics
descriptive stats: describe samples
inferential stats: draw conclusions about population from sample
Populations vs samples
population = entire group of interest (e.g. all Australians, all PD patients)
sample = subset of population used to estimate population characteristics
population parameters use Greek letters (μ = population mean, σ = population SD)
sample stats use Latin letters (M = sample mean, SD = sample SD)
Probability example using marbles
if population composition known → can compute exact probability of selecting specific types
e.g. 10 red + 10 green → P(red) = .5
1 red + 19 green → P(red) = .05 (rare outcome)
Normal distribution and probability
can determine probability of obtaining a particular score (or score range) if distribution is normal
34.13% of scores fall between mean and +1 SD (same below)
95% of scores fall between ±1.96 SD
tails beyond ±1.96 SD = only 5% of area → very unlikely event
Implication
if an observation falls in an extreme tail, it is unlikely to be due to chance
inferential statistics allow us to evaluate how surprising / probable an observation is given population distribution
Summary
probability intuition is unreliable
inferential stats use probability to generalise sample results to populations
knowing the population distribution (particularly normal) lets us quantify how likely an observation is
Sampling and inferential statistics
inferential stats = making claims about a population using a sample
representative sample must be random → every individual has equal + independent chance of selection
biased sampling → leads to misleading conclusions (e.g. polling only rich)
Sampling error + sampling variability
sampling error = sample statistic differs slightly from population parameter
sampling variability = each sample will give slightly different values
even if population mean = 4, random samples may get means like 3.5, 4.5 etc
Sampling distributions
if we repeatedly draw samples (e.g. sample size = 25) and compute sample mean each time → distribution of sample means forms
distribution of sample means becomes normal shaped
mean of sampling distribution = population mean
Likely vs unlikely sample means
central 95% of sample means = “likely”
outer 5% in tails = “unlikely” (2.5% each tail)
means in tails are surprising if population truly has that μ
Defining the sampling distribution fully
To describe a distribution we need:
mean
SD
shape
1) Mean
mean of distribution of sample means = population mean (μ)
2) SD → Standard Error of the Mean (SEM)
variance of sampling distribution = population variance ÷ sample size
take square root → SEM
SEM = σ / √n
3) Shape
central limit theorem: sampling distribution ≈ normal if n ≥ 30 (regardless of original population shape)
Using SEM + z-tables example
population: mean = 70, SD = 20, sample size = 25
SEM = 20 / √25 = 20 / 5 = 4
central 95% cut-off = ±1.96 SD
lower limit = 70 − (1.96×4) = 62.16
upper limit = 70 + (1.96×4) = 77.84
→ 95% of sample means will fall between 62.16 and 77.84
Summary
sampling error + variability = sample stats differ from population stats and from each other
sampling distribution = distribution of all possible sample means
properties of sampling distribution:
mean = μ
shape = normal (if n ≥ 30)
SD = SEM = σ / √n
SEM + z-tables can estimate probability of observing a particular sample mean
Hypothesis testing & sampling distributions
distribution of sample means = model of all possible outcomes if samples are drawn randomly from a population
shows what we’d expect if chance were the only factor
lets us objectively decide whether a given sample is likely or unlikely to represent that population
Likely vs unlikely events
middle 95% of sample means = likely region
outer 5% (2.5% in each tail) = unlikely regions
“p < 0.05” → observed sample mean fell in one of these tails
“statistically significant” = sample mean is unlikely under chance (if population parameters are true)
Example – IQ population
population: μ = 100, σ = 15
sample size = 9
SEM = σ / √n = 15 / 3 = 5
z = ±1.96 defines 95% boundaries
lower = 100 − 1.96×5 = 90.2
upper = 100 + 1.96×5 = 109.8
→ any sample mean between 90.2–109.8 = likely
→ means below 90.2 or above 109.8 = unlikely
Linking to hypotheses
Null hypothesis (H₀): no real effect; chance explains results
sample mean in likely region → consistent with H₀
Alternative hypothesis (H₁): something beyond chance is happening
sample mean in tail → consistent with H₁
suggests sample may come from different population (e.g., different mean IQ)
Interpretation
tails = unusual samples → raise suspicion of different underlying population
statistical test doesn’t tell why — only that it’s improbable under H₀
explanation comes from theory or context (e.g., schooling quality, pollution, etc.)
Summary
distribution of sample means = model of expected outcomes if sampling from known population
define likely (central 95%) vs unlikely (tail 5%) regions using z = ±1.96 and SEM
sample mean within likely region → supports Null
sample mean in tails → supports Alternative
Single sample z test (consolidation)
Concept
we are judging if a sample mean is likely to have come from a known population
we use the distribution of sample means as the model of what is expected by chance
if the sample mean falls in the central 95% → likely (consistent with Null)
if it falls in the outer 5% (top/bottom 2.5%) → unlikely (consistent with Alternative)
Example 1 (training program)
known population (no-training workers): μ = 53, σ = 7
sample after training: N = 25, sample mean M = 48
calculate SEM = σ / √n = 7 / 5 = 1.4
then compute z = (M − μ) / SEM = (48 − 53) / 1.4 = −3.57
critical cut-offs = +/− 1.96 (p = .05)
|−3.57| > 1.96 → sample mean is in the tail → unlikely it came from no-training population
→ conclude training reduced errors (statistically significant)
Example 2 (fake cavemen)
known cavemen population: μ = 142, σ = 20
sample: N = 16, M = 163
SEM = 20 / 4 = 5
z = (163 − 142) / 5 = 4.2
4.2 > 1.96 → in tail → unlikely to be cavemen
→ sample is probably just modern humans
Why this matters
z test is the direct statistical formalisation of:
“is this sample mean so extreme that chance is not a reasonable explanation?”
this exact logic underlies many later tests (single sample t, paired t, independent t)
if you understand this, you understand the core reasoning of inferential statistics
Summary
use population μ and σ to build the sampling distribution
convert sample mean to a z score using SEM
compare to critical cut-offs (usually ±1.96)
reject Null if z falls in tail (p < .05)
Introduction to hypothesis testing
We are very bad at judging probabilities off the top of our heads.
Inferential statistics are the set of rules we use to correctly evaluate probabilities.
Inferential statistics involves making a generalisation from a sample of data to the population we are interested in.
When we have sufficient information about the population we are sampling from, we can make judgments about how representative an observation is.
Samples, populations, and the distribution of sample means
Sampling error and sampling variability - any sample we select from a population will have slightly different statisitics than the population from which it was selected and from other samples drawn from that population.
The distribution of samples means is a distribution representing all possible samples drawn from a population.
The distribution of sample means has three critical features:
its mean is the same as the population
it is a normal shaped distribution
the standard error of the mean is equal to the standard deviation of the population divided by the square-root of the sample size.
We can use the z-tables to answer questions about probabilities associated with any given sample mean.
Using the sampling distribution
The distribution of sample means is a model of expected samples drawn from a population we know about.
Conceptually, we set up regions in the distribution delimiting values of sample means we would deem likely and those we would deem unlikely.
After describing the distribution of sample means – by declaring its mean, shape and standard error of the mean, we employ critical z-score values and the standard error of the mean to determine the likeliness of our sample.
A likely sample is consistent with the Null hypothesis
An unlikely sample is consistent with the Alternative hypothesis.
Single sample z-test
These two examples are of a legitimate statistical test, the z test. This test is used often in psychological research.
The logic of this test forms the basis of several other statistical tests like the single sample t-test, the dependent groups t-test and even the independent groups t-test.
If you followed all these steps and understand the concepts, you are well prepared for the statistics we are going to explore.
If you don't fully understand the concepts, please go back and revise.
Null vs Alternative Hypotheses
Why we need them
the Alternative hypothesis (the “interesting” one) is always vague in real research
(“sleep deprivation slows reaction time”, “studying increases grades”)
you can’t test vague — you need a precise prediction
the Null Hypothesis gives us a precise prediction: ZERO difference
Null Hypothesis (H₀)
predicts that nothing special is happening
the sample is from the known population
any difference between sample mean and population mean is just sampling error
the mean of the sample = the mean of the known population (within random fluctuation)
Alternative Hypothesis (H₁)
predicts there IS some effect / difference
sample mean is NOT from that known population
the sample mean came from a different population (with a different mean)
How this connects to the distribution of sample means
we use the sampling distribution as a model of what chance produces under H₀
central 95% (green) = likely → retain H₀
extreme 5% (tails) = unlikely → reject H₀ and support H₁
Example (alcoholism and reaction time)
population non-alcoholics: μ = 375ms
H₀: alcoholics’ mean = 375ms (sample comes from that distribution)
H₁: alcoholics’ mean ≠ 375ms (sample is from a different distribution)
Summary
H₁ is the idea/theory we care about — but it’s not mathematically testable as stated
H₀ is crafted to be mathematically exact (no difference = 0)
hypothesis testing uses the Null Hypothesis as the model to test against:
sample mean in central region → consistent with H₀
sample mean in tails → inconsistent with H₀ → evidence for H₁
Key steps in hypothesis testing
context
research Q: do chronic alcoholics have slower reaction times than non-alcoholics?
known population (non-alcoholics): mean = 375ms, SD = 120
sample of 16 alcoholics: mean = 475ms
hypotheses
H₁ = alcoholics differ (slower) — vague
H₀ = alcoholics same as population (mean difference = zero) — precise + testable
logic
always start by assuming H₀ true
build model of what sample means look like if H₀ is true
compute probability of getting our sample mean under H₀
decision system
sample mean in central 95% of sampling distribution → probability high → retain H₀ → difference due to chance
sample mean in outer 2.5% tails → probability low → reject H₀ → difference not due to chance → support H₁
working the example
SEM = √(120² / 16) = 30
z = (475 − 375) / 30 = 3.33
critical z ±1.96 (5% cutoff)
conclusion
z = 3.33 is beyond 1.96 → very unlikely under H₀ → reject H₀
therefore adopt H₁: alcoholics have slower reaction times
summary
assume H₀
compute probability of the observed sample under H₀
high probability → retain H₀
low probability → reject H₀ and accept H₁
this is exactly what the z-test does
errors + power in hypothesis testing
two realities
reality 1 → H₀ is TRUE (no real effect)
reality 2 → H₀ is FALSE (there IS a real effect)
two decisions
retain H₀
reject H₀
correct decisions
retain H₀ when H₀ true
reject H₀ when H₀ false (this = POWER)
errors
Type I error = reject H₀ when H₀ actually true
false positive
controlled by alpha level (p)
alpha .05 means 5% chance of Type I error
alpha .01 lowers Type I error risk to 1%
Type II error = retain H₀ when H₀ actually false
false negative
happens when real effect exists but test fails to detect it
visual
H₀ and H₁ distributions overlap
if sample mean falls in overlap where it is not “extreme enough” → Type II error
power
power = correctly rejecting H₀ when H₀ is false
power = ability to detect a real effect
alpha + trade-off
raising alpha → more power → but more Type I errors
lowering alpha → fewer Type I errors → but more Type II errors → lower power
effect size influence
larger effect size → distributions further apart → less overlap → less Type II error → more power
sample size influence
larger sample → smaller standard error → bigger z values → more likely to reject H₀ → more power
summary
Type I = false alarm
Type II = miss
alpha sets Type I rate
power = ability to detect real effect
↑ alpha = ↑ power but ↑ Type I
↓ alpha = ↓ Type I but ↓ power
↑ effect size + ↑ sample size → ↑ power
Null and alternative hypothesis
The distribution of sample means and the definition of likely and unlikely sample means are directly linked to testing hypotheses.
A testable hypothesis must make a precise prediction that can be evaluated.
The Alternative Hypothesis (H) is usually the hypothesis that motivates a study. However, it is too imprecise to be tested directly.
1
The Null Hypothesis (H) is useful because it makes a precise prediction – ZERO, no relationship between the IV and DV, nothing special about our sample.
0
Therefore, the Null hypothesis is the critical hypothesis when we are testing hypotheses.
Null hypothesis statistical tests
We worked through the computational steps of testing the hypothesis that chronic alcoholics have a slower reaction than the general non-alcoholic population.
Our first step is to assume the null hypothesis is true.
We then evaluate the probability of observing our sample if the null hypothesis is true.
If the probability is high, we retain the null hypothesis
If the probability is low, we reject the null hypothesis and adopt the alternative hypothesis
We tested the likelihood of the sample with a z-test.
Decision errors
When conducting hypothesis testing, we can make correct decisions or errors
We can incorrectly reject (Type I) or retain (Type II) the null hypothesis.
The power of a test is to correctly conclude we should reject the null hypothesis (and infer our altenative hypothesis may be correct) when, in reality, we should.
Attempting to maximise the power of a test is a juggling act between minimising two forms of error.
Power can also be influenced by effect size and sample size.
t test intro
foundation
z test → compares sample mean to population mean
uses population SD to compute standard error of mean (SEM)
SEM = how much an average sample mean differs from population mean
SEM calculation
σ² / N → then square root
or → σ / √N
same outcome
hypothesis testing with z
find SEM
compute z = (sample mean – population mean) / SEM
compare to critical ±1.96 (for α = .05, 2-tailed)
if z beyond ±1.96 = reject H₀
real world problem
we almost never know the population SD
cannot compute “true” SEM
solution = estimate
t test
replace σ with S (sample estimates)
t = (sample mean – population mean) / estimated SEM
S = estimate → because we now estimate population variance + SEM
estimating variance
sample mean is unbiased
sample variance is biased if divide by N
fix = divide by N-1 → inflates slightly → corrects bias
N-1 = degrees of freedom (df)
df = how many scores are free to vary given a fixed mean
evaluate t
same logic as z
but t distributions change shape depending on df
low df → heavier tails (more platykurtic)
as df ↑ → t distribution → approaches normal → critical t → approaches 1.96
critical values
t critical depends on df
example df = 15, α = .05 → t critical ≈ ±2.132
need t observed to exceed ±2.132 to reject H₀
summary
t test = same logic as z but uses estimated SD
z = population SD known
t = population SD unknown
t distribution changes with df
small samples → need larger t to reject H₀
single-sample t test worked example
purpose
use t when you want to test whether a sample mean is significantly different from a known comparison mean
same logic as z test
BUT population SD unknown → need to estimate variance + SEM
example: CO₂ 1970 vs 2000
1970 CO₂ = comparison mean
sample from year 2000 = sample mean
have sum of squares for the sample
steps
1. estimated population variance
sample variance underestimates population variance
fix bias by dividing by N−1 (df)
SS / (N−1)
= 72600 / 24 = 3025
2. estimated standard error of mean (SEM)
SEM = √(estimated population variance / N)
= √(3025 / 25)
= √121
= 11
3. compute t
t = (sample mean – comparison mean) / estimated SEM
t obtained = 3.18
4. critical value
t distribution shape varies by df
low df → more platykurtic → bigger critical values
df = N−1 = 24
t critical (α=.05 2-tailed) = 2.064
5. decision
t obtained = 3.18 > 2.064
reject H₀
conclude sample mean is unlikely to come from distribution centred at 1970 levels
interpret directionally: CO₂ 2000 > CO₂ 1970
reporting (APA)
report direction of effect + means
report statistic as: t(24) = 3.18, p < .05
summary
calculate estimated variance using SS/(N−1)
derive estimated SEM
compute t
compare to t critical using df
reject or retain H₀
report t, df, p, and direction of difference
repeated measures (dependent means) t test
what it is
used when the same participants provide 2 scores
e.g. pre-post, or 2 conditions per participant
statistic is based on difference scores (condition1 − condition2)
why we use difference scores
we are not comparing 2 separate group means
we are comparing: mean difference between 2 conditions vs 0
H₀: mean difference = 0
H₁: mean difference ≠ 0
example: tickling robot
participants are tickled twice
condition A = self-controlled robot
condition B = experimenter-controlled robot
DV = milliseconds until pulling away
direction must be defined: in this case = self tickle minus experimenter tickle
positive numbers = self tickle tolerated longer
steps
compute difference score per participant
compute SS of these difference scores
estimated population variance = SS / (N−1)
SEM = √(estimated variance / N) based on difference scores
t = (mean difference – 0) / SEM
compare t obtained vs t critical using df = N−1
if t obtained > t critical → reject H₀
example results given
mean difference = 6.33 ms
estimated population variance = 26.06
SEM calculated from that
t obtained = 3.04
df = 5
t critical at α=.05 (two-tailed) = 2.571
3.04 > 2.571 → reject H₀
interpretation: participants tolerate tickling longer when they control it
reporting (APA)
report both condition means and the direction
specify this is a dependent means t test
report t(df) and p value
e.g.: t(5)=3.04, p<.05
An introduction to ttests
When comparing a sample mean to a population mean we can use a ztest if we have both the population mean and its standard deviation or variance
When we don’t have a population standard deviation or variance we need to calculate an estimate of it
The estimated population variance is calculated with a denominator of N1 (or the degrees of freedom) to adjust for an underestimating bias.
A ttest is a form of *z-*test that uses an estimated standard error of the mean based on estimate of the population standard deviation or variance
Results of a ttest are evaluated against critical values on t distributions that differ based on degrees of freedom
Single sample ttests
When conducting a single sample *t-*test we need to calculate an estimated standard error of the mean from our sample data.
We do this by using information from our sample, such as the sum of squares.
Using our degrees of freedom of N – 1 we look up our t critical value in the t table.
We compare our t obtained score to the t critical value and the regions it defines for retaining or rejecting the null hypothesis.
We determine if we can reject the null hypothesis and find support for the alternative hypothesis that our sample mean differs significantly to the population mean.
When reporting a single sample *t-*test result be sure to report the means, the direction of the difference between them and t obtained with associated df and p values.
**Dependent means t-test
A single sample ttest can be adapted to test the difference between two dependent means obtained from a repeated measures design.
Difference scores are calculated for each participant and these difference scores become the data used to calculate the estimated population variance and the estimated standard error of the mean.
The mean of the sampling distribution used as a comparison point will be zero, if the null hypothesis assumes no differences between means.
It is important to bear in the mind the direction that the difference scores are calculated to ensure results are interpreted correctly.
When reporting dependent means ttest results, means for both conditions should be reported along with the direction of the difference between them, the t value and its associated degrees of freedom and p value.
independent groups t-test (logic)
what it is
used when comparing two separate groups (different people in each group)
IV has 2 levels, DV is measured once in each group
groups can come from true random assignment OR quasi assignment
key = independent observations
hypotheses
H₀ = population means are equal → difference = 0
H₁ = population means differ → difference ≠ 0 (or specific direction if one-tailed)
conceptual model
like all other tests: logic is built around the sampling distribution
but this time → sampling distribution = distribution of differences between sample means
under H₀:
the 2 population distributions overlap (same mean)
the 2 sampling distributions of means overlap
therefore mean of “difference between means” = 0
effect direction example (tickling)
if robot tickling actually produces higher tickle ratings than self tickling → the 2 population means differ
differences between means (MA − MB) would not centre at 0
if H₀ is true → difference distribution centres exactly at 0
the computational complication
we do not have 1 sample (like single sample)
we do not have paired scores (like dependent t)
we have 2 independent samples and we must estimate population variance using both
therefore we need to compute a combined pooled variance based on both samples’ data
then we use that pooled variance to get the estimated standard error of the difference between means
summary
independent groups t compares 2 group means drawn from different people
H₀ predicts a difference of exactly 0
H₁ predicts a non-zero difference
the sampling distribution used is the distribution of differences between sample means
key computation = estimate standard error from both groups (pooled)
inference: if the obtained difference is unlikely under H₀ → reject H₀ and conclude group means differ in population
independent groups t-test worked example (distributed vs massed skateboarding practice)
design + context
quasi-experiment (participants self-selected which condition)
independent groups (different people in each condition)
IV = learning schedule
distributed: 1 hour × 8 days (n=15)
massed: 8 hours in 1 day (n=20)
DV = number of half-pipe reps they could do at test
distributed mean = 7.4
massed mean = 5.2
observed difference = 2.2 reps
hypotheses
H₁: mean distributed ≠ mean massed
H₀: mean distributed = mean massed (difference = 0)
logic
if H₀ is true → populations overlap → sampling distributions overlap → sampling distribution of mean differences is centred on 0
we compare our obtained difference (2.2) to this model to see if it is likely or unlikely under chance
assumptions
normal populations
homogeneity of variance (population variances roughly equal)
independence of observations (different people in each group)
computational steps (condensed conceptual)
compute estimated population variance separately for each group
combine these two estimates into a pooled variance (weighted by df)
compute variance of each group’s sampling distribution (divide pooled variance by that group’s n)
sum these two to get variance of sampling distribution of the difference
square-root = standard error of the difference
compute t = (difference between sample means) / (standard error)
result
t obtained = 4.23
df total = (n₁ − 1) + (n₂ − 1) = 33
critical t ≈ 2.04 (α = .05, 2-tailed)
inference
4.23 > 2.04 → difference very unlikely due to chance → reject H₀
interpretation + reporting (APA style)
distributed learning produced significantly more reps than massed learning
t(33) = 4.23, p < .05
summary
independent groups t compares 2 separate group means (different people)
must pool variances to estimate population variance
the pooled variance drives the standard error calculation
if obtained t > critical t → reject H₀ → support H₁
confidence intervals (CI) — key idea
same computations as hypothesis testing but used to estimate what the population mean probably is
instead of only saying “sample is / is not representative of X population,” we estimate the range where the true population mean likely sits
a point estimate = a single number (sample mean)
an interval estimate = that sample mean ± margin of error
CI logic (visual)
95% CI uses the central region of the sampling distribution (middle 95%)
mathematically → sample mean ± (critical value × standard error)
for z-based CI → critical value = 1.96
for t-based CI → critical value = tcrit from table (depends on df)
example 1 (alcoholism reaction time)
sample of 16 alcoholics: M = 475ms
known general population mean = 375ms, sd = 120ms, SEM = 30
margin of error = 1.96 × 30 = 58.8
95% CI = 475 ± 58.8 = [416.2, 533.8]
interpretation = we are 95% confident the true mean RT for chronic alcoholics is between 416.2 and 533.8ms
because entire CI is above 375ms → same inference as z-test: reject H₀
example 2 (distributed vs massed practice in skateboarding)
observed mean difference = 2.2 reps
t-critical (df=33) ~ 2.043
standard error of difference = 0.52
margin of error = 2.043 × 0.52 ≈ 1.06
95% CI = 2.2 ± 1.06 = [1.14, 3.26]
interpretation = we are 95% confident the true benefit of distributed learning is between 1.14 and 3.26 more reps
CI does NOT contain 0 → same inference as independent groups t-test: reject H₀
summary
confidence intervals use the same logic and maths as hypothesis tests
95% CI gives a plausible range of true population values
if a 95% CI does not include the null value (usually 0) → reject H₀
and this will always match the conclusion of a .05 hypothesis test
effect size (Cohen’s d) — core concept
after we find a statistical difference → we need to know how big that difference actually is
effect size answers: “how far apart are the two populations?”
Cohen’s d = standardized effect size measure
interpreting Cohen’s d visually
small d → lots of overlap between populations
medium d → moderate overlap
large d → little overlap
Cohen’s d formula (concept)
(mean of population 2 − mean of population 1) / population SD
if we only have sample data → we estimate SD and use that estimate
example: positive information → attractiveness
no info group: M = 200, SD = 48
positive info group: M = 220
d = (220 − 200) / 48 = .42 (medium-ish)
if positive group was M = 210 → d = .21 (half as big)
benefit of Cohen’s d
like z → allows comparison across studies even when scales differ
one study used 0–400 scale, another used 1–10 scale → d standardizes them
general guideline (Cohen, 1988)
d ≈ .20 = small
d ≈ .50 = medium
d ≥ .80 = large
computing Cohen’s d after t-tests
same logic, but we need sample-estimated SD instead of population SD
dependent (repeated) example — tickle robot
mean difference = 6.33
estimated pop SD = sqrt(SS / df) = sqrt(130.33 / 5) ≈ 5.1
d = 6.33 / 5.1 ≈ 1.24 (large)
independent groups example — skateboarding
distributed − massed = 2.2 reps
pooled variance = 2.34 → sqrt = 1.53
d = 2.2 / 1.53 ≈ 1.44 (very large)
summary
Cohen’s d is a standardized measure of how far apart two means are
allows effect comparison across different scales + different studies
most journals require reporting effect size (APA manual requirement)
add it alongside p-values to properly describe your result’s practical importance
Independent groups design
Independent groups ttests are appropriate for designs that have two groups of observations, e.g. a true randomised experiment with two groups, or a quasi-experiment with two groups.
In conducting this test, we assume the null hypothesis is true. This is visualised as the underlying populations of our two samples as completely overlapping with the same mean.
The resulting sampling distributions also overlap under the null hypothesis.
The distribution of differences between means has a mean of zero, which gives us a precise number against which to test our observed difference between the means of the two groups in our study.
The major computation in this analysis is finding the standard error of the distribution of differences between means.
Example: Learning the half-pipe
The computations for the independent groups ttest require us to make two estimates of the population variance based on the two independent samples.
These two estimates are combined into a single pooled estimate.
The pooled variance is used to compute the standard error of the sampling distributions for each of the conditions.
These variances are added together to get the estimated variance of the sampling distribution.
Taking the square root of the estimated variance of the sampling distribution gives us the standard error of the sampling distribution.
To get the observed tstatistic, we divide the observed difference between the means of the Distributed and the Massed group and divide that by the standard error of the sampling distribution.
We compare our result with the critical value in the ttables associated with the df closest to the df Total of our experiment without going over.
If the observed tstatistic is larger than the critical value, we reject the null hypothesis and adopt the alternative hypothesis.
Confidence intervals
Confidence intervals are an informative complement to the null hypothesis tests of ztests, single sample ttests, repeated measures/dependent means ttests, and independent groups t-tests.
They are based on the same computations as the null hypothesis tests but use the information to generate an estimate of the mean of the population the sample came from.
Interpreting 95% confidence intervals leads us to the same conclusion as a null hypothesis test with an alpha level of 5% or 0.05.
Effect size
Cohen's d is a standardised measure of effect size.
It is an informative adjunct to a statistical test (e.g. repeated measures ttest; independent groups ttest).
Can further support a statistically significant result.
Cohen's d enables comparison of effect sizes across studies that have used different measurements with different means and standard deviations.
The APA Publication Manual standard is to provide some measure of effect size in addition to the results of a significance test.
Annotate
Readings
Grove — Chapter 1
1.1 Introduction
Common perception of psychology often limited to mental disorders and therapy (e.g., Freud, Dr. Phil).
Psychology is a broad discipline with over 50 divisions (American Psychological Association, Stanovich, 2007).
Examples of divisions: General psychology, Military psychology, Teaching psychology, Exercise and Sport psychology, Organisational psychology, Psychology and Law, Addictions, Study of Men and Masculinity.
Despite diversity, unifying feature: psychology embraces the scientific method to seek knowledge and truth about behavior.
1.2 Psychology and the Scientific Process
1.2.1 Knowledge from Personal Experience
Often, people do not know the source of their knowledge.
Example: Beliefs about jogging causing knee problems may come from anecdotal evidence (e.g., a friend’s injury).
Problem: Overgeneralization from limited cases and selective perception (focusing on negative incidents, ignoring positive).
Personal experience is biased and subjective; knowledge from it is unreliable and unverified.
1.2.2 Knowledge from Authority
We rely on experts or authorities (lecturers, mechanics) to simplify complex knowledge.
Trust is placed on qualifications and institutional evaluation.
Problem: Some authorities (politicians, broadcasters) are accepted without scrutiny of their expertise or evidence sources.
Authority figures can be biased or mistaken just like anyone else.
1.2.3 Rationalism: Knowledge from Reason
Knowledge can come from reasoning, e.g., mathematical proofs and logical syllogisms.
Example: If all brown dogs are friendly and Bonzo is a brown dog, then Bonzo is friendly.
Limitation: Validity depends entirely on the soundness of initial assumptions.
Rationalism is reliable only if premises are true.
1.2.4 Empiricism: Knowledge from Observation
“Seeing is believing” is problematic: observations are subjective, influenced by sensory differences, culture, mood, intoxication.
Different observers may describe the same event differently.
1.2.5 Knowledge from Science
All previous methods (experience, authority, reason, observation) are fallible and prone to error.ii
The scientific method is the best way to generate sound, reliable knowledge based on evidence.
Scientific claims must be supported by evidence, not faith or morality.
Five basic principles of science:
Objectivity
Skepticism
Openness / Open-mindedness
Tentativeness
Independence from authority
1.3 Principles of the Scientific Method
1.3.1 Objectivity
Evidence must be observable by anyone, not just personal feelings or thoughts.
Example: Measuring anxiety by physiological responses (heart rate, sweating) rather than self-report.
1.3.2 Skepticism
Science requires claims to be supported by evidence and critically evaluated.
Reflexive response: “Show me the evidence.”
No acceptance of claims based solely on authority or intuition.
Example: Galileo disproved the intuition that heavier objects fall faster by testing it experimentally.
1.3.3 Openness / Open-mindedness
Scientists must fully report methods and conditions so others can replicate studies.
Different interpretations or conclusions by others are acceptable if based on evidence.
Inter-rater reliability: agreement among different observers.
Must be willing to accept new evidence and revise beliefs accordingly, even if ideas seem extraordinary.
1.3.4 Tentativeness
Scientific knowledge is provisional and subject to revision with new data.
No scientific finding is 100% certain.
Confidence increases with repeated, replicated evidence (e.g., Pavlov’s classical conditioning).
Ambiguity is expected, especially early in research fields (example: ongoing debates about global warming).
1.3.5 Independence from Authority
Authority is irrelevant unless claims are supported by solid evidence.
Even reputable sources must be scrutinized skeptically.
1.4 Assumptions of Science
1.4.1 Nature is Lawfully Organised
There are finite, discoverable rules explaining natural phenomena, including human behavior.
This does not necessarily negate free will; humans show patterns and consistencies (e.g., language acquisition, altruism).
1.4.2 Science Assumes Determinism
Knowledge of rules allows prediction of behavior in given contexts.
Psychological rules predict behavior of most people most of the time (not necessarily every individual).
Example: Memory loss follows predictable patterns over time.
1.4.3 Science is Concerned with Solvable Problems
Only questions answerable through objective evidence are scientifically valid.
Questions like “Is there life after death?” or “Are people essentially good?” cannot be answered scientifically as phrased.
However, rephrasing questions to be measurable can make them scientific (e.g., “Are lone individuals more likely to help someone in distress than groups?”).
Example: The “bystander effect” showed individuals are more likely to help when alone than in groups, challenging intuition.
1.5 Goals of the Science of Psychology
Psychology seeks to understand behavior through four key goals:
1.5.1 Describe Behavior
First step: identify when and where behavior occurs.
Example: Pilots underestimate altitude more at night, resulting in short landings.
Description can be quantitative (distance from touchdown point, number of altitude adjustments) or qualitative.
1.5.2 Explain Behavior
Provide causes or reasons for observed behavior.
Avoid pseudo-explanations (circular reasoning).
Example: Pilots underestimate altitude at night due to lack of visual information, not because estimates are simply “unreliable.”
1.5.3 Predict Behavior
Use description and explanation to predict when behavior will or will not occur.
Example: Predict pilots will perform worse at night; if data contradicts this, revise the explanation.
1.5.4 Control Behavior
Use understanding to influence or modify behavior.
Example: Improving night landings by adding ground lights to increase visual information.
Successful control validates deeper understanding.
1.6 Scientific Hypotheses and Non-Scientific Theories
1.6.1 What Is a Theory?
Definition: A theory is a logically organized set of propositions (claims, statements, assertions) that:
Summarizes existing knowledge on a topic
Organizes that knowledge into specific relationships between variables/factors
Explains, at some level, the phenomenon of interest
Makes specific predictions about outcomes in situations relevant to the theory
Role in Psychology:
Theories explain why behavior occurs as it does
Theories are precise statements about how world events affect behavior
Include assumptions and concepts that must be clearly defined to be tested and understood
Example: Piaget’s Developmental Theory
Children develop through 4 stages, each with distinct cognitive/physical abilities
Stage 1 (0-2 years): Object permanence, intentional actions
Stage 2 (2-7 years): Use language to represent objects, classify by single features
Stage 3 (7-11 years): Logical thinking about objects/events
Stage 4 (11+ years): Abstract thinking, hypothesis testing, future and ideological thought
Requires operational definitions: precise criteria for testing components (e.g., what counts as "understanding logic")
1.6.2 What Is a Hypothesis?
Definition: A hypothesis is a focused, tentative explanation for behavior, derived from a theory.
Relationship to Theory:
Hypotheses test specific parts of broader theories.
Example: Testing if children 7-11 understand conservation of volume (a facet of Piaget’s theory)
Experiment Example:
Children given two identical short fat glasses of water; one is poured into a tall skinny glass
Prediction: Younger children (2-7) fail conservation (think tall glass has more water), older children (7-11) pass
Types of Hypotheses in Experiments:
Research (Alternate) Hypothesis (H1): Predicts a difference exists between groups (e.g., older children perform better)
Null Hypothesis (H0): Predicts no difference; any observed difference is due to chance or unrelated variables
1.6.3 Where Do Hypotheses Come From?
Generating Hypotheses:
Can be intimidating for new researchers
Eureka moments exist but most come from systematic reading and study
Process:
Read published research on a topic to become familiar with terminology, methods, and theories
Identify gaps or shortcomings in existing research that suggest new hypotheses
Example: Previous alcohol-driving study measured errors; new hypothesis could focus on reaction time for deeper insight
Replication:
Exact replication: Repeat study exactly to confirm original results (increases confidence or questions prior findings)
Conceptual replication: Test the same hypothesis but with modified methods to explore robustness (e.g., measuring reaction times as well as errors)
Analogy:
Exact replication = asking same witness repeatedly for consistent testimony
Conceptual replication = asking different witnesses for corroboration
1.6.4 Falsifiability
Core Principle:
Scientific theories/hypotheses must be falsifiable — they can be proven false by evidence
Non-falsifiable theories stall scientific progress (dead ends)
Example:
Theory of gravity predicts objects always fall down when dropped
If an object were observed floating up, theory would be falsified
Laws vs Theories:
Laws are theories that have withstood extensive attempts to falsify
Strictly, science only has theories, none are absolute laws
Analogy:
Theory = structural steel tested under stress
Each test strengthens or weakens confidence in the theory
Failing a test prompts modification or replacement of theory
1.6.5 Circular Hypotheses
Problem:
Hypotheses that explain an event by restating the event itself are circular and uninformative
Examples:
"The boy is distractible because he has attention deficit disorder" (distractibility = disorder)
"Financial crisis caused by economic panic" (crisis = panic)
1.6.6 Hypotheses Containing Non-Scientific Ideas or Forces
Issue:
Hypotheses involving concepts outside objective observation are untestable
Example:
Saying violent acts are caused by satanic possession is not testable scientifically, as "satan" is not objectively measurable
1.6.7 Hypotheses Containing Ill-Defined Terms
Problem:
Without clear definitions, hypotheses cannot be tested
Example:
"Man tried to assassinate the president because he was mentally disturbed" is untestable without a clear, measurable definition of "mentally disturbed"
1.6.8 Good Theories and Hypotheses Are Parsimonious
Parsimony: The simplest explanation that accounts for all observations is preferred
Approach:
Start with simple hypotheses explaining phenomena
Add complexity only when necessary due to new data
Example: Conway Lloyd Morgan’s Principle:
Avoid attributing higher mental faculties to animals when simpler explanations (like trial-and-error) suffice
E.g., dog opening gate explained by trial and error, not logical reasoning
Parsimonious explanations are generally favored over more complex ones if equally effective
1.7 Scientific and Non-Scientific Evidence
Research Studies:
Ask and answer scientific questions
Two main types: (1) Experiments, (2) Observational studies
Must define:
Participants (who is studied)
Situations/settings for observation
Measurement methods for behavior
Criteria for Compelling Scientific Evidence:
Observations must be:
Objective: Free from bias and personal influence; results replicable regardless of observer
Systematic: Conducted step-by-step with a clear method (e.g., varying conditions in flight simulator to test pilot performance)
Controlled: Extraneous variables are held constant or eliminated to isolate factors of interest (e.g., controlling wind, cloud cover in pilot studies)
1.7.1 Evidence Must Be Empirical
Based on observable phenomena that can be independently verified
Disputes about interpretation resolved by gathering more observations
1.7.2 Observations Must Be Objective
Must not be influenced by the researcher’s beliefs or expectations
True objectivity means consistent results regardless of observer
1.7.3 Observations Must Be Systematic
Observations follow a clear, consistent procedure
Example: Testing pilot’s altitude judgment under varying levels of visual ground detail systematically
1.7.4 Observations Must Be Controlled
Control or hold constant all variables other than the one being studied
Example: Using a flight simulator to control environmental factors affecting pilot’s altitude estimate
1.8 Critically Evaluating Evidence
Scientific Attitude Essentials:
Maintain an objective stance
Be skeptical of others’ arguments and evidence
Stay open-minded to new and surprising evidence
Be tentative and cautious about your own claims
Avoid undue influence by authorities when conducting or interpreting research
Initial Step in Evaluation:
Assess whether the researcher adhered to these principles above
Quality of Evidence Criteria:
Is the evidence empirical?
Is the evidence objective?
Is the evidence systematic?
Was it acquired under controlled circumstances?
Consequences of Failure:
Failing any of these criteria weakens confidence in the study’s conclusions
Some failures are more damaging than others (to be discussed further in the next chapter)
1.9 The Scientific Process
Overview:
The scientific process can be visualized as a flow chart with feedback loops
Starts with an idea or theory explaining a phenomenon
Example: Children progress through four cognitive developmental stages
Step 1: Generate Hypotheses
Smaller, testable predictions derived from the theory
Example: 7-11-year-olds understand conservation; 2-7-year-olds do not
Step 2: Design a Study
Develop a carefully planned experiment to test hypotheses
Study design is critical; poor design leads to uninterpretable results
Chapter 2 focuses entirely on study design principles
Step 3: Collect and Organize Data
Raw data from questionnaires, interviews, tests, etc., initially unorganized
Summarize, organize, and describe the data to understand group relationships
Step 4: Data Analysis (Inferential Statistics)
Perform statistical tests to infer whether hypotheses are supported
This stage determines the reasonableness of the hypothesis as an explanation
Step 5: Interpret Results and Revise
If data support the hypothesis: proceed to write a report and publish findings
If data do not support hypothesis: revise or abandon it, retest with new hypothesis
This iterative revision is part of scientific progress
Step 6: Replication by Others
Studies may be replicated exactly or conceptually by other researchers
Consistent replication increases confidence and moves hypothesis toward theory status
Failure to replicate reduces confidence, possibly leading to hypothesis rejection or modification
2.2.3 Sub-Types of Research
Two Broad Categories Based on Goals
Basic Research – aims to increase knowledge for its own sake.
Applied Research – aims to solve practical problems.
Two Broad Categories Based on Methods
Qualitative Research
Focuses on descriptive, non-numerical data.
Examples:
Detailed interviews.
Classroom interaction recordings.
In-depth case studies (e.g., rare conditions, brain injury).
Produces extended descriptive narratives.
Typically does not use statistical analysis.
Quantitative Research
Involves measurements and numerical data.
Analyzed using descriptive and inferential statistics.
Primary focus of this course.
2.2.4 Experimental Variables
Independent Variable (I.V.)
Definition: The variable manipulated by the experimenter.
Assumed to cause changes in the dependent variable.
Example: In a study, adding an extra mental task during driving simulation.
Levels of I.V.:
Level 1: No additional task (control).
Level 2: Additional task (treatment).
Can be:
Directly manipulated (e.g., task type).
Naturally occurring and grouped (e.g., age, gender, IQ, height).
Must have at least two values.
Often includes a zero (no treatment) and non-zero (treatment) condition.
Dependent Variable (D.V.)
Definition: The variable measured at each level of the I.V.
Depends on the I.V. exposure.
Example: Driving performance scores after word-generation tasks.
Examples of D.V. measurements:
Reaction time (seconds) when I.V. is age.
Test performance when I.V. is teaching method.
2.2.5 Unwanted Variables
Purpose of Control
Goal: Determine if a true relationship exists between I.V. and D.V.
Challenge: Other factors (extraneous variables) can obscure results.
Need to identify and minimize these influences.
Random Variables
Definition: Influences on the D.V. not directly due to the I.V.
Three main sources:
Situational variables.
Individual differences.
Measurement error.
Effect: Obscures the I.V.–D.V. relationship.
Situational Variables
Characteristics of the testing environment.
Examples: Room temperature, noise level, lighting, time of day.
Can degrade or improve performance in unpredictable ways.
Example: Teaching method study – noisy/hot rooms may obscure differences in test performance.
Best practice: Optimize and hold situational variables constant.
Individual Differences
People vary in traits such as height, IQ, motivation, anxiety, concentration.
Even within the same experimental condition, individuals differ in D.V. results.
Can interact with situational variables to amplify variability.
Reduction strategies:
Select participants with similar characteristics relevant to the study.
Example: Use only psychology students in a statistics learning study to ensure uniform background.
Measurement Error
Differences due to experimenter performance, attention, or equipment.
Examples:
Misreading measurements.
Inconsistent stopwatch timing.
Missing behaviours in observation.
Reduction strategies:
Automate measurement where possible (e.g., computer-based reaction time tests).
Standardize instructions (script or recorded message).
Extraneous Variables and Their Effects
Situational variables, individual differences, and measurement error have non-systematic effects.
Can inflate or deflate scores randomly, weakening consistency.
Example: Table 2.1 shows that removing random variables clarifies the I.V.–D.V. relationship, while their presence obscures it.
Analogy: Random Variability as TV Static
Random variability is like TV static obscuring a clear picture.
The "signal" is the relationship between I.V. and D.V.
More variability = harder to detect the signal.
Controlling variability = tuning the channel for a clear picture.
Goal: Minimize random variability and eliminate confounding variables.
Experimental Setup – Blakemore et al. (1999)
Participants sat at a table with right palm facing up.
A robot arm’s tip lightly touched the participant’s right palm.
Participant’s left hand held a robot control arm.
Conditions:
Self-tickle: Participant controlled robot arm with left hand, moving over 2 cm distance at ~2 strokes/sec.
Robot-tickle: Robot arm movements controlled by computer, mimicking self-tickle movement parameters.
After each trial, participants rated ticklishness on a 1–10 scale.
Independent Groups Design
Two levels of Independent Variable (IV): self-tickle vs. robot-tickle.
Dependent Variable (DV): ticklishness rating (1–10).
Participants randomly assigned to one of the two groups.
Hypothesis: Higher ticklishness ratings when tickled by robot (external source) vs. self.
Data plotted with:
X-axis: condition (self-tickle, robot-tickle)
Y-axis: ticklishness rating
Robot-tickle scores generally higher but with overlap in ratings between groups.
Role of Inferential Statistics
Visual trends can suggest an effect but overlap in data points means differences could be due to chance.
Inferential tests determine whether observed differences are statistically significant.
Limitations of Independent Groups Design
High variability due to individual differences in ticklishness.
Ideal scenario would involve identical participants to remove variability — not feasible in reality.
Random variability can obscure effects of the IV.
Strategies to Minimize Variability in Independent Groups
Use a homogeneous sample (e.g., similar ages to control for changes in sensitivity with age).
Keep experimental environment constant (room, lighting, temperature, time of day).
Use the same experimenter or identical pre-recorded instructions.
Repeated Measures Design (Within-Subjects)
Same participants tested in all conditions — reduces variability from individual differences.
Only 16 participants needed for both conditions vs. 32 in independent groups.
Participants randomly assigned to order of conditions to control for order effects:
Half: robot-tickle first, then self-tickle.
Half: self-tickle first, then robot-tickle.
Counterbalancing ensures order effects do not confound results.
Benefits of Repeated Measures
Reduces random variability due to individual differences.
Tighter clustering of data points in results (less spread).
Example: Same participant’s ratings across conditions more consistent than ratings between two different people in same condition.
Limitations of Repeated Measures
Inappropriate if exposure to one condition permanently or temporarily alters performance in the other condition.
Example: Learning tasks — once learned, cannot be “unlearned” for subsequent conditions.
Certain drug studies or skill-based experiments require independent groups.
Key Points
Independent groups: more susceptible to variability; larger sample size needed.
Repeated measures: preferred when possible, as it reduces variability and sample size requirements.
Counterbalancing is essential in repeated measures to prevent order effects from biasing results.
Independent Groups Design
Different participants in each condition.
More variability due to individual differences.
Requires larger sample sizes.
No risk of order effects.
Appropriate when one condition might affect the other (learning, drug effects, etc.).
Repeated Measures Design
Same participants in all conditions.
Less variability — individual differences controlled.
Smaller sample size needed.
Requires counterbalancing to avoid order effects.
Not suitable if exposure to one condition permanently alters performance in another.
Displaying the Order in a Group of Numbers Using Tables and Graphs
Why Organize Numbers?
Raw data lists are often overwhelming and obscure patterns.
Organizing numbers into tables and graphs allows clearer visualization of patterns, distributions, and relationships.
Goal: make sense of data by revealing order in what appears chaotic.
Frequency Tables
Definition: A frequency table shows how often each value (or range of values) occurs in a dataset.
Steps to create:
List all possible values from highest to lowest.
Mark frequency (f): Count how many times each value occurs.
Calculate percentage (%): Divide each frequency by total number of scores, then multiply by 100.
Advantages:
Provides a structured overview of the distribution.
Easy to spot common and rare values.
Forms the foundation for constructing graphs.
Grouped Frequency Tables
Used when data spans a wide range or has many distinct values.
Values are grouped into intervals (e.g., test scores 90–99, 80–89).
Rules for grouping:
Intervals must be of equal size (e.g., 10 points each).
No overlapping intervals.
Every value must fit into one and only one interval.
Purpose: Simplifies large datasets, making trends clearer.
Histograms
Definition: A bar graph representing a frequency distribution.
Key features:
X-axis: Variable values (e.g., scores).
Y-axis: Frequencies.
Bars: Touch each other (unlike bar charts for categories), indicating continuous data.
Interpretation:
Shape of distribution becomes visible (e.g., normal, skewed, uniform).
Easy to see modes, clustering, and spread of scores.
Frequency Polygons
Definition: A graph that uses points connected by straight lines to show frequency.
Steps to create:
Plot frequencies above each value or midpoint of intervals.
Connect the dots with lines.
Advantages:
Easier comparison of two or more groups on the same graph.
Highlights trends more smoothly than histograms.
Shapes of Distributions
Normal Distribution (bell curve): Symmetrical, unimodal, most scores around the middle.
Skewed Distributions:
Positively skewed (right-skewed): Long tail on right; common in income data (many low/mid values, few very high).
Negatively skewed (left-skewed): Long tail on left; less common, e.g., easy tests where most score high.
Bimodal or Multimodal: More than one peak in frequency; suggests multiple subgroups.
Rectangular/Uniform: Roughly equal frequency across all values.
Stem-and-Leaf Plots
Definition: A method to display data while retaining original values.
Structure:
Stem: First digit(s) of numbers.
Leaf: Last digit of numbers.
Advantages:
Maintains raw data visibility.
Useful for small datasets.
Quick way to see distribution shape and spread.
The Importance of Visualization
Tables summarize precise numbers.
Graphs provide an immediate impression of patterns.
Together, they help identify outliers, clustering, distribution shape, and trends.
These representations are essential groundwork before moving into descriptive or inferential statistics.
Central Tendency and Variability
Orientation
Purpose: describe and summarise groups of scores using single “typical” values (central tendency) and “spread” (variability).
Representative value: mean (primary), with mode and median as alternatives.
Variability measures: variance and standard deviation.
Statistical formulas are “recipes”; symbols must be understood before use.
Central Tendency
Concepts
Central tendency: the middle or typical value of a distribution.
Three measures: mean, median, mode.
Choice of measure depends on scale of measurement, distribution shape, and presence of outliers.
Mean (arithmetic average)
Definition: sum of all scores divided by number of scores.
Notation and formula:
( M ) (preferred in psych articles; sometimes (\bar{X})).
( \sum ) = “sum of”.
( X ) = scores; ( N ) = number of scores.
( M = \dfrac{\sum X}{N} ).
Interpretation:
Balance point of the distribution (teeter-totter analogy).
Total distance above the mean equals total distance below.
Need not be an observed score; can be decimal even when all ( X ) are integers.
Worked examples (as given):
Dreams (10 students): ( M = 6 ).
Stress ratings (30 students): ( M = 6.43 ) (rounded to two decimals beyond original precision).
Social interactions (94 students): ( M = 17.39 ).
Steps to compute the mean
Add all scores: compute ( \sum X ).
Divide by ( N ).
Mode
Definition: most frequent single value in the distribution.
Identification:
Highest frequency in a frequency table; peak of a histogram.
Properties and cautions:
Can differ from mean/median in skewed or irregular distributions.
Can remain unchanged when many scores shift—often a poor overall summary for numerical data.
Appropriate for nominal variables (e.g., most common religion).
Median
Definition: middle score when scores are ordered low→high.
Even ( N ): median is the average of the two middle scores.
Steps:
Order scores.
Locate middle position: ((N+1)/2); if fractional, take the two adjacent scores and average.
Robustness:
Resistant to outliers; useful when extreme scores distort the mean.
Example (reaction times):
Mean inflated by one very long time; median better captures typical performance.
Comparing mean, median, mode
Skewed left (negative skew): mean < median < mode.
Skewed right (positive skew): mode < median < mean.
Normal (perfectly symmetric, unimodal): mean = median = mode.
Use guidelines:
Mean: equal-interval/ratio variables without extreme outliers; dominant in psychology.
Median: rank-order variables; when outliers or skew make the mean unrepresentative.
Mode: nominal variables; rarely used for numerical data.
Illustrative controversy (interpretation risk)
Partner-number preferences:
Means suggested huge male–female difference (e.g., 64.3 vs 2.8).
Medians and modes both near 1 for men and women, revealing strong skew driven by few extreme male responses.
Lesson: focusing only on the mean can misrepresent skewed distributions.
Variability
Concepts
Variability: the spread of scores around the mean.
Distributions can share the same mean but differ in spread; conversely, different means can have similar spread.
Two main descriptive measures: variance and standard deviation.
Variance
Definition (definitional form): average of the squared deviations from the mean.
Deviation score: ( X - M ).
Squared deviation: ( (X - M)^2 ).
Formulae:
( SD^2 = \dfrac{\sum (X - M)^2}{N} ) (definitional).
( SS = \sum (X - M)^2 ) (sum of squares); thus ( SD^2 = SS/N ).
Historical computational shortcut (less instructive): ( SD^2 = \dfrac{\sum X^2 - \dfrac{(\sum X)^2}{N}}{N} ).
Interpretation:
Larger when scores are more spread out.
Rarely reported descriptively because units are squared and less intuitive.
Standard Deviation
Definition: positive square root of the variance.
Formulae:
( SD = \sqrt{SD^2} = \sqrt{\dfrac{\sum (X - M)^2}{N}} = \sqrt{\dfrac{SS}{N}} ).
Interpretation:
Roughly the average distance of scores from the mean in original units.
Commonly reported alongside the mean.
Sensitivity:
Influenced by outliers; a single extreme score can substantially increase ( SD ).
Worked examples (as given)
Dreams (10 scores; ( M=6 )):
( SS = 66 ), ( SD^2 = 6.60 ), ( SD = 2.57 ).
Social interactions (94 scores; ( M=17.39 )):
( SS = 12{,}406.44 ), ( SD^2 = 131.98 ), ( SD = 11.49 ).
Practical tips (error-checking)
Deviation scores sum to ~0 (apart from rounding). If not, re-check work.
Do not take ( \sqrt{SS} ) directly; divide by ( N ) first to get variance, then take the square root.
Definitional vs computational formulas
Definitional: directly embodies the meaning (builds intuition); recommended for learning.
Computational: algebraically equivalent shortcut from pre-computer era; less transparent, mainly historical interest now.
Why variability matters in psychology
Core research aim: explain differences among people or conditions (e.g., stress differences explained by math experience; social interactions explained by traits like extraversion or by gender).
Many inferential methods partition variability to test hypotheses.
Dividing by ( N ) vs ( N-1 )
Descriptive statistics for a specific sample or group: this chapter uses ( SS/N ).
Estimating population variance from a sample (common in research reporting and software): ( SS/(N-1) ).
Expect ( SD ) from software (e.g., SPSS/Excel default options) to reflect ( N-1 ) unless configured otherwise.
Reporting in Research Articles
Typical practice
Means and standard deviations are routinely reported (text or tables).
Medians and variances are only occasionally reported.
Example (social media use):
MySpace minutes per day mean > Facebook; higher SD indicates greater variability for MySpace.
Example (doctoral program applicants):
Means often exceed medians, signalling right-skewed distributions (few programs with very high counts).
Controversy: “Tyranny of the Mean”
Core argument
Overreliance on averages can obscure meaningful individual patterns.
Skinner’s critique: averaging can produce curves that represent no actual individual.
Qualitative and single-case traditions:
Emphasise in-depth observation/interview to discover categories before quantifying.
Blended approach advocated: explore qualitatively, then measure quantitatively.
Cultural concern (Jung/Von Franz):
“Statistical mood” may erode sense of uniqueness; numbers can dull qualitative meaning of human outcomes.
Stereotype Threat, Equity, and Math Performance (context box)
Key points
Stereotype threat: situational activation of negative group stereotypes can depress performance.
Evidence:
Informing women that men do better on a test lowers women’s scores; removing the prompt eliminates differences.
Similar effects for African Americans, Latinos, low-SES students.
Societal gender equality correlates with reduced gender gaps and more women with high mathematical talent.
Mechanism:
Stereotypes consume working memory resources; attitudes alone are insufficient protection.
Practical implications:
Reframe beliefs (“women can do math as well as men”); reduce test-threat cues; over-prepare to buffer memory demands.
Empowerment: quantitative literacy is an enabling skill across careers.
Learning Aids (from the chapter’s in-text guidance)
Tips for success
Know symbols before applying formulas; treat formulas as recipes.
Round with two more decimal places than the original data where needed.
Check deviation-sum ≈ 0; sanity-check magnitude (e.g., an SD larger than the whole scale is a red flag).
Summary (concise)
Central tendency summarises “where” scores are (mean/median/mode); variability summarises “how spread out” they are (variance/SD).
Mean is foundational and widely used; median for skew/outliers or ordinal data; mode for nominal data.
Variance quantifies average squared spread; standard deviation puts spread back into original units.
Reporting convention in psychology: ( M ) and ( SD ) almost always; medians/variances sometimes.
Beware misleading means in skewed data; consider full distribution and alternative summaries.
Key Ingredients for Inferential Statistics
Overview
Purpose: Introduces foundational concepts for inferential statistics, bridging descriptive data and generalisation to populations.
Focus areas: Z scores, the normal curve, sample vs population, and probability.
Application: Enables psychologists to infer conclusions beyond immediate research participants.
Z Scores
Concept and Purpose
Describe where an individual score lies within a distribution.
Z score = number of standard deviations a score is above or below the mean.
Converts scores from raw units into standardised units.
Positive Z = above mean; Negative Z = below mean.
Formula and Components
Formula:
( Z = \dfrac{X - M}{SD} )
( X ): raw score
( M ): mean
( SD ): standard deviation
Reversed formula (to convert back):
( X = (Z)(SD) + M )
Worked Examples
Jerome’s “morning person” score: ( X = 5 ), ( M = 3.40 ), ( SD = 1.47 ).
( Z = (5 - 3.40)/1.47 = +1.09 ).
Ashley’s score: ( X = 2 ), ( M = 3.40 ), ( SD = 1.47 ).
( Z = (2 - 3.40)/1.47 = -0.95 ).
Interpretation
Z scores express relative position and distance from the mean in standard deviation units.
Facilitate comparisons across different measures.
Useful for identifying extreme scores or percentile positions.
Converting Z ↔ Raw
From Z to raw: multiply Z by SD, then add mean.
e.g., ( X = 1.5(4) + 12 = 18 ).
From raw to Z: subtract mean, divide by SD.
Distribution Properties of Z Scores
Mean of all Z scores = 0.
Standard deviation of all Z scores = 1.
Shape of distribution remains unchanged after conversion.
The Normal Curve
Definition
A theoretical, bell-shaped, symmetrical, unimodal distribution.
Serves as a reference model for natural and psychological variables.
Often referred to as the Gaussian distribution (after Gauss, originally derived by Abraham de Moivre).
Why It Occurs in Nature
Random influences combine to produce values clustering around the mean.
The central limit theorem: many small random effects produce a normally distributed outcome.
Applies to biological, behavioural, and social phenomena (e.g., reaction time, IQ, height).
Key Percentages
68% of scores within ±1 SD.
95% of scores within ±2 SD.
99.7% of scores within ±3 SD.
(Known as the Empirical Rule.)
Range from Mean
% of Scores
±1 SD
68%
±2 SD
95%
±3 SD
99.7%
Example:
IQ: ( M = 100 ), ( SD = 15 )
85–115 = 68%
70–130 = 95%
Using Z Scores with the Normal Curve
Z scores locate raw scores on the normal curve.
Probabilities/percentages can be derived for any Z value.
Example: Z = +1 → 34% between mean and +1 SD; 16% above +1 SD.
The Normal Curve Table (Z Table)
Purpose
Lists percentages associated with each Z score.
Columns:
Z score
% Mean to Z
% in Tail
Symmetrical → positive and negative Z values mirror each other.
For Z = 0.64:
% Mean to Z = 23.89%
% in Tail = 26.11%
Total above mean = 50%.
Using the Z Table
To find the percentage above/below a score:
Convert raw → Z.
Sketch distribution and shade region of interest.
Estimate range using 50–34–14 rule.
Lookup exact value in Z table.
Add or subtract 50% as needed depending on side of mean.
Reverse Lookup
To find Z from a known percentage:
Locate nearest % in Tail or % Mean to Z in table.
Read corresponding Z.
Example: top 30% → % in Tail = 30.15% → Z = +0.52.
Applying the Normal Curve
Examples
IQ = 125
( Z = (125 - 100)/15 = +1.67 ).
Tail area = 4.75%. → 4.75% score higher.
IQ = 95
( Z = -0.33 ).
Tail + mean = 62.93% above.
Finding Raw Scores from Percentages
Top 5% → ( Z = +1.64 ), ( X = (1.64)(15) + 100 = 124.6 ).
Top 55% → ( Z = -0.13 ), ( X = (-0.13)(15) + 100 = 98.05 ).
Middle 95% → ( Z = ±1.96 ), range = 70.6–129.4.
Samples and Populations
Definitions
Population: Entire group of interest (e.g., all voters, all psychology students).
Sample: Subset drawn for study.
Purpose: Make inferences about population from sample.
Why Use Samples
Entire populations impractical to test.
Samples allow manageable data collection and generalisation.
Sampling Methods
Random selection: each individual has equal chance of inclusion (unbiased, ideal).
Haphazard selection: convenience-based (biased, common in psychology).
Cluster/multistage sampling: complex probability-based design (used in large-scale surveys).
Polling Example (Gallup)
Quota sampling (used pre-1948) → bias error (Dewey vs Truman).
Modern polls use probability sampling to reduce bias.
Still face issues: nonresponse, exclusion of mobile-only users.
Population Parameters vs Sample Statistics
Concept
Population
Sample
Mean
μ (mu)
M
Standard Deviation
σ (sigma)
SD
Variance
σ²
SD²
Parameters: actual (unknown) population values.
Statistics: computed from sample; estimates parameters.
Probability
Definition
Likelihood or proportion of a specific outcome occurring.
Formula:
( p = \dfrac{\text{successful outcomes}}{\text{total outcomes}} )
Range: 0 ≤ p ≤ 1.
Interpretations
Long-run (relative frequency): proportion expected over many trials.
e.g., coin → 0.5 heads.
Subjective: degree of belief or confidence (e.g., “95% sure the restaurant is open”).
Examples
Coin flip: p(heads) = 1/2 = 0.5.
Rolling ≤3 on a die: 3/6 = 0.5.
Random senior from 200 students (30 seniors): 30/200 = 0.15.
Probability Symbols
( p ): probability.
( p < .05 ): less than 5% chance (used in statistical significance).
Probability and the Normal Curve
Relationship
Normal curve = probability distribution.
Percentages under the curve = probabilities.
Between mean and +1 SD → p = 0.34.
Between -1.96 and +1.96 → p = 0.95.
Used to determine likelihood that a score occurs by chance.
Probability, Samples, and Populations
Probabilities indicate how likely a sample’s score is drawn from a population.
Low-probability scores suggest the sample may come from a different population.
e.g., sample score = 4, population μ = 10, σ = 3 → p very low.
Advanced: Probability Rules
Addition Rule (“or” rule)
For mutually exclusive events:
( p(A \text{ or } B) = p(A) + p(B) )
e.g., roll of die: p(3 or 5) = 1/6 + 1/6 = 1/3.
Multiplication Rule (“and” rule)
For independent events:
( p(A \text{ and } B) = p(A) \times p(B) )
e.g., 2 coin flips → p(2 heads) = 0.5 × 0.5 = 0.25.
Conditional Probability
( p(A|B) ): probability of A given B occurs.
e.g., p(woman | College A) = 0.5; p(woman | College B) = 0.6.
Controversies
“Is the Normal Curve Really Normal?”
Many real-world distributions deviate from normality.
Micceri (1989): none of 440 psychological measures perfectly normal.
Deviations due to ceiling/floor effects, skewness, kurtosis.
Yet traditional statistical methods remain robust under moderate deviations (Sawilowsky & Blair, 1992).
Nonparametric methods (distribution-free) are alternatives.
Using Nonrandom Samples
Psychology often uses convenience samples (students, volunteers).
Justification: relationships between variables often generalise.
Other fields (sociology, medicine) emphasise random representativeness.
Example: Morgenstern et al. (2009) obesity study—checked representativeness and response rates.
Researchers acknowledge sampling limitations (e.g., Heyman et al., 2001).
Research Applications
In Articles
Z scores seldom reported directly.
Normal curve mentioned when describing score distributions.
Sampling methods explained in methodology (response rate, representativeness).
Probability shown in results as p-values (e.g., p < .05 or p < .01).
Summary
Z scores standardise raw data and locate scores within a distribution.
Normal curve describes the ideal symmetrical distribution underlying many statistical tests.
Samples vs populations: inferential logic based on estimating population parameters from samples.
Probability quantifies uncertainty and underlies inferential reasoning.
Statistical significance (p < .05) arises from these probability principles.
Correlation
Core Concept
Correlation = the statistical description of the relationship between two equal-interval numerical variables.
Shows whether high values on X tend to occur with high values on Y (positive), or with low values on Y (negative), or no systematic pattern (zero correlation).
Correlation is about association — not proof of cause.
Graphing the Relationship: Scatter Diagrams
Scatter plot: horizontal axis = predictor/cause variable (if known/theorised), vertical axis = outcome/criterion variable.
Each dot = one person’s score on both variables.
Roughly straight line pattern = linear correlation.
Shapes:
positive linear: up and right; highs with highs, lows with lows.
negative linear: down and right; highs with lows, lows with highs.
curvilinear: systematic but not straight (e.g., U-shape, inverted U).
no correlation: dots have no pattern at all.
Linear vs Curvilinear
Linear = described by a single straight line.
Curvilinear = relationship exists but is not linear; a linear correlation coefficient will underestimate this relationship.
Example: idealisation vs marital satisfaction → too much idealisation reduces satisfaction (inverted U shape).
Strength of a Correlation
Strength = how close dots cluster to the line.
Perfect = r = +1 or r = −1 (zero scatter around a line).
Weak = dots widely dispersed, line poorly represents pattern.
Correlation Coefficient (r)
r = numerical index of linear correlation (range −1 to +1).
Sign indicates direction (+ positive, − negative).
Magnitude indicates strength (absolute size).
Computed by converting raw scores to Z scores, computing cross-products, summing them, dividing by N.
Significance Testing for Correlation
Null hypothesis: true population correlation = 0.
r is transformed to a t-value to test significance.
df = N − 2.
Same logic as other inferential tests.
Correlation ≠ Causation
Three causality directions always possible:
X causes Y
Y causes X
third variable causes both
Only a true experiment (random assignment) can rule out alternative directions.
Statistical vs Research Design “Correlation”
“Correlation” as a statistic = Pearson r (formula).
“Correlational research” = any non-experimental study (surveys, observation etc). A correlational study can be analysed with or without Pearson r.
Interpretation Issues
r²: Proportion of Variance Explained
Square of r = proportionate reduction in error / proportion of variance accounted for.
A correlation twice as large in r is not twice as strong in r².
Restriction of Range
Using limited range on one variable suppresses the correlation.
Example: only testing high scorers on an aptitude test → relationship to job performance seems artificially weak.
Unreliability of Measurement
Noise/inaccuracy reduces r.
If measures are poor, true correlation will be underestimated.
Outliers
Single extreme X/Y combination can massively distort r.
Outlier can change r from large to near zero or even switch sign.
Curvilinear Solutions: Spearman’s rho
If relationship is monotonic but not linear, convert to ranks → compute Spearman rho.
Less sensitive to outliers.
Does not require equal interval measurement.
Effect Size and Power
Cohen guidance: r = .10 small; r = .30 medium; r = .50 large.
But these conventions are contested.
Power depends on expected effect size; moderate correlations require sizeable N to achieve .80 power.
Real-World Importance of Small r
Even very small correlations can have major real practical consequences if event is serious or sample is huge (e.g., aspirin and heart attack reduction; r = −.034 but huge real effects on mortality).
Correlation Matrices
Common format in journal articles.
Table format presenting correlations between multiple variables, often only lower triangle shown.
Asterisks indicate statistical significance.
Intro to Hypothesis Testing
Core Definition and Purpose
Hypothesis testing = formal procedure to decide if sample results support a theory / innovation about a population.
Hypothesis = prediction derived from observation, prior research, or theory.
Theory = explanatory principles about psychological processes from which specific predictions are derived.
Essential frame: sample → inference → population.
Cognitive Difficulty
Logic is counter-intuitive — requires multiple new abstractions at once.
Psychological research relies on this logic for almost all inferential conclusions.
Core Logic
Ask: how likely is the result if the opposite of what we predict is actually true?
If that probability is sufficiently low, reject the null hypothesis.
If rejecting the null, we indirectly support the research hypothesis.
If not extreme enough, outcome is inconclusive, not support for the null.
Key Terms
Research hypothesis (alternative): specifies predicted relationship/difference between populations.
Null hypothesis: predicts no difference or the opposite direction.
Comparison distribution: the reference population distribution if the null is true.
Cutoff sample score (critical value): threshold Z value beyond which result is too unlikely under null.
Statistically significant: outcome extreme enough to reject null under chosen significance level.
Five Steps of Hypothesis Testing
Restate question as research + null hypothesis about populations.
Determine comparison distribution characteristics (mean, SD, shape).
Determine cutoff sample score (Z critical) for chosen significance level.
Determine sample score on comparison distribution (compute Z).
Decide: reject or fail to reject null.
One-Tailed vs Two-Tailed
Directional hypothesis = one-tailed test.
region of rejection only in predicted direction.
less extreme critical cutoff.
downside: if effect occurs in opposite direction, cannot call significant.
Nondirectional hypothesis = two-tailed test.
tests for extreme results in both directions.
more conservative (critical values more extreme).
Convention: most researchers default to two-tailed unless explicitly justified otherwise.
Significance Levels
Conventional α levels: .05 and .01.
One-tailed .05 = Z crit ±1.64.
Two-tailed .05 = Z crit ±1.96.
One-tailed .01 = Z crit ±2.33.
Two-tailed .01 = Z crit ±2.58.
Interpretation Boundaries
Reject null → results are statistically significant; supports research hypothesis.
Fail to reject → results are inconclusive; cannot claim null is true.
Words “prove” or “true” must not be used about research findings.
Examples
Baby vitamin example:
prediction: earlier walking.
cutoff (5%): Z ≤ −2.
sample baby walked at 6 months: Z = −2.67 → reject null.
$10 million happiness example:
prediction: happier.
cutoff (5%): Z ≥ +1.64.
sample Z = +1 → not enough → inconclusive.
Controversy: Banning Significance Tests
Critics argue logical problems, misuse, misinterpretation, arbitrary cutoffs.
Bayesian methods can compute odds of H1 vs H0 directly (Bayes factor).
Bayes factor shows p values between .05 and .01 often provide weak evidence.
Consensus to date: keep significance tests but use carefully.
APA: significance tests remain allowed but must not be misused.
Reporting in Research Articles
Authors typically report significance, test statistic symbol (t, F, χ²), and p level.
Tables often mark significant differences with asterisks.
Two-tailed is assumed unless stated otherwise.
Increasingly common to report exact p values (e.g., p = .03).
Summary Concept
Hypothesis testing = rejection logic.
We support theories by showing data are unlikely under the “no effect” assumption.
Statistical significance ≠ proof.
Failure to reject null ≠ evidence null is true.
Intro to Hypothesis Testing
Core Definition and Purpose
Hypothesis testing = formal procedure to decide if sample results support a theory / innovation about a population.
Hypothesis = prediction derived from observation, prior research, or theory.
Theory = explanatory principles about psychological processes from which specific predictions are derived.
Essential frame: sample → inference → population.
Cognitive Difficulty
Logic is counter-intuitive — requires multiple new abstractions at once.
Psychological research relies on this logic for almost all inferential conclusions.
Core Logic
Ask: how likely is the result if the opposite of what we predict is actually true?
If that probability is sufficiently low, reject the null hypothesis.
If rejecting the null, we indirectly support the research hypothesis.
If not extreme enough, outcome is inconclusive, not support for the null.
Key Terms
Research hypothesis (alternative): specifies predicted relationship/difference between populations.
Null hypothesis: predicts no difference or the opposite direction.
Comparison distribution: the reference population distribution if the null is true.
Cutoff sample score (critical value): threshold Z value beyond which result is too unlikely under null.
Statistically significant: outcome extreme enough to reject null under chosen significance level.
Five Steps of Hypothesis Testing
Restate question as research + null hypothesis about populations.
Determine comparison distribution characteristics (mean, SD, shape).
Determine cutoff sample score (Z critical) for chosen significance level.
Determine sample score on comparison distribution (compute Z).
Decide: reject or fail to reject null.
One-Tailed vs Two-Tailed
Directional hypothesis = one-tailed test.
region of rejection only in predicted direction.
less extreme critical cutoff.
downside: if effect occurs in opposite direction, cannot call significant.
Nondirectional hypothesis = two-tailed test.
tests for extreme results in both directions.
more conservative (critical values more extreme).
Convention: most researchers default to two-tailed unless explicitly justified otherwise.
Significance Levels
Conventional α levels: .05 and .01.
One-tailed .05 = Z crit ±1.64.
Two-tailed .05 = Z crit ±1.96.
One-tailed .01 = Z crit ±2.33.
Two-tailed .01 = Z crit ±2.58.
Interpretation Boundaries
Reject null → results are statistically significant; supports research hypothesis.
Fail to reject → results are inconclusive; cannot claim null is true.
Words “prove” or “true” must not be used about research findings.
Examples
Baby vitamin example:
prediction: earlier walking.
cutoff (5%): Z ≤ −2.
sample baby walked at 6 months: Z = −2.67 → reject null.
$10 million happiness example:
prediction: happier.
cutoff (5%): Z ≥ +1.64.
sample Z = +1 → not enough → inconclusive.
Controversy: Banning Significance Tests
Critics argue logical problems, misuse, misinterpretation, arbitrary cutoffs.
Bayesian methods can compute odds of H1 vs H0 directly (Bayes factor).
Bayes factor shows p values between .05 and .01 often provide weak evidence.
Consensus to date: keep significance tests but use carefully.
APA: significance tests remain allowed but must not be misused.
Reporting in Research Articles
Authors typically report significance, test statistic symbol (t, F, χ²), and p level.
Tables often mark significant differences with asterisks.
Two-tailed is assumed unless stated otherwise.
Increasingly common to report exact p values (e.g., p = .03).
Summary Concept
Hypothesis testing = rejection logic.
We support theories by showing data are unlikely under the “no effect” assumption.
Statistical significance ≠ proof.
Failure to reject null ≠ evidence null is true.
Making Sense of Statistical Significance & Power
Decision errors in hypothesis testing
Decision errors happen because we infer about populations from samples; the procedure limits but cannot eliminate error.
Two error types:
Type I error (α): Rejecting the null when it is actually true; its probability equals the significance level you set. Researchers sometimes lower α (e.g., .001) to reduce this risk, but there is a cost.
Type II error (β): Failing to reject the null when the research hypothesis is actually true; practically worrying because effective interventions might be missed.
You can’t know post-hoc that you made either error; significance levels act as “insurance,” but pushing α very low increases the risk of Type II error.
Trade-off: Protecting against one error type usually raises the other; standard compromises are α = .05 or .01.
Alpha, beta, and terminology
α (alpha): Probability of a Type I error; same as the test’s significance level.
β (beta): Probability of a Type II error.
Statistical power
Definition: Probability of correctly rejecting the null when the research hypothesis is true; equivalently, 1 − β. (Implied in the Type II section and power diagrams.)
Determinants:
Effect size / population variability: Smaller population SD (larger effect size) narrows the distribution of means → less overlap → higher power.
Sample size (N): Larger N narrows the distribution of means independently of effect size → higher power. Sample size affects power but is separate from effect size.
Alpha level: More lenient α (e.g., .10 or .20) increases power but raises Type I risk; stricter α reduces power but protects against Type I error.
Illustration: Increasing N (e.g., from 64 to 500) can raise power dramatically (e.g., from ~37% to ~99%) by shrinking the standard error and reducing curve overlap.
Planning studies with power
Why plan for power: Too-low power makes significance unlikely even when the effect is real; researchers should avoid underpowered designs.
Figuring required N: Start from a target power (commonly 80%), expected effect size (or mean difference) and known/assumed SD to compute N. Example: with predicted mean difference of 8 and SD = 48, about N = 222 achieves 80% power (single-sample mean scenario).
In practice, researchers use power software or online calculators to invert the power steps and solve for N.
Practical guidance
Choosing α: Use conventional .05 unless there’s a strong reason to shift; lowering α (e.g., .001) buys Type I protection at the cost of more Type II errors.
When Type II risk matters most: Applied settings (e.g., clinical) where missing a real effect means withholding a beneficial treatment.
Do not conflate effect size with sample size; both influence power, but via different mechanisms and imply different design choices.
Introduction to t-tests — core idea
Up until now, hypothesis testing examples assumed you knew the population variance.
In real psychology research, you almost never know the population variance.
You usually have only samples.
Solution = t tests.
When do you use a t test?
Use a Z test when population variance is known (rare).
Use a t test when population variance is unknown (normal/real research case).
Most psychological research = comparing two sets of scores with unknown variance.
Two main t-test types in this chapter
Single sample t test: 1 sample mean vs a known population mean (population variance unknown).
Dependent means t test: same people measured twice → create difference scores → test mean difference ≠ 0.
Single sample t test — why, and what changes?
Same logic as Z test.
Two new complications:
Must estimate population variance from sample → use unbiased estimate (divide by N−1).
Once you estimate variance → distribution is NOT normal anymore → becomes t distribution.
Unbiased estimate + degrees of freedom
Sample variance underestimates population variance.
Correction: divide SS by N − 1 → this denominator = df.
df = scores free to vary once you fix the sample mean → N − 1.
Distribution of means under t
Get estimated population variance → divide by N → get variance of distribution of means.
Square root = standard deviation of distribution of means (S_M).
Shape = t distribution with df = N − 1.
t distribution = normal-like but heavier tails. More extreme scores.
Lower df = heavier tails → more extreme cutoffs.
As df → ∞, t approaches normal.
Testing with t — same 5 steps, one change
Computationally identical to Z procedure, but:
cutoff comes from t table
sample test statistic is t not z
t = (M_sample − μ_comparison) / S_M
Why dependent means t test exists
Real research design commonly uses repeated measures:
before vs after therapy
beloved vs neutral face in fMRI
Two scores per participant → create difference score for each participant.
Then perform a single sample t test on the difference scores.
Null mean for difference scores = 0.
When to use dependent means t test
Unknown population variance.
Repeated measures / paired samples / matched pairs.
Use difference scores.
Population 2 mean assumed = 0 (no change baseline).
Repeated measures = higher power
Because individual differences cancel out within person.
SD of difference scores tends to be smaller.
Smaller denominator → bigger effect size d → more power.
BUT repeated measures without a control group is methodologically weak (confounds: maturation, time effects, practice, attrition).
Reporting in articles
Standard format: t(df) = value, p = …
One-tailed or two-tailed reported if relevant.
Means reported for each condition.
t test for independent means — core purpose
Use when comparing two separate groups of people.
Population variances unknown → must estimate → therefore use t not z.
Common design in psychology: experimental vs control groups.
comparison distribution = distribution of differences between means
NOT difference scores (dependent means = difference scores).
Independent means = 2 separate samples → focus is mean₁ – mean₂.
Logic: build 2 distributions of means → then build distribution of the difference between those means.
mean of this comparison distribution
If H₀ is true → μ₁ = μ₂ → difference between means = 0.
Therefore mean of the distribution of differences between means = 0.
estimating population variance
Must assume equal population variances (homogeneity of variance).
Estimate variance separately for each sample.
Combine them → pooled variance (weighted average by df).
pooled variance
More weight goes to the sample with larger df.
Pooled variance MUST lie between the two individual estimates.
variance + SD of distribution of differences between means
Step 1: pooled variance.
Step 2: divide by N₁ → variance of distribution of means₁.
Step 3: divide by N₂ → variance of distribution of means₂.
Step 4: add those two → variance of distribution of difference.
Step 5: square root → SD of distribution of difference.
df for t test for independent means
df_total = df₁ + df₂
df₁ = N₁ − 1
df₂ = N₂ − 1
test statistic
t = (M₁ − M₂)/SD_difference
assumptions
Each population = approx normal (t robust to moderate violation).
Population variances assumed equal (homogeneity).
Scores must be independent — no matching, no clustering, no nesting.
effect size (independent means)
d = (μ₁ − μ₂)/σ (planned)
estimated d after study = (M₁ − M₂)/S_pooled
power (independent means)
Same concept, but power depends on:
effect size
α
N
equal N gives more power than unequal N.
If N unequal → use harmonic mean to approximate power equivalence.
the “too many t tests” problem
Running many t tests inflates Type I error dramatically.
Five significant results out of 100 comparisons = expected by chance.
Issue is real; solutions exist (e.g., Bonferroni, etc.).
how it appears in articles
Usually: M, SD reported for each group + t(df) = value, p < .05
table-level conceptual comparison
single sample t → compare 1 sample mean to known μ
dependent means t → compare same people twice (difference scores)
independent means t → compare 2 groups of different people