Module 1: Reliability and Validity

Measuring human behaviour

The Scientific Method

  1. Theory: working model of the world based on literature and sometimes our own intuitions about our observations

  2. Hypothesis: a testable idea

  3. Measurement: could mean running an experiment or collecting data in another way

  4. Statistics: looking for patterns in collected data as well as assess its meaningfulness

  5. Inference: making conclusions regarding the original theory based on the collected information.

Measurement results in two things, the truth, meaning something that we actually wanted to measure, and error, which could be bias (something that we introduced ourselves that confounded the truth) and random error (something beyond our control). Therefore, more attention should be payed to biases since that is something we can actually control, by contrast to random error.

Error comes from 3 sources: Observers, Researchers, and Participants, which could all introduce a level of both bias and random error

  • Observers: An observer is watching a parent-child interaction, and then is asked to asses the number of positive interactions. The observers own subjective thresholds for what counts as a positive interaction might affect their judgement. To fix that: more precise task.

  • Researchers: The initial hypothesis might be biased, measures might also be biased

  • Participants: might try to be helpful by giving you the answers they think you want, especially children

Reliability and Validity in Measurement

Reliability: the consistency or repeatability of measures; Ex: an IQ test will give relatively similar results across time

  • Reducing random measurement error improves reliability

Validity: the extent to which the measure measures what its supposed to measure; Ex: an IQ test measures intellect rather than English skills

  • Reliability puts a ceiling on validity, if the measure gives different results each time, it is likely not measuring what it is supposed to measure (if reliability is .70, .30 is attributed to random error, therefore the ceiling for validity is .70 as .30 inherently can not it be valid, as the random error is being measured )

  • Reliability does not guarantee validity; just because something measures the same every time does not mean that it is measuring the right thing

How do we test reliability?

  • Inter-Rater/Inter-Observer Reliability: assess the degree to which different raters/observers give consistent estimates of the same phenomenon

  • Test-Retest Reliability: used to assess consistency of a measure from one time to another by looking for the correlation between scores across multiple administrations of the measure

  • Parallel-Forms Reliability: used to assess the consistency of the results of two tests constructed in the same way from the same content domain

    • Split-half Reliability: split questionnaire in half

    • Item-total Correlation: look at the correlation between multiple items (different assignments on a course relationship to final grade)

  • Internal Consistency Reliability: used to assess consistency of results across items within a test

    • Cronbach’s alpha refers to the average correlation among all possible pairs of items (.80 or above)

Key types of validity:

  • Construct (Measurement) Validity: the extent to which the manipulations and measures (of both DV’s and IV’s) reflect the theoretical constructs of interest

    • Manipulations can be Instructional (experimental conditions are defined by what you tell participants), Environmental (experimental conditions are defined by the different circumstances: stage an event, present a stimulus, induce a state), and Stooges (fake participants altering the experimental conditions)

    • Convergent validity: the extent to which the measure correlates with scores on other similar measures related to the construct (Anxiety-Depression should be similar)

    • Discriminant validity: the extent to which the measure is different from the measures that are unrelated to the construct (Anxiety-typing speed should have low correlations)

    • Face validity: the extent to which on the face value the measure seems to be a good translation of the construct (completing sums to measure arithmetic ability)

    • Content validity: the extent to which the measure assesses the entire range of characteristics that are representative of the construct it is intending to measure

    • Criterion validity: the extent to which the scores on the measure distinguish participants on other variables that are expected to be related to it (depressives from non-depressives, criminals from non-criminals) - concurrent and predictive

  • External Validity: the extent to which the results can be generalised to other relevant populations, settings, or times

    • Studies have good external validity when results can be replicated when using different measures of the same variables, measuring a different sample of participants, and conducting the research in another setting

    • Ecological validity: the extent to which results can be generalised to real-life settings

    • Population Generalisation refers to applying the results from an experiment to a group of participants that is different and more encompassing than those used in the original experiment. It has been questioned if the fact that most of the psych studies are done or psych students is affecting the generalisability of the results.

    • Temporal Generalisation: refers to applying the results from an experiment to a time that is different from the time when the original experiment was conducted

    • Environmental Generalisation refers to applying the results from an experiment to a situation or environment that differs from that of the original experiment

  • Internal Validity: refers to the extent to which the conclusions about causal relationships from the results of a study can be drawn; the extent to which we can say that any effects of the DV were caused by the IV

    • Inferences of cause-and-effect require three elements:

      • Co-variation

      • Temporal precedence

      • Elimination of alternative explanations

    • Threats to Internal Validity

      • Selection Bias: occurs if participants are chosen in such a way that the groups are not equal before the experiment; differences in the results may be due to the group differences rather than the IV influence.

      • Maturation: changes in participants during the course of an experiment or between measurements of the DV due to passage of time

        • Permanent: e.g., age, biological growth, cognitive development (most common: children growing)

        • Temporary: e.g., fatigue, boredom, hunger

      • Statistical Regression - Regression Towards the Mean: participants with extreme scores on the first measurement of DV tend to score closer to the mean on the subsequent measurement of DV

      • Mortality-Attrition: relates to premature dropouts and differential dropout (dropout rates in control and intervention groups vary systematically) across experimental conditions; if differential dropout rate occurs, then its likely that the groups of participants are not as equal at the end of the experiment as they were before

        • Common when the intervention is unpleasant or very demanding or is not working

      • History: outside events that may influence participants in the course of the experiment or between the DV measurements in a repeated-measures design; Ex: major historical events, or small changes like joining a gym, changing jobs, or having a baby

      • Testing: prior measurement of the DV may influence the results obtained for subsequent measurements; Ex: participant becomes aware of the study goals

      • Practice Effect: when a beneficial effect on a DV measurement is caused by previous experience with the DV measurement itself.

      • Instrumentation: changes in measurement of the DV that are due to the measuring device (equipment or human); the equipment or human measuring the DV changes the measuring criterion over time; Ex: grading essays with out a marking sheet

      • Effects of studying people:

        • Observee Reactivity (Hawthorne Effect): participants change behaviour when they know they are being observed (‘reactivity’)

        • Social Desirability: reporting inaccurately on sensitive topics in the best possible light

      • Demand Effects: relate to an aspect of the research that allows participants to guess what the research is about

      • Placebo Effects: results from participant’s own expectations about experiments or expectations about what will happen or what is meant to happen

      • Experimenter Bias: errors in a study due to the predisposed notions or beliefs of the experimenter Ex: Observer bias (selective viewing or interpretation of behaviours, NLP)

How to improve construct validity? By improving the quality of manipulations through: reducing random error by replicating procedure, reducing experimenter bias, reducing participant bias, ensuring manipulation has construct validity, do a manipulation check by asking the participants about their beliefs and attitudes and other aspects of the study

Controlling threats to Internal Validity:

  • random allocation of participants to levels of IV and random sampling

  • treat all conditions equally except for intended IV manipulations

  • use appropriate control conditions where relevant

  • use double-blind studies where possible

Replicating Science

The replication crisis in psychology

A study looked at the reproducibility of 100 famous psychological studies. While 97% of the original studies observed significant effects, only 36% of the replications were significant. The replication mean effect size was approximately half of the original studies, which means that even when significant effects were found in the replications, the effects were not necessarily as extreme. Across the different field of psychology, cognitive psychology proved to be the most replicable with around 50%, whereas social psychology was found to be the least replicable (~25%).

This does not necessarily mean that the original studies are wrong.

robot