Quantitative Research Notes

Measurement Reliability

Definition: Refers to the consistency of a measurement tool.

Internal Consistency

Definition: Assesses the extent to which all items on a measure consistently measure the same construct.
Relevance: Only relevant for measurement tools with multiple items.
Evaluation:
- Use statistics to calculate correlations between items.
- High correlations suggest items measure something similar.
- Calculations are covered in Jamovi Week 5.
Statistics:
- **Cronbach’s alpha ( $\alpha$ ):
  - A single value indicating the degree to which items are intercorrelated.
  - Interpreted like a correlation coefficient ( $r$ ).
  - Values close to 0 indicate no correlation; values close to 1 indicate strong correlation.
  - Most common measure of internal consistency in psychology.
  - Cronbach (1951) recommended \alpha > .70 as acceptable, but this guideline is criticised as arbitrary (Taber, 2018).
  - Acceptable criteria vary by field (e.g., medical fields often require \alpha > .80).
  - Tavakol and Dennick (2011) suggest that if alpha is too high (e.g., > .90), items may be redundant.
- McDonald’s omega ( $\omega$ ) – aka omega total:
  - Similar to Cronbach’s alpha but uses a different statistical model.
  - More accurate when assumptions are not met (e.g., non-normal distribution, differing variances).
  - Interpreted the same way as Cronbach’s alpha.
  - Becoming increasingly popular.
- Item-rest correlations:
  - Correlations between each item and the total of all other items.
  - Helps identify inconsistent items for adaptation or removal.
- Kuder-Richardson Formulas 20 and 21 (KR20 and KR21):
  - Measure internal consistency of scales with binary responses (e.g., Yes/No).

Test-Retest Reliability

Definition: Consistency of a measure over time.
Assessment: Administer the same test to the same group at two different times and correlate the scores.
Relevance: Only for constructs stable over time (e.g., personality traits), not constructs expected to vary (e.g., mood).
Evaluation:
- Administer the same measurement tool at two different points in time.
- Calculate the correlation between the two sets of scores.
- High correlation indicates high test-retest reliability.

Interrater Reliability

Definition: The extent to which different raters or observers provide consistent ratings when assessing the same phenomenon or behaviour.
Relevance: Only when a measurement tool is administered or scored by more than one person.
Evaluation:
- Often tested using Cohen’s Kappa.
- Values closer to 1 indicate higher interrater reliability (1 = perfect agreement).

The Relationship Between Validity and Reliability

Neither valid nor reliable (Target a): Tools produce inconsistent scores that do not accurately measure the intended construct.
Reliable but not valid (Target b): Tools produce consistent scores, but these scores do not accurately reflect the construct being measured.
- Example: A self-report measure of self-esteem with questions about a person's favourite animals.
For a measurement tool to be valid, it must be reliable (Target c): A valid measure consistently yields results that accurately reflect the intended construct.
- Example: A valid depression test consistently and accurately identifies symptoms of depression.
Reliability is necessary, but not sufficient, for validity.

Measurement Sensitivity

Definition: The ability of a measurement tool to discriminate between individuals who vary on a construct.
Evaluation:
- Administer the test to groups known to vary on the construct.
- Check if they (a) score within the expected range and (b) show significant differences between theoretically different groups.
- Examine the variability of scores: a well-spread distribution suggests good sensitivity, while a narrow range indicates lower sensitivity.

Experimental Manipulation

Choosing the Right Experimental Manipulation

Experimental manipulation involves systematically changing conditions participants experience to see if it causes changes in dependent variables.
Crucial for designing valid experiments.
If the manipulation doesn't effectively alter the intended variable, the study can't properly test the hypotheses.
A poorly executed manipulation means that the results won't be able to accurately reflect the relationship you're trying to investigate

Characteristics of a Good Experimental Manipulation

Construct Validity: It must actually manipulate the variable it is intended to.
Strength of Manipulation: It needs to be strong enough to influence participants' behaviour.
- Example: Inducing anger should elicit anger, not just mild annoyance.
- A weak manipulation may not produce the desired effect, making it difficult to detect any changes in behaviour; A manipulation that isn't strong enough can lead to a failure to detect an effect, even if one truly exists in the real world.
Reliability: A valid, effective manipulation should cause the same effect each time it is used.

Evaluating Manipulation

Pilot Testing: Small studies run before the main study.
- Useful for testing construct validity and effectiveness of experimental manipulation.
- Test if the manipulation causes changes in relevant criteria.
- Test if participants perceive the manipulation as relevant to the construct.
- Conduct a primary test of the strength of the experimental manipulation.
- Comparing the results of a study to the results of an initial pilot study can also provide some evidence about the reliability of the experimental manipulation's effects.
- Note that not all experimental studies will conduct pilot testing, as it may not be possible to conduct these due to time and cost constraints.
Manipulation Checks: Measurements included in your study to verify that the manipulation is influencing the variable it was intended to manipulate.
- Example: Asking participants to report how angry they feel after being exposed to an anger-inducing manipulation.
- Note that it may not be appropriate to use manipulation checks in all studies, as sometimes doing so can make participants suspicious of the hypotheses and bias their responses.
Evaluate Face Validity: On the surface, how well does the manipulation align with the theoretical definition of the construct the study aims to manipulate?
Replication: Re-running the study with a new sample to verify that any effects found the first time were not an artefact of change.
- Exact replication: Involves repeating the study exactly as it was run the first time, but with a new group of people
- Conceptual replication: Conducting another study to test the same hypotheses as the first but may change some aspects of the method eg. Measuring the same DV but in a different way
- Replication and extension: Repeat the study using the same methodology as the original study, but add additional elements to extend it in some way eg. Adding additional DVs to see if the results generalise to other outcomes.

Choosing an Experimental Manipulation

Step 1: Define Construct: Clearly define the constructs you are interested in studying. Know what exactly it is you intend to manipulate.
Step 2: Review the Literature: Examine previous studies to identify what (if any) types of manipulations have been used in the past and if these were effective.
Step 3: Select or Design: Select an existing manipulation, adapt an existing one, or design your own manipulation based on the definition of the construct.. If one is not available, they may adapt an existing one by changing it to make it appropriate for their study. If neither of these options are possible, they will need to design your own manipulation based on the definition of the construct.
Step 4: Evaluate: Evaluate the validity and strength of the experimental manipulation, either with piloting testing or by evaluating within the context of the study itself.
Step 5: Replicate: Replicate the results to examine the reliability of manipulation's effects.

Sampling

Choosing the Right Sample and Sample Size

Selecting the Right Population: Researchers need to carefully consider who the relevant population for their study is. Who is the study aiming to learn about? This should be determined by the research question
- General population research: Studies aim to draw conclusions about the general population. Refers to a broad, inclusive group with a wide range of characteristics, without any specific restrictions. Used when research aims to understand phenomena that apply to people broadly.
- Specific population research: About specific groups eg: aim to understand risk factors for poor mental health in refugees. The population of interest must be clearly defined and inclusion and exclusion criteria should be established to identify participants who are or are not part of the population.
Sample Representativeness:
- Refers to the degree to which a subset of individuals, items, or data points selected for analysis accurately reflects the larger population from which is drawn.
- If the sample is representative of the population, the results of a study are likely to generalise to the population the sample represents.

How to Recruit a Representative Sample

Probability Sampling: Each member of the population has a known chance of being selected.
- Simple Random Sampling: Everyone in the population has an equal chance of being selected.
- Systematic Sampling: The first person selected from a population is random, but from then on, selection follows a systematic rule (e.g., every 10th person is selected).
- Stratified Sampling: The population is divided into subgroups (e.g., age groups), and then participants are randomly chosen from each group.
- Cluster Sampling: The population is divided into groups, and then we randomly choose one or more of those groups and sample everyone in it.
Non-Probability Sampling: Not everyone in the population has a chance of being chosen, and selection is not random.
- Convenience Sampling: Participants are selected based on ease of access/availability.
- Quota Sampling: The researcher decides beforehand how many people of certain characteristics they need to match the population (e.g., 40% from rural locations, 60% from urban). The researcher then recruits this number of people fitting each characteristic.
- Purposive Sampling: Participants are selected based on specific characteristics/criteria (e.g., only people working in hospitality and who have experienced food insecurity).
- Snowball or Referral Sampling: Existing participants recruit future participants by passing on information about the study.

How to Evaluate Sample Representativeness

Sample Size: A larger sample generally provides more reliable estimates of population parameters.
Demographic Characteristics: Evaluate if the sample demographic characteristics reflect the characteristics of the population (age, gender, ethnicity, socioeconomic status, education level, geographic location, etc.).

Sample Size and Statistical Power

Statistical Power:
- Refers to the probability that a statistical test will correctly reject the null hypothesis when it is false.
- Measures the likelihood of a study detecting significant effect, assuming the effect truly exists in the real world.
- A study with high statistical power has a better chance of detecting true effects and is also likely to produce more precise effect estimates.
- High statistical power increases the likelihood that a statistically significant result reflects a true effect rather than a false positive.
How Much Power Do You Need?
- Cumming (2012) recommends a statistical power of at least 0.80 (80% chance of detecting a true effect).
- Trade-off between power and the Type I error rate ( $\alpha$ ).
- Type I error: Finding a significant effect that is a false positive.
- Type II error: Concluding there is no effect when one actually exists.
- The odds of a Type II error are low when power is high, but the odds of a Type I error can be higher.
- The odds of a Type II error are high when power is low, but the odds of a Type I error are lower.
Sample Size and Power:
- Sample size directly affects statistical power.
- Larger sample sizes increase statistical power because they provide more information and reduce random variability (error) in the data.
- Smaller sample sizes result in lower statistical power because they provide less reliable estimates of population parameters and introduce more error into the data.
Other Factors That Influence Power:
- Effect size: The magnitude of the difference or relationship between variables in the population. Larger effect sizes are easier to detect, leading to higher power.
- Significance level (alpha): The threshold set to determine statistical significance ( $\alpha$ ). A lower significance level decreases the chance of a Type I error but also reduces statistical power.
- Test sensitivity: The ability of the statistical test to detect differences or relationships. More sensitive tests have higher power.
- Study design and type of inferential analyses: Some study designs and their associated analyses are inherently more powerful than others. For example, repeated measures designs tend to be more powerful than between-groups designs.

How to Calculate the Sample Size You Need

An a priori power analysis is a statistical procedure conducted before data collection to determine the required sample size needed to achieve a certain level of statistical power for a planned hypothesis test.
Information needed:
- Estimated effect size: Determined by searching the literature or selecting the smallest effect size you are interested in detecting.
- Significance level (alpha): Select the threshold for significance you will use for your inferential analyses (this should be determined a priori, not later when you get to the statistical analysis).
- Type of analysis: Determined by your hypotheses and the type of data you have collected.

Other Considerations for Quantitative Research

Internal Validity

Refers to the extent to which a study can establish a causal relationship between the IV and DV.
Threats to internal validity undermine our ability to draw conclusions about causality from a study.
Threats to internal validity:
- Study design: Non-experimental studies cannot test causality and thus have low internal validity.
- Poor experimental control: Poor experimental control increases the likelihood for extraneous variables confounding the results of the study.
- Use of invalid or ineffective experimental manipulations: Experimental manipulations are not effective or if their validity has not been verified, they may appear to work but fail to manipulate the intended variable properly.
- Use of invalid or unreliable measurement tools: If we cannot be sure that we are measuring something accurately or consistently, we cannot draw valid conclusions about causation.

External Validity

Refers to the extent to which the results of a study can be generalised or applied to settings outside the study
Threats to external validity undermine our ability to generalise the results to people or environments outside of the study
Threats to external validity:
- Poor ecological validity: Ecological validity refers to how natural or realistic the experimental environment and tasks are.
- Poor psychological realism: If the mental processes used for a task in research study are very different from the processes a person would use in real life, the results may not be generalisable outside of the study.
- Non-representative samples: If a sample differs significantly from the population.

Why Does Good Research Methodology Matter?

When designing and evaluating research, it is essential to consider the appropriate research design, the selection of valid and reliable methods, and both the population and sample to ensure the validity of the study's conclusions.
If the wrong research design is chosen, the study may not actually test the hypotheses accurately.
If the wrong manipulation is chosen, or if there is low internal validity, you cannot be confident that any affects you observe were caused by the factor you think caused them.
If poor-quality measurement tools are chosen, you cannot be confident that you are measuring what you think you are, or that those measurement tools will be able to discriminate between participants effectively.
If the wrong population is chosen, you may be researching a phenomenon among people it does not affect.
If the wrong sample is chosen, the results may not represent the population, or you may not have sufficient power to detect a real effect.
If external validity is low, the results may not apply to people or settings outside of the study.

Operationalising Variables

Conceptual vs. Operational Variables

Conceptual Variable: A theoretical concept or idea used to describe an abstract phenomenon that cannot be directly observed or measured (e.g., intelligence, motivation, stress).
- Abstract and can be interpreted in various ways (e.g., anxiety).
Operational Variable: A specific, measurable representation of a construct in the context of a particular study.
- Defines how the construct will be observed, measured, or manipulated.

Choosing the Right Measures

Measurement refers to assigning numbers to represent constructs.
Selecting a unit of measurement scales

Choose the Type of Observation

Three primary ways we operationalise constructs, each involving the collection of different types of data: self-report measures, behavioural measures, and psychological measures.
- Self-Report: Participants provide information about themselves through questionnaires, interviews, or surveys.
  - Includes reports on their feelings, thoughts, attitudes, behaviours, or experiences.
  - Example: Ask participants to tell you how many hours they slept the previous night and to rate their sleep quality from 1 (very poor) to 5(very good)
  - Advantages: Direct access to an individual's thoughts, feelings, and perceptions, allowing us to measure these things even though we cannot observe them directly. Self-report measures are cheap and easy to administer to a large group.
  - Disadvantages: Responses may be influence by social desirability bias, memory errors or a lack of self-awareness.
- Behavioural Observation: Directly observing and recording aspects of an individual's behaviour, often within a natural or controlled setting.
  - Example: measuring sleep by observing someone in bed and recording how long their eyes are open or closed, how often they move etc.
  - Advantages: Behavioural observation are more objective than self-report due to the above disadvantages. EG person may report sleeping 3 hours, however behavioural observation indicated they actually slept 6.
  - Disadvantages: Observers may have biases or make subjective interpretations that influence the accuracy of the observation Reactivity can occur (change of behaviour if they know they are being observed.) Can be time consuming and costly.
- Physiological Measures: Recording biological data from participants (e.g., HR, hormone levels, brain activity, skin conductance).
  - Example: sleep may be measured by recording a person's brain waves, HR, RR, eye movements and muscle tension.
  - Advantages: Data is objective (interpretation of the data may still be subjective. Offers a direct link to the physiological processes underlying psychological states.
  - Disadvantages: Can be expensive and complex to conduct Relationship between physiological signals and psychological states can be complex and difficult to interpret. Some methods can be invasive or uncomfortable for participants, potentially resulting in a change of behaviour.

Choose the Scale of Measurement

Categorical Data (Nominal Data): Assigning numbers to represent different discrete categories defined by specific characteristics.
- Each number serves as a label denoting the category in which the participant is placed.
- No specific order.
- Example: whether a participant is a student or not; the participants nationality.
Ordinal Data: Consists of categories with a specific order or ranking, but the intervals between these categories are not necessarily equal or known.
- Examples: Grades (HD, D, C, P, F); satisfaction rating 1 (very unsatisfied to 5(very satisfied).
- Inherently discrete and are best characterised as a type of categorical data. However can sometimes be treated as continuous for analytical convenience; especially when the number of categories is large and the order represents a progression eg likert scale.
Continuous Data: Refers to data that can take any value within a given range, with intervals between numbers always being the same distance; difference in age between 1 and 2 years is the same as the difference between a 20 yr and 21 yr.
- data is not restricted to specific, discrete categories for instance; a person doesn’t have to be exactly 20 or 21; they can be 20.3 or 20.7.
- Interval Data: Continuous data that does not have a true meaningful zero.
  - A meaningful zero means that zero indicates the absence of the variable.
  - Example: temperature does not have a true zero (0 degrees does not represent the absence of temperature)
- Ratio Data: Continuous data with a true, meaningful zero.
  - Example: measuring the distance a participant sits away from a confederate (0 would indicate they are 0cm away from the other person; an absence of distance)

Measurement Validity

Validity of measurement refers to the degree to which the measurement tool accurately measures the construct it is intended to.
Several types of validity which fit into categories of construct validity and criterion validity.

Construct Validity

Refers to the extent to which we are confident that a measurement tool actually measures the construct it claims to measure.
- Face Validity: Refers to whether or not, on the surface, the tool appears to measure what it is supposed to.
  - Evaluation: Apply logic to assess whether the characteristics of the tool subjectively appear to related to the construct. In the context of a self-report questionnaire, for example, we would consider whether the items seem relevant to the construct being measured.
- Content Validity: Refers to the extent to which a measure represents all facets of a given construct.
  - Evaluation: Typically evaluated through expert and end user judgement. Experts in the relevant field r/v the measurement tool to ensure that it covers the full range of the concept being measured.. Eng users are also often involved to evaluate whether the tool reflects their real world experience.
- Convergent Validity: Refers to whether or not the tool correlates with other measures of the same construct.
  - Evaluation: Testing whether scores on the measurement tools correlate with scores from a different measure of the same construct.
- Discriminant Validity (Divergent Validity): Refers to the degree to which a measure does not correlate too strongly with measures of other constructs that are theoretically different.
  - Evaluation: Testing whether scores on our measurement tool correlates with scores from an unrelated construct. If the scores don't correlate or at least don’t correlate too highly this provides evidence that the measures are assessing different constructs.
- Known-Groups Validity: Refers to whether a measurement tool can distinguish between groups that it is theoretically expected to distinguish between.
  - Evaluation: Administering the measurement tool to different groups of people who are expected to score differently on the construct, and then testing if they produce significantly difference scores.

Criterion Validity

If your scale measures the construct it claims to measure, it should correlate with factors known to be related to that construct.
Concurrent Validity: Refers to the extent to which scores on the measurement tool correlate with scores on a criterion, when both are measured at the same time.
- Evaluation: Participants would complete both measurements tools we are interested in and another measure of a criterion variable during the same time period. We would then test if their scores on each measure are correlated.
Predictive Validity: Refers to the extent to which scores on the measurement tool correlate with scores on a criterion, when the criterion is measured at some point in the future.
- Evaluation: Participants would complete the measurement tool we are interested in at one time point, and then we would gather data about the criterion at a later time. We would then test if their scores on each are correlated.

Quantitative Approach Rationale: A quantitative approach is suitable due to being able to measure constructs eg : Research question example is there a relationship between chronic stress and working memory capacity