Variables, Sampling, Validity, and Reliability

Sampling

  • Sampling 101
    • To conduct a research study, researchers need people/subjects/participants.

On the Road to Generalizability

  • The main objective of quantitative research is to obtain generalizable results.
  • Generalizable results reflect the true state of affairs in the population of interest.
  • To claim generalizability, the sample needs to be representative of the population.

Key Concerns

  • How certain can we be that the findings from the sample will hold true for the entire population?
  • How representative is the sample of the population from which it is drawn?
  • How well can we generalize our results to the population under study?

Population vs. Sample

  • Population: The totality to whom/which you wish to generalize your study findings.
  • Sample: The participants in your study.
    • Select a sample from the population and conduct study with sample participants.
    • Generalize the findings/results from your sample back to the population.

Sampling Procedures

  • Probability Sampling
    • Simple Random
    • Systematic Random
    • Stratified
    • Multi-stage Cluster
  • Non-probability Sampling
    • Convenience
    • Snowball
    • Purposive

Probability Sampling

  • Ensures that your sample is representative of the population (on the characteristics deemed important for the study).
  • Basic principle:
    • A sample will be representative of the population if all members of the population have an equal chance of being selected in the sample.
  • Allows the researcher to calculate the relationship between the sample and the population.

Types of Probability Sample

  • Simple random sample
  • Systematic random sample
  • Stratified random sampling
  • Multistage cluster sampling

Simple Random Sample

  • Define the population, list all members, assign numbers
  • Use a table of random numbers to select
  • Use a “lottery” method
  • Use a computer program to randomly select
  • Each member has an equal and independent chance of being selected.

Systematic Random Sample

  • Randomly select the first person, then divide the size of the population by the size of the desired sample and use this to determine the interval at which the sample is selected.
  • e.g., To select a sample of 1000 people from a list of 10,000, randomly select the first person and then select every 10th person from the list.
  • Every kth person

Stratified Sampling

  • If you want to make sure the profile of the sample matches the profile of the population on some important characteristics e.g., Age, location, ethnicity.
  • Researcher divides population into subpopulations (strata) and randomly samples from the strata.
  • NB: can have proportional representation or disproportionate representation (but disproportionate sample would not be used to generalise to entire population, only the subgroups).
  • Why use stratified sampling?
    • Can reduce sampling error by ensuring ratios reflect actual population (e.g., Ratio of different ethnic groups).
    • To ensure that small subpopulations are included in the sample.

Stratified Random Sample Example

  • Stratified random sampling example (proportional)
  • Strata – states in Australia
  • Population – all Australian adults
  • Sample – subset of all Australian adults
  • Randomly sample X% of Australians from each state strata so that the proportion of people in the final sample matches the proportion of people within each state across Australia.

Multi-Stage Cluster Sampling

  • Begin with a sample of groupings and then sample individuals
  • e.g. Rural sample
    • Define rural townships as those with populations < X
    • Get listing of all relevant townships
    • Take a random sample of townships
    • Randomly sample people from within the randomly sampled townships
  • Not the same as stratified sampling, as each cluster does not need to be sampled.

Multi-Stage/Multi-Phase Sampling

  • Larger sample obtained first in order to identify members of a sub-sample
  • Sub-sample then randomly chosen from for study
  • Good (but costly) way to identify not readily identifiable subgroups
  • e.g. – Large community survey in Australia, one question asked if they had a previous diagnosis of X disease -> X disease sufferers followed up again for sampling.

Non-Probability Sampling

  • Not every member of the population has an equal chance of being part of the sample
  • Why use then?
    • There are no lists for some populations under study, e.g.,
      • The homeless
      • Certain occupations (e.g., Farmers)
      • Hidden or specific populations (e.g., Farmers with mental health issues)
    • Convenience/ resource restrictions

Convenience Samples

  • A sample of available participants, e.g.,
    • Students enrolled in a particular course
    • People passing a particular location
  • Advantages:
    • Easy, inexpensive
  • Disadvantages:
    • No control over representativeness
    • Bias!

Snowball Sampling

  • Used mainly for hard-to-study populations, e.g.,
    • Homeless young people
    • QUT students who access the public library at night.
    • People with a not readily/commonly listed characteristic (e.g., Holocaust survivors)
  • Involves collecting data with members of the population that can be located and then asks those members to provide information/contacts for other members of the population.

Quota Sample

  • Non-probability sampling equivalent of a stratified random sample
  • Want to reflect relative proportions of a population
  • But you don’t/aren’t able to sample randomly from each strata as you do in stratified random samples

Purposive/Judgment Sampling

  • Clear purpose to the sampling strategy: select key informants, atypical cases, deviant cases, or a diversity of cases.
  • Selecting a sample based on knowledge of the population, its elements, and the purpose of the study
  • Often used to:
    • Select cases that might be especially informative
    • Select cases in a difficult-to-reach population
    • Select cases for in-depth investigation

Purposive/Judgment Sampling Examples

  • STUDYING THE PROBLEMS EXPERIENCED BY NEW IMMIGRANTS
    • Interview key people involved in agencies that help immigrants such as ethnic welfare groups, community immigration legal aid groups
    • Interviewing people with extensive experience with immigrants likely to provide rich data
  • COMPARISON OF LEFT-WING AND RIGHT-WING STUDENTS
    • May not be possible to sample all left-wing and right-wing students
    • Instead, you could sample the membership of left (e.g., Socialist alliance) and right-wing groups on campus (e.g., Young liberals)

Which Method of Sampling Should I Use?

  • As a major aim of quantitative research is the ability to generalise results, the best method is usually a probability sampling one.
  • However, this is often not workable or feasible given resources, time, the specific target population.
  • Sampling method used should be fully explained, and caveats about the likely generalisability of results made accordingly so that the reader can review your results in an informed way.

Determining Sample Size

How many participants do I need for my study?
  • Largely determined by the analysis you plan to conduct with the data derived
  • Generally, the more complex the analysis - larger sample required
  • You can statistically predict sample size (power analysis)

Determining Sample Size

Larger sample sizes are needed:
  • When the sample is heterogeneous
    • Composed of widely different kinds of people
  • When you want to break down the sample into multiple subcategories
    • e.g., Look at genders separately
  • If you want to obtain a narrow or more precise confidence interval
  • When you expect a small effect or weak relationship
  • For some statistical techniques

Determining Sample Size - Rules

  • Five simple rules for determining sample size
    • If less than 100, use entire population
    • Larger sample sizes make it easier to detect an effect or relationship in the population
    • Compare to other research studies in the area by doing a literature review
    • Use a power table for a rough estimate
    • Use a sample size calculator (e.g., G-power)

Moving Right Along

  • So we have thought about how to get our sample
  • How representative it is going to be via how random our sampling method has been
  • And we have briefly considered how big our sample should be
  • Time to start turning our attention to our variables and design

Last Week You Were Introduced To…

  • Independent VS dependent variables
  • Categorical vs continuous variables

Operationalization

Conceptual Definition → Operational Definition

  • “…A description of the “operations” that will be undertaken in measuring a concept” (Rubin & Babbie, 2008, p. 160).
  • Specific procedures by which the researcher measures and/or manipulates a variable
  • Turning abstract concepts into concrete variables that we can measure or manipulate
  • The more careful and complete the operational definition, the more precise the measurement of the variable

Operationalization

  • Operationalization of IVs
    • How are you going to manipulate it? How might you measure it?
  • Operationalization of DVs
    • How are you going to measure it?

Levels of Measurement

  • When we want to measure something (e.g., Religion, self-esteem, tennis ability), we need to choose a metric with which we can measure it.
  • The metric will determine the statistical analyses we can perform.

Levels of Measurement

  • Nominal: Something which is purely categorical information (about the quality or the ‘kind’ of thing).
    • e.g., religion. Jewish, protestant, catholic, buddhist, etc…
    • Not a quantity, but rather a discrete quality that something can have
  • Ordinal: A rank order. Ordinal variables do indicate an underlying quantity, but they do not obey mathematical laws. e.g., you cannot meaningfully subtract, divide, etc.
  • Interval: A true number in the sense that there are equal intervals implied, but no true zero point.
    • e.g., temperature in degrees
  • Ratio: A true number. The distinguishing feature of a ratio scale variable is that it has a meaningful zero point, that participants could use to indicate the quantity is completely absent.

Summary of Levels of Measurement

Categories/ valuesRanksEqual intervalsZero point
NominalX
OrdinalXX
IntervalXXX
RatioXXXX

Reliability & Validity

  • Applies mostly to indexes/scales
  • How do we assess whether our measures/operationalisations are good?
    • Are they valid?
    • Are they reliable?
  • Issue is that you can’t assess these until after you have developed your questionnaires and used them
  • Therefore, a pilot test can be so beneficial
  • Many people chose to use established measures rather than develop their own

Validity Types

  • You can consider the overall validity of a design/piece of research
    • We call this internal validity and external validity (future weeks)
  • You can also consider the validity of variables within a study
  • We are going to consider this in more detail now…

Validity

  • Are we measuring what we think we are?
  • Is our measure credible, is it believable?
  • Why is validity an issue?
    • Many (if not most) variables in social research cannot be directly observed
      • e.g., Motivation, satisfaction, helplessness
  • The challenge:
    • To make a judgment call about whether we are measuring what we think we’re measuring

Types of Validity

Face Validity

  • Asks the question:
    • On the face of it, does my measure seem to relate to the construct?
  • e.g., On the face of it, which of the following is a more valid measure of worker morale:
    • No of grievances filed with the union or
    • No of books borrowed by workers during off-duty hours
  • Measures that lack face validity have the potential to alienate research participants (what are they really trying to measure?)
  • A weak, subjective method for assessing validity, but a first step

Content Validity

  • The extent to which the measure represents a balanced adequate sampling of relevant dimensions
  • Consider what should go into a measure and what should stay out - define the boundaries
  • How much does the measure cover the content of the definition?
  • e.g., Which of the following would be a more valid test of mathematical ability
    • A 20-question test containing addition problems
    • A 20-question test containing addition, subtraction, multiplication, division, fraction problems

Criterion-Related Validity

  • Involves checking the performance of your measure against some external criterion
  • Two types
    • Concurrent: does it relate to a known criterion, for example, an alternative (gold standard) measure of the same construct?
    • Predictive: does the measure predict/relate to some criterion that you would expect it to predict?

Concurrent Validity

  • Establish the validity of your measure by comparing it to a “gold standard” (i.e., Existing validated measure of the same construct)

Concurrent Validity Example

  • The super-duper new IQ test VS. WAIS (gold standard)
  • Randomly select a representative sample (N=100N = 100)
  • Give 1st 50 the super-duper test then WAIS; Give other 50 WAIS test then the super-duper test
  • Correlate scores on the two tests
    • High Pearson’s rr = good concurrent validity
    • Low Pearson’s rr = low concurrent validity

Predictive Validity

  • Does the measure predict something that it’s theoretically supposed to predict?
  • Does the measure differentiate between people in the way you would expect
    • e.g., people with different mental disorders, elite versus amateur athletes, etc.
  • What should a measure of the following constructs predict?
    • Iq -> perhaps some cognitive-based performance task
    • Workplace depression scale -> number of mental health sick days

Predictive Validity Example

  • Student self-report measure of interviewing skills prior to placement
  • Placement supervisor & client ratings of interviewing skills at the end of placement
  • If the self-report measure predicts the later ratings, then it has good predictive validity

Construct Validity

  • Demonstrating that the measure relates to the theoretical construct of interest
  • Two types
    • Convergent
      • Demonstrating that the measure relates to measures of similar and related constructs
    • Divergent
      • Demonstrating the measure does not relate to unrelated constructs

Summary of Validity Types

TypeDescription
FaceIn the judgment of others, items appear to relate to construct
ContentCaptures the entire meaning (all elements of definition) of a construct
CriterionAgrees with external source
ConcurrentAgrees with pre-existing “gold-standard” measure
PredictiveAgrees with future behavior
ConstructHow well multiple indicators relate to each other (consistent with theory)
ConvergentSimilar measures (or measures of theoretically related constructs) are related
DivergentDifferent measures (or…) are unrelated

Reliability

  • The consistency or repeatability of your measurement
  • For example, say I weigh myself on some scales at one point in time and then weigh myself 5 mins later and it says I’m 5 kilos heavier
    • My conclusion: the scales are dodgy!
    • The scientific conclusion: the scales are an unreliable measurement instrument

Types of Reliability

  • Stability of the measure (test-retest)
  • Internal consistency of the measure (split-half, Cronbach’s alpha)
  • Agreement or consistency across raters (inter-rater)

Test-Retest Reliability

  • Addresses the stability of your measure
  • You administer the measure at one point in time (time 1)
  • You then give the same measure to the same participants at a later point in time (time 2)
  • You correlate the scores on the two measures

Problems with Test-Retest

  • Two main problems
    • Memory effect
    • Practice effect
      • Performance improves because of practice in test taking
  • Other considerations: how long between intervals?
    • If too short, there’s a greater risk of memory effects
    • If too long, there’s a risk of other variables (e.g., Additional learning) influencing results

Split-Half Reliability

  • Administer a battery of questions
  • Split the measure into two halves
  • Correlate the scores on the two halves of the measure
  • Higher correlation means greater reliability
  • Strength: eliminates memory & practice effects
  • Limitation: are the two halves equivalent?
Split-Half Reliability Example
  • Measure of Prejudice toward First Nations Australians
  • 20-item scale
  • Score on one half of test (10 items) VS. Score on other half of test (10 items)
  • Higher correlation means higher reliability

Inter-Item Reliability

  • Assesses the ‘internal consistency’ of your measure
  • i.e., Tells you how well the items or questions in your measure appear to reflect the same underlying construct
  • You will get good internal consistency if individuals respond in approximately the same way to questions on your survey
  • Cronbach’s alpha can range from 0 (when the items are not correlated with one another) to 1.00 (when all items are perfectly correlated to each other). The closer the alpha is to 1.00, the better the reliability of the measure

Inter-Rater or Inter-Observer Reliability

  • Checking the match between two or more raters or judges, e.g., Research investigating the relationship between communication and family functioning
  • Coding videos for hostile statements – need to check the agreement amongst the coders

Calculation of Inter-Rater Reliability

  • Nominal or ordinal scale
    • The percentage of times different raters agree
  • Interval or ratio scale
    • Correlation coefficient
    • Other statistical methods – beyond scope of PYB210.

Interpreting Reliability Coefficients

  • What kind of reliabilities co-efficients should I be aiming for?
    • Test-retest coefficients > .70
    • Internal consistency >.70 (but ideally much higher)
    • Rating consistency >.90
  • These are relatively arbitrary but serve as a benchmark

Reliability and Measurement Error

  • One theme we will come back to later on, when talking about statistics, is that measurement error serves to weaken our statistical tests
  • All other things being equal, more error in measurement means lower power
  • Choosing a measure that is highly reliable decreases measurement error and increases the power of your design

The Relationship Between Reliability and Validity

  • Can a measure be reliable but not valid?
    • Yes! You could have a consistent measure that does not actually measure the construct
  • Can a measure be valid but not reliable?
    • Yes.
    • Example of a valid tool but is unreliable – something that is difficult to implement (e.g., Skin fold tests – require technical skill) – may be unreliable across multiple administrators.

A Dartboard Analogy

  • High validity, low reliability
  • High reliability, low validity

Summary of Validity Types

TypeDescription
FaceIn the judgment of others, items appear to relate to construct
ContentCaptures the entire meaning (all elements of definition) of a construct
CriterionAgrees with external source
ConcurrentAgrees with pre-existing “gold-standard” measure
PredictiveAgrees with future behavior
ConstructHow well multiple indicators relate to each other (consistent with theory)
ConvergentSimilar measures (or measures of theoretically related constructs) are related
DivergentDifferent measures (or…) are unrelated

Summary of Reliability Types

TypeDescription
Test-retestSame Q given on two occasions and data correlated
Split HalfSplit Q in half and correlate data from two halves
Inter-item reliabilityOverall correlation between items in the scale
Inter-raterChecking for agreement between multiple raters or judges