Variables, Sampling, Validity, and Reliability

Sampling

Sampling 101
- To conduct a research study, researchers need people/subjects/participants.

On the Road to Generalizability

The main objective of quantitative research is to obtain generalizable results.
Generalizable results reflect the true state of affairs in the population of interest.
To claim generalizability, the sample needs to be representative of the population.

Key Concerns

How certain can we be that the findings from the sample will hold true for the entire population?
How representative is the sample of the population from which it is drawn?
How well can we generalize our results to the population under study?

Population vs. Sample

Population: The totality to whom/which you wish to generalize your study findings.
Sample: The participants in your study.
- Select a sample from the population and conduct study with sample participants.
- Generalize the findings/results from your sample back to the population.

Sampling Procedures

Probability Sampling
- Simple Random
- Systematic Random
- Stratified
- Multi-stage Cluster
Non-probability Sampling
- Convenience
- Snowball
- Purposive

Probability Sampling

Ensures that your sample is representative of the population (on the characteristics deemed important for the study).
Basic principle:
- A sample will be representative of the population if all members of the population have an equal chance of being selected in the sample.
Allows the researcher to calculate the relationship between the sample and the population.

Types of Probability Sample

Simple random sample
Systematic random sample
Stratified random sampling
Multistage cluster sampling

Simple Random Sample

Define the population, list all members, assign numbers
Use a table of random numbers to select
Use a “lottery” method
Use a computer program to randomly select
Each member has an equal and independent chance of being selected.

Systematic Random Sample

Randomly select the first person, then divide the size of the population by the size of the desired sample and use this to determine the interval at which the sample is selected.
e.g., To select a sample of 1000 people from a list of 10,000, randomly select the first person and then select every 10th person from the list.
Every kth person

Stratified Sampling

If you want to make sure the profile of the sample matches the profile of the population on some important characteristics e.g., Age, location, ethnicity.
Researcher divides population into subpopulations (strata) and randomly samples from the strata.
NB: can have proportional representation or disproportionate representation (but disproportionate sample would not be used to generalise to entire population, only the subgroups).
Why use stratified sampling?
- Can reduce sampling error by ensuring ratios reflect actual population (e.g., Ratio of different ethnic groups).
- To ensure that small subpopulations are included in the sample.

Stratified Random Sample Example

Stratified random sampling example (proportional)
Strata – states in Australia
Population – all Australian adults
Sample – subset of all Australian adults
Randomly sample X% of Australians from each state strata so that the proportion of people in the final sample matches the proportion of people within each state across Australia.

Multi-Stage Cluster Sampling

Begin with a sample of groupings and then sample individuals
e.g. Rural sample
- Define rural townships as those with populations < X
- Get listing of all relevant townships
- Take a random sample of townships
- Randomly sample people from within the randomly sampled townships
Not the same as stratified sampling, as each cluster does not need to be sampled.

Multi-Stage/Multi-Phase Sampling

Larger sample obtained first in order to identify members of a sub-sample
Sub-sample then randomly chosen from for study
Good (but costly) way to identify not readily identifiable subgroups
e.g. – Large community survey in Australia, one question asked if they had a previous diagnosis of X disease -> X disease sufferers followed up again for sampling.

Non-Probability Sampling

Not every member of the population has an equal chance of being part of the sample
Why use then?
- There are no lists for some populations under study, e.g.,
  - The homeless
  - Certain occupations (e.g., Farmers)
  - Hidden or specific populations (e.g., Farmers with mental health issues)
- Convenience/ resource restrictions

Convenience Samples

A sample of available participants, e.g.,
- Students enrolled in a particular course
- People passing a particular location
Advantages:
- Easy, inexpensive
Disadvantages:
- No control over representativeness
- Bias!

Snowball Sampling

Used mainly for hard-to-study populations, e.g.,
- Homeless young people
- QUT students who access the public library at night.
- People with a not readily/commonly listed characteristic (e.g., Holocaust survivors)
Involves collecting data with members of the population that can be located and then asks those members to provide information/contacts for other members of the population.

Quota Sample

Non-probability sampling equivalent of a stratified random sample
Want to reflect relative proportions of a population
But you don’t/aren’t able to sample randomly from each strata as you do in stratified random samples

Purposive/Judgment Sampling

Clear purpose to the sampling strategy: select key informants, atypical cases, deviant cases, or a diversity of cases.
Selecting a sample based on knowledge of the population, its elements, and the purpose of the study
Often used to:
- Select cases that might be especially informative
- Select cases in a difficult-to-reach population
- Select cases for in-depth investigation

Purposive/Judgment Sampling Examples

STUDYING THE PROBLEMS EXPERIENCED BY NEW IMMIGRANTS
- Interview key people involved in agencies that help immigrants such as ethnic welfare groups, community immigration legal aid groups
- Interviewing people with extensive experience with immigrants likely to provide rich data
COMPARISON OF LEFT-WING AND RIGHT-WING STUDENTS
- May not be possible to sample all left-wing and right-wing students
- Instead, you could sample the membership of left (e.g., Socialist alliance) and right-wing groups on campus (e.g., Young liberals)

Which Method of Sampling Should I Use?

As a major aim of quantitative research is the ability to generalise results, the best method is usually a probability sampling one.
However, this is often not workable or feasible given resources, time, the specific target population.
Sampling method used should be fully explained, and caveats about the likely generalisability of results made accordingly so that the reader can review your results in an informed way.

Determining Sample Size

How many participants do I need for my study?

Largely determined by the analysis you plan to conduct with the data derived
Generally, the more complex the analysis - larger sample required
You can statistically predict sample size (power analysis)

Determining Sample Size

Larger sample sizes are needed:

When the sample is heterogeneous
- Composed of widely different kinds of people
When you want to break down the sample into multiple subcategories
- e.g., Look at genders separately
If you want to obtain a narrow or more precise confidence interval
When you expect a small effect or weak relationship
For some statistical techniques

Determining Sample Size - Rules

Five simple rules for determining sample size
- If less than 100, use entire population
- Larger sample sizes make it easier to detect an effect or relationship in the population
- Compare to other research studies in the area by doing a literature review
- Use a power table for a rough estimate
- Use a sample size calculator (e.g., G-power)

Moving Right Along

So we have thought about how to get our sample
How representative it is going to be via how random our sampling method has been
And we have briefly considered how big our sample should be
Time to start turning our attention to our variables and design

Last Week You Were Introduced To…

Independent VS dependent variables
Categorical vs continuous variables

Operationalization

Conceptual Definition → Operational Definition

“…A description of the “operations” that will be undertaken in measuring a concept” (Rubin & Babbie, 2008, p. 160).
Specific procedures by which the researcher measures and/or manipulates a variable
Turning abstract concepts into concrete variables that we can measure or manipulate
The more careful and complete the operational definition, the more precise the measurement of the variable

Operationalization

Operationalization of IVs
- How are you going to manipulate it? How might you measure it?
Operationalization of DVs
- How are you going to measure it?

Levels of Measurement

When we want to measure something (e.g., Religion, self-esteem, tennis ability), we need to choose a metric with which we can measure it.
The metric will determine the statistical analyses we can perform.

Levels of Measurement

Nominal: Something which is purely categorical information (about the quality or the ‘kind’ of thing).
- e.g., religion. Jewish, protestant, catholic, buddhist, etc…
- Not a quantity, but rather a discrete quality that something can have
Ordinal: A rank order. Ordinal variables do indicate an underlying quantity, but they do not obey mathematical laws. e.g., you cannot meaningfully subtract, divide, etc.
Interval: A true number in the sense that there are equal intervals implied, but no true zero point.
- e.g., temperature in degrees
Ratio: A true number. The distinguishing feature of a ratio scale variable is that it has a meaningful zero point, that participants could use to indicate the quantity is completely absent.

Summary of Levels of Measurement

	Categories/ values	Ranks	Equal intervals	Zero point
Nominal	X
Ordinal	X	X
Interval	X	X	X
Ratio	X	X	X	X

Reliability & Validity

Applies mostly to indexes/scales
How do we assess whether our measures/operationalisations are good?
- Are they valid?
- Are they reliable?
Issue is that you can’t assess these until after you have developed your questionnaires and used them
Therefore, a pilot test can be so beneficial
Many people chose to use established measures rather than develop their own

Validity Types

You can consider the overall validity of a design/piece of research
- We call this internal validity and external validity (future weeks)
You can also consider the validity of variables within a study
We are going to consider this in more detail now…

Validity

Are we measuring what we think we are?
Is our measure credible, is it believable?
Why is validity an issue?
- Many (if not most) variables in social research cannot be directly observed
  - e.g., Motivation, satisfaction, helplessness
The challenge:
- To make a judgment call about whether we are measuring what we think we’re measuring

Types of Validity

Face Validity

Asks the question:
- On the face of it, does my measure seem to relate to the construct?
e.g., On the face of it, which of the following is a more valid measure of worker morale:
- No of grievances filed with the union or
- No of books borrowed by workers during off-duty hours
Measures that lack face validity have the potential to alienate research participants (what are they really trying to measure?)
A weak, subjective method for assessing validity, but a first step

Content Validity

The extent to which the measure represents a balanced adequate sampling of relevant dimensions
Consider what should go into a measure and what should stay out - define the boundaries
How much does the measure cover the content of the definition?
e.g., Which of the following would be a more valid test of mathematical ability
- A 20-question test containing addition problems
- A 20-question test containing addition, subtraction, multiplication, division, fraction problems

Criterion-Related Validity

Involves checking the performance of your measure against some external criterion
Two types
- Concurrent: does it relate to a known criterion, for example, an alternative (gold standard) measure of the same construct?
- Predictive: does the measure predict/relate to some criterion that you would expect it to predict?

Concurrent Validity

Establish the validity of your measure by comparing it to a “gold standard” (i.e., Existing validated measure of the same construct)

Concurrent Validity Example

The super-duper new IQ test VS. WAIS (gold standard)
Randomly select a representative sample ( $N = 100$ )
Give 1st 50 the super-duper test then WAIS; Give other 50 WAIS test then the super-duper test
Correlate scores on the two tests
- High Pearson’s $r$ = good concurrent validity
- Low Pearson’s $r$ = low concurrent validity

Predictive Validity

Does the measure predict something that it’s theoretically supposed to predict?
Does the measure differentiate between people in the way you would expect
- e.g., people with different mental disorders, elite versus amateur athletes, etc.
What should a measure of the following constructs predict?
- Iq -> perhaps some cognitive-based performance task
- Workplace depression scale -> number of mental health sick days

Predictive Validity Example

Student self-report measure of interviewing skills prior to placement
Placement supervisor & client ratings of interviewing skills at the end of placement
If the self-report measure predicts the later ratings, then it has good predictive validity

Construct Validity

Demonstrating that the measure relates to the theoretical construct of interest
Two types
- Convergent
  - Demonstrating that the measure relates to measures of similar and related constructs
- Divergent
  - Demonstrating the measure does not relate to unrelated constructs

Summary of Validity Types

Type	Description
Face	In the judgment of others, items appear to relate to construct
Content	Captures the entire meaning (all elements of definition) of a construct
Criterion	Agrees with external source
Concurrent	Agrees with pre-existing “gold-standard” measure
Predictive	Agrees with future behavior
Construct	How well multiple indicators relate to each other (consistent with theory)
Convergent	Similar measures (or measures of theoretically related constructs) are related
Divergent	Different measures (or…) are unrelated

Reliability

The consistency or repeatability of your measurement
For example, say I weigh myself on some scales at one point in time and then weigh myself 5 mins later and it says I’m 5 kilos heavier
- My conclusion: the scales are dodgy!
- The scientific conclusion: the scales are an unreliable measurement instrument

Types of Reliability

Stability of the measure (test-retest)
Internal consistency of the measure (split-half, Cronbach’s alpha)
Agreement or consistency across raters (inter-rater)

Test-Retest Reliability

Addresses the stability of your measure
You administer the measure at one point in time (time 1)
You then give the same measure to the same participants at a later point in time (time 2)
You correlate the scores on the two measures

Problems with Test-Retest

Two main problems
- Memory effect
- Practice effect
  - Performance improves because of practice in test taking
Other considerations: how long between intervals?
- If too short, there’s a greater risk of memory effects
- If too long, there’s a risk of other variables (e.g., Additional learning) influencing results

Split-Half Reliability

Administer a battery of questions
Split the measure into two halves
Correlate the scores on the two halves of the measure
Higher correlation means greater reliability
Strength: eliminates memory & practice effects
Limitation: are the two halves equivalent?

Split-Half Reliability Example

Measure of Prejudice toward First Nations Australians
20-item scale
Score on one half of test (10 items) VS. Score on other half of test (10 items)
Higher correlation means higher reliability

Inter-Item Reliability

Assesses the ‘internal consistency’ of your measure
i.e., Tells you how well the items or questions in your measure appear to reflect the same underlying construct
You will get good internal consistency if individuals respond in approximately the same way to questions on your survey
Cronbach’s alpha can range from 0 (when the items are not correlated with one another) to 1.00 (when all items are perfectly correlated to each other). The closer the alpha is to 1.00, the better the reliability of the measure

Inter-Rater or Inter-Observer Reliability

Checking the match between two or more raters or judges, e.g., Research investigating the relationship between communication and family functioning
Coding videos for hostile statements – need to check the agreement amongst the coders

Calculation of Inter-Rater Reliability

Nominal or ordinal scale
- The percentage of times different raters agree
Interval or ratio scale
- Correlation coefficient
- Other statistical methods – beyond scope of PYB210.

Interpreting Reliability Coefficients

What kind of reliabilities co-efficients should I be aiming for?
- Test-retest coefficients > .70
- Internal consistency >.70 (but ideally much higher)
- Rating consistency >.90
These are relatively arbitrary but serve as a benchmark

Reliability and Measurement Error

One theme we will come back to later on, when talking about statistics, is that measurement error serves to weaken our statistical tests
All other things being equal, more error in measurement means lower power
Choosing a measure that is highly reliable decreases measurement error and increases the power of your design

The Relationship Between Reliability and Validity

Can a measure be reliable but not valid?
- Yes! You could have a consistent measure that does not actually measure the construct
Can a measure be valid but not reliable?
- Yes.
- Example of a valid tool but is unreliable – something that is difficult to implement (e.g., Skin fold tests – require technical skill) – may be unreliable across multiple administrators.

A Dartboard Analogy

High validity, low reliability
High reliability, low validity

Summary of Validity Types

Type	Description
Face	In the judgment of others, items appear to relate to construct
Content	Captures the entire meaning (all elements of definition) of a construct
Criterion	Agrees with external source
Concurrent	Agrees with pre-existing “gold-standard” measure
Predictive	Agrees with future behavior
Construct	How well multiple indicators relate to each other (consistent with theory)
Convergent	Similar measures (or measures of theoretically related constructs) are related
Divergent	Different measures (or…) are unrelated

Summary of Reliability Types

Type	Description
Test-retest	Same Q given on two occasions and data correlated
Split Half	Split Q in half and correlate data from two halves
Inter-item reliability	Overall correlation between items in the scale
Inter-rater	Checking for agreement between multiple raters or judges