Lecture 3 Flashcards (1)
Introduction
- This lecture focuses on transforming raw data into usable knowledge for analysis in SPSS.
- The goal is to clean and manipulate data obtained from sources like Qualtrics to ensure accurate and reliable results.
Data Cleaning: Garbage In, Garbage Out
- Principle: Clean data is crucial for generating meaningful insights; otherwise, the analysis will produce unreliable results.
- Raw Data: Keep the original, untouched data set separate from the working data.
- One set of raw data that you don't touch
- Another set of not so clean that you can manipulate and clean up.
Step 1: Removing Unnecessary Data
Excluding Participants Who Declined Participation
- Remove responses from individuals who indicated they did not want to participate.
- Process:
- Go to Data -> Select Cases.
- Use an IF condition to filter out participants who answered "no" to the participation question.
- Syntax example:
IF participant = 2 (where 2 represents "no"). - Execute the syntax to remove these cases.
Excluding Invariant Responses (Straight-liners)
- Identify and remove responses where participants provided the same answer for all questions, indicating a lack of attention.
- Respondent two and respondent four are invariant because there is no variation.
Excluding Rushers
- Identify and remove participants who completed the survey too quickly, suggesting they did not engage thoughtfully with the questions.
- Straight liners / invariants
- Rushers.
- Process:
- Create a new variable called "duration" to record the time taken to complete the survey.
- Calculate duration in minutes from seconds by dividing the duration in seconds by 60: Duration(minutes) = \frac{Duration(seconds)}{60}.
- Time in minutes is the durations in seconds divided by sixty, (\frac{seconds}{60} = minutes)
- Establish a reasonable minimum time based on the survey's intended length (e.g., at least 5 minutes for a 15-20 minute survey).
- Remove responses with durations below this threshold.
Step 2: Handling Missing Data
Types of Missing Data
- Skipped questions: Participants saw the question but chose not to answer (indicated by a specific code, e.g., -99).
- Unseen questions: Participants did not reach the question due to early dropout or branching (indicated by a blank cell).
- Two types of missing.
- Saw question, but skipped.
- Didn't see question.
Coding Missing Values
- Define missing value codes in SPSS to ensure they are not included in calculations.
- Process:
- Go to Variable View.
- For each variable with missing data, specify the missing value code (e.g., -99) under the "Missing" column.
Step 3: Creating a Respondent ID Variable
- Generate a unique identification number for each respondent.
- Process:
- Compute a new variable named "respondent ID" using the
$CASENUM system variable, which represents the case number. - Syntax:
COMPUTE respondent_ID = $CASENUM. - Place the new variable at the beginning of the data set by inserting a new column and then running the compute command.
- Go to variable view, insert the variable where you want it, then run the computation.
Step 4: Reverse Coding
- Reverse code items where the scale is inverted to maintain consistency and ensure participants are paying attention.
- Reverse Coded to make sure they are paying attention.
- Process:
- Use the Recode into Different Variables function.
- For each reverse-coded item, assign new values such that:
- 1 becomes 7
- 2 becomes 6
- 3 becomes 5
- 4 stays 4
- 5 becomes 3
- 6 becomes 2
- 7 becomes 1
- Syntax example:
RECODE Q9_1 (1=7) (2=6) (3=5) (4=4) (5=3) (6=2) (7=1) INTO Q9_1_reversed.
Step 5: Standardizing Variables / Mean Centering
Mean Centering
- Adjust the scale by subtracting the midpoint value (e.g., 4 for a 7-point scale) from each response.
- Helps balance responses around a neutral point.
- Process:
- Compute a new variable by subtracting the mean value from the original variable.
- Syntax example:
COMPUTE Q9_2_centered = Q9_2 - 4 (for a scale of 1-7). - NewValue = OldValue - Mean
Standardizing (Z-Scores)
- Transform variables to a standard scale with a mean of 0 and a standard deviation of 1 using the formula: z = \frac{x - \mu}{\sigma}, where x is the observed value, \mu is the mean, and \sigma is the standard deviation.
- Enables direct comparison between variables with different scales (e.g., age and income).
- If you have scales of one to five and one to seven you need to normalize the data.
- Also, say you wanna compare age to income, but income is a large number with a wide range, and age is a smaller number with a tighter range - you can use standardizing to compare them.
- The person that's 65 is one standard deviation above the mean, and a person who's income is 120,000 is also one standard deviation above the mean, you can now directly compare.
- Process:
- Use descriptive statistics to get the mean and the standard deviation.