E

Lecture 3 Flashcards (1)

Data Transformation and Cleaning in SPSS

Introduction

  • This lecture focuses on transforming raw data into usable knowledge for analysis in SPSS.
  • The goal is to clean and manipulate data obtained from sources like Qualtrics to ensure accurate and reliable results.

Data Cleaning: Garbage In, Garbage Out

  • Principle: Clean data is crucial for generating meaningful insights; otherwise, the analysis will produce unreliable results.
  • Raw Data: Keep the original, untouched data set separate from the working data.
    • One set of raw data that you don't touch
    • Another set of not so clean that you can manipulate and clean up.

Step 1: Removing Unnecessary Data

Excluding Participants Who Declined Participation

  • Remove responses from individuals who indicated they did not want to participate.
  • Process:
    • Go to Data -> Select Cases.
    • Use an IF condition to filter out participants who answered "no" to the participation question.
    • Syntax example: IF participant = 2 (where 2 represents "no").
    • Execute the syntax to remove these cases.

Excluding Invariant Responses (Straight-liners)

  • Identify and remove responses where participants provided the same answer for all questions, indicating a lack of attention.
  • Respondent two and respondent four are invariant because there is no variation.

Excluding Rushers

  • Identify and remove participants who completed the survey too quickly, suggesting they did not engage thoughtfully with the questions.
    • Straight liners / invariants
    • Rushers.
  • Process:
    • Create a new variable called "duration" to record the time taken to complete the survey.
    • Calculate duration in minutes from seconds by dividing the duration in seconds by 60: Duration(minutes) = \frac{Duration(seconds)}{60}.
      • Time in minutes is the durations in seconds divided by sixty, (\frac{seconds}{60} = minutes)
    • Establish a reasonable minimum time based on the survey's intended length (e.g., at least 5 minutes for a 15-20 minute survey).
    • Remove responses with durations below this threshold.

Step 2: Handling Missing Data

Types of Missing Data

  • Skipped questions: Participants saw the question but chose not to answer (indicated by a specific code, e.g., -99).
  • Unseen questions: Participants did not reach the question due to early dropout or branching (indicated by a blank cell).
    • Two types of missing.
      • Saw question, but skipped.
      • Didn't see question.

Coding Missing Values

  • Define missing value codes in SPSS to ensure they are not included in calculations.
  • Process:
    • Go to Variable View.
    • For each variable with missing data, specify the missing value code (e.g., -99) under the "Missing" column.

Step 3: Creating a Respondent ID Variable

  • Generate a unique identification number for each respondent.
  • Process:
    • Compute a new variable named "respondent ID" using the $CASENUM system variable, which represents the case number.
    • Syntax: COMPUTE respondent_ID = $CASENUM.
    • Place the new variable at the beginning of the data set by inserting a new column and then running the compute command.
    • Go to variable view, insert the variable where you want it, then run the computation.

Step 4: Reverse Coding

  • Reverse code items where the scale is inverted to maintain consistency and ensure participants are paying attention.
  • Reverse Coded to make sure they are paying attention.
  • Process:
    • Use the Recode into Different Variables function.
    • For each reverse-coded item, assign new values such that:
      • 1 becomes 7
      • 2 becomes 6
      • 3 becomes 5
      • 4 stays 4
      • 5 becomes 3
      • 6 becomes 2
      • 7 becomes 1
    • Syntax example: RECODE Q9_1 (1=7) (2=6) (3=5) (4=4) (5=3) (6=2) (7=1) INTO Q9_1_reversed.

Step 5: Standardizing Variables / Mean Centering

Mean Centering

  • Adjust the scale by subtracting the midpoint value (e.g., 4 for a 7-point scale) from each response.
  • Helps balance responses around a neutral point.
  • Process:
    • Compute a new variable by subtracting the mean value from the original variable.
    • Syntax example: COMPUTE Q9_2_centered = Q9_2 - 4 (for a scale of 1-7).
    • NewValue = OldValue - Mean

Standardizing (Z-Scores)

  • Transform variables to a standard scale with a mean of 0 and a standard deviation of 1 using the formula: z = \frac{x - \mu}{\sigma}, where x is the observed value, \mu is the mean, and \sigma is the standard deviation.
  • Enables direct comparison between variables with different scales (e.g., age and income).
  • If you have scales of one to five and one to seven you need to normalize the data.
  • Also, say you wanna compare age to income, but income is a large number with a wide range, and age is a smaller number with a tighter range - you can use standardizing to compare them.
  • The person that's 65 is one standard deviation above the mean, and a person who's income is 120,000 is also one standard deviation above the mean, you can now directly compare.
  • Process:
    • Use descriptive statistics to get the mean and the standard deviation.