Overview of Class Session on Data Analysis

Overview of Class Session on Data Analysis

  • The class focuses on data analysis using the 2018 General Social Survey (GSS) dataset.

  • Key objectives include:

    • Finding the correlation coefficient between variables.

    • Creating simple scatter plots.

    • Introducing Ordinary Least Squares (OLS) regression.

Setting Up the Data in Stata

  • The first step involves opening a do-file and preparing Stata for analysis:

    • Command to clear any existing data in memory: clear.

    • Set working directory to the location of the GSS file.

    • Open the GSS dataset using: use <path_to_gss_file>.

Correlation Analysis

1. Finding the Correlation Coefficient

  • The command to find the correlation coefficient is correlate or its shorthand core.

  • Example of finding correlation between education (educ) and respondent's income (conrank):

    • Command: correlate educ conrank

    • Outputs a correlation matrix.

    • Interpretation: The correlation coefficient ranges from -1 to 1.

    • Values closer to 0 indicate weaker relationships.

    • Values closer to 1 or -1 indicate stronger relationships.

    • Example observed value: 0.33 suggests a moderate positive correlation between education and income.

2. Correlation Between More Variables

  • The correlation coefficient can also be calculated for multiple variables simultaneously.

  • Example for education, income, age, and parents' education:

    • Command: correlate educ conrank age pa_educ ma_educ

    • Outputs a correlation matrix displaying relationships:

      • Education & Income: ~0.33

      • Education & Age: ~0.039 (weak correlation).

      • Father's education & Mother's education: ~0.677 (strong positive correlation).

3. Spearman's Rank Order Correlation Coefficient

  • If one or both variables are ordinal, use the Spearman's correlation:

    • Command: spearman <variables>.

  • Example of Spearman with education:

    • Output: around 0.41 (interpretation same as correlation).

Basic Scatter Plot Creation

1. Creating a Scatter Plot

  • To visualize relationships, scatter plots are used:

    • Command: scatter <y_variable> <x_variable>.

  • Example: scatter plot between age and work hours.

    • Interpretation of resulting plot: No clear association; visual representation is a shapeless blob.

  • Adding a fitted line:

    • Command: scatter hours age || lfit hours age

    • Shows negative relationship: As age increases, work hours tend to decrease.

Introduction to OLS Regression

1. Basic OLS Regression Command

  • To regress respondent income on education:

    • Command: regress conrank educ

    • Output includes:

    • R-squared value reflects the proportion of variance explained by the model.

    • Coefficient interpretation: Each additional year of education is associated with an increase of approximately $4,350 in income.

2. Regression with Age

  • Repeat regression with age:

    • Command: regress conrank age

    • Outcome: Each additional year of age corresponds to approximately a $512 increase in income. Statistically significant if p < 0.05.

3. Categorical Variables in Regression

  • For nominal or ordinal independent variables, use the i. prefix:

    • Example Command: regress conrank i.degree

    • This creates dummy variables, with a reference group set to the lowest value of the category.

    • Output interpretation also highlights comparison to the reference group per category.

Changing Reference Groups in Regression

  • To specify different reference groups:

    • Use ib<TYPE> (where is the value to use as reference).

    • Example: To set high school as the reference group in the regression, use:

    • Command: regress conrank ib1.degree (where 1 corresponds to high school in the coding).

Adding More Variables to the Regression Model

  • To consider multiple factors:

    • Command: regress conrank age i.degree hours1

    • Interpretation includes understanding how age, education (degree), and hours worked per week correlate with income.

    • Example Outputs:

    • Each year of age contributes approximately $500 increase in income.

    • Each additional hour worked is statistically significant, leading to income increases around $700.

Conclusion

  • A summary of the regression steps includes: set up, perform regression specifying dependent and independent variables, interpret results continually with respect to reference groups.

  • Future sessions will cover advanced regression techniques and better data visualization practices.