Overview of Class Session on Data Analysis
Overview of Class Session on Data Analysis
The class focuses on data analysis using the 2018 General Social Survey (GSS) dataset.
Key objectives include:
Finding the correlation coefficient between variables.
Creating simple scatter plots.
Introducing Ordinary Least Squares (OLS) regression.
Setting Up the Data in Stata
The first step involves opening a do-file and preparing Stata for analysis:
Command to clear any existing data in memory:
clear.Set working directory to the location of the GSS file.
Open the GSS dataset using:
use <path_to_gss_file>.
Correlation Analysis
1. Finding the Correlation Coefficient
The command to find the correlation coefficient is
correlateor its shorthandcore.Example of finding correlation between education (educ) and respondent's income (conrank):
Command:
correlate educ conrankOutputs a correlation matrix.
Interpretation: The correlation coefficient ranges from -1 to 1.
Values closer to 0 indicate weaker relationships.
Values closer to 1 or -1 indicate stronger relationships.
Example observed value: 0.33 suggests a moderate positive correlation between education and income.
2. Correlation Between More Variables
The correlation coefficient can also be calculated for multiple variables simultaneously.
Example for education, income, age, and parents' education:
Command:
correlate educ conrank age pa_educ ma_educOutputs a correlation matrix displaying relationships:
Education & Income: ~0.33
Education & Age: ~0.039 (weak correlation).
Father's education & Mother's education: ~0.677 (strong positive correlation).
3. Spearman's Rank Order Correlation Coefficient
If one or both variables are ordinal, use the Spearman's correlation:
Command:
spearman <variables>.
Example of Spearman with education:
Output: around 0.41 (interpretation same as correlation).
Basic Scatter Plot Creation
1. Creating a Scatter Plot
To visualize relationships, scatter plots are used:
Command:
scatter <y_variable> <x_variable>.
Example: scatter plot between age and work hours.
Interpretation of resulting plot: No clear association; visual representation is a shapeless blob.
Adding a fitted line:
Command:
scatter hours age || lfit hours ageShows negative relationship: As age increases, work hours tend to decrease.
Introduction to OLS Regression
1. Basic OLS Regression Command
To regress respondent income on education:
Command:
regress conrank educOutput includes:
R-squared value reflects the proportion of variance explained by the model.
Coefficient interpretation: Each additional year of education is associated with an increase of approximately $4,350 in income.
2. Regression with Age
Repeat regression with age:
Command:
regress conrank ageOutcome: Each additional year of age corresponds to approximately a $512 increase in income. Statistically significant if p < 0.05.
3. Categorical Variables in Regression
For nominal or ordinal independent variables, use the
i.prefix:Example Command:
regress conrank i.degreeThis creates dummy variables, with a reference group set to the lowest value of the category.
Output interpretation also highlights comparison to the reference group per category.
Changing Reference Groups in Regression
To specify different reference groups:
Use
ib<TYPE>(where is the value to use as reference).Example: To set high school as the reference group in the regression, use:
Command:
regress conrank ib1.degree(where1corresponds to high school in the coding).
Adding More Variables to the Regression Model
To consider multiple factors:
Command:
regress conrank age i.degree hours1Interpretation includes understanding how age, education (degree), and hours worked per week correlate with income.
Example Outputs:
Each year of age contributes approximately $500 increase in income.
Each additional hour worked is statistically significant, leading to income increases around $700.
Conclusion
A summary of the regression steps includes: set up, perform regression specifying dependent and independent variables, interpret results continually with respect to reference groups.
Future sessions will cover advanced regression techniques and better data visualization practices.