Lurking Variables: We can only make causal conclusions if there are no lurking variables present, highlighting the importance of a well-designed experiment.
Random Allocation: Randomly assign subjects to different treatment groups to ensure unbiased results before drawing any causation conclusions.
Scatterplot: Utilize scatterplots to explore relationships between two numerical variables. Visual inspection can indicate linear associations that merit further analysis.
Linear Regression: If a linear relationship is detected, we can quantify this association by fitting a linear regression model.
Determining the Best Fit: Assess whether to use a solid or dotted line for the best fit by minimizing the vertical distance between observed and predicted values.
Linear Equation: The basic linear equation is expressed as:
( Y = mx + b )
-( X): Explanatory variable-( Y): Response variable-( m): Slope-( b): Y-Intercept
Adjusted Symbols: In advanced statistics, we use alternative symbols:-( b_0 ): Y-Intercept (formerly b)-( b_1 ): Slope (formerly m)
Multiple Variables: Incorporate additional explanatory variables by expanding the model (e.g., ( b_0 + b_1x_1 + b_2x_2 + ... ))
Equation of the Line: ( Y = b_0 + b_1x )-Interception at ( (0, b_0) )-Change in ( Y ): ( b_1 ) indicates how ( Y ) changes with each unit increase in ( X ).
Positive vs. Negative Slope:-( +b_1 ): For a unit increase in X, Y increases by ( b_1 )-( -b_1 ): For a unit increase in X, Y decreases by ( b_1 )
Observed vs. Predicted Values: For each observed data point, the deviation or residual is calculated:
Residual = Observed y - Predicted y
Deviation Types:
Positive Deviation: When the observed y value is above the predicted line.
Negative Deviation: When the observed y value is below the predicted line.
Zero Deviation: Observed y value coincides with the predicted line.
Minimizing Deviation: To find the best fit line, minimize the sum of squared residuals. The goal is to achieve the Least Squares Regression line.
Formulas for Slope and Intercept:
Slope (b1): ( b_1 = R \cdot \frac{S_y}{S_x} )
Intercept (b0): ( b_0 = \bar{y} - b_1 \cdot \bar{x} )
Regression Model Interpretation:
Predictive equation: ( Y = 19 + 0.7x )
Slope Interpretation: Each additional hour of study increases the expected final exam score by 0.7%.
Y-Intercept Meaning: Y-Intercept indicates predicted scores without study data. Interpretation is not meaningful if data does not include that range.
Extrapolation Caution: Avoid using regression models for values outside the observed data range, as it can lead to misleading conclusions.
Example with Schooling Years:
Correlation between years of schooling and salary can be shown through regression analyses. Beware of extrapolation outside the observed ranges.
R-squared Value: Represents the proportion of variation in the response variable explained by the explanatory variable.
Values range from 0 (no explained variation) to 1 (all variation explained).
For a scientific study, R-squared values above 60% indicate a relatively good model, depending on context.
Effect of Outliers: Outliers can significantly influence the regression line, resulting in skewed interpretations. Researchers must verify whether outliers are valid data points.
Regression Line Behavior with Outliers: Outliers could potentially pull the regression line toward themselves, thus distorting the actual relationship.
Residuals: The distance measured between observed and predicted values; their scatter can determine the fit quality of a linear model.
Residual Plot Analysis: A well-distributed residual plot around the horizontal line indicates a good fit for linear regression.
Non-linear Relationships Indicated: Persistent patterns (e.g., curves) in a residual plot suggest that a linear model may not captively represent the data relationship.