To mathematically model relationships between variables to facilitate predictions, make informed business decisions, and understand underlying trends in data.
Indicate relationships among variables, helping to visualize potential connections.
Sufficient to determine if a relationship exists, but does not imply causation.
Commonly used coefficients include Pearson's r, which quantifies the strength and direction of a linear relationship.
Primarily used to predict the value of one variable based on another, providing a deeper understanding of relationships.
Addresses research questions like: "Given X, what is the predicted Y?"
Assists businesses in making forecasts and setting strategic goals based on data-driven insights.
A family of analyses used to predict the value of a dependent variable (Y) from one or more independent variables (X).
Can be used to evaluate the effects of changes in predictor variables on the outcome variable, thus informing business decisions.
Dependent Variable (Y): The variable we aim to predict; also referred to as the response, outcome, explained, or predicted variable.
Independent Variable (X): The variable that is manipulated or categorized to predict changes in the dependent variable; also known as explanatory or predictor variable.
Simple Linear Regression
Utilizes one independent variable to predict Y.
Example: Apartment size predicting monthly rent.
It's straightforward, providing clear insights but limited to assessing single relationships.
Multiple Linear Regression
Involves two or more independent variables predicting Y.
Example: Predicting album sales based on advertising budget, genre, and other factors.
More complex and capable of evaluating several influences simultaneously, allowing businesses to understand multifaceted relationships.
A straight line that best describes the overall trend between dependent and independent variables, crucial for visualizing relationships.
Goal: Minimize the distance (residuals) between each observed data point and the line, thus providing a model that best represents the data.
Deviations from the regression line, crucial for determining the accuracy of the model.
Good regression models minimize the sum of squared residuals, enhancing predictive power and reliability.
y' = B0 + B1X
y': Predicted value of Y.
B0: y-intercept (the expected mean value of Y when X=0).
B1: Slope (the change in Y for a 1-unit increase in X).
If B1 = 2, an increase of $1,000 in advertising predicts an increase of $2,000 in sales, showcasing the financial impact of the independent variable on outcomes.
H0: B1 = 0 (no relationship between X and Y).
H1: B1 ≠ 0 (there is a statistically significant relationship).
Commonly set at alpha = 0.05, indicating the threshold for determining statistical significance.
Determines if the null hypothesis can be rejected based on the calculated regression coefficients, guiding the acceptance or rejection of H0.
R-Squared: Indicates the proportion of variability in Y explained by X; a key metric for assessing model fit.
Adjusted R-Squared: Adjusts for the number of predictors, preventing overfitting and ensuring robustness of the model.
Displays regression coefficients for each predictor, their significance levels, and confidence intervals, enabling a comprehensive view of variable impacts.
Linearity
Assumes that independent variables have a straight-line relationship with the dependent variable; essential for validity in linear regression.
Normality of Residuals
Residuals should be normally distributed around 0.
Methods for checking include Normal Probability Plots, Histograms, and Shapiro-Wilk tests, critical for ensuring reliable inference.
Independence of Errors
Residuals must be uncorrelated (i.e., no autocorrelation), often checked using the Durbin-Watson test, crucial for validating regression assumptions.
Homoscedasticity
Residuals should exhibit constant variance across all levels of the independent variable, necessary for the reliability of coefficients.
No Multicollinearity
Assumes that predictors should not be highly correlated with one another; checked using correlation coefficients or Variance Inflation Factor (VIF) values to ensure model stability.
Transforms qualitative variables into dummy variables for inclusion in regression analysis.
For k categories, k-1 dummy variables needed to accurately represent categorical data.
Allows for comparisons between categories; for example, if the coefficient for electrical repair is positive, it indicates that electrical repairs take longer relative to mechanical repairs.
The regression model can include both types for broader analysis, offering a richer dataset for predictions.
Care must be taken in interpretation to account for the influence of each variable, ensuring clarity in communication of results.
Understanding regression and its assumptions is essential for accurate modeling and prediction in a business context, equipping managers and analysts with tools to make informed decisions based on statistical evidence and trends.