Introduction to Linear Regression and Gradient Descent

Introduction to Regression

Regression in Machine Learning: Regression is a supervised learning technique specifically used to model the relationship between a dependent variable (often referred to as the target) and one or more independent variables (regarded as features or predictors).
Goal of Regression: The primary objective is to create a predictive model capable of making continuous predictions or estimates. This distinguishes regression from classification, which focuses on discrete categories or classes.
Function Fitting: In a regression problem, the aim is typically to find a function that best fits the provided data. This allows for the prediction of a numeric value for the dependent variable based on known values of independent variables.
Representation: This function is frequently represented as a linear equation. However, in more complex scenarios, the function can be nonlinear.
Output: Regression models are defined by their output, which is always numerical data (a number).

A Simple Practice Example: Predicting House Prices

Problem Description: Imagine a dataset containing information about various houses, including features such as the number of bedrooms, square footage, neighborhood, and other factors.
Objective: Build a regression model that predicts the selling price of a house based on these specific features.
Variables in the Example:
- Dependent Variable (Target): The selling price of the house (a continuous numeric value).
- Independent Variables (Features): Number of bedrooms, square footage, location, etc.
Learning Process: Using a regression algorithm like linear regression, the model learns the relationships between the features and house prices from training data to make predictions on new, unseen houses.

Linear Regression with One Variable

Definition: Linear regression is a method used to estimate values such as house prices, stock values, life expectancy, or the duration a user spends on a website.
Simple Linear Regression: This specific type involves only two variables: one independent variable (predictor) and one dependent variable (target).
Model Equation: The goal is to identify a linear relationship expressed as a straight-line equation: $Y = b_0 + b_1 \times X$
Variable Definitions:
- $Y$ : The dependent variable (the target we want to predict).
- $X$ : The independent variable (the predictor feature).
- $b_0$ : The intercept, which is the exact point where the line intersects the $Y$ -axis.
- $b_1$ : The slope, representing the change in $Y$ for a single unit change in $X$ .
Objective: The model aims to find values for $b_0$ and $b_1$ that minimize the sum of squared differences between predicted values and actual values.

Real Estate Practical Solution: Housing Prices by Room Count

Scenario: A real estate agent wants to infer the price of a house by comparing it to others using a single feature: the number of rooms.
Dataset (Table 3.1):
- House 1: 1 Room, Price 150.
- House 2: 2 Rooms, Price 200.
- House 3: 3 Rooms, Price 250.
- House 4: 4 Rooms, Price ?
- House 5: 5 Rooms, Price 350.
- House 6: 6 Rooms, Price 400.
- House 7: 7 Rooms, Price 450.
Pattern Recognition: Adding one room increases the price by $50. This implies a base price of $100 and an extra charge of $50 per room.
Derived Formula: $\text{Price} = 100 + 50 \times (\text{Number of rooms})$
Prediction for House 4: Based on the formula, the predicted price is $300.

Key Machine Learning Concepts

Features: Properties used to make predictions (e.g., number of rooms, crime rate, house age, size). In the simple example, the feature is the number of rooms.
Labels: The target we try to predict (e.g., house price).
Model: A rule or formula (like the linear equation) used to predict a label from features.
Prediction: The specific output produce by the model (e.g., predicted price of 300).
Weights: Factors by which features are multiplied. In the house formula, the weight is 50. It indicates how much the label increases when the feature increases by one unit.
Bias: A constant value in the formula not attached to any feature, representing the value of the label when all features are zero. In the house formula, the bias is 100 (the base price).

Model Representation and Geometry

Supervised Learning Context: Labeled data is provided with "right answers."
Training Set Architecture:
- Training Set: Data used for learning.
- Learning Algorithm: Processes the training set to output a hypothesis function.
- Hypothesis ( $h$ ): Usually denoted by a lowercase $h$ , it maps from $x$ (input) to $y$ (estimated output).
Linear Equations Geometry:
- Equation: $y = f(x) = \theta_0 + \theta_1 x$
- $\theta_1$ : The slope (Rise over Run). This is the weight of the feature.
- $\theta_0$ : The $Y$ -intercept. This is the bias of the model.
Types of Relationships:
- Positive Linear Relationship: Line goes up as $X$ increases.
- Negative Linear Relationship: Line goes down as $X$ increases.
- Relationship Not Linear: Data points do not form a straight line.
- No Relationship: Points are scattered with no discernible trend.

Cost Function

Purpose: The cost function measures the error rate of the model. Choosing parameters like $\theta_0$ and $\theta_1$ should ensure $h_\theta(x)$ is as close to $y$ as possible for training examples.
Mean Squared Error (MSE): This is the function used to compute the error rate by averaging the squared differences between predictions and actual values.
Mathematical Definition: $J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2$
- $m$ : The number of training examples.
- $h_\theta(x^{(i)})$ : The prediction on the $i$ -th training example.
- $y^{(i)}$ : The actual value of the $i$ -th training example.
Objective: Minimize the function $J(\theta_0, \theta_1)$ .

Optimization and Gradient Descent

Gradient Descent Definition: An iterative optimization algorithm used to minimize any function, including the Cost Function $J$ . It is used extensively across machine learning.
Problem Setup:
1. Start with initial values for $\theta_0, \theta_1$ .
2. Keep changing $\theta_0, \theta_1$ to reduce $J(\theta_0, \theta_1)$ .
3. Stop when reaching a minimum.
Landscape Analogy: Imagine a landscape with hills (high error) and valleys (low error). The goal is to reach the lowest point as rapidly as possible.
Algorithm Update Rule: $\theta_j := \theta_j - \alpha \times \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)$
- $:=$ : Assignment operator.
- $\alpha$ : Learning rate (defines the size of steps taken during descent).
- $\frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)$ : Partial derivative term indicating the slope.
Simultaneous Update: All parameters must be updated simultaneously: $\text{temp}_0 := \theta_0 - \alpha \times \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1)$ $\text{temp}_1 := \theta_1 - \alpha \times \frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1)$ $\theta_0 := \text{temp}_0$ $\theta_1 := \text{temp}_1$
Gradient Descent for Linear Regression (Batch Gradient Descent):
- Update for $j=0$ (Bias): $\theta_0 := \theta_0 - \alpha \times \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})$
- Update for $j=1$ (Weight): $\theta_1 := \theta_1 - \alpha \times \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x^{(i)}$
Batch Property: "Batch" gradient descent means each individual step of the algorithm utilizes processed information from all training examples in the dataset.

Visualization of Training Process

The progression of gradient descent can be observed through iterations where the regression line gradually adjusts its slope and intercept to fit the scatter plot of data points precisely.
Typically, as iterations increase (from Iteration 1 to Iteration 11), the error decreases, and the regression line moves closer to the cluster of data points.
In a simple weight-only hypothesis where $\theta_0 = 0$ , plotting the cost function $J(\theta_1)$ results in a parabolic curve where the minimum point represents the optimal slope value (e.g., $\theta_1 = 1$ where error equals 0).