Introduction to Linear Regression and Gradient Descent

Introduction to Regression

  • Regression in Machine Learning: Regression is a supervised learning technique specifically used to model the relationship between a dependent variable (often referred to as the target) and one or more independent variables (regarded as features or predictors).

  • Goal of Regression: The primary objective is to create a predictive model capable of making continuous predictions or estimates. This distinguishes regression from classification, which focuses on discrete categories or classes.

  • Function Fitting: In a regression problem, the aim is typically to find a function that best fits the provided data. This allows for the prediction of a numeric value for the dependent variable based on known values of independent variables.

  • Representation: This function is frequently represented as a linear equation. However, in more complex scenarios, the function can be nonlinear.

  • Output: Regression models are defined by their output, which is always numerical data (a number).

A Simple Practice Example: Predicting House Prices

  • Problem Description: Imagine a dataset containing information about various houses, including features such as the number of bedrooms, square footage, neighborhood, and other factors.

  • Objective: Build a regression model that predicts the selling price of a house based on these specific features.

  • Variables in the Example:

    • Dependent Variable (Target): The selling price of the house (a continuous numeric value).

    • Independent Variables (Features): Number of bedrooms, square footage, location, etc.

  • Learning Process: Using a regression algorithm like linear regression, the model learns the relationships between the features and house prices from training data to make predictions on new, unseen houses.

Linear Regression with One Variable

  • Definition: Linear regression is a method used to estimate values such as house prices, stock values, life expectancy, or the duration a user spends on a website.

  • Simple Linear Regression: This specific type involves only two variables: one independent variable (predictor) and one dependent variable (target).

  • Model Equation: The goal is to identify a linear relationship expressed as a straight-line equation:     Y=b0+b1×XY = b_0 + b_1 \times X

  • Variable Definitions:

    • YY: The dependent variable (the target we want to predict).

    • XX: The independent variable (the predictor feature).

    • b0b_0: The intercept, which is the exact point where the line intersects the YY-axis.

    • b1b_1: The slope, representing the change in YY for a single unit change in XX.

  • Objective: The model aims to find values for b0b_0 and b1b_1 that minimize the sum of squared differences between predicted values and actual values.

Real Estate Practical Solution: Housing Prices by Room Count

  • Scenario: A real estate agent wants to infer the price of a house by comparing it to others using a single feature: the number of rooms.

  • Dataset (Table 3.1):

    • House 1: 1 Room, Price 150.

    • House 2: 2 Rooms, Price 200.

    • House 3: 3 Rooms, Price 250.

    • House 4: 4 Rooms, Price ?

    • House 5: 5 Rooms, Price 350.

    • House 6: 6 Rooms, Price 400.

    • House 7: 7 Rooms, Price 450.

  • Pattern Recognition: Adding one room increases the price by $50. This implies a base price of $100 and an extra charge of $50 per room.

  • Derived Formula:     Price=100+50×(Number of rooms)\text{Price} = 100 + 50 \times (\text{Number of rooms})

  • Prediction for House 4: Based on the formula, the predicted price is $300.

Key Machine Learning Concepts

  • Features: Properties used to make predictions (e.g., number of rooms, crime rate, house age, size). In the simple example, the feature is the number of rooms.

  • Labels: The target we try to predict (e.g., house price).

  • Model: A rule or formula (like the linear equation) used to predict a label from features.

  • Prediction: The specific output produce by the model (e.g., predicted price of 300).

  • Weights: Factors by which features are multiplied. In the house formula, the weight is 50. It indicates how much the label increases when the feature increases by one unit.

  • Bias: A constant value in the formula not attached to any feature, representing the value of the label when all features are zero. In the house formula, the bias is 100 (the base price).

Model Representation and Geometry

  • Supervised Learning Context: Labeled data is provided with "right answers."

  • Training Set Architecture:

    • Training Set: Data used for learning.

    • Learning Algorithm: Processes the training set to output a hypothesis function.

    • Hypothesis (hh): Usually denoted by a lowercase hh, it maps from xx (input) to yy (estimated output).

  • Linear Equations Geometry:

    • Equation: y=f(x)=θ0+θ1xy = f(x) = \theta_0 + \theta_1 x

    • θ1\theta_1: The slope (Rise over Run). This is the weight of the feature.

    • θ0\theta_0: The YY-intercept. This is the bias of the model.

  • Types of Relationships:

    • Positive Linear Relationship: Line goes up as XX increases.

    • Negative Linear Relationship: Line goes down as XX increases.

    • Relationship Not Linear: Data points do not form a straight line.

    • No Relationship: Points are scattered with no discernible trend.

Cost Function

  • Purpose: The cost function measures the error rate of the model. Choosing parameters like θ0\theta_0 and θ1\theta_1 should ensure hθ(x)h_\theta(x) is as close to yy as possible for training examples.

  • Mean Squared Error (MSE): This is the function used to compute the error rate by averaging the squared differences between predictions and actual values.

  • Mathematical Definition:     J(θ0,θ1)=12mi=1m(hθ(x(i))y(i))2J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2

    • mm: The number of training examples.

    • hθ(x(i))h_\theta(x^{(i)}): The prediction on the ii-th training example.

    • y(i)y^{(i)}: The actual value of the ii-th training example.

  • Objective: Minimize the function J(θ0,θ1)J(\theta_0, \theta_1).

Optimization and Gradient Descent

  • Gradient Descent Definition: An iterative optimization algorithm used to minimize any function, including the Cost Function JJ. It is used extensively across machine learning.

  • Problem Setup:

    1. Start with initial values for θ0,θ1\theta_0, \theta_1.

    2. Keep changing θ0,θ1\theta_0, \theta_1 to reduce J(θ0,θ1)J(\theta_0, \theta_1).

    3. Stop when reaching a minimum.

  • Landscape Analogy: Imagine a landscape with hills (high error) and valleys (low error). The goal is to reach the lowest point as rapidly as possible.

  • Algorithm Update Rule:     θj:=θjα×θjJ(θ0,θ1)\theta_j := \theta_j - \alpha \times \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)

    • :=:=: Assignment operator.

    • α\alpha: Learning rate (defines the size of steps taken during descent).

    • θjJ(θ0,θ1)\frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1): Partial derivative term indicating the slope.

  • Simultaneous Update: All parameters must be updated simultaneously:     temp0:=θ0α×θ0J(θ0,θ1)\text{temp}_0 := \theta_0 - \alpha \times \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1)     temp1:=θ1α×θ1J(θ0,θ1)\text{temp}_1 := \theta_1 - \alpha \times \frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1)     θ0:=temp0\theta_0 := \text{temp}_0     θ1:=temp1\theta_1 := \text{temp}_1

  • Gradient Descent for Linear Regression (Batch Gradient Descent):

    • Update for j=0j=0 (Bias):         θ0:=θ0α×1mi=1m(hθ(x(i))y(i))\theta_0 := \theta_0 - \alpha \times \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})

    • Update for j=1j=1 (Weight):         θ1:=θ1α×1mi=1m(hθ(x(i))y(i))x(i)\theta_1 := \theta_1 - \alpha \times \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x^{(i)}

  • Batch Property: "Batch" gradient descent means each individual step of the algorithm utilizes processed information from all training examples in the dataset.

Visualization of Training Process

  • The progression of gradient descent can be observed through iterations where the regression line gradually adjusts its slope and intercept to fit the scatter plot of data points precisely.

  • Typically, as iterations increase (from Iteration 1 to Iteration 11), the error decreases, and the regression line moves closer to the cluster of data points.

  • In a simple weight-only hypothesis where θ0=0\theta_0 = 0, plotting the cost function J(θ1)J(\theta_1) results in a parabolic curve where the minimum point represents the optimal slope value (e.g., θ1=1\theta_1 = 1 where error equals 0).