Math 240 Gradient Descent and Linear Regression

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/17

There's no tags or description

Looks like no tags are added yet.

Last updated 11:37 PM on 4/22/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai	Chat

No analytics yet

Send a link to your students to track their progress

18 Terms

New cards

In gradient descent, what is the purpose of the learning rate alpha?

It controls how quickly the model learns

New cards

What is typically used as the cost function in Linear Regression with Gradient Descent?

Mean Squared Error

New cards

What does the gradient represent in gradient descent?

Rate of change of the cost with respect to the parameters

New cards

Why do we normalize data before applying gradient descent?

To improve convergence speed

New cards

If the Gradient Descent method does not converge, what could be the problem?

The learning rate is too high, causing oscillations or divergence
The learning rate is too low, the algorithm is getting stuck at a local minimum
Unscaled data leading to slow convergence
A complex non-convex surface containing local minima or saddle points where optimizer gets stuck

New cards

Consider the function F(x, y) = (𝑥−1)^2 + (𝑦−1)^2 with initial guess (3,3). Compute the first step of the gradient descent method using the best possible with an alpha that works.

Gradient F = [(2(x-1)), (2(y-1))]
[(2(3-1)), (2(3-1))] = [4, 4]

x1 = x0 - a (gradient) → 3 - 4a

y1 = y0 - a (gradient) → 3 - 4a

F(a) = ((3 - 4a)-1)² + ((3-4a)-1)²
= (2-4a)² + (2-4a)²

2(2-4a)²

F’(a) = 4(2-4a)) - 4 = -16(2-4a)
-16(2-4a) = 0
-32 + 64a = 0
64a = 32
a = 0.5

x1 = 3-4(0.5) = 1

y1 = 3-4(0.5) = 1

1st step = (1,1)

New cards

Write the formula for exact line search for x and y

x1 = x0 - a (gradient x)

y1 = y0 - a (gradient y)

New cards

What are the 3 ways to find a regression line?

Analytical Method: Solving for the slope (m) and intercept (b) using OLS derived from setting the MSE equal to 0. Calculated from scratch using NumPy operations.
Optimization Method (Gradient Descent): Iteratively updating the slope (m) and intercept (b) using the partial derivatives of the loss function.
Library/Built-in Method: Using sklearn to abstract the math and fit the model using built in functions

New cards

Below is an incomplete Gradient Descent function. Fill in the missing code to accomplish two things:

Implement the step update.
Add a momentum mechanic where the learning rate alpha is cut in half (alpha / 2) if the absolute value of the current function F(x) is exactly greater than 100.

def GradientDescentMod(DF, F, x0, maxit=100, alpha=0.1):
    x = np.copy(x0)
    for i in range(maxit):
        # 1. Fill in the update rule for x
        x = ______________________________
        
        # 2. Modify alpha if F(x) > 100
        if ______________________________:
            alpha = ___________________
            
    return x

x = x - alpha * DF(x)

if abs(F(x)) > 100:
      alpha = alpha/2

New cards

What is Stochastic Gradient Descent?

Instead of looking at the whole dataset, SGD picks a single data point at each iteration, calculates the gradient for just that point, and takes a step.

New cards

How does gradient descent know when to stop taking steps?

It stops taking steps when step size is very close to 0.

New cards

How does the gradient of the MAE behave differently from the MSE? What are the two main consequences of this behavior?

The MAE uses the sign function, meaning the gradient is always a constant magnitude (±1). MSE uses the raw error, so its gradient shrinks as it gets closer to the correct answer.

MAE is highly robust to outliers because a massive error still only generates +1 or -1.
Since the gradient never shrinks, MAE will “bounce” back and forth across the minimum instead of settling perfectly, requiring you to manually reduce learning rate over time.

New cards

Consider a linear regression model defined by the equation y = mx + b where m is the slope and b is the intercept.

Write down the cost function for Mean Squared Error (MSE) and derive its partial derivatives with respect to m and b. Solve for m and b.
Write down the cost function for Mean Absolute Error (MAE) and derive its partial derivatives with respect to v0 and v1

1/Nsum(yi - (mxi + b))
1. b = y(bar) - mx(bar)
2. m = ((sum(xiyi) - Nx(bar)y(bar)/(sum(xi²)-N(x(bar)²)
1/N sum|yi - (v0x + v1)|
1. -1/N sum(xi sgn(yi - (v0x + v1)))
2. -1/N sum(sgn(yi - v0x + v1))

New cards

Using Python and NumPy, write the function: get_gradients_mae(X, y, v0, v1). Each function should compute and return the gradient for the slope and intercept based on your derivations in Part A.

import numpy as np

def get_gradients_mae(X, y, v0, v1):
      y_pred = v0 * X + v1
      d_v0 = -1 * np.mean(X * np.sign(y - y_pred))
      d_v1 = -1 * np.mean(np.sign(y - y_pred))

return d_v0, d_v1

New cards

Write a Python function fit_analytical(X, y) that computes the exact slope (m) and intercept (b) for a simple 1D Linear Regression using the analytical formulas (Ordinary Least Squares). Do not use loops or Gradient Descent.

import numpy as np

def fit_analytical(X, y):
    x_mean = np.mean(X)
    y_mean = np.mean(y)
    
    numerator = np.sum((X - x_mean) * (y - y_mean))
    denominator = np.sum((X - x_mean)**2)
    m = numerator / denominator
    
    b = y_mean - m * x_mean
    
    return m, b

New cards

Write the random sampling method for stochastic gradient descent in python.

def sgd_method_1(X, y, m, b, alpha, max_epochs, B):
      N = len(X)

      for i in range(max_epochs):
        idx = np.random.choice(N, B, replace=False)

        X_batch = X[idx]
        y_batch = y[idx]

        grad_m, grad_b = get_gradients(X_batch, y_batch, m, b)

        m = m - alpha * grad_m
        b = b - alpha * grad_b

return m, b

New cards

Write the permutation method for stochastic gradient descent in python

def sgd_method_2(X, y, m, b, alpha, max_epochs, B):
      N = len(X)

      for i in range(max_epochs):
        indices = np.random.permutation(N)

        for j in range(0, N, B):
          batch_idx = indices[j : j+B]

          X_batch = X[batch_idx]
          y_batch = y[batch_idx]

          # [Gradient function here]
          grad_m, grad_b = get_gradients(X_batch, y_batch, m, b)

          m = m - alpha * grad_m
          b = b - alpha * grad_b

return m,b

New cards

What is an epoch?

One complete pass through the entire training set.