1/17
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai | Chat |
|---|
No analytics yet
Send a link to your students to track their progress
In gradient descent, what is the purpose of the learning rate alpha?
It controls how quickly the model learns
What is typically used as the cost function in Linear Regression with Gradient Descent?
Mean Squared Error
What does the gradient represent in gradient descent?
Rate of change of the cost with respect to the parameters
Why do we normalize data before applying gradient descent?
To improve convergence speed
If the Gradient Descent method does not converge, what could be the problem?
The learning rate is too high, causing oscillations or divergence
The learning rate is too low, the algorithm is getting stuck at a local minimum
Unscaled data leading to slow convergence
A complex non-convex surface containing local minima or saddle points where optimizer gets stuck
Consider the function F(x, y) = (𝑥−1)^2 + (𝑦−1)^2 with initial guess (3,3). Compute the first step of the gradient descent method using the best possible with an alpha that works.
Gradient F = [(2(x-1)), (2(y-1))]
[(2(3-1)), (2(3-1))] = [4, 4]
x1 = x0 - a (gradient) → 3 - 4a
y1 = y0 - a (gradient) → 3 - 4a
F(a) = ((3 - 4a)-1)² + ((3-4a)-1)²
= (2-4a)² + (2-4a)²
2(2-4a)²
F’(a) = 4(2-4a)) - 4 = -16(2-4a)
-16(2-4a) = 0
-32 + 64a = 0
64a = 32
a = 0.5
x1 = 3-4(0.5) = 1
y1 = 3-4(0.5) = 1
1st step = (1,1)
Write the formula for exact line search for x and y
x1 = x0 - a (gradient x)
y1 = y0 - a (gradient y)
What are the 3 ways to find a regression line?
Analytical Method: Solving for the slope (m) and intercept (b) using OLS derived from setting the MSE equal to 0. Calculated from scratch using NumPy operations.
Optimization Method (Gradient Descent): Iteratively updating the slope (m) and intercept (b) using the partial derivatives of the loss function.
Library/Built-in Method: Using sklearn to abstract the math and fit the model using built in functions
Below is an incomplete Gradient Descent function. Fill in the missing code to accomplish two things:
Implement the step update.
Add a momentum mechanic where the learning rate alpha is cut in half (alpha / 2) if the absolute value of the current function F(x) is exactly greater than 100.
def GradientDescentMod(DF, F, x0, maxit=100, alpha=0.1):
x = np.copy(x0)
for i in range(maxit):
# 1. Fill in the update rule for x
x = ______________________________
# 2. Modify alpha if F(x) > 100
if ______________________________:
alpha = ___________________
return xx = x - alpha * DF(x)
if abs(F(x)) > 100:
alpha = alpha/2What is Stochastic Gradient Descent?
Instead of looking at the whole dataset, SGD picks a single data point at each iteration, calculates the gradient for just that point, and takes a step.
How does gradient descent know when to stop taking steps?
It stops taking steps when step size is very close to 0.
How does the gradient of the MAE behave differently from the MSE? What are the two main consequences of this behavior?
The MAE uses the sign function, meaning the gradient is always a constant magnitude (±1). MSE uses the raw error, so its gradient shrinks as it gets closer to the correct answer.
MAE is highly robust to outliers because a massive error still only generates +1 or -1.
Since the gradient never shrinks, MAE will “bounce” back and forth across the minimum instead of settling perfectly, requiring you to manually reduce learning rate over time.
Consider a linear regression model defined by the equation y = mx + b where m is the slope and b is the intercept.
Write down the cost function for Mean Squared Error (MSE) and derive its partial derivatives with respect to m and b. Solve for m and b.
Write down the cost function for Mean Absolute Error (MAE) and derive its partial derivatives with respect to v0 and v1
1/Nsum(yi - (mxi + b))
b = y(bar) - mx(bar)
m = ((sum(xiyi) - Nx(bar)y(bar)/(sum(xi²)-N(x(bar)²)
1/N sum|yi - (v0x + v1)|
-1/N sum(xi sgn(yi - (v0x + v1)))
-1/N sum(sgn(yi - v0x + v1))
Using Python and NumPy, write the function: get_gradients_mae(X, y, v0, v1). Each function should compute and return the gradient for the slope and intercept based on your derivations in Part A.
import numpy as np
def get_gradients_mae(X, y, v0, v1):
y_pred = v0 * X + v1
d_v0 = -1 * np.mean(X * np.sign(y - y_pred))
d_v1 = -1 * np.mean(np.sign(y - y_pred))
return d_v0, d_v1Write a Python function fit_analytical(X, y) that computes the exact slope (m) and intercept (b) for a simple 1D Linear Regression using the analytical formulas (Ordinary Least Squares). Do not use loops or Gradient Descent.
import numpy as np
def fit_analytical(X, y):
x_mean = np.mean(X)
y_mean = np.mean(y)
numerator = np.sum((X - x_mean) * (y - y_mean))
denominator = np.sum((X - x_mean)**2)
m = numerator / denominator
b = y_mean - m * x_mean
return m, bWrite the random sampling method for stochastic gradient descent in python.
def sgd_method_1(X, y, m, b, alpha, max_epochs, B):
N = len(X)
for i in range(max_epochs):
idx = np.random.choice(N, B, replace=False)
X_batch = X[idx]
y_batch = y[idx]
grad_m, grad_b = get_gradients(X_batch, y_batch, m, b)
m = m - alpha * grad_m
b = b - alpha * grad_b
return m, bWrite the permutation method for stochastic gradient descent in python
def sgd_method_2(X, y, m, b, alpha, max_epochs, B):
N = len(X)
for i in range(max_epochs):
indices = np.random.permutation(N)
for j in range(0, N, B):
batch_idx = indices[j : j+B]
X_batch = X[batch_idx]
y_batch = y[batch_idx]
# [Gradient function here]
grad_m, grad_b = get_gradients(X_batch, y_batch, m, b)
m = m - alpha * grad_m
b = b - alpha * grad_b
return m,bWhat is an epoch?
One complete pass through the entire training set.