Machine Learning Algorithms and Gradient Descent

0.0(0)

Studied by 0 people

0.0(0)

View linked note

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/20

Earn XP

Description and Tags

Flashcards covering key concepts from the lecture on machine learning algorithms, gradient descent, derivatives, and learning rules in the context of neural networks.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

21 Terms

New cards

Supervised Learning

A machine learning algorithm paradigm where the training data includes input patterns and their corresponding desired responses, used to minimize an empirical risk function.

New cards

Unsupervised Learning

A machine learning algorithm paradigm mentioned, typically focused on finding patterns in data without explicit desired responses.

New cards

Reinforcement Learning

A machine learning algorithm paradigm mentioned, where an agent learns to make decisions by taking actions in an environment.

New cards

Empirical Risk Function Framework

A framework for machine learning algorithms that aims to minimize an objective function.

New cards

Gradient Descent

An algorithm used to minimize an objective function by iteratively taking steps proportional to the negative of the gradient of the function.

New cards

Output Unit (y double dot)

A component in a learning machine that produces a prediction, often a sigmoidal function of input patterns and parameters.

New cards

Input Units (sI)

Nodes that receive and represent components of an input pattern, where each component is an activation level.

New cards

Connection Weights (theta)

Parameters in a learning machine that represent the strength of the connections between input units and an output unit.

New cards

Sigmoidal Function (Logistic Sigmoid)

A common activation function defined as 1 / (1 + e^(-x)), which outputs a value between 0 and 1, often interpreted as a probability.

New cards

Goal of Learning

To minimize the empirical risk function by adjusting the machine's parameters (connection weights).

New cards

Batch Gradient Descent

An optimization strategy where all training data is used to compute the derivative of the empirical risk function to update parameters in each iteration.

New cards

Learning Rate (gamma t)

A scalar variable in gradient descent algorithms that determines the step size for parameter updates, also called step size.

New cards

Function Decomposition

A technique to dissect a complicated function into a sequence of simpler, easy-to-work-with functions for easier derivative computation.

New cards

Scalar Chain Rule

A calculus rule for finding the derivative of a composite function, stating that if f is a function of h, and h is a function of g, and g is a function of theta, then df/d(theta) = df/dh * dh/dg * dg/d(theta).

New cards

Continuously Differentiable

A property of a function where all its partial derivatives exist and are continuous.

New cards

Gradient (g)

Defined as the transpose of the derivative of a scalar function with respect to a vector of parameters (dl/d(theta)), resulting in a column vector.

New cards

Derivative of a Vector of Functions (Jacobian Matrix)

For a vector-valued function v that maps R^q to R^m, its derivative with respect to theta is an m x q matrix where element (i, j) is the derivative of the i-th component function (vi) with respect to the j-th input variable (thetaj).

New cards

Vector Chain Rule

A generalization of the chain rule for vector-valued functions, using Jacobian matrices: dF/d(theta) = dU/dV * dV/d(theta).

New cards

Adaptive Learning Rule

A learning rule where parameters are updated using only a single data record at a time, rather than the entire dataset.

New cards

Log-likelihood Loss (Cross-entropy Loss)

An objective function particularly suitable for supervised learning tasks where the target (y_I) is binary (0 or 1), minimizing minus the log of either the predicted probability or (1-predicted probability).

New cards

Softmax

An activation function used in multi-class classification where output nodes represent probabilities for each class, all of which are non-negative and sum up to one.