1/20
Flashcards covering key concepts from the lecture on machine learning algorithms, gradient descent, derivatives, and learning rules in the context of neural networks.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
Supervised Learning
A machine learning algorithm paradigm where the training data includes input patterns and their corresponding desired responses, used to minimize an empirical risk function.
Unsupervised Learning
A machine learning algorithm paradigm mentioned, typically focused on finding patterns in data without explicit desired responses.
Reinforcement Learning
A machine learning algorithm paradigm mentioned, where an agent learns to make decisions by taking actions in an environment.
Empirical Risk Function Framework
A framework for machine learning algorithms that aims to minimize an objective function.
Gradient Descent
An algorithm used to minimize an objective function by iteratively taking steps proportional to the negative of the gradient of the function.
Output Unit (y double dot)
A component in a learning machine that produces a prediction, often a sigmoidal function of input patterns and parameters.
Input Units (sI)
Nodes that receive and represent components of an input pattern, where each component is an activation level.
Connection Weights (theta)
Parameters in a learning machine that represent the strength of the connections between input units and an output unit.
Sigmoidal Function (Logistic Sigmoid)
A common activation function defined as 1 / (1 + e^(-x)), which outputs a value between 0 and 1, often interpreted as a probability.
Goal of Learning
To minimize the empirical risk function by adjusting the machine's parameters (connection weights).
Batch Gradient Descent
An optimization strategy where all training data is used to compute the derivative of the empirical risk function to update parameters in each iteration.
Learning Rate (gamma t)
A scalar variable in gradient descent algorithms that determines the step size for parameter updates, also called step size.
Function Decomposition
A technique to dissect a complicated function into a sequence of simpler, easy-to-work-with functions for easier derivative computation.
Scalar Chain Rule
A calculus rule for finding the derivative of a composite function, stating that if f is a function of h, and h is a function of g, and g is a function of theta, then df/d(theta) = df/dh * dh/dg * dg/d(theta).
Continuously Differentiable
A property of a function where all its partial derivatives exist and are continuous.
Gradient (g)
Defined as the transpose of the derivative of a scalar function with respect to a vector of parameters (dl/d(theta)), resulting in a column vector.
Derivative of a Vector of Functions (Jacobian Matrix)
For a vector-valued function v that maps R^q to R^m, its derivative with respect to theta is an m x q matrix where element (i, j) is the derivative of the i-th component function (vi) with respect to the j-th input variable (thetaj).
Vector Chain Rule
A generalization of the chain rule for vector-valued functions, using Jacobian matrices: dF/d(theta) = dU/dV * dV/d(theta).
Adaptive Learning Rule
A learning rule where parameters are updated using only a single data record at a time, rather than the entire dataset.
Log-likelihood Loss (Cross-entropy Loss)
An objective function particularly suitable for supervised learning tasks where the target (y_I) is binary (0 or 1), minimizing minus the log of either the predicted probability or (1-predicted probability).
Softmax
An activation function used in multi-class classification where output nodes represent probabilities for each class, all of which are non-negative and sum up to one.