Convergence Concepts and Continuous Functions

Announcements, Logistics, and Module Overview

Remote Meetings: This week (Tuesday and Thursday) classes will be held remotely. The plan is to return to in-person sessions next week.
TA Office Hours: The Teaching Assistant (TA) holds office hours every Tuesday from 12 PM to 1 PM. The TA's email address is available on the syllabus for scheduling alternative meeting times, especially if a group of students wishes to meet.
Homework Adjustments (Homework 1): Two problems were removed from Homework 1, and three new problems were added. At least one of the added problems is considered very easy. Students who have already worked on the removed problems should save their work, as these problems will be incorporated into Homework 2.
Syllabus and Lecture Notes Updates: Minor updates to the syllabus and lecture notes will be released after this lecture to ensure they are current.
Module Transition: The lecture is transitioning from Module M4 (hidden units) to Module M12, which covers convergence concepts and continuous functions.

Introduction to Convergence Concepts and Continuous Functions (Module M12)

Purpose: This module serves as a foundational stepping stone for:
- Analyzing the convergence behavior of learning algorithms.
- Designing machine learning algorithms with predictable and desirable behaviors.
- Developing effective gradient descent algorithms.
Machine Learning Paradigms: A vast majority (approximately 95%) of machine learning algorithms, including supervised, unsupervised, and reinforcement learning, can be framed as minimizing an empirical risk function.
Gradient Descent Relevance: Gradient descent is a predominant method for minimizing these empirical risk functions.
The Crucial Missing Piece: Derivatives: To implement gradient descent, the computation of derivatives is essential. Understanding continuous functions is prerequisite for computing these derivatives and for discussing the general behavior of learning algorithms.

Convergence Concepts

Convergence to a Point in Vector Space

Intuition: Imagine a learning machine whose parameters (e.g., heta1, heta2 for a two-parameter model, generalizable to more components in a vector heta) evolve during learning. With each data interaction, the parameter vector updates (e.g., heta0, heta1, heta_2, ext{…}), eventually settling on a specific point, heta^*.
Batch Gradient Descent Example:
- In a batch gradient descent algorithm, we start with an initial guess for the learning machine's parameters, heta_0.
- The parameters are updated iteratively using the formula: \theta{t+1} = \thetat - \alpha \nabla R(\theta_t, \text{dataset}) . Here, \alpha is the learning rate, and \nabla R is the derivative of the empirical risk function computed over the entire dataset.
- This iterative process generates a sequence of parameter vectors (\theta0, \theta1, \theta_2, \dots) aiming to converge to an optimal parameter vector, \theta^*, which is the goal of learning.
- This concept applies broadly, from fine-tuning pre-trained models to training large deep neural networks from scratch.
Notation for Convergence to a Point:
- \lim{t \to \infty} \thetat = \theta^*
- \theta_t \to \theta^* \text{ as } t \to \infty
Formal Definition: Given any positive number \epsilon > 0, there exists a time T\epsilon such that the distance between \thetat and \theta^ (represented as ||\thetat - \theta^||) is less than \epsilon for all t \ge T\epsilon.
Significance: Provides a rigorous mathematical language to precisely describe the learning machine's behavior (e.g., stating that parameter vectors converge to a specific learning state).

Convergence to a Set

Motivation: For complex learning machines, or in scenarios with multiple equivalent optimal solutions (e.g., a set of global minimizers), it might be more appropriate to describe convergence to an entire set of points rather than a single point.
Intuition: A sequence of parameter vectors might approach and eventually remain within a defined set S. The learning trajectory could converge to this set from various starting points or along different paths.
Notation: \theta_t \to S \text{ as } t \to \infty
Formal Definition (Conceptual): Similar to point convergence, but instead of calculating the distance to a single point \theta^*, we calculate the distance from \theta_t to the *closest point* in the set S. The sequence converges if this minimum distance becomes arbitrarily small.

Bounded Sequence

Definition: A sequence of vectors (x1, x2, x3, \dots) is considered bounded if there exists a finite constant K (where K < \infty) such that the magnitude (or norm) of every vector in the sequence, ||xt||, is less than K for all t.
Significance: Bounded sequences often lead to easier mathematical analysis in learning problems.
Example 1: Binary Feature Vectors:
- Consider feature vectors composed of ones and zeros, represented as elements in {0, 1}^d. (e.g., a 1000-dimensional vector of binary features). If the dimensionality d is a constant, then such a sequence is bounded.
- Explanation: With a fixed dimension d, there are a finite number of possible feature vectors (2^d). One of these vectors (the one with all ones) will have the largest magnitude. This maximum magnitude serves as the constant K. This ensures the sequence is bounded.
Example 2: Gaussian Random Variable Elements:
- If the elements of feature vectors are realizations of Gaussian random variables, the sequence would not be bounded.
- Explanation: Gaussian random variables can, with some finite probability, take on arbitrarily large values. Thus, no finite constant K could effectively bound the magnitude of all possible vectors in such a sequence.
Heuristic: Data sequences whose feature vectors are composed of categorical elements typically result in bounded sequences.

Continuous Functions

Intuitive Understanding

A function is continuous if its graph can be traced without lifting the pen from the curve.
Discontinuities occur where there are 'jumps' or breaks in the graph (e.g., a function value abruptly shifting at a certain point).

Recipe for Checking Continuity (Theorem 5.1.7)

Properties of Continuous Functions:
1. Polynomials are continuous: E.g., L(\theta) = \theta1 \theta2 + \theta_3^{10}.
2. Exponential functions are continuous: E.g., L(\theta) = e^{\theta_1}.
3. Logarithmic functions are continuous on positive reals: E.g., L(\theta1) = \log(\theta1) for \theta_1 > 0 (where LOG denotes the natural logarithm).
4. Weighted sums of continuous functions are continuous: E.g., L(\theta) = 2e^{\theta1} + (\theta1 \theta2 + \theta3^{10}).
5. Products of continuous functions are continuous: E.g., L(\theta) = 2e^{\theta1} \cdot (\theta1 \theta2 + \theta3^{10}).
6. Compositions of continuous functions are continuous: If f is continuous and g is continuous, then f(g(x)) is continuous. E.g., L(\theta) = e^{||\theta||^2} (||\theta||^2 is continuous, e^x is continuous, so their composition is continuous).
7. Ratios of continuous functions are continuous: \frac{f(\theta)}{g(\theta)} is continuous if both f and g are continuous, provided that g(\theta) is never zero.

Application: Neural Network Architecture and Empirical Risk Function

Example: Logistic Sigmoid Activation Function:
- A common activation function for hidden units is the logistic sigmoid: hj = \frac{1}{1 + e^{-\phij}} where \phij is typically a weighted sum of inputs (e.g., \phij = \sumk w{jk} sk + bj), which is a polynomial function of the weights and inputs.
- Continuity Analysis:
  1. The numerator (1) is a constant, hence continuous.
  2. The argument to the exponential, \phi_j, is a polynomial, making it continuous.
  3. The exponential function e^x is continuous, so e^{-\phi_j} is continuous (composition of continuous functions).
  4. The sum \text{1} + e^{-\phi_j} is continuous (sum of continuous functions).
  5. The denominator \text{1} + e^{-\phij} is always positive (since e^{-\phij} > 0), so it never equals zero.
  6. Therefore, the logistic sigmoid function h_j is continuous (ratio of continuous functions with a non-zero denominator).
Problem Example: Is the Empirical Risk Function Continuous?
- Consider a neural network with input features ($s1, \dots, s4$), three hidden units ($h1, h2, h3$), and a single output unit ($y{double_dot}$).
- Output Layer: y{double_dot} = \theta1 h1 + \theta2 h2 + \theta3 h3 (where \theta1, \theta2, \theta3 are weights from hidden to output).
- Hidden Layer Activation (e.g., for h1): h1 = \text{Sigmoid}(\theta4 s1 + \theta5 s2 + \theta6 s3 + \theta7 s4), where the \theta values represent weights from inputs to hidden units.
- The complete parameter vector \theta encompasses all weights in the network (e.g., {\theta1, \dots, \theta{15}} for this example).
- Empirical Risk Function: L(\theta) = \frac{1}{N} \sum{i=1}^N (yi - y{double_dot}(si, \theta))^2. (si is the i-th input pattern, yi is the target output).
- Continuity Conclusion:
  - Each h_j is a continuous function of its input weights and features (as shown for the logistic sigmoid).
  - y{double_dot} is a weighted sum of continuous functions ($hj$) and a polynomial of output weights, making it a continuous function of \theta.
  - The squared difference (yi - y{double_dot})^2 is a continuous function of \theta. The sum over training examples and division by N preserve continuity.
  - Therefore, the empirical risk function L(\theta) is a continuous function of \theta in this architecture.

Discontinuous Activation Functions and Their Implications

McCulloch-Pitts Formal Neuron (Logical Threshold Unit):
- Definition: A threshold function, e.g., f(\phi) = \begin{cases} 1 & \text{if } \phi > 0 \ 0 & \text{otherwise} \end{cases}.
- Problem: This function is discontinuous (a step function), making it non-differentiable. Consequently, network architectures using such functions cannot be trained using gradient descent.
Historical Context (Hinton and Rummelhart): In the mid-1980s, when trying to apply gradient descent to McCulloch-Pitts networks (which used these threshold functions), D. Rummelhart suggested