Final Exam Review: Autoencoders, Regularization, and Probabilistic Models

Final Exam Overview and Strategic Preparation

Exam Format and Structure - Total Duration: The final exam is allocated a total of $2$ hours ( $120$ minutes), which is an increase from the $90$ minutes provided for the midterm to account for the increased breadth of material. - Question Composition: The exam consists of $20$ questions in total. - Automatically Graded Questions: There are $14$ questions that are automatically graded (e.g., multiple choice) similar in style to the midterm. - Essay Questions: There are $6$ specific essay questions. - Weighting: The midterm was worth $10$ marks, acting primarily as a feedback mechanism. The final quiz carries a higher weight. Within the final, the last question (and potentially the last two) are downweighted to $4$ marks because they are more complex, tougher, or involve more logical depth. Students are advised to save these for the end.
Material Scope and Emphasis - Comprehensive Coverage: While the exam leans towards material covered in the second half of the semester, no portion of the course (pre or post-midterm) can be safely ignored. Approximately a little less than half of the exam covers first-half material. - Concept Extrapolation: Concepts from the first half are often assessed through the lens of more advanced second-half topics. For example, rather than a standalone question on simple Gradient Descent for Logistic Regression, the concept is more likely to be tested within the context of Backpropagation in Neural Networks. - Preparation Metric: A benchmark for readiness is the ability to achieve a very high mark on the original midterm questions if they were administered again.
Writing Tips for the Final - Succinctness: A good answer for an essay question is usually succinct and precise rather than long-winded. If an answer requires excessive time or volume, the student may be providing unnecessary detail or could be on the wrong track. - Exception for Algorithms: Questions requiring the demonstration of an algorithm’s steps (e.g., manual calculation of a process) are inherently longer because they require step-by-step documentation. - Hedging: While precision is preferred, students can hedge their answers by adding more detail if they are uncertain, as there is no penalty for extra correct information.

Autoencoder Networks: Encoding and Decoding Dynamics

The Principle of Simultaneous Learning - Learning in an autoencoder is not a sequential process where one layer is "learned" before the next. Instead, all learning is simultaneous across the entire network. - Any parameter update "washes" through the network from inputs to the final loss function. - Initial State: Weights are typically initialized randomly. After the first iteration, the parameters have merely been "nudged" a small amount away from their random state. Improvement in the loss function occurs globally rather than layer-by-layer.
The Role of the Bottleneck Layer - The bottleneck layer is situated in the middle of the network architecture. Its purpose is to create a compressed representation of the input data. - This layer must retain the most essential features (the "squeezing" effect) to allow the decoder to reconstruct the input with minimal information loss. - To achieve a lower-dimensional representation, the number of nodes in the bottleneck must be smaller than the input/output layers. If the hidden layers were the same size as the input, the network could simply learn an identity function, achieving a loss of zero without extracting meaningful features.
Backpropagation and Global Guidance - The loss function sits at the final layer, measuring reconstruction error. It act as both an evaluation tool and a guide. - Guidance is sent backward through the network via backpropagation to update every trainable parameter/weight. - All parameters converge toward a state where the desired properties (compression and reconstruction) are achieved simultaneously.

Mathematical Frameworks for Loss and Reconstruction

Mean Squared Error (MSE) in Autoencoders - Standard MSE is used to force outputs to align with inputs. In a supervised context, MSE is defined as: $MSE = \frac{1}{n} \sum_{i=1}^{n} (\text{prediction} - \text{ground truth})^2$ - In an autoencoder, which is technically unsupervised or "self-supervised," the input representation itself acts as the "ground truth" label. The prediction is the output of the decoder.
Vector Norms and Distances - If we represent inputs and outputs as vectors $a$ and $b$ , the reconstruction error is fundamentally the squared norm of the difference between these vectors: $|a - b|^2$ - Differences accrue whenever a component (pixel value or feature) in the output vector deviates from the input vector. Minimizing the average squared distance ensures the output remains spatially close to the input in high-dimensional space.
Task-Specific Loss Functions - Loss functions are not arbitrary and cannot be swapped out without changing the objective of the task. The loss function encapsulates the specific engineering goal. In autoencoders, it specifically rewards the alignment of input and output to encourage dimensionality reduction as a byproduct.

Comparative Analysis: Autoencoders vs. PCA

Reconstruction Quality - Autoencoders: Utilize nonlinear activation functions, allowing them to fit complex, non-flat manifolds in the data. This complexity generally leads to better reconstruction than linear methods. - Principal Component Analysis (PCA): A rigidly linear method. When projecting data down and reconstructing it, PCA is limited to a linear subspace.
Visual Distortions and Fine Details - In image reconstruction, fine details are the most likely to be lost during the compression/squeezing process. - Example (Shirt): An autoencoder might reconstruct the general shape and color of a shirt but lose the specific logo or fine design elements present in the original image. - Example (Shoe): A reconstruction might miss specific structural details (like parts of a heel) especially if the original resolution was low. - Noise or errors in the process effectively "move" the input to a slightly different location in the feature space, resulting in a distorted output.

Advanced Regularization: Elastic Net

Definition and Composition - Elastic Net is a regularization technique that combines the penalties of both $L_1$ (Lasso) and $L_2$ (Ridge) regularization. - It aims to benefit from the sparsity-inducing properties of $L_1$ and the robust, parameter-shrinking properties of $L_2$ .
The SVM Context and Geometric Implications - In Support Vector Machines (SVMs), the $L_2$ norm is used in the denominator of the distance formula to maximize the margin ( $w$ ) between hyperplanes. - Substituting $L_1$ for $L_2$ in an SVM objective function would change the semantics of the model. $L_1$ is axis-dependent and would force widenings in directions related to specific axes/feature values, rather than a smooth, orientation-independent maximization. - Changing from $L_2$ to $L_1$ often results in the loss of differentiability and requires re-engineering the optimization program (e.g., move away from convex optimization standards).
Hyperparameter Tuning Costs - Practitioners often avoid Elastic Net by default because it introduces an additional hyperparameter to tune. - Moving from one hyperparameter (just $L_1$ or just $L_2$ ) to two (a combination) vastly increases the search space in cross-validation frameworks, increasing computational cost. - It is often reserved for cases where specific robustness is required or when a developer suspects a sparse representation ($L_1$) needs to be more equitably spread out across features ($L_2$).

Probabilistic Models: Naive Bayes Classification

Fundamentals of the Algorithm - Naive Bayes is not an iterative learner; it has no parameters that are updated via backpropagation or a loss function. It uses direct statistical formulas to calculate probabilities. - It is based on the Maximum Likelihood Estimation (MLE) principle: observing data $x$ and choosing the label $y$ that is most likely according to the data.
The Independence Assumption - The model is termed "naive" because it assumes that all features are completely independent of one another. - Mathematical Basis: Under the independence assumption, the joint probability is simply the product of individual proportions calculated for each feature. - Real-World Limitation: If features are highly correlated, the assumption fails, and the model may produce poor results. Despite this, it often works surprisingly well for foundational classification tasks.
Mathematics of Naive Bayes - It utilizes Bayes' Theorem to find the probability of a label given features: $P(y|x) = \frac{P(x|y)P(y)}{P(x)}$ - Since $P(x)$ is constant for all labels during comparison, the model simplifies the calculation to be proportional to $P(x|y)P(y)$ .

Administrative and Logistic Details

Technical Requirements - Programming: There will be absolutely no questions requiring students to write or debug Python or PyTorch code. Students do not need to memorize syntax. - Pseudo-code: The professor may provide a few lines of pseudo-code (generic code-like logic) and ask for an explanation of the underlying algorithm or process.
Office Hours and Availability - The professor will hold office hours on Tuesday from $4:00\,PM$ to $6:00\,PM$ (West Coast Time). - Starting the $6th$ of the month, the professor will be on the East Coast for departmental events. Subsequent meetings must be arranged via email.
Course Evaluations: Students are encouraged to complete course evaluations to assist in the constant improvement and tinkering of the curriculum.