L9 - Gradients, and a bit more about time series

Anomaly Detection

Self-Supervised Learning Task

Anomaly detection, a pivotal self-supervised machine learning task, is dedicated to pinpointing anomalies or outliers within a dataset. These anomalies are characterized by significant deviations from the established norm, often signaling critical issues or events that warrant attention. This methodology proves particularly invaluable in scenarios where datasets exhibit a scarcity or complete absence of instances representing the positive class, i.e., the anomalies themselves.

Approach

The primary approach involves training a model meticulously to reconstruct normal data with utmost accuracy. By adeptly learning to replicate the inherent patterns and underlying structures prevalent in typical data, the model gains the ability to discern deviations effectively. This endeavor commonly employs an encoder-decoder type model, exemplified by autoencoders, which are engineered to compress and subsequently decompress the data seamlessly.

Autoencoders
Training

Autoencoders undergo training to reconstruct their input x, thus effectively approximating a composite function g(f(x)). The overarching goals encompass:

  • Data Compression: The encoder f is tasked with learning to compress data into an efficient, lower-dimensional representation, meticulously capturing the most salient features inherent within the dataset.

  • Data Reconstruction: Conversely, the decoder g is entrusted with the responsibility of learning to decompress this condensed representation back into the original data space, striving to recreate the input with the highest fidelity achievable.

Loss Function

The Mean Squared Error (MSE) typically serves as the quintessential loss function, quantifying the disparity between the original and reconstructed data points. Nonetheless, the selection of a loss function remains contingent upon the specific requisites of the task at hand. For instance, cross-entropy loss may be favored for binary or categorical data scenarios.

x - \hat{x} = x - g(f(x, \theta), \theta') \rightarrow 0

Where:

  • \theta denotes the parameters of the encoder meticulously fine-tuned during the training process.

  • \theta' signifies the parameters of the decoder, similarly refined to optimize reconstruction accuracy.

Applications
  • Compression: The encoder assumes the role of reducing the dimensionality of the data, all while preserving essential information vital for efficient storage and streamlined processing.

  • Denoising: Here, the encoder diligently identifies and filters out noise present within the input data, thereby enabling the decoder to reconstruct a cleaner, more refined version of the original input.

  • Anomaly Detection: In this capacity, the encoder assimilates patterns from normal data, facilitating the decoder's attempt to reconstruct the input. Should the input harbor anomalies or previously unseen patterns, the reconstruction quality inevitably suffers, thereby serving as a telltale sign of anomaly presence.

Anomaly Detection with Autoencoders
Identifying Anomalies

By meticulously comparing the true input x with its reconstructed counterpart \hat{x}, we can gauge the degree of surprise or abnormality exhibited by the input. This comparative analysis serves as a litmus test, determining whether the input aligns with the learned patterns derived from the training data.

  • Expected Features: When the input showcases features mirroring the patterns learned during training, the reconstruction attains remarkable accuracy, culminating in a diminutive ||x - \hat{x}||.

  • Unexpected Features: Conversely, should the input present features diverging from the learned patterns, the reconstruction falters, resulting in a substantial ||x - \hat{x}||.

Metric

The Mean Squared Error (MSE) stands as the yardstick for quantifying the dissimilarity between the original and reconstructed inputs. Anomalies typically manifest elevated MSE values, stemming from the model's inherent inability to reconstruct them accurately.

Anomaly Detection for Time Series
Implementation

An encoder-decoder CNN model finds utility in processing 1D time series data.

  • The model undergoes training on normal data, with the aim of capturing the intricate temporal patterns and dependencies inherent within the dataset.

  • Anomalous data is then identified based on subpar reconstruction, wherein the model grapples with replicating the original time series accurately.

Modern Networks for Sequences and Time Series
Recurrent Networks
  • Pros: Recurrent Networks excel at maintaining context by storing and updating an internal state, thereby enabling them to capture temporal dependencies within sequential data effectively.

  • Cons:

    • Encoding the entirety of the context within a fixed-length state vector (z) presents a formidable challenge, often leading to information attrition.

    • Sequential processing, characterized by the utilization of for loops, introduces latency into the training process, attributable to the inherent dependencies between successive time steps.

Transformer Model

A more potent and scalable alternative lies in the Transformer model, distinguished by its reliance on attention mechanisms in lieu of recurrence and convolutions. This strategic design choice facilitates parallel processing and enhances the model's capacity to manage long-range dependencies with greater dexterity.

Transformers

The attention mechanism forms the bedrock of the Transformer architecture, adeptly learning to assign varying weights to distinct segments of the input. These weights, intrinsically dependent on the input values themselves, empower the model to prioritize the most pertinent information during sequence processing.

Transformer Tasks

The encoder-decoder structure offers versatility, amenable to utilization in discrete segments or cohesive combinations:

  • Encoder: Tasked with accepting a sequence as input and producing fixed-length vectors, such as class labels. It encapsulates the input sequence's essence, representing it in a condensed format.

  • Decoder: This component ingests a sequence as input and generates the subsequent element in an autoregressive manner, a methodology employed in models like ChatGPT. It constructs output sequences element by element, conditioned upon previously generated elements.

  • Encoder-Decoder: Known as a Sequence-to-sequence transformer, this configuration proves invaluable for translation tasks, albeit at the cost of heightened computational demands stemming from the necessity to process both input and output sequences comprehensively.

Using Gradients
Optimization with Gradients
  • Question: How do alterations in model parameters \theta impact predictions? Understanding this interplay is pivotal for optimizing model performance effectively.

  • This is quantified by meticulously computing the gradient of the loss L with respect to the model parameters \theta:

\nabla L(\theta) = \begin{bmatrix} \frac{\partial L}{\partial \theta0} \ \vdots \ \frac{\partial L}{\partial \thetan} \end{bmatrix}

  • Process: Systematically vary \theta and meticulously assess its consequential effect on L. This iterative procedure facilitates the fine-tuning of model parameters, aiming to minimize loss and elevate predictive accuracy.

Gradients: Explainability
  • Question: How does modifying the input data x influence the prediction? This inquiry lies at the heart of comprehending model behavior and ensuring its reliability.

  • This technique serves as a tool for elucidating model decisions on a per-datapoint basis, thereby facilitating a deeper understanding of the rationale underlying specific predictions for individual data points.

  • Compute the gradient of the model output S with respect to the input data x to discern the sensitivity of the model's output to alterations in the input.

Gradients: Explainability Example
Image Classifier

For an image classifier tasked with distinguishing between cats and dogs, differentiating the score for cat (Sc) with respect to pixel values (x1, …, x_n) yields an attribution map elucidating each pixel's contribution to the classification outcome.

\nabla Sc(x) = \begin{bmatrix} \frac{\partial Sc}{\partial x0} \ \vdots \ \frac{\partial Sc}{\partial x_n} \end{bmatrix} \leftarrow \text{attribution map}

This process culminates in pixel attributions that illuminate the classification score for the cat class, spotlighting the pixels most influential in driving the classification decision.

Attribution Methods
Problems with Gradients
  • Gradients are susceptible to noise, potentially obscuring the true significance of individual features.

  • When employing ReLU activation functions, should output activations equal 0, the gradient likewise becomes 0, engendering information loss.

Solutions
  • SmoothGrad: Mitigate noise and bolster stability by introducing noise and averaging gradients.

  • Integrated Gradients: Enhance attribution comprehensiveness by establishing a baseline and integrating gradients, thus considering the trajectory from baseline to input.

  • Clever Propagation: Employ methods such as DeepLift, Deconvolution, and Layerwise Relevance Propagation to adeptly manage how gradients propagate through activation functions, thereby furnishing more precise attributions.

Adversarial Attacks
Observations from Pixel Attributions
  • Gradients exhibit noise, rendering the identification of critical features challenging.

  • Select pixels manifest disproportionately large gradients, thereby exerting substantial leverage over classification outcomes and indicating potential vulnerabilities.

Constructing Adversarial Examples

Leveraging the gradient \nabla S_c, one can identify pixels pivotal in guiding correct predictions. Modifying these pixel values in the opposing direction can engender adversarial examples, inducing the model to misclassify inputs.

IRL Adversarial Examples

Illustrative instances of adversarial attacks encompass manipulations of speed limit signs, wherein imperceptible perturbations precipitate misclassifications, potentially precipitating real-world ramifications.

Adversarial Attacks
Targeted Attacks

These attacks necessitate access to the model for gradient computation, enabling attackers to meticulously craft specific inputs that compel the model to render a predefined, erroneous prediction.

Black Box Attacks

Black Box Attacks entail the construction of adversarial examples contingent solely upon analyzing model outputs for given inputs, rendering them more arduous due to the attackers' reliance on observing model behavior without insight into its inner workings.

Defense Strategies
  • Fortify robustness against attacks by training models on adversarial examples.

  • Employ ensembling techniques to amalgamate predictions from multiple models, thereby amplifying robustness.

Attribution Methods
Problems with Gradients
  • Noisy Gradients: Gradients can be noisy, making it difficult to discern the true importance of individual features.

  • Vanishing Gradients with ReLU: If using ReLU and output activations are 0, the gradient is also 0, leading to a loss of information.

Solutions
  • Average gradients: Add noise and average gradients to reduce noise and improve stability. This helps to smooth out the gradient landscape.

  • Integrated Gradients: Establishes a baseline and integrates gradients to provide a more comprehensive attribution by considering the path from the baseline to the input. This method helps to mitigate the issue of vanishing gradients by considering the cumulative effect of the gradient over the entire path.

  • LRP: Clever Propagation involves using methods like DeepLift, Deconvolution, and Layerwise relevance propagation to handle how gradients are propagated through activation functions and provide more accurate attributions. These methods aim to assign relevance scores to input features based on their contribution to the model's output.

IRL Adversarial Examples

It is an adversarial attack on speed limit signs, where slight perturbations can cause misclassification. For example, adding stickers to a speed limit sign can cause it to be misclassified by an autonomous vehicle, potentially leading to dangerous situations. These attacks can have real-world consequences, especially in safety-critical applications.

Defense Strategies
  • Training on adversarial examples enhances the model's resilience to adversarial attacks. By exposing the model to adversarial examples during training, it learns to better recognize and defend against such attacks.

  • Ensembling combines the predictions of multiple models, increasing robustness. This approach reduces the impact of individual model vulnerabilities and biases.
    Augmentation involves applying various transformations to the input data, such as rotation, scaling, and translation, to increase the diversity of the training set. This helps the model generalize better and become more robust to adversarial attacks. Data