CIVL3530 - Supervised Learning - Neural Networks Notes

Neural Networks

  • Neural Networks: Creating intermediate variables zk = σ(w{k1}x1 + w{k2}x2 + … + w{kJ}xJ + bk). The dependent variable is a linear combination of intermediate variables: y = o(w{o1}z1 + w{o2}z2 + … + w{oK}zK + b_o)
    • \sigma: Activation Function
    • o: Output Function

Deep Neural Networks

  • Deep Neural Networks: Neural networks with 5-10 hidden layers.
    • Training is now possible due to faster computing resources (GPU).
    • Adding layers increases connections between nodes, thus increasing the number of weights and biases to find.
    • Identifying weights is particularly difficult in deep neural networks.
  • Feed-forward network: information moves from left to right.
  • Example: A network with 3 hidden layers.
  • Deep Neural Networks Equations:
    • z^{(1)} = \sigma(A^{0\rightarrow1}x)
    • z^{(2)} = \sigma(A^{1\rightarrow2}z^{(1)})
    • z^{(3)} = \sigma(A^{2\rightarrow3}z^{(2)})
    • y = o(A^{3\rightarrow4}z^{(3)}) which expands to
    • y = o(A^{3\rightarrow4}(\sigma(A^{2\rightarrow3}(\sigma(A^{1\rightarrow2}(\sigma(A^{0\rightarrow1}x)))))))

Regularization / Dropout

  • Training Dataset: Data sample used to fit the model.
  • Validation Dataset: Data sample used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. The evaluation becomes biased as skill on the validation dataset is incorporated into the model configuration.
  • Test Dataset: The data sample used to provide an unbiased evaluation of a final model fit on the training dataset.
  • Train-test-validation split ratio is specific to the use case, depending on:
    • Whether the model requires substantial data for training.
    • Whether the model has many hyperparameters.
  • Beware of overfitting. Overfitting will result in high variance.
  • Regularization and dropout: Techniques to reduce overfitting.
    • Regularization: Add a penalty term to the loss function to penalize complicated models (models with high values of weights).
      • \lambda: Strength of regularization
      • \alpha \in [0,1]: Defines the balance between L1 and L2 penalties
    • Dropout: Prevent overfitting by randomly shutting down some neurons. Train random subsets of the neural network rather than training the entire large network and average the subsets to create the final prediction.

Categorical Data

  • Categorical data: Variables that contain label values rather than numeric values.
    • The number of possible values is often limited to a fixed set.
      • Example:
        • A “pet” variable with the values: “dog” and “cat“ .
        • A “color” variable with the values: “red“, “green” and “blue“ .
        • A “place” variable with the values: “first”, “second” and “third“ .
  • Categorical data must be converted to a numerical form.
    • Integer encoding: "red" is 1, "green" is 2, and "blue" is 3.
    • One-hot encoding: Applied to the integer representation. The integer encoded variable is removed, and a new binary variable is added for each unique integer value.
  • Mean squared error cannot be used as loss function for classification problems.
  • Cross-entropy function should be used for one hot encoded data.
  • For discrete probability distribution p and q:
    • H(P, Q) = - \sum_{x \in X} p(x) \log q(x)

Time Series

  • Time series data: Traffic measurements vs. time, river height vs. time, rainfall vs. time.
  • Sliding time window: Used for prediction with a certain timeframe (e.g., predicting 30 minutes ahead).

Recurrent Neural Networks

  • Feed-forward NN: One-to-one, fixed-size input to fixed-size output.
  • RNN: Various types including:
    • One-to-one
    • One-to-many (e.g., image captioning: Image -> sequence of words)
    • Many-to-one (e.g., sentiment analysis: Sequence of words -> sentiments)
    • Many-to-many (e.g., machine translation: RNN reads a sentence in English and outputs a sentence in French)
  • Vanilla Neural Networks are too constrained:
    • Accept a fixed-sized vector as input and produce a fixed-sized vector as output.
    • Use a fixed amount of computational steps.
  • RNNs allow operation over sequences of vectors (sequences in the input, the output, or both).
    • No pre-specified constraints on the lengths of sequences.
    • The recurrent transformation can be applied as many times as we like.
  • Key Idea: RNNs have an “internal state” that is updated as a sequence is processed.
  • RNNs combine the input vector with their state vector with a fixed (but learned) function to produce a new state vector.
    • This can be interpreted as running a fixed program with certain inputs and some internal variables.
  • Unrolled RNN: Visual representation of how RNNs process sequences over time.
  • The same function and the same set of parameters are used at every time step.
  • Recurrence formula: ht = fw(h{t-1}, xt)
    • h_t stands for hidden layer.
    • ht = \tanh(W{ha}h{t-1} + W{hx}x_t)
    • yt = W{hy}h_t
  • Computational Graph: Visual representation of the computations in an RNN.
    • Initial hidden state h_0: Either set to all 0 or use expert knowledge.
    • Same weights are used at every time step.
  • Different RNN computational graphs for different types of sequence processing:
    • Many-to-many: calculating loss at each step or total loss.
    • Many-to-one: calculating loss at the end.
    • One-to-many: generating a sequence from a single input.
  • RNN Detailed Look:
    • Diagram showing inputs, weights, hidden state, and bias.

Many to Many (Seq2Seq) Models

  • Maps input sequences to output sequences.
  • Input/output can have different lengths.
  • Trained end-to-end.
  • Examples: Machine Translation, Text Summarization, Speech Recognition, Dialogue Generation.
  • Structure:
    • Input: [X1, X2, X3, …, Xn] → Encoder → Context → Decoder → Output: [Y1, Y2, Y3, …, Ym]

Seq2Seq Models Components

  • Encoder-Decoder Structure:
    • Encoder: Converts input sequence into context vector.
    • Decoder: Uses context to generate output sequence.
    • Input → [Encoder] → Context → [Decoder] → Output
  • Encoder Details:
    • Processes: [x1, x2, …, x_T]
    • Typically an RNN (e.g. vanilla RNN, LSTM, GRU).
    • Outputs: Final hidden state = context vector or all hidden states (for attention).
  • Decoder Details:
    • Takes encoder's final hidden state as init.
    • Generates output step-by-step: [y1, y2, …, y_{TTT}]
    • Uses: Previous output token, hidden state from prior step, and optional attention vector.
  • Training: Uses teacher forcing (feed ground-truth token).
  • Inference: Feeds back own predictions and stops at or max length.

Why Seq2Seq Works

  • Handles variable-length input/output.
  • Trained end-to-end.
  • Supports Teacher Forcing.
  • Applicable across NLP, audio, etc.

Summary Table

  • Encoder: Reads and encodes input sequence.
  • Decoder: Generates output step by step.
  • EOS Token: Stops Output Generation.
  • Attention: Learns to focus on specific input tokens.
  • Flexible Architecture behind the success of Generative AI today
  • Stage 1 is self-supervision (pre-training)
  • Stage 2 is fine-tuning (adaptation to downstream tasks)