Introduction to Deep Learning

Course Name: ADM3308
Institution: Telfer School of Management, University of Ottawa

Course Outline

Introduction
Convolutional Neural Networks (CNN)
Autoencoder
Recurrent Neural Networks (RNN)
Long Short Term Memory (LSTM)
Appendix: Deep learning software

Introduction

In 2006, AI researcher Geoffrey Hinton published a pivotal paper that raised awareness of deep learning.
Deep Learning Overview:
- Involves complex networks with multiple layers.
- Creates flexible models that uncover buried information in vast datasets more efficiently than traditional machine learning techniques that rely on hand-crafted features.

Classical Machine Learning vs. Deep Learning

Classical machine learning techniques:
- Make predictions directly from a predetermined set of features specified by the user.
Deep learning techniques:
- Use multiple transformation steps to construct complex features.

Analyzing Massive Low-level Data

Classical statistical and machine learning models, including neural networks, utilize available informative predictors (e.g., purchase data, bank account details, etc.).
Rapidly growing applications in voice and image recognition present numerous low-level granular predictors, such as:
- Pixel values in images
- Wave amplitudes in audio
Deep Learning's Impact:
- Significant advancements in speech recognition, computer vision, and natural language processing.

Deep Learning for Image Processing (Unsupervised Learning)

Context:
- In image recognition, pixel values serve as predictors, often exceeding 100,000.
- The critical ability of deep learning models is to learn features without supervision.
Example:
- Separate pixels in an image (e.g., a football field) into distinct areas, such as "green field" versus "yard markers," without prior knowledge of these concepts.
- This leading to the emergence of boundaries and edges.
- The learning process transitions from identifying local, simple features to encompassing global, complex features.

Example of Feature Detection

Task: Instructing a machine to find an eye in an image.
Features of Interest:
- A small solid circle representing the pupil.
- An iris surrounding the pupil.
- A surrounding white area.

Simplified Image Representation

Assume an image comprised of 14x7 pixels.
Each pixel value is a color code ranging from 0 to 255.
Example of a representative 14x7 matrix (values are arbitrary):
- 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0
- 0, 1, …, 1, 0, 0
- 1, 0, …, 5, 5, 5, …, 0, 0, 1, 0
- etc.

Convolutional Neural Networks (CNN)

Definition:
- A popular deep learning model implemented for image recognition.
Functionality:
- Aggregate predictors (pixels) instead of assigning individual weights to each one; apply a convoluting operation to grouped pixels.
- A common aggregation method involves a 3x3 pixel area, e.g., around a person's chin.

What is Convolution?

Mathematical Definition:
- A convolution signifies a mathematical operation on two functions (e.g., f and g) resulting in a third function that illustrates how one function's shape is modified by another.
- Convolution describes the interaction and overlap of two functions as one slides across the other, serving the purpose of extracting and transforming features/signals/data.

Applying the Convolution

Convolution operation process:
- Multiply the pixel matrix by a filter matrix, then sum the results.
Example Calculation:
- For a filter that identifies central vertical lines:
- Calculation: 025 + 1200 + 025 + 025 + 1225 + 025 + 025 + 1225 + 0*25 = 650
- Result: This sum is relatively high compared to other arrangements since pixel values are elevated in the central column.

Continuing the Convolution Process

As the filter matrix shifts across the image:
- It records its results, resulting in a smaller matrix indicating the presence/absence of vertical lines.
Other filters can identify horizontal lines, curves, and borders, representing hyper-local features.
- Further convolutions on these local features yield a multi-dimensional matrix, or tensor, of higher-level features.

Filtering and Pooling Example

Example: Detecting edges of an image
Sobel filters are applied to filter the image.

Convolutions Produce Feature Reduction

In supervised learning, successful convolutions and features are preserved for tagging images.
Feature learning results in fewer, simpler features than the original set of pixel values.

Unsupervised Learning: Autoencoding

Deep learning networks can discover high-level features without labelling guidance.
Structure:
- The network has a mechanism to generate an image from high-level features at a bottleneck.
- The generated image is evaluated against the original, prompting adjustments similar to backpropagation if mismatched.
- The network fosters the architecture yielding the best matches.

Simple Autoencoder

Predicts its input.
Structure:
- Includes one hidden layer (simple) or multiple hidden layers (deep autoencoder).

Representation Comparison

Comparison of non-linear autoencoder and PCA in a 2D space showcasing learned data groupings.
Source: Hinton and Salakhutdinov (2006).

Recurrent Neural Networks (RNN)

Characteristics:
- Networks possess cycle-forming connections (feedback).
- Each hidden unit connects to itself and others, providing an internal state ideal for processing sequence data (e.g., handwriting recognition, speech, translation).
- RNNs can be conceptually unwrapped over time for computational comprehension.
- Each step utilizes the same weights and biases over time links to units.

Memory Loop in RNNs

RNNs include a memory loop allowing them to recall past information.
Applications: Suitable for time series analysis, sequential data like speech, music, and text.

Data Structure in RNNs

Original time series replicated into overlapping sub-series, with each labeled for one-step-ahead forecasting.
Predictors formatted as follows:
- Series: y1, y2, …, y_w
- Prediction: y_{w+1}
- Further structure: y2, y3, …, y_{w+1}
Continuing the pattern through y{t-w}, …, y{t-1} forecasts y_t.

The Issue of Vanishing Gradients in RNNs

RNNs face challenges with gradient calculation, where the gradient for parameters at layer L decomposes into matrix multiplication forms.
Due to repetitions of the same matrix across time steps, gradients can vanish to zero or explode to infinity, similar to how magnitudes behave when raised to a power (approaching zero or growing indefinitely).

Long Short Term Memory (LSTM)

LSTMs address the short-term memory issues faced by RNNs concerning vanishing gradients.
They utilize a gate operator (forget gate) enabling information retention, whereby the network adjusts the retention period to optimize prediction performance.

LSTM Architecture and Functions

Designed specifically to resolve the vanishing gradient issue.
Structure:
- Memory cells, input gates, output gates, and forget gates are incorporated.
Memory cells uniquely maintain information for extended durations.
Each cell possesses input and output gates controlled by learnable weights based on present observations and the prior hidden states.
Enhances backpropagation by allowing error terms to be stored and propagated without degradation.

LSTM Architecture

Source: Arpit Rathore, MDTI Project Report, University of Ottawa, 2021.

Appendix: Deep Learning Software

The appendix provides additional software information supporting deep learning, not part of course materials but for reference.

Deep Learning Software - Theano

A Python library focused on deep learning research (Bergstra et al., 2010; Theano Development Team, 2016).
A versatile tool for mathematical programming that extends NumPy with symbolic differentiation and GPU support.
Features a high-level language for deep learning model expressions and a compiler optimized for performance leveraging GPU capabilities.
Supports execution on multiple GPUs.

Theano Features

Enables declaring symbolic variables for inputs/targets, numerical values provided at runtime.
Shared variables like weights are tied to NumPy array-stored values.
Generates symbolic graphs defining mathematical operations comprising variable, constant, apply, and operation nodes.
Constant nodes aid optimization by remaining unchanged during computation.

Deep Learning Software - Tensor Flow

A C++ and Python library for numerical computations typically associated with deep learning (Abadi et al., 2016).
Heavily inspired by Theano, utilizing dataflow graphs to represent multidimensional array communication (called