Sequences are ordered series of elements, which can be anything from words in a sentence to temperature readings over time. Understanding the order and relationships between these elements is crucial for various applications.
Natural Language: The order of words in a sentence drastically changes the meaning. For example, "I only drink coffee" has a different meaning from "Only I drink coffee." This highlights the importance of word order in conveying the intended message.
Time Series: These are sequences of data points indexed in time order. Examples include temperature readings recorded at specific times (e.g., 7°C at 11:00, 8°C at 12:00, …, 12°C at 15:00). Analyzing time series data helps in identifying trends, seasonality, and anomalies.
Audio, Video, DNA: These are also examples of sequences. Audio can be represented as a sequence of sound amplitudes over time. Video is a sequence of frames, each being an image. DNA is a sequence of nucleotides. Sequence analysis techniques can be applied to these diverse forms of data.
Classification: Assigning a sequence to a specific category.
Speech recognition: Converting spoken words into text requires classifying audio sequences into corresponding words or phonemes.
Fraud detection, network intrusion detection: Identifying malicious activities by analyzing sequences of transactions or network traffic data.
Fault detection and predictive maintenance: Predicting equipment failures by analyzing sequences of sensor data.
Medical diagnostics: Diagnosing diseases by analyzing sequences of medical images or patient data.
Sentiment analysis: Determining the emotional tone of a text by analyzing the sequence of words.
Topic classification: Categorizing documents based on their content by analyzing the sequence of words.
Forecasting (Regression of future values): Predicting future values based on past data.
Predicting weather, energy prices, stock prices: These predictions rely on models that can remember and understand patterns in historical time series data.
Requires a model that can remember the past and identify relevant patterns to extrapolate future values.
Sequence-to-sequence learning: Transforming one sequence into another.
Language translation: Converting a sentence from one language to another requires understanding the sequence of words in the source language and generating the corresponding sequence in the target language.
Image captioning: Generating a textual description of an image involves analyzing the sequence of visual features and producing a coherent sentence.
Text summarization: Condensing a long text into a shorter version requires understanding the sequence of sentences and identifying the most important information.
Requires a model that can remember the context and generate a new sequence based on that context.
For images, we look for patterns between neighboring pixels in 2D space. Similarly, for sequences, we look for patterns between neighboring elements (e.g., points in time) in 1D space.
While sequences are 1D, they can still have multiple channels. For example, a time series of stock prices might include channels for price, volume, and other indicators.
Analogy between image and sequence processing:
keras.layers.Conv2D
corresponds to keras.layers.Conv1D
.
keras.layers.Conv2DTranspose
corresponds to keras.layers.Conv1DTranspose
.
keras.layers.MaxPooling2D
corresponds to keras.layers.MaxPooling1D
.
Illustrative examples of convolution operations are shown. (These examples visually depict how a convolution kernel interacts with a 1D sequence to produce an output.)
Cross-correlation is a similar operation to convolution, but one of the functions is reversed.
The formula for convolution: (f * g)(t) ≡ ∫ f(τ)g(t − τ)dτ from -∞ to ∞
If f(t) → f(−t), then cross-correlation is: f ⋆ g ≡ f(−t) ∗ g(t)
CNNs are effective for classification because of translation invariance, meaning they can recognize patterns regardless of where they occur in the input.
For forecasting, translation invariance is typically not desired because recent data is more informative than older data. The relevance of data points often diminishes as they become more distant in time.
New assumption: Recent data is more informative than old data. This assumption underlies many forecasting models, which give more weight to recent observations.
Traditional neural networks have no memory or state. Each input is processed independently, without considering past inputs.
RNNs introduce a state by having each node store its previous output. This state allows the network to remember information about past inputs and use it to influence the processing of future inputs.
Recall that a regular Dense layer computes its output by
outputs = activation(tf.dot(W, inputs) + b)
Where:
W is the weight matrix.
inputs is the vector of features.
b is the bias vector.
The recurrent node has two sets of weights:
W_x: The usual weights applied to the current input.
W_y: Weights applied to the previous output (state).
The outputs then become:
state_t = tf.zeros(shape=(num_output_features))
outputs = []
for input_t in input_sequence: # loop over inputs at time t
output_t = activation(tf.dot(W_x, input_t) + tf.dot(W_y, state_t) + b)
outputs.append(output_t)
state_t = output_t
output_sequence = tf.stack(outputs, axis=0)
Sequence-to-sequence: Input is a sequence, and output is also a sequence.
Sequence-to-vector: Input is a sequence, and output is a fixed-size vector.
Vector-to-sequence: Input is a fixed-size vector, and output is a sequence.
Encoder-decoder networks: A combination of sequence-to-vector (encoder) and vector-to-sequence (decoder) models.
These are different architectures for handling sequences, as illustrated in Figure 15-4.
Simplest possible forecast: The value tomorrow is the same as the value today. This is a naive forecast that assumes no change over time.
yt = y{t−1}
More advanced forecast: The value tomorrow is given by a weighted sum of the previous time steps, plus a noise term . ( are the parameters of the model)
yi = ∑ φi y{t-i} + ϵt
Can add moving average to get an ARMA model, look at differences to get an ARIMA model, and add seasonality to get a SARIMA model.
Partial autocorrelation.
Lots of work on (traditional) statistical time series modelling - usually worth trying out before going to deep learning. Statistical models like ARIMA are often a good starting point for time series analysis.
In practice, RNNs suffer from vanishing/exploding gradients during training, making it difficult for them to learn long-term dependencies. Vanishing gradients occur when the gradients become very small, preventing the network from learning. Exploding gradients occur when the gradients become very large, causing the network to become unstable.
Hidden states can be introduced which are not the same as the output. These hidden states provide additional memory capacity and can help the network learn more complex patterns.
Two most used approaches: LSTMs and GRUs. These are specialized types of recurrent cells that are designed to address the vanishing gradient problem.
Add long-term memory by having two states in each cell:
A short-term state h_t
A long-term state c_t
Gates determine data flow – small networks inside the cell act as gate operators. These gates regulate the flow of information into and out of the cell, allowing it to selectively remember or forget information.
Simplified and somewhat more effective variant of the LSTM cell. GRUs have fewer parameters than LSTMs, making them faster to train.
As usual, we can increase the capacity by stacking layers. Stacking layers allows the network to learn more complex representations of the input data.
When building a deep RNN, intermediate layers should return the entire sequence. This allows the subsequent layers to access the full context of the input sequence.
inputs = keras.Input(shape=(sequence_length, raw_data.shape[-1]))
x = layers.GRU(32, return_sequences=True)(inputs)
x = layers.GRU(32, return_sequences=True)(x)
x = layers.GRU(32)(x)
outputs = layers.Dense(1)(x)
model = keras.Model(inputs, outputs)
Some tricks to efficiently train recurrent networks:
Use saturating activation functions (tanh, sigmoid):
layers.LSTM(units, activation="tanh", recurrent_activation="sigmoid")
Use layer normalization (keras.layers.LayerNormalization
) instead of batch normalization. Layer normalization can help to stabilize training and improve performance.
Add recurrent dropout (potentially in addition to regular dropout):
x = layers.LSTM(32, recurrent_dropout=0.25)(inputs)
Test if training runs faster on CPU than on GPU.
NVIDIA backend only available if using default arguments for LSTM/GRU layers.
For loops in recurrent nodes reduces parallelizability. The sequential nature of RNNs can limit the benefits of GPU acceleration.
Can optionally unroll for loops (memory intensive):
x = layers.LSTM(32, unroll=True)(inputs)
Even with the previous tricks, getting RNNs to learn patterns over >100 time steps is difficult. The vanishing gradient problem can make it challenging for RNNs to capture long-range dependencies.
Extract small-scale patterns with convolutional layers first, then apply recurrent layers (add stride > 1 to downsample). This can help to reduce the sequence length and make it easier for the RNN to learn long-range dependencies.
model = keras.Sequential([
keras.layers.Conv1D(filters=32, kernel_size=4, strides=2, activation="relu"),
keras.layers.GRU(32, return_sequences=True),
keras.layers.Dense(14)
])
Or maybe skip the recurrence altogether? WaveNet architecture:
keras.layers.Conv1D( , padding="casual") # Look only backwards
For time series, the most recent data points are expected to be most important, so chronological ordering makes sense. In many time series applications, the past has a direct influence on the future.
Sometimes this is not the case - for instance, for text. The context of a word can depend on both the words that precede it and the words that follow it.
Process sequences both forwards and in reverse by using a bidirectional recurrent layer. This allows the network to capture information from both directions.
inputs = keras.Input(shape=( ))
x = layers.Bidirectional(layers.LSTM(16))(inputs)
outputs = layers.Dense(1)(x)
model = keras.Model(inputs, outputs)