1/113
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Perceptron
Simple neural unit that computes weighted sum of inputs and applies activation function
Activation function
Function that introduces nonlinearity into a neural network
Sigmoid activation
Activation function that squashes output between 0 and 1; can cause vanishing gradients
ReLU activation
Activation function max(0,x) ; helps reduce vanishing gradient problem
Hidden layer
Intermediate layer that learns internal feature representations
Forward propagation
Process of passing inputs through layers to compute output
Loss function
Measures error between prediction and true label
Gradient descent
Optimization method that updates weights in direction that reduces loss
Learning rate
Step size used in gradient descent updates
Backpropagation
Process of computing gradients backward through the network to update weights
Backpropagation step 1
Do forward pass and compute prediction
Backpropagation step 2
Compute loss/error
Backpropagation step 3
Compute gradient of loss with respect to output layer
Backpropagation step 4
Propagate gradients backward using chain rule
Backpropagation step 5
Update weights using gradient descent
Vanishing gradient
Gradients become very small in deep/recurrent networks, slowing learning
RNN
Neural network that processes sequential data using hidden state memory
Hidden state in RNN
Vector that carries information from previous time step
Shared weights in RNN
Same weights are reused across all time steps
Why RNNs are useful
Can model sequential/contextual information
RNN problem
Vanishing/exploding gradients and weak long-term memory
Long-term dependency problem
Standard RNN struggles to remember information far back in sequence
LSTM
Long Short-Term Memory network designed to preserve long-range information
Cell state in LSTM
Long-term memory pathway through the network
Forget gate in LSTM
Decides what information to remove from memory
Input gate in LSTM
Decides what new information to store
Output gate in LSTM
Decides what memory to expose as output
Why LSTM works better than RNN
Gates control memory flow and reduce vanishing gradients
GRU
Gated Recurrent Unit; simpler gated recurrent architecture
Update gate in GRU
Controls how much old memory to keep vs new memory to add
Reset gate in GRU
Controls how much old information to forget when computing candidate memory
LSTM vs GRU
LSTM has 3 gates + cell state; GRU has 2 gates and simpler design
Transformer
Main advantage over RNNs: parallelization and better long-range dependency modeling
Self-attention
Mechanism that lets each word attend to every other word in sequence
Query in attention
Vector representing what a token is looking for
Key in attention
Vector representing what a token offers for matching
Value in attention
Vector containing information passed forward
Attention score
Similarity between Query and Key
Scaled dot-product attention
Attention computed with dkQKT , softmax, then weighted sum of V
Why divide by dk
Prevents large dot products that make softmax unstable
Softmax in attention
Converts scores into probability weights
Attention output
Weighted combination of Value vectors
Multi-head attention
Multiple attention mechanisms run in parallel to capture different relationships
Positional encoding
Adds word-order information to transformer input
Residual connection
Skip connection that helps gradient flow and stabilizes training
Layer normalization
Normalizes activations to improve training stability
Masking
Prevents model from seeing future tokens during prediction
Encoder in transformer
Processes input representation
Decoder in transformer
Generates output sequence
Why transformers outperform RNNs
Parallel training + stronger long-distance relationships
GPT
Decoder-only transformer model
GPT training objective
Predict next token (autoregressive)
GPT attention type
Left-to-right masked attention
GPT best at
Text generation
BERT
Encoder-only transformer model
BERT training objective
Masked language modeling
BERT attention type
Bidirectional attention
BERT best at
Language understanding tasks
[CLS] token
Special token representing full input sequence
[SEP] token
Special separator token between segments
[MASK] token
Token hidden during pretraining for prediction
BERT embeddings
Token embeddings + positional embeddings + segment embeddings
Segment embeddings in BERT
Distinguish sentence A vs sentence B
GPT vs BERT
GPT generates text autoregressively; BERT learns bidirectional representations
Paradigmatic association
Words similar in meaning or substitutable (ex: dog/cat)
Syntagmatic association
Words that frequently co-occur (ex: eat/food)
Entropy
Measures uncertainty/randomness
High entropy
Harder to predict
Low entropy
More predictable
Conditional entropy
Remaining uncertainty in X after knowing Y
Mutual information
Reduction in uncertainty of one variable given another
MI formula
I(X;Y)=H(X)−H(X∣Y)
Mutual information property
Symmetric and nonnegative
High mutual information
Strong association between variables
Pull mode
User initiates search (search engines)
Push mode
System initiates delivery (recommender systems)
Content-based filtering
Recommend items similar to ones user liked before
Collaborative filtering
Recommend based on similar users' preferences
Cold start problem
Little user/item data initially
Memory-based collaborative filtering
Predict preferences using neighboring similar users
ALS
Alternating Least Squares matrix factorization method
ALS idea
Factor rating matrix into user vectors and item vectors
Topic mining
Discover hidden topics in text data
Topic representation
Probability distribution over words
Why topic as word distribution better
Represents richer, multi-word concepts
Language model
Probability distribution over text
Unigram language model
Assumes words generated independently
MLE
Choose parameters maximizing likelihood of observed data
MAP
Choose parameters maximizing posterior probability
Mixture model
Data generated by combining multiple distributions
Background language model
Model for common/background words
Benefit of background LM
Removes stopwords/common words from topical model
Law of total probability in mixture model
P(word)=topics∑p(topic)∗p(word∣topic)
EM algorithm
Iterative method for estimating hidden-variable models
E-step
Estimate hidden variable assignments probabilistically
M-step
Update parameters using expected assignments
EM weakness
Can converge to local maximum
PLSA
Probabilistic Latent Semantic Analysis; document is mixture of topics
PLSA output
Topic word distributions + topic proportions per document