CS 410 EXAM 3

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/113

There's no tags or description

Looks like no tags are added yet.

Last updated 9:57 AM on 4/28/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

114 Terms

New cards

Perceptron

Simple neural unit that computes weighted sum of inputs and applies activation function

New cards

Activation function

Function that introduces nonlinearity into a neural network

New cards

Sigmoid activation

Activation function that squashes output between 0 and 1; can cause vanishing gradients

New cards

ReLU activation

Activation function $max(0,x)$ ; helps reduce vanishing gradient problem

New cards

Hidden layer

Intermediate layer that learns internal feature representations

New cards

Forward propagation

Process of passing inputs through layers to compute output

New cards

Loss function

Measures error between prediction and true label

New cards

Gradient descent

Optimization method that updates weights in direction that reduces loss

New cards

Learning rate

Step size used in gradient descent updates

New cards

Backpropagation

Process of computing gradients backward through the network to update weights

New cards

Backpropagation step 1

Do forward pass and compute prediction

New cards

Backpropagation step 2

Compute loss/error

New cards

Backpropagation step 3

Compute gradient of loss with respect to output layer

New cards

Backpropagation step 4

Propagate gradients backward using chain rule

New cards

Backpropagation step 5

Update weights using gradient descent

New cards

Vanishing gradient

Gradients become very small in deep/recurrent networks, slowing learning

New cards

RNN

Neural network that processes sequential data using hidden state memory

New cards

Hidden state in RNN

Vector that carries information from previous time step

New cards

Shared weights in RNN

Same weights are reused across all time steps

New cards

Why RNNs are useful

Can model sequential/contextual information

New cards

RNN problem

Vanishing/exploding gradients and weak long-term memory

New cards

Long-term dependency problem

Standard RNN struggles to remember information far back in sequence

New cards

LSTM

Long Short-Term Memory network designed to preserve long-range information

New cards

Cell state in LSTM

Long-term memory pathway through the network

New cards

Forget gate in LSTM

Decides what information to remove from memory

New cards

Input gate in LSTM

Decides what new information to store

New cards

Output gate in LSTM

Decides what memory to expose as output

New cards

Why LSTM works better than RNN

Gates control memory flow and reduce vanishing gradients

New cards

GRU

Gated Recurrent Unit; simpler gated recurrent architecture

New cards

Update gate in GRU

Controls how much old memory to keep vs new memory to add

New cards

Reset gate in GRU

Controls how much old information to forget when computing candidate memory

New cards

LSTM vs GRU

LSTM has 3 gates + cell state; GRU has 2 gates and simpler design

New cards

Transformer

Main advantage over RNNs: parallelization and better long-range dependency modeling

New cards

Self-attention

Mechanism that lets each word attend to every other word in sequence

New cards

Query in attention

Vector representing what a token is looking for

New cards

Key in attention

Vector representing what a token offers for matching

New cards

Value in attention

Vector containing information passed forward

New cards

Attention score

Similarity between Query and Key

New cards

Scaled dot-product attention

Attention computed with $\frac{QK^{T}}{\sqrt{dk}}$ , softmax, then weighted sum of V

New cards

Why divide by $\sqrt{dk}$

Prevents large dot products that make softmax unstable

New cards

Softmax in attention

Converts scores into probability weights

New cards

Attention output

Weighted combination of Value vectors

New cards

Multi-head attention

Multiple attention mechanisms run in parallel to capture different relationships

New cards

Positional encoding

Adds word-order information to transformer input

New cards

Residual connection

Skip connection that helps gradient flow and stabilizes training

New cards

Layer normalization

Normalizes activations to improve training stability

New cards

Masking

Prevents model from seeing future tokens during prediction

New cards

Encoder in transformer

Processes input representation

New cards

Decoder in transformer

Generates output sequence

New cards

Why transformers outperform RNNs

Parallel training + stronger long-distance relationships

New cards

GPT

Decoder-only transformer model

New cards

GPT training objective

Predict next token (autoregressive)

New cards

GPT attention type

Left-to-right masked attention

New cards

GPT best at

Text generation

New cards

BERT

Encoder-only transformer model

New cards

BERT training objective

Masked language modeling

New cards

BERT attention type

Bidirectional attention

New cards

BERT best at

Language understanding tasks

New cards

[CLS] token

Special token representing full input sequence

New cards

[SEP] token

Special separator token between segments

New cards

[MASK] token

Token hidden during pretraining for prediction

New cards

BERT embeddings

Token embeddings + positional embeddings + segment embeddings

New cards

Segment embeddings in BERT

Distinguish sentence A vs sentence B

New cards

GPT vs BERT

GPT generates text autoregressively; BERT learns bidirectional representations

New cards

Paradigmatic association

Words similar in meaning or substitutable (ex: dog/cat)

New cards

Syntagmatic association

Words that frequently co-occur (ex: eat/food)

New cards

Entropy

Measures uncertainty/randomness

New cards

High entropy

Harder to predict

New cards

Low entropy

More predictable

New cards

Conditional entropy

Remaining uncertainty in X after knowing Y

New cards

Mutual information

Reduction in uncertainty of one variable given another

New cards

MI formula

$I(X;Y)=H(X)-H(X|Y)$

New cards

Mutual information property

Symmetric and nonnegative

New cards

High mutual information

Strong association between variables

New cards

Pull mode

User initiates search (search engines)

New cards

Push mode

System initiates delivery (recommender systems)

New cards

Content-based filtering

Recommend items similar to ones user liked before

New cards

Collaborative filtering

Recommend based on similar users' preferences

New cards

Cold start problem

Little user/item data initially

New cards

Memory-based collaborative filtering

Predict preferences using neighboring similar users

New cards

ALS

Alternating Least Squares matrix factorization method

New cards

ALS idea

Factor rating matrix into user vectors and item vectors

New cards

Topic mining

Discover hidden topics in text data

New cards

Topic representation

Probability distribution over words

New cards

Why topic as word distribution better

Represents richer, multi-word concepts

New cards

Language model

Probability distribution over text

New cards

Unigram language model

Assumes words generated independently

New cards

MLE

Choose parameters maximizing likelihood of observed data

New cards

MAP

Choose parameters maximizing posterior probability

New cards

Mixture model

Data generated by combining multiple distributions

New cards

Background language model

Model for common/background words

New cards

Benefit of background LM

Removes stopwords/common words from topical model

New cards

Law of total probability in mixture model

$P(word)=\sum_{topics}p(topic)*p(word|topic)$

New cards

EM algorithm

Iterative method for estimating hidden-variable models

New cards

E-step

Estimate hidden variable assignments probabilistically

New cards

M-step

Update parameters using expected assignments

New cards

EM weakness

Can converge to local maximum

New cards

PLSA

Probabilistic Latent Semantic Analysis; document is mixture of topics

New cards

PLSA output

Topic word distributions + topic proportions per document

100

New cards

User-controlled PLSA

PLSA extended with prior knowledge to guide discovered topics