CS 410 EXAM 3

0.0(0)
Studied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/113

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 9:57 AM on 4/28/26
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

114 Terms

1
New cards

Perceptron

Simple neural unit that computes weighted sum of inputs and applies activation function

2
New cards

Activation function

Function that introduces nonlinearity into a neural network

3
New cards

Sigmoid activation

Activation function that squashes output between 0 and 1; can cause vanishing gradients

4
New cards

ReLU activation

Activation function max(0,x)max(0,x) ; helps reduce vanishing gradient problem

5
New cards

Hidden layer

Intermediate layer that learns internal feature representations

6
New cards

Forward propagation

Process of passing inputs through layers to compute output

7
New cards

Loss function

Measures error between prediction and true label

8
New cards

Gradient descent

Optimization method that updates weights in direction that reduces loss

9
New cards

Learning rate

Step size used in gradient descent updates

10
New cards

Backpropagation

Process of computing gradients backward through the network to update weights

11
New cards

Backpropagation step 1

Do forward pass and compute prediction

12
New cards

Backpropagation step 2

Compute loss/error

13
New cards

Backpropagation step 3

Compute gradient of loss with respect to output layer

14
New cards

Backpropagation step 4

Propagate gradients backward using chain rule

15
New cards

Backpropagation step 5

Update weights using gradient descent

16
New cards

Vanishing gradient

Gradients become very small in deep/recurrent networks, slowing learning

17
New cards

RNN

Neural network that processes sequential data using hidden state memory

18
New cards

Hidden state in RNN

Vector that carries information from previous time step

19
New cards

Shared weights in RNN

Same weights are reused across all time steps

20
New cards

Why RNNs are useful

Can model sequential/contextual information

21
New cards

RNN problem

Vanishing/exploding gradients and weak long-term memory

22
New cards

Long-term dependency problem

Standard RNN struggles to remember information far back in sequence

23
New cards

LSTM

Long Short-Term Memory network designed to preserve long-range information

24
New cards

Cell state in LSTM

Long-term memory pathway through the network

25
New cards

Forget gate in LSTM

Decides what information to remove from memory

26
New cards

Input gate in LSTM

Decides what new information to store

27
New cards

Output gate in LSTM

Decides what memory to expose as output

28
New cards

Why LSTM works better than RNN

Gates control memory flow and reduce vanishing gradients

29
New cards

GRU

Gated Recurrent Unit; simpler gated recurrent architecture

30
New cards

Update gate in GRU

Controls how much old memory to keep vs new memory to add

31
New cards

Reset gate in GRU

Controls how much old information to forget when computing candidate memory

32
New cards

LSTM vs GRU

LSTM has 3 gates + cell state; GRU has 2 gates and simpler design

33
New cards

Transformer

Main advantage over RNNs: parallelization and better long-range dependency modeling

34
New cards

Self-attention

Mechanism that lets each word attend to every other word in sequence

35
New cards

Query in attention

Vector representing what a token is looking for

36
New cards

Key in attention

Vector representing what a token offers for matching

37
New cards

Value in attention

Vector containing information passed forward

38
New cards

Attention score

Similarity between Query and Key

39
New cards

Scaled dot-product attention

Attention computed with QKTdk\frac{QK^{T}}{\sqrt{dk}} , softmax, then weighted sum of V

40
New cards

Why divide by dk\sqrt{dk}

Prevents large dot products that make softmax unstable

41
New cards

Softmax in attention

Converts scores into probability weights

42
New cards

Attention output

Weighted combination of Value vectors

43
New cards

Multi-head attention

Multiple attention mechanisms run in parallel to capture different relationships

44
New cards

Positional encoding

Adds word-order information to transformer input

45
New cards

Residual connection

Skip connection that helps gradient flow and stabilizes training

46
New cards

Layer normalization

Normalizes activations to improve training stability

47
New cards

Masking

Prevents model from seeing future tokens during prediction

48
New cards

Encoder in transformer

Processes input representation

49
New cards

Decoder in transformer

Generates output sequence

50
New cards

Why transformers outperform RNNs

Parallel training + stronger long-distance relationships

51
New cards

GPT

Decoder-only transformer model

52
New cards

GPT training objective

Predict next token (autoregressive)

53
New cards

GPT attention type

Left-to-right masked attention

54
New cards

GPT best at

Text generation

55
New cards

BERT

Encoder-only transformer model

56
New cards

BERT training objective

Masked language modeling

57
New cards

BERT attention type

Bidirectional attention

58
New cards

BERT best at

Language understanding tasks

59
New cards

[CLS] token

Special token representing full input sequence

60
New cards

[SEP] token

Special separator token between segments

61
New cards

[MASK] token

Token hidden during pretraining for prediction

62
New cards

BERT embeddings

Token embeddings + positional embeddings + segment embeddings

63
New cards

Segment embeddings in BERT

Distinguish sentence A vs sentence B

64
New cards

GPT vs BERT

GPT generates text autoregressively; BERT learns bidirectional representations

65
New cards

Paradigmatic association

Words similar in meaning or substitutable (ex: dog/cat)

66
New cards

Syntagmatic association

Words that frequently co-occur (ex: eat/food)

67
New cards

Entropy

Measures uncertainty/randomness

68
New cards

High entropy

Harder to predict

69
New cards

Low entropy

More predictable

70
New cards

Conditional entropy

Remaining uncertainty in X after knowing Y

71
New cards

Mutual information

Reduction in uncertainty of one variable given another

72
New cards

MI formula

I(X;Y)=H(X)H(XY)I(X;Y)=H(X)-H(X|Y)

73
New cards

Mutual information property

Symmetric and nonnegative

74
New cards

High mutual information

Strong association between variables

75
New cards

Pull mode

User initiates search (search engines)

76
New cards

Push mode

System initiates delivery (recommender systems)

77
New cards

Content-based filtering

Recommend items similar to ones user liked before

78
New cards

Collaborative filtering

Recommend based on similar users' preferences

79
New cards

Cold start problem

Little user/item data initially

80
New cards

Memory-based collaborative filtering

Predict preferences using neighboring similar users

81
New cards

ALS

Alternating Least Squares matrix factorization method

82
New cards

ALS idea

Factor rating matrix into user vectors and item vectors

83
New cards

Topic mining

Discover hidden topics in text data

84
New cards

Topic representation

Probability distribution over words

85
New cards

Why topic as word distribution better

Represents richer, multi-word concepts

86
New cards

Language model

Probability distribution over text

87
New cards

Unigram language model

Assumes words generated independently

88
New cards

MLE

Choose parameters maximizing likelihood of observed data

89
New cards

MAP

Choose parameters maximizing posterior probability

90
New cards

Mixture model

Data generated by combining multiple distributions

91
New cards

Background language model

Model for common/background words

92
New cards

Benefit of background LM

Removes stopwords/common words from topical model

93
New cards

Law of total probability in mixture model

P(word)=topicsp(topic)p(wordtopic)P(word)=\sum_{topics}p(topic)*p(word|topic)

94
New cards

EM algorithm

Iterative method for estimating hidden-variable models

95
New cards

E-step

Estimate hidden variable assignments probabilistically

96
New cards

M-step

Update parameters using expected assignments

97
New cards

EM weakness

Can converge to local maximum

98
New cards

PLSA

Probabilistic Latent Semantic Analysis; document is mixture of topics

99
New cards

PLSA output

Topic word distributions + topic proportions per document

100
New cards
User-controlled PLSA
PLSA extended with prior knowledge to guide discovered topics