cpl midterm

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/56

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

57 Terms

New cards

why can there be no learning without inductive biases?

Any finite set of data is consistent with multiple, and in fact infinite, generalizations. to land on one we need inductive biases for any type of learning

New cards

Inductive problems vs. deductive problems

deductive problems apply general true rules to specific examples. inductive problems try and generate general rules from specific examples.

New cards

how is language an inductive learning situation?

we are given examples (words we hear) and must infer general rules

New cards

nature vs nurture (and why this framing is not perfect)

nature kinda corresponds to inductive biases, nurture = learner’s data. HOWEVER, inductive biases can be learned / changed (need some in there from the get-go tho). in fact, learning a language depends on both nurture and nature, where some aspects of language are innate. debate about how much is innate

New cards

The poverty of the stimulus / plato’s problem

linguistic data that children encounter are consistent with multiple possible generalizations about how a language’s grammar works, yet children reliably arrive at certain generalizations rather than others, therefore, children must have linguistic inductive biases that favor these generalizations (my walrus who will sleep can giggle linear vs hierarchical) specific instances are debated, but the general “the data underdetermine the grammar, such that we must have linguistic inductive biases“ is true

New cards

The subset problem

if there are two grammars, and one is a subset of another, there is no way to rule out the superset one (ex: pro-drop) (assumption: only receive positive evidence)

New cards

Baker’s paradox

version of the subset problem applied to “I gave Alex a book” (double-object constructions in English)

New cards

The gavagai problem

point to a rabbit and say “gavagai,“ gavagai could mean rabbit, could be that rabbits name, could mean animal, etc.

New cards

Goodman’s new riddle of induction (“grue”)

favoring green meaning green rather than grue

New cards

symbolic structure properties + rules

made up of discrete units (symbols) combined in structured ways. symbols you use matter, way of combining them matters. rules tell us things we can do with those structures

New cards

3 key properties of symbolic structures

Compositionality: meaning is informed by meaning of the parts and how the parts are put together

Systematicity: symbol has the same meaning in every context

Productivity: with a finite number of components, can create infinitely many configurations

New cards

rules and symbols benefits and detractions

useful for AI (generalize robustly) and human linguistics (generally follow human language)

ways which it doesn’t:

-idioms

- I washed the plate” vs. “I washed my car” –> different meanings?

-Baker’s paradox

(these can usually be solved by adding more rules but like…)

New cards

Principles & Parameters

universal principles that hold across all languages combined with parameters that define constrained types of variation across languages

New cards

Issues with rules and symbols (4)

Rigidity: only can learn grammars within a very particular set of possible grammars (could be gotten around with making the space broader?)

Challenging search process: the space is really big (make the space smaller?)

Expressivity / tractability tradeoff: broader space = more expressivity, less ability to search through

Dealing with uncertainty: no way to assign higher certainty to different hypothesizes

New cards

Frequentist definition of probability

there is some event, with some outcomes. probability asks: If we repeated this trial many times, what proportion of the times would the outcome be of this particular event?

New cards

Bayesian definition

event is a hypothesis about the world. P(event) is then our degree of belief that this hypothesis is correct.

New cards

joint + conditional probability

P(A, B): the probability that both A and B hold

P(A, B) = P(A) * P(B)

P(A | B): given that B holds, what is the probability of A?P

(A | B) = P(A, B) / P(B)

New cards

Bayes’ Rule / theorem

P(A | B) = (P(B | A) * P(A)) / (P(B))

in cog-sci, A = Hypothesis, B = Data

New cards

posterior, likelihood, prior

posterior: P(H | D) Your degree of belief in the hypothesis H after seeing data D (common framing is that what we learned is the thing with the highest prior)

likelihood: P(D | H) If we assume this hypothesis is correct, what is the probability of getting the data we have seen? (usually p(example | meaning) is equal to 1 / # of entities which are described by “meaning“ (bigger likelihood that ‘gavagi‘ is the name for the rabbit)) (assume examples are independent (tbh not true))

(1/n) ^k (k = number of entities you’ve seen that have that meaning)

prior: P(A) = Our beliefs before seeing any data. usually set by the programmer to capture some sort of bias

New cards

normalizing constant

P(B): out of all possible datasets you could see, what’s the probability of seeing this one? P(D) is the same for all hypotheses, so including it amounts to multiplying by a constant – won’t change the ranking

New cards

general game when Bayesian modeling:

1. Hypothesize how the prior works in your particular case

2. Hypothesize how the likelihood works in your particular case

3. Use those hypotheses to compute (unnormalized) posteriors

4. The highest-posterior hypothesis is the one that is predicted to

be learned

5. Check if the model’s predictions match human data!

New cards

how do Bayesian models account for rapid learning

the prior accounts for rapid learning, gives a strong inductive bias toward certain meanings. If the intended meaning aligns with these priors, the model can arrive at it very quickly/

New cards

how do Bayesian models account for suspicious coincidences

Captured by the likelihood. if every example we get it of the same fish, that would be like a crazy coincidence, so it is more likely that the word meaning is actually the name of the one specific fish. If H is a more specific meaning, n H is smaller. Thus, the likelihood doesn’t shrink as fast for more specific meanings than for more general ones. SOLUTION TO SUBSET PROBLEM (WAY TO BE CONFIDENT IN SMALLER DEFINITION)

New cards

how do Bayesian models account for learning from positive examples alone

asking “what is the most probable explanation for the data?”, we can favor some hypotheses over others – even if many are technically possible

New cards

how do Bayesian models account for the fact that the hypothesis must be consistent with the data

the likelihood will be zero if the hypothesis is not consistent with the data

New cards

why is the intuition “inductive bias = prior“ more complicated?

inductive biases are coded into the model also through The formula for computing the likelihood, and the mere choice to use a Bayesian model

New cards

Whole object bias (Bayesian model)

Markman (1989) argued that humans have a whole-object

bias

• A bias for words to refer to whole objects, rather than parts of

objects

• Our framing so far in fact has a whole-object constraint:

the only possible meanings are whole objects!

• How could we instead make it a soft bias?

• Modify the prior: allow meanings that are just body parts (”rabbit

tail”, “owl wing”, etc.)

• But how do we still preserve the preference for whole objects?

• Answer: Give a lower prior to the body parts than to the whole

animals!

New cards

Mutual exclusivity (Bayesian model)

Mutual exclusivity

• Markman (1989) also argued for a mutual exclusivity bias

• A bias for word meanings for which the learner does not already

have a word

• I.e., we are biased against synonyms

• How can we instantiate this?

• Add to the model a list of the words the learner already knows, then

use that to modify the prior

• Reduce the prior probability for any hypothesis that the learner

already has a word for

• Or we could reframe what counts as one hypothesis: frame it

as learning the meanings of all words you’ve encountered,

not just one at a time

• Then computing the prior for that hypothesis could factor in

whether any words share meanings

New cards

how do you deal with noise? (Bayesian model)

Modify the likelihood – assume that examples could be produced either correctly or as noise.

P(example | H) = P(correct) P(example | H) + P(noise) P(example | noise)

P(correct) + P(noise) = 1

P(noise) = something small (0.001 ?)

P(example | noise) = 1/# of objects (random)

Result: the likelihood is now reduced slightly (by multiplying by

P(correct)), and then a small constant is added to it. no data point is ever truly zero probability under any hypothesis

New cards

multiple examples? (Bayesian)

multiply likelihoods together

New cards

Marr’s Levels

An information-processing system can be characterized at three levels:

• Computational level: What function does the system

compute?

Example: Adding 2 numbers. 16 + 29 = 45.

• Algorithmic and representational level: What algorithm and

representation does the system use to compute this

function?

Representation: Base-10 Arabic numerals (could instead use

base-7, or Roman numerals, or ...)

Algorithm: Add digits right-to-left, carrying the 1 if necessary
- (could instead do “10 + 20 + 6 + 9” or “16 + 20 + 9” or

“16 + 30 – 1” or ...)

• Implementational level: How does the system implement

this algorithm?

By passing electric signals inside a calculator

New cards

what marr’s level is Bayesian models?

computational level: not trying to represent the actual way the mind works, but to perform the same computation the mind performs

New cards

tolerance principle

Let a rule R be defined over a set of N items.

R is productive if and only if e, the number of items not

supporting R, does not exceed θN: where theta-n = N/ln(n)

New cards

what makes a linguistic rule productive

if it can be applied to novel forms

New cards

how is information represented inside a neural network?

as vectors

New cards

what are three ways to think about a vector?

a list of numbers, a point in space, a direction and magnitude

New cards

how can you calculate the similarity between two vectors in a neural network

euclidian distance (assumes that angle and magnitude both

matter. could be any positive number.) or cosine similarity (only the angle/direction matters; not the magnitude. could only be between -1 and 1).

New cards

what is a feature vector?

a feature vector is a vector where each number aligns with a feature. each position in the vector encodes some interpretable feature

New cards

why is the feature vector idea wrong, why is it still useful

wrong because the computer is learning the features, often doesn’t produce individually meaningful features. useful because they often seem to feature rotated versions of meaningful features

New cards

how are (basic) neural networks theories of representation

“feature vectors“

New cards

how are (basic) neural networks theories of processing

every feature (node) has a strength (weight). the weights are put in a matrix where the number of rows is equal to the number of nodes in the next hidden state and the number of columns matches the amount of data (starting nodes). (W x D) produces a matrix that is the values of all of the new nodes. the sigmoid function is applied to (W x D) to make the values all between -1 and 1. a bias can be added (a vector of biases w same dimension as resulting vector from W D multiplication) Position [i, j] in the weight matrix is the weight connecting input j to output i

New cards

how are (basic) neural networks theories of learning

start with random parameters for weights, biases. then run tons of times, adjusting those values to find local minima.

New cards

How do you split your data (neural networks)

Training set (often about 80% of the data)
Validation set (often about 10% of the data)
Test set (often about 10% of the data)

training used to train, while training test on validation once and a while, then test used last. keeps from overfitting (memorizing the data)

New cards

what is the loss? (neural networks)

a measure of how bad the computer did. the computer tries to minimize this using gradient descent. you might use mean squared error or other ways of measuring badness

New cards

One way to set up next-word prediction (gen structure, what is Y, what is loss?)

general structure: to start with, takes in hidden state 0 + beginning of sentence, produces hidden state 1 and y1. next takes in hidden state 1, next word of sentence, and y2.

y: y is a vector whose length is the size of the vocabulary. each spot corresponds to one word. produced by the softmax function, which makes all vals sum to 1 and makes each value non-negative.

loss: cross-entropy: the negative log of the predicted probability of the correct word

New cards

central model of learning dogma

Inductive bias + data = knowledge

New cards

What are common targeted evaluations for testing machine learning?

Minimal pairs! two sentences that differ in only one respect (like a controlled environment for a bio experiment). For each pair in your dataset, determine which sentence the network scores more highly (probability of each sentence)

New cards

Issues for evaluation (targeted evaluations) (neural networks ) (4)

1) spurious correlations (the network picked up on some other unintentional feature)

2) Memorization: your test items might have appeared in the training set (don’t use wug test lol)

3) Task demands: it can’t deal with the way you give your question

4) Competence vs. performance

(1, 2) unfairly inflate performance

(3, 4) unfairly deflate performance

New cards

Competence vs. Performance

we are confident that humans know their language, but they still make mistakes.

New cards

how is the evaluation process

like hypothesis testing rather than as evaluation

goal is to understand what model is doing, not get it to do it the best (encourages alternate hypothesis testing –> more confidence)

New cards

Multi-layer perceptron

the “basic neural network“ discussed in other slides. Multiple layers of processing

The input is a single, fixed-size vector

Each layer of processing produces an

intermediate representation called a

hidden state

The output is a single, fixed-size vector

New cards

Recurrent neural network

(like the language predictor one) The input is a sequence of vectors

A vector called the hidden state encodes what information has been encountered so far

At each time step:

• Take in the previous hidden state and the next input

• Update the hidden state

• Produce an output vector

New cards

Transformer

Operate over a sequence of inputs. There are multiple layers; each layer has one hidden state for each input. Each of these hidden states is produced via attention, an operation that can selectively combine and modify information from all the hidden states in the previous layer. The last layer than produces on output for each input token

New cards