1/56
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
why can there be no learning without inductive biases?
Any finite set of data is consistent with multiple, and in fact infinite, generalizations. to land on one we need inductive biases for any type of learning
Inductive problems vs. deductive problems
deductive problems apply general true rules to specific examples. inductive problems try and generate general rules from specific examples.
how is language an inductive learning situation?
we are given examples (words we hear) and must infer general rules
nature vs nurture (and why this framing is not perfect)
nature kinda corresponds to inductive biases, nurture = learner’s data. HOWEVER, inductive biases can be learned / changed (need some in there from the get-go tho). in fact, learning a language depends on both nurture and nature, where some aspects of language are innate. debate about how much is innate
The poverty of the stimulus / plato’s problem
linguistic data that children encounter are consistent with multiple possible generalizations about how a language’s grammar works, yet children reliably arrive at certain generalizations rather than others, therefore, children must have linguistic inductive biases that favor these generalizations (my walrus who will sleep can giggle linear vs hierarchical) specific instances are debated, but the general “the data underdetermine the grammar, such that we must have linguistic inductive biases“ is true
The subset problem
if there are two grammars, and one is a subset of another, there is no way to rule out the superset one (ex: pro-drop) (assumption: only receive positive evidence)
Baker’s paradox
version of the subset problem applied to “I gave Alex a book” (double-object constructions in English)
The gavagai problem
point to a rabbit and say “gavagai,“ gavagai could mean rabbit, could be that rabbits name, could mean animal, etc.
Goodman’s new riddle of induction (“grue”)
favoring green meaning green rather than grue
symbolic structure properties + rules
made up of discrete units (symbols) combined in structured ways. symbols you use matter, way of combining them matters. rules tell us things we can do with those structures
3 key properties of symbolic structures
Compositionality: meaning is informed by meaning of the parts and how the parts are put together
Systematicity: symbol has the same meaning in every context
Productivity: with a finite number of components, can create infinitely many configurations
rules and symbols benefits and detractions
useful for AI (generalize robustly) and human linguistics (generally follow human language)
ways which it doesn’t:
-idioms
- I washed the plate” vs. “I washed my car” –> different meanings?
-Baker’s paradox
(these can usually be solved by adding more rules but like…)
Principles & Parameters
universal principles that hold across all languages combined with parameters that define constrained types of variation across languages
Issues with rules and symbols (4)
Rigidity: only can learn grammars within a very particular set of possible grammars (could be gotten around with making the space broader?)
Challenging search process: the space is really big (make the space smaller?)
Expressivity / tractability tradeoff: broader space = more expressivity, less ability to search through
Dealing with uncertainty: no way to assign higher certainty to different hypothesizes
Frequentist definition of probability
there is some event, with some outcomes. probability asks: If we repeated this trial many times, what proportion of the times would the outcome be of this particular event?
Bayesian definition
event is a hypothesis about the world. P(event) is then our degree of belief that this hypothesis is correct.
joint + conditional probability
P(A, B): the probability that both A and B hold
P(A, B) = P(A) * P(B)
P(A | B): given that B holds, what is the probability of A?P
(A | B) = P(A, B) / P(B)
Bayes’ Rule / theorem
P(A | B) = (P(B | A) * P(A)) / (P(B))
in cog-sci, A = Hypothesis, B = Data
posterior, likelihood, prior
posterior: P(H | D) Your degree of belief in the hypothesis H after seeing data D (common framing is that what we learned is the thing with the highest prior)
likelihood: P(D | H) If we assume this hypothesis is correct, what is the probability of getting the data we have seen? (usually p(example | meaning) is equal to 1 / # of entities which are described by “meaning“ (bigger likelihood that ‘gavagi‘ is the name for the rabbit)) (assume examples are independent (tbh not true))
(1/n) ^k (k = number of entities you’ve seen that have that meaning)
prior: P(A) = Our beliefs before seeing any data. usually set by the programmer to capture some sort of bias
normalizing constant
P(B): out of all possible datasets you could see, what’s the probability of seeing this one? P(D) is the same for all hypotheses, so including it amounts to multiplying by a constant – won’t change the ranking
general game when Bayesian modeling:
1. Hypothesize how the prior works in your particular case
2. Hypothesize how the likelihood works in your particular case
3. Use those hypotheses to compute (unnormalized) posteriors
4. The highest-posterior hypothesis is the one that is predicted to
be learned
5. Check if the model’s predictions match human data!
how do Bayesian models account for rapid learning
the prior accounts for rapid learning, gives a strong inductive bias toward certain meanings. If the intended meaning aligns with these priors, the model can arrive at it very quickly/
how do Bayesian models account for suspicious coincidences
Captured by the likelihood. if every example we get it of the same fish, that would be like a crazy coincidence, so it is more likely that the word meaning is actually the name of the one specific fish. If H is a more specific meaning, n H is smaller. Thus, the likelihood doesn’t shrink as fast for more specific meanings than for more general ones. SOLUTION TO SUBSET PROBLEM (WAY TO BE CONFIDENT IN SMALLER DEFINITION)
how do Bayesian models account for learning from positive examples alone
asking “what is the most probable explanation for the data?”, we can favor some hypotheses over others – even if many are technically possible
how do Bayesian models account for the fact that the hypothesis must be consistent with the data
the likelihood will be zero if the hypothesis is not consistent with the data
why is the intuition “inductive bias = prior“ more complicated?
inductive biases are coded into the model also through The formula for computing the likelihood, and the mere choice to use a Bayesian model
Whole object bias (Bayesian model)
Markman (1989) argued that humans have a whole-object
bias
• A bias for words to refer to whole objects, rather than parts of
objects
• Our framing so far in fact has a whole-object constraint:
the only possible meanings are whole objects!
• How could we instead make it a soft bias?
• Modify the prior: allow meanings that are just body parts (”rabbit
tail”, “owl wing”, etc.)
• But how do we still preserve the preference for whole objects?
• Answer: Give a lower prior to the body parts than to the whole
animals!
Mutual exclusivity (Bayesian model)
Mutual exclusivity
• Markman (1989) also argued for a mutual exclusivity bias
• A bias for word meanings for which the learner does not already
have a word
• I.e., we are biased against synonyms
• How can we instantiate this?
• Add to the model a list of the words the learner already knows, then
use that to modify the prior
• Reduce the prior probability for any hypothesis that the learner
already has a word for
• Or we could reframe what counts as one hypothesis: frame it
as learning the meanings of all words you’ve encountered,
not just one at a time
• Then computing the prior for that hypothesis could factor in
whether any words share meanings
54
how do you deal with noise? (Bayesian model)
Modify the likelihood – assume that examples could be produced either correctly or as noise.
P(example | H) = P(correct) P(example | H) + P(noise) P(example | noise)
P(correct) + P(noise) = 1
P(noise) = something small (0.001 ?)
P(example | noise) = 1/# of objects (random)
Result: the likelihood is now reduced slightly (by multiplying by
P(correct)), and then a small constant is added to it. no data point is ever truly zero probability under any hypothesis
multiple examples? (Bayesian)
multiply likelihoods together
Marr’s Levels
An information-processing system can be characterized at three levels:
• Computational level: What function does the system
compute?
Example: Adding 2 numbers. 16 + 29 = 45.
• Algorithmic and representational level: What algorithm and
representation does the system use to compute this
function?
Representation: Base-10 Arabic numerals (could instead use
base-7, or Roman numerals, or ...)
Algorithm: Add digits right-to-left, carrying the 1 if necessary
(could instead do “10 + 20 + 6 + 9” or “16 + 20 + 9” or
“16 + 30 – 1” or ...)
• Implementational level: How does the system implement
this algorithm?
By passing electric signals inside a calculator
what marr’s level is Bayesian models?
computational level: not trying to represent the actual way the mind works, but to perform the same computation the mind performs
tolerance principle
Let a rule R be defined over a set of N items.
R is productive if and only if e, the number of items not
supporting R, does not exceed θN: where theta-n = N/ln(n)
what makes a linguistic rule productive
if it can be applied to novel forms
how is information represented inside a neural network?
as vectors
what are three ways to think about a vector?
a list of numbers, a point in space, a direction and magnitude
how can you calculate the similarity between two vectors in a neural network
euclidian distance (assumes that angle and magnitude both
matter. could be any positive number.) or cosine similarity (only the angle/direction matters; not the magnitude. could only be between -1 and 1).
what is a feature vector?
a feature vector is a vector where each number aligns with a feature. each position in the vector encodes some interpretable feature
why is the feature vector idea wrong, why is it still useful
wrong because the computer is learning the features, often doesn’t produce individually meaningful features. useful because they often seem to feature rotated versions of meaningful features
how are (basic) neural networks theories of representation
“feature vectors“
how are (basic) neural networks theories of processing
every feature (node) has a strength (weight). the weights are put in a matrix where the number of rows is equal to the number of nodes in the next hidden state and the number of columns matches the amount of data (starting nodes). (W x D) produces a matrix that is the values of all of the new nodes. the sigmoid function is applied to (W x D) to make the values all between -1 and 1. a bias can be added (a vector of biases w same dimension as resulting vector from W D multiplication) Position [i, j] in the weight matrix is the weight connecting input j to output i
how are (basic) neural networks theories of learning
start with random parameters for weights, biases. then run tons of times, adjusting those values to find local minima.
How do you split your data (neural networks)
Training set (often about 80% of the data)
Validation set (often about 10% of the data)
Test set (often about 10% of the data)
training used to train, while training test on validation once and a while, then test used last. keeps from overfitting (memorizing the data)
what is the loss? (neural networks)
a measure of how bad the computer did. the computer tries to minimize this using gradient descent. you might use mean squared error or other ways of measuring badness
One way to set up next-word prediction (gen structure, what is Y, what is loss?)
general structure: to start with, takes in hidden state 0 + beginning of sentence, produces hidden state 1 and y1. next takes in hidden state 1, next word of sentence, and y2.
y: y is a vector whose length is the size of the vocabulary. each spot corresponds to one word. produced by the softmax function, which makes all vals sum to 1 and makes each value non-negative.
loss: cross-entropy: the negative log of the predicted probability of the correct word
central model of learning dogma
Inductive bias + data = knowledge
What are common targeted evaluations for testing machine learning?
Minimal pairs! two sentences that differ in only one respect (like a controlled environment for a bio experiment). For each pair in your dataset, determine which sentence the network scores more highly (probability of each sentence)
Issues for evaluation (targeted evaluations) (neural networks ) (4)
1) spurious correlations (the network picked up on some other unintentional feature)
2) Memorization: your test items might have appeared in the training set (don’t use wug test lol)
3) Task demands: it can’t deal with the way you give your question
4) Competence vs. performance
(1, 2) unfairly inflate performance
(3, 4) unfairly deflate performance
Competence vs. Performance
we are confident that humans know their language, but they still make mistakes.
how is the evaluation process
like hypothesis testing rather than as evaluation
goal is to understand what model is doing, not get it to do it the best (encourages alternate hypothesis testing –> more confidence)
Multi-layer perceptron
the “basic neural network“ discussed in other slides. Multiple layers of processing
The input is a single, fixed-size vector
Each layer of processing produces an
intermediate representation called a
hidden state
The output is a single, fixed-size vector
Recurrent neural network
(like the language predictor one) The input is a sequence of vectors
A vector called the hidden state encodes what information has been encountered so far
At each time step:
• Take in the previous hidden state and the next input
• Update the hidden state
• Produce an output vector
Transformer
Operate over a sequence of inputs. There are multiple layers; each layer has one hidden state for each input. Each of these hidden states is produced via attention, an operation that can selectively combine and modify information from all the hidden states in the previous layer. The last layer than produces on output for each input token