11/11
Rationality (III): Reward & Reinforcement Learning
Tuesday, November 11
Course Announcements & Reminders
Homework #4 has been released.
Topic: Q-learning/reinforcement learning, using the algorithm learned today.
Due Date: Sunday, November 23.
Students have two attempts; the highest score will be recorded.
Today’s lecture involves hands-on practice to help prepare for the homework.
For time management, practice with iClicker quiz questions will not occur in class today.
Quiz #17 is available on Canvas, and it is to be completed by the end of today, November 11.
Event Announcement
Tonight's event: Backpacking with CogSci
Location: Jerusalem Garden, 955 Weiser Hall
Time: November 11, from 6 PM to 8 PM
Purpose: Plan your next semester with the advisors!
Food will be served!
Functional Problem in Cognitive Science
The functional problem the capacity is to solve involves:
Mapping from states to actions that maximize long-term discounted expected utility given that you are in that state.
Referencing David Marr’s three levels of explanation in cognitive science:
Functional Level: Defines problems to be solved.
Algorithmic Level: Describes procedures that enable the problems to be solved.
Physical Level: Involves the neural/chemical substrates in which the procedures are implemented.
Q-learning algorithm relates to:
Ventral medial prefrontal cortex (vmPFC)
Striatum
Ventral tegmental area (VTA)
Are People Actually Rational?
Heuristics and Biases Research Program:
(Often) No, especially at the automatic, intuitive level.
Evolutionary Psychology Research Program:
(Often) Yes, if the problem is posed in the right format.
Neuroeconomics Research Program:
Yes, especially at the automatic, intuitive (emotional/affective) level.
Key Question: Why are these findings significant?
Affective systems may be implementing specialized algorithms that learn from experience.
Insights on Decision Making Under Uncertainty
Kahneman and Tversky's perspective:
In making predictions and judgments under uncertainty, people often do not follow statistical theory and instead rely on limited heuristics, categorized as:
Representativeness Heuristic: Judgments of probability are based on similarity to a prototype.
Availability Heuristic: Judgments of frequency are based on how easily examples come to mind.
Affect Heuristic: Judgments are influenced by gut affective reactions.
Framing Effects: Choices are perceived differently based on the presentation format.
Example Illustrations:
Linda Problem: Bank teller stereotype.
Words with 'n': First versus third position.
Stock Purchase Scenario: Choosing to buy Ford.
Disease Outbreak: Perceptions of risk.
Evolutionary Psychology's Perspective
Massively Modular Mind: Proponents suggest that when problems are presented in a format familiar to our evolved minds (e.g., using frequencies or social contract rules), people typically provide rationally correct answers.
Characteristics of modules include being domain-specific and adapted to ancestral environments.
Child vs. Adult Probabilistic Reasoning
Children vs. Adults: A contrast exists where babies and toddlers excel in probabilistic reasoning (as per infant cognition studies), while adults struggle (supported by Heuristics and Biases findings).
Suggested areas for revisiting: Task difficulty and the format of presented information.
Defining Neuroeconomics
Definition of Neuroeconomics (as per Glimcher & Rustichini):
“Economics, psychology, and neuroscience are converging into a unified discipline—neuroeconomics—with the goal of providing a general theory of human behavior.”
Key Concept: Reinforcement Learning (RL)
A specific algorithm for computing a value function.
Understanding Reinforcement Learning
Definition: Reinforcement learning (RL) is the problem of making decisions to maximize long-term, discounted expected rewards.
Methods: RL comprises various methods or algorithms to solve the RL problem in different settings.
Field: It is a branch of machine learning and artificial intelligence.
Intersecting Field: Also intersects with psychology and cognitive neuroscience.
Recent Innovations:
DeepMind's breakthrough in reinforcement learning led to significant AI advancements, including the game of Go.
Exploring the Q-learning Algorithm
Q-value update formula: Q(St, at) = Q(St, at) + eta imes (Rt + heta imes ext{max}Q(S{t+1}, a{t+1}) - Q(St, a_t))
Where,
$R_t$: Reward received.
$eta$: Learning rate (between 0 and 1).
$ heta$: Discount factor for future rewards.
Importance of exploring and learning functions or policies that link states to actions.
Reward Signals in Reinforcement Learning
Reward function: A mapping from states to quantities that indicates desirability:
Example:
$R( ext{“safe in my hole”}) = 23$
$R( ext{“tasting cheese”}) = 37$
$R( ext{“going on a hike”}) = 40$
Action Choices and Eventual Learning
Best Action: The action that maximizes expected cumulative future reward. This is defined as value or utility in sequential decision-making contexts.
The credit assignment problem: Understanding which past actions resulted in rewards, especially in scenarios where reward is delayed.
Temporal Difference Q-learning Algorithm: Can assign credit to past actions even when rewards are delayed.
Incentive Receiver Example: Monsters and Kit-Kats
Scenario setting: An agent operates in a 2-D world encountering various terrains and objectives (i.e., Kit-Kats) while avoiding negative stimuli (i.e., monsters).
The agent must navigate based on environmental conditions, energy levels, and probabilities of success.
Describing possible actions relates directly to maximizing rewards while minimizing risks (e.g., avoiding monsters, managing energy levels).
State Representation in Q-learning Problems
The total number of potential states can be calculated based on agent location, energy levels, monster locations, and objectives, leading to the combinatorial total of states to consider.
Formula for total states:
Final Remarks on Q-learning and Homework Context
Emphasis on revisiting the framework of Q-learning to develop robust understanding of action selection, value maximization, and recursive reward structuring.
The updating rules for Q-values after taking certain actions need to be consistently applied to ensure convergence to optimal policies across learning iterations.
Students must prepare for applying these principles to homework scenarios, ensuring proper application of concepts learned in the session.