CS 593RL Lec 1
Pigeons, puzzles, and the core idea of reinforcement learning
Video illustrates how a pigeon can be taught to distinguish between two words and respond differently to each sign, rewarded with food.
The bird’s behavior is shaped by the environment rather than acting independently; this links to the idea that an agent learns through environmental feedback.
Comparison to the puzzle box with cats: learning is still about shaping behavior through consequences, but now we add the idea of frequency of rewards.
Reward frequency and learning:
If a pigeon is reinforced every single time it pecks, the behavior may become overly eager or tunnel-visioned and crash when rewards stop.
If rewards are infrequent (e.g., every many pecks or irregular timing), the learner may be confused or learn to wait for rewards rather than consistently pecking.
The core question: what reward schedule supports learning a task while promoting sustained behavior that maximizes total reward over time?
Rich Sutton and Andrew Barto define reinforcement learning as:
learning what to do, how to map situations to actions so as to maximize the numerical reward signal,
the learner is not told which actions to take; it must discover which actions yield the most reward by trying them,
actions may affect not only immediate rewards but also future states and future rewards.
Feedback sources in RL (the meaning of reward/feedback):
Environment: rewards from the consequences of actions (e.g., points in video games; success/failure signals).
Teacher/Coach: directed feedback during early learning (e.g., guidance in sports or music).
Self-generated intrinsic motivation: internal drives or curiosity (e.g., a cat knocking objects around for curiosity).
Core idea reiterated: in reinforcement learning, the learner discovers its own actions, not via explicit instruction on what to do.
In contrast, supervised learning uses demonstrations to tell the agent what to do; RL emphasizes trial-and-error exploration.
Reward signals are not only about the type of feedback but also about how often feedback is given; this ties back to reward schedules.
Skinnerian roots and reward shaping:
BF Skinner explored reward schedules and shaping to guide learning.
Pigeon-ball experiments: rewards for any response resembling a desired action can accelerate learning toward a complex end action.
Bread-crumb learning: give incremental rewards for progress toward a complex task, not just the final perfect action.
Reward shaping and its purpose:
Provide intermediate rewards to guide learning when the final task is too complex to achieve from scratch.
Adapt the reward over time to keep pushing the learner in the right direction.
Why RL matters in CS and AI:
RL enables agents to learn to act in environments where outcomes are uncertain and delayed.
Real-world impact examples include AlphaGo and robotics.
AlphaGo and self-play:
Go was long believed to require a hand-crafted evaluation function; the field doubted a simple evaluation function would suffice.
AlphaGo surpassed human players using reinforcement learning with self-play; it trained by playing against copies of itself and learning from that experience.
Historical note: the idea that a simple evaluation function would never suffice aged poorly due to AlphaGo’s success.
Go specifics used to illustrate power of RL:
A move (e.g., move 37 on a Go board) that humans would unlikely choose but AlphaGo sometimes favored can be highly creative and strategic.
Other RL success stories and connections to real problems:
TD-Gammon and backgammon: self-play and reinforcement learning achieved high performance.
Deep reinforcement learning in practice extends to robotics, recommendation systems, and content optimization (e.g., maximizing watch time by selecting thumbnails).
RL techniques underpin training for language models with human preferences (reinforcement learning from human feedback, or RLHF) and other AI alignment tasks.
Course goals and structure in this class:
Build agents that learn to act in applications; cover how RL works, algorithm types, and when to use them.
Focus on deep reinforcement learning (policy approximated by neural networks) for complex state spaces (e.g., images).
Emphasize theory and practical implementation; connect to industry use cases.
Important caution: RL is not a universal solve-everything tool.
It requires interaction with an environment (simulator or real world) and realistic fidelity to the target task.
If the simulator is not faithful, learned policies may fail in the real world.
Real-world training raises safety, wear-and-tear, and safety concerns; sometimes cautions or staged training are needed.
Alternatives and tradeoffs to consider:
Supervised learning can be faster when demonstrations are available, but performance is limited by the quality of demonstrations.
RL explores novel strategies but often learns slowly and requires an environment to interact with.
Notion of non-stationarity in RL:
The policy changes during training, which makes the environment non-stationary from the agent’s perspective.
This can complicate exploration and make it easy to get stuck in local minima if updates are poor.
Reset and episodic nature in RL:
Training often relies on resets to known good states to begin experiments; frequent resets can be tedious in real-world tasks (e.g., robotics on sand).
Some research explores non-episodic RL and settings without easy resets.
Reward design challenges in RL:
Specifying reward functions can be hard, especially for complex tasks like language generation or multi-step activities (e.g., making a cup of coffee).
Dense rewards can speed learning but may constrain exploration and lead to reward hacking if the agent finds loopholes to maximize reward without achieving the intended task.
Examples of reward-hacking and pitfalls:
A robot that maximizes a distance-to-goal metric by moving blocks instead of actually solving the task (reward hacking example).
A game where the agent circles to collect recurring points rather than playing the intended game, illustrating how simple reward signals can be gamed.
Practical tips for learners:
Reward shaping can help early learning but should avoid destroying the agent’s ability to discover novel solutions.
Balance the density of rewards to encourage exploration while guiding progress.
Start with simpler tasks and gradually increase complexity to build intuition and robust policies.
Core Reinforcement Learning Concepts
Environment, state, action, reward, and policy:
Time-step flow: the environment is in state $st$; the agent takes action $at$ drawn from a policy $\