CS 593RL Lec 1

Pigeons, puzzles, and the core idea of reinforcement learning

Video illustrates how a pigeon can be taught to distinguish between two words and respond differently to each sign, rewarded with food.
The bird’s behavior is shaped by the environment rather than acting independently; this links to the idea that an agent learns through environmental feedback.
Comparison to the puzzle box with cats: learning is still about shaping behavior through consequences, but now we add the idea of frequency of rewards.
Reward frequency and learning:
- If a pigeon is reinforced every single time it pecks, the behavior may become overly eager or tunnel-visioned and crash when rewards stop.
- If rewards are infrequent (e.g., every many pecks or irregular timing), the learner may be confused or learn to wait for rewards rather than consistently pecking.
- The core question: what reward schedule supports learning a task while promoting sustained behavior that maximizes total reward over time?
Rich Sutton and Andrew Barto define reinforcement learning as:
- learning what to do, how to map situations to actions so as to maximize the numerical reward signal,
- the learner is not told which actions to take; it must discover which actions yield the most reward by trying them,
- actions may affect not only immediate rewards but also future states and future rewards.
Feedback sources in RL (the meaning of reward/feedback):
- Environment: rewards from the consequences of actions (e.g., points in video games; success/failure signals).
- Teacher/Coach: directed feedback during early learning (e.g., guidance in sports or music).
- Self-generated intrinsic motivation: internal drives or curiosity (e.g., a cat knocking objects around for curiosity).
Core idea reiterated: in reinforcement learning, the learner discovers its own actions, not via explicit instruction on what to do.
In contrast, supervised learning uses demonstrations to tell the agent what to do; RL emphasizes trial-and-error exploration.
Reward signals are not only about the type of feedback but also about how often feedback is given; this ties back to reward schedules.
Skinnerian roots and reward shaping:
- BF Skinner explored reward schedules and shaping to guide learning.
- Pigeon-ball experiments: rewards for any response resembling a desired action can accelerate learning toward a complex end action.
- Bread-crumb learning: give incremental rewards for progress toward a complex task, not just the final perfect action.
Reward shaping and its purpose:
- Provide intermediate rewards to guide learning when the final task is too complex to achieve from scratch.
- Adapt the reward over time to keep pushing the learner in the right direction.
Why RL matters in CS and AI:
- RL enables agents to learn to act in environments where outcomes are uncertain and delayed.
- Real-world impact examples include AlphaGo and robotics.
AlphaGo and self-play:
- Go was long believed to require a hand-crafted evaluation function; the field doubted a simple evaluation function would suffice.
- AlphaGo surpassed human players using reinforcement learning with self-play; it trained by playing against copies of itself and learning from that experience.
- Historical note: the idea that a simple evaluation function would never suffice aged poorly due to AlphaGo’s success.
Go specifics used to illustrate power of RL:
- A move (e.g., move 37 on a Go board) that humans would unlikely choose but AlphaGo sometimes favored can be highly creative and strategic.
Other RL success stories and connections to real problems:
- TD-Gammon and backgammon: self-play and reinforcement learning achieved high performance.
- Deep reinforcement learning in practice extends to robotics, recommendation systems, and content optimization (e.g., maximizing watch time by selecting thumbnails).
- RL techniques underpin training for language models with human preferences (reinforcement learning from human feedback, or RLHF) and other AI alignment tasks.
Course goals and structure in this class:
- Build agents that learn to act in applications; cover how RL works, algorithm types, and when to use them.
- Focus on deep reinforcement learning (policy approximated by neural networks) for complex state spaces (e.g., images).
- Emphasize theory and practical implementation; connect to industry use cases.
Important caution: RL is not a universal solve-everything tool.
- It requires interaction with an environment (simulator or real world) and realistic fidelity to the target task.
- If the simulator is not faithful, learned policies may fail in the real world.
- Real-world training raises safety, wear-and-tear, and safety concerns; sometimes cautions or staged training are needed.
Alternatives and tradeoffs to consider:
- Supervised learning can be faster when demonstrations are available, but performance is limited by the quality of demonstrations.
- RL explores novel strategies but often learns slowly and requires an environment to interact with.
Notion of non-stationarity in RL:
- The policy changes during training, which makes the environment non-stationary from the agent’s perspective.
- This can complicate exploration and make it easy to get stuck in local minima if updates are poor.
Reset and episodic nature in RL:
- Training often relies on resets to known good states to begin experiments; frequent resets can be tedious in real-world tasks (e.g., robotics on sand).
- Some research explores non-episodic RL and settings without easy resets.
Reward design challenges in RL:
- Specifying reward functions can be hard, especially for complex tasks like language generation or multi-step activities (e.g., making a cup of coffee).
- Dense rewards can speed learning but may constrain exploration and lead to reward hacking if the agent finds loopholes to maximize reward without achieving the intended task.
Examples of reward-hacking and pitfalls:
- A robot that maximizes a distance-to-goal metric by moving blocks instead of actually solving the task (reward hacking example).
- A game where the agent circles to collect recurring points rather than playing the intended game, illustrating how simple reward signals can be gamed.
Practical tips for learners:
- Reward shaping can help early learning but should avoid destroying the agent’s ability to discover novel solutions.
- Balance the density of rewards to encourage exploration while guiding progress.
- Start with simpler tasks and gradually increase complexity to build intuition and robust policies.

Core Reinforcement Learning Concepts

Environment, state, action, reward, and policy:
- Time-step flow: the environment is in state $st$; the agent takes action $at$ drawn from a policy $\