WL

CS 593RL Lec 1

Pigeons, puzzles, and the core idea of reinforcement learning

  • Video illustrates how a pigeon can be taught to distinguish between two words and respond differently to each sign, rewarded with food.

  • The bird’s behavior is shaped by the environment rather than acting independently; this links to the idea that an agent learns through environmental feedback.

  • Comparison to the puzzle box with cats: learning is still about shaping behavior through consequences, but now we add the idea of frequency of rewards.

  • Reward frequency and learning:

    • If a pigeon is reinforced every single time it pecks, the behavior may become overly eager or tunnel-visioned and crash when rewards stop.

    • If rewards are infrequent (e.g., every many pecks or irregular timing), the learner may be confused or learn to wait for rewards rather than consistently pecking.

    • The core question: what reward schedule supports learning a task while promoting sustained behavior that maximizes total reward over time?

  • Rich Sutton and Andrew Barto define reinforcement learning as:

    • learning what to do, how to map situations to actions so as to maximize the numerical reward signal,

    • the learner is not told which actions to take; it must discover which actions yield the most reward by trying them,

    • actions may affect not only immediate rewards but also future states and future rewards.

  • Feedback sources in RL (the meaning of reward/feedback):

    • Environment: rewards from the consequences of actions (e.g., points in video games; success/failure signals).

    • Teacher/Coach: directed feedback during early learning (e.g., guidance in sports or music).

    • Self-generated intrinsic motivation: internal drives or curiosity (e.g., a cat knocking objects around for curiosity).

  • Core idea reiterated: in reinforcement learning, the learner discovers its own actions, not via explicit instruction on what to do.

  • In contrast, supervised learning uses demonstrations to tell the agent what to do; RL emphasizes trial-and-error exploration.

  • Reward signals are not only about the type of feedback but also about how often feedback is given; this ties back to reward schedules.

  • Skinnerian roots and reward shaping:

    • BF Skinner explored reward schedules and shaping to guide learning.

    • Pigeon-ball experiments: rewards for any response resembling a desired action can accelerate learning toward a complex end action.

    • Bread-crumb learning: give incremental rewards for progress toward a complex task, not just the final perfect action.

  • Reward shaping and its purpose:

    • Provide intermediate rewards to guide learning when the final task is too complex to achieve from scratch.

    • Adapt the reward over time to keep pushing the learner in the right direction.

  • Why RL matters in CS and AI:

    • RL enables agents to learn to act in environments where outcomes are uncertain and delayed.

    • Real-world impact examples include AlphaGo and robotics.

  • AlphaGo and self-play:

    • Go was long believed to require a hand-crafted evaluation function; the field doubted a simple evaluation function would suffice.

    • AlphaGo surpassed human players using reinforcement learning with self-play; it trained by playing against copies of itself and learning from that experience.

    • Historical note: the idea that a simple evaluation function would never suffice aged poorly due to AlphaGo’s success.

  • Go specifics used to illustrate power of RL:

    • A move (e.g., move 37 on a Go board) that humans would unlikely choose but AlphaGo sometimes favored can be highly creative and strategic.

  • Other RL success stories and connections to real problems:

    • TD-Gammon and backgammon: self-play and reinforcement learning achieved high performance.

    • Deep reinforcement learning in practice extends to robotics, recommendation systems, and content optimization (e.g., maximizing watch time by selecting thumbnails).

    • RL techniques underpin training for language models with human preferences (reinforcement learning from human feedback, or RLHF) and other AI alignment tasks.

  • Course goals and structure in this class:

    • Build agents that learn to act in applications; cover how RL works, algorithm types, and when to use them.

    • Focus on deep reinforcement learning (policy approximated by neural networks) for complex state spaces (e.g., images).

    • Emphasize theory and practical implementation; connect to industry use cases.

  • Important caution: RL is not a universal solve-everything tool.

    • It requires interaction with an environment (simulator or real world) and realistic fidelity to the target task.

    • If the simulator is not faithful, learned policies may fail in the real world.

    • Real-world training raises safety, wear-and-tear, and safety concerns; sometimes cautions or staged training are needed.

  • Alternatives and tradeoffs to consider:

    • Supervised learning can be faster when demonstrations are available, but performance is limited by the quality of demonstrations.

    • RL explores novel strategies but often learns slowly and requires an environment to interact with.

  • Notion of non-stationarity in RL:

    • The policy changes during training, which makes the environment non-stationary from the agent’s perspective.

    • This can complicate exploration and make it easy to get stuck in local minima if updates are poor.

  • Reset and episodic nature in RL:

    • Training often relies on resets to known good states to begin experiments; frequent resets can be tedious in real-world tasks (e.g., robotics on sand).

    • Some research explores non-episodic RL and settings without easy resets.

  • Reward design challenges in RL:

    • Specifying reward functions can be hard, especially for complex tasks like language generation or multi-step activities (e.g., making a cup of coffee).

    • Dense rewards can speed learning but may constrain exploration and lead to reward hacking if the agent finds loopholes to maximize reward without achieving the intended task.

  • Examples of reward-hacking and pitfalls:

    • A robot that maximizes a distance-to-goal metric by moving blocks instead of actually solving the task (reward hacking example).

    • A game where the agent circles to collect recurring points rather than playing the intended game, illustrating how simple reward signals can be gamed.

  • Practical tips for learners:

    • Reward shaping can help early learning but should avoid destroying the agent’s ability to discover novel solutions.

    • Balance the density of rewards to encourage exploration while guiding progress.

    • Start with simpler tasks and gradually increase complexity to build intuition and robust policies.

Core Reinforcement Learning Concepts

  • Environment, state, action, reward, and policy:

    • Time-step flow: the environment is in state $st$; the agent takes action $at$ drawn from a policy $\