N

Reinforcement Learning Vocabulary Flashcards

Reinforcement Learning: Key Concepts

  • Reinforcement learning (RL) is a subset of artificial intelligence (AI) where an agent learns to behave in an environment by performing actions and observing the results.
  • In RL, learning is driven by interactions with the environment, not by explicit supervision.
  • Slide-level data illustrations show that data representations can be arbitrary symbols (e.g., Cherry, Apple, Algorithm, Orange) to emphasize data input concepts rather than specific labels.
  • There are also basic arithmetic/notation examples on early slides (e.g., 11+5-16, 2+24, 12+3-15, 17-18-, 6+5-11, 16-8-8, 22-15-7) used as visual fillers or to draw attention to symbols used in descriptions of concepts like states, actions, rewards.
  • Early framing contrasts supervised, unsupervised, and reinforcement learning to position RL within the broader ML landscape.

Learning paradigms: supervised, unsupervised, and reinforcement

  • Supervised Learning: learning from labeled examples to predict outputs.
  • Unsupervised Learning: discovering structure in unlabeled data.
  • Reinforcement Learning: an agent learns by trial-and-error to maximize cumulative reward over time.
  • RL emphasizes sequential decision making, where each action influences future states and rewards.

RL Process and Core Concepts

  • RL Process: An agent operates in an environment by taking actions and receiving feedback (rewards) that guide learning.
  • Core components:
    • Agent: The RL algorithm that learns from trial and error.
    • Environment: The world with which the agent interacts.
    • Action (A): All possible moves the agent can take.
    • State (S): The current situation returned by the environment.
    • Reward (R): Instant feedback from the environment evaluating the last action.
    • Policy (π): The strategy the agent uses to select actions based on the state.
    • Value (V): The expected long-term return from a state (discounted).
    • Action-value (Q): The expected return from taking a specific action in a state (considers the current action A).

Reward maximization and the exploration vs exploitation trade-off

  • Reward maximization theory: the agent should be trained to take actions that maximize the cumulative reward.
  • Exploitation: using known information to obtain higher rewards in the short term.
  • Exploration: seeking new information about the environment to improve long-term rewards.
  • The balance between exploitation and exploration is crucial for learning effective policies.

The formal framework: Markov Decision Process (MDP)

  • The mathematical approach for mapping a solution in RL is the Markov Decision Process (MDP).
  • Key elements used to attain a solution:
    • Set of actions, A
    • Set of states, S
    • Reward, R
    • Policy, π
    • Value, V
  • RL context uses states and actions to model transitions and rewards as the agent moves through the environment.

Simple example: Shortest-path problem (graph-based RL)

  • Goal: Find the shortest path between node A and node D with minimum cost.
  • Components:
    • States: nodes {A, B, C, D}.
    • Actions: possible traversals between connected nodes (e.g., A → B, A → C, C → D, etc.).
    • Reward: the cost associated with each edge; reward is interpreted as a reward or cost depending on convention (often negative costs or negative rewards are used to convert cost minimization into reward maximization).
    • Policy: the path chosen to reach the destination (e.g., A → C → D).
  • Note: In the slide, edge costs and the path example are depicted to illustrate how actions move between states and how rewards guide the chosen route.

Building-room environment: a graph-based RL example

  • Setup: Place an agent in one of five rooms (0–4) with the goal of reaching outside the building (room 5).
  • Rooms: 0, 1, 2, 3, 4 are inside; 5 represents the outside.
  • Connectivity: Doors connect rooms; doors 1 and 4 lead into the building from room 5 (outside).
  • Representation: Each room is a node; each door is a link (edge) in the graph.
  • Purpose: Demonstrates how an agent can learn a path to the outside by exploring door connections and receiving rewards.

Q-learning terminology and setup (state-action view)

  • States (including the outside): Room 0–5 are considered states.
  • Actions: Movement from one room to a connected room (edges/arrows).
  • Reward concept: Direct-to-goal doors yield a reward of 100; doors not directly connected to the goal yield 0 reward; doors are two-way, so each door has an instantaneous reward value in both directions.
  • In the illustrated example, the state is depicted as a node and an action as an arrow.
  • Example traversal (conceptual): Agent traverses from state 2 to state 5 via intermediate states (e.g., 2 → 3, then 3 → {2, 1, 4}, then 4 → 5).

Reward matrix R (conceptual)

  • R(s, a) represents the immediate reward for taking action a in state s.
  • In the room-building example, doors that lead directly to the goal have a reward of 100; other direct connections have 0; invalid or non-existent transitions are represented as null (often depicted with -1 in some tables).
  • Because doors are bidirectional, the matrix is interpreted as having two arrows for each door, each with its own immediate reward.
  • Practical takeaway: R encodes the immediate value of taking a given action in a given state, forming the basis for updating Q.

The Q-matrix: memory of learned value estimates

  • Add a Q-matrix, Q, to represent the agent’s learned memory about action values in each state.
  • Rows: current state
  • Columns: possible actions leading to the next state
  • Update rule (core Q-learning formula):
    Q(s, a) = R(s, a) + \ gamma \, \max_{a'} Q(s', a')
  • Gamma parameter: \ gamma (note: the transcript uses Gamma; standard notation is \gamma).
    • Gamma range: 0 ≤ Γ ≤ 1
    • If Γ is closer to 0, the agent emphasizes immediate rewards; if Γ is closer to 1, future rewards are weighted more heavily.
  • Initialization: typically initialize Q as a zero matrix and set a learning schedule (e.g., starting from random states).

Step-by-step Q-learning algorithm (high-level procedure)

  1. Set the gamma parameter and initialize the environment rewards in matrix R.
  2. Initialize the Q-matrix to zero.
  3. Choose a random initial state (current state).
  4. From the current state, select one among all possible actions (that lead to a next state).
  5. Execute the action to move to the next state.
  6. Compute the Q-value for the current state-action pair using the update rule:
    Q(state, action) = R(state, action) + \ gamma \; \max[Q(next state, all actions)]
  7. Repeat steps 4–6 until the current state equals the goal state.

Episode walkthroughs (illustrative calculations)

  • Episode 1 (initial state = 1):
    • From state 1, possible move to state 5 (action 5).
    • Since Q-values are initially zero, the update is:
    • Q(1,5) = R(1,5) + \gamma \; \max[Q(5,1), Q(5,4), Q(5,5)]
    • With R(1,5) = 100 and initial Q-values all zeros, and \gamma = 0.8, we get:
    • Q(1,5) = 100 + 0.8 \times 0 = 100
  • Episode 2 (next episode, initial state = 3):
    • From state 3, possible moves include to state 1 (action 1).
    • Compute: Q(3,1) = R(3,1) + \gamma \; \max[Q(1,3), Q(1,5)]
    • Given R(3,1) = 0 and initial Q-values with Q(1,5) already set to 100, while Q(1,3) remains 0, we have:
    • Q(3,1) = 0 + 0.8 \times \max[0, 100] = 0 + 0.8 \times 100 = 80
    • After this update, the Q-table reflects Q(3,1) = 80.
  • Ongoing updates alternate as the agent continues to explore and update Q-values for other state-action pairs, gradually shaping the policy toward high-value actions.

Gamma parameter and learning dynamics (revisited)

  • Gamma (\gamma) controls the balance between immediate and future rewards.
  • Practical guidance: In the example, a value of 0.8 was used to emphasize future rewards while still paying attention to immediate outcomes.
  • The learning process continues episodically, updating Q-values based on observed rewards and the estimated future value of subsequent states.

Practical implications and takeaways

  • Q-learning provides a model-free approach to learn good policies for sequential decision problems, without requiring a model of the environment's dynamics.
  • The method relies on iterative updates of the Q-table using observed rewards and the maximum estimated future value.
  • Key hyperparameters include the learning rate (often implicit in update schemes) and the discount factor Γ (gamma).
  • RL approaches require careful design of rewards to align with the desired outcomes and to avoid unintended incentives.
  • In real-world applications, exploration strategies (e.g., ε-greedy) are typically employed to ensure sufficient state-action coverage.

Connections to foundational principles and real-world relevance

  • RL connects to dynamic programming concepts like the Bellman equation, but without requiring a complete model of the environment.
  • Markov Decision Process formalism underpins many planning and control problems in robotics, operations research, and AI systems.
  • The room-building example illustrates how simple graph-based environments can be used to teach core RL ideas: states, actions, rewards, and the iterative improvement of action-values.

Ethical, philosophical, and practical implications

  • Ethical: RL agents acting in the real world must be designed to avoid unsafe or unethical actions; safe exploration is critical when learning in physical systems or with humans.
  • Philosophical: RL embodies a pragmatic approach to learning from consequences, emphasizing experience-driven improvement rather than pre-programmed rules.
  • Practical: Effective RL requires careful reward shaping, sufficient exploration, and computational resources for updating and storing Q-values in larger state-action spaces.

Notation recap (quick reference)

  • States: S
  • Actions: A
  • Reward: R(s, a)
  • Policy: π
  • Value: V(s)
  • Action-value: Q(s, a)
  • Discount factor: \ gamma
  • Q-learning update: Q(s, a) = R(s, a) + \ gamma \; \max_{a'} Q(s', a')
  • Goal of examples: minimize cost or maximize cumulative reward through learned policy