Reinforcement Learning Vocabulary Flashcards
Reinforcement Learning: Key Concepts
- Reinforcement learning (RL) is a subset of artificial intelligence (AI) where an agent learns to behave in an environment by performing actions and observing the results.
- In RL, learning is driven by interactions with the environment, not by explicit supervision.
- Slide-level data illustrations show that data representations can be arbitrary symbols (e.g., Cherry, Apple, Algorithm, Orange) to emphasize data input concepts rather than specific labels.
- There are also basic arithmetic/notation examples on early slides (e.g., 11+5-16, 2+24, 12+3-15, 17-18-, 6+5-11, 16-8-8, 22-15-7) used as visual fillers or to draw attention to symbols used in descriptions of concepts like states, actions, rewards.
- Early framing contrasts supervised, unsupervised, and reinforcement learning to position RL within the broader ML landscape.
Learning paradigms: supervised, unsupervised, and reinforcement
- Supervised Learning: learning from labeled examples to predict outputs.
- Unsupervised Learning: discovering structure in unlabeled data.
- Reinforcement Learning: an agent learns by trial-and-error to maximize cumulative reward over time.
- RL emphasizes sequential decision making, where each action influences future states and rewards.
RL Process and Core Concepts
- RL Process: An agent operates in an environment by taking actions and receiving feedback (rewards) that guide learning.
- Core components:
- Agent: The RL algorithm that learns from trial and error.
- Environment: The world with which the agent interacts.
- Action (A): All possible moves the agent can take.
- State (S): The current situation returned by the environment.
- Reward (R): Instant feedback from the environment evaluating the last action.
- Policy (π): The strategy the agent uses to select actions based on the state.
- Value (V): The expected long-term return from a state (discounted).
- Action-value (Q): The expected return from taking a specific action in a state (considers the current action A).
Reward maximization and the exploration vs exploitation trade-off
- Reward maximization theory: the agent should be trained to take actions that maximize the cumulative reward.
- Exploitation: using known information to obtain higher rewards in the short term.
- Exploration: seeking new information about the environment to improve long-term rewards.
- The balance between exploitation and exploration is crucial for learning effective policies.
- The mathematical approach for mapping a solution in RL is the Markov Decision Process (MDP).
- Key elements used to attain a solution:
- Set of actions, A
- Set of states, S
- Reward, R
- Policy, π
- Value, V
- RL context uses states and actions to model transitions and rewards as the agent moves through the environment.
Simple example: Shortest-path problem (graph-based RL)
- Goal: Find the shortest path between node A and node D with minimum cost.
- Components:
- States: nodes {A, B, C, D}.
- Actions: possible traversals between connected nodes (e.g., A → B, A → C, C → D, etc.).
- Reward: the cost associated with each edge; reward is interpreted as a reward or cost depending on convention (often negative costs or negative rewards are used to convert cost minimization into reward maximization).
- Policy: the path chosen to reach the destination (e.g., A → C → D).
- Note: In the slide, edge costs and the path example are depicted to illustrate how actions move between states and how rewards guide the chosen route.
Building-room environment: a graph-based RL example
- Setup: Place an agent in one of five rooms (0–4) with the goal of reaching outside the building (room 5).
- Rooms: 0, 1, 2, 3, 4 are inside; 5 represents the outside.
- Connectivity: Doors connect rooms; doors 1 and 4 lead into the building from room 5 (outside).
- Representation: Each room is a node; each door is a link (edge) in the graph.
- Purpose: Demonstrates how an agent can learn a path to the outside by exploring door connections and receiving rewards.
Q-learning terminology and setup (state-action view)
- States (including the outside): Room 0–5 are considered states.
- Actions: Movement from one room to a connected room (edges/arrows).
- Reward concept: Direct-to-goal doors yield a reward of 100; doors not directly connected to the goal yield 0 reward; doors are two-way, so each door has an instantaneous reward value in both directions.
- In the illustrated example, the state is depicted as a node and an action as an arrow.
- Example traversal (conceptual): Agent traverses from state 2 to state 5 via intermediate states (e.g., 2 → 3, then 3 → {2, 1, 4}, then 4 → 5).
Reward matrix R (conceptual)
- R(s, a) represents the immediate reward for taking action a in state s.
- In the room-building example, doors that lead directly to the goal have a reward of 100; other direct connections have 0; invalid or non-existent transitions are represented as null (often depicted with -1 in some tables).
- Because doors are bidirectional, the matrix is interpreted as having two arrows for each door, each with its own immediate reward.
- Practical takeaway: R encodes the immediate value of taking a given action in a given state, forming the basis for updating Q.
The Q-matrix: memory of learned value estimates
- Add a Q-matrix, Q, to represent the agent’s learned memory about action values in each state.
- Rows: current state
- Columns: possible actions leading to the next state
- Update rule (core Q-learning formula):
Q(s, a) = R(s, a) + \ gamma \, \max_{a'} Q(s', a') - Gamma parameter: \ gamma (note: the transcript uses Gamma; standard notation is \gamma).
- Gamma range: 0 ≤ Γ ≤ 1
- If Γ is closer to 0, the agent emphasizes immediate rewards; if Γ is closer to 1, future rewards are weighted more heavily.
- Initialization: typically initialize Q as a zero matrix and set a learning schedule (e.g., starting from random states).
Step-by-step Q-learning algorithm (high-level procedure)
- Set the gamma parameter and initialize the environment rewards in matrix R.
- Initialize the Q-matrix to zero.
- Choose a random initial state (current state).
- From the current state, select one among all possible actions (that lead to a next state).
- Execute the action to move to the next state.
- Compute the Q-value for the current state-action pair using the update rule:
Q(state, action) = R(state, action) + \ gamma \; \max[Q(next state, all actions)] - Repeat steps 4–6 until the current state equals the goal state.
Episode walkthroughs (illustrative calculations)
- Episode 1 (initial state = 1):
- From state 1, possible move to state 5 (action 5).
- Since Q-values are initially zero, the update is:
- Q(1,5) = R(1,5) + \gamma \; \max[Q(5,1), Q(5,4), Q(5,5)]
- With R(1,5) = 100 and initial Q-values all zeros, and \gamma = 0.8, we get:
- Q(1,5) = 100 + 0.8 \times 0 = 100
- Episode 2 (next episode, initial state = 3):
- From state 3, possible moves include to state 1 (action 1).
- Compute: Q(3,1) = R(3,1) + \gamma \; \max[Q(1,3), Q(1,5)]
- Given R(3,1) = 0 and initial Q-values with Q(1,5) already set to 100, while Q(1,3) remains 0, we have:
- Q(3,1) = 0 + 0.8 \times \max[0, 100] = 0 + 0.8 \times 100 = 80
- After this update, the Q-table reflects Q(3,1) = 80.
- Ongoing updates alternate as the agent continues to explore and update Q-values for other state-action pairs, gradually shaping the policy toward high-value actions.
Gamma parameter and learning dynamics (revisited)
- Gamma (\gamma) controls the balance between immediate and future rewards.
- Practical guidance: In the example, a value of 0.8 was used to emphasize future rewards while still paying attention to immediate outcomes.
- The learning process continues episodically, updating Q-values based on observed rewards and the estimated future value of subsequent states.
Practical implications and takeaways
- Q-learning provides a model-free approach to learn good policies for sequential decision problems, without requiring a model of the environment's dynamics.
- The method relies on iterative updates of the Q-table using observed rewards and the maximum estimated future value.
- Key hyperparameters include the learning rate (often implicit in update schemes) and the discount factor Γ (gamma).
- RL approaches require careful design of rewards to align with the desired outcomes and to avoid unintended incentives.
- In real-world applications, exploration strategies (e.g., ε-greedy) are typically employed to ensure sufficient state-action coverage.
Connections to foundational principles and real-world relevance
- RL connects to dynamic programming concepts like the Bellman equation, but without requiring a complete model of the environment.
- Markov Decision Process formalism underpins many planning and control problems in robotics, operations research, and AI systems.
- The room-building example illustrates how simple graph-based environments can be used to teach core RL ideas: states, actions, rewards, and the iterative improvement of action-values.
Ethical, philosophical, and practical implications
- Ethical: RL agents acting in the real world must be designed to avoid unsafe or unethical actions; safe exploration is critical when learning in physical systems or with humans.
- Philosophical: RL embodies a pragmatic approach to learning from consequences, emphasizing experience-driven improvement rather than pre-programmed rules.
- Practical: Effective RL requires careful reward shaping, sufficient exploration, and computational resources for updating and storing Q-values in larger state-action spaces.
Notation recap (quick reference)
- States: S
- Actions: A
- Reward: R(s, a)
- Policy: π
- Value: V(s)
- Action-value: Q(s, a)
- Discount factor: \ gamma
- Q-learning update: Q(s, a) = R(s, a) + \ gamma \; \max_{a'} Q(s', a')
- Goal of examples: minimize cost or maximize cumulative reward through learned policy