Reinforcement Learning Vocabulary Flashcards

Reinforcement Learning: Key Concepts

Reinforcement learning (RL) is a subset of artificial intelligence (AI) where an agent learns to behave in an environment by performing actions and observing the results.
In RL, learning is driven by interactions with the environment, not by explicit supervision.
Slide-level data illustrations show that data representations can be arbitrary symbols (e.g., Cherry, Apple, Algorithm, Orange) to emphasize data input concepts rather than specific labels.
There are also basic arithmetic/notation examples on early slides (e.g., 11+5-16, 2+24, 12+3-15, 17-18-, 6+5-11, 16-8-8, 22-15-7) used as visual fillers or to draw attention to symbols used in descriptions of concepts like states, actions, rewards.
Early framing contrasts supervised, unsupervised, and reinforcement learning to position RL within the broader ML landscape.

Learning paradigms: supervised, unsupervised, and reinforcement

Supervised Learning: learning from labeled examples to predict outputs.
Unsupervised Learning: discovering structure in unlabeled data.
Reinforcement Learning: an agent learns by trial-and-error to maximize cumulative reward over time.
RL emphasizes sequential decision making, where each action influences future states and rewards.

RL Process and Core Concepts

RL Process: An agent operates in an environment by taking actions and receiving feedback (rewards) that guide learning.
Core components:
- Agent: The RL algorithm that learns from trial and error.
- Environment: The world with which the agent interacts.
- Action (A): All possible moves the agent can take.
- State (S): The current situation returned by the environment.
- Reward (R): Instant feedback from the environment evaluating the last action.
- Policy (π): The strategy the agent uses to select actions based on the state.
- Value (V): The expected long-term return from a state (discounted).
- Action-value (Q): The expected return from taking a specific action in a state (considers the current action A).

Reward maximization and the exploration vs exploitation trade-off

Reward maximization theory: the agent should be trained to take actions that maximize the cumulative reward.
Exploitation: using known information to obtain higher rewards in the short term.
Exploration: seeking new information about the environment to improve long-term rewards.
The balance between exploitation and exploration is crucial for learning effective policies.

The formal framework: Markov Decision Process (MDP)

The mathematical approach for mapping a solution in RL is the Markov Decision Process (MDP).
Key elements used to attain a solution:
- Set of actions, A
- Set of states, S
- Reward, R
- Policy, π
- Value, V
RL context uses states and actions to model transitions and rewards as the agent moves through the environment.

Simple example: Shortest-path problem (graph-based RL)

Goal: Find the shortest path between node A and node D with minimum cost.
Components:
- States: nodes {A, B, C, D}.
- Actions: possible traversals between connected nodes (e.g., A → B, A → C, C → D, etc.).
- Reward: the cost associated with each edge; reward is interpreted as a reward or cost depending on convention (often negative costs or negative rewards are used to convert cost minimization into reward maximization).
- Policy: the path chosen to reach the destination (e.g., A → C → D).
Note: In the slide, edge costs and the path example are depicted to illustrate how actions move between states and how rewards guide the chosen route.

Building-room environment: a graph-based RL example

Setup: Place an agent in one of five rooms (0–4) with the goal of reaching outside the building (room 5).
Rooms: 0, 1, 2, 3, 4 are inside; 5 represents the outside.
Connectivity: Doors connect rooms; doors 1 and 4 lead into the building from room 5 (outside).
Representation: Each room is a node; each door is a link (edge) in the graph.
Purpose: Demonstrates how an agent can learn a path to the outside by exploring door connections and receiving rewards.

Q-learning terminology and setup (state-action view)

States (including the outside): Room 0–5 are considered states.
Actions: Movement from one room to a connected room (edges/arrows).
Reward concept: Direct-to-goal doors yield a reward of 100; doors not directly connected to the goal yield 0 reward; doors are two-way, so each door has an instantaneous reward value in both directions.
In the illustrated example, the state is depicted as a node and an action as an arrow.
Example traversal (conceptual): Agent traverses from state 2 to state 5 via intermediate states (e.g., 2 → 3, then 3 → {2, 1, 4}, then 4 → 5).

Reward matrix R (conceptual)

R(s, a) represents the immediate reward for taking action a in state s.
In the room-building example, doors that lead directly to the goal have a reward of 100; other direct connections have 0; invalid or non-existent transitions are represented as null (often depicted with -1 in some tables).
Because doors are bidirectional, the matrix is interpreted as having two arrows for each door, each with its own immediate reward.
Practical takeaway: R encodes the immediate value of taking a given action in a given state, forming the basis for updating Q.

The Q-matrix: memory of learned value estimates

Add a Q-matrix, Q, to represent the agent’s learned memory about action values in each state.
Rows: current state
Columns: possible actions leading to the next state
Update rule (core Q-learning formula):
Q(s, a) = R(s, a) + \ gamma \, \max_{a'} Q(s', a')
Gamma parameter: \ gamma (note: the transcript uses Gamma; standard notation is \gamma).
- Gamma range: 0 ≤ Γ ≤ 1
- If Γ is closer to 0, the agent emphasizes immediate rewards; if Γ is closer to 1, future rewards are weighted more heavily.
Initialization: typically initialize Q as a zero matrix and set a learning schedule (e.g., starting from random states).

Step-by-step Q-learning algorithm (high-level procedure)

Set the gamma parameter and initialize the environment rewards in matrix R.
Initialize the Q-matrix to zero.
Choose a random initial state (current state).
From the current state, select one among all possible actions (that lead to a next state).
Execute the action to move to the next state.
Compute the Q-value for the current state-action pair using the update rule:
Q(state, action) = R(state, action) + \ gamma \; \max[Q(next state, all actions)]
Repeat steps 4–6 until the current state equals the goal state.

Episode walkthroughs (illustrative calculations)

Episode 1 (initial state = 1):
- From state 1, possible move to state 5 (action 5).
- Since Q-values are initially zero, the update is:
- Q(1,5) = R(1,5) + \gamma \; \max[Q(5,1), Q(5,4), Q(5,5)]
- With R(1,5) = 100 and initial Q-values all zeros, and \gamma = 0.8, we get:
- Q(1,5) = 100 + 0.8 \times 0 = 100
Episode 2 (next episode, initial state = 3):
- From state 3, possible moves include to state 1 (action 1).
- Compute: Q(3,1) = R(3,1) + \gamma \; \max[Q(1,3), Q(1,5)]
- Given R(3,1) = 0 and initial Q-values with Q(1,5) already set to 100, while Q(1,3) remains 0, we have:
- Q(3,1) = 0 + 0.8 \times \max[0, 100] = 0 + 0.8 \times 100 = 80
- After this update, the Q-table reflects Q(3,1) = 80.
Ongoing updates alternate as the agent continues to explore and update Q-values for other state-action pairs, gradually shaping the policy toward high-value actions.

Gamma parameter and learning dynamics (revisited)

Gamma (\gamma) controls the balance between immediate and future rewards.
Practical guidance: In the example, a value of 0.8 was used to emphasize future rewards while still paying attention to immediate outcomes.
The learning process continues episodically, updating Q-values based on observed rewards and the estimated future value of subsequent states.

Practical implications and takeaways

Q-learning provides a model-free approach to learn good policies for sequential decision problems, without requiring a model of the environment's dynamics.
The method relies on iterative updates of the Q-table using observed rewards and the maximum estimated future value.
Key hyperparameters include the learning rate (often implicit in update schemes) and the discount factor Γ (gamma).
RL approaches require careful design of rewards to align with the desired outcomes and to avoid unintended incentives.
In real-world applications, exploration strategies (e.g., ε-greedy) are typically employed to ensure sufficient state-action coverage.

Connections to foundational principles and real-world relevance

RL connects to dynamic programming concepts like the Bellman equation, but without requiring a complete model of the environment.
Markov Decision Process formalism underpins many planning and control problems in robotics, operations research, and AI systems.
The room-building example illustrates how simple graph-based environments can be used to teach core RL ideas: states, actions, rewards, and the iterative improvement of action-values.

Ethical, philosophical, and practical implications

Ethical: RL agents acting in the real world must be designed to avoid unsafe or unethical actions; safe exploration is critical when learning in physical systems or with humans.
Philosophical: RL embodies a pragmatic approach to learning from consequences, emphasizing experience-driven improvement rather than pre-programmed rules.
Practical: Effective RL requires careful reward shaping, sufficient exploration, and computational resources for updating and storing Q-values in larger state-action spaces.

Notation recap (quick reference)

States: S
Actions: A
Reward: R(s, a)
Policy: π
Value: V(s)
Action-value: Q(s, a)
Discount factor: \ gamma
Q-learning update: Q(s, a) = R(s, a) + \ gamma \; \max_{a'} Q(s', a')
Goal of examples: minimize cost or maximize cumulative reward through learned policy