Notes on Reinforcement Learning
Reinforcement Learning Overview
Definitions:
Reinforcement Learning (RL): A type of machine learning where an agent learns to make decisions by taking actions in an environment to maximize cumulative reward.
Components of RL:
States (S): Represents the different situations in which the agent can find itself.
Actions (A): The decisions or moves made by the agent based on the current state.
Transition Model (T): Defines the probabilities of moving from one state to another given a specific action, denoted as T(s,a,s') = P(s'|s,a) .
Reward Function (R): Provides feedback on the outcome of an action in terms of immediate reward, represented as R(s,a,s') .
Discount Factor (γ): A factor that reduces the importance of future rewards, accounting for uncertainty over time.
Markov Decision Processes (MDPs)
Characteristics:
MDPs exhibit the Markov property, meaning the future state depends only on the current state and action, and not on past states.
The objective in RL is to find a policy 0(pi(s)) that defines the best action for each state to maximize total reward.
Reward Mechanism
Rewards:
Indicate immediate success or failure of an action taken in a given state.
Negative rewards act as penalties for inappropriate actions.
Rewards are discounted over time, meaning future rewards are worth less than immediate rewards.
Value and Utility Functions
Utility Function U(s): Measures the long-term benefit of being in a state.
Action-Utility Function Q(s,a): Represents how good it is to take action a in state s.
This is crucial for agents to learn behaviors that maximize long-term rewards.
Types of RL
Model-Based vs. Model-Free:
Model-Based RL: Learns the model of the environment (i.e., T and R).
Model-Free RL: The agent learns the value functions (V and Q) directly without needing to understand T and R.
Learning Approaches
Passive RL:
Assumes a fixed policy while learning the expected utility of being in certain states through evaluation.
Active RL:
The agent decides which actions to explore, balancing exploration and exploitation to optimize learned policies.
Q-Learning and SARSA
Q-Learning:
A model-free reinforcement learning algorithm that updates the action-utility function Q based on rewards received and the maximum future expected reward.
Optimizes the action-value function directly.
SARSA:
Similar to Q-learning, but is an on-policy algorithm that updates based on the action actually taken, leading to different exploration dynamics.
Exploration Strategies
Exploration vs. Exploitation:
The agent must explore sufficiently to discover the optimal actions while also exploiting known information to maximize rewards.
Common method: ε-Greedy strategy, where it takes a random action with probability ε and exploits only with probability 1-ε.
Function Approximation
As state spaces grow, it becomes infeasible to maintain a table of Q-values.
Function Approximation: Uses methods like linear regression or neural networks to estimate Q-values across large state spaces.
Deep Reinforcement Learning (DRL)
Deep Q-Networks (DQN): Integrates Q-learning with deep learning to allow an agent to learn directly from high-dimensional sensory data like images.
Key Innovations:
Experience Replay: Stores past experiences for efficient learning from previously encountered states.
Target Networks: Stabilizes training by maintaining a separate network for producing target values during learning.
Practical Applications
RL can be applied to various domains like game playing (e.g., Blackjack), robotics, and natural language processing.
Example: BlackJack can be modeled via reinforcement learning to optimize a player's strategy based on current cards and dealer's visible cards.