SS

Notes on Reinforcement Learning

Reinforcement Learning Overview

  • Definitions:

    • Reinforcement Learning (RL): A type of machine learning where an agent learns to make decisions by taking actions in an environment to maximize cumulative reward.

  • Components of RL:

    • States (S): Represents the different situations in which the agent can find itself.

    • Actions (A): The decisions or moves made by the agent based on the current state.

    • Transition Model (T): Defines the probabilities of moving from one state to another given a specific action, denoted as T(s,a,s') = P(s'|s,a) .

    • Reward Function (R): Provides feedback on the outcome of an action in terms of immediate reward, represented as R(s,a,s') .

    • Discount Factor (γ): A factor that reduces the importance of future rewards, accounting for uncertainty over time.

Markov Decision Processes (MDPs)

  • Characteristics:

    • MDPs exhibit the Markov property, meaning the future state depends only on the current state and action, and not on past states.

    • The objective in RL is to find a policy 0(pi(s)) that defines the best action for each state to maximize total reward.

Reward Mechanism

  • Rewards:

    • Indicate immediate success or failure of an action taken in a given state.

    • Negative rewards act as penalties for inappropriate actions.

    • Rewards are discounted over time, meaning future rewards are worth less than immediate rewards.

Value and Utility Functions

  • Utility Function U(s): Measures the long-term benefit of being in a state.

  • Action-Utility Function Q(s,a): Represents how good it is to take action a in state s.

    • This is crucial for agents to learn behaviors that maximize long-term rewards.

Types of RL

  • Model-Based vs. Model-Free:

    • Model-Based RL: Learns the model of the environment (i.e., T and R).

    • Model-Free RL: The agent learns the value functions (V and Q) directly without needing to understand T and R.

Learning Approaches

  • Passive RL:

    • Assumes a fixed policy while learning the expected utility of being in certain states through evaluation.

  • Active RL:

    • The agent decides which actions to explore, balancing exploration and exploitation to optimize learned policies.

Q-Learning and SARSA

  • Q-Learning:

    • A model-free reinforcement learning algorithm that updates the action-utility function Q based on rewards received and the maximum future expected reward.

    • Optimizes the action-value function directly.

  • SARSA:

    • Similar to Q-learning, but is an on-policy algorithm that updates based on the action actually taken, leading to different exploration dynamics.

Exploration Strategies

  • Exploration vs. Exploitation:

    • The agent must explore sufficiently to discover the optimal actions while also exploiting known information to maximize rewards.

    • Common method: ε-Greedy strategy, where it takes a random action with probability ε and exploits only with probability 1-ε.

Function Approximation

  • As state spaces grow, it becomes infeasible to maintain a table of Q-values.

  • Function Approximation: Uses methods like linear regression or neural networks to estimate Q-values across large state spaces.

Deep Reinforcement Learning (DRL)

  • Deep Q-Networks (DQN): Integrates Q-learning with deep learning to allow an agent to learn directly from high-dimensional sensory data like images.

  • Key Innovations:

    • Experience Replay: Stores past experiences for efficient learning from previously encountered states.

    • Target Networks: Stabilizes training by maintaining a separate network for producing target values during learning.

Practical Applications

  • RL can be applied to various domains like game playing (e.g., Blackjack), robotics, and natural language processing.

    • Example: BlackJack can be modeled via reinforcement learning to optimize a player's strategy based on current cards and dealer's visible cards.