Notes on Reinforcement Learning

Definitions:
- Reinforcement Learning (RL): A type of machine learning where an agent learns to make decisions by taking actions in an environment to maximize cumulative reward.
Components of RL:
- States (S): Represents the different situations in which the agent can find itself.
- Actions (A): The decisions or moves made by the agent based on the current state.
- Transition Model (T): Defines the probabilities of moving from one state to another given a specific action, denoted as T(s,a,s') = P(s'|s,a) .
- Reward Function (R): Provides feedback on the outcome of an action in terms of immediate reward, represented as R(s,a,s') .
- Discount Factor (γ): A factor that reduces the importance of future rewards, accounting for uncertainty over time.

Characteristics:
- MDPs exhibit the Markov property, meaning the future state depends only on the current state and action, and not on past states.
- The objective in RL is to find a policy 0(pi(s)) that defines the best action for each state to maximize total reward.

Rewards:
- Indicate immediate success or failure of an action taken in a given state.
- Negative rewards act as penalties for inappropriate actions.
- Rewards are discounted over time, meaning future rewards are worth less than immediate rewards.

Utility Function U(s): Measures the long-term benefit of being in a state.
Action-Utility Function Q(s,a): Represents how good it is to take action a in state s.
- This is crucial for agents to learn behaviors that maximize long-term rewards.

Model-Based vs. Model-Free:
- Model-Based RL: Learns the model of the environment (i.e., T and R).
- Model-Free RL: The agent learns the value functions (V and Q) directly without needing to understand T and R.

Passive RL:
- Assumes a fixed policy while learning the expected utility of being in certain states through evaluation.
Active RL:
- The agent decides which actions to explore, balancing exploration and exploitation to optimize learned policies.

Q-Learning:
- A model-free reinforcement learning algorithm that updates the action-utility function Q based on rewards received and the maximum future expected reward.
- Optimizes the action-value function directly.
SARSA:
- Similar to Q-learning, but is an on-policy algorithm that updates based on the action actually taken, leading to different exploration dynamics.

Exploration vs. Exploitation:
- The agent must explore sufficiently to discover the optimal actions while also exploiting known information to maximize rewards.
- Common method: ε-Greedy strategy, where it takes a random action with probability ε and exploits only with probability 1-ε.

As state spaces grow, it becomes infeasible to maintain a table of Q-values.
Function Approximation: Uses methods like linear regression or neural networks to estimate Q-values across large state spaces.

Deep Q-Networks (DQN): Integrates Q-learning with deep learning to allow an agent to learn directly from high-dimensional sensory data like images.
Key Innovations:
- Experience Replay: Stores past experiences for efficient learning from previously encountered states.
- Target Networks: Stabilizes training by maintaining a separate network for producing target values during learning.

RL can be applied to various domains like game playing (e.g., Blackjack), robotics, and natural language processing.
- Example: BlackJack can be modeled via reinforcement learning to optimize a player's strategy based on current cards and dealer's visible cards.