Reinforcement Learning

Big Picture Summary

  • an agent will learn from the environment by interacting with it (through trial and error) and receiving rewards (negative or positive) as feedback for performing actions

Formal Definition

  • Reinforcement learning is a framework for solving control tasks (also called decision problems) by building agents that learn from the environment by interacting with it through trial and error and receiving rewards (positive or negative) as unique feedback.

The RL Process

  • The agent receives the state from the environment

  • based on the state the agent takes an action

  • the environment goes to a new state

  • the environment gives some reward to the agent

  • The agent wants to maximise its cumulative reward, called the expected return

Why is the goal of the agent to maximize the expected return?

  • Because RL is based on the reward hypothesis, which is that all goals can be described as the maximization of the expected return (expected cumulative reward).

Markov Property

  • the Markov Property implies that our agent needs only the current state to decide what action to take and not the history of all the states and actions they took before.

Observation vs State

  • State s: is a complete description of the state of the world (there is no hidden information). In a fully observed environment

  • Observation o: is a partial description of the state. In a partially observed environment.

Action Space

  • Set of all possible actions in an environment

  • Discrete space: number of actions is finite (this will be our blue agent)

  • Continuous space: the number of actions is infinite

Rewards and Discounting

  • The rewards in the beginning are more likely to happen since it could be predictable, therefore you can discount the rewards of earlier events allowing for the agent to priortize the long-term reward

  • To discount the rewards the following steps must be taken:

    • We define a discount rate called gamma. It must be between 0 and 1. Most of the time between 0.95 and 0.99

    • Then, each reward will be discounted by gamma to the exponent of the time step. As the time step increases, the cat gets closer to us, so the future reward is less and less likely to happen

      Rewards

Types of tasks

  • Episodic: starting point and an ending point (a terminal state)

  • Continuing: task that continue forever (no terminal state)

Exploration vs Exploitation trade-off

  • Exploration is exploring the environment by trying random actions in order to find more information about the environment

  • Exploitation is exploiting known information to maximize the reward

  • In other terms, the agent may only exploit a certain task as it will give a small reward, but if it were to explore further it could find a larger reward somewhere else in the environment

Two main approaches for solving RL problems

  • Two ways of find the optimal policy

    • Directly, by teaching the agent which action to take given the current state = Policy-Based Methods

    • Indirectly, teach the agent to learn which state is more valueable and then take the action that leads to more valuable states = Value-Based Methods

  • The Policy is the brain of our Agent, it’s the function that tells us what action to take given the state we are in

  • The given policy is to learn, so the goal is to find the optimal policy

1. Policy-Based Methods

  • Learn the policy function directly

    • define a probability distribution over the set of possible actions at that state

  • Two types of policies:

    • Deterministic: a policy at a given state will always return the same action

    • Stochastic: outputs a probability distribution over actions

2. Value-Based Methods

  • learn a value function that maps a state to the expected value of being at that state

  • The value of a state is the expected discounted return the agent can get if it starts in that state, and then acts by going to the state with the highest value

  • This is most likely the method we will choose for the CAGE challenge

The “Deep” in Reinforcement Learning

  • Deep Reinforcement Learning introduces deep neural networks to solve Reinforcement Learning problems