Reinforcement Learning

Big Picture Summary

an agent will learn from the environment by interacting with it (through trial and error) and receiving rewards (negative or positive) as feedback for performing actions

Reinforcement learning is a framework for solving control tasks (also called decision problems) by building agents that learn from the environment by interacting with it through trial and error and receiving rewards (positive or negative) as unique feedback.

The agent receives the state from the environment
based on the state the agent takes an action
the environment goes to a new state
the environment gives some reward to the agent
The agent wants to maximise its cumulative reward, called the expected return

Because RL is based on the reward hypothesis, which is that all goals can be described as the maximization of the expected return (expected cumulative reward).

the Markov Property implies that our agent needs only the current state to decide what action to take and not the history of all the states and actions they took before.

State s: is a complete description of the state of the world (there is no hidden information). In a fully observed environment
Observation o: is a partial description of the state. In a partially observed environment.

The rewards in the beginning are more likely to happen since it could be predictable, therefore you can discount the rewards of earlier events allowing for the agent to priortize the long-term reward
To discount the rewards the following steps must be taken:
- We define a discount rate called gamma. It must be between 0 and 1. Most of the time between 0.95 and 0.99
- Then, each reward will be discounted by gamma to the exponent of the time step. As the time step increases, the cat gets closer to us, so the future reward is less and less likely to happen

Exploration is exploring the environment by trying random actions in order to find more information about the environment
Exploitation is exploiting known information to maximize the reward
In other terms, the agent may only exploit a certain task as it will give a small reward, but if it were to explore further it could find a larger reward somewhere else in the environment

Two ways of find the optimal policy
- Directly, by teaching the agent which action to take given the current state = Policy-Based Methods
- Indirectly, teach the agent to learn which state is more valueable and then take the action that leads to more valuable states = Value-Based Methods

The Policy is the brain of our Agent, it’s the function that tells us what action to take given the state we are in
The given policy is to learn, so the goal is to find the optimal policy

Learn the policy function directly
- define a probability distribution over the set of possible actions at that state
Two types of policies:
- Deterministic: a policy at a given state will always return the same action
- Stochastic: outputs a probability distribution over actions

learn a value function that maps a state to the expected value of being at that state
The value of a state is the expected discounted return the agent can get if it starts in that state, and then acts by going to the state with the highest value
This is most likely the method we will choose for the CAGE challenge

Deep Reinforcement Learning introduces deep neural networks to solve Reinforcement Learning problems