1/29
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Does unsupervised and supervised learning encompass all types of machine learning?
No because they relate only to a machine finding the underlying structure of a dataset. Whereas reinforcement learning is reward driven. It is its own paradigm
Exploration vs exploitation.
An agent must exploit previous experience in order to effectively seek reward. But in order to have had those experiences to begin with, it must have explored new experiences.
Is our daily life guided by reinforcement learning?
Yes almost every aspect, even that which is mundane. How I choose to drive to work is a simple example. I have observed from past experience that the left lane moves faster than the others except in very narrow circumstances that would have me temporarily move into another lane.
Policy
A learning agents way of behaving at a given time. There may be probabilities for each action of a given policy.
Reward signal
The goal that the agent is seeking at a given time. At each step of the process the agent receives some amount of reward and the overarching goal is to maximize the total reward.
Value
The ____ of a state is roughly how much reward that the agent estimates to gain from an action according to the long term perceived value of the action.
Evolutionary/genertic algorithms
Do not estimate value functions but rather use random static policies and see which generate the next reward and the best of these are reused in the next lifecycle and repeated for hundreds of generations.
St and St+1
The state at the current time and the state in the next tick
Greek letter alpha α
The step size parameter
Symmetries
It is possible due to the nature of some games that some positions are identical. This is evident in tic tac toe but also possible in more complex scenarios and can decrease the training load if they are recognized.
Greedy
Selecting whichever move is the best for each state.
Neurodynamic programming
Using dynamic programming in combination with artificial neural networks
Reinforcer
The response to a stimulus that results in the strengthening of one behavior and possible weakening of another.
Temporal difference
Model free RL method where learning takes place from the current estimate of the value function
Thinking about this from a code perspective. You would want to make an algorithm that can make the greedy choice almost every time but be able to make a random move a few percent of the time so that it can make an explorative move. With this explorative move, eventually the algorithm will gain from that knowledge which random moves turned out to be the true best greedy moves.
A new problem is introduced which is that of a shifting ecosystem. Where the outputs are not static and at various points the outputs could change and ensure that our system is not learning anything because the patterns are changing. In this case we need to replace the 1/n step size with alpha. As rewards get older we deprioritize them
2 rules which are that the step size much be larger enough to move past any noise at the beginning and
Optimistic initial values
Start values higher than usual based on your estimates. In the beginning try all the levers to see if any of them are good but once the environment stops shifting its like a greedy algorithm.U
Upper confidence bound action selection
It will increase the reward over time of a random aciton. Once in a while it will eventually be better than greedy to try the new thingG
Gradient bandit algorithms
Introduces preferences . Compares formula of all past rewards to that of the just received reward.
Bayesian methods and probability trees would be used to find the absolute perfect action. Computationally impossible to process for just one step.
Performance for these algorithms
When tuned appropriately, these are all not that far apart in terms of performance.
How do you get the benefits of episilon = 0 and epsilon > 0?
Start with epsilon > 0 and as time progresses, trim down epsilon so that after all of the arms have been tried, the most greedy approach is known and chosen.
Do we always want to converge?
Only if the problem is stationary, for nonstationary problems it will mean we weren’t able to learn enough from the environment.
costly exploration
every action which is not the greedy action will have a cost associated with it
regret
expected loss for n actions due to the result of picking suboptimally in the beginning
free exploration
every action is free until time n and then after n you get greedy. Can be difficult if the action that is the best, requires many pulls to reveal itself as the best.
Q*(a)
The true value of arm a if you pulled the lever an infinite number of times.
Q_hat(a)
Agents estimation of what Q*(a) is and trying to track Q*(a).