Chapter 1 & 2 RL

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/29

There's no tags or description

Looks like no tags are added yet.

Last updated 5:50 PM on 6/7/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

30 Terms

New cards

Does unsupervised and supervised learning encompass all types of machine learning?

No because they relate only to a machine finding the underlying structure of a dataset. Whereas reinforcement learning is reward driven. It is its own paradigm

New cards

Exploration vs exploitation.

An agent must exploit previous experience in order to effectively seek reward. But in order to have had those experiences to begin with, it must have explored new experiences.

New cards

Is our daily life guided by reinforcement learning?

Yes almost every aspect, even that which is mundane. How I choose to drive to work is a simple example. I have observed from past experience that the left lane moves faster than the others except in very narrow circumstances that would have me temporarily move into another lane.

New cards

Policy

A learning agents way of behaving at a given time. There may be probabilities for each action of a given policy.

New cards

Reward signal

The goal that the agent is seeking at a given time. At each step of the process the agent receives some amount of reward and the overarching goal is to maximize the total reward.

New cards

Value

The ____ of a state is roughly how much reward that the agent estimates to gain from an action according to the long term perceived value of the action.

New cards

Evolutionary/genertic algorithms

Do not estimate value functions but rather use random static policies and see which generate the next reward and the best of these are reused in the next lifecycle and repeated for hundreds of generations.

New cards

S_t and S_t+1

The state at the current time and the state in the next tick

New cards

Greek letter alpha α

The step size parameter

New cards

Symmetries

It is possible due to the nature of some games that some positions are identical. This is evident in tic tac toe but also possible in more complex scenarios and can decrease the training load if they are recognized.

New cards

Greedy

Selecting whichever move is the best for each state.

New cards

Neurodynamic programming

Using dynamic programming in combination with artificial neural networks

New cards

Reinforcer

The response to a stimulus that results in the strengthening of one behavior and possible weakening of another.

New cards

Temporal difference

Model free RL method where learning takes place from the current estimate of the value function

New cards

Thinking about this from a code perspective. You would want to make an algorithm that can make the greedy choice almost every time but be able to make a random move a few percent of the time so that it can make an explorative move. With this explorative move, eventually the algorithm will gain from that knowledge which random moves turned out to be the true best greedy moves.

New cards

A new problem is introduced which is that of a shifting ecosystem. Where the outputs are not static and at various points the outputs could change and ensure that our system is not learning anything because the patterns are changing. In this case we need to replace the 1/n step size with alpha. As rewards get older we deprioritize them

New cards

2 rules which are that the step size much be larger enough to move past any noise at the beginning and

New cards

Optimistic initial values

Start values higher than usual based on your estimates. In the beginning try all the levers to see if any of them are good but once the environment stops shifting its like a greedy algorithm.U

New cards

Upper confidence bound action selection

It will increase the reward over time of a random aciton. Once in a while it will eventually be better than greedy to try the new thingG

New cards

Gradient bandit algorithms

Introduces preferences . Compares formula of all past rewards to that of the just received reward.

New cards

Bayesian methods and probability trees would be used to find the absolute perfect action. Computationally impossible to process for just one step.

New cards

Performance for these algorithms

When tuned appropriately, these are all not that far apart in terms of performance.

New cards

How do you get the benefits of episilon = 0 and epsilon > 0?

Start with epsilon > 0 and as time progresses, trim down epsilon so that after all of the arms have been tried, the most greedy approach is known and chosen.

New cards

Do we always want to converge?

Only if the problem is stationary, for nonstationary problems it will mean we weren’t able to learn enough from the environment.

New cards

costly exploration

every action which is not the greedy action will have a cost associated with it

New cards

regret

expected loss for n actions due to the result of picking suboptimally in the beginning

New cards

free exploration

every action is free until time n and then after n you get greedy. Can be difficult if the action that is the best, requires many pulls to reveal itself as the best.

New cards

Q*(a)

The true value of arm a if you pulled the lever an infinite number of times.

New cards

Q_hat(a)

Agents estimation of what Q*(a) is and trying to track Q*(a).

New cards