1/52
Flashcards covering Reinforcement Learning concepts from CS 5804 lecture notes.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
What is the basic idea behind Reinforcement Learning?
Agent receives feedback in the form of rewards; utility is defined by the reward function; agent acts to maximize expected rewards based on observed outcomes.
What components still exist from MDPs in Reinforcement Learning?
A set of states S, a set of actions A, a model T(s,a,s'), and a reward function R(s,a,s').
What is the new twist in Reinforcement Learning compared to MDPs?
In RL, we don't know T or R so we must actually try out actions and states to learn.
What is the Model-Based Learning Idea?
Learn an approximate model based on experiences, then solve for values as if the learned model were correct.
What is the first step in Model-Based Learning?
Count outcomes s' for each s, a, then normalize to estimate T(s,a,s') and discover R(s,a,s') when we experience (s, a, s').
What is the second step in Model-Based Learning?
Solve the learned MDP, for example, using value iteration.
What is Passive Reinforcement Learning?
Simplified task of policy evaluation with a fixed policy p(s), where the goal is to learn the state values V(s) without knowing T(s,a,s') or R(s,a,s').
What is the core idea behind Direct Evaluation in RL?
Average together observed sample values by acting according to p and writing down the sum of discounted rewards each time a state is visited.
What is a disadvantage of Direct Evaluation?
It wastes information about state connections because each state must be learned separately, taking a long time to learn.
What is the big idea behind Temporal Difference Learning?
Learn from every experience by updating V(s) each time we experience a transition (s, a, s', r), moving values toward the value of whatever successor occurs.
What is TD Value Learning?
Mimicking Bellman updates with running sample averages is a model-free way to do policy evaluation. If new policy is desired, need to learn Q-values, not values to make action selection model-free too!
What is Active Reinforcement Learning?
You don’t know the transitions T(s,a,s’) or the rewards R(s,a,s’), you choose the actions now, and the goal is to learn the optimal policy/values.
What best describes Active Reinforcement Learning?
Learner makes choices, balancing exploration vs. exploitation; this is NOT offline planning - you actually take actions in the world and find out what happens.
What is Value Iteration in RL?
Find successive (depth-limited) values by starting with V0(s) = 0 and calculating depth k+1 values; Q-values are more useful so compute them instead.
What is Q-Learning?
Learn Q(s,a) values as you go by receiving a sample (s,a,s’,r), considering the old estimate and your new sample estimate, and incorporating the new estimate into a running average.
What is the amazing result of Q-Learning?
Q-learning converges to optimal policy -- even if you’re acting suboptimally!
What is the first step in Exploration?
Work through a small example and you have two choices.
What are some schemes for forcing exploration?
Several schemes for forcing exploration, simplest = random actions (e-greedy).
What is a better idea for when to explore?
Explore areas whose badness is not (yet) established, eventually stop exploring.
What does an Exploration function do?
Takes a value estimate u and a visit count n, and returns an optimistic utility.
What is Regret?
Even if you learn the optimal policy, you still make mistakes along the way!
What is Regret a measure of?
Measure of your total mistake cost: the difference between your (expected) rewards, including youthful suboptimality, and optimal omniscient (expected) rewards.
What is Basic Q-Learning?
Keep a table of all q-values. too many states to visit them all in training and too many states to hold the q-tables in memory
What do we want to do instead with Basic Q-Learning?
Instead, we want to generalize: Learn about some small number of training states from experience and Generalize that experience to new, similar situations.
What is the Solution for Feature-Based Representations?
Describe a state using a vector of features (properties).
What are Features in Basic Q-Learning?
Features are functions from states to real numbers (often 0/1) that capture important properties of the state.
What is a advantage of Linear Value Function?
Our experience is summed up in a few powerful numbers.
What is a disadvantage of Linear Value Function?
States may share features but actually be very different in value!
What is the Q-learning priority?
Get Q-values close (modeling).
What is the Action selection prioriy?
Get ordering of Q-values right (prediction).
What is the solution for Basic Q-Learning?
Learn policies that maximize rewards, not the values that predict them.
What is Policy Search?
Start with an initial linear value function or Q-function and Nudge each feature weight up and down and see if your policy is better than before
What is Policy Search Problem?
often the feature-based policies that work well (win games, maximize utilities) aren’t the ones that approximate V / Q best.
What is Loss Function?
Total error = Σ (Yi — Û i )² = Σ | Yi - Σ Wk fk (xi))²
What is Gradient?
It is a vector of partial derivatives, one per input scalar. Defines tangent plane and points in the direction of fastest increase.
What is Actor-Critic Algorithms?
is no longer deterministic. Policy is now take action based on get update based on weighted by .
Define Double Deep Q-Network
Deep Q-Network = Deep Neural Network estimating Q values
What are the iterate steps of Double Deep Q-Network?
collect samples update Q based on
What is collect initial datase made of?
human-provided scripted controller baseline policy all of the above
What is standard real-world RL process made of?
instrument the task so that we can run RLsafety mechanismsautonomous collectionrewards, resets, etc.
A classification example: Credit approval is made of what ?
sanmay Das Virginia Tech The “ideal credit approval function” and Past data on customers (demographic, income, personal data)
The Supervised Learning Problem made off what?
Unknown target function f : X → Y ↭ Classification is when Y is categorical (e.g. binary)
Generalization Error made off what?
Eout(h) = Pr[h(x) ↑= f(x)] In practice, we estimate Eout by evaluating on a (held-out) test set. We call this test error
What is the main aim from Choosing h from H?
Minimize training error Many algorithms can be thought of within this broad framework. Linear regression Find a weight vector w that minimizes Logistic regression Find a linear function that minimizes
What is Central Problems main concepts?
There are deep relationships between the stability and variance of a learning algorithm, hypothesis complexity, and generalization ability
Understanding the Setting made off what?
h data is still binary. Ways to think about this: Each individual has a risk state or subjective probability, and whether the outcome happens is based on a biased coin flifunctionct function E
Probabilistic Interpretation aim ?
We want to pick the w that maximizes this. Let’s walk through this: let’s walk
Computing the Gradient for Logistic Regression main concepts?
Chain rule: • Using this: ∇ = 1(1 + )
Designing versus learning features steps?
classic models, features are designed by hand by examining the training set with an eye to linguistic intuitions and literature, supplemented by insights from error analysis on the training set of an early version of a system.
Applying the multi-class output in basic regression the input will what?
sigmoid function vector z = [z1,z2,…,zK] the input will (just as for the the dot product between a weight vector w and an input vector x
What is Functions Neurons?
Linear unit 8 Threshold/sign unit sgn(8) Sigmoid unit 1 1 + exp (−8)
Features from classifiers is made of ?
The input layer - The hidden layer -- This is a two layer feed forward neural network
What do we mean cost matrix?
What is Cost of misclassifying class j example as class i?