CS 5804: Introduction to Artificial Intelligence - Reinforcement Learning

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/52

flashcard set

Earn XP

Description and Tags

Flashcards covering Reinforcement Learning concepts from CS 5804 lecture notes.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

53 Terms

1
New cards

What is the basic idea behind Reinforcement Learning?

Agent receives feedback in the form of rewards; utility is defined by the reward function; agent acts to maximize expected rewards based on observed outcomes.

2
New cards

What components still exist from MDPs in Reinforcement Learning?

A set of states S, a set of actions A, a model T(s,a,s'), and a reward function R(s,a,s').

3
New cards

What is the new twist in Reinforcement Learning compared to MDPs?

In RL, we don't know T or R so we must actually try out actions and states to learn.

4
New cards

What is the Model-Based Learning Idea?

Learn an approximate model based on experiences, then solve for values as if the learned model were correct.

5
New cards

What is the first step in Model-Based Learning?

Count outcomes s' for each s, a, then normalize to estimate T(s,a,s') and discover R(s,a,s') when we experience (s, a, s').

6
New cards

What is the second step in Model-Based Learning?

Solve the learned MDP, for example, using value iteration.

7
New cards

What is Passive Reinforcement Learning?

Simplified task of policy evaluation with a fixed policy p(s), where the goal is to learn the state values V(s) without knowing T(s,a,s') or R(s,a,s').

8
New cards

What is the core idea behind Direct Evaluation in RL?

Average together observed sample values by acting according to p and writing down the sum of discounted rewards each time a state is visited.

9
New cards

What is a disadvantage of Direct Evaluation?

It wastes information about state connections because each state must be learned separately, taking a long time to learn.

10
New cards

What is the big idea behind Temporal Difference Learning?

Learn from every experience by updating V(s) each time we experience a transition (s, a, s', r), moving values toward the value of whatever successor occurs.

11
New cards

What is TD Value Learning?

Mimicking Bellman updates with running sample averages is a model-free way to do policy evaluation. If new policy is desired, need to learn Q-values, not values to make action selection model-free too!

12
New cards

What is Active Reinforcement Learning?

You don’t know the transitions T(s,a,s’) or the rewards R(s,a,s’), you choose the actions now, and the goal is to learn the optimal policy/values.

13
New cards

What best describes Active Reinforcement Learning?

Learner makes choices, balancing exploration vs. exploitation; this is NOT offline planning - you actually take actions in the world and find out what happens.

14
New cards

What is Value Iteration in RL?

Find successive (depth-limited) values by starting with V0(s) = 0 and calculating depth k+1 values; Q-values are more useful so compute them instead.

15
New cards

What is Q-Learning?

Learn Q(s,a) values as you go by receiving a sample (s,a,s’,r), considering the old estimate and your new sample estimate, and incorporating the new estimate into a running average.

16
New cards

What is the amazing result of Q-Learning?

Q-learning converges to optimal policy -- even if you’re acting suboptimally!

17
New cards

What is the first step in Exploration?

Work through a small example and you have two choices.

18
New cards

What are some schemes for forcing exploration?

Several schemes for forcing exploration, simplest = random actions (e-greedy).

19
New cards

What is a better idea for when to explore?

Explore areas whose badness is not (yet) established, eventually stop exploring.

20
New cards

What does an Exploration function do?

Takes a value estimate u and a visit count n, and returns an optimistic utility.

21
New cards

What is Regret?

Even if you learn the optimal policy, you still make mistakes along the way!

22
New cards

What is Regret a measure of?

Measure of your total mistake cost: the difference between your (expected) rewards, including youthful suboptimality, and optimal omniscient (expected) rewards.

23
New cards

What is Basic Q-Learning?

Keep a table of all q-values. too many states to visit them all in training and too many states to hold the q-tables in memory

24
New cards

What do we want to do instead with Basic Q-Learning?

Instead, we want to generalize: Learn about some small number of training states from experience and Generalize that experience to new, similar situations.

25
New cards

What is the Solution for Feature-Based Representations?

Describe a state using a vector of features (properties).

26
New cards

What are Features in Basic Q-Learning?

Features are functions from states to real numbers (often 0/1) that capture important properties of the state.

27
New cards

What is a advantage of Linear Value Function?

Our experience is summed up in a few powerful numbers.

28
New cards

What is a disadvantage of Linear Value Function?

States may share features but actually be very different in value!

29
New cards

What is the Q-learning priority?

Get Q-values close (modeling).

30
New cards

What is the Action selection prioriy?

Get ordering of Q-values right (prediction).

31
New cards

What is the solution for Basic Q-Learning?

Learn policies that maximize rewards, not the values that predict them.

32
New cards

What is Policy Search?

Start with an initial linear value function or Q-function and Nudge each feature weight up and down and see if your policy is better than before

33
New cards

What is Policy Search Problem?

often the feature-based policies that work well (win games, maximize utilities) aren’t the ones that approximate V / Q best.

34
New cards

What is Loss Function?

Total error = Σ (Yi — Û i )² = Σ | Yi - Σ Wk fk (xi))²

35
New cards

What is Gradient?

It is a vector of partial derivatives, one per input scalar. Defines tangent plane and points in the direction of fastest increase.

36
New cards

What is Actor-Critic Algorithms?

is no longer deterministic. Policy is now take action based on get update based on weighted by .

37
New cards

Define Double Deep Q-Network

Deep Q-Network = Deep Neural Network estimating Q values

38
New cards

What are the iterate steps of Double Deep Q-Network?

collect samples update Q based on

39
New cards

What is collect initial datase made of?

human-provided scripted controller baseline policy all of the above

40
New cards

What is standard real-world RL process made of?

instrument the task so that we can run RLsafety mechanismsautonomous collectionrewards, resets, etc.

41
New cards

A classification example: Credit approval is made of what ?

sanmay Das Virginia Tech The “ideal credit approval function” and Past data on customers (demographic, income, personal data)

42
New cards

The Supervised Learning Problem made off what?

Unknown target function f : X → Y ↭ Classification is when Y is categorical (e.g. binary)

43
New cards

Generalization Error made off what?

Eout(h) = Pr[h(x) ↑= f(x)] In practice, we estimate Eout by evaluating on a (held-out) test set. We call this test error

44
New cards

What is the main aim from Choosing h from H?

Minimize training error Many algorithms can be thought of within this broad framework. Linear regression Find a weight vector w that minimizes Logistic regression Find a linear function that minimizes

45
New cards

What is Central Problems main concepts?

There are deep relationships between the stability and variance of a learning algorithm, hypothesis complexity, and generalization ability

46
New cards

Understanding the Setting made off what?

h data is still binary. Ways to think about this: Each individual has a risk state or subjective probability, and whether the outcome happens is based on a biased coin flifunctionct function E

47
New cards

Probabilistic Interpretation aim ?

We want to pick the w that maximizes this. Let’s walk through this: let’s walk

48
New cards

Computing the Gradient for Logistic Regression main concepts?

Chain rule: • Using this: ∇ = 1(1 + )

49
New cards

Designing versus learning features steps?

classic models, features are designed by hand by examining the training set with an eye to linguistic intuitions and literature, supplemented by insights from error analysis on the training set of an early version of a system.

50
New cards

Applying the multi-class output in basic regression the input will what?

sigmoid function vector z = [z1,z2,…,zK] the input will (just as for the the dot product between a weight vector w and an input vector x

51
New cards

What is Functions Neurons?

Linear unit 8 Threshold/sign unit sgn(8) Sigmoid unit 1 1 + exp (−8)

52
New cards

Features from classifiers is made of ?

The input layer - The hidden layer -- This is a two layer feed forward neural network

53
New cards

What do we mean cost matrix?

What is Cost of misclassifying class j example as class i?