Multi-Armed Bandits

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/9

flashcard set

Earn XP

Description and Tags

Flashcards covering the key concepts of Multi-Armed Bandits, including the problem definition, simplifications of RL, feedback types, policies, algorithms, and variants.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

10 Terms

1
New cards

Multi-Armed Bandit Problem

A problem inspired by slot machines where a gambler (agent) tries to maximize rewards by choosing which slot machine arm to pull in an environment of slot machines, receiving a payout (or lack thereof) as a reward signal.

2
New cards

Simplification of RL in Multi-Armed Bandits

The environment has only a single state, observing the environment state is unnecessary, the environment doesn't change, and the distribution of rewards is fixed.

3
New cards

Evaluative Feedback

Provides a reward depending on the action actually taken, where the reward signal is a function of the action, R(a).

4
New cards

Random Policy

A policy that randomly chooses an action at every time step, providing an average reward over all choices but ignoring past experiences.

5
New cards

Value of an Action

The expected reward given that action: 𝑞∗ 𝑎 ≔ 𝔼 𝑅! 𝐴! = 𝑎

6
New cards

Greedy Action

Optimizes the action value approximation 𝑄! 𝑎; however, it can be suboptimal if 𝑄! is a bad approximation.

7
New cards

Explore-Exploit Tradeoff

The balance between choosing the greedy action to maximize rewards (exploit) and choosing a non-greedy action to improve the estimate of 𝑄 (explore).

8
New cards

"":-Greedy Algorithm

An algorithm that introduces exploration by randomly sampling arms initially or by randomly choosing between exploration (via random algorithm) and exploitation (via greedy algorithm) with probability 𝜖.

9
New cards

Non-Stationary/Dynamic Bandits

Bandits where the distribution of rewards changes over time, and the optimal 𝑞∗ is dependent on time.

10
New cards

Contextual Bandits

Bandits where the agent can observe some clue or contextual information about the environment state, influencing the best action based on this context.