1/9
Flashcards covering the key concepts of Multi-Armed Bandits, including the problem definition, simplifications of RL, feedback types, policies, algorithms, and variants.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Multi-Armed Bandit Problem
A problem inspired by slot machines where a gambler (agent) tries to maximize rewards by choosing which slot machine arm to pull in an environment of slot machines, receiving a payout (or lack thereof) as a reward signal.
Simplification of RL in Multi-Armed Bandits
The environment has only a single state, observing the environment state is unnecessary, the environment doesn't change, and the distribution of rewards is fixed.
Evaluative Feedback
Provides a reward depending on the action actually taken, where the reward signal is a function of the action, R(a).
Random Policy
A policy that randomly chooses an action at every time step, providing an average reward over all choices but ignoring past experiences.
Value of an Action
The expected reward given that action: 𝑞∗ 𝑎 ≔ 𝔼 𝑅! 𝐴! = 𝑎
Greedy Action
Optimizes the action value approximation 𝑄! 𝑎; however, it can be suboptimal if 𝑄! is a bad approximation.
Explore-Exploit Tradeoff
The balance between choosing the greedy action to maximize rewards (exploit) and choosing a non-greedy action to improve the estimate of 𝑄 (explore).
"":-Greedy Algorithm
An algorithm that introduces exploration by randomly sampling arms initially or by randomly choosing between exploration (via random algorithm) and exploitation (via greedy algorithm) with probability 𝜖.
Non-Stationary/Dynamic Bandits
Bandits where the distribution of rewards changes over time, and the optimal 𝑞∗ is dependent on time.
Contextual Bandits
Bandits where the agent can observe some clue or contextual information about the environment state, influencing the best action based on this context.