Reinforcement Learning

Reinforcement Learning:

  • receive feedback in the form of rewards

  • agent utility is defined by the reward function

  • must (learn to) act so as to maximize expected rewards

  • still assume MDP:

    • a set of states s S

    • a set of actions (per state) A

    • a model T(s, a, s’)

    • a reward function R(s, a, s’)

  • still looking for a policy π

  • now we don’t know T or R

    • therefore - experiement!

Model-Based Learning

  • learn an approximate model based on experiences

  • learn empirical MDP model

    • count outcomes s’ for each s, a

    • normalize to give estimate or T(s, a, s’)

    • discover each R(s, a, s’) when we experience (s, a, s’)

  • solve the learned MDP

Example A: Based on small amount of observed episodes

Model-Free IL

  • Passive Reinforcement Learning:

    • policy evaluation

      • input: fixed policy π(s)

      • don’t know transitions or rewards

      • goal: learn state values

    • learner is “along for the ride”

    • no choice about what actions to take

    • just execute the policy and learn from it

    • Direct Evaluation:

      • compute values for each state under π

      • average together observed sample values

      • ex A. output values:

        • A: -10, B: 8, C: [[(-1+10) + (-1+10) + (-1+10) + (-1 - 10)] / 4] = 4, D: 10, E: -2

    • Indirect Evaluation:

Temporal Difference Learning:

  • learn from every experience

    • update V(s) each time we experience a transition (s, a, s’, r)

  • policy still fixed, still doing evaluation

    • Sample of V(s)

      • sample = R(s, π(s), s’) + γVπ(s’)

    • Update to V(s)

      • Vπ(s) ← (1 - a)Vπ(s) + (a)sample

      • same update:

        • Vπ(s) ← Vπ(s) + a(sample - Vπ(s))

  • Exponential Moving Average

    • running interpolation update:

      • xbarn = (1-a) xbarn - 1 + a xn

Active Reinforcement Learning:

  • full reinforcement learning: optimal policies

    • don’t know transitions or rewards

    • choose the actions now

    • goal: learn the optimal policy / values

  • fundamental tradeoff: exploration vs. exploitation

    • Exploration:

      • How to explore? Random actions

        • every time step, flip a coin

        • with (small) probability ε, act randomly

        • with (large) probability 1-ε, act on current policy

      • when to explore?

        • random actions: explore a fixed amount

        • better: explore areas whose badness is not established yet, eventually top exploring

      • Exploration Function:

        • f(u, n) = u + k / n

      • Modified Q-Update:

        • Q(s, a) ← R(s, a, s’) + γ max f(Q(s’, a’), N(s’, a’))

Q-Value Iteration

  • find successive (depth-limited) values

    • start with V0(s) = 0

    • given Vk, calculate the depth k+1 values for all states

      • Vk+1(s) ← max T(s, a, s’) [R(s, a, s’) + γVk(s’)]

    • but Q values are more useful so compute them instead:

      • Qk+1(s, a) ← T(s, a, s’) [R(s, a, s’) + γ maxQk(s’, a’)]

Q-Learning

  • sample-based Q-value iteration

  • learn Q(s, a) values as you go

    • receive a sample

    • consider your old estimate

    • consider your new sample estimate

    • incorporate new into average

  • Off-policy learning

  • Approximate Q-Learning:

    • Generalizing across states:

      • learn about some small number of training states from experience

      • generalize that experience to new, similar situation

    • Feature-Based Representations:

      • describe a state using a vector of features (properties)

      • Q(s, a) = w1f1(s, a) + w2f2(s, a) + … + wnfn(s, a)

        • Transition = (s, a, r, s’)

        • difference = [r + γ max Q(s’, a’)] - Q(s, a)

        • Exact Q’s: Q(s, a) ← Q(s, a) + a[difference]

        • Approx Q’s: wi = wi + a[difference] fi(s, a)

Regret: measure of total mistake cost: difference between rewards and and optimal rewards

Policy Search:

  • often feature-based policies that work well aren’t the ones that approximate V / Q best

    • we should learn policies that maximize rewards, not the values that predict them

  • start with an ok solution then fine-tune by hill climbing on feature weights

  • simplest policy search:

    • start with initial linear value function or Q-functions

    • nudge each feature weight up and down and see if your policy is better than before