L12 - Operational Control of Cross-Linked Energy Systems by Means of Reinforcement Learning

0.0(0)
studied byStudied by 0 people
GameKnowt Play
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/10

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

11 Terms

1
New cards

Reinforcement learning (RL)

Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents take actions in an environment in order to maximize a cumulative reward.

Reinforcement learning uses the formal framework of Markov decision processes (MDPs), a discrete stochastic approach for designing a controller to minimize or maximize a measure of a dynamical system’s behavior (the reward) over time.

It uses the Markov Decision Process (MDP) framework — a mathematical model to decide actions that will improve performance over time.

2
New cards

policy (π)

A policy is a mapping from perceived states (St ) of the environment to actions (At ) to be taken when in those states

  • The policy (π) is the core of an RL agent.

  • A policy maps states (Sₜ) to actions (Aₜ) for what to do in each situation.

  • The policy improves by getting rewards (Rₜ) from the environment.

  • A policy can be a simple lookup table or a complex search/computation.

<p>A policy is a mapping from perceived states (St ) of the environment to actions (At ) to be taken when in those states</p><ul><li><p>The <strong>policy (π)</strong> is the core of an RL agent.</p></li><li><p>A policy maps states (<strong>Sₜ</strong>) to actions (<strong>Aₜ</strong>) for what to do in each situation.</p></li><li><p>The policy improves by getting <strong>rewards (Rₜ)</strong> from the environment.</p></li><li><p>A policy can be a simple lookup table or a complex search/computation.</p></li></ul><p></p>
3
New cards

Q-Learning

What is the goal

Lookup table

The objective of Q-Learning is to find a policy that is optimal in the sense that the expected return over all successive time steps is the maximum achievable

  • Q-Learning is a type of RL algorithm.

  • Goal: Find the optimal policy that maximizes the expected total reward over time.

  • Works by learning the value (Q-value) of taking an action in a given state.

<p>Lookup table</p><p>The objective of Q-Learning is to <strong>find a policy</strong> that is optimal in the sense that the <strong>expected return</strong> over <strong>all successive time steps</strong> is the <strong>maximum</strong> achievable</p><ul><li><p>Q-Learning is a type of RL algorithm.</p></li><li><p>Goal: Find the <strong>optimal policy</strong> that maximizes the <strong>expected total reward</strong> over time.</p></li><li><p>Works by learning the <strong>value (Q-value)</strong> of taking an action in a given state.</p></li></ul><p></p>
4
New cards

How can we find that policy?

  • Policy maps states → actions.

  • Q-Learning solves it in 2 steps:

    1. Find Q-values for every state–action pair.

    2. Choose the action with the best Q-value.

  • Rulebook:

    • Flag = +20 and restart

    • Green = +5

    • Red X = -5

    • Gray = -1

    • Max 10 steps before restart

<ul><li><p>Policy maps <strong>states → actions</strong>.</p></li><li><p>Q-Learning solves it in 2 steps:</p><ol><li><p>Find Q-values for every state–action pair.</p></li><li><p>Choose the action with the best Q-value.</p></li></ol></li><li><p>Rulebook:</p><ul><li><p>Flag = +20 and restart</p></li><li><p>Green = +5</p></li><li><p>Red X = -5</p></li><li><p>Gray = -1</p></li><li><p>Max 10 steps before restart</p></li></ul></li></ul><p></p>
5
New cards

What is Q-value?

What would happen if we know optimal Q-values?

  • Q-value = expected return from taking an action in a state, then following the policy.

  • If we knew optimal Q-values, we could use a greedy policy → always choose the action with the highest Q-value.

6
New cards

Bellman Equation

Q-values are updated using the Bellman optimality equation:

  • This is an iterative learning process — Q-table stores learned values about the environment.

7
New cards

Does it make sense to use the greedy policy all the time?

  • Pro: Yes, because we want to maximize the cumulative reward!

  • Contra: No, because then we will take the first solution that works and don’t explore more profitable ones!

8
New cards

Difference between greedy and 𝝐-greedy policy?

  • Greedy policy: Always picks the highest Q-value action → quick results, but may miss better paths.

  • ε-greedy policy: Sometimes picks random actions → allows discovering better solutions.

  • If we immediately follow the first working path we find, we might never find the extra points (or even the flag)

<ul><li><p><strong>Greedy policy</strong>: Always picks the highest Q-value action → quick results, but may miss better paths.</p></li><li><p><strong>ε-greedy policy</strong>: Sometimes picks random actions → allows discovering better solutions.</p></li><li><p>If we immediately follow the first working path we find, we might never find the extra points (or even the flag)</p></li></ul><p></p>
9
New cards

Exploration vs. Exploitation

  • Exploration is the act of exploring the environment to gather information about it

  • Exploitation is the act of exploiting the information that is already known in order to maximize the return

  • Exploration: Learning about the environment by trying different actions.

  • Exploitation: Using what is already known to get maximum reward.

  • ε-greedy policy: Mostly picks best Q-value, but with probability ε picks random action.

  • ε decreases over time (start high, end low).

<ul><li><p>Exploration is the act of exploring the environment to gather information about it</p></li><li><p>Exploitation is the act of exploiting the information that is already known in order to maximize the return</p></li><li><p><strong>Exploration</strong>: Learning about the environment by trying different actions.</p></li><li><p><strong>Exploitation</strong>: Using what is already known to get maximum reward.</p></li><li><p><strong>ε-greedy policy</strong>: Mostly picks best Q-value, but with probability ε picks random action.</p></li><li><p>ε decreases over time (start high, end low).</p></li></ul><p></p>
10
New cards

Optimal Q-Table & Hyperparameters

  • After training, we get an optimal Q-table.

  • From it, we can derive optimal actions for each state.

  • Key hyperparameters:

    • α (learning rate) = 0.7

    • γ (discount factor) = 0.99

    • ε_max = 1, ε_min = 0.01, ε_decay = 0.01

<ul><li><p>After training, we get an <strong>optimal Q-table</strong>.</p></li><li><p>From it, we can derive <strong>optimal actions</strong> for each state.</p></li><li><p>Key hyperparameters:</p><ul><li><p>α (learning rate) = 0.7</p></li><li><p>γ (discount factor) = 0.99</p></li><li><p>ε_max = 1, ε_min = 0.01, ε_decay = 0.01</p></li></ul></li></ul><p></p>
11
New cards

Simple vs. Complex Environments

Most environments relevant to real application cases are much more complex than our gridworld-example! It is not possible to gather and store the information about all states and actions. How is this problem solved?

  • Real-world environments are much more complex than the small grid example.

  • It’s impossible to store info for all states and actions.

  • Need advanced methods to handle large/continuous spaces.