Introduction to Neural Networks
Download
Report
Transcript Introduction to Neural Networks
Introduction to
Reinforcement Learning
Freek Stulp
1
Overview
General principles of RL
Markov Decision Process as model
Values of states: V(s)
Values of state-actions: Q(a,s)
Exploration vs. Exploitation
Issues in RL
Conclusion
2
General principles of RL
Neural Networks are supervised learning algorithms:
for each input, we know the output.
What if we don‘t know the output for each input?
Flight control system example
Let the agent learn how to achieve certain goals itself,
through interaction with the environment.
3
General principles of RL
Let the agent learn how to achieve certain goals
itself, through interaction with the environment.
Environment
action
percept reward
Agent
Rewards to specify goals (example: dogs)
This does not solve the problem!
4
Popular model: MDPs
Markov Decision Process
= {S,A,R,T}
Policy: p(S)=>A
Set of states S
Set of actions A
Reward function R
Transition function T
Problem: Find policy p that
maximizes the reward
Discounted reward:
r0 + g1r1 + g2r2 ... gnrn
Markov property
Tss´ only depends on s, s´
s0
a0
r 0 s1
a1
r 1 s2
a2
r 2 s3
5
Values of states: Vp(s)
Definition of value Vp(s)
Cumulative reward when starting in state s, and executing
some policy untill terminal state is reached.
Optimal policy yields V*(s)
R
(Rewards)
0 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1 0
Vp(s)
V*(s)
(Random policy)
(Optimal policy)
0 -1 -2 -3
-1 -2 -3 -2
-2 -3 -2 -1
-3 -2 -1 0
0 -14 -20 -22
-14 -20 -22 -20
-20 -22 -20 -14
-22 -20 -14 0
6
Determining Vp(s)
Dynamic programming
V(s) = R(s) + S Vps´(Tss´V(s´))
s
- Necessary to consider all
states.
TD-learning
V(s) = V(s) + a(R(s)+V(s´)-V(s))
s
+ Only visited states are used
7
Values of state-action: Q(a,s)
Q-values: Q(a,s)
Value of doing an action in a certain state.
Dynamic Programming:
Q(a,s) =R(s) + Ss´Tss´maxaQ(a´,s´)
TD-learning
Q(a,s) = Q(a,s) + a(R(s) + maxa´Q(a´,s´) - Q(a,s))
T is not in this formula: Model free learning!
8
Exploration vs. Exploitation
Only exploitation:
New (maybe better) paths never discovered
Only exploration:
What is learned is never exploited
Good trade-off:
Explore first to learn, exploit later to benefit
9
Some issues
Hidden state
If you don‘t know where you are, you can‘t know what to do.
Curse of dimensionality
Very large state spaces.
Continuous states/action spaces
All algorithms use discrete tables spaces.
What about continuous values?
Many of your articles discuss solutions to these problems.
10
Conclusion
RL: Learning through interaction and rewards.
Markov Decision Process popular model
Values of states: V(s)
Values of action/states: Q(a,s)
(model free!)
Still some problems... not quite ready for complex
real-world problems yet, but research underway!
11
Literature
Artificial Intelligence: A Modern Approach
Stuart Russel and Peter Norvig
Machine Learning
Tom M. Mitchell
Reinforcement learning: A Tutorial
Mance E. Harmon and Stephanie S. Harmon
12