Introduction to Neural Networks

Download Report

Transcript Introduction to Neural Networks

Introduction to
Reinforcement Learning
Freek Stulp
1
Overview
General principles of RL
Markov Decision Process as model
Values of states: V(s)
Values of state-actions: Q(a,s)
Exploration vs. Exploitation
Issues in RL
Conclusion
2
General principles of RL
Neural Networks are supervised learning algorithms:
for each input, we know the output.
What if we don‘t know the output for each input?

Flight control system example
Let the agent learn how to achieve certain goals itself,
through interaction with the environment.
3
General principles of RL
Let the agent learn how to achieve certain goals
itself, through interaction with the environment.
Environment
action
percept reward
Agent
Rewards to specify goals (example: dogs)
This does not solve the problem!
4
Popular model: MDPs
Markov Decision Process
= {S,A,R,T}
Policy: p(S)=>A
Set of states S
Set of actions A
Reward function R
Transition function T
Problem: Find policy p that
maximizes the reward
Discounted reward:
r0 + g1r1 + g2r2 ... gnrn
Markov property

Tss´ only depends on s, s´
s0
a0
r 0 s1
a1
r 1 s2
a2
r 2 s3
5
Values of states: Vp(s)
Definition of value Vp(s)

Cumulative reward when starting in state s, and executing
some policy untill terminal state is reached.
Optimal policy yields V*(s)
R
(Rewards)
0 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1 0
Vp(s)
V*(s)
(Random policy)
(Optimal policy)
0 -1 -2 -3
-1 -2 -3 -2
-2 -3 -2 -1
-3 -2 -1 0
0 -14 -20 -22
-14 -20 -22 -20
-20 -22 -20 -14
-22 -20 -14 0
6
Determining Vp(s)
Dynamic programming
V(s) = R(s) + S Vps´(Tss´V(s´))
s
- Necessary to consider all
states.
TD-learning
V(s) = V(s) + a(R(s)+V(s´)-V(s))
s
+ Only visited states are used
7
Values of state-action: Q(a,s)
Q-values: Q(a,s)
Value of doing an action in a certain state.
Dynamic Programming:
Q(a,s) =R(s) + Ss´Tss´maxaQ(a´,s´)
TD-learning
Q(a,s) = Q(a,s) + a(R(s) + maxa´Q(a´,s´) - Q(a,s))
T is not in this formula: Model free learning!
8
Exploration vs. Exploitation
Only exploitation:

New (maybe better) paths never discovered
Only exploration:

What is learned is never exploited
Good trade-off:

Explore first to learn, exploit later to benefit
9
Some issues
Hidden state

If you don‘t know where you are, you can‘t know what to do.
Curse of dimensionality

Very large state spaces.
Continuous states/action spaces

All algorithms use discrete tables spaces.
What about continuous values?
Many of your articles discuss solutions to these problems.
10
Conclusion
RL: Learning through interaction and rewards.
Markov Decision Process popular model
Values of states: V(s)
Values of action/states: Q(a,s)
(model free!)
Still some problems... not quite ready for complex
real-world problems yet, but research underway!
11
Literature
Artificial Intelligence: A Modern Approach

Stuart Russel and Peter Norvig
Machine Learning

Tom M. Mitchell
Reinforcement learning: A Tutorial

Mance E. Harmon and Stephanie S. Harmon
12