Machine Learning in Computer Game Players

Download Report

Transcript Machine Learning in Computer Game Players

Chikayama & Taura Lab.
M1 Ayato Miki
1
Introduction
Computer Game Players
Machine Learning in Computer Game
Players
Tuning Evaluation Functions
1.
2.
3.
4.
◦
◦
◦
5.
Supervised Learning
Reinforcement Learning
Evolutionary Algorithms
Conclusion
2

Improvements in Computer Game Players
◦ DEEP BLUE defeated Kasparov in 1997
◦ GEKISASHI and TANASE SHOGI on WCSC 2008

Strong Computer Game Players are usually
developed by strong human players
◦ Input heuristics manually
◦ Devote a lot of time and energy to tuning
3


Machine Learning enables automatic tuning
using a large amount of data
It is not necessary for a developer to be an
expert of the game
4
Introduction
Computer Game Players
Machine Learning in Computer Game
Players
Tuning Evaluation Functions
1.
2.
3.
4.
◦
◦
◦
5.
Supervised Learning
Reinforcement Learning
Evolutionary Algorithms
Conclusion
5

Games

Game Trees

Game Tree Search

Evaluation Function
6

Turn system games
◦ ex. tic-tac-toe, chess, shogi, poker, mah-jong…

Additional Classification
◦
◦
◦
◦

two player or otherwise
zero-sum or otherwise
deterministic or non-deterministic
perfect or imperfect information
Game Tree Model
7
← player’s turn
move 1 →
← move 2
← opponent’s turn
8

ex. Minimax search algorithm
5
Max
5
3
Min
5
3
1
Min
8
Max
5
4
8
3
Max
2
3
0
6
1
6
4
2
9

Difficult to search up to leaf nodes
◦ 10^220 possible positions in shogi


Stop search at practicable depth
And “Evaluate” nodes
◦ Using Evaluation Function
10

Estimate the superiority of the position

Elements
◦ feature vector of the position
◦ parameter vector
V (s)  f (s , )

s


feature vector of position s
parameter vector
11




Introduction
Computer Game Players
Machine Learning in Computer Game Players
Tuning Evaluation Functions
◦ Supervised Learning
◦ Reinforcement Learning
◦ Evolutionary Algorithms

Conclusion
12

Initial work
◦ Samuel’s research [1959]

Learning objective
◦ What do Computer Game Players Learn ?
13

Many useful techniques
◦ Rote learning
◦ Quiescence search
◦ 3-layer neural network evaluation function

And some machine learning techniques
◦ Learning through self-play
◦ Temporal-difference learning
◦ Comparison training
14

Opening Book

Search Control

Evaluation Function
15

Automatic construction of evaluation function
◦ Construct and select a feature vector automatically
◦ ex. GLEM [Buro, 1998]
◦ Difficult

Tuning evaluation function parameters
◦ Make a feature vector manually and tune its
parameters automatically
◦ Easy and effective
18




Introduction
Computer Game Players
Machine Learning in Computer Game Players
Tuning Evaluation Functions
◦ Supervised Learning
◦ Reinforcement Learning
◦ Evolutionary Algorithms

Conclusion
19

Supervised Learning

Reinforcement Learning

Evolutionary Algorithm
20

Provide the program with example positions and
their exact evaluation values
・・・
50

20
Adjusts the parameters in a way that minimizes the
error between the evaluation function outputs and
the exact values
error  V (s) 
40
 10
50
21

Manual labeling positions

Quantitative evaluation
Consider more soft approach
22

Soft Supervised Training

Require only relative order for the possible moves
◦ Easier and more intuitive
>
23


Comparison training using records of expert
games
Simple relative order
The expert
move
>
other moves
24


Based on the Optimal Control Theory
Minimize the Cost Function J

N 1

J ( s0 , s1 ,, s N 1 , )   l ( si , )
i 0
si
example positions in the records
N
total number of example positions

l ( si ,  )
error function
25
Error Function

M 1


l ( s, )   T [ ( s'm , )   ( s'm0 , )]
m 1
s 'm
child position with move m
M
total number of possible moves
m0
the move played in the record

 ( s 'm ,  ) minimax search value
order discriminant function
T ( x)
26

Sigmoid Function
1
T ( x) 
 kx
1 e
◦ k is the parameter to control the gradient
◦ When k   , T(x) is Step Function
◦ In this case, the error function means “the number
of moves that were considered to be better than the
move in the record”
27



30,000 professional game records and
30,000 high rating game records in SHOGI
CLUB 24 were used
The weight parameters of about 10,000
feature elements were tuned
And won in the World Computer Shogi
Championship 2006
29

It is costly to accumulate a training data set
◦ It takes a lot of time to label manually
◦ Using expert records has been successful

But how if not enough expert records ?
◦ New games
◦ Minor games

Other approach without a training set
◦ ex. Reinforcement Learning (Next)
30

Supervised Learning

Reinforcement Learning

Evolutionary Algorithm
31



The learner gets “a reward” from the
environment
In the domain of game, the reward is final
outcome(win/lose)
Reinforcement learning requires only the
objective information of the game
32
+10 +20 -10
+30 +60 -30
+120 -60
+200
-100
+60
+100
Inefficient in Games…
33
+10 +10
+30 +15
+60 +10
+80
+100
TDerror  r  V (st 1 )  V (st )
34

Trained through self-play
Version
Features
Strength
TD-Gammon 0.0
Raw Board
Information
Top of Computer
Players
TD-Gammon 1.0
Plus Additional
Heuristics
Worldchampionship
35

Falling into a local optimum
◦ Lack of playing variation

Solutions
◦ Add intentional randomness
◦ Play against various players (computer/human)

Credit Assignment Problem (CAP)
◦ Not clear which action was effective
36

Supervised Learning

Reinforcement Learning

Evolutionary Algorithm
37
Initialize Population
Randomly Vary
Individuals
Evaluate “Fitness”
Apply Selection
38

Evolutionary algorithm for chess player

Using open-source chess program
◦ Attempt to tune its parameters
39

Make initial 10 parents
◦ Initialize parameters with random values
40

Create 10 offsprings from each surviving
parent by mutating parental parameters
 'i  i  N (0, s'i )
N ( , )
s'i
Gaussian random variable
strategy parameter
41

Each player plays ten games against randomly
selected opponents
Select 10 opponents randomly

Ten best players become parents of the next
generation
42

Material value

Positional value

Weights and biases of three neural networks
43
Each network has 3 Layers
 Input = Arrangement of specific areas
(front 2 rows, back 2 rows, and center 4x4 square)
 Hidden = 10 Units
 Output = Worth of the area arrangement

16 input
10 hidden
1 output
44

Initial Rating = 2066 (Expert)
◦ Rating of open-source player
10 independent trials (Each has 50 generations)


Best Rating = 2437 (Senior Master)
But the program cannot yet compete with
other strongest chess programs (R2800~)
45




Introduction
Computer Game Players
Machine Learning in Computer Game Players
Tuning Evaluation Functions
◦ Supervised Learning
◦ Reinforcement Learning
◦ Evolutionary Algorithms

Conclusion
47
Advantages
Disadvantages
Supervised Learning
Direct and Effective
Manual Labeling Cost
Reinforcement
Learning
Wide Application
Local Optimal
CAP
Evolutionary Algorithm
Wide Application
No CAP
Indirect
Random Dispersion
48

Automatic position labeling
◦ Using records or computer play

Sophisticated reward
◦ Consider opponent’s strength
◦ Move analysis for credit assignment

Experiment in other games
49