Transcript slides

G54DMT – Data Mining Techniques and
Applications
http://www.cs.nott.ac.uk/~jqb/G54DMT
Dr. Jaume Bacardit
[email protected]
Topic 3: Data Mining
Lecture 2: Evolutionary Learning
Outline of the lecture
•
•
•
•
•
Introduction and taxonomy
Genetic algorithms
Knowledge Representations
Paradigms
Two complete examples
– GAssist
– BioHEL
• Resources
Evolutionary Learning
• Usage of any kind of evolutionary computation
methods (list follows) to machine learning tasks
–
–
–
–
–
Genetic Algorithms
Genetic Programming
Evolution Strategies
Ant Colony Optimization
Particle Swarm Optimization
• Also known as
– Genetics-Based Machine Learning (GBML)
– Learning Classifier Systems (LCS) (subset of it)
Paradigms and representation
• EL involves a huge mix of
– Search methods (previous slide)
– Representations
– Learning paradigms
• Learning paradigms: how the solution to the
machine learning problem are generated
• Representations: rules, decision trees,
synthetic prototypes, hyperspheres, etc.
Genetic Algorithm working cycle
Population A
Evaluation
Mutation
Population B
Selection
Population D
Population C
Crossover
Genetic Algorithms: terms
• Population
– Possible solutions of the problem
– Traditionally represented as bit-strings (e.g. each bit
associated to a feature, indicating if it is selected or not)
– Each bit of an individual is called gene
– Initial population is created at random
• Evaluation
– Giving a goodness value to each individual in the
population
• Selection
– Process that rewards good individuals
– Good individuals will survive, and get more than one copy
in the next population. Bad individuals will disappear
Genetic Algorithms
• Crossover
– Exchanging subparts of the solutions
1-point crossover
uniform crossover
– The crossover stage will take two individuals from
the population (parents) and with certain
probability Pc will generate two offspring
Knowledge representations
• For nominal attributes
– Ternary representation
– GABIL representation
• For real-valued attributes
– Hyperrectangles
– Decision tree
– Synthetic prototypes
– Others
Ternary representation
• Used by XCS (Michigan LCS)
– Three-letter alphabet {0,1,#} for binary problems
• # means “don’t care”, that is, that the attribute is irrelevant
– If A1=0 and A2=1 and A3 is irrelevant  class 0
01#|0
– For non-binary nominal attributes:
• {0,1, 2, …, n,#}
– Crossover and mutation act as in a classic GA
GABIL representation
– Predicate  Class
– Predicate: Conjunctive Normal Form (CNF)
(A1=V11..  A1=V1n) ..... (An=Vn2..  An=Vnm)
• Ai : ith attribute
• Vij : jth value of the ith attribute
– The rules can be mapped into a binary string
1100|0010|1001|1
• 2 Variables:
– Sky = {clear, partially cloudy, dark clouds}
– Pressure = {Low, Medium, High}
• 2 Classes: {no rain, rain}
• Rule: If [sky is (partially cloudy or has dark clouds)] and [pressure is
low] then predict rain
• Genotype: “011|100|1”
Hyper-rectangle representation
• The rule’s predicate encodes an interval for
each of the dimensions of the domain,
effectively generating an hyperrectangle
If (X<0.25 and Y<0.25) then 
• Different ways of encoding the interval
– X< value, X> value, X in [l,u]
– Encoding the actual bounds (UBR, NAX)
– Encoding the interval as center±spread (XCSR)
– What if the u<l ?
• Flipping them (UBR)
• Declaring the attribute as irrelevant (NAX)
Decision tree representation
• Each individual literally encodes a complete
decision tree [Llora, 02]
• Only suitable for the Pittsburgh approach
• Decision tree can be axis-parallel or oblique
• Crossover
– Exchange of sub-branches of a tree between
parents
– Mutation
• Change of the definition of a node/leaf
• Total replacement of a tree’s sub-branch
Synthetic Prototypes
representation [Llora, 02]
• Each individual is a set of synthetic instances
• These instances are used as the core of a nearestneighbor classifier
1
1. (-0.125,0,yellow)
2. (0.125,0,red)
3. (0,-0.125,blue)
4. (0,0.125,green)
Y
0
1
Other representations for
continuous problems
• Hyperellipsoid representation (XCS)
– Each rule encodes an (hyper)ellipse over the search space
• Smooth, non-linear, frontiers
• Arbitrary rotation
– Encoded as
• Center
• Stretches across dimensions
• Rotation angles
• Neural representation (XCS)
– Each individual is a complete MLP, and evolution can
change both the weights and the network topology
Learning Paradigms
• Different ways of generating a solution
– Is each individual a rule, a rule set?
– Is the solution the best individual, or the whole
population?
– Is the solution generated in a single GA run
• The Pittsburgh approach
LCS
• The Michigan approach
• The Iterative Rule Learning approach
The Pittsburgh Approach
• Each individual is a complete solution to the
classification problem
• Traditionally this means that each individual is
a variable-length set of rules
• The final solution is the best individual from
the population after the GA run
• Fitness function is based on the rule set
accuracy on the training set (usually also on
complexity)
• GABIL [De Jong & Spears, 91] is a classic
example
Pittsburgh approach:
recombination
– Crossover operator
Parents
Offspring
– Mutation operator: classic GA mutation of bit
inversion
The Michigan Approach
• Each individual (classifier) is a single rule
• The whole population cooperates to solve the
classification problem
• A reinforcement learning system is used to
identify the good rules
• A GA is used to explore the search space for
more rules
• XCS [Wilson, 95] is the most well-known
Michigan LCS
The Michigan approach
• What is Reinforcement Learning?
– “a way of programming agents by reward and
punishment without needing to specify how the
task is to be achieved” [Kaelbling, Littman, &
Moore, 96]
– Rules will be evaluated example by example,
receiving a positive/negative reward
– Rule fitness will be update incrementally with this
reward
– After enough trials, good rules should have high
fitness
Michigan system’s working cycle
Iterative Rule Learning approach
• This approach implements the separate-andconquer method of rule learning
– Each individual is a rule
– A GA run ends up generating a single good rule
– Examples covered by the rule are removed from
the training set, and process starts again
• First used in evolutionary learning in the SIA
system [Venturini, 93]
The Gassist Pittsburgh LCS
[Bacardit, 04]
• Genetic clASSIfier SysTem
• Designed with three aims
1. Generate compact and accurate solutions
2. Run-time reduction
3. Be able to cope with both continuous and discrete data
• Objectives achieved by several components
–
–
–
–
–
–
ADI rule representation (3)
Explicit default rule mechanism (1)
ILAS windowing scheme (2)
MDL-based fitness function (1)
Initialization policies (1)
Rule deletion operator (1)
GAssist components in the GA
cycle
• Representation
– ADI representation
– Explicit default rule mechanism
• GA cycle
Initialization
Evaluation
Initialization
Policies
Selection
Standard
operators
•MDL fitness function
•ILAS windowing
Mutation
Crossover
GAssist: Default Rule mechanism
• When we encode this rule set as a decision list
we can observe an interesting behavior: the
emergent generation of a default rule
• Using a default rule can help generating a
more compact rule set
– Easier to learn (smaller search space)
– Potentially less sensitive to overlearning
• To maximize this benefits, the knowledge
representation is extended with an explicit
default rule
GAssist: Default Rule mechanism
• What class is assigned to the default rule?
– Simple policies such as using the
majority/minority class are not robust enough
– Automatic determination of default class
• The initial population contains individuals with all
default classes
• Evolution will choose the correct default class
• In the first few iterations the different default classes
will be isolated: each is a separate subpopulation
– Different default classes learn at different rates
• Afterwards, restrictions are lifted and the system is
freely to pick up the best policy
GAssist: Initialisation policy
• Initialization policy
– Probability of a rule matching a random instance
• In GABIL each gene associated to a value of an attribute
is independent of the other values
• Therefore the probability of matching an attribute
equals to the probability of value 1 when initializing the
chromosome
P(match)  ( P1 )
a
– Probability of a rule set matching a random
instance
P(matchrule set)  1  (1  P(match))r  1  (1  ( P1 )a )r
GAssist: Initialisation policy
• Initialization policy
– How can we derive a formula to adjust P1 ?
• We use an explicit default rule mechanism
• If we suppose equal class distribution, we have to make sure that
we match all but one of the classes
1
P(match ruleset)  1 
nc
1
a r
(1  ( P1 ) ) 
nc
P1  a 1 
1
r
nc
GAssist: Initialisation policy
• Covering operator
– Each time a new rule has to be created, an
instance is sampled from the training set
– The rule is created as a generalized version of the
example
• Makes sure it matches the example
• It covers not just the examples, but a larger area of the
search space
– Two methods of sampling instances from the
training set
• Uniform probability for each instance
• Class-wise sampling probability
GAssist: Rule deletion operator
• Operator applied after the fitness computation
• Rules that do not match any training example are
eliminated
• The operator leaves a small number of ‘dead’ rules
in each individual, acting as protective neutral code
– If crossover is applied over a dead rule, it does not matter,
it will not break a good rule
– However, if too many dead rules are present, exploration is
inefficient, and the population loses diversity
GAssist: ILAS windowing scheme
• Windowing: use of a subset of examples to perform
fitness computations
• Incremental Learning with Alternating Strata (ILAS)
• The mechanism uses a different subset of training
examples in each GA iteration
0
Ex/n
2·Ex/n
3·Ex/n
Ex
Training set
Iterations
0
Iter
BioHEL [Bacardit et al, 09]
• BIO-inspired HiErarchical Learning
• Successor of GAssist, but changing paradigms:
uses the Iterative Rule Learning approach
• Created to overcome the scalability limitations
of GAssist
• It still employs
– Default Rule (no auto policy)
– ILAS windowing scheme
BioHEL: fitness function
• Fitness function definition is trickier than in GAssist,
as it is impossible to have a global control over the
solution
• As in any separate-and-conquer method, the system
should favor rules that are
– Accurate (do not make mistakes)
– General (that cover many examples)
• These two objectives are contradictory, specially in
real-world problems: the best way of increasing the
accuracy is by creating very specific rules
• BioHEL redefines coverage as a piece-wise function,
which rewards rules that cover at least a certain
fraction of the training set
BioHEL: fitness function
• Coverage term penalizes rules that do not cover a minimum
percentage of examples
• Choice of the coverage break is crucial for the proper
performance of the system
BioHEL: ALKR
• The Attributes List Knowledge Representation (ALKR)
• This representation exploits a very frequent situation
– In high-dimensionality domains it is usual that each rule
only uses a very small subset of the attributes
• Example of a rule for predicting a Bioinformatics dataset
[Bacardit and Krasnogor, 2009]
• Att Leu-2  [-0.51,7] and Glu  [0.19,8] and
Asp+1  [-5.01,2.67] and Met+1 [-3.98,10] and
Pro+2  [-7,-4.02] and Pro+3  [-7,-1.89] and
Trp+3  [-8,13] and Glu+4  [0.70,5.52] and
Lys+4  [-0.43,4.94]  alpha
• Only 9 attributes out of 300 were actually in the rule
BioHEL: ALKR
• Function match (instance x, rule r)
Foreach attribute att in the domain
If att is relevant in rule r and
(x.att < r.att.lower or x.att > r.att.upper)
Return false
EndIf
EndFor
Return true
• Given the previous example of a rule, 293 iterations
of this loop are wasted !!
• Can we get rid of them?
BioHEL: ALKR
• ALRK automatically identifies the relevant attributes
in the domain for each rule and just tracks them
BioHEL’s ALKR
• Simulated 1-point crossover
BioHEL: ALKR
• In ALKR two operators (specialize and generalize) add
or remove attributes from the list with a given
probability, hence exploring the rule-wise space of
the relevant attributes
• ALKR match process is more efficient, however
exploration is costlier and it has two extra operators
• Since ALKR chromosome only contains relevant
information, the exploration process is more
efficient.
BioHEL: CUDA-based fitness
computation
• NVIDIA’s Computer Unified Device Architecture (CUDA) is a
parallel computing architecture that exploits the capacity
within NVIDIA’s Graphic Processor Units
• CUDA runs thousands of threads at the same time  Single
Program, Multiple Data paradigm
• In the last few years GPUs have been extensively used in the
evolutionary computation field
– Many papers and applications are available at
http://www.gpgpgpu.com
• The use of GPGPUs in Machine Learning involves a greater
challenge because it deals with more data but this also means
it is potentially more parallelizable
CUDA architecture
CUDA memory management
• Different types of memory with different access
speed
– Global memory (slow and large)
– Shared memory (block-wise; fast but quite small)
– Constant memory (very fast but very small)
• The memory is limited
• The memory copy operations involve a considerable
amount of execution time
• Since we are aiming to work with large scale datasets
a good strategy to minimize the execution time is
based on the memory usage
CUDA for matching a set of rules
• The match process is the stage
computationally more expensive
• However, performing only the match
inside the GPU means downloading
from the card a structure of size
O(NxM) (N=population size,
M=training set size)
• In most cases we don’t need to
know the specific matches of a
classifier, just how many (reduce the
data)
• Performing the second stage also
inside the GPU allows the system to
reduce the memory traffic to O(N)
CUDA in BioHEL
Performance of CUDA alone
• We used CUDA in a Tesla C1060 card with 4GB of global
memory, and compared the run-time to that of Intel Xeon
E5472 3.0GHz processors
• Biggest speedups obtained in large problems (|T| or #Att),
specially in domains with continuous attributes
• Run time for the largest dataset reduced from 2 weeks to 8
hours
CUDA fitness in combination with ILAS
• The speedups of CUDA and ILAS are
cumulative
Resources
• A very thorough survey on GBML is available
here
• Thesis of Martin Butz on XCS, including
theoretical models and advanced exploration
methods (later a book)
• My thesis, about Gassist (code)
• Complete description of BioHEL (code)
Questions?