An Introduction to Genetic Algorithms and the application

Download Report

Transcript An Introduction to Genetic Algorithms and the application

Institute of Information Science, Academia Sinica Research Assistant: Lin, Hsin-Nan 林信男 1/67

Outline

 Optimization Problems  Optimization Techniques  Genetic Algorithms  Two brief examples  NMR Backbone Assignment 2/63

Optimization Problems

 Definition: A computational problem in which the object is to find the best of all possible solutions.

 Classical optimization problems  Traveling Salesman Problem  Knapsack Problem  Prisoner’s Dilemma 3/63

Search spaces and Fitness Landscapes

 Search Space : Some collection of candidate solutions to a problem  Fitness Landscape : A representation of the space of all possible genotypes along with their fitness.

4/63

Example: Traveling Salesman Problem 5/63

Example: Traveling Salesman Problem  Search space: n!

 No method can give the optimal solution except exhaustive searching for all possible solutions.

6/63

Optimization Techniques

 Heuristic methods      Hill Climbing Simulated Annealing Genetic Algorithms Ant Colony optimization …..

7/63

Hill Climbing

 Iterative Improvement  Start at a random configuration; repeatedly consider various moves; accept some & reject some. When you’re stuck, restart  We must invent a moveset that describes what moves we will consider from any candidate solution 8/63

Genetic Algorithms

 Search technique based on the principle of evolution  Survival of the fittest in Natural Selection  Developed by John Holland, University of Michigan (1970’s)  GAs use two basic processes from evolution  Inheritance (passing of features from one generation to the next)  Competition (survival of the fittest) 9/63

Basic Description of GAs

 Algorithm is started with a

set of solutions

(represented by

chromosomes

) called

population

.

 Solutions from one population are taken and used to form a new population.

 The

new

population (

offspring

) will be

better

than the

old

one (

parent

).

 Solutions which are selected to form new solutions are selected according to their

fitness

-

the more suitable they are the more chances they have to reproduce

.

10/63

Basic Genetic Algorithm

11/63

Operators of GA: Encoding

 The

chromosome

should in some way

contain information about solution

which it represents.

 The most used way of encoding is a

binary string

.  Chromosome 1  1101100100110110  Chromosome 2  1101111000011110  Each bit in this string can represent some characteristic of the solution.

 Fixed length or variable length 12/63

Operators of GA: Selection

 Chromosomes are selected from the population to be parents to crossover.

 According to Darwin's evolution theory

the best ones should survive and create new offspring

.

 For example: roulette wheel selection, Boltzman selection, tournament selection, rank selection, steady state selection and some others.

13/63

Roulette Wheel Selection

 Parents are selected according to their

fitness

.

 The better the chromosomes are, the more chances to be selected they have.

14/63

Operators of GA: Crossover

 Crossover selects genes from parent chromosomes and      creates a new offspring.

Chromosome 1  11011 | 00100110110 Chromosome 2  11000 | 11000011110 Offspring 1  11011 | 11000011110 Offspring 2 “ | “  11000 is the crossover point | 00100110110 15/63

Crossover

Single point crossover: 11001 011 + 11011 111 = 11001 111

Two point crossover: 11 0010 11 + 11 0111 11 = 11 0111 11

Uniform crossover : 1 10 010 11

+

1 10 111 01

=

1 10 111 11

Arithmetic crossover: 11001011 + 11011111 = 11001001 (AND)

16/63

Operators of GA: Mutation

 Prevent falling all solutions in population into a

local optimum

of solved problem      Mutation changes randomly the new offspring.

Original offspring 1  110 1 111000011110 Mutated offspring 1  110 0 111000011110 Original offspring 2  110110 0 1001101 1 0 Mutated offspring 2  110110 1 1001101 1 0 17/63

Control Parameters

    Population Size and Generation Number GA has two control parameters:

crossover rate mutation rate

(p

m

). (p

c

) and p c (0.5~1.0): The higher the value of p c , the quicker are the new solutions introduced into the population.

p m (0.005~0.05): Large values of p to suboptimal solutions.

m transform the GA into a purely random search algorithm, while some mutations are required to prevent the premature convergence of the GA 18/63

How Do Genetic Algorithms Work?

 GAs work by discovering, and recombining good “building blocks” of solutions in a highly parallel fashion.

 Good solutions tend to be made up of good building blocks 19/63

An example of building block

 Password Guessing  Chromosome 1: alg ehklppr (fitness: 3)  Chromosome 2: bgrek ithms (fitness: 5)  Child: algehithms (fitness: 8) 20/63

Two Brief Examples

 The strategies for the Prisoner’s Dilemma  Global Sequence Alignment 21/63

Prisoner’s Dilemma

 Two suspects, A and B, are arrested by the police. The police have insufficient evidence for a conviction, and, having separated both prisoners, visit each of them to offer the same deal: 22/63

The strategies

 If you suspect that your opponent is going to  cooperate, then you should surely defect.

 defect, then you should defect too.

 The dilemma is  If both players defect each gets a worse score than if they cooperate.

 Both players memorize the previous games  Simple and good strategy: TIC FOR TAT  Could the GA evolve better strategies ?

23/63

Encoding the Chromosomes

  If memory is one previous game     CC (case 1) CD (case 2) DC (case 3) DD (case 4) Example of TIT FOR TAT  If CC (case 1), then C      If CD (case 2), then D If DC (case 3), then C If DD (case 4), then D Encoding the strategies for TIT FOR TAT : CDCD Encoding the strategies for “Loyal Guy”: CCCC 24/63

Encoding the Chromosomes

 If memory is the three previous games   There are 64 possibilities for the previous three games:      CC CC CC (case 1) CC CC CD (case 2) … DD DD DC (case 63) DD DD DD (case 64) strategy for case 64

C D C C ….

C D D

Encoding    a 64-letters string CDCCCDDCCCDD…CCCDDD Search space: 2 64 strategy for case 1 25/63

Evaluation

 Each individual is playing the game with several fixed strategies (coach)  The environment is static.

 For each game, individual could get a payoff according to the payoff matrix.

Cooperate Defect Cooperate 3, 3 5, 0 Defect 0, 5 1, 1 26/63

Observation

 The highest-scoring strategies produced by the GA were designed to exploit specific weaknesses of the fixed strategies.

 It is not necessarily true that these high-scoring strategies would also score well in a different environment.

 Conclusion: GA is good at doing what evolution often does: developing highly specialized adaptations to specific characteristics of the environment.

27/63

Evaluation II

 Take any two individuals (chromosomes) to play the games iteratively for 100 times.  Since we randomly generate the populations. The performance of a strategy depends very much on its environment (opponents).

 The environment is dynamic.

 The environments are evolving  The results of evolution are like to the TIC FOR TAT 28/63

Global Sequence Alignment

 In bioinformatics, a sequence alignment is a way of arranging the primary sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.

29/63

Global Sequence Alignment

 A general global alignment technique is called the Needleman-Wunsch algorithm and is based on dynamic programming.

 To align two sequences in length m and n   Time Complexity: O(kmn) Space Complexity: O(kmn)  If m = 4000, n = 5000  It takes more than 100 MB 30/63

Global Sequence Alignment

 S1 = 【 ABC 】 , S2 = 【 BDC 】  The alignments are variable in length

A

B

B

D

C

C -

A B

B

-

D

C

C -

A B

B

-

D -

C -

C -

A

-

B

-

C -

B

-

D

-

C 31/63

Global Sequence Alignment using a GA

 Encoding  Chromosome Length: 2(m+n )(the maximum length of the alignment)   The first m+n  the alignment result of first sequence The second m+n  the alignment result of second sequence  Binary bit string   1: a character in the sequence 0: a gap 32/63

Encoding

 Example:  S=“AAAC”, T=“AGC”

A

A -

A A

G

C

C  chromosome= 1111000 1011000

1

A 1 A

1

A 0

1

A 1 G

1

C 1 C -

0

0 -

0

0 -

0

0 33/63

Encoding Constraints

 The first half of m+n bit string must contain m 1’s  Represents the m characters of the first sequence  The second half of m+n bit sting must contain n 1’s  Represents the n characters of the second sequence chromosome= 1111000 1011000

1 1 1 1 0 0 0

A A A C 1 A 0 1 G 1 C 0 0 0 34/63

Evalution

 As the same as the dynamic programming method  +2: if x

i

= y

j

 -2: if x

i

y

j

 -1: gap penality (if x

i

aligns a gap or y

j

aligns a gap ) 35/63

Crossover

 Could not simply perform the normal crossover operation.

 It must satisfy the encoding constraints.

P1

P2 child child

1

1 1 1 1 1

1

0

1

1 1 1

1

1 1 1 1

0

1 0

0

0 0 0

0

0 0 0

1

1 1 1

0

0 0 0

1

0 0 0

1

1 1 1 1 1 1

0 0

0 0 0

0

0 0 0 36/63

An Extra Example for variable length of the chromosome encoding  Genetic programming  Evolving Lisp Programs  Example: design a program to compute the orbital period P of a planet given its average distance A from the sun.

 Kepler’s Third Law: P 2 = cA 3  In FORTRAN: P = SQRT(A*A*A)  In Lsip: (SQRT(*A(*AA))) 37/63

Encoding

 Choose a set of possible functions and terminals for the programs.

 Functions: {+, -, *, /, √}  Terminal: A 38/63

Crossover

39/63

A GP demo on the web

 http://alphard.ethz.ch/gerber/approx/default.html

40/63

A GA demo on the web

 http://www.see.ed.ac.uk/~rjt/ga.html?http://oldeee.se

e.ed.ac.uk/~rjt/ga.html

41/63

Other applications

 Classification  Scheduling  TSP  http://www.ads.tuwien.ac.at/raidl/tspga/TSPGA.html

 Computer-Aided Design  Composing music with GA.

42/63

43

Data Source: Three important experiments 44/63

HSQC Spectra

 HSQC peaks (1 chemical shifts for an amino acid)

H

8.109

N

118.60

Intensity

65920032 45/63

CBCANH Spectra

 CBCANH peaks (4 chemical shifts for one amino acid)  Ca (+), Cb (-)

H

8.116

8.109

8.117

8.119

N

118.25

118.60

118.90

117.25

C

16.37

36.52

61.58

57.42

Intensity

79238811 -65920032 -51223894 109928374 46/63

CBCA(CO)NH Spectra

 CBCA(CO)NH peaks (2 chemical shifts for one amino acid)

H

8.116

8.109

N

118.25

118.60

C

16.37

36.52

Intensity

79238811 65920032 47/63

Spin System Groups

 A spin system contains the chemical shifts of atoms within a residue.

 A spin system group contain two spin systems, one is from inter-residue and the other is from intra-residue.

Inter-residue’s spin system, SS inter (

i

) Intra-residue’s spin system SS intra (

i

) 48/63

Chemical Shift Ranges of Amino Acids

Some amino acids have particular chemical shift ranges, some share common chemical shift rages.

Chemical shift ranges of Ala, CS(A) 14 < C b 15 < C a < 24 < 72 49/63

Ambiguities

 All 4 point experiments are mixed together  All 2 point experiments are mixed together  Each spin system can be mapped to several amino acids in the protein sequence  False positives, false negatives 50/63

Ambiguous Spin System

N H Intensity 106.9 8.87 423879 106.9 8.87 524522 N H C 106.91 8.85 59.7

Intensity 235673 106.92 8.86 54.93 346234 106.91 8.86 61.5

432432 106.91 8.85 40.31 -335759 106.92 8.86 30.5

-483759 51/63

Candidate Lists

M L K V A R S T Candidate list of R, CL(R) = {

x

| SS inter (

x

) is within CS(A) and SS intra (

x

) is within CS(R) } CL(L) = {1, 4, 17, 28, 33, 40, 52, 65} CL(K) = {2, 9, 11, 19, 21, 38, 45, 47, 57, 60, 79} CL(V) = {3, 8, 10, 12, 15, 18, 22, 29, 30, 32, 49, 51} … … 52/63

Adjacency Lists

SSGroup 1 122.30

44.30

8.15

41.80

51.90

19.40 0 1 SSGroup 2 116.50

51.90

57.50

8.25

19.40

63.30

0 1 SSGroup 1  SSGroup 2  SSGroup 3 SSGroup 3 111.20

57.50

34.20

7.75

63.30 42.90

0 1 AL(SSGroup 1 ) Left: Right: 2 AL(SSGroup 2 ) Left: 1 Right: 3 AL(SSGroup 3 ) Left: 2 Right: 53/63

Likes a puzzle game

54/63

Using a Genetic Algorithm

29 1 5 18 25 43 17  Step1. Randomly select a position

x

 Step2. Randomly select a SSGroup

i

from CL(x)  Step3. Extend connected fragments from

i

to both sides by using adjacency lists until no more extension can be found.

 Step4. Repeat Step1~Step3 until all positions are assigned.

55/63

Evaluate the fitness of each individual

1 5 18 25 29 43 17 Fitness(ch) = The number of connected pairs associate with their chemical shift differences.

Two principles: 1. The more connected pairs it has, the higher score it gets.

2. The less chemical shift differences it has, the higher score it gets.

56/63

Crossover operatoin

 Inherit continuous connected fragments from parents.

Parents Child 57/63

Mutation operations

 Once a position is going to mutate, the following positions will also mutate to produce a connected fragments.

Child Mutant Mutation point 58/63

Experiment Design: Simulated Error Data  False Positive  75% error data  False Negative  50% error  Linking Error  Ca: +- 0.2

 Cb: +- 0.4

 Combined Error 59/63

Experiment Results

 The accuracy on two real dataset  SBD:95.1% (FP: 67%)  LBD:100% (FP: 48%)  The average accuracy on perfect BMRB datasets (902 proteins) 60/63

Compare with MARS

61/63

Compare with PACES

62/63

Reference

 An Introduction to Genetic Algorithms  Melanie Mitchell, The MIT Press  Genetic Algorithm  Ming-Feng Yeh  GANA–A Genetic Algorithm for NMR Backbone Resonance Assignment  Nucleic Acids Research, 2005, Vol. 33, No. 14 63/63

Programming Projects

 NMR Backbone Assignment  Consecutive one’s 64/63