An Introduction to Genetic Algorithms and the application

Transcript An Introduction to Genetic Algorithms and the application

Institute of Information Science, Academia Sinica Research Assistant: Lin, Hsin-Nan 林信男 1/67

Outline

 Optimization Problems  Optimization Techniques  Genetic Algorithms  Two brief examples  NMR Backbone Assignment 2/63

Optimization Problems

 Definition: A computational problem in which the object is to find the best of all possible solutions.

 Classical optimization problems  Traveling Salesman Problem  Knapsack Problem  Prisoner’s Dilemma 3/63

Search spaces and Fitness Landscapes

 Search Space : Some collection of candidate solutions to a problem  Fitness Landscape : A representation of the space of all possible genotypes along with their fitness.

4/63

Example: Traveling Salesman Problem 5/63

Example: Traveling Salesman Problem  Search space: n!

 No method can give the optimal solution except exhaustive searching for all possible solutions.

6/63

Optimization Techniques

 Heuristic methods      Hill Climbing Simulated Annealing Genetic Algorithms Ant Colony optimization …..

7/63

Hill Climbing

 Iterative Improvement  Start at a random configuration; repeatedly consider various moves; accept some & reject some. When you’re stuck, restart  We must invent a moveset that describes what moves we will consider from any candidate solution 8/63

Genetic Algorithms

 Search technique based on the principle of evolution  Survival of the fittest in Natural Selection  Developed by John Holland, University of Michigan (1970’s)  GAs use two basic processes from evolution  Inheritance (passing of features from one generation to the next)  Competition (survival of the fittest) 9/63

Basic Description of GAs

 Algorithm is started with a

set of solutions

(represented by

chromosomes

) called

population

 Solutions from one population are taken and used to form a new population.

 The

new

population (

offspring

) will be

better

than the

old

one (

parent

 Solutions which are selected to form new solutions are selected according to their

fitness

the more suitable they are the more chances they have to reproduce

10/63

Basic Genetic Algorithm

11/63

Operators of GA: Encoding

 The

chromosome

should in some way

contain information about solution

which it represents.

 The most used way of encoding is a

binary string

.  Chromosome 1  1101100100110110  Chromosome 2  1101111000011110  Each bit in this string can represent some characteristic of the solution.

 Fixed length or variable length 12/63

Operators of GA: Selection

 Chromosomes are selected from the population to be parents to crossover.

 According to Darwin's evolution theory

the best ones should survive and create new offspring

 For example: roulette wheel selection, Boltzman selection, tournament selection, rank selection, steady state selection and some others.

13/63

Roulette Wheel Selection

 Parents are selected according to their

fitness

 The better the chromosomes are, the more chances to be selected they have.

14/63

Operators of GA: Crossover

 Crossover selects genes from parent chromosomes and      creates a new offspring.

Crossover



Single point crossover: 11001 011 + 11011 111 = 11001 111



Two point crossover: 11 0010 11 + 11 0111 11 = 11 0111 11



Uniform crossover : 1 10 010 11

1 10 111 01

1 10 111 11



Arithmetic crossover: 11001011 + 11011111 = 11001001 (AND)

16/63

Operators of GA: Mutation

 Prevent falling all solutions in population into a

local optimum

of solved problem      Mutation changes randomly the new offspring.

Original offspring 1  110 1 111000011110 Mutated offspring 1  110 0 111000011110 Original offspring 2  110110 0 1001101 1 0 Mutated offspring 2  110110 1 1001101 1 0 17/63

Control Parameters

    Population Size and Generation Number GA has two control parameters:

crossover rate mutation rate

). (p

) and p c (0.5~1.0): The higher the value of p c , the quicker are the new solutions introduced into the population.

p m (0.005~0.05): Large values of p to suboptimal solutions.

m transform the GA into a purely random search algorithm, while some mutations are required to prevent the premature convergence of the GA 18/63

How Do Genetic Algorithms Work?

 GAs work by discovering, and recombining good “building blocks” of solutions in a highly parallel fashion.

 Good solutions tend to be made up of good building blocks 19/63

An example of building block

 Password Guessing  Chromosome 1: alg ehklppr (fitness: 3)  Chromosome 2: bgrek ithms (fitness: 5)  Child: algehithms (fitness: 8) 20/63

Two Brief Examples

 The strategies for the Prisoner’s Dilemma  Global Sequence Alignment 21/63

Prisoner’s Dilemma

 Two suspects, A and B, are arrested by the police. The police have insufficient evidence for a conviction, and, having separated both prisoners, visit each of them to offer the same deal: 22/63

The strategies

 If you suspect that your opponent is going to  cooperate, then you should surely defect.

 defect, then you should defect too.

 The dilemma is  If both players defect each gets a worse score than if they cooperate.

 Both players memorize the previous games  Simple and good strategy: TIC FOR TAT  Could the GA evolve better strategies ?

23/63

Encoding the Chromosomes

  If memory is one previous game     CC (case 1) CD (case 2) DC (case 3) DD (case 4) Example of TIT FOR TAT  If CC (case 1), then C      If CD (case 2), then D If DC (case 3), then C If DD (case 4), then D Encoding the strategies for TIT FOR TAT : CDCD Encoding the strategies for “Loyal Guy”: CCCC 24/63

Encoding the Chromosomes

 If memory is the three previous games   There are 64 possibilities for the previous three games:      CC CC CC (case 1) CC CC CD (case 2) … DD DD DC (case 63) DD DD DD (case 64) strategy for case 64

C D C C ….

C D D

Encoding    a 64-letters string CDCCCDDCCCDD…CCCDDD Search space: 2 64 strategy for case 1 25/63

Evaluation

 Each individual is playing the game with several fixed strategies (coach)  The environment is static.

 For each game, individual could get a payoff according to the payoff matrix.

Cooperate Defect Cooperate 3, 3 5, 0 Defect 0, 5 1, 1 26/63

Observation

 The highest-scoring strategies produced by the GA were designed to exploit specific weaknesses of the fixed strategies.

 It is not necessarily true that these high-scoring strategies would also score well in a different environment.

 Conclusion: GA is good at doing what evolution often does: developing highly specialized adaptations to specific characteristics of the environment.

27/63

Evaluation II

 Take any two individuals (chromosomes) to play the games iteratively for 100 times.  Since we randomly generate the populations. The performance of a strategy depends very much on its environment (opponents).

 The environment is dynamic.

 The environments are evolving  The results of evolution are like to the TIC FOR TAT 28/63

Global Sequence Alignment

 In bioinformatics, a sequence alignment is a way of arranging the primary sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.

29/63

Global Sequence Alignment

 A general global alignment technique is called the Needleman-Wunsch algorithm and is based on dynamic programming.

 To align two sequences in length m and n   Time Complexity: O(kmn) Space Complexity: O(kmn)  If m = 4000, n = 5000  It takes more than 100 MB 30/63

Global Sequence Alignment

 S1 = 【 ABC 】 , S2 = 【 BDC 】  The alignments are variable in length

C -

A B

C -

A B

D -

C -

C -

C -

C 31/63

Global Sequence Alignment using a GA

 Encoding  Chromosome Length: 2(m+n )(the maximum length of the alignment)   The first m+n  the alignment result of first sequence The second m+n  the alignment result of second sequence  Binary bit string   1: a character in the sequence 0: a gap 32/63

Encoding

 Example:  S=“AAAC”, T=“AGC”

A -

A A

C  chromosome= 1111000 1011000

A 1 A

A 0

A 1 G

C 1 C -

0 -

0 33/63

Encoding Constraints

 The first half of m+n bit string must contain m 1’s  Represents the m characters of the first sequence  The second half of m+n bit sting must contain n 1’s  Represents the n characters of the second sequence chromosome= 1111000 1011000

1 1 1 1 0 0 0

A A A C 1 A 0 1 G 1 C 0 0 0 34/63

Evalution

 As the same as the dynamic programming method  +2: if x

= y

 -2: if x

≠ y

 -1: gap penality (if x

aligns a gap or y

aligns a gap ) 35/63

Crossover

 Could not simply perform the normal crossover operation.

 It must satisfy the encoding constraints.

P2 child child

1 1 1 1 1

1 1 1

1 1 1 1

1 0

0 0 0

1 1 1

0 0 0

1 1 1 1 1 1

0 0

0 0 0

0 0 0 36/63

An Extra Example for variable length of the chromosome encoding  Genetic programming  Evolving Lisp Programs  Example: design a program to compute the orbital period P of a planet given its average distance A from the sun.

 Kepler’s Third Law: P 2 = cA 3  In FORTRAN: P = SQRT(A*A*A)  In Lsip: (SQRT(*A(*AA))) 37/63

Encoding

 Choose a set of possible functions and terminals for the programs.

 Functions: {+, -, *, /, √}  Terminal: A 38/63

Crossover

39/63

A GP demo on the web

 http://alphard.ethz.ch/gerber/approx/default.html

40/63

A GA demo on the web

 http://www.see.ed.ac.uk/~rjt/ga.html?http://oldeee.se

e.ed.ac.uk/~rjt/ga.html

41/63

Other applications

 Classification  Scheduling  TSP  http://www.ads.tuwien.ac.at/raidl/tspga/TSPGA.html

 Computer-Aided Design  Composing music with GA.

42/63

Data Source: Three important experiments 44/63

HSQC Spectra

 HSQC peaks (1 chemical shifts for an amino acid)

8.109

118.60

Intensity

65920032 45/63

CBCANH Spectra

 CBCANH peaks (4 chemical shifts for one amino acid)  Ca (+), Cb (-)

8.116

8.109

8.117

8.119

118.25

118.60

118.90

117.25

16.37

36.52

61.58

57.42

Intensity

79238811 -65920032 -51223894 109928374 46/63

CBCA(CO)NH Spectra

 CBCA(CO)NH peaks (2 chemical shifts for one amino acid)

8.116

8.109

118.25

118.60

16.37

36.52

Intensity

79238811 65920032 47/63

Spin System Groups

 A spin system contains the chemical shifts of atoms within a residue.

 A spin system group contain two spin systems, one is from inter-residue and the other is from intra-residue.

Inter-residue’s spin system, SS inter (

) Intra-residue’s spin system SS intra (

) 48/63

Chemical Shift Ranges of Amino Acids

Some amino acids have particular chemical shift ranges, some share common chemical shift rages.

Chemical shift ranges of Ala, CS(A) 14 < C b 15 < C a < 24 < 72 49/63

Ambiguities

 All 4 point experiments are mixed together  All 2 point experiments are mixed together  Each spin system can be mapped to several amino acids in the protein sequence  False positives, false negatives 50/63

Ambiguous Spin System

N H Intensity 106.9 8.87 423879 106.9 8.87 524522 N H C 106.91 8.85 59.7

Intensity 235673 106.92 8.86 54.93 346234 106.91 8.86 61.5

432432 106.91 8.85 40.31 -335759 106.92 8.86 30.5

-483759 51/63

Candidate Lists

M L K V A R S T Candidate list of R, CL(R) = {

| SS inter (

) is within CS(A) and SS intra (

) is within CS(R) } CL(L) = {1, 4, 17, 28, 33, 40, 52, 65} CL(K) = {2, 9, 11, 19, 21, 38, 45, 47, 57, 60, 79} CL(V) = {3, 8, 10, 12, 15, 18, 22, 29, 30, 32, 49, 51} … … 52/63

Adjacency Lists

SSGroup 1 122.30

44.30

8.15

41.80

51.90

19.40 0 1 SSGroup 2 116.50

51.90

57.50

8.25

19.40

63.30

0 1 SSGroup 1  SSGroup 2  SSGroup 3 SSGroup 3 111.20

57.50

34.20

7.75

63.30 42.90

0 1 AL(SSGroup 1 ) Left: Right: 2 AL(SSGroup 2 ) Left: 1 Right: 3 AL(SSGroup 3 ) Left: 2 Right: 53/63

Likes a puzzle game

54/63

Using a Genetic Algorithm

29 1 5 18 25 43 17  Step1. Randomly select a position

 Step2. Randomly select a SSGroup

from CL(x)  Step3. Extend connected fragments from

to both sides by using adjacency lists until no more extension can be found.

 Step4. Repeat Step1~Step3 until all positions are assigned.

55/63

Evaluate the fitness of each individual

1 5 18 25 29 43 17 Fitness(ch) = The number of connected pairs associate with their chemical shift differences.

Two principles: 1. The more connected pairs it has, the higher score it gets.

2. The less chemical shift differences it has, the higher score it gets.

56/63

Crossover operatoin

 Inherit continuous connected fragments from parents.

Parents Child 57/63

Mutation operations

 Once a position is going to mutate, the following positions will also mutate to produce a connected fragments.

Child Mutant Mutation point 58/63

Experiment Design: Simulated Error Data  False Positive  75% error data  False Negative  50% error  Linking Error  Ca: +- 0.2

 Cb: +- 0.4

 Combined Error 59/63

Experiment Results

 The accuracy on two real dataset  SBD:95.1% (FP: 67%)  LBD:100% (FP: 48%)  The average accuracy on perfect BMRB datasets (902 proteins) 60/63

Compare with MARS

61/63

Compare with PACES

62/63

Reference

 An Introduction to Genetic Algorithms  Melanie Mitchell, The MIT Press  Genetic Algorithm  Ming-Feng Yeh  GANA–A Genetic Algorithm for NMR Backbone Resonance Assignment  Nucleic Acids Research, 2005, Vol. 33, No. 14 63/63

Programming Projects

 NMR Backbone Assignment  Consecutive one’s 64/63

An Introduction to Genetic Algorithms and the application

Transcript An Introduction to Genetic Algorithms and the application

Outline

Optimization Problems

Search spaces and Fitness Landscapes

Optimization Techniques

Hill Climbing

Genetic Algorithms

Basic Description of GAs

Basic Genetic Algorithm

Operators of GA: Encoding

Operators of GA: Selection

Roulette Wheel Selection

Operators of GA: Crossover

Crossover

Operators of GA: Mutation

Control Parameters

How Do Genetic Algorithms Work?

An example of building block

Two Brief Examples

Prisoner’s Dilemma

The strategies

Encoding the Chromosomes

Encoding the Chromosomes

Evaluation

Observation

Evaluation II

Global Sequence Alignment

Global Sequence Alignment

Global Sequence Alignment

Global Sequence Alignment using a GA

Encoding

Encoding Constraints

Evalution

Crossover

Encoding

Crossover

A GP demo on the web

A GA demo on the web

Other applications

HSQC Spectra

CBCANH Spectra

CBCA(CO)NH Spectra

Spin System Groups

Chemical Shift Ranges of Amino Acids

Ambiguities

Ambiguous Spin System

Candidate Lists

Adjacency Lists

Likes a puzzle game

Using a Genetic Algorithm

Evaluate the fitness of each individual

Crossover operatoin

Mutation operations

Experiment Results

Compare with MARS

Compare with PACES

Reference

Programming Projects

Directory