Evolutionary Algorithms & Protein Folding

Download Report

Transcript Evolutionary Algorithms & Protein Folding

Evolutionary Algorithms for the Protein Folding Problem

Giuseppe Nicosia

Department of Mathematics and Computer Science University of Catania 30/04/2020 1 DMI Università di Catania

Talk Outline

1. An overview of Evolutionary Algorithms 2. The Protein Folding Problem 3. Genetic Algorithms for the

ab initio

prediction 30/04/2020 2 DMI Università di Catania

An overview of Evolutionary Algorithms

EAs are optimization methods based on an evolutionary metaphor that showed effective in solving difficult problems.

30/04/2020 3 DMI Università di Catania

Computational Intelligence and EAs

“Evolution is the natural way to program”

DMI Università di Catania Thomas Ray 30/04/2020 4

Evolutionary Algorithms

1. Set of candidate solutions (

individuals

):

Population

.

2. Generating candidates by: – – –

Reproduction

: Copying an individual.

Crossover

:  2 parents   2 children.

Mutation

: 1 parent  1 child.

3. Quality measure of individuals: Fitness function . 4.

Survival-of-the-fittest

principle.

DMI Università di Catania 30/04/2020 5

Main components of EAs

1. Representation of individuals:

Coding

.

2. Evaluation method for individuals:

Fitness

.

3. Initialization procedure for the

1st generation

.

4. Definition of variation operators (

mutation

and

crossover

).

5. Parent (

mating

) selection mechanism.

6. Survivor (

environmental

) selection mechanism.

7.

Technical parameters

(e.g. mutation rates, population size).

30/04/2020 6 DMI Università di Catania

Mutation and Crossover

EAs manipulate partial solutions in their search for the overall optimal solution

. These partial solutions or ` building blocks ' correspond to sub-strings of a trial solution - in our case local sub-structures within the overall conformation. 30/04/2020 7 DMI Università di Catania

`Optimal' Parameter Tuning:

Experimental tests.

• Adaptation based on measured quality.

• Self-adaptation based on evolution !

30/04/2020 8 DMI Università di Catania

Constraint handling strategies

[Michalewicz,

Evolutionary Computation

, 4(1), 1996] 1.

Repair strategy : whenever an unfeasible solution is produced "repair" it , i.e. find a feasible solution "close“ to the unfeasible one; 2.

Penalize

strategy :

admit unfeasible individuale them in the population, but penalize adding a suitable term to the energy .

DMI Università di Catania 30/04/2020 9

The evolution Loop

DMI Università di Catania 30/04/2020 10

Algorithm Outline

} procedure EA; { t = 0; initialize population P(t); evaluate P(t); until (done) { t = t + 1; parent_selection P(t); recombine P(t); mutate P(t); evaluate P(t); survive P(t); } DMI Università di Catania 30/04/2020 11

Example

DMI Università di Catania 30/04/2020 12

Evolutionary Programming

procedure EP; { t = 0; initialize population P(t); evaluate P(t); until (done) { t = t + 1; parent_selection P(t); mutate P(t); evaluate P(t); survive P(t); } • The individuals are real-valued vectors , ordered lists , graphs .

• All

N

individuals are selected to be parents , and then are mutated, producing

N

children. These children are evaluated and

N

survivors are chosen from the

2N

individuals, using a probabilistic function based on fitness (individuals with a greater fitness have a higher chance of survival). • Mutation is based on the representation used, and is often adaptive . For example, when using a real-valued vector, each variable within an individual may have an adaptive mutation rate that is normally distributed.

• No Recombination.

} DMI Università di Catania 30/04/2020 13

}

Evolution Strategies

procedure ES; { • ES typically use real-valued vector .

t = 0; • Individuals are selected uniformly randomly to be parents . initialize population P(t); evaluate P(t); • Pairs of parents produces children via recombination . The number of children created is greater than N .

until (done) { • Survival is deterministic : t = t + 1; parent_selection P(t); 1.

ES allows the

N

best children to survive, and replaces the parents with these children. recombine P(t) mutate P(t); 2.

ES allows the

N

best children and parents to survive. evaluate P(t); survive P(t); } • Like EP, adapting mutation .

• Unlike EP, recombination does play an important role in ES, especially in adapting mutation.

DMI Università di Catania 30/04/2020 14

Genetic Algorithms

procedure GA; { t = 0; • GAs traditionally use a more domain independent representation , namely, bit-strings . initialize population P(t); evaluate P(t); until (done) { } t = t + 1; parent_selection P(t); recombine P(t) mutate P(t); evaluate P(t); survive P(t); } • Parents are selected according to a probabilistic function based on relative fitness. •

N

children are created via recombination from the

N

parents . • The

N

children are mutated and survive, replacing the

N

parents in the population. • Emphasis on mutation and crossover is opposite to that in EP . • Mutation flips bits with some small probability (background operator). • Recombination , on the other hand, is emphasized as the primary search operator. DMI Università di Catania 30/04/2020 15

Genetic Programming 1/2

There are 5 major preparatory steps in using GP for a particular problem. 1) selection of the set of terminals (e.g., the actual variables of the problem, zero-argument functions, and random constants, if any) 2) selection of the set of functions 3) identication of the evaluation function 4) selection of parameters of the system for controlling the run 5) selection of the termination condition .

Each tree (program) is composed of functions and terminals appropriate to the particular problem domain; the set of all functions and terminals is selected a priori in such a way that some of the composed trees yield a solution.

The initial pop is composed of such trees. The evaluation function assigns a fitness value which evaluates the performance of a tree. The evaluation is based on a preselected set of test cases ,a fitness cases; in general, the fitness function returns the sum of distances between the correct and 30/04/2020 obtained results on all test cases.

DMI Università di Catania 16

Genetic Programming 2/2

procedure GP; { } t = 0; initialize population P(t); /* randomly create an initial pop of individuals computer program */ evaluate P(t); /* execute each program in the pop and assign it a fitness value */ until (done) { t = t + 1; parent_selection P(t); /* select one or two program(s) with a probability based on fitness (with reselection allowed) */ create P(t); /* create new programs for the pop by applying the following ops with specified probability */ reproduction; /* Copy the selected program to the new pop */ } crossover; /* create new offspring programs for the new pop by recombining randomly chosen parts from 2 selected prgs*/ mutation; /* Create one new offspring program for the new pop by randomly mutating a randomly chosen part of one selected program. */ Architecture-altering

ops; /*

Choose an architecture-altering operation from the available repertoire of such op. and create one new offspring program for the new pop by applying the chosen architecture-altering op. to the one selected prg */ 30/04/2020 DMI Università di Catania 17

Scaling

Suppose one has

two search spaces

. The first is described with a real-valued fitness function

F

. The second search space is described by a fitness function

G

that is equivalent to

F p

, where

p

is some constant. The

relative positions

of peaks and valleys in the two search spaces

correspond

exactly. Only the

relative heights differ

(i.e., the vertical scale is different).

Should our EA search both spaces in the same manner?

30/04/2020 18 DMI Università di Catania

Ranking selection (ES, EP)

If we believe that the EA should

search the two spaces in the same manner

, then selection should only be based on the relative ordering of fitnesses, only the rank of individuals is of importance. • •

ES

Parent selection is performed uniformly randomly, with no regard to fitness . Survival simply saves the

N

best individuals on the relative ordering of fitnesses . , which is only based • •

EP

All individuals are selected to be parents . Each parent is mutated once, producing

N

children.

A probabilistic ranking mechanism chooses the

N

best individuals for survival, from the union of the parents and children . DMI Università di Catania 30/04/2020 19

Probabilistic selection mechanism GA

• • Many people, in the GA community ,

believe that F and G should be searched differently

. Fitness proportional selection is the probabilistic selection mechanism of the traditional GA. Parent selection is performed based on

how fit an individual is with respect to the population average

. For example, an individual with fitness twice the population average will tend to have twice as many children as average individuals. Survival, though,

is not based on fitness

, since the parents are automatically replaced by the children.

DMI Università di Catania 30/04/2020 20

Lacking the killer instinct

One problem with this latter approach is that, as the search continues, more and more

individuals receive fitnesses with small relative differences

. This lessens the selection pressure,

slowing the progress of the search

. This effect, often referred to as "lacking the killer instinct", can be compensated somewhat by scaling mechanisms , that attempt to magnify relative differences as the search progresses..

30/04/2020 21 DMI Università di Catania

Mutation and Adaptation

GAs

typically use

mutation as a simple background operator

very , to ensure that a particular bit value is not lost forever. Mutation in GAs typically flips bits with a

low probability

(e.g., 1 bit out of 1000).

• Mutation is far more

important in ESs

and Instead of a global mutation rate,

mutation EP

.

probability distributions can be maintained for every variable of every individual

. More importantly, ESs and EP encode the probability distributions as extra information within each individual, and allow this information to evolve as well (

self adaptation

of mutation parameters,

while

the space is being searched). DMI Università di Catania 30/04/2020 22

Recombination and Adaptation

• There are a number of recombination methods for

ESs

, all of which assume that the individuals are composed of real-valued variables.

Either the values are exchanged or they are averaged

. The ES community has also considered

multi-parent versions

of these operators.

• The

GA

community places primary emphasis on crossover. –

One-point recombination

parents (e.g., between the 3rd and 4th variables, or bits). Then the information before the cut-point is swapped between the two parents. inserts a

cut-point

within the two –

Multi-point recombination

is a generalization of this idea, introducing a higher number of cut-points. Information is then swapped between pairs of cut-points. –

Uniform crossover

, however, does not use cut-points, but simply uses a global parameter to indicate the likelihood that each variable should be exchanged between two parents. 30/04/2020 23 DMI Università di Catania

Representation

• Traditionally,

GAs

use

bit strings

. In theory, this representation makes the GA more problem independent. We can also see this as a more

genotypic level of representation

, since the individual is in some sense encoded in the bit string. Recently, however, the GA community has investigated more

phenotypic representations

, including vectors of real values, ordered lists, neural networks. • The

ES

and

EP

communities focus on

real-valued vector

representations, although the EP community has also used

ordered list

and

finite state automata

representations.

• Very little has been done in the way of

adaptive representations.

DMI Università di Catania 30/04/2020 24

Strength of the selection and population’s

carrying capacit

y •

Strong selection

refers to a selection mechanism that concentrates quickly on the best individuals , while

weaker selection mechanisms

allow poor individuals to survive (and produce children) for a longer period of time. • Similarly, the population can be thought of as having a certain

carrying capacity

, which refers to the amount of information that the population can usefully maintain . A

small population

has less carrying capacity, which is usually adequate for simple problems.

Larger populations

, with larger carrying capacities, are often better for more difficult problems. • Perhaps the evolutionary algorithm can adapt both

selection pressure and the population size dynamically

, as it solves problems.

30/04/2020 DMI Università di Catania 25

Accumulated payoff

EP

and

ES

usually have

optimization for a goal

. In other words, they are typically most interested in finding the best solution as quickly as possible. • De Jong (1992) reminds us that

GAs are not function optimizers per se

, although they can be used as such. There is very little theory indicating how well GAs will perform optimization tasks. Instead,

theory concentrates on what is referred to as accumulated payoff

. • The difference can be illustrated by considering

financial investment planning over a period of time

(e.g., you play the stock market). Instead of trying to find the

best

stock , you are trying to maximize your returns as the various stocks are sampled . Clearly the two goals are somewhat different, and maximizing the return may or may not also be a good 30/04/2020 heuristic for finding the best stock. 26

Fitness correlatio

n

Fitness correlatio

n, which appears to be a measure of EA Hardness that places less emphasis on optimality (Manderick et al., 1991). Fitness correlation

measures the correlation between the fitness of children and their parents

. Manderick et al. found a strong relationship between GA performance and the strength of the correlations. • Another possibility is

problem modality

.

Those problems that have many suboptimal solutions

will, in general, be more difficult to search. • Finally, this issue is also very related to a concern of de Garis, which he refers to as

evolvability

. de Garis notes that often

his systems do not evolve at all

, namely, that fitness does not increase over time. The reasons for this are not clear and remain an important research topic.

DMI Università di Catania 30/04/2020 27

Distributed EAs

Because of the

inherent natural parallelism

within an EA, much recent work has concentrated on the

implementation of EAs on parallel machines

. Typically either one processor holds one individual (in SIMD machines), or a subpopulation (in MIMD machines). Clearly, such implementations hold promise of execution time decreases. More interestingly, are the

evolutionary effects

that can be naturally illustrated with parallel machines, namely,

speciation

,

nicheing

, and

punctuated equilibria

(Belew and Booker,1991).

DMI Università di Catania 30/04/2020 28

Resume

• Selection serves to

focus search into areas

high fitness.

of • Other genetic operators (recombination and mutation) perturb the individuals, providing exploration in nearby areas .

• •

Recombination

and

mutation

provide

different search biases

, which may or may not be appropriate for the task.

The key to more robust EA systems lies in the

adaptive selection of such genetic operators

.

DMI Università di Catania probably 30/04/2020 29

Decimal 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Binary 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111

Hint: Gray Code

Gray 0000 0001 0011 0010 0110 0111 0101 0100 1100 1101 1111 1110 1010 1011 1001 1000

{ } Procedure binaryToGray g 1 =b 1 ;

for

k=2

to

m

do

g k =b k-1 XOR b k ; Gray code is often used in GAs for mapping between a decimal number and a bit string.

Mapping each digit of a decimal number to a string of four bits corresponds to choosing 10 strings from 16 possibilities. In Gray code, neighbouring decimal digits are two always represented by adjacent bit strings that differ by only one bit position .

DMI Università di Catania 30/04/2020 30

Advantages of EAs

Widely applicable

, also in cases where no (good) problem specic techniques are available: – Multimodalities, discontinuities, constraints.

– Noisy objective functions.

– Multiple criteria decision making problems.

• •

No presumptions

with respect to the problem space.

Low development costs

; i.e. costs to adapt to new problem spaces.

• The

solutions

of EA's have

straightforward interpretations

.

• They can be run interactively (online parameter adjustment).

30/04/2020 31 DMI Università di Catania

Disadvantages of EAs

No guarantee for finding optimal solutions

within a finite amount of time: True for all global optimization methods.

No complete theoretical basis

progress is being made.

(yet), but much •

Parameter tuning

is largely based on trial and error (genetic algorithms); solution: Self adaptation (evolution strategies).

• Often computationally expensive:

Parallelism

.

DMI Università di Catania 30/04/2020 32

Applications of EAs

• Optimization and Problem Solving; • NP-Complete Problem ; • Protein Folding ; • Financial Forecasting; • Automated Synthesis of Analog Electrical Circuits; • Evolutionary Robotics; • Evolvable Hardware; • Modelling.

30/04/2020 33 DMI Università di Catania

Turing ‘s vision

DMI Università di Catania July 20, 2001 30/04/2020 34

Turing’s three approaches

for creating intelligent computer program One approach was

logic-driven

while a second was

knowledge-based

. The third approach that Turing specifically identified in 1948 for achieving machine intelligence is

“... the genetical or evolutionary search by which a combination of genes is looked for, the criterion being the survival value”. A. M Turing A. M.,Intelligent machines, Machine Intelligence, 1948.

DMI Università di Catania 30/04/2020 35

In 1950 Turing described how evolution and natural selection might be used to create an intelligent program: “... we cannot expect to find a

good child machine

at first attempt. One must

experiment with teaching

one such machine and see how well it learns. One can then try another and see if it is

better or worse

”.

Turing A. M.,

Computing machinery and intelligence

,Mind,1950.

DMI Università di Catania 30/04/2020 36

Turing’s third approach & EAs

There is an obvious connection between this process and evolution, by identifications: 1.

Structure of the child machine = Hereditary material ; 2.

Changes of the child machine = Mutations ; 3.

Judgment of the experimenter = Natural selection .

The above features of Turing's third approach to machine intelligence are common to the various forms of evolutionary computation developed over the past four decades.

30/04/2020 37 DMI Università di Catania

Genes were easy: The Protein Folding Problem

Genomics

Transcriptomics

Proteomics

30/04/2020 38 DMI Università di Catania

Reductionistic and synthetic approaches

DMI Università di Catania 30/04/2020 39

Basic principles

DMI Università di Catania 30/04/2020 40

Why are computer scientists interested in biology ?

30/04/2020 41 DMI Università di Catania

Scientific answer

• Biology is interesting as a domain for AI research (i.e. drug design) .

• Biology provides a rich set of metaphors for thinking about intelligence : genetic algorithms, neural networks and Darwinian automata are but a few of the computational approaches to behavior based on biological ideas. There will, no doubt, be many more ( Artificial Immune System ).

DMI Università di Catania 30/04/2020 42

• Pragmatic answer “

Gene sequencing’s Industrial Revolution”

[IEEE Spectrum, November 2000]

.

• IBM predicts the IT market for biology will grow from $3.5 billion to more $9 billion by 2003. The volume of life science data doubles every six months .

[IEEE Spectrum, January 2001]

• “Golden rice to Bioinformatics”

[Scientific American 2001]

.

• Bio technology , Bio XML , Bio Perl , Bio Java , Bio inspired models , Biological data analisys … Bio all .

DMI Università di Catania 30/04/2020 43

The building blocks: the 20 natural Amino acids

1. Ala A 2. Arg R 3. Asn N 4. Asp 5. Cys D C 6. Gln 7. Glu 8. Gly 9. His 10. Ile Q E G H I Alanine

H

Arginine

C+

Asparagine

P

Aspartic acid

C-

Cysteine

P

Glutamine

P

Glutamic acid

C-

Glycine

P

Histidine

P, C+

Isoleucine

H

11. Leu 12. Lys 13. Met L K M 14. Phe F Phenylalanine

H

15. Pro P Proline

H

16. Ser 17. Thr 18. Trp 19. Tyr 20. Val S T W Y V Leucine

H

Lysine

C+

Methionine

H

Serine Threonine Tryptophan Tyrosine Valine

P P H P H

3-letter code, single code, name residued,

c

harge,

P

olar,

H

ydrophobic

.

DMI Università di Catania 30/04/2020 44

Proteins are

necklaces

of amino acids

The protein is a linear polymer of the 20 different kinds of amino acids , which are linked by peptide bonds . Protein sequence length: 20 – 4500 aa.

DMI Università di Catania 30/04/2020 45

Hydrophobic

&

hydrophilic

residues

Hydrophobic residues

tend to come together to form compact core that exclude water. Because the environment inside cells is

aqueous

(primarily water), these hydrophobic residues will tend to be on the inside of a protein, rather than on its surface.

• Hydrophobicity is one of the key factors that determines how the chain of amino acids will fold up into an active protein (

Hydrophilic:

attracted to water,

Hydrophobic:

repelled by water).

• The

polarity

of a molecule refers to the degree that its electrons are distributed asymmetrically. A

non-polar

molecule has a relatively even distribution of charge. DMI Università di Catania 30/04/2020 46

Primary structure

•The scaffold is always the same. •The side-chain R determines the amino acid type .

DMI Università di Catania 30/04/2020 47

Grand Challenge Problems in Bioinformatics

[T.Lengauer, Informatics – 10 Years Back, 10 Years Ahed, LNCS 2000] 1. Finding Genes in Genomic Sequences 2.

Protein Folding and Protein Structure Prediction 3. Estimating the Free Energy of Biomolecules and their complexes 4. Simulating a Cell DMI Università di Catania 30/04/2020 48

The famed Protein Folding problem asks

how the amino-acid sequence of a protein adopts its native three-dimensional structure under natural conditions (e.g. in aqueous solution, with neutral pH at room temperature). DMI Università di Catania 30/04/2020 49

Sequence

Structure

Function

While the nature of the fold is determined by the sequence, it is encoded in a very complicated manner. Thus, protein folding

can be seen as a connection between the genome ( sequence ) and what the proteins actually do ( their function ).

DMI Università di Catania 30/04/2020 50

Protein Folding process

•Denaturated = Unfolded (Disruption of equilibrium and of the H-bonds)

No biological activity

• Native = Folded = Unique compact structure

Biologically active

Below folding transition temperature, the protein seems to exist under a unique conformation.

Folding time: 10 -2 to 1 s

Folding the smallest protein: a single alpha helix .

DMI Università di Catania 30/04/2020 51

Protein Folding Principles

Finding a low-energy shape:

Proteins tend to twist into shapes that achieve a “low energy” state in which amino acids fit comfortably together.

Attraction between neighbors:

aa will most strongly attract or repel those closest to themselves. Aa can also interact with each other through “H-bonds”, weak interactions that when multiplied throughout the chain, can hold a protein in a regular shape.

DMI Università di Catania 30/04/2020 52

The Ab initio protein structure prediction problem

Protein folding has to be distinguished from protein structure prediction (PSP).

In the PSP we are not interested in the folding process but just in the final structure attained .

30/04/2020 53 DMI Università di Catania

Why do Proteins “Fold“ ?

In order to carry out their function (e.g. as enzymes or Ab), they must take on a

particular shape

, also known as a

Fold

. Thus, proteins are truly amazing machines: before they do their work, they

assemble themselves!

This self-assembly is called

Folding

.

What happens if protein don’t fold correctly ?

Diseases such as Alzheimer's disease , cystic fibrosis , Mad Cow disease, an inherited form of emphysema , and even many cancers are believed to result from protein misfolding . When proteins misfold , then can clump together (aggregate).

DMI Università di Catania 30/04/2020 54

Computational Tools in Protein Folding

• Molecular Dynamics (MD); • Monte Carlo (MC); • Simulated Annealing (SA); • Evolutionary Algorithms (EA); • Convex Global Underestimator (CGU) [K.A.Dill & H.S.Chan,Nature Struct Biol, 4:10-19,1997] ; • Performance Guaranteed Approximation Algorithms [ W.E. Hart & S. Istrail,RECOMB, ACM SIGACT 1997] ; Search methods Compute native conformation 30/04/2020 55 DMI Università di Catania

Genetic Algorithms for the

ab initio

prediction

Using biologically inspired ideas to compute about biological problems.

30/04/2020 56 DMI Università di Catania

Why Genetic Algorithms for Protein Structure Prediction?

1. PSP is analytically difficult to solve . 2. The number of conformations

N

aa grows exponentially as  N of a protein with where  is the average number of conformations per residue (typically  10). 3. The PSP problem is NP-hard . 4. The energy landscape ` rugged '. of proteins must be 30/04/2020 57 DMI Università di Catania

The HP protein folding model

[K. A.Dill, Biochemistry, 24:1501, 1985].

• HP models abstract the hydrophobic interaction process in protein folding by reducing of hydrophobicity in the protein.

a protein to a heteropolymer that represents a predetermined pattern • Nonpolar amino acids are classifìed as hydrophobic ( H ) and polar amino acids are classified as hydrophilic ( P ). A sequence is s 

{H, P} +

• The HP modei restricts the space of conformations to, self-avoiding paths on a lattice ( square , cubic or triangular ) in which vertices axe labeled by the amino acids.

DMI Università di Catania 30/04/2020 58

The energy potential in the HP model

• The energy potential reflects the fact that hydrophobic amino acids have a propensity to form a hydrophobic core . To capture this feature of protein structures, the HP model adds a value topological contact.

 for every pair of hydrophobics that form a • A topological contact is formed by a pair of amino acids that are adjacent on the lattice and not consecutive in the sequence. The value of  is typically taken to be -1 .

30/04/2020 59 DMI Università di Catania

HP sequences embedded

Figure shows sequence embedded in the square lattice, with hydrophobic-hydrophobic contacts (HH contacts) highlighted with conformation has an energy of dotted –4 .

lines.

The DMI Università di Catania 30/04/2020 60

Why HP model for Protein structure prediction ?

• S. Istrail: “

Best Science for Protein Folding

”, [Lipari School on Computational Biology 1999] .

• HP model is powerful enough to capture a variety of properties of actual proteins [Dill

et al.,

Protein Science, 4:561, 1995] .

• The PSP problem for the HP model has been shown to be NP-complete on the square lattice [Crescenzi

et al

., J. Comp. Bio. 5(3), 1998] and cubic lattice 1998] .

[Berger

et al.

, J.Comp. Bio., 5(1), 30/04/2020 61 DMI Università di Catania

Internal coordinates with

Absolute

direction

Individuals are coded with a sequence in

{U,D,L,R,F,B} n-1

: which correspond to up, down, left, right, forward and backward moves in a cubic for a length n protein. 30/04/2020 62 DMI Università di Catania

Representation and Encoding

P seq

{A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V} +

P HP 

{H, P} +

P conf

{U,D,L,R,F,B} n-1

P conf-abs =RULLURURULU DMI Università di Catania 30/04/2020 63

Bond directions describing lattice conformations

Direction  r U p r z  r z +1 L eft r x  r x -1 F ront r y  r y +1 B ack r y  r y -1 R ight r x  r x +1 D own r z  r z –1 DMI Università di Catania A bond direction corresponds to a change,  r , in one of the Cartesian coordinates of the successive monomer, keeping all other coordinates the same as the previous monomer.

30/04/2020 64

The energy function

DMI Università di Catania where, •   strength of HH attraction (usually taken as 1).

•  2  energetic penalty parameter for sites containing two monomers.

•  3  energetic penalty parameter for sites containing three or more monomers.

30/04/2020 65

} {

Procedure GA (Protein Sequence)

t := 0; initialize population P(t); /* random pop. of conformations */ evaluate P(t); /* compute energy function */

while not

terminate

do {

/* terminate when free energy reaches equilibrium point*/ parent_selection P(t); crossover P(t); /* 1-point-Crossover acts stochastically with fixed probability P cros */ mutate P(t); /* randomly change the value of bond directions along the string with fixed probability P mut */ evaluate P(t); survive P(t);

} /*

All individuals are replaced except for 10% of the current best conformation elitism strategy

*/ Output

(Protein structure); DMI Università di Catania 30/04/2020 66

Selection

Selection is linearly proportional to fitness

so that the probability, P

i

, of selecting the i-th conformation, with a fitness value F

i

, to propagate to the next time step is given by:

DMI Università di Catania 30/04/2020 67

Fitness function Probabilities must be positive so a linear mapping with a cut-off value is used to

convert the energy (E) minimization problem to a fitness (F) maximization

:

DMI Università di Catania 30/04/2020 68

Population Free Energy

• Since GAs deal with an ensemble of solutions, a quantity analogous to the statistical mechanical free energy is used. The population free energy ,

F

, is calculated from its partition function ,

Z

: • where the sum is over the total number of conformers in a population and the i-th conformer. Hence, E i is the energy of DMI F = -ln(Z) Università di Catania 30/04/2020 69

Evolution of energy distributions

The GA dynamics converge the population of conformations to an equilibrium distribution. This is characterised by

F

as shown in next slide. Fig. 1 Evolution of a population: plotting the energy distributions at various time steps.

t

is the percentage of the total run-time elapsed . The population converges within 50% of the run time.

DMI Università di Catania 30/04/2020 70

Mean, Minimum & Free Energy of an Evolving Population

The mean energy fluctuates around the equilibrium making it difficult to use as a stop criterion The free energy, which characterises the energy distribution of a population, reaches a convergence, or equilibrium point.

DMI Università di Catania 30/04/2020 71

Compact HP conformation found by GA

This method determines the global minima of HP sequences by constructing conformations with a core of H (hydrophobic) residues that also minimize the surface area of the conformation.

Example of a compact HP conformation, with a hydrophobic core, found by the GA (number of nearest neighbours = 39). Dark=H=hydrophobic, Light=P=polar.

DMI Università di Catania 30/04/2020 72

Sequence

Lowest Energy # Energy eval

643d1 -27 433.533

643d.2 -30 167.017

643d.3 -38 172.192

643d.4 -34 107.143

643d.5 -36 154.168

643d.6 -31 454.727

643d.7 -25 320.396

643d.8 -34 315.036

643d.9 -33 151.705

643d.10 -26 191.019

643d.1 wwbbbbbwwwbbwwwwwbbwwwbwwwwwwbwb wwwbwwbwwbwwwwwbwwwwbbwbbwwbwwbw DMI Università di Catania

Results

Studies were carried out using HP sequences taken from Unger,Moult, J. Molecular Biology, 231,1993

.

30/04/2020 73

Energy Landascapes are Funnels

DMI Università di Catania 30/04/2020 74

Conformational search strategies

DMI Università di Catania 30/04/2020 75

Conclusions

• The evolutionary algorithms (GA) approach to the protein structure prediction problem offers a

promising potential method of solution

. • GAs are

fast

and

efficient

at searching the

conformational landscapes rugged

presented by protein molecules.

Working in Progress

• Work is under way

to design a EA to optimize real protein structures

(torsion angle space, lattice models). 30/04/2020 76 DMI Università di Catania

References • D. E. Clark (ed.),

Evolutionary Algorithms in Molecular Design

, Wiley-VCH, Weinheim, 2000.

• M. Kanehisa,

Post-Genome Informatics

, Oxford University Press, 2000. (

An overview of the types of data and databases used in bioinformatics).

30/04/2020 77 DMI Università di Catania

Final Remark

“The impact of Bioinformatics research critically depends on an accurate understanding of the biological process under investigation. It is essential to ask the right questions, and often modeling takes priority over optimization. Therefore, we need people that

understand

and

love

both

computer science and biology

forward.” to bring the field Thomas Lengauer Institute of Computer Science - University of Bonn DMI Università di Catania 30/04/2020 78