Transcript Document
Genetic Programming and Genetic Algorithms General Introduction 7/21/2015 1 Introduction This series of lectures tries to cover the area of search from a different perspective. We first observe that every program is a function, from a domain to a range: a program takes an input from an “acceptable set of inputs” and generates an output (side-effects could be part of the output). Sometimes the program “is aware” of the exact set of acceptable inputs - and reacts appropriately to inputs outside that set; most of the time such awareness is limited and so the function the program corresponds to may produce unpredictable (or undesirable) inputoutput pairs outside a small part of the possible set of inputs. 7/21/2015 2 Introduction If the function can be represented in terms of already known functions either in terms of algebraic formulae or in terms of exact (domain, range) pairing rules, we can describe the function explicitly and we can end with a program for computing the input-output relation. If the function cannot be so represented, we have a problem… If the function CAN be so represented, we may still (and, with high probability, do) have a problem (NP-Complete, anyone?)… Can we solve the problem(s)? The answer will be, by and large, NO, BUT… 7/21/2015 3 Introduction If we do not have explicit rules to take us from an input to an output, what can we expect to have? a) A finite collection of valid input-output pairs; b) A way of evaluating whether a collection of input-output pairs produced is more or less desirable than some other collection of pairs with the same input components; c) A way of evaluating when our process of function construction can stop, either because we are not generating “better functions” or because some other cost (time or space) is becoming unacceptable. 7/21/2015 4 Introduction In some instances, we have a function and we are trying to find an input-output with some desired characteristics - e.g., a maximum, a minimum or a saddle-point of the function. If the function is given with a simple analytical form, Calculus techniques may be adequate. If the function is given in a very complex form (or, at least, not amenable to simple analytic techniques), an “intelligent search” (using known properties of the function to reduce the search space) may be the only available strategy. 7/21/2015 5 Introduction One of the early results was obtained by McCulloch and Pitts (starting in the 1940’s) via their study of perceptrons: input-output networks with an input layer of nodes connected to an output layer of nodes. By the late 1960’s this setup had been shown to be inadequate to represent useful functions (e.g.: XOR) (Minsky and Papert, Perceptrons, MIT Press, 1969). Later on, other people showed that the introduction of a third layer was adequate for the approximation of any desired “well-behaved” function: the level of approximation was tied to the number of nodes in this intermediate layer. More specifically, we have: 7/21/2015 6 Introduction The Universal Approximation Theorem. Let f(•) be a non-constant, bounded and monotone-increasing continuous function on R. Let Ip denote the p-dimensional closed unit hypercube [0, 1]p. Let C(Ip) denote the space of continuous functions Ip --> R. Then, given any function f C(Ip) and e > 0, there exists an integer M and sets of real constants ai, qi and wij, where i = 1, …, M and j = 1, …, p, such that M p we may define F (x1, , x p ) a i wij x j q j i1 j1 as an approximate realization of the function f(•); that is, |F(x1, …, xp) - f(x1, …, xp)| < e for all (x1, …, xp) Ip. the references in: Simon Haykin, Neural Networks, a Proof. See Comprehensive Foundation,Macmillan, 1994, pp. 181-182. (There is a second edition out…) 7/21/2015 7 Introduction We can observe: • the logistic function 1/[1 + exp(-v)] used as the nonlinearity in a neuron model satisfies the conditions on f(•) above; • the network can be thought as having p input nodes and a single hidden layer with M neurons; the inputs are x1, x2, …, xp; • the hidden neuron i has synaptic weights wi1, wi2, …, wip and threshold qi; • the network output is a linear combination of the outputs of the hidden neurons, with a1, …, aM defining the coefficients of this combination. 7/21/2015 8 Introduction The theorem itself is a simple existence theorem - a generalization of Fourier’s result going back to the 1820’s. It tells us that a single layer suffices, but it does not tell us anything about efficiency of representation, minimality, ease of “training”, size of the required set of hidden nodes, etc. It allows for further results relating the accuracy of estimation of a function f by the approximation F to the accuracy of the empirical fit i.e., a way of relating M, the number of hidden nodes, to N, the size of the training set. One can also show that there exists a choice of M so that the rate of convergence of training is O((1/N)1/2) time a logarithmic factor. 7/21/2015 9 Introduction One of the problems with single hidden layer perceptrons is that all intermediate information is “global”: it becomes impossible to interpret anything as a “local feature”. Introducing two hidden layers allows one to interpret the information at the layer closest to the input as “local” while the information at the layer closest to the output is “global”. Other types of problems lead in a natural way to two-layer networks. 7/21/2015 10 Introduction The point to be made is that, although mathematical theory supports the approximability of most reasonable classes of functions from an ndimensional euclidean space to an m-dimensional one, coming up with a usable approximation in a reasonable amount of time (= use of computational resources) given a “small” amount of information remains (for now and forever) a completely non-trivial problem. The problem is, essentially, one of search: among all possible approximations find one that a) fits the known input/output relation “best” (e.g., with minimum least square error); b) allows us to interpolate at points where we do not have data; c) has acceptable cost in both construction and evaluation. 7/21/2015 11 Introduction What general method can we devise? • Random Search; start with a pair, add another,…, and so on, rejecting all those that do not meet acceptable (input, output) criteria. You may use continuity - which, essentially, says that nearby input value should lead to nearby output values. After a while, you may have enough of a function to perform (linear) interpolation with some feeling that you won’t have too many surprises. • Non-Random Search: find a clever general purpose algorithm that will allow us to build a “good” function with acceptable cost. For the second one, recall that just about all the NP-Complete problems we know require a complete enumeration of all possible configurations to guarantee finding a solution… so… 7/21/2015 12 Introduction More formally: The No Free Lunch Theorems: No search algorithm is superior to any other algorithm on average across all possible problems. Consequence: algorithms that perform better than random search on one class of problems will perform worse than random search on another. Consequence: all algorithms we devise will have to be tailored to the search domain - it is the use of specific domain-dependent information that gives us algorithms better than random search. Outside of a given domain, our results will be (much) worse. 7/21/2015 13 Introduction We are going to concentrate on some aspects of search and of function construction through search - the techniques to be introduced attempt to find computational analogues to methods that (to the best of our current knowledge) appear used in biological evolution. Since biological evolution appears based on modification of the DNA exchanged from parent(s) to offspring we are going to have to find ways to: a) encode all desired characteristics of a problem in a data structure that can support “DNA-type” modifications (whatever they will be, but they must include some analogues of chromosomal recombination and mutation applied to strings over some alphabet) and still remain meaningful in our context; b) construct an evaluation function (equivalent to evolutionary fitness determination) on either the underlying data structure 7/21/2015 14 Introduction derivable from the original one (i.e., the genotype); c) devise a strategy for “differential reproduction”, so that “fitter individuals” produce more offspring with a high chance of possessing “desirable characteristics”, while guarding against “genetic overspecialization”. A lot of the terms are in quotes, since it is not obvious how the analogy with biological systems will be implemented: as they say, the devil is in the details… 7/21/2015 15 Introduction The Problem of Representation. How do we represent the information to be searched for? In DNA-based search we deal with an alphabet over 4 letters (A, C, G, T), and finite strings over that alphabet. The length of the strings is (more or less) fixed and parent-child information exchange seems to consist (at least at a first approximation - and for two-parent species) of the joining together of two single parental chromosomal strands into a double descendant strand with dominant and recessive alleles for nearly all the genes so transmitted. Two other known mechanisms involve movement of sections of the string from one location to another and single-locus changes (point mutations). Besides some other known mechanisms, there may well be many unknown ones. 7/21/2015 16 Introduction Some early representations used fixed-length strings of binary digits: each position had a 0 or a 1 (a bit-gene). A function scored the string. Evolution of a solution involved: 1. Determining which individuals would reproduce. 2. Selecting the pairs of individuals contributing to the offspring. 3. Determining how they would so contribute. 4. Determining the role and frequency of point-mutations. 5. Defining the new population. 6. Repeating the process from 1. above. 7. Determining when termination was reached. 8. Extraction of the “best” individual as the solution. 7/21/2015 17 Introduction There are several slightly different ways of looking at the evolution of a population via genetic-analogue methods: 1) each generation corresponds to another subset of possible individuals. Convergence will correspond to a family of subsets of decreasing diameter (under some metric). 7/21/2015 18 Introduction 2) Given a population of M individuals with N binary genes each, the number of states such a population can be in is given by the formula: 2 N M 1 N 2 1 Finding a solution means, essentially, finding a limiting state for the evolving family of populations. A “best” individual of the limit population is our desired solution. 7/21/2015 19 Introduction 3) We can interpret each possible individual as a point in some space. The evaluation function defines a function from this space to R. We look for the maximum of this function over the space. The graph of this function provides us with what is called a “fitness landscape”. The evolution mechanism creates sets of different points in the domain from one generation to the next. We stop when several successive generations have not produced a “substantially better individual”, or when a given amount of computational resources has been expended. 7/21/2015 20 Introduction We need to consider how the offspring will receive the information from the parents, and the mechanisms that correspond to pointmutation and, possibly, to larger scale mutation (e.g., repositioning of substrings within a chromosome). Start from binary strings: each parent is represented by a binary string of length N. A “reasonable analog” to the contribution of single chromosomal strands from each parent, with dominant and recessive alleles may be to just take a prefix substring of (random) length 0 ≤ n ≤ N from the first parent, a suffix substring of length N - n from the second parent and concatenate them in the same prefix-suffix order. This provides a new chromosome of length N. Mutation can be simulated by “walking down” the new string and randomly resetting the bits. 7/21/2015 21 Introduction At the simplest level, we now have a data structure which is just a fixed length string of bits, an evaluation function to provide relative ranks of different individuals, a mechanism for enforcing differential reproduction, and a mechanism to provide the next generation. A variant may include allowing some of the “best” individuals of one generation to have exact copies appear in the next. A question that begs to be asked is: given a current generation, can we say anything about the evolution of genetic patterns from one generation to the next? This is crucial to our being able to believe that the process set in motion has some convergence properties, rather than just leading us to a completely random population. 7/21/2015 22 Introduction Other Representations. Some problems have “natural representations” in terms of larger alphabets (actual DNA comes to mind - 4 letters) or in terms of continuous quantities (requiring floating point numbers over various ranges). If the cardinality of the alphabet is a power of 2, we can still use the same bit-oriented mechanisms, and the “chromosome” at the next generation will remain meaningful. If the “alphabet” is made up of continuous ranges, the problem of representation becomes more complex. A possible solution involves using a contiguous range of bits to represent each more structured entity, with the caveat that mutation and recombination must be constrained not to exit the appropriate ranges. Another solution involves accepting floating point values (rather than bits) as “genes”. 7/21/2015 23 Introduction This may simplify the interpretation (and implementation) of recombination and mutation, but we are still left with the problem of guaranteeing meaning for the results of such actions. Another representational problem arises in what is properly called “genetic programming”: where the object being “evolved” is a program that attempts to compute a specific (only partially known) function. “Natural” representation of programs may involve trees, where the nodes are functions (interior nodes) or parameter values (leaves - if the program does not require iteration or recursion), or graphs. Although it is always possible to reduce everything to bits, any intuition is likely to be so removed from the bit-representation as to be useless. 7/21/2015 24 Introduction A difference in approach between genetic algorithms and genetic programming can be exemplified in the two diagrams below: in genetic algorithms, once the problem is codified into a data structure, we just apply the genetic algorithm; in genetic programming the interaction with the original problem remains more direct and ongoing. 7/21/2015 25 Introduction More specifically, the genetic algorithm approach results in defining a binary string and then modifying and evaluating binary strings, constructing successive generations using (at least) crossover and mutation operators: 7/21/2015 26 Introduction The genetic programming approach leads to a cycle: 7/21/2015 27 Introduction The “string” is replaced by a tree, and the tree can be modified in ways that are more complex than those supported by a string (the apparent cycle is not crucial - the methods are all cyclical in nature). More crucial is the observation that we cannot limit a program to a fixed number of “tree nodes”: the search would be much too limited. This implies that the usual methods (which we will study in more detail later) for evaluating “convergence” will have to be modified - if they are applicable at all. 7/21/2015 28 Introduction Why should anything “converge”? By allowing a “best element” of the population P to survive from one generation to the next, we can ensure that the “derived evaluation function”: F(t) = maxx Pt f(x) is monotone non-decreasing in t, but this does not mean that we should expect improvement or convergence. Another approach, based on the idea of a schema, provides a probabilistic approach, still with substantial drawbacks. 7/21/2015 29 Introduction Some Examples: 1 - Function Optimization. You are given the function f(x) = x•sin(10π x) + 1.0 over the closed interval [-1.0, 2.0], and you are expected to find the value of x in that range that maximizes it [Z. Michalewicz, p.18]. An analytic approach would first compute the zeros of the first derivative (the function possesses a first derivative at every point in (-1.0, 2.0) so any interior maxima or minima will appear only at zeros of the derivative). There are finitely many such values over the given interval. We can now evaluate the function at all such values, plus 1.0 and 2.0 (solving tan(10πx) + 10πx = 0 will require some numerics - nothing too hard). A value of x for which we attain the maximum of this finite set, plus endpoints, provides us with a correct (input, output) pair, and a solution to our problem. 7/21/2015 30 Introduction In the absence of analytic techniques, what can we do? We can choose a finite random set of points in [-1.0, 2.0], evaluate the function at those points, choose a point where the function achieves a largest value and stop. A modification would entail choosing a small random set, finding the x-value where the function has the largest value; choosing a second random set “near” this value, and repeating the process with smaller and smaller sets near better and better values. Stop when you don’t improve from one “generation” to the next, or when you run out of computational cycles. The second technique, just like the first, is likely to leave us stuck near a “relative maximum” which is not optimal… Can we do better? 7/21/2015 31 Introduction How do we devise a genetic algorithm? Essentially, we want to add some “intelligence” to this random search: try to avoid getting stuck on local maxima, and direct the search so that it is - hopefully - more efficient than strictly random. How? Representation: what precision do we want? Let’s choose six places after the decimal point (there has to be a point beyond which we don’t care). Six decimal points = 3•1000000 intervals over [-1.0, 2.0]. Notice that 2097152 = 221 < 3000000 < 222 = 4194304, so we will use a string on 22 bits to represent numbers in the desired range. We now have binary strings b21b20…b1b0, which we can convert to 21 decimal in [-1.0, 2.0]: (b21b20 b0 )2 i 0 bi 2i x', 10 3 x 1.0 x' 22 . 2 1 7/21/2015 32 Introduction The two “chromosomes” 00…0 and 11…1 correspond to the endpoints -1.0 and 2.0, respectively. All others correspond to interior points. The evaluation function simply takes a binary chromosome, say v, transforms into a decimal number, say dec(v), and evaluates f at that number: eval(v) = f(dec(v)). Initialization. Create a random initial population of “chromosomes”. Ranking. Rank the chromosomes according to the evaluation function. Reproduction. Normalize the rankings so that each rank corresponds to an appropriate subinterval of [0, 1]. Run the random number generator twice, using the subintervals to determine probability of choice, to obtain two parents. 7/21/2015 33 Introduction When the parents are found, we create the offspring. Two operators are used: crossover and mutation. Select, randomly, the gene after which the crossover takes place. The first parent contributes the part of its chromosome ending at that gene (inclusive); the second parent contributes the final part of its chromosome to the offspring. If mutation is to be included, one must successively use the random number generator to determine if each gene of the offspring is to be changed. Repeat the process until a number of offspring equal to the desired population is obtained. Next Generation. We now have a new generation, for the process to be repeated. 7/21/2015 34 Introduction Several issues must be resolved: 1) the size of the initial population (50, in this case); 2) the number of generations the process is allowed to continue (150, in this case); 3) the probability of crossover (the probability that a chromosome undegoes cross-over: pc = 0.25); 4) the probability of mutation (the probability that a gene is “flipped”: pm = 0.01). None of them can be “optimally determined”… A run provides the following results: 7/21/2015 35 Introduction The best individuals discovered within 150 generations are given in the table below. 7/21/2015 36 Introduction Some Examples: 2 - The Prisoner’s Dilemma. This is discussed in Michalewicz’s book, in Mitchell’s, and, quite extensively, in D. B. Fogel’s. The Problem: two individuals are held prisoner and are under pressure to confess to some “undesirable activity” implicating the other. The options (and rewards/punishments) are summarized in the table below. Player1 Player2 P1 P2 Comment Defect Defect 1 1 Both punished (relatively mild) Defect Cooperate 5 0 Defector rewarded, holdout punished 0 5 Holdout punished, defector rewarded Cooperate Cooperate 3 3 Both rewarded - maybe not free?? Cooperate Defect 7/21/2015 37 Introduction The aim of the game is to find a strategy that, over the long run, will maximize one’s gains. At any one point, the maximizing strategy would have the winner defect, and the loser holding out, so there is always a temptation to defect. In fact, defection is always the “safest individual choice” at each point. On the other hand, a sequence of mutual defections has a combined payoff much smaller than that of a sequence of mutual cooperations. We can compute: 1. An infinite sequence of random choices (each configuration has probability 0.25) has an expected return (for each of the players) of 2.25, with an expected cumulative return of 4.5; 2. An infinite sequence of defections, with the other choosing randomly has an expected return (for the defector) of 3 and of 0.5 (for the other), so the expected cumulative return is only 3.5; 7/21/2015 38 Introduction 3. Ex.: compute the expected returns, individual and cumulative, for at least two other sequences of actions. Representing a solution. Ideally, we should have a complete memory of the past to determine the next decision. Since this is not possible, a decision has to be made on the basis of a “finite memory”. Michalewicz’s text uses the previous three moves. In fact (Mitchell’s and Fogel’s books), tournaments (human and computer) were organized by A. Axelrod in the ‘70’s and ‘80’s to try to determine “best strategies”, and he decided to use the “three-move-memory” we will use here. If a “chromosome” contains information about the three previous moves, and since there are 4 possible outcomes for each move, we have to keep track of 64 (= 4•4•4) possible games - 64 different histories. 7/21/2015 39 Introduction If we order the histories in some canonical order, a 64-bit array allows us to associate a response to each history. We must also “prime the system” with three initial games, with two bits required for each (index into histories). Total: 70 bits for a chromosome. Choose an initial population. Create a number, N, of random 70-bit strings. Test each player to determine effectiveness. Use the strategy encoded in the chromosome to play games against all other players. The “fitness” is the average score over all the games played. The original study by Axelrod had each member of the population play against a set of given strategies - culled from one of the human and computer tournaments. The initial game sequence provides an index into the bit-string, so one could use the first six bits to determine the 7/21/2015 40 Introduction initial strategy for each player (player no. 1 uses its initial conditions to determine the game sequence to be “continued” by both). Select the player to breed. A player with an average score (within a standard deviation of average) is given one mating; a player with a score one or more standard deviations above average is given two matings; one with a score one or more standard deviations below average is give no matings (some minor adjustment may be necessary to keep the population of constant size). Breed. Randomly pick pairs, and create two descendants per mating using both crossover and mutation. 7/21/2015 41 Introduction Results. Experiments lead to a number of strategies being “discovered”. 1. Continue cooperation after three initial cooperations: (CC)(CC)(CC) leads to C. 2. Defect when the other player defects: (CC)(CC)(CD) leads to D. 3. Continue to cooperate after cooperation has been restored: (CD)(CD)(CC) leads to C. 4. Cooperate when cooperation has been restored after your own defection: (DC)(CC)(CC) leads to C, 5. Defect after three mutual defections: (DD)(DD)(DD) leads to D. 6. If the payoff for successful defection is increased to 6, strategies can delop with expected payoffs > 3.0 (Fogel). 7/21/2015 42 Introduction The evolved strategies can be represented as finite-state machines, which can also mean that the final best strategy can be interpreted as a formal program - the best evolved. This was an example of co-evolution, where the individuals in a population were pitted against one another and “caused” the population to evolve through mutual interactions. For much more detail, see the papers by Axelrod or the book by D. B. Fogel: Evolutionary Computation. 7/21/2015 43 Introduction Some Examples: 3 - The Traveling Salesman Problem. This is a well-known NP-Complete problem, which has some “reasonable” approximation schemes - at least in the case in which the distances between nodes satisfy a “triangle inequality” (see Cormen & al., Ch. 35). There can be no expectation of an exact solution (find a tour of minimum length) in anything but exponential time, but it may be possible to “beat” the quality of the known deterministic approximation algorithms without giving up too much time… What is a chromosome? A “reasonable” interpretation for a chromosome is that it represents a tour. Since the complete graph consists of N nodes, is the chromosome a binary string (array) or an integer array? 7/21/2015 44 Introduction This is an important decision because of the genetic operators: do we split at a bit or at a full node index (integer)? 1. If we use a bit representation, our chromosome will need N•ceiling(log2N) bits. Using crossover or mutation at the bit level cannot guarantee that the new chromosome represents a new tour, or, possibly, even that we have a path in the graph (if exp(ceiling(log2N)) > N). 2. If we use integer representation (N integers), and we use our genetic operators at integer boundaries, we can at least guarantee that we still have a path, although maybe not a tour. 7/21/2015 45 Introduction The choice in this case is of an integer representation for the chromosome, essentially because: a. it avoids one set of problems - the possible introduction of nonexistent nodes and, b. it permits some simpler “cross-over repair algorithms” to make sure that no “stillborn” descendants are allowed into the population. The repair algorithms can be incorporated into the genetic operators. c. Mutation can be handled ina similar way. Intialization. For a population of size M, pick M random permutations on N items. Another possibility is to use a greedy algorithm to construct M approximate solutions, and start from those. 7/21/2015 46 Introduction Evaluation and Ranking. Straightforward for each individual - just compute the value of the tour it represents. Breeding. Both crossover and mutation must be implemented carefully to preserve a tour and to maintain a relationship with the parents. We will look at the details at some later date. Results. The algorithm, as described, appears to be better than random search, but is not very efficient: a 100 city tour, after 20,000 generations, gives a value for the best tour found about 9.4% above optimum (Michalewicz, p. 26). 7/21/2015 47 Introduction Some Examples: 4 - Tree-Based Genetic Programming. Assume we are given a function, either explicitly or, more likely, as a set of (input, output) pairs. Assume we have a finite set of pairs derived from the function y = x2. We want to construct a “best interpolating function” we can, obviously starting only from the set of (input, output) pairs. A program can be viewed as a tree structure, where the leaves are terminals (= parameter values), while the interior nodes are function calls whose children are values provided from farther down the tree. 7/21/2015 48 Introduction One would start with an initial population of trees, evaluate the individuals on the set of (input, output) pairs assigning each a value dependent on how “good” the match is between the output values computed by the program and those given. One would then apply some treecompatible genetic algorithms to generate the new population. 7/21/2015 49 Introduction Another example of “next generation:… 7/21/2015 50 Introduction Final result: we have a true match, although all we really know is that we have achieved an “exact interpolation” of the given (input, output) set. 7/21/2015 51 Introduction The program simple-gp.c evolves such a solution. It develops a formula (with some understanding of the need to take care of zero-divisions) which, complex as it appears, can, in fact be reduced to an actual function… unfortunately it does not look much like the actual function on a first reading. This is not an unusual problem, since the rules for “canonical representation” of rational functions are fairly hard to implement… see some of the early programs for symbolic computation. 7/21/2015 52 Introduction This approach raises a number of representational questions: how do we represent a program? What is a chromosome? What is mutation? etc. Part of the problem is that a constant length chromosome might correspond to a rather limited family of trees, making the whole evolutionary process moot. On the other hand, introducing variable length chromosomes with very large alphabets (= primitive functions and parameter values) may grow our chromosomes to enormous size (although our own DNA may be trying to warn us). Furthermore, we are looking for some theoretical models to at least give plausibility to our methods: such theoretical models are likely to be far too complicated 7/21/2015 53 Introduction Some other ideas on Genetic Programming. Other questions would arise on the meaning of a “program”: as we indicated in the Prisoner’s Dilemma, one can evolve finite-state machines that are quite efficient. Those are, undeniably, programs. The next question might be on how to represent (and define and apply genetic operators) for stack machines (supporting recursion), assembly-language machines, graphreduction machines (which are used in the compilation and optimization of functional language programs), etc. And all of this requires some kind of supporting theoretical framework. 7/21/2015 54