Transcript PowerPoint
CS 551/651 Search Topics Search • Local global • Simulated annealing, genetic algorithms • Neural networks Through the lens camera control • Paper by Gleicher with two assistive handouts Local Search Algorithms and Optimization Problems What are search problems? What are search problems? Examples • Fit a line to data… fit a surface to data • What quantities of quarters, nickels, and dimes add up to $17.45 and minimizes the total number of coins • Is the price of Microsoft stock going up tomorrow? In each case, what are the DOFs and the evaluation function? What are search problems Constraints • Some solutions must be avoided (hard constraints) – Minimize foo subject to constraint bar Cull parts of state space • Some solutions are highly unavoidable – Minimize foo and try hard to avoid condition bar Add some highly weighted terms to foo Considering search some more How do you know what good is? • Are good things near one another? Where do you search? • Do you know when to stop? Where do you not search? Local Search Local search does not keep track of previous solutions • Instead it keeps track of current solution (current state) • Uses a method of generating alternative solution candidates Advantages • Use a small amount of memory (usually constant amount) • They can find reasonable (note we aren’t saying optimal) solutions in infinite search spaces Optimization Problems Objective Function • A function with vector inputs and scalar output – goal is to search through candidate input vectors in order to minimize or maximize objective function Example • f (q, d, n) = 1,000,000 if q*0.25 + d*0.1 + n*0.05 != 17.45 = q + n + d otherwise • minimize f Search Space The realm of feasible input vectors • Also called state-space landscape • Usually described by – number of dimensions (3 for our change example) – domain of each dimension (#quarters is discrete from 0 to 69…) – nature of relationship between input vector and objective function output no relationship smoothly varying discontinuities Search Space Looking for global maximum (or minimum) Hill Climbing Also called Greedy Search • Select a starting point and set current • evaluate (current) • loop do – neighbor = highest value successor of current – if evaluate (neighbor) <= evaluate (current) return current – else current = neighbor Hill climbing gets stuck Hiking metaphor (you are wearing glasses that limit your vision to 10 feet) • Local maxima – Ridges (in cases when you can’t walk along the ridge) • Plateau – why is this a problem? Hill Climbing Gadgets Variants on hill climbing play special roles • stochastic hill climbing – don’t always choose the best successor • first-choice hill climbing – pick the first good successor you find useful if number of successors is large • random restart – follow steepest ascent from multiple starting states – probability of finding global max increases with number of starts Hill Climbing Usefulness It Depends • Shape of state space greatly influences hill climbing • local maxima are the Achilles heel • what is cost of evaluation? • what is cost of finding a random starting location? Simulated Annealing A term borrowed from metalworking • We want metal molecules to find a stable location relative to neighbors • heating causes metal molecules to jump around and to take on undesirable (high energy) locations • during cooling, molecules reduce their movement and settle into a more stable (low energy) position • annealing is process of heating metal and letting it cool slowly to lock in the stable locations of the molecules Simulated Annealing “Be the Ball” • You have a wrinkled sheet of metal • Place a BB on the sheet and what happens? – BB rolls downhill – BB stops at bottom of hill (local or global min?) – BB momentum may carry it out of hill into another (local or global) • By shaking metal sheet, your are adding energy (heat) • How hard do you shake? Our Simulated Annealing Algorithm “You’re not being the ball, Danny” (Caddy Shack) • Gravity is great because it tells the ball which way is downhill at all times • We don’t have gravity, so how do we find a successor state? – Randomness AKA Monte Carlo AKA Stochastic Algorithm Outline Select some initial guess of evaluation function parameters: Evaluate evaluation function, Compute a random displacement, • The Monte Carlo event Evaluate • If v’ < v; set new state, • Else set with Prob(E,T) – This is the Metropolis step Repeat with updated state and temp Metropolis Step We approximate nature’s alignment of molecules by allowing uphill transitions with some probability • Prob (in energy state E) ~ – Boltzmann Probability Distribution – Even when T is small, there is still a chance in high energy state • Prob (transferring from E1 to E2) = – Metropolis Step – if E2 < E1, prob () is greater than 1 – if E2 > E1, we may transfer to higher energy state The rate at which T is decreased and the amount it is decreased is prescribed by an annealing schedule Genetic Algorithms (GAs) Another randomized search algorithm Start with k initial guesses • they form a population • each individual from the population is a fixed-length string (gene) • each individual’s fitness is evaluated • successors are generated from individuals according to fitness function results What’s good about evolution? Think about mother nature… What’s bad about evolution? Genetic Algorithms • Reproduction – Reuse – Crossover • Mutation Crossover • Early states are diverse – Crossover explores state broadly • Later stages are more similar – Crossover fine tunes in small region } Like simulated annealing Mutation Could screw up a good solution • Like metropolis step in simulated annealing Could explore untapped part of search space GA Analysis Combines • uphill tendency • random exploration • exchange information between multiple threads – like stochastic beam search Crossover is not needed – theoretically • if starting states are sufficiently random GA Analysis It’s all in the representation • GA works best if representation stores related pieces of the puzzle in neighboring cells of string • Not all problems are amenable to crossover – TSP Model of Neurons • Multiple inputs/dendrites (~10,000!!!) • Cell body/soma performs computation • Single output/axon • Computation is typically modeled as linear – d change in input corresponds to kd change in output (not kd2 or sind…) McCulloch-Pitts Neurons • One or two inputs to neuron • Inputs are multiplied by weights • If sum of products exceeds a threshold, the neuron fires What can we model with these? -0.5 1 Neuron thresholds (activation functions) • It is desirable to have a differentiable activation function for automatic weight adjustment http://www.csulb.edu/~cwallis/artificialn/History.htm Perceptrons are linear classifiers Consider a two-input neuron • Two weights are “tuned” to fit the data • The neuron uses the equation w1 * x1 + w2 * x2 to fire or not – This is like the equation of a line mx + b - y http://www.compapp.dcu.ie/~humphrys/Notes/Neural/single.neural.html Linearly separable These single-layer perceptron networks can classify linearly separable systems Linearly separable These single-layer perceptron networks can classify linearly separable systems • Consider a system like AND x1 x2 x1 AND x2 1 1 1 0 1 0 1 0 0 0 0 0 1- 1 Linearly separable - AND • Consider a system like AND x1 x2 x1 AND x2 1 1 1 0 1 0 1 0 0 0 0 0 x1 w1 w2 x2 S q(x●w) 1- 1 Not linearly separable - XOR • Consider a system like XOR x1 x2 x1 XOR x2 1 1 0 0 1 1 1 0 1 0 0 0 x1 w1 w2 x2 S q(x●w) 1- 1 Error Correction w i x i c x w Only updates weights for non-zero inputs For positive inputs • If the perceptron should have fired but did not, the weight is increased • If the perceptron fired but should not have, the weight is decreased Consider error in single-layer neural networks Sum of squared errors (across training data) For one sample: How can we minimize the error? • Set derivative equal to zero and solve for weights • Is that error affected by each of the weights in the weight vector? Minimizing the error What is the derivative? • The gradient, – Composed of Computing the partial By the Chain Rule: g ( ) = the activation function Computing the partial g’( in ) = derivative of the activation function = g(1-g) in the case of the sigmoid Minimizing the error Gradient descent Multi-layered Perceptrons • Input layer, output layer, and “hidden” layers • Not restricted to linearly separable • Modification rules are more complicated! Why are modification rules more complicated? We can calculate the error of the output neuron by comparing to training data • We could use previous update rule to adjust W3,5 and W4,5 to correct that error • But how do W1,3 W1,4 W2,3 W2,4 adjust? What changes in multilayer? We do not know the correct outputs for the hidden layers • We will have to propagate errors backwards Back propagation (backprop) Multilayer Backprop at the output layer Output layer error is computed as in single-layer and weights are updated in same fashion • Let Erri be the ith component of the error vector y – hW – Let Backprop in the hidden layer Each hidden node is responsible for some fraction of the error i in each of the output nodes to which it is connected • i is divided among all hidden nodes that connect to output i according to their strengths Error at hidden node j: Backprop in the hidden layer Error is: Correction is: Summary of backprop 1. Compute the value for the output units using the observed error 2. Starting with the output layer, repeat the following for each layer until done • Propagate value back to previous layer • Update the weights between the two layers Some general artificial neural network (ANN) info • The entire network is a function g( inputs ) = outputs – These functions frequently have sigmoids in them – These functions are frequently differentiable – These functions have coefficients (weights) • Backpropagation networks are simply ways to tune the coefficients of a function so it produces desired output Function approximation Consider fitting a line to data • Coefficients: slope and y-intercept • Training data: some samples • Use least-squares fit y This is what an ANN does x Function approximation A function of two inputs… • Fit a smooth curve to the available data – Quadratic – Cubic – nth-order – ANN! Curve fitting • A neural network should be able to generate the input/output pairs from the training data • You’d like for it to be smooth (and well-behaved) in the voids between the training data • There are risks of over fitting the data When using ANNs • Sometimes the output layer feeds back into the input layer – recurrent neural networks • The backpropagation will tune the weights • You determine the topology – Different topologies have different training outcomes (consider overfitting) – Sometimes a genetic algorithm is used to explore the space of neural network topologies