Transcript PowerPoint

CS 551/651
Search
Topics
Search
• Local global
• Simulated annealing, genetic algorithms
• Neural networks
Through the lens camera control
• Paper by Gleicher with two assistive handouts
Local Search Algorithms and
Optimization Problems
What are search problems?
What are search problems?
Examples
• Fit a line to data… fit a surface to data
• What quantities of quarters, nickels, and dimes add up to
$17.45 and minimizes the total number of coins
• Is the price of Microsoft stock going up tomorrow?
In each case, what are the DOFs and the
evaluation function?
What are search problems
Constraints
• Some solutions must be avoided (hard constraints)
– Minimize foo subject to constraint bar
 Cull parts of state space
• Some solutions are highly unavoidable
– Minimize foo and try hard to avoid condition bar
 Add some highly weighted terms to foo
Considering search some more
How do you know what good is?
• Are good things near one another?
Where do you search?
• Do you know when to stop?
Where do you not search?
Local Search
Local search does not keep track of previous
solutions
• Instead it keeps track of current solution (current state)
• Uses a method of generating alternative solution candidates
Advantages
• Use a small amount of memory (usually constant amount)
• They can find reasonable (note we aren’t saying optimal)
solutions in infinite search spaces
Optimization Problems
Objective Function
• A function with vector inputs and scalar output
– goal is to search through candidate input vectors in order
to minimize or maximize objective function
Example
• f (q, d, n) = 1,000,000 if q*0.25 + d*0.1 + n*0.05 != 17.45
= q + n + d otherwise
• minimize f
Search Space
The realm of feasible input vectors
• Also called state-space landscape
• Usually described by
– number of dimensions (3 for our change example)
– domain of each dimension (#quarters is discrete from 0 to 69…)
– nature of relationship between input vector and objective function
output
 no relationship
 smoothly varying
 discontinuities
Search Space
Looking for global
maximum (or minimum)
Hill Climbing
Also called Greedy Search
• Select a starting point and set current
• evaluate (current)
• loop do
– neighbor = highest value successor of
current
– if evaluate (neighbor) <= evaluate (current)
 return current
– else current = neighbor
Hill climbing gets stuck
Hiking metaphor (you are wearing glasses that
limit your vision to 10 feet)
• Local maxima
– Ridges (in cases when you can’t walk along the ridge)
• Plateau
– why is this a problem?
Hill Climbing Gadgets
Variants on hill climbing play special roles
• stochastic hill climbing
– don’t always choose the best successor
• first-choice hill climbing
– pick the first good successor you find
 useful if number of successors is large
• random restart
– follow steepest ascent from multiple starting states
– probability of finding global max increases with number of starts
Hill Climbing Usefulness
It Depends
• Shape of state space greatly influences hill climbing
• local maxima are the Achilles heel
• what is cost of evaluation?
• what is cost of finding a random starting location?
Simulated Annealing
A term borrowed from metalworking
• We want metal molecules to find a stable location relative to
neighbors
• heating causes metal molecules to jump around and to take on
undesirable (high energy) locations
• during cooling, molecules reduce their movement and settle
into a more stable (low energy) position
• annealing is process of heating metal and letting it cool slowly
to lock in the stable locations of the molecules
Simulated Annealing
“Be the Ball”
• You have a wrinkled sheet of metal
• Place a BB on the sheet and what happens?
– BB rolls downhill
– BB stops at bottom of hill (local or global min?)
– BB momentum may carry it out of hill into another (local or global)
• By shaking metal sheet, your are adding energy (heat)
• How hard do you shake?
Our Simulated Annealing Algorithm
“You’re not being the ball, Danny” (Caddy Shack)
• Gravity is great because it tells the ball which way is downhill
at all times
• We don’t have gravity, so how do we find a successor state?
– Randomness
 AKA Monte Carlo
 AKA Stochastic
Algorithm Outline
Select some initial guess of evaluation function parameters:
Evaluate evaluation function,
Compute a random displacement,
• The Monte Carlo event
Evaluate
• If v’ < v; set new state,
• Else set
with Prob(E,T)
– This is the Metropolis step
Repeat with updated state and temp
Metropolis Step
We approximate nature’s alignment of molecules by allowing
uphill transitions with some probability
•
Prob (in energy state E) ~
– Boltzmann Probability Distribution
– Even when T is small, there is still a chance in high energy state
•
Prob (transferring from E1 to E2) =
– Metropolis Step
– if E2 < E1, prob () is greater than 1
– if E2 > E1, we may transfer to higher energy state
The rate at which T is decreased and the amount
it is decreased is prescribed by an annealing schedule
Genetic Algorithms (GAs)
Another randomized search algorithm
Start with k initial guesses
• they form a population
• each individual from the population is a fixed-length string
(gene)
• each individual’s fitness is evaluated
• successors are generated from individuals according to
fitness function results
What’s good about evolution?
Think about mother nature…
What’s bad about evolution?
Genetic Algorithms
• Reproduction
– Reuse
– Crossover
• Mutation
Crossover
• Early states are diverse
– Crossover explores state broadly
• Later stages are more similar
– Crossover fine tunes in small region
}
Like simulated
annealing
Mutation
Could screw up a good solution
• Like metropolis step in simulated annealing
Could explore untapped part of search space
GA Analysis
Combines
• uphill tendency
• random exploration
• exchange information between multiple threads
– like stochastic beam search
Crossover is not needed – theoretically
• if starting states are sufficiently random
GA Analysis
It’s all in the representation
• GA works best if representation stores related pieces of the
puzzle in neighboring cells of string
• Not all problems are amenable to crossover
– TSP
Model of Neurons
• Multiple inputs/dendrites
(~10,000!!!)
• Cell body/soma performs
computation
• Single output/axon
• Computation is typically
modeled as linear
– d change in input
corresponds to kd change in
output (not kd2 or sind…)
McCulloch-Pitts Neurons
• One or two inputs to neuron
• Inputs are multiplied by
weights
• If sum of products exceeds a
threshold, the neuron fires
What can we model with these?
-0.5
1
Neuron thresholds (activation functions)
• It is desirable to have a differentiable activation function for
automatic weight adjustment
http://www.csulb.edu/~cwallis/artificialn/History.htm
Perceptrons are linear classifiers
Consider a two-input neuron
• Two weights are “tuned” to fit the data
• The neuron uses the equation w1 * x1 + w2 * x2 to fire or not
– This is like the equation of a line mx + b - y
http://www.compapp.dcu.ie/~humphrys/Notes/Neural/single.neural.html
Linearly separable
These single-layer perceptron networks can
classify linearly separable systems
Linearly separable
These single-layer perceptron networks can
classify linearly separable systems
• Consider a system like AND
x1
x2
x1 AND x2
1
1
1
0
1
0
1
0
0
0
0
0
1-
1
Linearly separable - AND
• Consider a system like AND
x1
x2
x1 AND x2
1
1
1
0
1
0
1
0
0
0
0
0
x1
w1
w2
x2
S
q(x●w)
1-
1
Not linearly separable - XOR
• Consider a system like XOR
x1
x2
x1 XOR x2
1
1
0
0
1
1
1
0
1
0
0
0
x1
w1
w2
x2
S
q(x●w)
1-
1
Error Correction
w i  x i c  x w

Only updates weights for non-zero inputs
For positive inputs
• If the perceptron should have fired but did not, the weight
is increased
• If the perceptron fired but should not have, the weight is
decreased
Consider error in single-layer neural
networks
Sum of squared errors (across training data)
For one sample:
How can we minimize the error?
•
Set derivative equal to zero and solve for weights
•
Is that error affected by each of the weights in the weight
vector?
Minimizing the error
What is the derivative?
• The gradient,
– Composed of
Computing the partial
By the Chain Rule:
g ( ) = the activation function
Computing the partial
g’( in ) = derivative of the activation function
= g(1-g) in the case of the sigmoid
Minimizing the error
Gradient descent
Multi-layered Perceptrons
• Input layer, output
layer, and “hidden”
layers
• Not restricted to
linearly separable
• Modification rules
are more
complicated!
Why are modification rules more
complicated?
We can calculate the error of the output neuron by
comparing to training data
• We could use previous update rule to adjust W3,5 and W4,5 to
correct that error
• But how do W1,3
W1,4 W2,3 W2,4
adjust?
What changes in multilayer?
We do not know the correct outputs for the hidden
layers
• We will have to propagate errors backwards
Back propagation (backprop)
Multilayer
Backprop at the output layer
Output layer error is computed as in single-layer
and weights are updated in same fashion
• Let Erri be the ith component of the error vector y – hW
– Let
Backprop in the hidden layer
Each hidden node is responsible for some
fraction of the error i in each of the output nodes
to which it is connected
• i is divided among all hidden nodes that connect to output i
according to their strengths
Error at hidden node j:
Backprop in the hidden layer
Error is:
Correction is:
Summary of backprop
1. Compute the  value for the output units using
the observed error
2. Starting with the output layer, repeat the
following for each layer until done
•
Propagate  value back to previous layer
•
Update the weights between the two layers
Some general artificial neural network
(ANN) info
• The entire network is a function g( inputs ) = outputs
– These functions frequently have sigmoids in them
– These functions are frequently differentiable
– These functions have coefficients (weights)
• Backpropagation networks are simply ways to tune the
coefficients of a function so it produces desired output
Function approximation
Consider fitting a line to data
• Coefficients: slope and y-intercept
• Training data: some samples
• Use least-squares fit
y
This is what an ANN does
x
Function approximation
A function of two inputs…
• Fit a smooth
curve to the
available
data
– Quadratic
– Cubic
– nth-order
– ANN!
Curve fitting
• A neural network should be able to generate the input/output
pairs from the training data
• You’d like for it to be smooth (and well-behaved) in the voids
between the training data
• There are risks of over fitting the data
When using ANNs
• Sometimes the output layer feeds back into the input layer –
recurrent neural networks
• The backpropagation will tune the weights
• You determine the topology
– Different topologies have different training outcomes
(consider overfitting)
– Sometimes a genetic algorithm is used to explore the
space of neural network topologies