No Slide Title

Download Report

Transcript No Slide Title

Basic string algorithms
NP-hard problems
Heuristics
© Heiko Schröder 2004
1
What do algorithm people do?
Invent efficient recipes to solve
combinatorial problems
http://www.iis.sinica.edu.tw/~hil/
ttp://cmgm.stanford.edu/biochem218/ http://
Dr. Zhiping Weng, Dept. Biomedical Engineering, Boston University
“Introduction to Algorithms”, Cormen, Leiserson, Rivest, The MIT Press, McGraw-Hill
© Heiko Schröder 2004
2
Growth functions
f(N)= O(g(N)) c,k "N>k f(n) c*g(N)
f(N)= W(g(N)) c,k "N>k f(n) c*g(N)
f(N)= Q(g(N)) f(N)=O(g(N)) and f(N)= W(g(N))
f(N)
Asymptotic
optimality/comparison
Constant factor disaster!
N
3
2N
N!
1000N2
Powerset
partition
Permutations
salesman
N logN
N2
Fast sort
Slow sort
10
30
100
105
1000
350000
20
80
400
4*105
106
2.5*1018
100
700
104
107
1030
10158
109
10300
???
N
1000
104
106
½ hour of the
TOP500 together
3*1024 operations/year/world
4
NP-hard problems?
• t(n)=n2
• t(n)=2n
n=10
t=100
n=100 t=104
n=1000 t=106
n=10
n=20
n=40
time limit T
Maxproblem(T,1) = X
Maxproblem(T,1000)=32 X
time limit T
Maxproblem(T,1)=X
Maxproblem(T,1000)=X+10
t=1000
t=106
t=1012
Parallel computing does not solve NP-hard problems !
But it helps with corresponding heuristics.
5
Trivial string matching method
S=m=0
while m<|P| and P(m+1) = S(s+m+1)
m:=m+1
If m=|P| then output “match position”
If s ≥ |S|-|P| then EXIT
s:=s+1; m:=0
|P|
|S|
a a b a b
a a b a a b a c a a b a b a a c a
a
b a
ba
a
a
Worst case
Time complexity: O(p*s)
6
Knuth-Morris-Pratt string matching algorithm (Google)
s
m
mismatch
P
S
Next match:
s+m-π(m) π(m)
For every m, π(m) is the length of the longest prefix of P which is also a
proper suffix of Pm. π(1)= π(0)=0.
π(m) can be produced in time linear in |P| (pre-processing).
m:=0; s:=0
while m<|P| and P(m+1) = S(s+m+1)
m:=m+1
If m=0 then s:=s+1
If m=|P| then output “match position”
s:=s+m- π(m)
s+m unchanged
m:=π(m)
If s ≥ |S|-|P| then EXIT
Time complexity:
In the while-loop s+m+1 is increased.
“outside” s is increased.
Total number of increases < 2|S|.
Thus T=O(|S|)
7
π(2)=1
π(3)=0
π(4)=1
π(5)=2
π(6)=3
π(7)=4
π(8)=0
π(9)=1
a a b a a b a c a
0 1 0 1 2 3 4 0 1
a a b b a a b a a b a c a a b a
s
m:=0; s:=0
while m<|P| and P(m+1) = S(s+m+1)
m:=m+1
If m=0 then s:=s+1
If m=|P| then output “match position”
s:=s+m- π(m)
m:=π(m)
If s ≥ |S|-|P| then EXIT
8
Producing π in linear time (the idea only)!
a b c x…
a b c x
a b c
d a b c x
a=d a=b b=c
→ π(10)=4
π(11)=4
π(11)=5 ??
d a b c x
→a=b=c=d
π(10)=3
π does never increase by more than 1.
One comparison.
Case π(k+1)< π(k):
π(k)=7
d e f g x
a b c d e f g
e f g x
π(7)=3 and d=x
a b c d e f g x
π(k+1)≤ π(7)+1=4
π(k+1)=5 → π(7)>3
9
I suspect that if perfect string matching is used for protein string comparisons,
then the trivial method might also be the fastest (in the average), even
though it is much worse in the worst case.
The reason is, that “long matches” are very unlikely.
A match of length 1 has probability 1/20.
A match of length k has probability of (1/20)k.
Thus in every position a mismatch is very likely and m stays small.
Please, prove me wrong!
S=m=0
while m<|P| and P(m+1) = S(s+m+1)
m:=m+1
If m=|P| then output “match position”
If s ≥ |S|-|P| then EXIT
s:=s+1; m:=0
10
Elementary algorithms
related to sequences/strings
Sequence:
Prefix sum:
Sum from i to j:
P(j)-P(i-1) = 6-1=5
0
0
1
1
i-1
i
1
2
0
1
1
1
0
1
0
1
0
6
7
7
j
2
3
4
5
5
6
Density : sum/length
Following slides: http://www.iis.sinica.edu.tw/~hil/
11
Density of a segment
• prefix-sum(i) = S[1]+S[2]+…+S[ i ],
– all n prefix sums are computable in O(n)
time.
• sum(i, j) = prefix-sum(j) – prefix-sum(i-1)
• density(i, j) = sum(i, j) / (j-i+1)
• Where is the area with the closest match between
protein A and protein B?
• In which area is the density of amino acid “X” the
highest?
• Where is the sub-sequence of length greater or
equal to L that has the highest density? – This
sequence might be longer than L, why? But it will
be shorter than 2L – why??
0 1 1 1 0 1 1 1 0
L=4, density 6/7
12
From Suffix Trie to Suffix Tree
(retrieval)
© Heiko Schröder 2004
13
Suffixes of S
S
= b b a b b a a b
S[1…8]= b b a b b a a b
S[2…8]=
b a b b a a b
S[3…8]=
a b b a a b
S[4…8]=
b b a a b
S[5…8]=
b a a b
S[6…8]=
a a b
S[7…8]=
a b
S[8…8]=
b
1st suffix
2nd suffix
3rd suffix
4th suffix
5th suffix
6th suffix
7th suffix
8th suffix
14
KEY: P occurs in S iff P is a prefix of a suffix of S.
S
= b b a b b a a b
S[1…8]= b b a b b a a b
S[2…8]=
b a b b a a b
S[3…8]=
a b b a a b
S[4…8]=
b b a a b
S[5…8]=
b a a b
S[6…8]=
a a b
S[7…8]=
a b
S[8…8]=
b
1st suffix
2nd suffix
3rd suffix
4th suffix
5th suffix
6th suffix
7th suffix
8th suffix
15
T = Suffix Trie of S
b b a b
b a b
a b
b
b
b
b
b
b
a
a
a
a
a
a
a
a
a
a
a
a
a
b
b
b
b
b
b
b
b
16
Why suffix trie?
• The following statements are equivalent.
– P occurrs in S.
– P is a prefix of a suffix of S.
– P corresponds to a path of T starting from the root of T.
17
P=babba
b b a b b a
b a b b a
b b a b b a a b
a ba bba b
a ba
a b b
b aba ab
b b a a b
b a
b a a b
a a ab
a b
b
a
a
a
a
a
a
a
b
b
b
b
b
b
b
b
P occurs in S!
18
P=bbaaba
b b a b b a
b a b b a
b b a b b a a b
b aba ab
b a ab b
a b b
b aba ab
b b a a b
b a
b a a b
a a ab
a b
b
a
a
a
a
a
a
a
b
b
b
b
b
b
b
b
P doesn’t occur in S!
19
Q: Where does P occur in S?
© Heiko Schröder 2004
20
P=abbaa
8
7
6
7
6
3
3
4
5
5
5
3
2
2
3
4
1
4
1
1
2
2
Output: 3
4
2
1
1 2 3 4 5
b b a b b
b b ba b
a bba ba
b a b b a a
a b b
a b b a a
b bba ba
b a ba
a a
a
6
a
b
a
b
a
b
a
b
b
a
b
a
b
b
7
a
a
a
a
a
a
a
8
b
b
b
b
b
b
b
b
1
21
Solving the Substring
Problem using suffix trie?
• Question:
– If we are given a suffix trie of S, what is the time complexity for
finding an occurrence of P in S?
• Answer:
– O(|P|) time.
22
Constructing
Suffix Trie
Q(|S|)
Q(|S| log |S|)
Q(|S|2)
Q(|S|3)
Q: Time complexity for constructing the suffix
trie T of S?
© Heiko Schröder 2004
23
Time = O(|S|2)
8
7
6
7
6
3
3
4
5
5
5
3
2
2
3
4
2
4
1
4
1
1 2 3
b b a
b a
a
4
b
b
b
b
5
b
b
b
b
b
6
a
a
a
a
a
a
7
a
a
a
a
a
a
a
8
b
b
b
b
b
b
b
b
1
2
2
1
1
24
Worst-case time = W(|S|2)
• How to establish a lower bound?
• Answer:
– Showing an example which takes W(|S|2) time for any algorithm.
25
S=aaaabbbb
26
Summary
• Suffix trie is good in solving Substring Problem, but may
require W(|S|2) time and space.
• Question: is there a compact representation of suffix trie
that needs only O(|S|) time and space?
27
Suffix Tree
A compact representation of suffix trie
© Heiko Schröder 2004
28
Observations on Trie T of S
• T has at most |S| leaves.
– Why?
• T has at most |S| – 1
branching nodes.
– Why?
29
S=aaaabbbb
[1,1]
[2,2]
[3,3]
[4,8]
[5,8]
[5,8]
[5,8]
[5,8]
30
S=bbabbaab
© Heiko Schröder 2004
31
S=bbabbaab
12345678
[3,3]
[3,3]
[1,1]
[3,3]
[7,8]
[1,1]
[2,3]
[7,8]
[7,8]
[4,8]
[3,3]
[2,3]
[7,8]
[4,8]
[7,8]
[4,8] [7,8]
[4,8]
[4,8]
[4,8]
© Heiko Schröder 2004
32
12345678
S=bbabbaab
[3,3]
[7,8]
[1,1]
[4,8]
[7,8]
© Heiko Schröder 2004
[3,3]
[2,3]
[4,8] [7,8]
[4,8]
33
Question
• The space complexity of suffix tree
– O(|S|)
– O(|S| log |S|)
– O(|S|2)
– O(|S|3)
• Why?
– Number of nodes = O(|S|).
– Number of edges = O(|S|).
– Space required by each edge = O(1).
34
Almost all optimisation tasks
are NP-hard
i.e. intractable
35
Intractability
•The theory of intractability deals with decision problems.
Thus the answer is “yes” or “no”.
•Any optimisation problem can be solved with at most polynomial
slow-down compared to the corresponding decision problem.
Example: What is the shortest path in ...? -- This is an
optimisation problem. The corresponding decision problem is:
Is there a path of length k?
If we can solve the decision problem then we can use a
binary tree search approach to solve the optimisation problem
by solving the decision problem several times with varying k.
•Decision problems can conveniently be translated into languages,
i.e. the problem instances that have the answer “yes” are the
members of the language.
36
NP-complete Problems
•A decision problem belongs to NP iff after a solution to the problem
has been guessed/suggested/provided it can be tested within
polynomial time, whether the solution satisfies the specified
criteria. (N=nondeterministic; P=polynomial)
•P is a subset of NP (proper subset? believed by most researchers)
•A problem is NP-complete iff it is as hard to solve as the “Traveling
Salesman Problem”or any of the other problems known to be
NP-complete, i.e. to prove NP-completeness of a problem A
it is necessary and sufficient to show that it belongs to NP and that
an algorithm solving problem A will provide an algorithm to solve
some problem B (which is known to be NP-complete) with at most
polynomial slow-down as compared to the algorithm solving A.
This process is called reducing B to A.
•NP-complete means that no polynomial time solution can be expected
37
Charactersitics of NPcomplete problems
•Most “interesting” optimisation problems are NP-hard
(i.e. at least as hard as NP-complete problems)
•Often NP-complete problems look “easy” at first glance -we give some examples.
•Many NP-complete problems allow “good” approximation
algorithms, for some even good approximations to the
optimal solution requires more than polynomial time
•There is a wide range of techniques available that can be used
in order to find optimal or close to optimal solutions.
38
Partition
Instance:A finite set S and a weight function w :S  N.
Question: Is there a subset S’ of S such that S’ and S-S’ have
the same weight.
S’
S-S’
39
Traveling Salesman Problem
Instance: A complete weighted graph and a target cost k.
Question:Is there a path (called a tour) of length less than
or equal to k that goes through all vertices of the graph.
2
2
1
4
3
3
Has a tour of length 8
40
Graph Colorability
Instance: Undirected graph G=(V,E), integer k<|V|.
Question: Is G k-colorable? Is there an assignment of k
colors to the vertices of G such that each edge connects
nodes of different color.
minimal k=3
minimal k=2
41
Hamiltonian Cycle
Instance: Undirected graph G=(V,E).
Question:Does G have a Hamiltonian cycle? (I.e. a cycle in the
graph that contains each vertex exactly once).
This graph does not have a Hamiltonian cycle.
42
Hitting Set
Instance: A collection C of subsets of a set S and a positive integer k.
Question: Does S contain a subset S’ of size smaller or equal to k
such that S’ contains at least one member of each subset in C?
S
Each circle represents a
member of C.
Thus S’ needs to be at least
of size 3.
43
Subgraph Isomorphism
Instance: Two undirected graphs G1 and G2.
Question: Does G1 contain G2 as a subgraph?
G1 contains G2
G1
G2
44
Optimizing ATM-networks
Instance: Directed graphs G=(V,E). Positive integer k.
Permitted modification: replace any set of edge-disjoint paths
by edges from their first to their last nodes.
Question: Is it possible to modify G such that the diameter of the
modified graph is less than or equal to k?
diameter = 8
diameter = 6
diameter = 5 ?
optimal ???
45
Facts and Believes
What is the optimal solution to
Mesh networks?
For nxn networks the diameter
should become proportional to
log n.
Bidirectional cycle length: n
Path length: n
Diameter:< 3 n
46
“Addressing” NP-hard Problems
Heuristics:
Polynomial time algorithms that often find the optimal solution
and in the average find a solution close to the optimum.
Machine learning:
Neural networks
genetic algorithms / evolutionary algorithms
simulated annealing
ant algorithms
…
Good heuristics usually outperform the methods based on
machine learning. Thus they should be preferred (if they
can be found in reasonable time).
47
“Hill Climbing”
Take a candidate and try to improve it by a minimal change, i.e.
a small change in any direction:
If this “improves the solution” then modify the candidate
accordingly and try again
-- until minimal changes do not give further improvement.
48
Genetic Algorithm
Generate a population:
Generate a population (random selection from the solution space) – 103 - 106 . The members
need to have “genes” (genes = parameters that determine the qualities of the individual,
coordinates in the solution space).
Determine the fitness of each member.
Breed another 103-106 members:
Pick pairs A, B from the population (probability of being picked can be a function of the
fitness). Create one child using “Crossover”, i.e. determine (at random) a subset of the genes
to be used from parent A (the other genes are used from parent B). Create second child
(complement).
Mutation: Do random (mostly minor) changes to some children. Make sure a
chain of such changes can reach all members of the solution space.
Select (reduce population back to 103-106 ):
Delete week members of population: All the weakest? Use some probability? Keep at
least a fixed percentage of the parent generation? If all of the weakest are deleted, then you
are likely to get stuck in local maxima.
Island model
49
Ant Colony Optimization is a metaheuristic that is inspired by the
following observation. Ants that leave the nest to search for food
mark their path with pheromone. Those ants that found shorter
paths return earlier. Since pheromone trails attract other ants the
portion of ants on the shorter paths increases.
www.aifb.uni-karlsruhe.de/ .../aik_08/poster.html
50
Simulated Annealing
Simulated Annealing
Pick a number D close to half the diameter of the solution space
Pick a random member X from the solution space.
Evaluate its quality
Repeat 1000 times:
Move in random direction random distance (D) away from X.
Evaluate the quality there. If the quality is better there, this becomes
X. Otherwise it is dismissed.
Reduce D (cooling down).
51
yeast
Base Station Transmitter Location and Frequency Allocation
for Cellular Radio Networks - a Simulated Annealing Approach
www.ti.inf.ethz.ch/pw/st_proj/ base_station.html
52
J
?
?
J
J
?
?
J
53