Transcript Document

Output Sensitive Algorithm for
Finding Similar Objects
Takeaki Uno
National Institute of Informatics,
The Graduate University for
Advanced Studies (Sokendai)
Jul/2/2007 Combinatorial Algorithms Day
Motivation: Analyzing Huge Data
• Recent information technology gave us many huge database
- Web, genome, POS, log, …
• "Construction" and "keyword search" can be done efficiently
• The next step is analysis; capture features of the data
- statistics, such as size, #rows, density, attributes, distribution…
• Can we get more?
Database
 look at (simple) local structures
but keep simple and basic
実験1 実験2 実験3 実験4
●
▲
▲
●
▲
●
●
▲
●
●
●
▲
●
▲
●
●
●
▲
●
●
▲
▲
▲
▲
Results of
experiments
ATGCGCCGTA
TAGCGGGTGG
TTCGCGTTAG
GGATATAAAT
GCGCCAAATA
ATAATGTATTA
TTGAAGGGCG
ACAGTCTCTCA
ATAAGCGGCT
genome
Our Focus
• Find all pairs of similar objects (or structures)
(or binary relation instead of similarity)
• Maybe, this is very basic and fundamental
 There would be many applications
- finding global similar structure,
- constructing neighbor graphs,
- detect locally dense structures (groups of related objects)
In this talk, we look at the strings
Existing Studies
• There are so many studies on similarity search (homology search)
 Given a database, construct a data structure which enable us to
find the objects similar to the given a query object quickly
- strings with Hamming distance, edit distance
- points in plane (k-d trees), Euclidian space
- sets
- constructing neighbor graphs (for smaller dimensions)
- genome sequence comparison (heuristics)
• Both exact and approximate approaches
• All pairs comparison is not popular
Our Problem
Problem:
For given a database composed of n strings of the fixed same
length l, and a threshold d,
find all the pairs of strings such that the Hamming distance of the
two strings is at most d
ATGCCGCG
GCGTGTAC
GCCTCTAT
TGCGTTTC
TGTAATGA
...
・ ATGCCGCG , AAGCCGCC
・ GCCTCTAT , GCTTCTAA
・ TGTAATGA , GGTAATGG
...
Trivial Bound of the Complexity
• If all the strings are exactly the same, we have to output all the
pairs, thus take Θ(n2) time
 simple all pairs comparison of O(l n2) time is optimal,
if l is a fixed constant
 Is there no improvement?
• In practice, we would analyze only when output is small,
otherwise the analysis is non-sense
 consider complexity in the term of
the output size
M: #outputs
We propose O(2l(n+lM)) time algorithm
Basic Idea: Fixed Position Subproblem
• Consider the following subproblem:
• For given l-d positions of letters, find all pairs of strings with
Hamming distance at most d such that
"the letters on the l-d positions are the same"
Ex) 2nd, 4th, 5th positions of strings with length 5
• We can solve by "radix sort" by letters on the positions, in O(l n)
time.
Examine All Cases
• Solve the subproblem for all combinations of the positions
 If distance of two strings S1 and S2 is at most 2,
letters on l-d positions (say P) are the same
 In at least one combination, S1 and S2 is found
(in the subproblem of combination P)
• # combinations is lCd. When l=5 and d=2, it is 10
 Computation is "radix sorts +α", O(lCd ln ) time for sorting
 Use branch-and-bound to radix sort, in O(lCd n ) time
Exercise
・ Find all pairs of strings with Hamming distance at most 1
A
A
A
E
F
A
G
B
B
C
F
F
F
A
C
D
C
G
G
G
B
G
A
A
A
E
F
A
A
B
B
C
F
F
F
B
C
D
C
G
G
G
A
A
A
A
E
F
G
B
C
B
F
F
F
A
C
C
D
G
G
G
B
A
A
A
A
E
F
G
B
B
C
F
F
F
A
C
D
C
G
G
G
B
Duplication: How long is "+α"
• If two strings S1 and S2 are exactly the same, their combination is
found in all subproblems, lCd times
 If we allow the duplications, "+α" needs O(M lCd ) time
 To avoid the duplication, use "canonical positions"
Avoid Duplications by Canonical Positions
• For two strings S1 and S2, their canonical positions are the first
l-d positions of the same letters
• Only we output the pair S1 and S2 only in the subproblem of
their canonical positions
• Computation of canonical posisions takes O(d) time, "+α"
needs O(K d lCd ) time
Avoid duplications without keeping the solutions in memory
O(lCd (n+dM)) = O(2l (n + lM) ) time in total
( O(n+M)) if l is a fixed constant )
In Practice
• Is lCd small in practice?
 In some case, yes (ex, genome sequences)
• If we want to find strings with at most 10% of error
20C2 = 190, 30C3 = 4060,
60C6 = 50063860…
maybe, large for (bit) large l
• For dealing with (bit) large l, we use a variant of this algorithm
Partition to Blocks
• Consider the partition of strings into k blocks
• For given k-d positions of blocks, find all pairs of strings with
distance at most d s. t. "the blocks on the positions are the same"
• Radix sorts are done in O(kCd n) time
Ex) 2nd, 4th, 5th positions of blocks of strings of length 5
Small "+α" is Expected
• The Hamming distance of two strings may be larger than d, even if
their k-d blocks are the same
 In the worst case, "+α" is not linear in #output
• However, if #letters in k-d blocks are large enough, the strings
having the same blocks are few
 "+α" is not large, in practice, in almost O(kCd n) time
Experiments: l = 20 and d = 0,1,2,3
Prefixes of Y chromosome of Human
Note PC with Pentium M 1.1GHz, 256MB RAM
10000
1000
d=0
d=1
d=2
d=3
10
20
00
70
00
22
95
3
0.1
70
0
1
20
0
time(sec)
100
長さ(1000文字)
Comparison of Long Strings
• Slice one of the long strings with overlaps
• Partition the other long string without overlap
• Compare all pairs
•1 draw a matrix: intensity of a cell is
given by #pairs inside
•2 draw a point if 3 pairs in an area
of length αand width β:
 two substrings of length α have error of bit
less than k %, they have at least some
short similar substrings
Comparison of Chromosome
• Grid lines detect "repetitions
of similar structures"
chimpanzee
Human 21st and chimpanzee 22nd chromosomes
• Take strings of 30 letters from both, with overlaps
• Intensity is given by # pairs
human 21st chr.
• White  possibly similar
• Black  never similar
nd
22
chr.
20 min. by PC
Homology Search on Chromosomes
Human X and mouse X chromosomes (150M strings for each)
human X chr.
1 hour by PC
mouse X chr.
• take strings of 30
letters beginning at
every position
・ For human X,
Without overlaps
・ d=2, k=7
・ dots if 3 points are
in area of width 300
and length 3000
Extensions ???
• Can we solve the problem for other objects?
(sets, sequences, graphs,…)
• For graphs, maybe yes, but not sure for the practical performance
• For sets, Hamming distance is not preferable.
For large sets, many difference should be allowed.
• For continuous objects, such as points in Euclidian space, we can
hardly bound the complexity in the same way.
(In the discrete version, the neighbors are finite, actually
classified into constant number of groups)
Conclusion
• Output sensitive algorithm for finding pairs of similar strings
( in the term of Hamming distance)
• Multiple-classification by positions to be the same
• Using blocks to reduce the practical computation
• Application to genome sequence comparison
Future works
• Extension to other objects (sets, sequences, graphs)
• Extension to continuous objects (points in Euclidian space)
• Efficient spin out heuristics for practice
• Genome analyze system