On supermaximal repeats and minimal forbidden words

Download Report

Transcript On supermaximal repeats and minimal forbidden words

On the suffix automaton with
mismatches
Maxime Crochemore, Chiara Epifanio,
Alessandra Gabriele, and Filippo Mignosi
Outline
Motivations and basic definitions
2. Nerode’s congruence …with mismatches
3. Suffix automata with mismatches
4. Conclusions and open problems
1.
Prague, 17/07/2007
CIAA 2007
In literature several data structures have been studied for storing
the suffixes of a text. Each of them is conceived for giving a fast
access to all factors of the text itself. Among them:
suffix tries: representation of all the suffixes of a word by an
ordinary tree - quadratic size in the length of the word;

suffix trees: compact representations of suffix tries - linear size
in the length of the word;

suffix automata: minimization (related to automata) of suffix
tries - linear size in the length of the word;

compact suffix automata: compact representations of suffix
automata - linear size in the length of the word.

Why suffix automata?
Suffix automata, compact suffix automata and suffix trees
have many applications, such as indexing, pattern
matching, and data compression.

They both linear size.
but
 suffix trees and compact suffix automata represent
strings by pointers to the text, while suffix automata work
without the need of accessing it.
Why mismatches?
1. Data structures recognizing languages with mismatches for
approximate string matching and its applications, such as
- recovering the original signals after their transmission over
noisy channels;
- finding DNA subsequences after possible mutations;
- text searching where there are typing or spelling errors;
- retrieving musical passages.
2. Independent theoretical interest, such as, for instance, the
modelling of some evolutionary events in molecular biology.
Prague, 17/07/2007
CIAA 2007
In Blumer et al. (1985)
1. a linear algorithm for building the suffix
automaton of a word w on a fixed alphabet
is given (based on Nerode’s congruence);
2. it is showed that this suffix automaton must
have at least |w|+1 states and at most 2|w|
complexity [Carpi, de Luca in 2001 have proved that the
lower bound is joined for any prefix of Fibonacci word].
In this paper we focus on the minimal deterministic finite
automaton, denoted by Sk, that recognizes the set of suffixes
Suff(w,k) of a word w up to k errors.
First main result: characterization of the Nerode's rightinvariant congruence relative to Sk and a Conjecture on the
size of Sk.
2. Second main result: description of an algorithm that makes
use of Sk in order to accept, in an efficient way, the language
of all suffixes of w up to k errors in every window of size r,
(r=repetition index).
1.
Prague, 17/07/2007
CIAA 2007
Basic definitions
The distance d(x,y) between two strings x and y is the minimal cost
of a sequence of operations that transform x into y (and  if no such
sequence exists).
We consider the Hamming distance, that allows only substitutions,
with cost 1 (simplified definition). It is finite whenever |x|=|y| and it
holds 0  d(x,y)  |x|.
Ex.:
x=acgtatct, y=aggttact
d(x,y)=3 (in the simplified definition)
A string x k-occurs in w if it occurs in w at position l, 1≤l≤|w|, up to k errors.
A string x that k-occurs in w as a suffix of w is a k-suffix of w.
Prague, 17/07/2007
CIAA 2007
Suffixes with One Mismatch

“a”: Suff(a,1)={e,a,b}.
The minimal automaton has 2 states.

“ab”: Suff(ab,1) = {e,a,b,aa,ab,bb}.
The minimal automaton has 4 states.

“aba”: Suff(aba,1)={e,a,b,aa,ba,bb,aaa,aba,abb,bba}.
The minimal automaton has 6 states.

“abaa”: Suff(abaa,1) = {e,a,b,aa,ab,ba,aaa,baa,bab,
bba,aaaa,abaa,abab,abba,bbaa}.
The minimal automaton has 11 states.
Prague, 17/07/2007
CIAA 2007
On Nerode’s congruence… with mismatches
Definition 1 Let w*. y *, y≠ e
end-setw(y,k) = {i | y k-occurs in w with final position i}.
Notice that end-setw(e, k) = {0,1, …, |w|}.
Definition 2: x, y * are endk-equivalent, x ≡w,ky, on w if
1. end-setw(x, k) = end-setw(y, k);
2. i end-setw(x,k) = end-setw(y, k), the number of errors available in
the suffix of w having i+1 as first position is the same after the reading of
x and of y, i.e.
min{|w|-i, k-erri(x)} = min{|w|-i, k-erri(y)} ,
erri(u)=#(mismatches) of u in w with final position i.
[x]w,k =equivalence class of x with respect to ≡w,k.
Prague, 17/07/2007
CIAA 2007
In other words …
x ≡w,ky if
1. x and y have the same end-set in w up to k mismatches as in the exact
case [Blumer et al.],
2. #(available errors) in the suffix of w after the reading of x and of y is the
same.
The definition includes two cases depending on the considered final
position iend-setw(x,k) = end-setw(y, k):
2.a) |w|-i≥max{k-erri(x),k-erri(y)}  k-erri(x)=k-erri(y)  erri(x)=erri(y).
(In this case min{|w|-i,k-erri(x)}= k-erri(x) = k-erri(y) = min{|w|-i,k-erri(y)})
2.b) |w|-i ≤ min{k-erri(x), k-erri(y)}  it is possible to have mismatches in
any position of the suffix of w having length |w|-i.
This does not necessarily imply that erri(x) = erri(y).
(In this case min{|w|-i,k-erri(x)} = |w|-i = min{|w|-i,k-erri(y)})
Example
Let w = abaababaab, k=2.
i
1 2 3 4 5 6 7 8 9 10
1. x = baba, y = babb,
end-setw(x, 2) = {5, 6, 8, 10} = end-setw(y, 2)
but x ≡w,ky.
i = 5  err5(x) = 2, err5(y) = 1 
min{|w|-5,2-err5(x)} = 0 ≠ 1= min{|w|-5,2-err5(y)}
Example (contd)
Let w = abaababaab, k=2.
i
1 2 3 4 5 6 7 8 9 10
2. x = abaababa, y = baababa,


x ≡w,ky:
end-setw(x, 2) = {8} = end-setw(y, 2)
i = 8  err8(x) = 0 = err8(y) 
min{|w|-8,2-err8(x)}=2=min{|w|-8,2-err8(y)}
Example (contd2)
Let w = abaababaab, k=2.
i
3.
1 2 3 4 5 6 7 8 9 10
x = abaababaa, y = baababab, x ≡w,ky:


end-setw(x, 2) = {9} = end-setw(y, 2)
i = 9  err9(x) = 0 ≠ 1= err9(y) but
min{|w|-9,2-err9(x)}=1=min{|w|-9,2-err9(y)}
Results
In Blumer et al. (exact case)



Lemma 1 (approximate case)
≡w,k is a right-invariant equivalence
relation on *.
≡w is a right-invariant equivalence
relation on *.

x ≡w y  x is a suffix of y (or viceversa).

xy ≡wy 
every occurrence of y is immediately
preceded by an occurrence of x.

Prague, 17/07/2007
x ≡w,ky  x is a suffix of y up to 2k
errors (or vice-versa).
xy ≡w,ky 
i  end-setw(xy, k)=end-setw(y,k),
the k-occurrence of y with final position i
is immediately preceded by a toccurrence of x, where t = max{(kerri(y))-(|w|-i), 0)}.
CIAA 2007
Results (contd)
Theorem 1.
x ≡w,ky  (z*, xz is a k-suffix of w  yz is a k-suffix of w) (they have the
same future in w).
Corollary 1.
w*, the (partial) DFA Sk=(,Q,q0,F, δ) having
input alphabet ,
• state set Q={[x]w,k| x is a k-occurrence of w},
• initial state q0=[e]w,k,
• accepting states (F) those equivalence classes that include the k- suffixes of w
(i.e., whose end-sets include the position |w|),
• transition function δ:[x]w,k → [xa]w,k , x and xa are k-occurrences of w,
•
a
is the minimal deterministic finite automaton which recognizes the set Suff(w,
k).
What about the size of Sk?
Gad Landau asked for a data structure having size “close” to
|w| that allows approximate pattern matching in time
proportional to the query plus the number of occurrences.
In the NON approximate case suffix trees and (compact)
suffix automata do the job.
What about approximate case?
Prefixes of Fibonacci word
2, 4, 6, 11, 15, 18, 23, 28, 33, 36, 39, 45,
50, 56, 61, 64, 67, 70, 73, 79, 84, 90, 96,
102, 107, 110, 113, 116, 119, 122, 125,
128, 134, 139, 145, 151, 157, 163, 169,
175, 180, 183, 186, 189, 192, 195, 198,
201, 204, 207, 210, 213, 216, 222, 227,
233, 239, 245, 251, 257, 263, 269, x?.....
It is not in the Sloane & al. Database
Writing {an+1-an}n we obtain
2, 2, 5, 4, 3, 5, 5, 5, 3, 3, 6, 5, 6, 5, 3, 3,
3, 3, 6, 5, 6, 6, 6, 5, 3, 3, 3, 3, 3, 3, 3, 6,
5, 6, 6, 6, 6, 6, 6, 5, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
6, 5, 3, 3, 3, 3, 3, 3, 3, 3, x? .......
It seems easier. Let us Run-Length encode.
Run-Length encode
two 2, one 5, one 4, one 3, three 5, two 3,
one 6, one 5,
one 6, one 5,
four 3,
one 6, one 5,
three 6, one 5, seven 3,
one 6, one 5,
six 6, one 5, twelve 3,
one 6, one 5,
eleven 6, one 5, twenty 3,
one 6, one 5, nineteen 6, one 5,
......
Which is the rule?
x? 3,
Conjecture on the size of S1 for
prefixes of Fibonacci word
An initial part, and then from i=4, 5, ....
one 6, one 5, (fibi-1-2) 6, one 5, (fibi-1) 3,…
Conjecture 1: The size of the suffix automaton with one
mismatch of the prefixes of the Fibonacci word grows
according to
afibn = afibn-1 + 3(afibn-3-1) + 10 + 6(afibn-4-1)
We did not prove the rule. The rule holds true up to prefixes
of length 2000. It is a conjecture that the rule describes this
sequence.
Other experiments and
Final Conjecture



bban, n≥4  an+1-an=19+6*(n-4),
Prefixes of Thue-Morse word  |S1|≤2|w|log|w|
Random words generated by memoryless
sources  |Sk|=O(|w|logk|w|) [Epifanio Gabriele Mignosi
Restivo Sciortino 2003, 2005; Maas Nowak 2005].
Conjecture
The suffix automaton with k mismatches of any
text w has size O(|w|logk|w|).
Prague, 17/07/2007
CIAA 2007
Allowing more mismatches
Definition:
wΣ*, k, r Z+{0}, k ≤ r.
x occurs in w at position l up to k errors in a window of size r, or simply kroccurs in w at position l, if:
− if |x| < r 
d(x, w(l, l+|x|-1)) ≤ k;
− if |x| ≥ r  i, 1≤ i ≤ |x|-r+1, d(x(i,i+r-1), w(l+i-1, l+i+r-2)) ≤ k.
A string x satisfying above property is a kr-occurrence of w.
A string x that kr-occurs in w as a suffix of w is a kr-suffix of w.
L(w,k,r) ={x | x kr-occurs in w at position l, 1≤ l ≤ |w|-|x|+1}.
Suff(w,k,r) ={x | x kr-suffix of w}.
Remark:
Suff(w,k) = Suff(w,k,r) when r = |w|.
Prague, 17/07/2007
CIAA 2007
Example
w=abaa, k=1, r=2
•
L(w,1,2)={e,a,b,aa,ab,ba,bb,aaa,aab,aba,abb,baa,
bab,bba,bbb,aaaa,aaab,abaa,abab,abba,bbaa,bbab,
bbba}
Remark: bbba  L(w,1,2), but bbba  L(w,1,4)=L(w,1)
•
Suff(w,1,2)={e,a,b,aa,ab,ba,aaa,aab,baa,bab,bba,
aaaa,aaab,abaa,abab,abba,bbaa,bbab,bbba}
Prague, 17/07/2007
CIAA 2007
The Repetition Index R(w,k,r) of w is the smallest integer
h such that all strings of length h kr-occur at most once in w.
Remarks:
1. R(w,k,r) is well defined because the integer h=|w| is an element of
the set above described;
2. If k/r  1/2 then R(w,k,r)=|w|;
3. Equation r = R(w,k,r) admits an unique solution.
Lemma 2:
Given Sk there exists a linear time algorithm that returns r=R(w,k,r).
Remark: This algorithm labels each state of Sk with an integer that
represents a distance from this state to the end.
Prague, 17/07/2007
CIAA 2007
Algorithm that lets Sk recognize Suff(w,k,r)
Algorithm (x,r,Sk)
•
|x|≤r = R(w, k, r)
if x is accepted by Sk then xSuff(w,k,r)
else xSuff(w,k,r)
•
|x|>r = R(w, k, r)
let x’= prefix of x such that |x’|= r = R(w, k, r);
let q be the state reached after reading x’ and i the integer associated to q;
|w|-i-r+1=j is the unique possible initial position of x;
check if x kr-occurs at position j in w.
Conclusions and open problems
1.
Sk can be useful for approximate indexing.
2.
If Conjecture 2 is true and constants involved in O-notation are
small, our data structure is useful for some classical applications
of approximate indexing.
3.
We think that it is possible to connect Sk with Sk,r and conjecture
that |Sk,r| = O(|Sk|).
4.
We think that it is possible to obtain an online algorithm even
when dealing with mismatches. It would be probably more
complex than the classical one. It still remains an open problem
how to define it.
Prague, 17/07/2007
CIAA 2007