文字列照合アルゴリズム

Download Report

Transcript 文字列照合アルゴリズム

1
北海道大学
Hokkaido University
Lecture on Information Knowledge Network
"Information retrieval and pattern matching"
Laboratory of Information Knowledge Network,
Division of Computer Science,
Graduate School of Information Science and Technology,
Hokkaido University
Takuya KIDA
2011/1/7
Lecture on Information knowledge network
The 5th
Regular expression
matching
About regular expression
Flow of processing
Construction of syntax tree (parse tree)
Construction of NFA for RE
Simulating the NFA
2011/1/7
Lecture on Information
knowledge network
2
北海道大学 Hokkaido University
3
What is regular expression?

Notation for flexible and strong pattern matching
– Example of a regular expression of filenames:
matches to any files whose extensions are “.txt. “
> rm *.txt
matches to Important0.doc~Important9.doc
> cp Important[0-9].doc
– Example of a regular expression of search tool Grep:
> grep –E “for.+(256|CHAR_SIZE)” *.c
– Example of a regular expression of programming language Perl:
$line = m|^http://.+\.jp/.+$|
matches to strings which start
with “http://” and include ”.jp/”.

A regular expression can express a regular set (regular language).
– It expresses a language L (sets of strings) which can be accepted by a
finite automaton
2011/1/7
Lecture on Information knowledge network
北海道大学 Hokkaido University
4
Definition of regular expression

Definition:
A regular expression is a string over Σ∪{ε, |, ・, *, (, )},
which is recursively defined by the following rules.
– (1) An element of {ε}∪Σ is a regular expressions.
– (2) If α and β are regular expressions, then (α・β) is a regular expression.
– (3) If α and β are regular expressions, then (α|β) is a regular expression.
– (4) If α is a regular expression, α* is a regular expression.
– (5) Only ones led on from the above are regular expressions.
Example: (A・((A・T)|(C・G))*)
→ A(AT|CG)*
(α・β) is often described αβ
for short
※ Symbols ‘| ’, ‘・’, ‘*’ are called operator.
Moreover, for a regular expression α, “+" is often used in the meaning of α+ =α・α*.
2011/1/7
Lecture on Information knowledge network
北海道大学 Hokkaido University
5
Semantic of regular expression

A regular expression is mapped into a subset of Σ* (language L)
– (i)
– (ii)
– (iii)
– (iv)
– (v)

||ε|| = {ε}
For a∈Σ, || a || = { a }
For regular expressions α and β, ||(α・β)|| = ||α||・||β||
For regular expressions α and β, ||(α|β)|| = ||α||∪||β||
For a regular expression α, ||α*|| = ||α||*
For example: (a・(a | b)*)
|| (a・(a | b) *) ||
= ||a||・||(a | b)*||
= {a}・||(a | b)||*
= {a}・({a}∪{b})*
= { ax | x∈{a, b}* }
An equivalent DFA to the left example
q0
a
q1
a,b
b
q2
a,b
※exercise: What is the equivalent language to (AT|GA)(TT)* ?
2011/1/7
Lecture on Information knowledge network
北海道大学 Hokkaido University
6
What is the regular expression matching problem?

Regular expression matching problem:
– It is the problem to find any strings in L(α)=||α||, which is defined by a given
α, from a given text.

The ability of regular expression to define a language is equal to that of
finite automaton!
– We can construct a finite automaton that accepts the same language expressed
by a regular expression.
– We also can describe a regular expression that expresses the same language
accepted by a finite automaton.
※ Please refer to "Automaton and computability" (2.5 regular expressions and regular
sets), written by Setsuo Arikawa and Satoru Miyano.

What we should do for matching a regular expression is to make an
automaton (NFA/DFA) corresponding to the regular expression and then to
simulate it.
– A regular expression is easier to convert to NFA than to DFA.
– The initialization state of the automaton is always active.
– The pattern expressed by a given regular expression occurs when the automaton
reaches to the final states by reading a text.
2011/1/7
Lecture on Information knowledge network
北海道大学 Hokkaido University
7
Flow of pattern matching process
General flow
Constructing NFA
by Thompson method
Scan texts
Parsing
Regular
expression
Parse tree
Report
the occurrences
NFA
Constructing NFA
by Glushkov method
DFA
Flow with filtering technique
Multiple pattern
Extracting
matching
Verify
Regular
A set of
Find
expression
strings
the candidates
2011/1/7
Report
the occurrences
Lecture on Information knowledge network
北海道大学 Hokkaido University
8
Construction of parse tree

Parse tree: a tree structure used in preparation for making NFA
– Each leaf node is labeled by a symbol a∈Σ or the empty word ε.
– Each internal node is labeled by a operator symbol on {|, ・, *}.
– Although a parser tool like Lex and Flex can parse regular expressions, it is too
exaggerated. (The pseudo code of the next slide is enough to do that).
Example: the parse tree TRE for regular expression RE=(AT|GA)((AG|AAA)*)
・
(AT|GA)((AG|AAA)*)
|
*
・
A
・
T
G
|
A
・
A
・
G
・
A
2011/1/7
A
Depth of
parentheses
Operator
1
|
2
|
A
Lecture on Information knowledge network
北海道大学 Hokkaido University
9
Pseudo code
Parse (p=p1p2…pm, last)
1 v ← θ;
2 while plast≠$ do
3
if plast∈Σ or plast=ε then
/* normal character */
4
vr ← Create a node with plast;
5
if v≠θthen v ← [・](v, vr);
6
else v ← vr;
7
last ← last + 1;
8
else if plast = ‘|’ then
/* union operator */
9
(vr, last) ← Parse(p, last + 1);
10
v ← [ | ](v, vr);
11
else if plast = ‘*’ then
/* star operator */
12
v ← [ * ](v);
13
last ← last + 1;
14
else if plast = ‘( ’ then
/* open parenthesis */
15
(vr, last) ← Parse(p, last + 1);
16
last ← last + 1;
17
if v≠θthen v ← [・](v, vr);
18
else v ← vr;
19
else if plast = ‘)’ then
/* close parenthesis */
20
return (v, last);
21
end of if
22 end of while
23 return (v, last);
2011/1/7
Lecture on Information knowledge network
北海道大学 Hokkaido University
10
NFA construction by Thompson method
K. Thompson. Regular expression search algorithm. Communications of the ACM, 11:419-422, 1968.

Idea:
– Traversing the parse tree TRE for a given RE in post-order traversal, we construct
the automaton Th(v) that accepts language L(REv) corresponding to a partial tree
whose top is node v.
– The key point is that Th(v) can be obtained by connecting with ε transitions the
automatons corresponding to each partial tree whose top is a child of v.

Properties of Thompson NFA
– The number of states < 2m, and the number of state transitions < 4m →O(m).
– It contains many ε transitions.
– The transitions other than ε connect the states from i to i+1.
Example: Thompson NFA for
RE = (AT|GA)((AG|AAA)*)
ε
1
0
ε
2011/1/7
4
A
G
2
5
T
A
3
ε
ε
ε
7
9
ε
8
ε
12
ε
A
A
13
10
A
G
14
ε
11
A
ε
15
ε
16
ε
17
6
Lecture on Information knowledge network
北海道大学 Hokkaido University
11
NFA construction algorithm

For the parse tree tree TRE, as traversing the tree in post-order traversal,
it generates and connects the automatons for each node as follows.
(i) When v is the empty word ε
I
ε
F
ε
(ii) When v is a character “a”
I
a
2011/1/7
vL
IL
vL
FL
ε
I
F
ε
IR
vR
FR
ε
F
(iii) When v is a concatenation ”・”
→ (vL・vR)
IL
(iv) When v is a selection ”|” → (vL| vR)
vR
FR
(v) When v is a repetition”*” → v*
ε
ε
I
v
ε
ε
F
Lecture on Information knowledge network
北海道大学 Hokkaido University
12
Move of the NFA construction algorithm
Ex.: Parse tree TRE
for RE=(AT|GA)((AG|AAA)*)
7
3
・
18
|
17
*
・
6
・
16
|
15
10
A
T
1
G
2
4
A
・
・
5
13
A
8
G
・
9
11
Ex.: Thompson NFA for
RE = (AT|GA)((AG|AAA)*)
ε
1
0
ε
2011/1/7
4
A
G
2
5
T
A
3
ε
ε
ε
7
9
ε
8
ε
12
ε
A
14
A
A
13
10
A
A
12
G
14
A
ε
11
A
ε
15
ε
16
ε
17
6
Lecture on Information knowledge network
北海道大学 Hokkaido University
13
Pseudo code
Thompson_recur (v)
1 if v = “|”(vL, vR) or v = “・”(vL, vR) then
2
Th(vL) ← Thompson_recur(vL);
3
Th(vR) ← Thompson_recur(vR);
4 else if v=“*”(vC) then Th(v) ← Thompson_recur(vC);
5 /* the above is for recursive traversal (post-order) */
6 if v=(ε) then return construction (i);
7 if v=(α), α∈Σ then return construction (ii);
8 if v=“・”(vL, vR) then return construction (iii);
9 if v=“|”(vL, vR) then return construction (iv);
10 if v=“*”(vC) then return construction (v);
Thompon(RE)
11 vRE ← Parse(RE$, 1); /* construct the parse tree */
12 Th(vRE) ← Thompson_recur(vRE);
2011/1/7
Lecture on Information knowledge network
北海道大学 Hokkaido University
14
NFA construction by Glushkov method
V-M. Glushkov. The abstract theory of automata. Russian Mathematical Surveys, 16:1-53, 1961.

Idea
– Making a new expression RE’ by numbering each symbol a∈∑ sequentially from
the beginning to the end. (Let ∑’ be the alphabet with subscripts)

Example: RE = (AT|GA)((AG|AAA)*) → RE’ = (A1T2|G3A4)((A5G6|A7A8A9)*)
– After constructing a NFA that accepts language L(RE'), we obtain the final NFA
by removing the subscript numbers.

Properties of Glushkov NFA
– The number of states is just m+1, and the number of state transitions is O(m2).
– It doesn't contain any ε transitions.
– For any node, all the labels of transitions entering to the node are the same.
Example: A NFA for RE’ = (A1T2|G3A4)((A5G6|A7A8A9)*)
G3
A7
A1
T2
A4
A5
G6
0
1
2
3
4
5
A5
A5
6
A7
7
8
A9
9
A7
A7
A5
A8
Example: The Glushkov NFA
G
0
A
1
T
2
A
3
A
A
2011/1/7
4
A
A
5
G
A
6
A
A
7
A
8
A
9
A
Lecture on Information knowledge network
北海道大学 Hokkaido University
15
NFA construction algorithm (1)

Construction procedure:
– Making a new expression RE’ by numbering each symbol a∈∑ sequentially from
the beginning to the end.

Pos(RE’) = {1…m}, and ∑’ is the alphabet with subscript numbers.
– As traversing the parse tree TRE’ in post-order traversal, for each language REv’
corresponding to a partial tree whose top is v, it calculates set First(REv’), set
Last(REv’), function Emptyv, and function Follow(RE', x) of position x as follows.




First(RE’) = {x∈Pos(RE’) | ∃u∈∑’*, αxu∈L(RE’)}
Positions of the initial states
Last(RE’) = {x∈Pos(RE’) | ∃u∈∑’*, uαx∈L(RE’)}
Positions of the final states
Follow(RE’, x) = {y∈Pos(RE’) | ∃u, v∈∑’*, uαxαyv∈L(RE’)}
transition functions
EmptyRE: a function that returns {ε} if ε belongs to L(RE), otherwise returns φ.
This can be recursively calculated as follows.
Whether the initial
Emptyε
= {ε},
state of the NFA is a
Emptyα∈∑
= φ,
final state or not?
EmptyRE1|RE2
= EmptyRE1 ∪ EmptyRE2,
EmptyRE1・RE2 = EmptyRE1 ∩ EmptyRE2,
EmptyRE*
= {ε}.
– The NFA is constructed based on the values obtained from the above.
2011/1/7
Lecture on Information knowledge network
北海道大学 Hokkaido University
16
NFA construction algorithm (2)

The Glushkov NFA GL’= (S, ∑’, I, F, δ’) that accepts
language L(RE')
–S
– ∑'
–I
–F
:A set of states. S = {0, 1, …, m}
:The alphabet with subscript numbers
:The initial state id; I = 0
:The final states;
F = Last(RE’)∪(EmptyRE・{0}).
– δ' :Transition function defined by the followings
∀x∈ Pos(RE’), ∀y∈ Follow(RE’, x), δ’(x, αy) = y
The transitions from the initial state are as follows:
∀y∈ First(RE’), δ’(0, αy) = y
Example: NFA for RE’ = (A1T2|G3A4)((A5G6|A7A8A9)*)
G3
0
A1
1
T2
2
3
A4
A5
2011/1/7
A7
A5
4
A5
5
G6
A5
6
A7
A7
7
A8
8
A9
9
A7
Lecture on Information knowledge network
北海道大学 Hokkaido University
17
Pseudo code
Glushkov_variables (vRE, lpos)
1 if v = [ | ](vl,vr) or v = [・](vl,vr) then
2
lpos ← Glushkov_variables(vl, lpos);
3
lpos ← Glushkov_variables(vr, lpos);
4 else if v = [ * ](v*) then lpos ← Glushkov_variables(v*, lpos);
5 end of if
6 if v = (ε) then
7
First(v) ← φ, Last(v) ← φ, Emptyv ← {ε};
8 else if v = (a), a∈Σ then
9
lpos ← lpos + 1;
10
First(v) ← {lpos}, Last(v) ← {lpos}, Emptyv ← φ, Follow(lpos) ← φ;
11 else if v = [ | ](vl,vr) then
12
First(v) ← First(vl)∪First(vr);
13
Last(v) ← Last(vl)∪Last(vr);
14
Emptyv ← Emptyvl∪Emptyvr;
15 else if v = [・](vl,vr) then
16
First(v) ← First(vl)∪(Emptyvl・First(vr));
17
Last(v) ← (Emptyvr・Last(vl))∪Last(vr);
18
Emptyv ← Emptyvl∩Emptyvr;
19
for x∈Last(vl) do Follow(x) ← Follow(x)∪First(vr);
O(m3) time totally
20 else if v = [ * ](v*) then
21
First(v) ← First(v*), Last(v) ← Last(v*), Emptyv ← {ε};
22
for x∈Last(v*) do Follow(x) ← Follow(x)∪First(v*);
It takes O(m2) time
23 end of if
24 return lpos;
2011/1/7
Lecture on Information knowledge network
北海道大学 Hokkaido University
18
Pseudo code (cont.)
Glushkov (RE)
1 /* make the parse tree by parsing the regular expression */
2 vRE ← Parse(RE$, 1);
3
4 /* calculate each variable by using the parse tree */
5 m ← Glushkov_variables(vRE, 0);
6
7 /* construct NFA GL(S,∑, I, F,δ) by the variables */
8 Δ←φ;
9 for i ∈ 0…m do create state I;
10 for x ∈ First(vRE) do Δ←Δ∪ {(0, αx, x)};
11 for i ∈ 0…m do
12
for i ∈ Follow(i) do Δ←Δ∪ {(i,αx, x)};
13 end of for
14 for x∈ Last(vRE)∪(EmptyvRE・{0}) do mark x as terminal;
2011/1/7
Lecture on Information knowledge network
北海道大学 Hokkaido University
19
Take a breath
Taiwan High-speed Railway@Taipei 2011.11.8
2011/1/7
Lecture on Information knowledge network
北海道大学 Hokkaido University
20
Flow of pattern matching process
Constructing NFA
by Thompson method
Parsing
Regular
expression
An NFA can be
simulated in O(mn) time
Scan texts
Parse tree
Report
the occurrences
NFA
Constructing NFA
by Glushkov method
To translate, we need
O(2m) time and space
DFA
There exists a method of converting directly into a DFA
※Please refer the section 3.9 of “Compilers – Principles, Techniques and Tools,”
written by A. V. Aho, R. Sethi, and J. D. Ullman, Addison-Wesley, 1986.
2011/1/7
Lecture on Information knowledge network
北海道大学 Hokkaido University
21
Methods of simulating NFAs

Simulating a Thompson NFA directly
– The most naïve method
– Storing the current active states with a list of size O(m), the method updates the states
of the NFA in O(m) time for each symbol read from a text.
– It obviously takes O(mn) time.

Simulating a Thompson NFA by converting into an equivalent DFA
– It is a classical technique.
– Refer “Compilers – Principles, Techniques and Tools,” written by A. V. Aho, R. Sethi, and
J. D. Ullman, Addison-Wesley, 1986.
– The conversion is done as preprocessing → it takes O(2m) time and space.
– There are also techniques that converses dynamically as scanning a text.

Hybrid method
– E. W. Myers. A four russians algorithm for regular expression pattern matching. Journal of
the ACM, 39(2):430-448, 1992.
– It is a method that combines NFA and DFA to do efficient matching.
– It divides the Thompson NFA into modules which include O(k) nodes for each, and then
converses each module into DFA. It simulates the transitions between modules as a NFA.

High-speed NFA simulation by bit-parallel technique
– Simulating the Thompson NFA: proposed by Wu and U. Manber[1992]
– Simulating the Glushkov NFA: proposed by G. Navarro and M. Raffinot[1999]
2011/1/7
Lecture on Information knowledge network
北海道大学 Hokkaido University
22
Simulating by converting into an equivalent DFA
Ex.: A DFA converted from the Glushkov NFA for RE = (AT|GA)((AG|AAA)*)
02
T
C
01
A
C,T
T
T
A
G
C,T
G
T
A
C
018
A
G
0157
G
C,T
03
C,T
G
A
A
04
G
C,T
036
G
T
G
C
C
G
A
019
A
C
0
G
A
C C
01457
T
T
C
T
A
01578
A
01579
A
0189
G
A
G
G
DFA Classical (N = (Q,∑, I, F,Δ), T = t1t2…tn)
1 Preprocessing:
2
for σ∈∑ do Δ←Δ∪ (i, σ, I);
3
(Qd,∑, Id, Fd,δ) ← BuildDFA(N); /* Make an equivalent DFA with NFA N */
4 Searching:
5
s ← Id;
6
for pos ∈ 1…n do
7
if s∈Fd then report an occurrence ending at pos – 1;
8
s ← δ(s, tpos);
9
end of for
2011/1/7
Lecture on Information knowledge network
北海道大学 Hokkaido University
23
Bit-parallel Thompson (BPThompson)
S. Wu and U. Manber. Fast text searching allowing errors. Communications of the ACM, 35(10):83-91, 1992.

Simulating the Thompson NFA by bit-parallel technique
– For Thompson NFAs, note that the next of the i-th state is the i+1th except
for ε transitions.
→ Bit-parallelism similar to the Shift-and method can be applicable.
– ε transitions are separately simulated.

This needs the mask table of size 2L (L is the number of states of the NFA)
– It takes O(2L + m|∑|) time for preprocessing.
– It scans a text in O(n) time when L is small enough.

About NFA GL=(Q={s0,…,s|Q|-1}, ∑, I = s0, F, Δ)
– The expression of mask tables of the NFA: Qn={0,…,|Q-1|}, In = 0|Q|-11, Fn =
|sj∈F 0|Q|-1-j10j
– Definitions of mask tables:
2011/1/7

Bn[i,σ] = |(si,σ,sj)∈Δ 0|Q|-1-j10j

En[ i ] = |sj∈E(i) 0|Q|-1-j10j

Ed[D] = |i, i=0 OR D&0L-i-110i ≠ 0L En[ i ]

B[σ] = |i∈0…m Bn[i, σ]
(where E(i) is the ε-closure of state si)
Lecture on Information knowledge network
北海道大学 Hokkaido University
24
Pseudo code
BuildEps (N = (Qn,∑,In,Fn,Bn,En) )
1 for σ∈∑ do
2
B[σ] ← 0L;
3
for i∈0…L–1 do B[σ] ← B[σ] | Bn[i,σ];
4 end of for
5 Ed[0] ← En[0];
6 for i∈0…L–1 do
7
for j∈0…2i – 1 do
8
Ed[2i + j] ← En[ i ] | Ed[ j ];
9
end of for
10 end of for
11 return (B, Ed);
BPThompson (N = (Qn,∑,In,Fn,Bn,En), T = t1t2…tn)
1 Preprocessing:
2
(B, Ed) ← BuildEps(N);
3 Searching:
4
D ← Ed[ In ];
/* initial state */
5
for pos∈1…n do
6
if D & Fn≠ 0L then report an occurrence ending at pos–1;
7
D ← Ed[ (D << 1) & B[tpos] ];
8
end of for
2011/1/7
Lecture on Information knowledge network
北海道大学 Hokkaido University
25
Bit-parallel Glushkov (BPGlushkov)
G. Navarro and M. Raffinot. Fast regular expression search. In Proc. of WAE99, LNCS1668, 199-213, 1999.

Simulating the Glushkov NFA by bit-parallel technique
– For Glushkov NFAs, note that, for any node, all the labels of transitions
entering to the node are the same.
→ Although the bit-parallel similar to the Shift-And method cannot be
applicable, each state transition can be calculated by Td[D]&B[σ].
– The number of mask tables is 2|Q| (while it is 2L for BPThompson).
– It takes O(2m + m|∑|) time for preprocessing.
– It scans a text in O(n) time when m is small enough.
– It is more efficient than BPThompson in almost all cases.

About NFA GL=(Q={s0,…,s|Q|-1}, ∑, I = s0, F, Δ)
– The expression of mask tables of the NFA: Qn={0,…,|Q-1|}, In = 0|Q|-11,
Fn = |sj∈F 0|Q|-1-j10j
– Definitions of mask tables:
2011/1/7

Bn[i,σ] = |(si,σ,sj)∈Δ 0|Q|-1-j10j

B[σ] = |i∈0…m Bn[i, σ]

Td[D] = |(i,σ), D&0m-i10i ≠ 0m+1, σ∈∑ Bn[i,σ]
Lecture on Information knowledge network
北海道大学 Hokkaido University
26
Pseudo code
BuildTran (N = (Qn,∑,In,Fn,Bn,En) )
1 for i∈0…m do A[ i ] ← 0m+1;
2 for σ∈∑ do B[σ] ← 0m+1;
3 for i∈0…m, σ∈∑ do
4
A[ i ] ← A[ i ] | Bn[I,σ];
5
B[σ] ← B[σ] | Bn[i,σ];
6 end of for
7 Td[0] ← 0m+1;
8 for i∈0…m do
9
for j∈0…2i – 1 do
10
Td[2i + j] ← A[ i ] | Td[ j ];
11
end of for
12 end of for
13 return (B, Ed);
BPGlushkov (N = (Qn,∑,In,Fn,Bn,En), T = t1t2…tn)
1 Preprocessing:
2
for σ∈∑ do Bn[0,σ] ← Bn[0,σ] | 0m1; /* initial self-loop */
3
(B, Ed) ← BuildTran(N);
4 Searching:
5
D ← 0m1;
/* initial state */
6
for pos∈1…n do
7
if D & Fn≠ 0m+1 then report an occurrence ending at pos–1;
8
D ← Td[D] & B[tpos];
end of for
2011/1/7 9
Lecture on Information knowledge network
北海道大学 Hokkaido University
27
Other topics

Extended regular expression:
– The one with allowing two operations, intersection and complementation,
in addition to connection, selection, and repetition.

¬(UNIX)∧(UNI(.)* | (.)*NIX)
– It is different from POSIX regular expression.



H. Yamamoto, An Automata-based Recognition Algorithm for Semi-extended
Regular Expressions, Proc. MFCS2000, LNCS1893, 699-708, 2000.
O. Kupferman and S. Zuhovitzky, An Improved Algorithm for the Membership
Problem for Extended Regular Expressions, Proc. MFCS2002, LNCS2420,
446-458, 2002.
Researches on speeding-up regular expression matching
– Filtration technique using BNDM + verification

2011/1/7
G. Navarro and M. Raffinot, New Techniques for Regular Expression
Searching, Algorithmica, 41(2): 89-116, 2004.
- In this paper, the method of simulating the Glushkov NFA with mask
tables of O(m2m) bits is also presented.
Lecture on Information knowledge network
北海道大学 Hokkaido University
28
The 5th summary

Regular expression
– the ability of it to define the language is the same as that of finite automaton.

Flow of regular expression matching
– After translating it to a parse tree, the corresponding NFA is constructed. Matching is done by
simulating the NFA
– Filtration + pattern plurals collation + inspection + NFA simulation

Methods for constructing an NFA
– Thompson NFA:



The number of states < 2m, and the number of state transitions < 4m →O(m).
It contains many ε transitions.
The transitions other than ε connect the states from i to i+1.
– Glushkov NFA




The number of states is just m+1, and the number of state transitions is O(m2).
It doesn't contain any ε transitions.
For any node, all the labels of transitions entering to the node are the same.
Methods of simulating NFAs
– Simulating Thompson NFAs directly → O(mn) time
– Converting into an equivalent DFA → It runs in O(n) for scanning, but it takes O(2 m) time and
space for preprocessing.
– Speeding-up by bit-parallel techniques: Bit-parallel Thompson and Bit-parallel Glushkov

The next theme
– Pattern matching on compressed texts: an introduction of Kida’s research (it’s a trend of 90's in
this field!)
2011/1/7
Lecture on Information knowledge network
北海道大学 Hokkaido University
29
Appendix

About the definitions of terms which I didn’t explain in the first lecture.
– A subset of ∑* is called a formal language or a language for short.
– For languages L1, L2∈∑*, the set
{ xy | x∈L1 and y∈L2 }
is called a product of L1 and L2, and denoted by L1・L2 or L1L2 for short.
– For a language L⊆∑*, we define
L0 = {ε}, Ln = Ln-1・L
(n≧1)
Moreover, we define
L* = ∪n=0…∞ Ln
and call it as a closure of L. We also denote L+ = ∪n=1…∞ Ln.

About look-behind notations
– I told in the lecture that I couldn’t find the precise description of look-behind
notations. But I eventually found that!


Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity,
The MIT Press, Elsevier, 1990.
(Japanese translation)コンピュータ基礎理論ハンドブックⅠ:アルゴリズムと複雑さ,丸善,1994.
– Chapter 5, section 2.3 and section 6.1
– According to this, it seems that the notion of look-behind appeared in 1964.
– It exceeds the frame of context-free grammar!
– The matching problem of it is proved to be NP-complete.
2011/1/7
Lecture on Information knowledge network