Transcript Document

Regular Expressions (RE)
• Used for specifying text search strings.
• Standarized and used widely (UNIX: vi, perl, grep.
Microsoft Word and other text editors…)
• A RE is a notation for characterizing a set of strings.
Formally a language is defined as a (possibly infinite) set
of strings of a given alphabet.
• A regular expression search consists of a search pattern
and a text to search through.
Basic RE Patterns
•
•
•
•
E.g /woodchuck/
Case sensitive /Woodchuck/ not the same as /woodchuck/
Disjunction /[Ww]oodchuck/ : Woodchuck or woodchuck
Ranges
– /[A-Z]/ : [ABCDEFGHIJKLMNOPQRSTUVWXYZ]
– /[0-9]/ : [0123456789]
• Negation
– [^a] : anything that is not an “a”
– [^A-Z] : anything that is not an uppercase letter
– But: [a^b] : the pattern “a^b”
Basic RE Patterns
• Optional characters
– /woodchucks?/ : woodchuck or woodchucks
• Zero or more instances (Kleene star)
– /baa*!/ : ba! or baa! or baaa! or baaaa! …
– /c[ab]*c/ : cabababc or caaaac or cc …
– Note: /a*/ matches everything.
• One or more instances
– /ba+!/ : ba! or baa! or baaa! or baaaa! …
– /[0-9]+/: A string of digits.
Basic RE Patterns
• Wildcards: /./ matches any character
– /beg.n/ : begin, begun, beg_n…
• Anchors:
– Pattern at beginning of string: /^the car/ matches “the
car I drive” but not “I drive the car”
– Pattern at end of string: /the car$/ matches “I drive the
car” but not “the car I drive”
– \b matches a word boundary: /\bthe\b/ matches “the”
but not “other”
Basic RE Patterns
•
•
•
•
Parentheses: (abc)+ matches abc, abcabc, abcabcabc ...
Disjunction: /cit(y|ies)/ matches city or cities
Repetitions: /(abc){3}/ matches abcabcabc
Backslash: Used for escaping special characters.
– \*, \+, \., \? ...
• Aliases
– \n: newline, \t:tab, \d:[0-9], \w:[a-zA-Z0-9 ]
RE Substitution
• s/regexp1/regexp2/ E.g. s/colour/color/
• Back references: \1, \2, \3 …
– s/([0-9]+)/<\1>/ : the 35 boxes -> the <35> boxes
– s/^\s*(\w+)\W+(\w+)/\2 \1/ : reverses the first two
words of a sentence.
– Also used in search REs
• /A [a-z]+ is a \1/ : matches “A car is a car”.
ELIZA
• Simulated the responses of a psychologist based on simple
pattern substitution.
• Initially it cascades through a set of RE substitutions that
change for example s/I’m/YOU ARE/, s/my/YOUR/ ...
• Then it runs the input through RE substitutions looking for
relevant patterns and produces the appropriate output. e.g.
s/.* YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR THAT YOU ARE \1/
s/.* YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1\?/
s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/
Finite State Automata (FSA)
• REs (that don’t use back-references) can be implemented
as finite-state automata.
• A FSA is described by a regular expression.
• A RE or a FSA can be used to describe a class of languages
called Regular Languages (RL).
Finite State Automata
• A FSA is represented as a graph with a finite set of nodes
(called states) and directed arcs between pairs of states
(called transition) labeled with symbols from the alphabet.
• One state is a start state, represented by an incoming arrow.
• Some states are final or accepting states represented by a
double circle.
FSA Example
Sheeptalk: baa! baaa! baaaa! baaaaa! …
Equivalent to RE: /baaa*!/
FSA Recognition
Examples:
baaa! Succeeds
aba!b Fails
FSA State Transition Table
• Alternative representation for FSA
FSA Example
Formal FSA Definition
•
•
•
•
•
Q: a finite set of states. (q0, q1, q2, …)
Σ: a finite input alphabet of symbols
q0: the start state (first state)
F: the states with of final states (subset of Q)
δ(q,i): the transition function from states and inputs to
states. Given a state q and an input i, it returns a new state
q’.
Deterministic FSA (DFSA). The recognition of a string has
no choice points.
Non Deterministic FSA (NFSA)
• When in state q2 with input a, the FSA has the choice to
move to state q3 or remain in state q2.
Empty Arcs
From state q3 the FSA can move to state q2, without looking at
the input (without advancing the tape).
NFSA Transition Tables
An extra ε column is added.
The transitions are now sets of states (instead of single states)
Accepting Strings with NFSA
• Since there is a choice of which arc to follow it is possible
to take the wrong path and reject a string that should be
accepted.
• All possible paths should be followed and if even one
reaches a final state then the string is accepted.
• Computational approaches
– Backup: When we store the current search-state (the state of the
FSA and the position of the tape) and when we reach dead end we
back up to that search-state and try another path from there.
– Lookahead: We look ahead in the input to decide which path to
take.
– Parallelism: Alternative paths are explored in parallel.
NFSA Recognition as Search
• The NFSA recognition can be seen as a search through a
space of search-states. This consists of all the possible
pairings of FSA-states and tape positions.
• The order that these search-states are visited (i.e. the
decision about which possible path to follow) is important
for performance.
• Depth-first or breadth-first search.
• For larger search spaces it may be necessary to use more
complex search tehniques (e.g Dynamic programming or
A*).
Relating DFSA and NFSA
• For every NFSA there exists an equivalent DFSA (i.e.
that accepts exactly the same set of strings).
• The idea behind the proof is based on converting a NFSA
to an equivalent DFSA. The resulting DFSA, may have
many more states than the original NFSA (up to 2N states
for a NFSA with N states).
Morphological Parsing and Recognition
• Morphological recognition: Accepts and rejects forms:
– Accept: geese
– Reject: gooses
• Morphological parsing: produces a morphological
analysis (stem followed by morphological features)
– geese: goose + N + PL
– cats: cat + N + PL
– ground: ground +N +SG, grind +V +PPart
Morphological Parsing
• A morphological parser is composed of
– lexicon: the list of stems or affixes in a language,
together this basic information about them.
– morphotactics: model of morpheme ordering, that
defines which morpheme classes may follow other
classes.
– orthographic rules: spelling rules used to model
changes that occur in the language (e.g. city+s -> cities)
Lexicon
• A repository of words:
a, AAA, AA, Aachen, aardvark, aardwolf...
• Not practical to list every word in the language. Impossible for some
languages (e.g. Finnish, Turkish...) Usually only the stems and the
affixes are listed.
• Ideally every word possible word (or stem) should be in the lexicon,
including abbreviations and proper names.
• Often along with stems in the lexicon we keep information about stem
classes.
– e.g. dog: reg-noun, goose: irreg-sg-noun,
– geese: irreg-pl-noun, -s: plural-suffix
Morphotactics
• Commonly represented as a FSA.
• e.g. Simple FSA for plural formation in English
Morphotactics
• In cases where a morphological process is more complicated, or not
fully productive (unhappy, unreal but *unbig, *unred) the
morphotactics FSA, may become quite complicated and many
different stem classes may be necessary.