INFO 320 Server Technology I Week 7 Regular expressions INFO 320 week 7 www.ischool.drexel.edu Overview • One of the most powerful tools in UNIX/Linux is the ability.

Download Report

Transcript INFO 320 Server Technology I Week 7 Regular expressions INFO 320 week 7 www.ischool.drexel.edu Overview • One of the most powerful tools in UNIX/Linux is the ability.

INFO 320
Server Technology I
Week 7
Regular expressions
INFO 320 week 7
1
www.ischool.drexel.edu
Overview
• One of the most powerful tools in
UNIX/Linux is the ability to compare
regular expressions
– Regular expressions overview
– grep
– Character classes
– Applications
INFO 320 week 7
2
www.ischool.drexel.edu
Regular expressions overview
Mostly from Regular-Expressions.info and the man pages cited
INFO 320 week 7
3
www.ischool.drexel.edu
Regular expressions?
• “A regular expression (regex or regexp for
short) is a special text string for describing
a search pattern”
– While developed in UNIX, regular expressions
can be also used with little modification in
Windows, Perl, PHP, Java, or a .NET
language
– “little modification?” Yes, you have to be
careful which set of regex rules you’re using
INFO 320 week 7
4
www.ischool.drexel.edu
Regular expressions
• The down side?
– They look like complete and utter gibberish
^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$
• The good news?
– There are zillions of cookbook recipes for
common uses of them
– And with commands (grep, ed, sed), they can
be used in scripts
INFO 320 week 7
5
www.ischool.drexel.edu
Fancy wildcards?
• The basic idea is that regex are wildcards
on steroids
• We saw that, in bash scripting
– A star ‘*’ can substitute for zero or more of
any character (except a line break)
– A question mark ‘?’ can substitute for exactly
one any character HERE IT DOESN’T
– We’ll refine our use of brackets [ ] to include
or exclude any specific one character
INFO 320 week 7
6
www.ischool.drexel.edu
Regex syntax
• Within UNIX, there are variations on
regex syntax
– GNU grep (our main tool) uses GNU Basic
Regular Expressions syntax (BRE)
– GNU egrep uses GNU Extended Regular
Expressions syntax (ERE)
– POSIX-compliant systems use POSIX Basic
Regular Expressions for grep, or POSIX
Extended Regular Expressions for egrep
INFO 320 week 7
7
www.ischool.drexel.edu
BRE (grep) vs ERE (egrep)
• The only difference is that BRE's will use
backslashes to give various characters a
special meaning, while ERE's will use
backslashes to take away the special
meaning of the same characters
• egrep has the same functions as grep,
it’s just a little faster
– grep –E is the same as egrep
INFO 320 week 7
8
www.ischool.drexel.edu
Ed and sed
• Similar regex rules are used by grep,
ed, and sed
– ed is a text line editor
– sed is used to perform basic transformations
on an input text stream
INFO 320 week 7
9
www.ischool.drexel.edu
grep
INFO 320 week 7
10
www.ischool.drexel.edu
Regular expressions and grep
• Regular expressions were first
implemented in the 1970’s in UNIX
for the ‘grep’ command
– grep = generate regular expression
– egrep = extended grep
• We’ll focus on grep
– grep matches BREs, which were defined by
IEEE Std 1003.1-2001, Section 9.3, Basic
Regular Expressions (now dated 2008)
INFO 320 week 7
11
www.ischool.drexel.edu
grep syntax
• The basic form is
– grep –options pattern file
• The normal output from grep is a text list
of all the lines which matched the
pattern in the file
– Notice that patterns like ‘remember’ which cross lines are not found!
Regex matches cannot span multiple lines
INFO 320 week 7
12
www.ischool.drexel.edu
grep options
• Like most UNIX commands, grep has
many options (see handout), including
– -c shows the count of lines matched, instead
of the lines themselves
– -i ignores case when matching (!)
– -n gives the line number of each line matched
– -v gives lines which don’t match the
pattern(s) as output
INFO 320 week 7
13
www.ischool.drexel.edu
grep options
• You can also include a list of patterns with
the –e option
• Or use a file with patterns using
the –f option
• You can match lines where the whole line
matches the pattern, with the –x option
INFO 320 week 7
14
www.ischool.drexel.edu
Search patterns
• As a good habit, put the search pattern
in single or double quotes (either works if
consistent)
– The pattern is a regular expression
• If you give an empty pattern all lines
will be matched
– So what does grep –c ‘’ filename do?
INFO 320 week 7
15
www.ischool.drexel.edu
Metacharacters
• Regex metacharacters are text strings that
have special meaning in this context
• We’ll look at them in groups
– We already mentioned the wildcard ‘*’ which
matches zero or more of any character
(except newline)
– To match any exactly one character, use a
period ‘.’
• Notice a ‘?’ did this in the context of scripting
INFO 320 week 7
16
www.ischool.drexel.edu
Metacharacters
• We can identify words that start or end of
a line
• ‘^’ (the carat) marks the start of the line
– ‘^Four’
• ‘$’ (dollar) marks the end of the line
– ‘ago$’
– Again, different meaning than in scripting
INFO 320 week 7
17
www.ischool.drexel.edu
Metacharacters
• We can identify the start or end of a word
• ‘\<‘ marks the start of a word
– ‘\<eat’ would match eats or eating, not feat
• ‘\>’ marks the end of a word
– ‘ing\>’ would match loving but not sings
INFO 320 week 7
18
www.ischool.drexel.edu
Character classes
INFO 320 week 7
19
www.ischool.drexel.edu
Character classes
• With a "character class" (or set) you can tell the
regex engine to match only one out of several
characters
– Simply place the possible characters you want to
match between square brackets
• If you want to match an a or an e, use [ae]
– You could use this in gr[ae]y to match either gray
or grey
• Very useful if you do not know whether the document you are
searching through is written in American or British English
From http://www.regular-expressions.info/charclass.html
INFO 320 week 7
20
www.ischool.drexel.edu
Character classes
• The order of the characters inside a
character class does not matter
– The results are identical [ae] or [ea]
• The characters don’t have to be sequential
– [dptjgm583;] is fine
– But if you want cite special characters
[\^$.|?*+(){} literally, you need to
add a backslash before them
• So [abc\\\?] matches a b c \ or ?
INFO 320 week 7
21
www.ischool.drexel.edu
Character classes
• More generally in character classes
– ‘[]’ matches any one character specified
between the brackets
– ‘[^abc]’ matches any one character NOT
specified between the brackets
• That example means ‘does not have a b or c in it’
• Notice the ^ has very different meaning in a
character class or as its own metacharacter
INFO 320 week 7
22
www.ischool.drexel.edu
Character classes
• Within character classes, ranges of
possible characters can be given
– [a-z] means any lower case letter
– [a-zA-Z] means any upper or lower case letter
– [a-zA-Z0-9] could be any character that isn’t
a letter or number
INFO 320 week 7
23
www.ischool.drexel.edu
Metacharacters
• The pipe means logical OR in an
expression, here called alternation
– abc(def|xyz) matches abcdef or abcxyz
• Multiple alternations are allowed
– s[i|a|o]ng
• Notice the parentheses group a string of
characters to be treated as one
INFO 320 week 7
24
www.ischool.drexel.edu
Bracket expressions
• POSIX has bracket expressions to provide
abbreviations for common search terms
– For example instead of [a-z] can use [:lower:]
– [a-zA-Z] becomes [:alpha:]
– [a-zA-Z0-9] becomes [:alnum:]
– What does [A-Fa-f0-9] = [:xdigit:] mean?
• So [^x-z[:digit:]] matches a single
character that is not x, y, z or a digit [0-9]
From http://www.regular-expressions.info/posixbrackets.html
INFO 320 week 7
25
www.ischool.drexel.edu
Optional
• The question mark will attempt match the
preceding token zero times or once, in
effect making it optional
– colou?r matches both colour and color
– Nov(ember)? will match Nov and November
INFO 320 week 7
26
www.ischool.drexel.edu
Repetition
• The asterisk or star tells the engine to
attempt to match the preceding token
zero or more times.
– ‘<[A-Za-z][A-Za-z0-9]*>’ matches an
HTML tag without any attributes
• The plus tells the engine to attempt to
match the preceding token once or more.
– ‘<[A-Za-z0-9]+>’ will match a tag with
any one or more alphanumeric characters
INFO 320 week 7
27
www.ischool.drexel.edu
Limiting repetition
• As a further refinement, it’s possible to
specify how many times a string will be
repeated, by adding {min,max} instead of
a star or plus
• Max is infinite if not specified, so
– * = {0,} + = {1,} and ? = {0,1}
– But {0,3} would limit the previous character
to appear zero to three times
INFO 320 week 7
28
www.ischool.drexel.edu
() [] [::]?
• So in the context of a regex
– Parentheses ( ) are used for grouping, to treat
a series of characters as one for repetition
– Square brackets [ ] define a character class,
matches any one character in that class
– Square brackets with colons [: :] define a
POSIX bracket expression
INFO 320 week 7
29
www.ischool.drexel.edu
?*+{}?
• And following any kind of grouping,
character class, or bracket expression
– ? Makes a group repeated zero or one time
(optional)
– + makes a group repeated one or more times
– * makes a group repeated zero or more times
– Curly brackets { } are used for controlling
repetition by giving min and max limits
INFO 320 week 7
30
www.ischool.drexel.edu
Searching for special characters
• To match a ], put it as the first character
after the opening [ or the negating ^
• To match a -, put it right before the
closing ]
• To match a ^, put it before the final
literal - or the closing ]
• Put together, []\d^-] matches
], \, d, ^ or INFO 320 week 7
31
www.ischool.drexel.edu
Applications
From http://www.regular-expressions.info/examples.html
INFO 320 week 7
32
www.ischool.drexel.edu
Ok, now what?
• Given this terribly complex set of rules for
defining a regular expression … so what?
• Regexes are very handy for searching for
specific terms, or validating inputs
• Here we’ll review a few cookbook
examples
INFO 320 week 7
33
www.ischool.drexel.edu
Trimming Whitespace
• A mundane example is to use regular
expressions to get rid of spaces at the
start and end of lines
– Search for ^[ \t]+ and replace with nothing
to delete leading whitespace
– Search for [ \t]+$ and replace with nothing
to trim trailing whitespace
– [ \t] matches a space or tab
INFO 320 week 7
34
www.ischool.drexel.edu
Match IP addresses
• A simplified version is
\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b
• But that will catch illegal IP addresses
above 255; to fix that use
– \b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b
– Ok, matching numbers is tough in a text world
INFO 320 week 7
35
www.ischool.drexel.edu
Numbers are challenging
• To get a real number
– [-+]?[0-9]*\.?[0-9]+
• But if you might need exponential notation
– [-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?
INFO 320 week 7
36
www.ischool.drexel.edu
Validate email addresses
• If you get a string and want to see if it’s an
email address, could try
– ^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$
– What assumption is made here about case?
INFO 320 week 7
37
www.ischool.drexel.edu
Validate a date
• (19|20)\d\d[- /.](0[1-9]|1[012])[/.](0[1-9]|[12][0-9]|3[01])
• Matches a date in yyyy-mm-dd format
from between 1900-01-01 and 2099-12-31
INFO 320 week 7
38
www.ischool.drexel.edu
Validate credit cards
• To validate a credit card, need their
format, and first strip out spaces & dashes
• Visa: ^4[0-9]{12}(?:[0-9]{3})?$
– All Visa card numbers start with a 4; new
cards have 16 digits, old cards have 13
• MasterCard: ^5[1-5][0-9]{14}$
– All MasterCard numbers start with the
numbers 51 through 55; all have 16 digits
INFO 320 week 7
39
www.ischool.drexel.edu
References
• Regular-expressions.info
http://www.regular-expressions.info/
• Grep man page
http://manpages.ubuntu.com/manpages/ja
unty/en/man1/grep.1posix.html
• Lots of books are also available on regular
expressions
INFO 320 week 7
40
www.ischool.drexel.edu