Information Encoding in Biological Molecules: DNA and
Download
Report
Transcript Information Encoding in Biological Molecules: DNA and
http://creativecommons.org/licenses/by-sa/2.0/
Lecture 6.0
1
An Introduction to Perl for
Bioinformatics – Part 1
Will Hsiao
Simon Fraser University
Department of Molecular Biology and Biochemistry
[email protected]
www.pathogenomics.sfu.ca/brinkman
Lecture 6.1
2
Important Rules
1. Please feel free to ask any questions at any
time
2. There is no stupid questions, this is a safe
learning environment
3. The lecture is interactive, you are highly
encouraged to actively participate (ask
questions and answer questions)
Lecture 6.0
3
Lecture 6.0
4
Lecture 6.0
5
Outline
• Session 1
–
–
–
–
–
–
Review of the previous day
Perl – historical perspective
Expand on Regular Expression
General Use of Perl
Expand on Perl Functions and introduce Modules
Interactive demo on Modules
• Break
• Session 2
–
–
–
–
–
Use of Perl in Bioinformatics
Object Oriented Perl
Bioperl Overview
Interactive demo on Bioperl
Introduction to the Perl assignment
Lecture 6.0
6
Today’s Goals
• Will have become familiar with a few more advanced
programming concepts
– Regular Expression
– Functions and Modules
– Object Oriented Perl
• Will have heard a few common uses of Perl
• Will have learned how Perl can be used in
bioinformatics
• Will have discovered Bioperl
• For the biologists, see some potential power of
programming and learn to communicate with your CS
colleagues
Lecture 6.0
7
Recap from Yesterday
• Which ones below are variables?
– $sequence, @sequences, 74, ‘I knew this’,
%seq_id, “exciting stuff”
• Examples of operators? < > , ++, =, ==
• What are functions?
• Which part(s) of the statement below is a
function:
– my $seq2 = reverse ($seq1);
• Types of loops
Lecture 6.0
8
Effects of “use strict”
• Requires you to declare variables
Correct
Incorrect
my $DNA;
$DNA = “ATCG”;
$DNA = “ATCG”;
or
my $DNA = “ATCG”;
• Warns you about possible typos in variables
Lecture 6.0
No warning
Warning
my $DNA = “ATCG”;
$DNA =~tr/ATCG/TAGC/
my $DNA = “ATCG”;
$DAN =~tr/ATCG/TAGC
9
Why bother “use strict”
•
•
•
•
Enforces some good programming rules
Helps to prevent silly errors
Makes trouble shooting your program easier
Becomes essential as your code becomes
longer
• We will use strict in all the code you see
today and in your assignment
• Bottom line: ALWAYS use strict
Lecture 6.0
10
What does this program do?
#!/usr/bin/perl –w
use strict;
#a mystery subroutine that does something
#body of the main program
my $DNA1 = “GATACAATAC”;
my $DNA2 = “ATCGTAATCC”;
my $rDNA = reverse $DNA1;
$DNA2 =~ tr/T/U/;
my $hybrid = $rDNA.$DNA2;
print “$hybrid\n”;
Lecture 6.0
11
Any Other Issues?
Lecture 6.0
12
Perl – a brief history
• Purpose: “… for scanning arbitrary text files,
extracting information from those text files, and
printing reports based on that information…”
- from perl manpage
• 1987-Perl 1.0 released
• 1993 – CPAN conceived
• 1995 – Perl 5.000 released
– Object oriented perl
– Modules for creating interactive web pages (CGI)
– Modules for connection to databases (DBI)
• Current stable version of Perl is 5.8.5
Lecture 6.0
13
How do we manipulate text?
Regular Expression
Lecture 6.0
14
What is Regular Expression
• REGEX provides pattern matching ability
• Tells you whether a string contains a pattern
or not (Note: it’s a yes or no question!)
“My dog ate my homework”
“Yesterday I saw a big
black dog”
“I have a golden retriever”
Dog! Human’s best friend
Regular Expression looking for “dog”
“Yes” or “True”
Lecture 6.0
“Yes” or “True”
“No” or “False”
“No” b/c REGEX
is case sensitive
15
Why need Regular Expression
• Human does this quite well
• But….
• Imagine trying to find all ATG’s in the human
genome by hand
• Furthermore, imagine trying to find all EcoRI
digestion sites (GAATTC) in the human
genome
Lecture 6.0
16
Perl REGEX example
my $text = “Bioinformatics Kicks Ass”;
if ($text=~m/Kicks/){
print “The text contains Kicks\n”;
}
• =~ is the binding operator
– It says: “does the string on the left contain the
pattern on the right?”
• /Kicks/ is my pattern
• The matching operation results in a true or false answer
Lecture 6.0
17
More Regular Expression
• A pattern that match only one string is not very
useful!
• We need symbols to represent classes of
characters
• REGEX is its own little language inside Perl
– Has different syntax and symbols!
– Symbols which you have used in perl such as $ . { }
[ ] have totally different meanings in REGEX
Lecture 6.0
18
REGEX Metacharacters
• Metacharacters allow a pattern to match
different strings
– Wildcards are examples of metacharacters
– /.ick/ will match “kick”, “sick”, “tick”, “stick”, “kicks”,
etc.
– Perl REGEX has much more powerful
metacharacters used to represent classes of
characters
Lecture 6.0
19
Types of Metacharacters
• . matches any one character or space except “\n”
• [ ] denotes a selection of characters and matches
ONE of the characters in the selection.
– What does [ATCG] match?
• \t, \s, \n match a tab, a space and a newline
respectively
• \w matches any characters in [a-zA-Z0-9]
• \d matches [0-9]
• \D matches anything except [0-9]
Lecture 6.0
20
An Example of Metacharacters
V1S 5A6?
/\w\d\D\s\d.[0-9]/
Is it a good pattern for postal code?
What else does it match?
Lecture 6.0
21
REGEX Quantifiers
• What if you want to match a character more
than once?
• What if you want to match an mRNA with a
polyA tail that is at least 5 – 12 A’s?
“ATG……AAAAAAAAAAA”
Lecture 6.0
22
REGEX Quantifiers
“ATG……AAAAAAAAAAA”
/ATG[ATCG]+A{5,12}/
• + matches one or more copies of the previous
character
• * matches zero or more copies of the previous
character
• ? matches zero or one copy of the previous character
• {min,max} matches a number of copies within the
specified range
Lecture 6.0
23
REGEX Anchors
• The previous pattern is not strictly correct
because:
– It’ll match a string that doesn’t start with ATG
– It’ll match a string that doesn’t end with poly A’s
• Anchors tell REGEX that a pattern must
occur at the beginning or at the end of a
string
Lecture 6.0
24
REGEX Anchors
• ^ anchors the pattern to the start of a
string
• $ anchors the pattern to the end of a
string
•/^ATG[ATCG]+A{5,12}$/
Lecture 6.0
25
REGEX is greedy!
• The revised pattern is still incorrect because
– It’ll match a string that has more than 12 A’s at the end
• quantifiers will try to match as many copies of a subpattern as possible!
/^ATG[ATCG]+A{5,12}$/
“ATGGCCCGGCCTTTCCCAAAAAAAAAAAA”
“ATGGCCCGGCCTTTCCCAAAAAAAAAAAA”
Lecture 6.0
26
Curb that Greed!
• ? after a quantifier prevensts REGEX from being greed
• note this is the second use of the question mark
• What is the other use of ? in REGEX?
/^ATG[ATCG]+?A{5,12}$/
“ATGGCCCGGCCTTTCCGAAAAAAAAAAAA”
“ATGGCCCGGCCTTTCCGAAAAAAAAAAAA”
Lecture 6.0
27
REGEX Capture
• What if you want to keep the part of a string
that matches to your pattern?
• Use ( ) “memory parentheses”
“ATGGCCCGGCCTTTCCGAAAAAAAAAAAA”
/^ATG([ATCG]+?)A{5,12}$/
Lecture 6.0
28
REGEX Capture
/^ATG([ATCG]+?)(A{5,12})$/
$1
$2
• What’s inside the first ( ) is assigned to $1
• What’s inside the Second ( ) is $2 and so on
• So $2 eq “AAAAAAAAAAAA”
Lecture 6.0
29
REGEX Modifiers
• Modifiers come after a pattern and affect the
entire pattern
• You have seen //g already which does global
matching (/T/g) and global replacement(s/T/U/g)
• Other useful modifiers:
//i
//s
//m
///e
Lecture 6.0
make pattern case insensitive
let . match newline
let ^ and $ (anchors) match next to embedded newline
allow the replacement string to be a perl statement
30
REGEX Demo
•
•
•
•
•
Demonstrate quantifiers
Demonstrate anchors
Demonstrate //i
Demonstrate capture
Demonstrate the effect of greedy vs. nongreedy
• Demonstrate metacharacters
Lecture 6.0
31
Other binding operators
• =~ is called the binding operator which
“binds” the a string on the left to a pattern on
the right
– E.g. $text =~/PATTERN/
• Two other binding operators: s/// and tr///
– =~s/// (substitution) substitutes a matched pattern
by a string (kind of like the “replace” function in
MS Word)
– =~tr/// (translation) translates a character to
another
Lecture 6.0
32
Summary on REGEX
•
•
•
•
•
REGEX is its own little language!!!
REGEX is used in some functions (e.g. split)
Perl REGEX: extremely powerful and fast
REGEX is one of the main strengths of Perl
To learn more:
– Learning Perl (3rd ed.) Chapters 7, 8, 9
– Programming Perl (3rd ed.) Chapter 5
– Mastering Regular Expression (2nd ed.)
Electronic version of these books are available from
wiki day6 link (only during the workshop)
Lecture 6.0
33
Common Uses of Perl
• REGEX
– Complete set of tools for pattern matching text
• System administration
– Perl scripts can be written to automate many system
administration tasks
• CGI.pm
– Module for designing interactive web pages
• DBI.pm
– Database Interface – allows communication between all
major RDBMS systems (Oracle, MySQL, etc.)
Lecture 6.0
34
Functions
• Functions (sub-routines) are like small programs inside
your program
• Like programs, functions execute a series of statements
that process input to produce some desired output
• Functions help to organise your program
– parcel it into named functional units that can be called repeatedly
• These “functional units” are called “blocks”
• There are literally hundreds of functions built-in to Perl
– E.g. print, reverse, tr…
• You can make your own functions
Lecture 6.0
35
What happens when you call a
function?
$DNA = “ACATAATCAT”;
$rcDNA = reverse ($DNA);
$rcDNA =~ tr/ACGT/TGCA/;
sub reverse {
# process input
# return output
}
Everything between { } is called a block
Lecture 6.0
36
Calling a function
• Input is passed to a function by way of an ordered
parameter list
Basic syntax of calling a function
$result = function_name (parameter list);
$longDNA = "ACGACTAGCATGCATCGACTACGACTACGATCAGCATCGACT"
$shortDNA = substr ($longDNA, 0, 10);
String from
which to
extract the
substring
Lecture 6.0
Start from
this position
Length of the
substring
37
Useful string functions in Perl
•
chomp(STRING) OR chomp(ARRAY) –
–
•
chop(STRING) OR chop(ARRAY)
–
•
Returns a string that consists of all of the elements of ARRAY joined together by STRING. For instance, join(">>", ("AA", "BB", "cc")) returns
"AA>>BB>>cc".
lc(STRING)
–
•
Returns the position of the first occurrence of SUBSTRING in STRING at or after POSITION. If you don't specify POSITION, the search starts at the
beginning of STRING.
join(STRING, ARRAY)
–
•
Removes the last character from a string or the last character from every element in an array. The last character chopped is returned.
index(STRING, SUBSTRING, POSITION)
–
•
Uses the value of the $/ special variable to remove endings from STRING or each element of ARRAY. The line ending is only removed if it matches
the current value of $/.
Returns a string with every letter of STRING in lowercase. For instance, lc("ABCD") returns "abcd".
lcfirst(STRING)
–
Returns a string with the first letter of STRING in lowercase. For instance, lcfirst("ABCD") returns "aBCD".
•
length(STRING)
•
split(PATTERN, STRING, LIMIT)
–
–
•
Returns the length of STRING.
Breaks up a string based on some delimiter. In an array context, it returns a list of the things that were found. In a scalar context, it returns the number
of things found.
substr(STRING, OFFSET, LENGTH)
–
Returns a portion of STRING as determined by the OFFSET and LENGTH parameters. If LENGTH is not specified, then everything from OFFSET to
the end of STRING is returned. A negative OFFSET can be used to start from the right side of STRING.
•
uc(STRING)
•
ucfirst(STRING)
–
–
Returns a string with every letter of STRING in uppercase. For instance, uc("abcd") returns "ABCD".
Returns a string with the first letter of STRING in uppercase. For instance, ucfirst("abcd") returns "Abcd".
Lecture 6.0
source: http://www.cs.cf.ac.uk/Dave/PERL/
38
Useful array functions in Perl
•
pop(ARRAY)
–
•
push(ARRAY1, ARRAY2)
–
•
Appends the contents of ARRAY2 to ARRAY1. This increases the size of ARRAY1 as needed.
reverse(ARRAY)
–
•
Returns the last value of an array. It also reduces the size of the array by one.
Reverses the elements of a given array when used in an array context. When used in a scalar context, the
array is converted to a string, and the string is reversed.
scalar(ARRAY)
–
Evaluates the array in a scalar context and returns the number of elements in the array.
• shift(ARRAY)
– Returns the first value of an array. It also reduces the size of the array
by one.
•
sort(ARRAY)
–
•
Returns a list containing the elements of ARRAY in sorted order. See next Chapter 8on References for more
information.
split(PATTERN, STRING, LIMIT)
–
Breaks up a string based on some delimiter. In an array context, it returns a list of the things that were
found. In a scalar context, it returns the number of things found.
source: http://www.cs.cf.ac.uk/Dave/PERL/
Lecture 6.0
39
String functions - split
• ‘splits’ a string into an array based on a delimiter
• excellent for processing tab or comma delimited files
$line = “MacDonald,Old,The farm,Some city,BC,E1E 1O1”;
($lastname, $firstname, $address, $city, $province, $postalcode) = split (/,/, $line);
print (“LAST NAME: “, $lastname, “\n”,
“FIRST NAME: “, $firstname, “\n”,
“ADDRESS: “, $address, “\n”,
“CITY: “, $city, “\n”,
“PROVINCE: “, $province, “\n”,
“POSTAL CODE: “, $postalcode, “\n”);
LAST NAME: MacDonald
FIRST NAME: Old
ADDRESS: The Farm
CITY: Some city
PROVINCE: BC
POSTAL
CODE: E1E 1O1
Lecture 6.0
REGEX
goes here
String
goes here
40
Array functions - sort
• You can sort the elements in your array with ‘sort’
@myNumbers = ("one","two","three","four");
@sorted = sort(@myNumbers);
print “@sorted\n”;
four one three two
Lecture 6.0
sorts
alphabetically
41
Making your own function
‘sub’ tells the
interpreter you are
declaring a function
sub function_name {
(my $param1, my $param2, ...) = @_;
# do something with the parameters
my $result = ...
return $result;
}
‘return’ tells the interpreter to go
back to the place in the program
that called this function. When
followed by scalars or variables,
these values are passed back to
where the function was called.
This is the output of the function
Lecture 6.0
This is the function name. Use this
name to ‘call’ the function from
within your program
What is this? This is an array that
gets created automatically to hold
the parameter list.
What is the word ‘my’ doing here?
‘my’ is a variable qualifier that
makes it local to the function.
Without it, the variable is available
anywhere in the program. With
‘my’, $param1 is only available
inside the sub-routine block
SCOPE!!
42
Making your own function - example
Function definition
sub mean {
my @values = @_;
my $numValues = scalar @values;
local variables to be used
inside the function
my $mean;
foreach my $element (@values) {
do the work!
my $sum = $sum + $element;
}
$mean = $mean / $numValues;
return $mean;
return the answer
}
$avg = mean(1,2,3,4,5);
Lecture 6.0
43
A program with a function
#!/usr/bin/perl –w
use strict;
#a mystery subroutine that does something
sub mystery_function{
my ($seq1, $seq2)=@_;
my $rDNA = reverse $seq1;
$seq2 =~ tr/T/U/;
my $hybrid = $rDNA.$seq2;
return $hybrid;
}
#body of the main program
my $DNA1 = “GATACAATAC”;
my $DNA2 = “ATCGTAATCC”;
my $answer = mystery_function($DNA1, $DNA2);
print “$answer\n”;
Lecture 6.0
44
Functions Recap
• How do we “call” a function?
my $sum = add (2, 3)
Return Value
Function Name
Parameter list
• Functions can take some input values
(parameters) and can return some output
values
• You need to assign the return values to a
variable in order to use them
Lecture 6.0
45
More Review on Functions
• Benefits of subroutines
– Decompose a big problem into smaller, more
manageable problems
– Organize your code
– Improve code reuse
– Easier to test and debug your code
sub add {
#some code that adds numbers here
#return the sum
}
Lecture 6.0
46
What are Modules
• a “logical” collection of functions
• Each collection (or module) has its own
“name space”
– Name space: a table containing the names of
variables and functions used in your code
Lecture 6.0
47
Why is name space important
Package: SEQanalysis_DNA
Package: SEQanalysis_Prot
$DNA = “ATGAATACTACTAT…”
$exon1 = “MEDAVRSKNTMI”
$polyAtail = “AAAAAAAAAA”
$exon2 = “RSVADEGFLSMIRQH”
sub Revcom{}
#reverse complement sequence
sub findmotif{}
#find a peptide motif
sub concat{}
#concatenate two DNA sequences
sub concat{}
#concatenate two exon
sequences
SEQanalysis_DNA::concat
Lecture 6.0
SEQanalysis_Prot::concat
48
Why Use Modules
• Modules allow you to use others’ code to extend the
functionality of your program
• But, use other people’s modules is like going to other
people’s houses
– Not everything will be the way you like it
• Read the module documentation
– Be nice
• use a module as it is intended
• In Perl, each module is a file stored in some directory
in your system
– E.g. you can find cgi.pm in /usr/lib/… on your system
Lecture 6.0
49
Use modules to speedup the
development of your own program
• Modules are like Hamburger Helpers!
• It provides many of the ingredients for your
dish, but you are responsible for the “meat” of
the dish.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Lecture 6.0
50
Use Modules
• To use a module:
– use modulename;
• Examples:
– use strict;
– use Env;
– use cgi qw(:standard);
• To find out where modules are installed :
perl –V
• To find out what standard modules are
available: perldoc perlmodlib
Lecture 6.0
51
Module Demo
• Demonstrate perldoc as a method to read
module documentation
• Demonstrate the difference before and after
using a module (use strict and use Env)
• Demonstrate the perl –V and an example of
directory structure of modules
Lecture 6.0
52
Where to find modules
• CPAN: Comprehensive Perl Archive Network
• Central repository for Perl modules and more
• “If it’s written in Perl, and it’s helpful and free, it’s
probably on CPAN”
• http://www.perl.com/CPAN/
• To install modules from CPAN
– perl –MCPAN –e “install ‘Some::Module’”
• Module dependency is taken cared of automatically
• You’ll (usually) need to be root to install a module successfully
Lecture 6.0
53
CPAN Web Demo
• Demonstrate how to search for a module and
how to access the online documentation
• We’ll use Getopt::Long as an example
Lecture 6.0
54
Interactive Demo on Getopt::Long
•
•
•
•
Open your laptop!
Open a terminal window
Type cd ~/perl_two
Type gedit ./getopt_demo.pl&
• Let’s go over the example together
Lecture 6.0
55
Summary for Session 1
• Always “use strict”
• Regular Expression is its own language
inside Perl
• I encourage you to read the chapters on
REGEX in Learning Perl
• A module is a logical collection of functions
• You can find module documentation by using
perldoc (command line) or by going online to
CPAN
Lecture 6.0
56
Break
Lecture 6.0
57