Introduction to Perl

Download Report

Transcript Introduction to Perl

Introduction to Perl
Part II
1
Associative arrays or Hashes
 Like arrays, but instead of numbers as indices can


use strings
@array = (‘john’, ‘steve’, ‘aaron’, ‘max’, ‘juan’,
‘sue’)
%hash = ( ‘apple’ => 12, ‘pear’ => 3, ‘cherry’
=>30, ‘lemon’ => 2, ‘peach’ => 6, ‘kiwi’ => 3);
‘aaron’
3
4
5
‘sue’
2
‘juan’
1
‘max’
0
‘steve’
2
Hash
‘john’
Array
apple
12
pear
3
cherry
30
lemon
2
peach
6
kiwi
3
Using hashes
 { } operator
Set a value
– $hash{‘cherry’} = 10;
Access a value
– print $hash{‘cherry’}, “\n”;
Remove an entry
– delete $hash{‘cherry’};
3
Get the Keys
keys function will return a list of the
hash keys
@keys = keys %fruit;
for my $key ( keys %fruit ) {
print “$key => $hash{$key}\n”;
}
Would be ‘apple’, ‘pear’, ...
Order of keys is NOT guaranteed!




4
Get just the values
@values = values %hash;
for my $val ( @values ) {
}
5
print “val is $val\n”;
Iterate through a set
Order is not guaranteed!
while( my ($key,$value) = each %hash){
print “$key => $value\n”;
}
6
Subroutines
Set of code that can be reused
Can also be referred to as procedures
and functions
7
Defining a subroutine
sub routine_name {
}
Calling the routine

– routine_name;
– &routine_name; (& is optional)
8
Passing data to a subroutine
Pass in a list of data
– &dosomething($var1,$var2);
– sub dosomething {
my ($v1,$v2) = @_;
}
sub do2 {
my $v1 = shift @_;
my $v2 = shift;
}
9
Returning data from a
subroutine
The last line of the routine set the
return value
sub dothis {
my $c = 10 + 20;
}
print dothis(), “\n”;
Can also use return specify return value
and/or leave routine early

10
Write subroutine which returns true if codon
is a stop codon (for standard genetic code).
-1 on error, 1 on true, 0 on false
sub is_stopcodon {
my $val = shift @_;
if( length($val) != 3 ) {
return -1;
} elsif( $val eq ‘TAA’ ||
$val eq ‘TAG’ ||
$val eq ‘TGA’ ) {
return 1;
} else {
return 0;
}
Context
array versus scalar context

my $length = @lst;
my ($first) = @lst;
Want array used to report context
subroutines are called in
Can force scalar context with scalar
my $len = scalar @lst;


12
subroutine context
sub dostuff {
if( wantarray ) {
print “array/list context\n”;
} else {
print “scalar context\n”;
}
}
dostuff(); # scalar
my @a = dostuff(); # array
my %h = dostuff(); # array
my $s = dostuff(); # scalar
13
Why do you care about
context?
sub dostuff {
my @r = (10,20);
return @r;
}
my @a = dostuff(); # array
my %h = dostuff(); # array
my $s = dostuff(); # scalar
print “@a\n”; # 10 20
print join(“ “, keys %h),”\n”; # 10
print “$s\n”; # 2
14
References
Are “pointers” the data object instead of
object itsself
Allows us to have a shorthand to refer
to something and pass it around
Must “dereference” something to get its
actual value, the “reference” is just a
location in memory


15
Reference Operators
 \ in front of variable to get its
memory location
– my $ptr = \@vals;
[ ] for arrays, { } for hashes
Can assign a pointer directly
– my $ptr = [ (‘owlmonkey’, ‘lemur’)];
my $hashptr = { ‘chrom’ => ‘III’,
‘start’ => 23};
16
Dereferencing
Need to cast reference back to datatype
my @list = @$ptr;
my %hash = %$hashref;
Can also use ‘{ }’ to clarify
– my @list = @{$ptr};
– my %hash = %{$hashref};
17
Really they are not so hard...
my @list = (‘fugu’, ‘human’, ‘worm’,
‘fly’);
my $list_ref = \@list;
my $list_ref_copy = [@list];
for my $item ( @$list_ref ) {
print “$item\n”;
}



18
Why use references?
Simplify argument passing to
subroutines
Allows updating of data without
making multiple copies
What if we wanted to pass in 2
arrays to a subroutine?
sub func { my (@v1,@v2) = @_; }
How do we know when one stops
and another starts?




19
Why use references?
Passing in two arrays to intermix.
sub func {
my ($v1,$v2) = @_;
my @mixed;
while( @$v1 || @$v2 ) {
push @mixed, shift @$v1 if @$v1;
push @mixed, shift @$v2 if @$v2;
}
return \@mixed;
}
20
References also allow Arrays
of Arrays
my @lst;
push @lst, [‘milk’, ‘butter’, ‘cheese’];
push @lst, [‘wine’, ‘sherry’, ‘port’];
push @lst, [‘bread’, ‘bagels’, ‘croissants’];
my @matrix = [ [1, 0, 0],
[0, 1, 0],
[0, 0, 1] ];

21
Hashes of arrays
$hash{‘dogs’} = [‘beagle’,
‘shepherd’, ‘lab’];
$hash{‘cats’} = [‘calico’, ‘tabby’,
‘siamese’];
$hash{‘fish’} = [‘gold’,’beta’,’tuna’];
for my $key (keys %hash ) {
print “$key => “, join(“\t”,
@{$hash{$key}}), “\n”;
}

22
More matrix use
my @matrix;
open(IN, $file) || die $!;
# read in the matrix
while(<IN>) {
push @matrix, [split];
}
# data looks like
# GENENAME EXPVALUE STATUS
# sort by 2nd column
for my $row ( sort { $a->[1] <=> $b->[1] }
@matrix ) {
print join(“\t”, @$row), “\n”;
}
23
Funny operators
my @bases = qw(C A G T)
my $msg = <<EOF
This is the message I wanted to tell you
about
EOF
;
24
Regular Expressions
Part of “amazing power” of Perl
Allow matching of patterns
Syntax can be tricky
Worth the effort to learn!
25
A simple regexp
if(
$fruit eq ‘apple’ ||
$fruit eq ‘Apple’ ||
$fruit eq ‘pear’) {
print “got a fruit $fruit\n”;
}
if( $fruit =~ /[Aa]pple|pear/ ){
print “matched fruit $fruit\n”;
}
26
Regular Expression syntax
 use the =~ operator to match
 if( $var =~ /pattern/ ) {} - scalar context
 my ($a,$b) = ( $var =~ /(\S+)\s+(\S+)/ );
 if( $var !~ m// ) { } - true if pattern doesn’t
 m/REGEXPHERE/ - match
 s/REGEXP/REPLACE/ - substitute
 tr/VALUES/NEWVALUES/ - translate
27
m// operator (match)
Search a string for a pattern match
If no string is specified, will match $_
Pattern can contain variables which will
be interpolated (and pattern recompiled)
while( <DATA> ) {
if( /A$num/ ) { $num++ }
}
while( <DATA> ) {
if( /A$num/o ) { $num++ }
}
28
Pattern extras
m// -if specify m, can replace / with
anything e.g. m##, m[], m!!
/i - case insensitive
/g - global match (more than one)
/x - extended regexps (allows
comments and whitespace)
29
/o - compile regexp once
Shortcuts
\s - whitespace (tab,space,newline, etc)
\S - NOT whitespace
\d - numerics ([0-9])
\D - NOT numerics
\t, \n - tab, newline
. - anything
30
Regexp Operators
+ - 1 -> many (match 1,2,3,4,...
instances )
/a+/ will match ‘a’, ‘aa’, ‘aaaaa’
* - 0 -> many

? - 0 or 1
{N}, {M,N} - match exactly N, or M to N
[], [^] - anything in the brackets,
anything but what is in the brackets
31
Saving what you matched
Things in parentheses can be retrieved
via variables $1, $2, $3, etc for
1st,2nd,3rd matches
if( /(\S+)\s+([\d\.\+\-]+)/) {
print “$1 --> $2\n”;
}
my ($name,$score) =
($var =~ /(\S+)\s+([\d\.\+\-]+)/);

32
Simple Regexp
my $line = “aardvark”;
if( $line =~ /aa/ ) {
print “has double a\n”
}
if( $line =~ /(a{2})/ ) {
print “has double a\n”
}
if( $line =~ /(a+)/ ) {
print “has 1 or more a\n”
}
33
Matching gene names
# File contains lots of gene names
# YFL001C YAR102W - yeast ORF names
# let-1, unc-7 - worm names
http://biosci.umn.edu/CGC/Nomenclature/nomenguid.htm
# ENSG000000101 - human Ensembl gene names
while(<IN>) {
if( /^(Y([A-P])(R|L)(\d{3})(W|C)(\-\w)?)/ ) {
printf “yeast gene %s, chrom %d,%s arm, %d %s strand\n”,
$1, (ord($2)-ord(‘A’))+1, $3, $4;
} elsif( /^(ENSG\d+)/ ) { print “human gene $1\n” }
elsif( /^(\w{3,4}\-\d+)/ ) { print “worm gene $1\n”; }
}
34
Putting it all together
A
parser for output from a gene
prediction program
35
GlimmerM (Version 3.0)
Sequence name: BAC1Contig11
Sequence length: 31797 bp
Predicted genes/exons
Gene Exon Strand Exon
#
#
Type
Exon Range
Exon
Length
1
1
1
1
1
1
2
3
4
5
+
+
+
+
+
Initial
Internal
Internal
Internal
Terminal
13907
14117
14635
14746
15497
13985
14594
14665
15463
15606
79
478
31
718
110
2
2
2
1
2
3
+
+
+
Initial
Internal
Terminal
20662
21190
21624
21143
21618
21990
482
429
367
3
1
-
Single
25351
25485
135
4
4
4
4
4
4
1
2
3
4
5
6
+
+
+
+
+
+
Initial
Internal
Internal
Internal
Internal
Terminal
27744
27858
28091
28636
28746
28852
27804
27952
28576
28647
28792
28954
61
95
486
12
47
103
Putting it together
while(<>)
37
{
if(/^(Glimmer\S*)\s+\((.+)\)/ {
$method = $1; $version = $2;
} elsif( /^(Predicted genes)|(Gene)|(\s+\#)/ ||
/^\s+$/ ) { next
} elsif( # glimmer 3.0 output
/^\s+(\d+)\s+ # gene num
(\d+)\s+
# exon num
([\+\-])\s+ # strand
(\S+)\s+
# exon type
(\d+)\s+(\d+) # exon start, end
\s+(\d+)
# exon length
/ox ) {
my ($genenum,$exonnum,$strand,$type,$start,$end,
$len) = ( $1,$2,$3,$4,$5,$6,$7);
}
}
s/// operator (substitute)
Same as m// but will allow you to
substitute whatever is matched in first
section with value in the second section
$sport =~ s/soccer/football/
$addto =~ s/(Gene)/$1-$genenum/;
38
The tr/// operator (translate)
Match and replace what is in the first
section, in order, with what is in the
second.
lowercase - tr/[A-Z]/[a-z]/
shift cipher - tr/[A-Z]/[B-ZA]/
revcom - $dna =~ tr/[ACGT]/[TGCA]/;
$dna = reverse($dna);
39
(aside) DNA ambiguity chars
 aMino - {A,C}, Keto - {G,T}
 puRines - {A,G}, prYmidines - {C,T}
 Strong - {G,C}, Weak - {A,T}
 H (Not G)- {ACT}, B (Not A), V (Not T),
D(Not C)
 $str =~
tr/acgtrymkswhbvdnxACGTRYMKSWHBVDNX/
tgcayrkmswdvbhnxTGCAYRKMSWDVBHNX/;
40