BINF 634 Bioinformatics Programming

Download Report

Transcript BINF 634 Bioinformatics Programming

Topics






Logical expression
string functions: substr and index
random numbers and mutation
hashes
Transcription, translation, genetic code
Quiz 2
BINF634 FALL13 - LECTURE 4
1
Logical Value of Expressions



Any expression in Perl can be
interpreted as a logical value

true or false
Scalar context:

false if 0, "", or undefined

otherwise true
List context:

false if () or undefined

true otherwise
my $x; # or: my $x = 0;
if ($x) { $x++ }
else { $x = 17 }
print "$x\n";
17
my $x = 2;
while ($x) {
print $x--, "\n";
}
2
1
my @a = ("A", "T");
while (@a) {
print shift @a, "\n";
}
A
T
BINF634 FALL13 - LECTURE 4
2
Logical Operators
Logical operator "short-circuit": only evaluate second argument if necessary
$a and $b
$a if $a is false, $b otherwise
$a or $b
$a if $a is true, $b otherwise
not $a
true if $a is not true
$a xor $b
True if $a or $b is true, but not both
Example:
open(GRADES, "grades") or die "Can't open file grades\n";
See Wall (p 109-110) for discussion of C-style &&, || and ! operators
BINF634 FALL13 - LECTURE 4
3
String Functions: substr
substr EXPR, OFFSET, LENGTH, REPLACEMENT
Return substring of string EXPR at position OFFSET and length LENGTH
$s
$x
$x
$x
=
=
=
=
"hello, world!";
substr $s, 1, 1;
substr $s, 0, 3;
substr $s, 7;
# $x = "e"
# $x = "hel"
# $x = "world!"
# The substring is replaced by REPLACEMENT if used:
substr($s,0,5,"goodbye"); # $s = "goodbye, world!";
# This does the same thing:
$s = "hello, world!";
substr($s,0,5) = "goodbye";
Note: Predefined Perl functions may be used with or without parentheses around their arguments
BINF634 FALL13 - LECTURE 4
4
# From example7-4.pl
# matching_percentage
#
# Subroutine to calculate the percentage of identical bases in two
# equal length DNA sequences
sub matching_percentage {
my($string1, $string2) = @_;
# we assume that the strings have the same length
my $length = length($string1);
my $count = 0;
for (my $position=0; $position < $length ; ++$position) {
if(substr($string1,$position,1) eq substr($string2,$position,1)) {
++$count;
}
}
return $count / $length;
}
BINF634 FALL13 - LECTURE 4
5
#!/usr/bin/perl
use strict;
use warnings;
my $dna = "AAAAATTTTTGGGGGTTTTT";
print_sequence(dna,8);
exit;
# A subroutine to format and print sequence data
sub print_sequence {
my($sequence, $length) = @_;
# Print sequence in lines of $length
for (my $pos = 0 ; $pos < length($sequence); $pos += $length ) {
print substr($sequence, $pos, $length), "\n";
}
}
AAAAATTT
TTGGGGGC
CCCC
BINF634 FALL13 - LECTURE 4
6
String Functions: index
index STR, SUBSTR, OFFSET
Return the position of the first occurrence of SUBSTR in STR; If OFFSET
is given, skip this many letters before looking
Returns -1 if SUBSTR not found
$dna = "GATGCCATGAAATGC";
$pos = index $dna, "ATG";
print "ATG found at position $pos\n";
# answer: 1
pos = -1;
while (($pos = index($dna, "ATG", $pos)) > -1) {
print "ATG found at position $pos\n";
$pos++;
}
OUTPUT:
ATG found at position 1
ATG found at position 6
ATG found at position 11
BINF634 FALL13 - LECTURE 4
7
Random Number Generation
$r = rand;
1.0)
# $r is random between 0 and 1 (0<=$r<
$r = rand(100);
# random between 0 and 100
$d = int rand(101); # random integer from 0 to 100
srand($seed);
# seeds the random number generator
If a program doesn't call srand(), then it generates different random numbers
each time it is run.
If program does call srand($seed), we get the same sequence of random
number each time the program is run with the same value for $seed.
BINF634 FALL13 - LECTURE 4
8
An Example to Generate Random
Stories - I
#!/usr/bin/perl -w
# Example 7-1
Children's game, demonstrating primitive artificial intelligence,
# using a random number generator to randomly select parts of sentences.
use strict;
use warnings;
# Declare the variables
my $count;
my $input;
my $number;
my $sentence;
my $story;
# Here are the arrays of parts of sentences:
my @nouns = (
'Dad',
'TV',
'Mom',
'Groucho',
'Rebecca',
'Harpo',
'Robin Hood',
'Joe and Moe',
);
BINF634 FALL13 - LECTURE 4
9
An Example to Generate Random
Stories - II
my @verbs = (
'ran to',
'giggled with',
'put hot sauce into the orange juice
of',
'exploded',
'dissolved',
'sang stupid songs with',
'jumped with',
);
'in a dream',
'around the world',
);
my @prepositions = (
'at the store',
'over the rainbow',
'just for the fun of it',
'at the beach',
'before dinner',
'in New York City',
# This do-until loop composes sixsentence "stories".
# until the user types "quit".
do {
# (Re)set $story to the empty
string each time through the loop
$story = '';
# Seed the random number generator.
# time|$$ combines the current time
with the current process id
# in a somewhat weak attempt to come up
with a random seed.
srand(time|$$);
# Make 6 sentences per story.
for ($count = 0; $count < 6;
$count++) {
BINF634 FALL13 - LECTURE 4
10
An Example to Generate Random
Stories - III
#
Notes on the following statements:
# 1) scalar @array gives the
number of elements in the array.
# 2) rand returns a random
number greater than 0 and
#
less than scalar(@array).
# 3) int removes the
fractional part of a number.
# 4) . joins two strings
together.
$sentence
=
$nouns[int(rand(scalar @nouns))]
. " "
.
$verbs[int(rand(scalar @verbs))]
. " "
.
$nouns[int(rand(scalar @nouns))]
. " "
.
$prepositions[int(rand(scalar
@prepositions))]
. '. ';
}
# Print the story.
print "\n",$story,"\n";
# Get user input.
print "\nType \"quit\" to quit, or
press Enter to continue: ";
$input = <STDIN>;
}
# Exit loop at user's request
until($input =~ /^\s*q/i);
exit;
$story .= $sentence;
BINF634 FALL13 - LECTURE 4
11
Randomly Selecting and Element
of an Array

$verbs[int(rand(scalar @verbs)))]

verbs[int(rand(7))] #Why??

What does rand(7) do?

How about int(rand(7))?
BINF634 FALL13 - LECTURE 4
12
An Alternative Way to Randomly
Select the Elements in an Array

$verbs[int rand scalar @verbs]

This will actually work

$verbs[rand @verbs] #How does this work?
BINF634 FALL13 - LECTURE 4
13
# From example7-2.pl
# A subroutine to perform a mutation in a string of DNA
#
sub mutate {
my($dna) = @_;
my(@nucleotides) = ('A', 'C', 'G', 'T');
# Pick a random position in the DNA
my($position) = randomposition($dna);
# Pick a random nucleotide
my($newbase) = randomnucleotide(@nucleotides);
# Insert the random nucleotide into the random position in the DNA
substr($dna,$position,1,$newbase);
return $dna;
}
BINF634 FALL13 - LECTURE 4
14
# A subroutine to randomly select a position in a string.
sub randomposition {
my($string) = @_;
# rand returns a decimal number between 0 and its argument.
# int returns the integer portion of a decimal number.
#
# The whole expression returns a random number between 0 and
# length-1
return int rand length $string;
}
# Select at random one of the four nucleotides
sub randomnucleotide {
my(@nucleotides) = ('A', 'C', 'G', 'T');
return randomelement(@nucleotides);
}
# randomly select an element from an array
sub randomelement {
my(@array) = @_;
# return $array[int(rand(scalar @array)];
return $array[rand @array];
}
BINF634 FALL13 - LECTURE 4
15
# Subroutine to perform a mutation in a string of DNA-version 2, in
# which it is guaranteed that one base will change on each call.
sub mutate_better {
my($dna) = @_;
my(@nucleotides) = ('A', 'C', 'G', 'T');
# Pick a random position in the DNA
my($position) = randomposition($dna);
# Pick a random nucleotide
my($newbase);
do {
$newbase = randomnucleotide(@nucleotides);
# Make sure it's different than the nucleotide we're mutating
}until ( $newbase ne substr($dna, $position,1) );
# Insert the random nucleotide into the random position in the DNA
substr($dna,$position,1,$newbase);
return $dna;
}
BINF634 FALL13 - LECTURE 4
16
Hashes
(Associative Arrays)


A Hash is a collection of zero or more pairs of scalar values,
called keys and values
Hash variable names begin with a percent sign (%)
%genes = ( "gene1", "ATTCGT", "gene2", "CTGCCATGA");

The values are indexed by the keys

Given a key, the hash returns the corresponding value
$seq = $genes{"gene2"};

#
$seq = "CTGCCATGA"
Note that $genes{"gene2"} is a scalar, so it starts with $
BINF634 FALL13 - LECTURE 4
17
Hashes

Hashes can be assigned values use key=>value notation:
%genes = ( "gene1", "ATTCGT", "gene2", "CTGCCATGA");
%genes = ( gene1=>"ATTCGT", gene2=>"CTGCCATGA");

Hash elements can be created/altered by assignment statements:
$genes{"gene1"} = "ATTCGT";
$genes{gene2} = "CTGCCATGA";
# note: no quotes in key
BINF634 FALL13 - LECTURE 4
18
Hashes (Associative Arrays)
%genomes = ( );
# creates an empty hash
# two ways to do the same thing:
%genomes = ( "virus", 31, "bacteria", 89, "plants", 5 );
%genomes = ( virus => 31, bacteria => 89, plants => 5 );
$genomes{mammals} = 2;
# adds a new pair to the hash
@genome_list = keys %genomes;
# @genome list is now ("plants" , "mammals", "bacteria", "virus")
@genome_counts = values %genomes;
# @genome_counts is now (5, 2, 89, 31)
# keys and values are not guaranteed to return the data is same order
# as it was entered, but they are guaranteed to return the data in the
# same order as each other.
BINF634 FALL13 - LECTURE 4
19
Hashes
The keys function returns a list of all keys in a hash (in some random order)
%genes = (gene2=>"CTGCCATGA", gene1=>"ATTCGT");
@key_list = keys(%genes);
print "@key_list\n";
# prints: gene1 gene2
# often used to loop through a hash:
foreach $key (@key_list) {
print "The value of $key is $genes{$key}\n";
}
Output:
The value of gene1 is ATTCGT
The value of gene2 is CTGCCATGA
BINF634 FALL13 - LECTURE 4
20
Hashes
# exists($H{$key}) returns TRUE if $key occurs in hash %H
# print all the words in a file with their counts
my ($file) = @ARGV;
open FH, $file;
my @lines = <FH>;
close FH;
my %count = ();
for my $line (@lines) {
for my $word (split " ", $line) {
if (not exists $count{$word}) {
# initialize the count for a new word
$count{$word} = 1;
}
else {
# update the count for an existing word
$count{$word}++;
}
}
}
for my $key (sort keys %count) {
print "$key $count{$key}\n";
}
What does each line in this
program do?
BINF634 FALL13 - LECTURE 4
21
Hashes
% cat testfile2
a text file with lots of words
some words occur once and some
words occur more than once
% wordcount.pl testfile2
a 1
and 1
file 1
lots 1
more 1
occur 2
of 1
once 2
some 2
text 1
than 1
with 1
words 3
BINF634 FALL13 - LECTURE 4
22
Hashes (Associative Arrays)


The each function returns a two-element list containing one key from the hash
and its associated value
Subsequent calls to each will return another pair, until all pairs have been
returned (at which point an empty array is returned)
while ( ($genome, $count) = each %genomes ) ) {
print “$genome $count\n”;
}
OUTPUT: (possibly not in this order)
plants 5
virus 31
bacteria 89
mammals 2
BINF634 FALL13 - LECTURE 4
23
More on Hashes (Associative Arrays)

Assigning the return value from values or keys to a scalar gives the number
of pairs in a hash:
$genome_count = keys %genomes; # $genome_count is now 4
$genome_count = values %genomes; # $genome_count is now 4

The delete function removes a pair from a hash
delete $genomes{bacteria};
$genome_count = keys %genomes;
#
$genome_count is now 3
BINF634 FALL13 - LECTURE 4
24
Transcription and Translation




DNA is transcribed to mRNA, mRNA is translated to protein
Start with double stranded DNA
ATTCGAGCATGACATCATCGGTA (sense strand)
TAAGCTCGTACTGTAGTAGCCAT (complement strand)
DNA double helix separates, allowing polymerase to transcribe one strand:
ATTCGAGCATGACATCATCGGTA (sense strand)
AUUCGAGCAUGACAUCAUCGGUA (mRNA)
| | | | | | |
TAAGCTCGTACTGTAGTAGCCAT (complement strand)
mRNA codons translated into protein:
AUU CGA GCA UGA CAU CAU CGG … (mRNA)
BINF634 FALL13 - LECTURE 4
25
Codons
BINF634 FALL13 - LECTURE 4
26
Reading Frames




A genomic sequence has 6 reading frames, corresponding to the six possible ways of
translating the sequence into three-letter codons.

Frame 1 treats each group of three bases as a codon, starting from the first base

Frame 2 starts at the second base

Frame 3 starts at the third base
Start with the sense strand of DNA: ATTCGAGCATGACATCATCGGTA
 Reading frame 1: ATT CGA GCA TGA CAT CAT CGG TA
 Reading Frame 2: TTC GAG CAT GAC ATC ATC GGT A
 Reading Frame 3: TCG AGC ATG ACA TCA TCG GTA
Each reading frame can be translated into a different protein sequence
Frames 4, 5 and 6 are defined in a similar way, but refer to the opposite strand, which is the
reverse complement of the first strand.
BINF634 FALL13 - LECTURE 4
27
Reading Frames

Perl code to process first three reading frames:
$dna = ...;
for (my $r = 1; $r <= 3; $r++) {
my $frame = substr($dna, ?, ?); #
# process reading frame $r here
}
fill in the blanks!
# exercise: write a subroutine to return the nth reading frame for a DNA
string
BINF634 FALL13 - LECTURE 4
28
BINF634 FALL13 - LECTURE 4
29
Using the Genetic Code
We’ll look at four versions of translating DNA using the genetic code:

Look up the codon using if-then-else

Same as above, but use patterns to reflect redundancy of genetic code

Use a hash to look up each codon

Same as above, but more efficiently

See BeginPerlBioinfo.pm
BINF634 FALL13 - LECTURE 4
30
Version 1
# codon2aa
# A subroutine to translate a DNA 3-character codon to an amino acid
sub codon2aa_v1 {
my($codon) = @_;
if
( $codon =~ /TCA/i )
{ return "S" }
elsif ( $codon =~ /TCC/i )
{ return "S" }
elsif ( $codon =~ /TCG/i )
{ return "S" }
elsif ( $codon =~ /TCT/i )
{ return "S" }
elsif ( $codon =~ /TTC/i )
{ return "F" }
elsif ( $codon =~ /TTT/i )
{ return "F" }
elsif ( $codon =~ /TTA/i )
{ return "L" }
elsif ( $codon =~ /TTG/i )
{ return "L" }
elsif ( $codon =~ /TAC/i )
{ return "Y" }
elsif ( $codon =~ /TAT/i )
{ return "Y" }
elsif ( $codon =~ /TAA/i )
{ return "_" }
elsif ( $codon =~ /TAG/i )
{ return "_" }
elsif ( $codon =~ /TGC/i )
{ return "C" }
elsif ( $codon =~ /TGT/i )
{ return "C" }
elsif ( $codon =~ /TGA/i )
{ return "_" }
elsif ( $codon =~ /TGG/i )
{ return "W" }
.
.
elsif ( $codon =~ /GGT/i )
{ return "G" }
else {
print STDERR "Bad codon \"$codon\"!!\n";
exit;
}
}
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
Serine
Serine
Serine
Serine
Phenylalanine
Phenylalanine
Leucine
Leucine
Tyrosine
Tyrosine
Stop
Stop
Cysteine
Cysteine
Stop
Tryptophan
Remember
in relating
to the
translation
table on
slide 29
that the T’s
become U’s
# Glycine
BINF634 FALL13 - LECTURE 4
Problem: takes longer to look up codons near end of list.
31
Version 2
# codon2aa
# A subroutine to translate a DNA 3-character codon to an amino acid
sub codon2aa_v2 {
my($codon) = @_;
if ( $codon =~ /GC./i)
{ return "A"
elsif ( $codon =~ /TG[TC]/i)
{ return "C"
elsif ( $codon =~ /GA[TC]/i)
{ return "D"
elsif ( $codon =~ /GA[AG]/i)
{ return "E"
elsif ( $codon =~ /TT[TC]/i)
{ return "F"
elsif ( $codon =~ /GG./i)
{ return "G"
elsif ( $codon =~ /CA[TC]/i)
{ return "H"
elsif ( $codon =~ /AT[TCA]/i)
{ return "I"
elsif ( $codon =~ /AA[AG]/i)
{ return "K"
elsif ( $codon =~ /TT[AG]|CT./i) { return "L"
elsif ( $codon =~ /ATG/i)
{ return "M"
elsif ( $codon =~ /AA[TC]/i)
{ return "N"
elsif ( $codon =~ /CC./i)
{ return "P"
elsif ( $codon =~ /CA[AG]/i)
{ return "Q"
elsif ( $codon =~ /CG.|AG[AG]/i) { return "R"
elsif ( $codon =~ /TC.|AG[TC]/i) { return "S"
elsif ( $codon =~ /AC./i)
{ return "T"
elsif ( $codon =~ /GT./i)
{ return "V"
elsif ( $codon =~ /TGG/i)
{ return "W"
elsif ( $codon =~ /TA[TC]/i)
{ return "Y"
elsif ( $codon =~ /TA[AG]|TGA/i) { return "_"
else {
print STDERR "Bad codon \"$codon\"!!\n";
exit;
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
Alanine
Cysteine
Aspartic Acid
Glutamic Acid
Phenylalanine
Glycine
Histidine
Isoleucine
Lysine
Leucine
Methionine
Asparagine
Proline
Glutamine
Arginine
Serine
Threonine
Valine
Tryptophan
Tyrosine
Stop
BINF634 FALL13 - LECTURE 4
Still takes longer to look up codons near end of list.
32
Version 3
# codon2aa
# Translate codon using hash lookup (pp 160-162 and BeginPerlBioinfo.pm)
sub codon2aa_v3 {
my($codon) = @_;
$codon = uc $codon;
my(%genetic_code) = (
"TCA" => "S",
# Serine
"TCC" => "S",
# Serine
"TCG" => "S",
# Serine
"TCT" => "S",
# Serine
"TTC" => "F",
# Phenylalanine
"TTT" => "F",
# Phenylalanine
"TTA" => "L",
# Leucine
"TTG" => "L",
# Leucine
"TAC" => "Y",
# Tyrosine
"TAT" => "Y",
# Tyrosine
"TAA" => "_",
# Stop
"TAG" => "_",
# Stop
. . .
"GGT" => "G",
# Glycine
);
if(exists $genetic_code{$codon}) {
return $genetic_code{$codon};
}
else{ die "Bad codon: $codon";}
}
Efficiency Problem: the
hash is redefined for every
codon lookup.
Robustness Problem:
should the program die if
the codon is unknown?
BINF634 FALL13 - LECTURE 4
33
More Efficient Solution


Define the genetic code hash just once (outside subroutine)
Use subroutine to perform lookup
my(%genetic_code) = (
"TCA" => "S",
# Serine
"TCC" => "S",
# Serine
"TCG" => "S",
# Serine
"TCT" => "S",
# Serine
"TTC" => "F",
# Phenylalanine
"TTT" => "F",
# Phenylalanine
"TTA" => "L",
# Leucine
"TTG" => "L",
# Leucine
"TAC" => "Y",
# Tyrosine
"TAT" => "Y",
# Tyrosine
"TAA" => "_",
# Stop
"TAG" => "_",
# Stop
. . .
"GGA" => "G",
"GGC" => "G",
"GGG" => "G",
"GGT" => "G",
);
# Glycine
# Glycine
# Glycine
# Glycine
BINF634 FALL13 - LECTURE 4
34
Version 4
# codon2aa
# Translate a DNA 3-character codon to an amino acid
#
using hash lookup passed as argument
sub codon2aa {
my($codon) = @_;
$codon = uc $codon;
if (exists $genetic_code{$codon}) {
return $genetic_code{$codon};
} else {
return "X";
# alternative error response:
# die "Bad codon: $codon";
}
}
The exists function return true if an element with the given key occurs in the hash.
BINF634 FALL13 - LECTURE 4
35
Wrap up


Program #1 was due tonight. Please talk to me privately either via
email or in person if you did not turn it in.
Program #2 is posted on course website




hint: Read about BeginPerlBioinfo.pm in chapter 8
Read the code in BeginPerlBioinfo.pm
Start NOW!
Read:


Tisdall Ch. 9
Wall Ch. 5
BINF634 FALL13 - LECTURE 4
36