Slide 1

Transcript Slide 1

Introduction to Objects
1
Motivation
• Vast (and growing) repositories of data available
– genomic (sequence and annotation)
http://genome.ucsc.edu
http://www.ensembl.org
– literature
http://www.ncbi.nih.gov/pubmed
– expression (microarray)
http://www.ncbi.nih.gov/geo/index.cgi?qb=pro
Still being sorted out
100 * 22,000 = 2,200,000 data points locally
– structure
http://www.ncbi.nih.gov/Structure/
Exellexis (Geoff Duyk) does 1000/year
2
Extrapolates to 3,456,000/day
More than 12,000,000 citations
3
Structure
MMDB (Molecular Modeling DB) currently contains 15 000
structure entries, corresponding to 35 000 chains and
50 000 3D domains.
And oh by the way…. the information was only contained in
the literature.
Therefore, in some instances there is strong
interconnection between these data resources (but there
are also many false/erroneous relationships).
http://nar.oupjournals.org/cgi/content/full/30/1/249?ijkey=iG
xcC9hHh9esk&keytype=ref&siteid=nar
4
Structure Example
• Web page pex5
http://www.ncbi.nih.gov/Structure
• Structure viewer
• Data file
5
What does this all have to do
with "Bioinformatics Techniques"
• Accessing data
– most of this data resides in databases
– extracting/accessing data
• manual
• programatically –write programs to perform operations on
sets of data that could not be performed any other way
– download data (mirror data), process, analyze
– query databases directly (for smaller sets)
– "spider" data – write a program that "follows" html links to
access data being served in web pages (HGMD)
– locally generated data to contribute to community
• pipeline processing
6
caBIG (cancer Biomedical Informatics Grid)
cabig.nci.nih.gov
•
The cancer Biomedical Informatics Grid, or caBIG, is an informatics
infrastructure that is connecting teams of cancer and biomedical
researchers together to enable them to better develop and share
tools and data in an open environment with common standards.
caBIG is creating a voluntary virtual network (or grid) that links
individuals and institutions, both nationally and internationally,
effectively forming a World Wide Web of cancer research.
• caBIG will allow researchers to answer research questions more
rapidly and efficiently, thereby promising to accelerate progress in all
aspects of cancer research– from etiologic research to prevention,
early detection and treatment. Ultimately, because caBIG will
provide a common unifying force that facilitates progress in cancer
research and care, the most important beneficiaries will be cancer
patients and the public at large.
7
cancer Biomedical Informatics Grid (caBIG) -another model (63 Centers 2007)
Analytical
Services
Model lends itself well
to ideas such as "sharing" of
workflows, automatic
capture of workflows and
Development
autodetection of data, in
tools
and analytical resources.
Isolation
However there is still a large gap between theory and reality.
Data
Tools
8
Modules – a brief intro
• A Perl module is a collection of subroutines and
variables that typically implements some
common functionality – and "packaged" to
facilitate reusability
• Modules exist to perform some of these tasks
– access databases (Ensembl)
– read data formats (GenBank records, BLAST files)
– special/custom processing
• generate an image of a database
9
File Handles and "strict"
use strict;
my $fh = &open_file("gene.gb");
&parseFile($fh);
sub open_file
{
my $file_handel;
open(FH,$_[0]);
$file_handel = FH;
return($file_handel);
}
sub parseFile
{
my $sfh = $_[0];
my $line = <$sfh>;
print "line = $line\n";
}
10
Bareword "FH" not allowed while "strict
subs" in use at fileHandle.pl line 10.
Execution of fileHandle.pl aborted due to
compilation errors.
11
use strict;
use IO::Handle;
my $fh = IO::Handle->new();
open($fh,"gene.gb");
&parseFile($fh);
&t2($fh);
sub parseFile
{
my $sfh = $_[0];
my $line = <$sfh>;
print "line = $line\n";
}
sub t2
{
my $sfh = $_[0];
my $line = <$sfh>;
print "line = $line\n";
}
12
Module Example
• Statistics::ChisqIndep
• http://www.cpan.org/
• http://www.cpan.org/modules/01modules.i
ndex.html
• http://www.cpan.org/authors/id/Y/YU/YUN
FANG/Statistics-ChisqIndep-0.1.tar.gz
13
Install (my Mac)
•
•
•
•
•
Download, uncompress, untar
make Makefile.pl
make
make test
sudo make install
• Note: requires admin privilege (sort of)
• Also requires: Statistics::Distributions
14
Install (nxclient) – no admin
privaledges
• mkdir modules
• Download, extract Statistics::Distributions
• perl Makefile.PL
PREFIX=/user/eng/tbraun/modules
• (replace tbraun with your user ID)
• make
• make test
• make install
15
Statistics::ChisqIndep
• Download, extract
• Edit Makefile.PL, add the following line:
use lib
"/user/eng/tbraun/modules/lib/perl5/site_pe
rl/5.10.0";
Replace "tbraun" with YOUR userID
perl Makefile.PL PREFIX="/user/eng/tbraun"
make
16
make install
Perldoc ChisqIndep
Statistics::ChisqIndep The module to perform chi-square test of independence (a.k.a.
contingency tables)
Synopsis
#example for Statistics::ChisqIndep
use strict;
use Statistics::ChisqIndep;
use POSIX;
# input data in the form of the array of array references
my @obs = ([15, 68, 83], [23,47,65]);
my $chi = new Statistics::ChisqIndep;
$chi->load_data(\@obs);
# print the summary data along with the contingency table
$chi->print_summary();
#print the contingency table only
$chi->print_contingency_table();
17
Output
Added: use lib "/user/eng/tbraun/modules/lib/perl5/site_perl/5.10.0";
$ perl terry-test.pl
Rows: 2
Columns: 3
Degree of Freedom: 2
Total Count: 301
Chi-square Statistic: 4.5639003430935
p-value: 0.10208
Warning: some of the cell counts might be too low.
1
2
3
rtotal
1 15 68
83
166
(20.96)
(63.42) (81.62)
2 23 47
65
135
(17.04)
(51.58) (66.38)
ctotal
38
115
148
301
18
CPAN – more info
cpan faqs
Cpan search (though I have not had good
luck with the search – example: chi
square)
perldoc perllocal (display installed modules)
cpan (perl module to install modules)
Ex:
19
cpan –i Statistics::ChisqIndep
Example of a Module
Invocation Arguments
From unix:
grep pattern file – grep is the application, pattern is an
invocation argument, and file is an invocation argument
grep foreach arrayArray.pl
foreach $i (@numbers) # i becomes a reference to an
array
foreach $j (@$i) # @$ dereferences i (an reference to
an aray)
DEMO grep foreach arrayArray.pl
20
Invocation Arguments
(command line parameters)
Run program:
program.pl -flag1 -flag2 param1 param2
Ex:
translate.pl sequence 50 10
Special variable -- all command line parameters
are automatically stored in array @ARGV.
Behaves just like a “normal” array
21
Invocation Arguments in Perl
• @ARGV variables -- captures invocation
variables when program is executed from
command line
#!/usr/bin/perl
"The invocation arguments were:\n";
foreach $i (@ARGV)
{
print "$i\n";
}
22
How do we implement something
like 'grep' that uses invocation
arguments
#!/usr/bin/perl
#
# grep.pl
# example of invocation arguments --
$pattern = shift @ARGV; # get the pattern
$file = shift @ARGV; # get the file name
open(FH,$file);
while($line = <FH>) {
if($line =~ m/$pattern/)
{
print "$line";
}
}
23
@INC -- how Perl "finds" things
•
the "search path" is given in a special variable called @INC
perl –V
: (lots of stuff)
@INC:
/home/tabraun/perl/bioperl-live//ensembl/modules
/home/tabraun/perl/bioperl-live//bioperl-0.7
/home/tabraun/perl/bioperl-live//ensembl-external/modules
/home/tabraun/perl/bioperl-live//ensembl-lite/modules
/usr/lib/perl5/5.8.0/i386-linux-thread-multi
/usr/lib/perl5/5.8.0
/usr/lib/perl5/site_perl/5.8.0/i386-linux-thread-multi
/usr/lib/perl5/site_perl/5.8.0
/usr/lib/perl5/site_perl
/usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi
/usr/lib/perl5/vendor_perl/5.8.0
/usr/lib/perl5/vendor_perl
/usr/lib/perl5/5.8.0/i386-linux-thread-multi
/usr/lib/perl5/5.8.0
.
( ^ "Dot")
24
Sample Output
DEMO grep.pl foreach arrayArray.pl
./grep.pl foreach arrayArray.pl
foreach $i (@numbers) # i becomes a
reference to an array
foreach $j (@$i) # @$ dereferences i
(an reference to an aray)
25
Invocation Arguments
The "-n" flag
grep -n foreach arrayArray.pl
19:foreach $i (@numbers) # i becomes a reference to an array
21: foreach $j (@$i) # @$ dereferences i (an reference to an aray)
grep foreach arrayArray.pl -n
19:foreach $i (@numbers) # i becomes a reference to an array
21: foreach $j (@$i) # @$ dereferences i (an reference to an aray)
26
Added Complexity
#!/usr/bin/perl
#
# grep1.pl
# example of invocation arguments --
$line_nums = 0;
$pattern = 0;
$file = 0;
# this ASSUMES correct ORDER of arguments
while($param = shift @ARGV) {
if($param eq "-n")
{
$line_nums = 1; # flag to be used to generate
line numbers
}
elsif(!$pattern){
$pattern = $param;
}
elsif (!$file){
$file = $param;
}
else {
print "Error\n";
exit(1);
}
DEMO grep1.pl
}
$count = 0;
open(FH,$file);
while($line = <FH>) {
$count++;
if($line =~ m/$pattern/)
{
if($line_nums) {
print "$count: $line";
}
else {
print "$line";
}
}
}
foreach arrayArray.pl
27
Perl Module: Getopt::Long
#!/usr/bin/perl
#
# grep2.pl
# example of invocation arguments -use Getopt::Long;
my $line_nums = 0;
my $pattern = 0;
my $file = 0;
&GetOptions(”n" => \$line_nums, # n is command line argument, no value, $line_nums set to 1
"e=s" => \$pattern,
# e is command line argument, S for string, stored in $pattern
"f=s" => \$file);
# f is command line argument, S for string, stored in $file, i for integer
$count = 0;
open(FH,$file);
while($line = <FH>) {
$count++;
if($line =~ m/$pattern/)
{
if($line_nums) {
print "$count: $line";
}
else {
print "$line";
}
}
}
28
man Getopt::Long
Getopt::Long(3)
perl v5.8.1
Perl Programmers Reference Guide
02
Getopt::Long(3)
Perl Programmers Reference Guid 2003-09-
NAME
Getopt::Long - Extended processing of command line options
SYNOPSIS
use Getopt::Long;
my $data = "file.dat";
my $length = 24;
my $verbose;
$result = GetOptions ("length=i" => \$length, # numeric
"file=s" => \$data,
# string
"verbose" => \$verbose); # flag
DESCRIPTION
The Getopt::Long module implements an extended getopt function called
GetOptions(). This function adheres to the POSIX syntax for command…
29
Getopt::Long
• Parses command line arguments off of @ARGV
$result = GetOptions ("length=i" => \$length, # numeric
"file=s" => \$data,
# string
"verbose" => \$verbose); # flag
length is command line argument, expecting
integer ($length is variable)
file is command line argument, expecting string
($data is variable)
verbose is command line argument (no values)
($verbose is variable, gets a 1)
“option variables”
30
Sample Output
DEMO ./grep2.pl –n –f arrayArray.pl –e
foreach
./grep2.pl -n -f arrayArray.pl -e foreach
19: foreach $i (@numbers) # i becomes a
reference to an array
21: foreach $j (@$i) # @$ dereferences
i (an reference to an aray)
31
Building Larger Programs
Write a subroutine called:
print_formated_sequence
- input is a sequence string, and line length
-output:
1 ATTTTTTTTT TTTTTTTTTT TTTTTTTTTT TTTTTTTTAc cccccccccc 50
51 cccccccccc cccccccccc cccccccccc cccccccccc gggggggggg 100
101 gggggggggg gggggggggg gggggggggg gggggggggg gggggggggg 150
151 gggggggggg gggggggggg gggggggggg gggggggggg g 191
Now I use this subroutine in 100's of programs
Later determine that is annoying how the first line is offset from second,
etc.
To fix this, I have to edit 100's of programs
Need a way to save the code ONCE, and share with 100's of programs
32
Sharing Code
Library of sequence processing subroutines in a file called "sequence_pro.pl"
These subroutines may be accessed by "do":
do "sequence_pro.pl";
This is the same as if the subroutine was in the new program.
formatted_seq.pl:
#!/usr/bin/perl
# formatted_seq.pl
do "sequence_pro.pl";
$seq = "ATGCCCCCCCGGGCG";
&print_formatted_sequence($seq,50);
33
Packages, Modules, Libraries
(software) Library -- generic term for shared code, subroutines,
variables
Package -- Perl declaration (often called a Perl Module Library, or Perl
Module, or Module)
do “filename.**” -- incorporate code, subroutines, variables into a
program
require “filename.**” -- incorporate code, subroutines, variables into a
program (with checking of duplication)
use FILE -- incorporate code that has been declared as a module
(File.pm)
require “MySeq.pl”;
use Seq;
# Seq.pm is a module file
34

Slide 1

Transcript Slide 1

Directory