Nimble Perl Programming Using Scriptome

Download Report

Transcript Nimble Perl Programming Using Scriptome

Nimble Perl
Programming Using
Scriptome
Yannick Pouliot, PhD
Bioresearch Informationist
Lane Medical Library & Knowledge Management Center
1/22/2009
Lane Medical Library & Knowledge Management Center
http://lane.stanford.edu
Objectives
Determining whether Scriptome can …
1.
2.
Enable you to perform operations otherwise
difficult/time-consuming/error-prone?
Help you learn Perl?
Also, we’ll be using
And
don’t worry:
anonymous
pollingThis
to
experiment
won’t hurt
a
determine whether
you’re
happy with the material and
bit!
speed of delivery …
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
2
So What Is Scriptome?
Scriptome is a resident Perl program that
performs various data manipulation tasks
useful to biologists
 Originally developed by Harvard’s FAS
Center for Systems Biology

Maintained and extended by lots more volunteers
not associated with Harvard
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
3
Why Bother With Scriptome?


Code is visible, enabling learning on how to
do things in Perl … or not
Can handle arbitrarily large files



No size limitations, e.g., Excel
Free; runs on everything: PC, Mac, Linux
It’s programmatic!


Much faster than manual operations
You can string operations together and save these
in e.g. a .bat file
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
4
How Do You Use Scriptome?



You tell Scriptome which function you want it to
perform (more later)
You can also string Scriptome functions into a
protocol
Input: Scriptome operates on text files

No binary files, but you could add that capability yourself


E.g., process Excel files in native form using Perl modules,
e.g., ParseExcel
Output: command line or write into another file
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
5
Scriptome: Pick Your Flavor
http://lane.stanford.edu/howto/index.html?id=_1257
http://sysbio.harvard.edu/csb/resources/computational/scriptome/
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
6
Installing Scriptome - Windows
1.
Download Scriptome_exe.tar.gz using this link:
http://sysbio.harvard.edu/csb/resources/computational/scriptome/b
in/Scriptome_exe.tar.gz.
→ Final location: I suggest C:/Program Files/Scriptome
2.
3.
Create a directory named “Scriptome”
Decompress Scriptome_exe.tar.gz by double-clicking
→ Notice the four files inside
3.
Update the PATH variable
add this string at the END of the contents of the
PATH variable:
;C:\Program Files\Scriptome\Scriptome;C:\Program
Files\Scriptome\ScriptPack;C:\Program
Files\Scriptome\Scriptome.bat;C:\Program
Files\Scriptome\ScriptPack.bat
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
7
Scriptome Usage
1. Using a specific tool:
Scriptome flags toolname [input_filenames] [> output_filename]
Example

Scriptome -t change_fasta_to_tab LONGhmcad.fst
2. Finding a tool by type:
Scriptome -t tooltype
where tooltype =






Calc
Choose
Sort
Fetch
Merge
Change
Let’s examine each area
briefly before going over
specifics…
Example

Scriptome -t Calc
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
8
Polling Time: How’s the speed?
1: Too fast
2. Too slow
3. More or less OK
4. I feel nauseous
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
9
Examples and
noteworthy tools
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
10
Calc Tool Examples - 1
Compute column sums:
 Scriptome -t calc_col_sum SubjectData1.tab
→ select columns to add
IMPORTANT: column numbers start at 0, not 1

Note visible Perl code → easy to modify,
expand
perl -e "
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
$col=1;
while(<>) {
s/\r?\n//;
@F=split /\t/, $_;
$sum += $F[$col];
}
warn qq~\nSum of column $col for $. lines\n\n~;
print qq~$sum\n~
" file.tab
11
Calc Tool Examples - 2
Compute row sums:
 Scriptome -t calc_row_sum
SubjectData1.tab
→ enter 1 for column 1, 2 for
perl -e "
column 2, etc
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
@cols=(1, 2, 3);
while(<>) {
s/\r?\n//;
@F=split /\t/, $_;
$sum = 0;
foreach $col (@cols) {
$sum += $F[$col]
};
print qq~$_\t$sum\n~;
}
warn qq~\nSum of columns @cols for each line ($.
lines)\n\n~
" in.tab
12
Change Tool Examples - 1
Create tab-delimited file from
FASTA file:

Scriptome -t
change_fasta_to_tab
LONGhmcad.fst >
LONGhmcad.fst.tab
perl -e "
$count=0;
$len=0;
while(<>) {
s/\r?\n//;
s/\t/ /g;
if (s/^>//) {
if ($. != 1) {
print qq~\n~
}
s/ |$/\t/;
$count++;
$_ .= qq~\t~;
}
else {
→ change_fasta_to_tab is
an important tool because
many Scriptome tools use
tab-delimited files
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
s/ //g;
$len += length($_)
}
print $_;
}
print qq~\n~;
warn qq~\nConverted $count FASTA records in $. lines
to tabular format\nTotal sequence length: $len\n\n~;
" seqs.fna
13
Change Tool Examples - 2
Change rows to columns or vice versa:

Scriptome -t change_transpose_table SubjectData1.tab

Note: change_transpose_table operates on tabdelimited files
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
14
Change Tool Examples - 3

Create tab-delimited file from
FASTA file:
Scriptome -t
change_bio_format_to_bio_format
LONGhmcad.fst
enter ‘fasta’ as input format (no
quotes)
enter ‘genbank’ as output format
(no quotes)
change_bio_format_to_bio_format
addresses the common problem of
converting formats
perl -MBio::SeqIO -e "
$informat= qq~genbank~;
$outformat= qq~fasta~;
$count = 0;
for $infile (@ARGV) {
$in = Bio::SeqIO->newFh(-file => $infile , -format =>
$informat);
$out = Bio::SeqIO->newFh(-format => $outformat);
while (<$in>) {
print $out $_;
$count++;
}
Important: requires Bioperl to be
installed
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
}
warn qq~Translated $count sequences from $informat to
$outformat format\n~
" myseqs.genbank > myseqs.fasta
15
Conclusions
Scriptome is …
 A good solution for manipulating medium to
large data files quickly and reliably
 A way to learn Perl in a “real” context (no toy
problems)
 Able to perform a wide range of tasks, from
simple, generic file manipulations to biospecific complex tasks
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
16
Resources


For Perl help, see resources in workshop
description in Lane’s Perl Programming for
Biologists
Some recommended titles:
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
17
Polling Time: Do you think
Scriptome will be useful to your
research?
1. Definitely
2. Likely
3. Not likely
4. No way
5. What’s the question again?
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
18
Lane Medical Library & Knowledge Management Center
http://lane.stanford.edu