Transcript Document

Unix for Bioinformaticists: Unix
Tools, Emacs, and Perl
helpdesk at stat.rice.edu
Aug 2004
Some slides are borrowed from Dr. Woely’s
(BCM) presentation.
Do I Have to Know/Use Unix?
Simple answer: no.

Windows can do almost everything.
Complicated answer: yes, if you
are lazy (would like to automate things)
are good at reading manuals and writing scripts
want to make better use of your machine
are as poor as I am (can not afford pricy
windows software)
especially if you will be a bioinformaticist
Why Unix Is Useful in
Bioinformatics
Many tasks involve processing on large text
based datasets. Unix tools in many cases are
better than their windows counterparts.
You may need to use several tools to
accomplish a task. Windows is not particularly
good at gluing them.
When you need more CPU power, servers and
clusters are usually *nix-based.
Many tools are available only under Unix-like
systems.
Outline
Unix in general
Unix tools
Emacs
Perl
Unix Commands
Single command:
> sort –k1 file.txt
Combine other commands:
> sort –k1 file.txt | grep “Tag=Mouse” > output.txt
Operate multiple files:
> foreach file (*.txt)
sort –k1 $file > $file:r_sorted.txt
end
More commands
> rename .html .htm *.html
There are many such convenient tools. Scripts can
be used if you can not find one,
> foreach f (*.html)
mv $f $f:r.htm
end
More commands
> wget -r -l1 --no-parent -A.tar.gz -Ppackages
http://cran.r-project.org/src/contrib/PACKAGES.html
download all .tar.gz files to packages directory, This
command can do everything ‘teleport’ etc. under
windows can do.
> convert –rotate 90 file.jpg file.png
Convert a .jpg file to .png format after rotating 90
degrees.
A shell script: lyx2pdf
#!/bin/csh
set file = $1:r
lyx --export latex $file.lyx
latex $file.tex
dvips -o $file.ps $file.dvi
ps2pdf $file.ps
> lyx2pdf myfile.lyx
A Makefile
%.html: %.tex
latex2html -local_icons -no_subdir -split 0 $*.tex
%.tex: %.lyx
lyx2tex $*.lyx
%.dvi: %.tex
latex $*.tex
%.ps: %.dvi
dvips -o $*.ps $*.dvi
%.pdf: %.ps
ps2pdf $*.ps
> make file.dvi
> make file.ps
> make file.pdf
A Perl Script
#!/usr/bin/perl
# read all the things at once
undef $/;
# read in the file and look for /* */
($comm) = <> =~ /.*\/\*(.*)\*\//ms;
# print comments
print $comm, "\n";
crontab
# do not forget to renew your library books
0 0 15 7 * mail [email protected] %subject reminder
Renew all the books!
# backup your files to server every day at 6AM
6 * * * * /usr/local/bin/rsync -avz /home/bpeng
thor.stat.rice.edu::backup > logfile
Graphviz
> dot –Tps try.dot –o try.eps
File: try.dot
digraph G
{
A->B->C
B->D->C
}
Useful (and free) tools
Servers: Apache, openssh, openldap
Web: Mozilla/firefox, Konqueror, lynx
Mail clients: Pine, Mutt, Mozilla/thunderbird, kmail, evolution
Text processing: tetex/lyx, open office, koffice
Languages: gcc, Perl, python, gmake, kdevelop
Scientific libraries and tools: GNU Scientific Library,
bioPython, bioPerl, R, Graphviz, gnuplot, octave
Misc: VNC, wget,
Unix text-processing tools
Access to Unix




Mac OSX + developers kit
Linux
Stat and ruf/owlnet servers (Solaris)
Windows + cygwin
Tools - in contrast to Excel, faster, operate on larger files



Grep, Pipes, Sort, Comm, Diff, Join
Sed - regular expression substitution editor, replaced by perl in most
contexts
Man - to list manual pages with options for most commands (if
installed and concurrent version)
Grep
Grab lines that match a text phrase
Only the line that matches
Lines before or after the matched line
Lines that do not match
Piping multiple searches
GenBank Files
Grab the Locus, Definition and
Keyword lines
phase2.txt.out
temp
Select Non-Human Definition Lines
and Use Pipe
kworley% grep -v Homo temp | grep DEF
temp
Specify
Lines to
return
grep -1
grep -B1
grep -A1
Sort
In dictionary (-d), month (-M), or numerical (n) order
Ignore case (-f)
Specify output file (-o)
Specify the separator between fields (-t)
Unique lines only (-u)
Specify field on which to sort (-k POS1,[POS2]), numbered starting from 0, can
specify which character in the field
(field.char)
Merge more than one sorted file (-m)
Comm
Select or reject lines in common
between two sorted files
Options suppress printing of columns




comm [-123] file1 file2
Column 1 is lines only in file 1
Column 2 is lines only in file 2
Column 3 is lines in both files
Diff
Compares two files (or sets of files in a
directory) and output lines with differences
Compare as text (-a)
Ignore changes in white space (-b) or blank
lines (-B), case difference (-i)
For directory comparisons


Report only files that differ not details (-q)
Compare subdirectories recursively (-r)
Join
Combines lines from two files based on
a common field (-1 field -2 field)
Specify the fields from each file and the
order to output (-o file_number.field
file_number.field file_number.field)
What is Emacs?
A Unix text editor with additional
functionality
Column functions
Settings for DNA mode
Settings for programming mode
Seamless integration with matlab, R, SPlus, SAS etc.
Emacs Demonstrations
Search and replace




By query
All
New lines
Counting things
Column functions




Select
Kill
Copy
Paste
Query replace
Esc %

Replace phrase
With phrase

Designate carriage return with control Q control J

Y or N
! To replace all
Starting File
Query Replace
End file
Rectangle functions
Mark, select rectangle
Control x r

ra
 To register the rectangle as buffer a

k
 To kill the rectangle

ria
 To insert previously registered rectangle a from
buffer
Select Rectangle, Kill
Select Rectangle,
Mark, Insert
What is Perl?
A general purpose programming language.
Invented to replace awk, sed, and sh.
A scripting language.
Practical Extraction and Reporting Language
Pathologically Eclectic Rubbish Lister
“There is more than one way to do it” TIMTOWTDI
How to Use Perl
Perl “scripts” (programs) are text and are interpreted by
the the perl program.
TIMTOWTDI:



You can put the script on the command line:
>perl -e 'print "Hello, world!\n";'
You can pass it as an argument to perl:
>perl my_program.pl
You can make the script self-executing:
>my_program.pl
print, ", ', \n
'print "Hello, world!\n";'
In most programming languages, "print" means
"display" or "output".
The single and double quote characters ( " ' ) are used
to set apart blocks of "text". In this example, the
single quote sets apart the perl script, and the double
quotes sets apart the text to display. (Perl has others
ways to quote.)
The backslash, '\', is used to change the meaning of a
character, e.g. to generate special characters. \n
means "start a new line" (e.g. the Carriage Return, or
Return, or Enter.)
Example of a One Liner
(Thanks to Dr. Wheeler)
perl -nle '@f=split/\t/; print if ($f[2] > 95 );' blast_tbl_in.txt >
blast_tbl_out.txt
perl -nle
'@f=split/\t/;
print if ($f[2] > 95);'
blast_tbl_in.txt >blast_tbl_out.txt
A One Liner: TIMTOWTDI
1. perl -nle
'@f=split/\t/; print if ($f[2] > 95 );'
blast_tbl_in.txt > blast_tbl_out1.txt
2. perl -ne
'@f=split; print if ($f[2] > 95 );'
blast_tbl_in.txt > blast_tbl_out2.txt
3. perl -ane
'print if ($F[2] > 95 );'
blast_tbl_in.txt > blast_tbl_out3.txt
split, if, variables
@f=split/\t/; print if ($f[2] > 95);
split is a function. It can be written with parens like in
most languages, and takes UP TO three arguments:
split( where_to_split, what_to_split, how_many_to_split)
split, like many Perl statements, uses defaults for missing
arguments.
Special characters mark @whole_arrays,
$array_members[1], %whole_hashes,
$hash_members{'one'}, $simple_variables.
if acts like its common English meaning. It can go before a
block or at the end of a statement (as above).
Perl converts between numbers and text. '>' is a numeric
operator so 95 and $f[2] are treated as numbers. If gt
replaced >, they would be treated as strings.
FASTA to XML
perl -pi.bak -e
's"^>(.*)$"</seq><title>\1</title><seq>";'
test.fa
[localhost:~/test] steffen% ls
test.fa
test.fa.bak
[localhost:~/test] steffen% perl -pi.bak -e 's"^>(.*)$"</seq><title>\1</title><seq>";' test.fa
[localhost:~/test] steffen% ls
test.fa
test.fa.bak
[localhost:~/test] steffen% more test.fa
</seq><title>CSTAP1E0101A</title><seq>
gttgcctgcgtcttcggxaacaacgtagttctcagGCCGCCCGACCAGGT
ACTTTTTTGCTTTTTTTTTTTTTATTTTTTACAAATTATCAAAAGTTCTT
GTGCTTTCAGGAGCGATTAACATTCTCATGGGCCATACCCTTGTCAGGTT
TCATAAACTAAGTTAGATGGACCTGCTTGGTATTGTGGTGGAAGACCTCC
AAGAAAACAAAGTCCCGGAATCTCAACGTCCTCTGTCTTCTGGCATTTCA
TCTTCAAGAAACAATGTCTTATAGTTATTATTGCATGTTTTGGGAGGTTA
AAGGGTAAAGTTTGTAATGCCTTGACTAAAAACTTCCAGTTGTTATGGTG
cacaacaatttttggtatgctaacttatacttgtgcctaatccttaagga
aaagaaagagccatatacctaaaactgactttatttttcaaaaggta
</seq><title>CSTAP1E0102A</title><seq>
tttttgctggcgaactatcaggagactacagxaactacttttcagtxcga
actcacatcatcactggccgtcgttttacaacgtcgtgattgggaaaacc
ctggcgttacccaacttaatcgccttgcagcacatccccctttcgccagc
tggcgtaatagcgaagaggcccgcaccgatcgcccttcccaacagttgcg
cagcctgaatggcgaatggcgcctgatgcggtattttctccttacgcttt
caatgatgagcacttxtaaaggtctgx
</seq><title>CSTAP1E0103A</title><seq>
atttgagcagcatctattgaaaactaxcgxagxtcttcaggcgcgCCCAC
CCGAGGTACTACCAAGCCAGTGTCCTGCCCGGTTTTAAGCCCTCGTCCTC
TCCCTTCGCTCTCCTCCAAACTGAGCAGCATTAGTTCCACAAGCACAGAA
GTTAAACGAAAAACTGTCTTGCTCCACGGTCTCCTACAGTAGAATGCTGG
ATAATAATGCTTTCAGAAGCCACTTCTACAACCAGAACATTCTGACCACC
ACAATCATCAGGTTTACACACACCCTACGAAACACTAGCGAGTTAACAAG
actgatgaactacttgcagtcgaactccaatcattactggccgtcgtttt
aa
Executing a Perl Script in a File
$line = <>;
$line =~ s">(.*)"<title>\1</title><seq>";
print $line;
while( $line = <> ) {
$line =~ s">(.*)"</seq><title>\1</title><seq>";
print $line;
}
print "</seq>\n";
File Reading, Binding, while
$line = <>;
<> reads one line from the "current file"
$line =~ s">(.*)"<title>\1</title><seq>";
=~ makes the preceding string the "current line"
(Binding)
while( $line = <> ) {
print $line;
}
Repeats the statements between { and } while there is
another line.
Self-executing Perl Scripts
You need to know the path to your Perl program:
>which perl
/usr/bin/perl
The first line of your script must be:
#!/usr/bin/perl
Permissions need to allow execution
>chmod 755 my_program.pl
FASTA to XML Fleshed Out
#!/usr/bin/perl
#
# fasta2xml by David Steffen 6/2/2004
# - Converts fasta file to mini-xml format
$inpfile = shift( @ARGV );
if( not( $inpfile =~ m/^(.*)\.fa$/ ) ) {
die( "Input file, $inpfile, must be a fasta file and end in .fa\n" );
}
$basefile = $1;
open( INPFILE, $inpfile ) or die( "Can't open $inpfile: $!\n" );
$outfile = '>' . $basefile . '.xml';
open( OUTFILE, $outfile ) or die( "Can't open $outfile: $!\n" );
$line = <INPFILE>;
$line =~ s">(.*)"<title>\1</title><seq>";
print OUTFILE $line;
while( $line = <INPFILE> ) {
$line =~ s">(.*)"</seq><title>\1</title><seq>";
print OUTFILE $line;
}
print OUTFILE "</seq>\n";
Running Other Programs from Perl
$files = `ls`;
The "backtic" (` `) characters execute the text in between
as a command to the operating system, returning the
output of that command (e.g. to the $files) variable.
$error = system( "mv $file ${basefile}.abi" );
The system statement executes its argument as a
command to the operating system, returning ERROR
MESSAGES from that command. (Output is printed as
usual.) There are other, subtle differences between ` `
and system.