Transcript BINF 634 Bioinformatics Programming
BINF 634 Bioinformatics Programming
Instructor: Jeff Solka Ph.D.
Office: Room 312C OB Phone: 540-809-9799 Email: [email protected]
Office Hours: By appointment Required texts: Beginning Perl for Bioinformatics Programming Perl (3rd Edition) by Tisdall and Waliszewski by Wall, Christiansen and Orwant Course Meeting Place: Ocaquan Prince William Rm. 304B Course Meeting Times: M: 4:30 pm – 7:10 pm Course webpage http://binf.gmu.edu/~jsolka/fall13/binf634/Fall_2013BINF_634_Syll abus_rev1.html
BINF634 FALL013 LECTURE 1 1
Acknowledgements
Some of the material used in this course was previously developed by
John Grefenstette
John Kopecky
BINF634 FALL013 LECTURE 1 2
Experimental Biology Computational Biology and Bioinformatics
Computational Biology Problem Statement Simulation Results SIMS Analysis Tools Database LIMS Problem Statement Experiment Experimental Biology
BINF634 FALL013 LECTURE 1
Results
Rick Stevens
3
Bioinformatics Programming Tasks
Manage large experimental data sets Sequence data Microarray data (gene expression) Mass spec data (proteomics) Genotype project data (HapMap) Clinical data Build tools for Knowledge Discovery Find motifs in sequence data Data clustering Visualization Build analysis pipelines Glue several analysis steps together into a single automated process "Munge" data: Take data from one application or database and format it for input to another application of database BINF634 FALL013 LECTURE 1 4
Objectives
Programming skills Problem solving and Debugging Reading and Writing Documentation Data Munging: Data filtering and transformation Pattern matching and data mining Visualization and web presentation Object-oriented programming Bioinformatics skills Biological sequence analysis Interacting with biological databases Using Bioperl BINF634 FALL013 LECTURE 1 5
Background and Prerequisites
Molecular Biology BIOL 482 or similar course
Recombinant DNA -
Watson, Gilman, Witlowski, Zoller http://www.amazon.com/Recombinant-DNA-Genes-Genomes Course/dp/0716728664/ref=dp_ob_title_bk Online Tutorials http://www.biology-online.org/1/5_DNA.htm
Computer Science IT 108, CS 112 or similar Previous programming experience BINF634 FALL013 LECTURE 1 6
Course Policies
Programming assignments (50%) 3-4 graded programming assignments Exams: Midterm (20%) and Final (20%) May include both closed-book section and open-book programming problems In-class Quizzes (10%) Weekly homework assignments All HW assignments must be submitted to me via email by the beginning of the next class. HW assignments will not be graded individually, but you may be called upon to discuss your work during the next class. Therefore, late assignments will not be accepted.
Grading will be on the following scale. 93-100 (A), 90-92 (A-), 87-89 (B+), 83 86 (B), 80-82(B-), 77-79 (C+), 73-76 (C), Below 70 F. Student averages will be rounded to the closest integer to determine final letter grades. Keep an eye on the webpage http://binf.gmu.edu/~jsolka/fall13/binf634/Fall_2013BINF_634_Syllabus_re v1.html
BINF634 FALL013 LECTURE 1 7
Honor Code Policies
I take honor code violations very seriously. Programming assignments must be your work. Each assignment will specify whether you may use code from other sources. Any material you take from another source must be acknowledged within the program documentation. You must read and understand the honor code handout. Violations of the honor code WILL be referred to the Honor Council.
All students must adhere to the GMU Honor Code: See: http://honorcode.gmu.edu/ BINF634 FALL013 LECTURE 1 8
Pragmatics
Assignments and Announcement Will be posted on course wepage; check daily Class email will be sent to your email address from Patriot Web Accounts You should have an account on the server binf.gmu.edu
Systems administrator: Chris Ryan, [email protected]
Accessing perl: Login from Rooms 304B or 320 Login from off-campus using ssh Go to ftp://ftp.ssh.com/pub/ssh/ for academic Windows client Alternatively go to http://www.chiark.greenend.org.uk/~sgtatham/putty/ Install perl on your own computer -- see textbooks and backup slide materials BINF634 FALL013 LECTURE 1 9
Pragmatics
Unix This class will focus on using the Unix operating system We will be using Mac OS X (at least in the classroom) There are numerous UNIX tutorials http://www.unixtools.com/tutorials.html
Text Editors Perl program are stored in plain text files I recommend emacs or vim for a Unix text editor (see links for windows support) http://www.claremontmckenna.edu/math/ALee/emacs/emacs.html
http://www.vim.org
If you are interested in an integrated development environment I recommend Eclipse (see backup slides) www.eclipse.org
There is a tutorials for each online http://www.gnu.org/software/emacs/tour/ http://www.yolinux.com/TUTORIALS/LinuxTutorialAdvanced_vi.html
BINF634 FALL013 LECTURE 1 10
Review: Molecular Biology
Life evolved from common origin about 3.5 billion years ago All life shares similar biochemistry Proteins: active elements Nucleic acids: informational elements Molecular Biology: the study of structure and function of proteins and nucleic acids BINF634 FALL013 LECTURE 1 11
Proteins
Functions: Structural proteins Enzymes Transport Antibody defense Structure: Chains of amino acids Typical size ~300 residues Range from about 100 to over 5000 residues N.B. – A residue is one of the 20 building blocks of proteins also called an amino acid. BINF634 FALL013 LECTURE 1 12
DNA Double stranded Four bases: adenine guanine thymine (G), (T) (A), cytosine (C) and A and G are purines C and T are pyrimidines A always paired with T (complementary) C always paired with G (complementary) => Watson-Crick base pairs (bp) DNA may consist of hundreds of millions bp A short sequence (<100) is called an oligonucleotide BINF634 FALL013 LECTURE 1 13
•introns are not translated •exons are translated BINF634 FALL013 LECTURE 1 RNA: single stranded uses U (uracil) instead of T less stable than DNA also used in functional molecules (e.g. rRNA, tRNA) rRNA = ribosomal RNA tRNA = transfer RNA important regulatory functions (siRNA) siRNA = small interferring RNA 14
Translation
Translation involves mRNA and ribosomes Ribosomes made of protein and ribosomal RNA (rRNA) Transfer RNA between specific codons in mRNA and amino acids (tRNA) make connection As tRNA binds to the next codon in mRNA, its amino acid is bound to the last amino acid in the protein chain When a STOP codon ribosome releases the mRNA and synthesis ends is encountered, the BINF634 FALL013 LECTURE 1 15
BINF634 FALL013 LECTURE 1 16
DNA Structure
DNA contains: Genes " a locatable region of genomic sequence, corresponding to a unit of inheritance, which is associated with regulatory regions, transcribed regions, and or other functional sequence regions ".
[1] Promoters “a promoter is a region of DNA facilitates the transcription of a particular gene ” that Non-coding regions DNA which does not contain instructions for making proteins Reading frames An open reading frames (ORF): a contiguous sequence of DNA starting at a start codon and ending at a STOP codon BINF634 FALL013 LECTURE 1 17
Shotgun DNA Sequencing
More discussions can be found here http://en.wikipedia.org/wiki/Shotgun_sequencing BINF634 FALL013 LECTURE 1 18
Sequence Files -- FASTA Format
>gi|40457238|HIV-1 isolate 97KE128 from Kenya gag gene, partial cds CTTTTGAATGCATGGGTAAAAGTAATAGAAGAAAGAGGTTTCAGTCCAGAAGTAATACCCATGTTCTCAG CATTATCAGAAGGAGCCACCCCACAAGATTTAAATACGATGCTGAACATAGTGGGGGGACACCAGGCAGC TATGCAAATGCTAAAGGATACCATCAATGAGGAAGCTGCAGAATGGGACAGGTTACATCCAGTACATGCA GGGCCTATTCCGCCAGGCCAGATGAGAGAACCAAGGGGAAGTGACATAGCAGGAACTACTAGTACCCCTC AAGAACAAGTAGGATGGATGACAAACAATCCACCTATCCCAGTGGGAGACATCTATAAAAGATGGATCAT CCTGGGCTTAAATAAAATAGTAAGAATGTATAGCCCTGTTAGCATTTTGGACATAAAACAAGGGCCAAAA GAACCCTTTAGAGACTATGTAGATAGGTTCTTTAAAACTCTCAGAGCCGAACAAGCTT >gi|40457236| HIV-1 isolate 97KE127 from Kenya gag gene, partial cds TTGAATGCATGGGTGAAAGTAATAGAAGAAAAGGCTTTCAGCCCAGAAGTAATACCCATGTTCTCAGCAT TATCAGAAGGAGCCACCCCACAAGATTTAAATATGATGCTGAATATAGTGGGGGGACACCAGGCAGCTAT GCAAATGTTAAAAGATACCATCAATGAGGAAGCTGCAGAATGGGACAGGTTACATCCAATACATGCAGGG CCTATTCCACCAGGCCAAATGAGAGAACCAAGGGGAAGTGACATAGCAGGAACTACTAGTACCCCTCAAG AGCAAATAGGATGGATGACAAGCAACCCACCTATCCCAGTGGGAGACATCTATAAAAGATGGATAATCCT GGGATTAAATAAAATAGTAAGAATGTATAGCCCTGTTAGCATTTTGGACATAAAACAAGGGCCAAAAGAA CCTTTCAGAGACTATGTAGATAGGTTTTTTAAAACTCTCAGAGCCGAACAAGCTT >gi|40457234| HIV-1 isolate 97KE126 from Kenya gag gene, partial cds CCTTTGAATGCATGGGTGAAAGTAATAGAAGAAAAGGCTTTCAGCCCAGAAGTAATACCCATGTTTTCAG CATTATCAGAAGGAGCCACCCCACAAGATTTAAATATGATGCTGAACATAGTGGGGGGGCACCAGGCAGC TATGCAAATGTTAAAAGATACCATCAATGAGGAAGCTGCAGAATGGGACAGGCTACATCCAGCACAGGCA GGGCCTATTGCACCAGGCCAGATAAGAGAACCAAGGGGAAGTGATATAGCAGGAACTACTAGTACCCCTC AAGAACAAATAGCATGGATGACAGGCAACCCGCCTATCCCAGTGGGAGACATCTATAAAAGATGGATAAT CCTGGGATTAAATAAAATAGTAAGAATGTATAGCCCTGTTAGCATTTTGGATATAAAACAAGGGCCAAAA GAACCATTCAGAGACTATGTAGACAGGTTCTTTAAAACTCTCAGAGCCGAACAAGCTT
BINF634 FALL013 LECTURE 1 19
GenBank Record
LOCUS AK091721 2234 bp mRNA linear PRI 20-JAN-2006 DEFINITION Homo sapiens cDNA FLJ34402 fis, clone HCHON2001505.
ACCESSION AK091721 VERSION AK091721.1 GI:21750158 KEYWORDS oligo capping; fis (full insert sequence).
SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Catarrhini; Hominidae; Homo.
TITLE Complete sequencing and characterization of 21,243 full-length human cDNAs JOURNAL Nat. Genet. 36 (1), 40-45 (2004) FEATURES Location/Qualifiers source 1..2234
/organism="Homo sapiens" /mol_type="mRNA" CDS 529..1995
/note="unnamed protein product" /codon_start=1 /protein_id="BAC03731.1" /db_xref="GI:21750159" /translation="MVAERSPARSPGSWLFPGLWLLVLSGPGGLLRAQEQPSCRRAFD ...
RLDALWALLRRQYDRVSLMRPQEGDEGRCINFSRVPSQ" ORIGIN 1 gttttcggag tgcggaggga gttggggccg ccggaggaga agagtctcca ctcctagttt ...
20
Why Perl?
Widely used in Bioinformatics Bioperl http://www.bioperl.org/wiki/Main_Page Ease of Programming Excellent pattern matching features Good for gluing other programs together Easy to learn (enough to get started) Rapid Prototyping Few lines of code needed for many problems One-liners Portability Runs on Unix, Windows, Macs Open Source Culture Many sources of help ( try: %perldoc perldoc ) %perldoc –f print http://perldoc.perl.org/index-tutorials.html
Many sources of useful modules ( http://www.cpan.org/ BINF634 FALL013 LECTURE 1 ) 21
Variables
The types of Perl variables are indicated by the initial symbol:
$var
stores a scalar (a single string or number)
$x = 10; $s = "ATTGCGT"; $x = 3.1417; @var
stores an array (a list of values)
@a = (10, 20, 30); @a = (100, $x, "Jones", $s); print "@a\n"; # prints "100 3.1417 Jones ATTGCGT" %var
stores a hash (associative array)
%ages = { John => 30, Mary => 22, Lakshmi => 27 }; print $age{"Mary"}, "\n"; # prints 22
BINF634 FALL013 LECTURE 1 22
Declaring Variables
use strict;
Putting use strict; at the top of your programs will tell perl to slap your hands with a fatal error whenever you break certain rules. Requires us to declare all variables Avoids creating variable by typos variables may be declaring using my , our or local for now, we only need to use my :
my $a; # value of $a is undef
my ($a, $b, $c); # $a, $b, $c are all undef my @array; # value of @array is ()
Can combine declaration and initialization:
my @array = qw/A list of words/; my $a = "A string";
BINF634 FALL013 LECTURE 1 23
How Things Can Go Wrong
Come back and examine this after we have discussed references.
http://www.perlmonks.org/?node_id=269642
BINF634 FALL013 LECTURE 1 24
Scalar and List Context
All operations in Perl are evaluated in either scalar or list context, and may behave differently depending on context
@array = ('one', 'two', 'three'); $a = @array; print $a; # scalar context for assignment, return size # prints 3 ($a) = @array; print $a; # list context for assignment # prints 'one' ($a, $b) = @array; print "$a, $b"; ($a, $b, $c, $d) = @array; # prints 'one, two' # $d is undefined
In computer science a list is an ordered collection of values BINF634 FALL013 LECTURE 1 25
String Operations
Ways to concatenate strings
$DNA1 = "ATG"; $DNA2 = "CCC"; $DNA3 = $DNA1 . $DNA2; $DNA3 = "$DNA1$DNA2"; print "$DNA3"; # concatenation operator # string interpolation # prints ATGCCC $DNA3 = '$DNA1$DNA2'; print "$DNA3"; # no string interpolation # prints $DNA1$DNA2
BINF634 FALL013 LECTURE 1 26
Arrays
An array stores an ordered list of scalars: @gene_array = (‘EGF1’, ‘TFEC’, ‘CFTR’, ‘LOC1691’); print “@gene_array\n”; Output: EGF1 TFEC CFTR LOC1691 # there’s more than one way to do it (see previous slide on declaring variables) @gene_array = qw/EGF1 TFEC CFTR LOC1691/;
http://www.perlmeme.org/howtos/perlfunc/qw_function.html
BINF634 FALL013 LECTURE 1 27
Arrays
An array stores an ordered list of scalars: @a = (‘one’, ‘two’, ‘three’, ‘four’); The array is indexed by integers starting with 0: print “$a[1] $a[0] $a[3]\n”; prints : two one four Notice: $a[i] is a scalar since we used the $ method of addressing the variable
BINF634 FALL013 LECTURE 1 28
Unix Commands I
cat --- for creating and displaying short files chmod --- change permissions cd --- change directory cp --- for copying files date --- display date echo --- echo argument ftp --- connect to a remote machine to download or upload files grep --- search file head --- display first part of file ls --- see what files you have lpr --- standard print command more --- use to read files mkdir --- create directory mv --- for moving and renaming files BINF634 FALL013 LECTURE 1 29
Unix Commands II
pwd --- find out what directory you are in rm --- remove a file rmdir --- remove directory setenv --- set an environment variable sort --- sort file tail tar --- display last part of file --- create an archive, add or extract files ssh --- log in to another machine wc --- count characters, words, lines This site has a nice reference card http://www.digilife.be/quickreferences/QRC/UNIX%20commands% 20reference%20card.pdf
BINF634 FALL013 LECTURE 1 30
chmod and tar
chmod There is a nice tutorial here http://www.perlfect.com/articles/chmod.shtml
tar There is a nice tutorial here http://www.apl.jhu.edu/Misc/Unix-info/tar/tar_2.html
BINF634 FALL013 LECTURE 1 31
Running perl on binf.gmu.edu
% ssh binf.gmu.edu
Password: ****** -- Create binf634 directory (don't type stuff in red) % mkdir binf634 % cd binf634 % ls -- Copy a file to current directory -- (the "." means :current directory") % cp ~ jsolka/public_html/fall13/binf634/bookcode/examples /example4-1.pl .
% ls % ls -l % l
BINF634 FALL013 LECTURE 1 32
Running perl on binf.gmu.edu
% cat example4-1.pl #!/usr/bin/perl -w # Example 4-1 Storing DNA in a variable, and printing it out # First we store the DNA in a variable called $DNA $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; # Next, we print the DNA onto the screen print $DNA; # Finally, we'll specifically tell the program to exit.
exit; -- Changing permissions % chmod 755 example4-1.pl
-- Running a perl script % example4-1.pl
BINF634 FALL013 LECTURE 1 33
Editing a Perl Script
-- Read the Emacs or vi tutorial.
-- Make a copy and edit the copy % cp example4-1.pl first.pl
% l % e first.pl
-- 1. Change 'print $DNA;' to 'print $DNA, "\n";' -- 2. Now add a comment: # Author: your name % cat first.pl
#!/usr/bin/perl -w # Author: Jeff Solka # Example 4-1 Storing DNA in a variable, and printing it out # First we store the DNA in a variable called $DNA $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; # Next, we print the DNA onto the screen print $DNA, "\n"; # Finally, we'll specifically tell the program to exit.
exit;
BINF634 FALL013 LECTURE 1 34
For Next Week
Read Tisdall chapters 1-5.
Be ready to ask questions Be ready to answer questions HW 1: Write programs as described in the following exercises from "Beginning Perl for Bioinformatics" by Tisdall: 4.3, 4.4, 4.5, 5.2, 5.4 and 5.6
For each exercise, create a perl script called exX.Y.pl, for example, ex4.3.pl for the first exercise.
email me the assignments at [email protected]
Use the following format Binf634.initialoffirstname.lastname.ex.4.3
No class next week because of labor day BINF634 FALL013 LECTURE 1 35
Some of the Details
BINF634 FALL013 LECTURE 1 36
Alternative Development Environments
BINF634 FALL013 LECTURE 1 37
What is Eclipse?
Eclipse is a multi-language an IDE and a plug-in software development platform system to extend it. It is written primarily in the various plug-ins, in other Python , Perl , PHP and more.
languages as well— C / C++ , Cobol , Java and is used to develop applications in this language and, by means of originated from VisualAge .
[1] comprising The initial codebase In its default form it is meant for Java developers, consisting of the Java Development Tools (JDT). Users can extend its capabilities by installing plug-ins written for the Eclipse software framework, such as development toolkits for other programming languages, and can write and contribute their own plug in modules. Language packs provide translations into over a dozen natural languages .
[2] Released under the terms of the Eclipse Public License , Eclipse is free and open source software .
http://en.wikipedia.org/wiki/Eclipse_(software) BINF634 FALL013 LECTURE 1 38
What Operating Systems Does Eclipse Run Under?
LINUX MAC OSX WINDOWS XP Vista BINF634 FALL013 LECTURE 1 39
Languages Supported by the Eclipse IDE
JAVA Out of the box PERL Via EPIC library Note one must also have a PERL compiler PYTHON Via PyDev library Note one must also have a PYTHON compiler installed BINF634 FALL013 LECTURE 1 40
Advantages and Disadvantages of the Eclipse Development Environment
Advantages Support for a plethora of languages Industrial strength Used by many professional software developer Has support for configuration management Disadvantages Can be slow when developing in languages other than JAVA (may be mere anecdotal evidence) BINF634 FALL013 LECTURE 1 41
Installing Eclipse Under Windows XP - I
First make sure that you have a Java Runtime Environment installed Microsoft Windows XP [Version 5.1.2600] (C) Copyright 1985-2001 Microsoft Corp.
C:\Documents and Settings\Owner>java -version java version "1.5.0_05" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_05-b05) Java HotSpot(TM) Client VM (build 1.5.0_05-b05, mixed mode) C:\Documents and Settings\Owner> If you don’t have a JRE installed go to http://www.oracle.com/technetwork/java/javase/downloads/java-se-jre-7 download-432155.html
BINF634 FALL013 LECTURE 1 42
Installing Eclipse Under Windows XP - II
Obtain the Eclipse zipped file from the Eclipse downloads link at http://www.eclipse.org/downloads/ I believe that I chose this one Eclipse IDE for Java Developers (85 MB) I think that the current version of Eclipse is 4.3
Unzip it into an eclipse folder under your windows Program Files directory In my case here C:\Program Files\eclipse Note that Eclipse does not modify your system’s registry BINF634 FALL013 LECTURE 1 43
Installing Eclipse Under Windows XP - III
Once installed (unzipped) Double click on the eclipse.exe icon There is a “hello world” java tutorial There are a number of other tutorials Eclipse3-1.pdf (http://www.cs.umanitoba.ca/~eclipse/Eclipse3 1.pdf) BINF634 FALL013 LECTURE 1 44
Downloading ActiveStates ActivePerl
Go here and click on the Windows download link http://www.activestate.com/activeperl/ I previously set this up using version 5.10
Use this self extracting binary to install the program This takes a long time (30 minutes or more, go enjoy your favorite beverage) BINF634 FALL013 LECTURE 1 45
Installing the Eclipse EPIC Library
This is my synopsis of this EPIC webpage tutorial http://www.epic-ide.org/download.php
This is also a helpful site http://www.epic-ide.org/faq.php
Under Eclipse user the Help->Software Updates Tab Switch to the Available Software tab Choose Add Site and choose
http://e-p-i-c.sf.net/updates
Tick the newly created site and click the install button BINF634 FALL013 LECTURE 1 46
Creating Your First PERL Program Under the Eclipse IDE - I
Under Eclipse go to Window -> Open Perspective -> Other Choose PERL Under Eclipse go to Window -> Preferences Click on the PERL + and enter in the full path to the ActiveStates PERL executable In my case it is "C:\Perl\bin\perl5.10.0.exe" BINF634 FALL013 LECTURE 1 47
Creating Your First PERL Program Under the Eclipse IDE - II
Click on File -> New PERL Project Call it something like HelloWorld Click on File -> New PERL File Call it something like HelloWorldPerl Left click on this file symbol and make sure its extension is .pl (Now it should have a camel symbol) Enter in your code print "Hello from ActivePerl!\n"; Now you should be able to choose Run from the top menu or left click on the program symbol and choose Run As Perl Local If all goes well a console window with the output Hello from ActivePerl!
should show up BINF634 FALL013 LECTURE 1 48
Debugging With Eclipse and PERL
The Perl PPM package PadWalker has to be installed before one can debug your PERL programs under Eclipse Follow the steps on the next two slides to install PadWalker within ActiveStates PERL BINF634 FALL013 LECTURE 1 49
First Find the Package (PadWalker)
Find a package.
To find a package in the repository: Click the All packages button, Enter text from the package's name or abstract in the Filter field As text is entered in the Filter field, the list of packages is automatically updated as the substring match becomes more precise. Click the magnifying glass icon to filter on different meta-data (e.g. Author).
Alternatively, just start typing the name of the package. The Package List will highlight the first package that matches the string you have typed.
BINF634 FALL013 LECTURE 1 50
Next Install the Package (PadWalker)
Install a package.
To install a package from the repository: Click on the desired package in the Package List to select it. Mark the package by: Clicking the Mark for install button or, Hitting the "+" key or, Selecting Install
BINF634 FALL013 LECTURE 1 51
Installing PadWalker Via ppm
There are other interesting discussions here but they seem to have been somewhat relegated by the gui-based ActiveStates PERL ppm interface http://trouchelle.com/perl/ppmrepview.pl
BINF634 FALL013 LECTURE 1 52
Editors
BINF634 FALL013 LECTURE 1 53
http://www.viemu.com/vi-vim cheat-sheet.gif
BINF634 FALL013 LECTURE 1 54
http://refcards.com/docs/gildeas/ gnu-emacs/emacs-refcard-a4.pdf
BINF634 FALL013 LECTURE 1 55