Open source tools for computational biology Craig A. Stewart, Richard Repasky, and Andrew Arenson [email protected]; [email protected]; [email protected] Indiana University SC2004 Tutorial S06 7 November 2004

Transcript Open source tools for computational biology Craig A. Stewart, Richard Repasky, and Andrew Arenson [email protected]; [email protected]; [email protected] Indiana University SC2004 Tutorial S06 7 November 2004

Open source tools for computational biology

Craig A. Stewart, Richard Repasky, and Andrew Arenson [email protected]; [email protected]; [email protected]

Indiana University SC2004 Tutorial S06 7 November 2004

• •

License terms

Please cite as: Stewart, C.A., R. Repasky, A. Arenson. 2004. Open source tools for computational biology. Tutorial presented at SC2004, 6-12 Nov 2004, Pittsburgh, PA. http://hdl.handle.net/2022/13997 _ Some figures are shown here taken from web, under an interpretation of fair use that seemed reasonable at the time and within reasonable readings of copyright interpretations. Such diagrams are indicated here with a source url. In several cases these web sites are no longer available, so the diagrams are included here for historical value. Except where otherwise noted, by inclusion of a source url or some other note, the contents of this presentation are © by the Trustees of Indiana University. This content is released under the Creative Commons Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work.

Planned schedule for the day

• 8:30-8:40 • 8:40-9:00 • 9:00-10:00 • 10:00-10:30 • 10:30-11:15 • 11:15-12:00 • 12:00-13:30 • 13:30-14:15 • 14:15-15:00 • 15:00-15:30 • 15:30-16:15 • 16:15-17:00 Introduction and objectives An introduction to the biological basis and biological data sources Pattern matching Break Hands-on problem session Multiple sequence alignment Lunch Microarray data analysis RNA and Protein Structure Break Problem session II Systems Biology 3

• • • • • • • • • • • •

Class Plan and Objectives A rapid introduction to key elements of biology Bioinformatics data sources Similarity matching Multiple sequence alignment Microarray data analysis RNA and Protein Structure Systems Biology Acknowledgements & references Appendix 1: DNA sequencing Appendix 2: Phylogenetics Appendix 3: Grand challenge problems 96 119 137 155 157 161 169 4 12 30 43 79 Note: Slides with the Indiana University wordmark in the bottom left corner were generated at Indiana University, with images sometimes from other sources. In such cases the url for the source of the image is indicated on the slide. Slides with a plain white background have been graciously provided by someone outside IU, and sources are attributed on such slides. Appendices cover material of potential interest to the attendees, but will not be covered in class 4

Class Plan & Objectives

• Class Plan & Strategy – Materials focus on open source software (generally not the presenters own work) – One critical application will be covered in great depth, and several others will be reviewed • Objectives. At the end of the class, participants should: – understand enough biology to understand key computational biology problems, and be familiar with some strategies for collaborating with biologists and biomedical scientists – be conversant with key open source applications in computational biology and bioinformatics, and current problems in these areas – Be ready to download some code and start making it better!

Motivation

• The “ -omics ” trend • Finding press pieces about huge computing problems is easy • How many bio codes really scale to hundreds of processors?

• What are the coming high performance needs of biologists?

• Importance of computational biology and bioinformatics to the HPC community • The challenges and promise are real • Successes and failures so far – Successes: Protein structure, Genome assembly, Surgical assistance, Phylogenetics – Mismatched priorities: Ab initio protein folding – Not yet successful: Drug discovery 6

What has changed recently?

• Bioinformatics not new – Protein structure – Phylogenetics • What is new is high throughput sequencing: – Lots more data – The possibility of going from a knowledge of the DNA sequence to an understanding of diseases and health

http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html

Genome Projects Timeline

• 1978 First virus (SV40) sequenced (5224 base pairs) • 1986 DOE announces Human Genome Initiative • 1994 First complete map of all human chromosomes • 1995 First living organism sequenced (

H. influenzae

) 2 Mb • 1996 Yeast (

S. cerevisiae

) - 12 Mb • 1997 Intestinal bacterium (

E. coli

) - 5 Mb • 1998 Nematode worm (

C. elegans

) - 100 Mb • 1998 Celera announcement; Public effort regroups • 1999 Human Chromosome 22 – 34 Mb • 2000 Joint announcement by NHGRI – Celera • 2003 “As good as it gets” human genome

This slide based on slide by Manfred D. Zorn

Definitions

• Computational Biology: any use of advanced information technology in the study of biological problems.

• “Bioinformatics applies the principles of information sciences and technologies to make the vast, diverse and complex life sciences data mnore understandable and useful” (NIH BISTIC Committee grants1.nih.gov/grants/bistic/CompuBioDef.pdf) • Genomics – study of genomes and gene function • Proteomics – study of proteins and protein function • ___omics – 9

Challenges

• Different types of biological data at different scales • Data of varying quality • Much of the underlying biology is not well understood • Prior to the availability of high-throughput sequencing, scientists could only study small pieces of the genetic information of any organism. • Now the entire genome of several organisms has been completed, but knowing the genome is different than knowing how it works!

Comparison of Complexity

• Physics & Chemistry – 2 elementary particles – 4 forces – 112 elements – When random events occur it is often possible to study average behavior – Typically ahistoric (astrophysics an exception) • Biology – 3B base pairs in humans – Min. 30,000 genes in humans – ~1.5M species – Individual random events important; no law of large numbers – Intensely historic, heavily contingent 11

•

Complexity, Con't

Chip design – All components known – – – – Device physics for individual components known Itanium has 3 x 10^8 connections and 2 x 10^8 devices Unified basic currency (electrons) Computer program required to understand (e.g. SPICE) • Cells – Components not known – – – Function of individual components not known # components ~10^13 No unified basic currency – – Ecell, Karyote, etc. attempting to model cells Computer programs do not yet permit full understanding 12

A rapid introduction to key elements of biology

Why is it important to know some biology?

Anopheles gambiae From www.sciencemag.org/feature/data/ mosquito/mtm/index.html

Source Library:Centers for Disease Control Photo Credit:Jim Gathany

• Would you study numerical methods without knowing some mathematics?

• Much current biological knowledge is very specific to particular organisms, genes, or diseases • If you just wade into the available data online you can do some very silly things.

Central dogma of biology

• The central dogma of biology is that genes act to create phenotypes through a flow of information form DNA to RNA to proteins, to interactions among proteins (regulatory circuits and metabolic pathways), and ultimately to phenotypes. Collections of individual phenotypes constitute a population (first put forward by Crick in 1958)

http://www.ncbi.nlm.nih.gov/About/primer/genetics_cell.html

Cell Structure

http://www.ncbi.nlm.nih.gov/About/primer/genetics_cell.html

Eukaryotes • Chromosomes linear • Introns, exons, postprocessing • Nucleus & nuclear wall • Mictochondria and (in plants) Chloroplasts Prokaryotes • Chromosome circular • Location is everything • No nucleus • No plastids 16

Four (or Five) Bases

• DNA consists of four nucleotides: Cytosine, Thymine, Adenine, and Guanine. • In the double helix, A&T are always bound, and C&G are always bound to each other • RNA consists of four nucleotides as well: Cytosine, Uracil, Adenine, and Guanine • RNA may loop back on itself but it does not form a double helix http://

www.ornl.gov/TechResources/Human_Genome/graphics/slides/images/structur.gif

http://www.ornl.gov/TechResources/Human_Genome/graphics/slides/images/98-647.jpg

Genetic Code

Ala Alanine Arg Arginine Asn Asparagine Asp Aspartic acid Cys Cysteine Glu Glutamic acid Gln Glutamine Gly Glycine His Histidine Ile Isoleucine Leu Leucine Lys Lysine Met Methionine Phe Phenylalanine Pro Proline Ser Serine Thr Threonine Trp Tryptophan Tyr Tyrosine Val Valine 19

http://www.ncbi.nlm.nih.gov/Class/MLACourse/ Original8Hour/Genetics/geneticcode.html

Translating DNA to RNA and Transcribing RNA to Proteins

20 DNA AAAAAGGAGCAAATT 1 2 3 6 5 4 RNA One possible amino acid string UUUUUCCUCGUUUAA Phe Asn Asp Ala

Human Chromosomes

http://www.ncbi.nlm.nih.gov/Class/MLACourse /Original8Hour/Genetics/cytogenetic.html

http://www.ornl.gov/TechResources/ Human_Genome/graphics/slides/ elsikaryotype.html

Sickle Cell

Normal RBC • GAG codes for Glutamine • disc-Shaped, soft • easily flow through small blood vessels • lives for 120 days Sickle RBC • GTG codes for Valine • sickle-Shaped, hard • often get stuck in small blood vessels • lives for 20 days or less Malaria vs. Anaemia!

http://www.nlm.nih.gov/medlineplus/ ency/imagepages/1223.htm

What is a Gene?

• An inheritable trait associated with a region of DNA that codes for a polypeptide chain or specifies an RNA molecule which in turn has an influence on some characteristic phenotype of the organism.

– Early views: genes lined up on the chromosome like beads on a string; one gene => one protein – Examples of genes: color blindness, sickle-cell anaemia – Mendelian genes, Sex-linked genes, Quantitative traits • Annotation: Extraction, definition, and interpretation of features on the genome sequence • Annotations vs. genes: – Many annotations describe features that constitute a gene.

– Others may not always directly correspond in this way – An annotation is what we think… nature may disagree!

• Inheritance problem with annotations 23

Gene Components

• Prokaryotes – Location is everything – Essentially all of the DNA is transcribed (few mitochondrial diseases) • Eukaryotes – Non-contiguous pieces of DNA may comprise one gene – Start sequence (complicated and long) – Stop Codons – end transcription – Exons – portions of sequence that are transcribed and used – Introns – portions of sequence that are not used • Genes and Chromosomes – In eukaryotes, an organism has two of each chromosome (in pairs).

– Among sexually reproducing organisms, one chromosome comes from each parent – In “ simple Mendelian genes ” there are two alleles for each gene – one on each chromosome (e.g. wrinkly) 24

Alternate splicing

http://www.blc.arizona.edu/marty/411/Modules/altsplice.html

A (very) little about evolutionary genetics Hardy-Weinberg Law

WW Parents Ww Ww Offspring Ww Ww ww Based on this, can you explain why the gene for Sickle Cell Anaemia persists in populations of people in Africa?

Population genetics & evolution

• Mutations create the raw material for evolution • Natural selection and chance affect the frequency with which particular genes or DNA sequences are present in populations • Given enough time and enough change, evolution, speciation, and so forth happen • Genes can be ‘ fixed ’ or ‘ maintained in an equilibrium ’ a population by chance or through natural selection in

http://faculty.wm.edu/bsgran/

How do sequences differ?

• Differences in individual bases

CGTACCG T TAATAT CGTACCG A TAATAT

• Bases may be added to a sequence

CGTACC CC GTAATAT CGTACC . .

GTAATAT

• Bases may be deleted from a sequence

CGTACCG TTA ATAT CGTACCG . . .

ATAT

Random genetic change

• “ things happen ” • Molecular clock – theory – ~ 2% change per million years (2 x 10 -9 substitutions per base location per year) – Practice – a rule of thumb is different than something like Newton ’ s 2 nd law of motion • Random change may often be responsible for speciation – e.g. two populations of birds, separated by a geographic barrier, may at random eventually develop into two different species 29

Key points (so far)

• Biological processes are complicated; the historicity and complexity of biological processes and our lack of understanding of many matters makes biology an interesting topic!

• The fundamental dogma of molecular biology is that genes act to create phenotypes through a flow of information form DNA to RNA to proteins, to interactions among proteins (regulatory circuits and metabolic pathways), and ultimately to phenotypes. Collections of individual phenotypes constitute a population.

• DNA consists of four base pairs (ATCG). A is always paired with T; C always paired with G. • DNA is translated into RNA. RNA consists of four base pairs as well (AUCG). • The linear structure of DNA is transcribed into RNA and then into proteins. Proteins have their 3D configuration as the basis for their structure. 30

Bioinformatics data sources

Bioinformatics Data Sources

• There are many • Characteristics vary • There are many ways to organize view of the biological data • A pragmatic approach: – Biomedical literature sources – Structured vocabularies – DNA, RNA, Protein etc. data sources 32

Biomedical literature

• Abstracts of biomedical lit. largely available online • Text processing itself is an interesting problem • U.S. National Library of Medicine – NLM Medline http://www.nlm.nih.gov/ • ~12 million references on life sciences/biomedicine. • Covers 1966 to present.

• Citations from over 4,600 journals; most published in English 33

PubMed

• Standard search tool for Medline http://www.ncbi.nlm.nih.gov/entrez/ • Structured Language - Medical Subject Heading http://www.nlm.nih.gov/mesh/MBrowser.html

• ~17,000 Thesaurus Terms, typically 10-15 used per article in MedLine; 3-4 as major points (indicated with * in PubMed) • Annotation done by humans • You can save queries 34

Genomic, Proteomic, etc. data sources

• A tremendous amount of data is available through public data sources via the Web, ftp, or by other means.

• To analyze biological data, we first have to get it…. • Several ways to organize presentation of material – by site, by type, etc. We will organize by data type.

• Types of Databases: – Chromosomal (http://www.ncbi.nlm.nih.gov/mapview) – DNA/Genes – Protein – Biochemistry and metabolic pathways – Microarray – Web collections 35

Types of genomic data

• Genomic DNA: DNA sequences, typically complete with coding and non-coding sequences • GSS: Genome survey sequence. Single pass sequence read directly from robot.

• mRNA: an RNA sequence from an mRNA molecule. May or may not cover all of a particular gene • cDNA: complement DNA – a DNA sequence generated by conversion of an mRNA sequence • EST: Expressed Sequence Tag – short cDNA sequences from studies of cells under particular circumstances. Typically incomplete.

• SNP – Single Nucleotide Polymorphism 36

DNA databases

• GenBank. Operated by NCBI (National Center for Biotechnology Information). http://www.ncbi.nlm.nih.gov

• European Molecular Biology Laboratory – Nucleotide Sequence Database. http://www.ebi.ac.uk/genomes/ • DNA Database of Japan (DDBJ). http://www.ddbj.nig.ac.jp

• All share data daily. Update conflicts avoided by policy. • Differences are in “ value added ” and interfaces 37

http://www.ncbi.nlm.nih.gov

Data Structures

• Current – Primary DNA repository data based on ASN.1. Makes possible linkages among many types of biomedical info.

– The software libraries now often handle XML as well – Software toolkits and docs available at http://www.ncbi.nlm.nih.gov/IEB/ • Genbank Flat File format – http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html

• FASTA > gi|532319|pir|TVFV2E|TVFV2E envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC 39

Primary vs. Secondary Data sources

• Primary data sources: – Genetic sequences in NCBI, EMBL, DDJP – Protein sequences in PDB • Secondary data sources: – Inferred protein sequences (what do we know already about issues here?) – Curated data sources 40

Protein Structure

• NCBI (of course…) • Swiss-Prot/TrEMBL at http://www.expasy.org/ – Note: 125,744 chemically determined vs 861,482 inferred from automated translation of DNA sequences!!!!!

• Protein Data Base – PDB http://www.rcsb.org/pdb/ - one of the first online bioinformatics databases!!!

Biochemistry and pathways

• Biochemistry – ENZYME (part of the ExPASy system) – BIND (part of the NCBI system) • Pathways – PathDB http://www.ncgr.org/software/version_2_0.html

– Kegg WIT http://wit.mcs.anl.gov/WIT2/ 42

Web Resources - General • NCBI http://www.ncbi.nlm.nih.gov/ • EBI Biocatalog http://www.ebi.ac.uk/biocat/ • IUBio Archive http://iubio.bio.indiana.edu

http://www.ncbi.nlm.nih.gov/

Similarity matching

Why pattern matching (and what are the problems)

45 and…

US!

Bonobo

http://www.sandiegozoo.org/special/zoo-featured/pygmy_chimps.html

Problems!

• For proteins, 95% similarity is ~ identical, 80% similarity is a lot. Even less similarity than that needed for DNA • Database techniques inadequate – they are too precise!

• Datasets very large to search • Homology • Common ancestry • Sequence (and usually structure) conservation • Homology is inferred rather than measured • Identity • Objective and well defined • Can be quantified easily, but not very useful!

• Similarity • Most common method used, but not as easily defined 46

Alignment

• An alignment is an arrangement of two sequences opposite one another • It shows where they are different and where they are similar • We want to find the optimal alignment - the most similarity and the least differences • Alignments have two aspects: – Quantity: To what degree are the sequences similar (percentage, other scoring method) – Quality: Regions of similarity in a given sequence 47

Alignment

• Methods: – dynamic programming – Hidden Markov Models – Pattern matching • Key problem: keeping the calculation time manageable • Some alignment packages: – BLAST (http://www.ncbi.nlm.nih.gov/BLAST/) – FASTA (http://gcg.nhri.org.tw/fasta.html) 48

Scoring Alignments

GCTAAATTC ++ x x GC AAGTT • Matches are good: they get a positive value • Mismatches are bad: they get a negative value • Gaps are bad: they get a negative value – Gap opening penalty – Gap extension penalty – Score = Matches –Mismatches -∑{gap opening penalty +(length)*gap length penalty}

CGTACCG TTA ATAT CGTACCG . . .

ATAT CGT A CCGTT A ATAT CGT . C . GTT .

ATAT

Now what?

• Taking a sequence and simply comparing it against all existing sequences in a database in all possible ways approaches O(N!) if you do it badly enough. Plus it would be silly.

• So: many algorithms possible • Algorithms are in some ways the same, and in some ways different, between DNA and proteins.

• We ’ ll start with DNA, and not do things in historical order 50

Dotter

• Simple way to get a feel for how sequences compare to each other.

• Used both with DNA and Protein sequences • http://www.cgr.ki.se/cgr/groups/son nhammer/Dotter.html/ • "A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis" Erik L.L. Sonnhammer and Richard Durbin Gene 167(2):GC1-10 (1995) • Modular nature of proteins 51

Local Alignments with BLAST

• Basic Linear Alignment Search Tool • We ’ ll spend a LOT of time with BLAST • First a quick demo (hopefully) • http://www.ncbi.nlm.nih.gov/BLAST • So, what did we do?

– BLAST – Basic Linear Alignment Search Tool – In particular, BLASTn (for nucleotides) – Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic Local alignment search tool. Journal of Molecular Biology 215:403-410 52

(Original) BLAST Algorithm

• Original algorithm does not permit gaps • The original BLAST algorithm is a local (heuristic) alignment tool • Given a search sequence, e.g. ACGTAGGCATGAA • BLAST first makes a list of all “ words ” of a given length that would possibly have a score of at least T against the search string.

• In the case of this example there would be (at least) the following: – ACGTAGGCATG – CGTAGGCATGA – GTAGGCATGAA 53

(Original) BLAST Algorithm, 2

• BLAST takes the list of all words with a score of at least T against the string one is trying to match…. and then searches a database for any matches to these words. So if one were using the example and the NR database, BLAST would search NR for all occurrences of the words: – ACGTAGGCATG – CGTAGGCATGA – GTAGGCATGAA • Suppose BLAST finds in the NR database an exact match to – ACGTAGGCATG • BLAST then attempts to extend the match in both directions – – ACGTAGGCATGA ACGTAGGCATGA • So now we have an exact match of 12 letters 54

(Original) BLAST algorithm,3

• So BLAST keeps going, and in this case would stop at an exact match of 13 letters (if one existed), since 13 letters was the entire initial search string: – ACGTAGGCATGAA – ACGTAGGCATGAA • BLAST has a stopping algorithm for dropping particular search directions, or stopping altogether 55

Scoring of DNA

A C G T R Y M W S K D H V B N A 4 C -3 4 G -3 -3 4 T -3 -3 -3 4 R 1 -1 1 -1 1 Y -1 1 -1 1 -3 1 M 1 1 -2 -2 0 0 1 W 1 -2 -2 1 0 0 0 1 S -2 1 1 -2 0 0 0 0 1 K -2 -2 1 1 0 0 0 0 0 1 D 1 -2 1 1 1 0 0 1 0 1 1 H 1 1 -2 1 0 1 1 1 0 0 0 1 V 1 1 1 -2 1 0 1 0 1 0 0 0 1 B -2 1 1 1 0 1 0 0 1 1 0 0 0 1 N 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 56

BLAST algorithm in more detail

• The BLAST algorithm searches for MSPs – Maximal Scoring Pairs – such that the score of sequences cannot be improved either by lengthening it or shortening it. “ Pairs found in a database.

” here refers to a string – or a substring – of the initial string used as the search string – and one or more strings or substrings • The search starts with the creation of all possible subwords of a given length (default typically 11 for DNA sequences, 3 amino acids for protein sequences) that would score at least T when matched against the original search string. (T is short for Threshold) • BLAST searches for any occurrence of each of these words that have a score of at least T. This is a “ hit ” – or a “ High Scoring Pair (HSP) ” • The search then continues by trying to extend these HSPs. • Suppose “ S ” is the best score found for a word of length k. BLAST stops trying to extend words when the score drops a certain amount below the best value S in the previous round.

• BLAST continues on and on until it is no longer possible to improve the score of HSPs by making them longer.

• Then it generates a list of the best HSPs. Default is a cutoff E-value of 10 • BLAST (original) has an infinite gap penalty 57

BLAST Statistics

• BLAST reports E values rather than P values, but it turns out that when E < 0.01, E~P • What do we do about the fact that we have done many tests?

• If the sequence is length n, and the total length of the database being searched is N, then a reasonable approach is to multiply E by N/n • Edge effects – statistics tend to be conservative for short sequences • Problems: – Highly repetitive segments – Low complexity regions – Bias in composition • Solution: low complexity regions can be excluded 58

BLAST Options

• Set subsequence (of the submitted sequence) • Choose Database (NB: nr ≠ non redundant!) • Limit by entrez query or select an organism • Choose Filter • Expect Value • Word size (default = 11 for nucleotides) 59

Protein Sequence Alignment

• What most people do most of the time • DNA sequences are useful for relationships that are close, but DNA sequences are not nearly as well conserved as Amino Acid sequences • Now we need to talk about the characteristics of Amino Acids and ways to compare what is similar and what is not!

• Amino acids can have similar chemical properties, and similar functions as part of a protein, without being identical!

Point Accepted Mutations (PAM)

• For scoring amino acid sequence alignments • Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C. 1978. "A model of evolutionary change in proteins." In Atlas of Protein Sequence and Structure 5(3) M.O. Dayhoff (ed.), 345 - 352, National Biomedical Research Foundation, Washington. • PAM N corresponds to N mutations in DNA sequence per 100 amino acids. N can be greater than 100.

• PAM 250 is most commonly used; PAM 100 is also used. PAM 250 => chains with ~20% identity • PAM matrix calculator at www.cmbi.kun.nl/bioinf/tools/pam.shtml

http://www.psc.edu/biomed/training/ tutorials/sequence/db/index.html

BLOSUM Matrices

• Henikoff and Henikoff (1992) Proc Natl Acad Sci 89(22):10915-9 • Based on analysis of the BLOCKS database (http://www.blocks.fhcrc.org/) • BLOSUM = BLOcks SUM database • Based on analysis of conserved and variable regions of proteins Naming convention is different than for PAM matrices.

• BLOSUMxy is based on likelihood ratios for two chains of amino acids that are xy% identical • BLOSUM62 is the ‘ typical default ’ • PAM250 is roughly equivalent to BLOSUM45 62

PSI BLAST

• Position Specific Iterative BLAST • http://nar.oupjournals.org/cgi/content/full/25/17/3389 • Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. 1997. Gapped BLAST and PSI BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997 Sep 1;25(17):3389-402 • Required two non-overlapping similarities with search term to occur within a certain distance (A) on the genome • Permits gaps in the alignments • Can be iterated to allow for user-specified scoring matrices By default, uses the BLOSUM-62 Matrix 63

PSI BLAST

• In the original BLAST, the step of extending the length of the ‘ hits ’ took ~90% of execution time.

• The initial threshold value T must be lower than with the original BLAST, but far fewer hits are pursued, meaning that the extension time is lower Two hits, T=11 A=40 vs One hit, T=13 64

http://nar.oupjournals.org/content/ vol25/issue17/images/gka56202.gif

http://nar.oupjournals.org/content/vol25/ issue17/images/gka56201.gif

Gaps in PSI-Blast

• PSI BLAST seeks alignments with single gaps • Gaps are sought only when a two-hit score exceeds the value Sg • Gaps: handled by using a different gap cost function: -(a+bk+cj) a is the cost for opening a gap b is the per unit cost for the length of the gap k is the length of the gap c is the cost per of unaligned sequences in the gap j is the number of sequences left unaligned 66

Discontinuous MEGA Blast

• Useful especially for identifying diverged DNA sequences • Uses templates; within the template only those items with “ 1 ” s are compared.

• E.g. 1101101101101101 67

How many BLASTs?

http://www.ncbi.nlm.nih.gov/BLAST/producttable.html

mpiBLAST http://mpiblast.lanl.gov/

mpiBLAST Algorithm

• Darling, A.E., L. Carey, W.-C. Feng. 2003. The design, implementation, and evaluation of mpiBLAST. Presented at ClusterWorld2003. http://www.cs.wisc.edu/%7Edarling/mpiblast-cwce2003.pdf

• Algorithm – Database is segmented. Portions of database are placed on data storage devices on multiple nodes in a HPC system. mpiformatdb is a wrapper for the BLAST formatdb program. Number of subdivisions specified by user – Foreman/worker algorithm. Portions of the database are assigned to workers, using a greedy algorithm 69

mpiBLAST performance

• Scaling can be super linear when pieces are small enough that they fit into memory • Scalability limitations due to communication, implicit barrier before assembly of results • If pieces of data distributed out to workers are larger than available RAM, then scaling is still good but not super linear • Blast is the most heavily used bioinformatics tool in existence. Parallelization of BLAST has huge payoff for practicing biologists 70

Motivation: BLAST with Low Memory

71 • Standard BLAST running on a system with 128 MB of memory.

Slide courtesy of Wu-chun Feng [email protected] Los Alamos National Laboratory

mpiBLAST: Low-Memory Performance

72 • Environment – 1, 2, or 4 nodes.

– Each node w/ dual 550-MHz CPUs and 128-MB memory.

– Same query and database used.

• Conclusions – blastn is I/O bound. Superlinear speed-up possible.

– tblastx is CPU bound.

Slide courtesy of Wu-chun Feng [email protected] Los Alamos National Laboratory

mpiBLAST on

Green Destiny

BLAST Run Time for 300-kB Query against nt

Nodes Runtime (s) Speedup over 1 node 1 80774.93

4 8 16 8751.97

4547.83

2436.60

32 64 128 1349.92

850.75

473.79

1.00

9.23

17.76

33.15

59.84

94.95

170.49

The Bottom Line: mpiBLAST reduces search time from 1346 minutes (or 22.4 hours) to under 8 minutes!

Slide courtesy of Wu-chun Feng [email protected] Los Alamos National Laboratory

Global Alignments: Needleman-Wunsch Algorithm

• • Start at the beginning, end at the end • Needleman, S.B., and C.D. Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequences of two proteins. J. Mol. Bio. 48: 443-453.

“ The amino acid sequences of a number of proteins have been compared to determine whether the relationships existing between them could have occurred by chance. Generally, these sequences are from proteins having closely related functions and are so similar that simple visual comparisons can reveal sequence coincidence….

” 74

Needleman-Wunsch

• Amino acid sequences are lined up as column and row headers for a matrix • A i is the ith amino acid in protein A • B j is the jth amino acid in protein B • Start with a matrix where the matches between the A i s and B j s are 1 of there is a match, 0 otherwise • The optimal alignment can be represented as a path through the matrix • If MATmn is part of a pathway including MATij, the only permissible relationships are m> i and n>j, or m

Needleman-Wunsch and Smith-Watermann

76 • Shortcomings of Needleman-Wunsch?

• Can you think of biological situations in which you might want to use Needleman-Wunsch?

• Smith-Waterman: similar to Needleman-Wunsch, except – Requires a penalty for gaps – Will do partial alignments (e.g. has stopping point) • Computational requirements – Original Needleman-Wunsch and Smith Waterman both require O(N*M) time and O(N*M) memory – There are improvements of Smith-Waterman that require O(N*M) time and O(N) space

ALIGN

• Simple protein alignment tool • Included in FASTA distributions 2.x, but not 3.x

• Still, it ’ s a nice learning tool • Can be downloaded for Linux or for Windows • Can also be run from web at http://fasta.bioch.virginia.edu/fasta/align.htm

• Can also be run from web at http://us.expasy.org/tools 77

Protein Alignment with the FASTA family

• FASTA is one of the earliest protein alignment tools, and still actively maintained • Pronounced FAST and then a long A • A local alignment, heuristic tool • Can be downloaded from http://www.people.virginia.edu/~wrp/pearson.html

• FASTA family maintained by Prof. William R. Pearson • Can also be run from Web 78

FASTA Algorithm

• Ktup = word length (2 default) • FASTA searches for matching words, focuses on ungapped regions that have the highest number of identical ktups • FASTA scores the 10 ungapped alignments that have the highest number of identical ktups (default scoring is BLOSUM50) • FASTA merges the ungapped alignments into a single alignment with a stopping rule • FASTA uses the Smith-Waterman algorithm within the local alignment regions 79

Multiple Alignment

Richard Repasky 80

Multiple Alignment

• Sequences lined up so that homologous residues are next to one another • Why they are useful • Constructing multiple alignments • Abstracting multiple alignments 81

An alignment

82 • Color reflects residue type (e.g., green = hydrophobic)

Uses

• Alignments reveal the degree to which sequences have been conserved • Most functional sequences are conserved • Multiple alignment is used to locate them – Functional groups of enzymes – Predict protein structure – Gene promoters – Unknown functional units of non-coding regions of DNA • Alignments necessary to estimate evolutionary trees 83

The problem

• Pairwise dynamic programming alignment algorithms can be extended to multiple sequences but scale poorly to large numbers of sequences (or to long sequences) • Heuristic algorithms are employed 84

Progressive alignments

• Commonly used heuristic methods are progressive - build multiple alignment by aggregating from paired alignments • Order in which sequences are added determined by a guide tree that reflects similarity/distance • Guide tree constructed from sequences • Closely related sequences aggregated/added first • Errors in early additions tend to propagate • Algorithms differ in strategy for minimizing error propagation • Algorithms also differ in guide tree construction & scoring 85

Progressive alignment steps

• 1 - align B & C • 2 - align D & E • 3 - align (D & E) & A • 4 -align (D & E & A) & (B & C) 86

Three algorithms

• CLUSTAL W – Oldest of three & most widely used – Initial alignment error not addressed – Good performance by adding realistic details • T-COFFEE – Initial alignment error addressed by using consistency methods – Uses CLUSTAL W, improves performance • ProbCons – New this year – Initial alignment error also addressed by consistency methods – Uses hidden Markov models 87

CLUSTAL W

• http://www.ebi.ac.uk/clustalw/ • Thompson et al. 1994. Nucleic Acids Res. 22: 4673-4680

• Uses dynamic programming with distance matrices and gap penalties for alignments • Selective use of scoring matrices

– Strict matrices for closely related sequences – Permissive matrices for distantly related sequences – Relatedness determined by branch lengths in guide tree

• Uses residue-specific gap penalties from reference alignments

88

CLUSTAL W

• Gap penalties reduced in short stretches of hydrophilic residues (usually associated with bends and are gap-prone) • Gap penalties increased in areas within 8 residues of existing gaps because such gaps are rare in reference alignments • Sequences weighted by relatedness – Attempt to correct for unbalanced sampling across guide tree – Closely related sequences discounted in importance • Progression – Leaves of tree joined by dynamic programming – Leaves joined with internal nodes by sequence-profile alignment – Internal nodes joined by profile-profile alignment 89

FOS_RAT FOS_MOUSE FOS_CHICK FOSB_MOUSE FOSB_HUMAN FOS_RAT FOS_MOUSE FOS_CHICK FOSB_MOUSE FOSB_HUMAN

Example output

MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNT MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNT MMYQGFAGEYEAPSSRCSSASPAGDSLTYYPSPADSFSSMGSPVNS -MFQAFPGDYDS-GSRCSS-SPSAESQ--YLSSVDSFGSPPTAAAS -MFQAFPGDYDS-GSRCSS-SPSAESQ--YLSSVDSFGSPPTAAAS *:..* .:*:: .***** **:.:* * *..***.* :.. :*: IPTVTAISTSPDLQWLVQPTLVSSVAPSQ-------TRAPHPYGLP IPTVTAISTSPDLQWLVQPTLVSSVAPSQ-------TRAPHPYGLP VPTVTAISTSPDLQWLVQPTLISSVAPSQ-------NRG-HPYGVP VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMP VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMP :******:** **********:**:* **... ::. .**.:* : 90

ClustalW-MPI

• Li, K.-B.2003. ClustalW-MPI: ClustalW analysis using distributed and parallel computing. Bioinformatics 19: 1585 1586 • Initial pairwise alignment process is parallelized and scales very well • Multiple alignment process is parallelized and scales modestly • Scaling tests published thus far up to 16 processors, reduces time from hours to minutes 91

Consistency Methods

• Make estimates based on more information “ averaging ” • Lazy teacher analogy • In progressive multiple alignment, use as much information as possible when adding sequences to the alignment • T-COFFEE: each position in one alignment is weighted by consistency in all alignments of all pairs of sequences that include the sequences being aligned • ProbCons: posterior probabilities in pairwise HMM alignments weighted by posterior probabilities of same positions in other alignments 92

T-COFFEE

• Http://igs-server.cnrs mrs.fr/~cnotred/Projects_home_page/t_coffee_home_page.ht

ml • Notredame, et al. (2000, J. Mol. Biol. 302:205-217) • Gives better alignments than CLUSTAL W on benchmark datasets • Avoids problem of early bad gaps using consistency methods • Progressive alignment based on weights pooled from all pairwise alignments rather than currently accumulated sequences 93

T-COFFEE steps

• Calculation of weights – All pairwise alignments using CLUSTAL W, local alignments using FASTA Lalign – For each aligned base pair in each pair of sequences calculate weight – Aggregate weights for aligned base pairs using triplets of sequences • Progression – Align all sequences pairwise using weights – Build guide tree from pairwise alignments – Progressively build multiple alignment using tree and weights 94

ProbCons

• Http://probcons.stanford.edu

• ISMB 2004, Bioinformatics 20:Supplement 1 • Constistency methods & HMM to align • Gives better alignments than CLUSTAL W & T-COFFEE on alignment benchmarks 95

ProbCons method

• Use HMM to align all pairs of sequences • Keep posterior probability matrices & update each value by averaging over all alignments in which the sequence position occurs. Do twice.

• Create a guide tree from expected accuracies (sums of posterior probabilities of highest summing path in matrix) • Progressive alignment objective function is sum of posterior probabilities for all aligned residues 96

Multiple alignment viewers

• CLUSTAL X - X windows – ftp://ftp-igbmc.u strasbg.fr/pub/ClustalX/ • Jalview - Java – http://www.ebi.ac.uk/~michel e/jalview/ • Variable color schemes • Editing • Front end to aligners 97

Abstracting Multiple Alignments

• Hidden Markov models can be used to describe alignments • Called profile HMMS • Think of them as definitions of proteins or averages • Useful for aligning newly discovered sequences • Search sequence databases for sequences that match the alignment profile (Consider the alternative!) • Build databases of profiles and search for profiles that match query sequences 98

HMMR

• http://hmmer.wustl.edu/ • Profile HMMs for protein sequence analysis • Builds profiles from existing alignments • Searches sequence databases for molecules that match profiles • Can be used to construct db ’ s of profiles and to search for profiles that match sequences • Generates random sequences from profiles • Also available as a parallel code using PVM • Scales reasonably well as regards number of processors. Does not scale as well as regards size of the biological problem 99

Gene Expression Microarray Analysis

Richard Repasky 100

Gene Expression Microarrays

• Detect which genes are relatively turned on and which off in paired samples (turned-on = expressed) • Uses • How microarrays work • Data handling • Available software 101

Example uses of microarrays

• Identify genes associated with disease • Reconstruct gene regulatory networks • Understand traits such as resistance to toxins or disease • Prognosticators 102

Use: Identify genes associated with disease

103 • Changes in expression may cause disease • Disease may cause changes in gene expression • Initially happy to know association • Cause demonstrated later by other methods • Often simple designs: normal tissue v. diseased tissue • Plot changes in gene expression as disease progresses

Use: Reconstruct gene regulatory networks

104 • Genes turn on and off through life • Gene expression is regulated by genes • Seek to identify and understand regulatory networks • Identify correlated sets of genes • Sometimes in controlled experiments • Often use data from variety of sources

Use: understand traits

• Are real traits dependent upon levels of gene expression?

• That is, what is responsible for trait levels, variation in gene expression or variation in the type of protein that is produced by the gene?

• Example: is variation in redness of tomatoes due to the quantity of red that is produced or due to the production of red pigments vs orange or yellow pigments?

• Example: is variation in fungal resistance to copper sulfate attributable to variation in level of gene expression?

• Look for relationship between gene expression levels and traits under controlled circumstances 105

Dream use: Prognosticators

• Earliest indicators of prognosis are sought in medicine and drug development • Useful for making decisions • Is your fate associated with which genes are on or off at some point in your disease?

– Example: bad prognosis? Try experimental treatments earlier.

• Is a candidate drug's fate associated with which genes that are on or off in early trials? – Example: if gene Q325 shuts down, liver failure is certain – Could save developers millions of dollars 106

How microarrays work

• Capture transcripts, label them, sort them and measure abundance • Capture & label – DNA -transcribed-> messenger RNA -translated-> protein – Transcript = messenger RNA – Capture messenger RNA & convert back to DNA (called cDNA) – Label cDNA with florescent markers – Label one sample green, one red 107

Sort &Measure Transcript Abundance

• Principles – DNA likes to stick to itself – DNA from one gene is more likely to stick to DNA from the same gene that to DNA from other genes – Let transcripts sort themselves by sticking to spots of DNA of known identity – Transcripts from labeled samples compete to stick to spots – Relative abundances of labels reflect relative abundances of transcripts 108

Sort &Measure Transcript Abundance

109 • Setup – Glue microspots of DNA of known identities to slides – Combine red-labeled and green-labeled samples – Flood slide with mixture – DNA sticks to appropriates spots – Shine red laser and green laser at spots and measure brightness – Red spots: gene relatively on in red-labeled sample – Green spots: gene relatively on in green-labeled sample

Variations on theme

• Whole genes glued to slides as implied so far – Called spot arrays • Small gene sequences (oligonucleotides) glued to slides – Sequences selected for genes they represent – Usually several sequences per gene to reduce error – All must be “ ON ” to conclude that gene is “ ON ” – Sequences selected for location in gene for efficacy – Known as oligonucleotide arrays – Affymetrix arrays of this type – Some newer spot arrays use oligonucleotides 110

Microarray capacity

• Hundreds to tens of thousands of spots per slide • Can measure hundreds to tens of thousands of genes • Slides are expensive. Expense limits numbers of slides used in experiments.

111

Data processing and software

• Data collection - getting from slide spots to numbers • Data curation - keeping data organized and stored • Data analysis - describing data & drawing inferences • Annotation - providing information about genes that emerge as significant results 112

Data collection

• Images scanned from slides, usually TIFF images • Data collected from spots • Identify boundaries of spots • Enumerate colors of pixels in spots • Software usually provided by scanner manufacturers • Public software – ScanAlyze http://rana.lbl.gov/EisenSoftware.htm

– Spot http://www.cmis.csiro.au/IAP/Spot/spotmanual.htm

– Spotfinder http://www.tigr.org/software/tm4/spotfinder.html

113

Data curation

• Hundreds to thousands of spots to store per slide • Associate spots with sequence identities • Hopefully dozens or more slides per experiment • Many experiments • Usual goal: keep everything - scanner settings, scanned images, settings of spot lifter, experimental conditions (treatments, etc) • Use relational database • Commercial & public offerings available • Databases often include analysis or interface to analysis 114

Curation for posterity

• People hope to mine mountains of accumulated array data • Public repositories are available to house data • Goal to have journals require submission to repositories • Standards are necessary to ensure that data are useful • MIAME - Minimum Information about a Microarray Experiment http://www.mged.org/Workgroups/MIAME/miame_1.1.html

• Some DB ’ s comply - some do not • Plan before you choose 115

Public microarray db

’

s

• Stanford Microarray Database (SMD) http://genome www5.stanford.edu

• BioArray Software Environment (BASE) http://base.thep.lu.se/ • ArrayDB http://genome.nhgri.nih.gov/arrraydb/ • Gene Expression Open Source System http://va-genex.sf.net/ 116

Data analysis

• Data - what are they?

• Methods – Inferential statistics – Descriptive methods – What ’ s used depends on starting point and goals • Approaches – Bayesian statistics – Frequentist statistics – Algorithms from CS 117

Data analysis: What are the data?

• Raw data are ratios of red/green brightness of spots • Ratios have undesirable properties • Data usually log transformed for analysis • For oligonucleotide arrays data from short sequences must be used to estimate expression levels of genes before genes can be analyzed.

• Data often standardized to remove variation from – Local glare on slides – Variation among slides – Batches of reagents – Methods simpler than blocking in analyses of variance 118

Data analysis - Inferential statistics

• Most often seen in hypothesis-driven experiments – Does treatment A affect expression of gene APE?

– Does treatment A affect expression of any gene?

– Is expression of gene X correlated with patient survival?

• Bayesian quite useful in situations in which sample sizes (number of subjects) is small • Heavy use of linear regression and Analysis of Variance • Many independent, single-method applications • Routines for Excel, Matlab, R, S-Plus 119

Data analysis - Descriptive/exploratory

120

• Usual goal to identify genes or subjects (e.g., patients) that behave similarly in an experiment • Principal Components Analysis (PCA) – Identify axes of covariation (linear) that account for greatest amount of variation in expression – Identify genes/subjects that are associated with axes • Clustering – Identify groups of genes or groups of subjects that behave similarly – No assumption of colinearity

PCA example

• Trivial example - 2 variables, 2 axes • Morphology – Measure size of many body parts – First axis size – Second may be limb size • Gene expression – Measure many genes – First axis often house keeping genes 121

Clustering

• Usual goal to find groups of genes and/or subjects that behave similarly in an experiment • Hierarchical clustering • K-means clustering • Self-organizing maps • Support vector machines 122

Analysis packages

• Many analyses are available in standard statistical packages such as SAS, SPSS, and S-Plus.

• Microarray packages usually contain suites of tools • Statistics for Microarray Analysis (R) http://stat www.berkeley.edu/users/terry/zarray/Software/smacode.html

• Bioconductor (R) http://www.bioconductor.org/ • BRB Array Tools http://linus.nci.nih.gov/BRB ArrayTools.html

• MicroArrayExplorer http://maexplorer.sourceforge.net/ 123

Specialized tools

• Michael Eisen ’ s lab - several tools http://rana.lbl.gov/EisenSoftware.htm

• SNOMAD - Standardization and Normalization http://pevsnerlab.kennedykrieger.org/snomadinput.html

• CAGED - cluster analysis http://genomethods.org/caged/ • RELNET - relevance networks http://www.chip.org/relnet/ • POE - probabilistic classifier http://astor.som.jhmi.edu/poe/ • Ebarrays - emprical Bayes analysis (R) http://www.biostat.wisc.edu/~kendzior/ 124

R

• Open source data analysis and graphics language modeled after S (S-Plus) • Functional language, like S • High-level functions for statistical analyses and graphics • Low-level functions available • R differs from S in management of memory • Similar to S in syntax but different enough to be infuriating • Popular among people who invent statistics, like S • Popular among people who invent methods for microarrays • http://www.r-project.org/ 125

Annotation

• Analyses produce lists of significant genes • What are they? What do they do? What is known about them?

• People need systems that retrieve information about significant genes • Affymetrix provides service for its chip users • Commercial products available • Public annotation systems – Go Miner http://discover.nci.nih.gov/gominer/ – Resourcerer http://pga.tigr.org/tigr-scripts/magic/r1.pl

126

NIH software

• NIH is in the microarray software business • National Cancer Centers need consistent methods of storage and analysis • caBIG - Cancer Biomedical Informatics Grid • Open source • Building data object framework, application programmer ’ s interface, applications • http://cabig.nci.nih.gov/ 127

RNA & Protein Structure

128

RNA & Protein Structure

• Want to know functions • Function dictated by structure • Need structure to understand function • Empirical determination of structure difficult/expensive • Shortcut: predict structure from sequence • Algorithms & software for predicting structure 129

Outline

• Example of protein structure & function • RNA structure • RNA software • Protein structure • Protein software • Open source?

130

Structure and function

• Enzymes receive most attention • Enzymes catalyze reactions = lower energy required • Place reactants in favorable positions for reaction • Location is everything • Example enolase 131

Enolase reaction

132

RNA

• Nucleotide sequence • Composition differs from DNA • Thymine replaced by uracil • Alphabet: C, G, A, U 133

RNA

’

s

• Messenger RNA - template from DNA - decoded to produce protein • Transfer RNA - interface attached to amino acids - identifies amino acid to protein producing machinery • Ribosomal RNA - protein producing machinery • Regulatory RNA - small polynucleotides that bind to other molecules and alter behavior • Catalytic RNA - most catalyze reactions of DNA 134

RNA secondary structure

• Single stranded • RNA folds on itself • Complementary bases join • A - U • G - C • Forms loops & hairpins • Transfer RNA shown 135

Predicting RNA secondary structure

• In nature, structure nearly minimizes energy • Energy - more or less bending/stress on bond angles • Zuker algorithm minimizes calculated energy • Uses dynamic programming algorithm • Includes interactions between adjacent nucleotide pairs (e.g., A-U followed by G-C has different energy than A-U followed by U-A) • Web services www.bio.rpi.edu/applications/mfold/old/rna/form1.cgi

• Vienna RNA 136

RNA Structure – Vienna RNA

• http://www.tbi.univie.ac.at/~ivo/RNA/ • Package consists of several parts (from the web site): – RNAfold - predict minimum energy secondary structures and pair probabilities – RNAeval - evaluate energy of RNA secondary structures – RNAheat - calculate the specific heat (melting curve) of an RNA sequence – RNAinverse - inverse fold (design) sequences with predefined structure – RNAdistance - compare secondary structures – RNApdist - compare base pair probabilities – RNAsubopt - complete suboptimal folding 137

http://www.tbi.univie.ac.at/~ivo/RNA/

Types of Proteins

• • • • • • • •

Enzymes -

catalysts

Regulatory -

bind with molecules to alter behavior

Transport

- move here to there as oxygen in hemoglobin

Storage

- e.g., caches of nitrogen or metal ions

Mobility -

contractile & motile proteins (muscle, flagella)

Structural proteins -

fill space, provide support

Scaffold

- supports for construction of macro molecules

Defense/Attack

- immune system proteins, venom 138

http://bmbiris.bmb.uga.edu/wampler/tutorial/prot0.html

139

http://bmbiris.bmb.uga.edu/wampler/tutorial/prot0.html

140

Secondary structure

• Reflects angles in amino acid chain • Shape of the peptide chain over short sequences • Determined by amino acid composition 141

Atoms, bonds & rotation

142

Bond rotation in peptide chain

143

Angles of rotation & secondary structure

144

http://www.cryst.bbk.ac.uk/PPS95/ course/3_geometry/rama.html

Beta-sheet & Alpha-helix

• Beta-sheets usually drawn as flat noodles • Alpha-helices usually drawn as spiral noodles • Folds between sheets and helices illustrated are tertiary structure 145

Empirical structure

• X-ray crystallography • Yields 3-D plot of electron density in space • Create model of molecule that matches electron density • ~127,863 entries in SwissProt • ~857,950 entries in TrEMBL

http://crystal.uah.edu/~carter/protein/xray.htm

146

www.msg.ucsf.edu/local/programs/solve-2.06

147

X-Ray Crystallography

• Some algorithms recognize backbone of amino acid chain • Some look for signatures of secondary structure in electron map • Iterative: apply an algorithm, visualize, apply algorithm, visualize, … • XtalView/Xfit - manual fitting - www.ccp14.ac.uc/ccp/web mirrors/xtalview-mcree/pub/dem-web • CNS - Crystallography & NMR System - framework for combining algorithms - cns.csb.yale.edu

148

Secondary structure determination

• Several methods • Calculate energies of backbone-backbone hydrogen bonds • Consensus from estimates made using different bonding thresholds • Calculate energies & bond angles and plot proximity to Ramachandran clusters • Compare path of backbone with paths of ideal secondary structures 149

Secondary structure prediction

• Induction: assign structure based on sequence similarity to proteins of known structure • Consensus methods do best • Search database for similar sequences • Align sequences • Apply several algorithms (e.g., neural network, nearest neighbor) to predict structure type • Take consensus of predictions • JPRED: www.compbio.dundee.ac.uk/~www-jpred/ 150

Tertiary structure

• Long segments fold • Folds are held in place by molecular forces (e.g., electrostatic, hydrogen bonds, some covalent bonds) • Proteins fold to minimize energy • Folding algorithms seek conformation with minimum energy 151

Criteria

• Goal: prediction of molecule position within 1 angstrom • Remember, location is everything in enzymes • Measuring quality of fit – Root mean square of atom distances from correct position RMSD = √ (∑d i 2 )/N – Q3 = (true positives + true negatives)/total residues • Better than 70% right is really good!

152

• Fold recognition • Ab initio

Methods

153

Fold Recognition (Threading)

• Impose known folds on molecule; evaluate fit • Dissimilar sequences may fold similarly • Number of possible folds is finite • Many methods of fitting (e.g., dynamic programming, Gibbs sampling, hidden Markov models) • Calculate energy or distances • Web services - many methods http://cubic.bioc.columbia.edu/predictprotein • General recommendation: use many methods and build a consensus 154

Ab Initio methods

• L. From the beginning (O. E. D.) • A scoring function is used to judge conformations • Search function used to explore conformational space • Criterion: usually minimize free energy • Scoring function types – Molecular mechanics calculations – Use empirically derived scoring functions based on probability distributions of data in Protein Data Bank • Search function may be coarse-grained or fine-grained, usually matches granularity of scoring function 155

Ab Initio methods

• Many methods, many players • Variations – Coarseness of scoring function – Coarseness of search function – Search techniques – Coarseness in representation of atoms in sequence 156

Ab initio methods: Amber

• Amber: http://amber.scripps.edu/#ff • sander: Simulated annealing with NMR-derived energy restraints. • gibbs: Free energy perturbation (FEP) and thermodynamic integration (TI) , and also allows potential of mean force (PMF) calculations. • roar: Allows mixed quantum-mechanical/molecular-mechanical (QM/MM) calculations, "true" Ewald simulations, and alternate molecular dynamics integrators. • nmode: Normal mode analysis program using first and second derivative information, used to find search for local minima, perform vibrational analysis, and search for transition states. • (from http://amber.scripps.edu/#code) 157

Ab initio methods - GAMESS

• M.W.Schmidt, M.W., K.K.Baldridge, J.A.Boatz, S.T.Elbert, M.S.Gordon, J.H.Jensen, S.Koseki, N.Matsunaga, K.A.Nguyen, S.Su, T.L.Windus, M.Dupuis, J.A.Montgomery. 1993. General Atomic and Molecular Electronic Structure System J. Comput. Chem.14: 1347-63.

• NPACI/SDSC Web portal for GAMESS: https://gridport.npaci.edu/gamess/ • It ’ s parallel 158

Ab Initio methods: Rosetta

• Work with fragments 3-9 amino acids in length • Restrict conformations of individual fragments to the distribution of conformations exhibited in real proteins • Fragment conformations modeled stochastically using distributions of observed conformations • Seek array of conformations that minimize energy • Local minima likely • Run many searches • Cluster results & pick large clusters as likely conformations 159

Visualization: ways to view molecules

• Wireframe – Often used by crystallographers while interpreting data – Frames fit nicely in electron density mesh • Space filled (Van der Waals radii) – Often used in docking applications – Van derWaals radii useful for thinking about hydrogen bonds 160

Visualization: ways to view molecules

• Richardson-type (also ribbon or noodle) – Depict secondary and tertiary structure – Omit details of atoms and polymerization • Ball and stick – Usually more useful for ligands than for whole proteins – Atoms & covalent bonds 161

Molecular viewing software

• Most programs do several types of visualizations • VRML – Cosmo Player http://www.karmanaut.com/cosmo/player/ • RASMOL - http://www.openrasmol.org/ • RasTop - http://www.geneinfinity.org/rastop/ • CHIME - http://www.mdl.com/chime/index.html

• Swiss Pdb Viewer - http://www.expasy.ch/spdbv/ • MICE - http://mice.sdsc.edu/ • Many tend to be touchy about browsers and plugins 162

Molecular Docking

• Will two molecules bind?

• Usually interested in docking of small molecules (e.g., drug candidates) to proteins • Small molecule called ligand (from Latin ligare - to bind) • Specific question: will ligand bind to a receptor in a protein?

• Receptor usually the largest pocket in the surface of a protein • Steps – Characterize receptor site – Orient ligand(s) & evaluate 163

Molecular Docking

• Process: usually create negative image of receptor site and ask whether ligands take that conformation • Rigid models use a grid search • Models with flexible surfaces – Usually assume receptor fixed and ligand flexible.

– Explore conformation of ligand – May use simulated annealing or genetic algorithm to explore conformations • Models with flexible surfaces do better than those with fixed surfaces 164

Molecular Docking

• Autodock is a commonly used package • http://www.scripps.edu/pub/olson-web/doc/autodock/ • AutoDock is a suite of automated docking tools.

– Flexible model – Can do simulated annealing and genetic algorithms 165

Systems Biology

166

Systems Biology

• • Special issue of Science: 295, Mar. 2002 • Special issue of Nature: 420, Nov. 2002 “ Systems biology is a new field in biology that aims at a systems-level understanding of biological systems.

” • Nobody ’ s quite sure what it is, but it sure is hot!

http://www.ornl.gov/TechResources/Human_Genome/ graphics/slides/images/01-0052_web.gif

167

Historical approach to biological experiments

• From Lazebnik, Y. 2002. Cancer cell 2:179: Traditional biological experimentation much like the process of trying to fix a broken radio (or if you are or were a 12-year old boy…) • Some typical steps: – Cataloguing components and their attributes – Perturbing the system – Knock-out experiments – Drawing diagrams • Eventually interactions

may

find a component that, when replaced, repairs the radio • In a very complex system, knowing what all of the parts are, and knowing the function of individual pathways, may still not tell you how the systems work. It may simply be impossible to deduce this from 1-st order • Interactions, multiple changes – Power supply and other components (well-known PC repair example!) – Change everything all at once so that we ’ ll never know what worked!

168

Systems Biology

• Systems biology emphasizes close integration of experiment, theory and computational modeling • Goal: understanding the structure and dynamics of biological systems, placing the parts in the context of the dynamic whole – Studies the complex interactions of many levels of biological information – Quantitative, predictive models are central – Computational modeling in particular is a key tool • Why model – You are forced to really state what you are hypothesizing – Allows you to understand an *approximation* of reality in great detail • Computational Cell Biology. 2002. Springer Verlag (Fall et al, eds).

• Foundations of systems biology. MIT Press, 2001. Kitano (ed) 169

From http://www.nrcam.uchc.edu/technology/modeling_process.html

170

• • • • • • • • • • • • BALSA BASIS BIOCHAM BioCharon biocyc2SBML BioGrid BioNetGen BioPathways Explorer Bio Sketch Pad BioSPICE Dashboard BioSpreadsheet BioUML

A small sampling

• • • • • • • • • • • • • Cellware Cytoscape DBsolve Dizzy E-CELL FluxAnalyzer Gepasi INSILICO discovery Jarnac JDesigner JSIM JWS Karyote 171

• • • •

Example - MCell

MCell

is: A General

M

onte

C

arlo Simulator of

Cell

ular Microphysiology. http://www.mcell.cnl.salk.edu/

MCell

focuses on simulations using a Brownian dynamics random walk algorithm.

MCell

's use to date has been focused on the microphysiology of synaptic transmission.

Images and MCell-related material courtesy of Joel R. Stiles, Pittsburgh SupercomputingCenter and Carnegie Mellon University, and Thomas M. Bartol, Computational Neurobiology Laboratory, The Salk Institute.http://www.mcell.cnl.salk.edu/

172

MCell Scalability

173

Images and MCell-related material courtesy of Joel R. Stiles, Pittsburgh Supercomputing Center and Carnegie Mellon University, and Thomas M. Bartol, Computational Neurobiology Laboratory, The Salk Institute. http://www.mcell.cnl.salk.edu/

M-Cell

• Uses MDL (Model Description Language (MDL), designed with biologically-oriented users in mind. • Embarrassingly parallel Monte Carlo application • Supports checkpointing!

Images and MCell-related material courtesy of Joel R. Stiles, Pittsburgh Supercomputing Center and Carnegie Mellon University, and Thomas M. Bartol, Computational Neurobiology Laboratory, The Salk Institute. http://www.mcell.cnl.salk.edu/

174

CompuCell

• CompuCell currently uses a combination of "extended Potts model" for cell sorting and clustering, and "Schnakenberg Reaction Diffusion" equations to establish the underlying chemical field to which cells respond and form typical patterns found in such biological systems as a growing chicken limb. • http://www.nd.edu/~icsb/ 175

Image courtesy of James Glazier http://www.biocomplexity.indiana.edu/software.php

Issue: Getting Tools to Interoperate

• There is currently a proliferation of software, but no single package answers all needs • No single tool is likely to do so in the near future • But: problems with using multiple packages • Among the efforts to address this problem: – Systems Biology Markup Language & Systems Biology Workbench Project • Purpose: develop software and standards to – Enable sharing of simulation & analysis software – Enable sharing of models • Goal: make it easier to

share

than to reimplement 176

SBML

• An XML-based markup language • Active and functional leadership and reasonable funding stream • SBML is focused on biochemical networks, but of all of the biology-oriented markup languages, it seems to be the one with the most traction • Permits storage, transmission, and reuse of models • Consists of “ levels ” 177

What does an SBML model look like?

178

2.2

SQUARE

SBML Levels

• Level 1 – Biochemical networks. Frozen.

• Level 2 – enhancements and extensions to level 1. Frozen June 2003.

– Uses MathML for equation specifications – Uses same metadatascheme as CellML (exp named function defs), catalysts, time delays – Fixes minor issues in Level 1 specification – Any Level 1 model can be run within software that supports Level 2 • Level 3 – current development effort – More on what ’ s in it later 179

Components of a Level 1 or 2 model

• Compartment: a well-stirred container • Species: chemical compounds • Reaction: transformation, transport, or binding process involving a species. May have a rate parameter • Parameter: a quantity that has a symbolic name (global and local) • Unit definition • Rule: added to set constraints, initial conditions, bounds, etc on the reactions • Everything in SBML is one of the above!

180

SBML Level 2

• • Model Composition - extensions to define an SBML model as a composition of submodels • Diagrams - extensions to include display and layout information in an SBML model • Complexes - species with multiple states, like phosphorylated/not phosphorylated • Alternative Reactions - extensions to allow multiple formalisms for describing reactions, such as stochastic and deterministic • Controlled Vocabularies - vocabularies for labeling models and their components • Dynamic Structures - extensions to allow model structures to vary during simulation • Spatial Features - extensions to represent 2D and 3D spatial characteristics of models and their components

From www.sbml.org

181

So you actually want to run one…

• MANY programs will handle a model written in SBML • libSBML provides a C/C++ API if you want to write your own • Math SBML – an open source toolbox for running SBML models within Mathematica • SBML Toolbox – the equivalent for MatLab • While an open source toolkit for a proprietary software package seems odd at first blush… 182 •

There is a KEGG to SBML converter!

JWS Online

183 From http://jjj.biochem.sun.ac.za/

184

185

186

The Systems Biology Workbench Project

• http://www.sbw-sbml.org/ • Simple framework for application interaction.

Visual Editor

• Cross-platform compatible & language-neutral • Modules are separately compiled executables. A module defines services which have methods

Stochastic Simulator

• SBW native-language libraries provide APIs. • SBW Broker acts as coordinator

SBW Script Interpreter Database Interface ODE-based Simulator

CellML

• Originally designed to describe and exchange models of cellular and subcellular processes. • http://www.cellml.org/public/about/what_is_cellml.html

• XML-based specification of interchange of cell model information • Includes: • Information about model structure • Math, based on MathML • Metadata about the model • Project of Bioengineering Institute of University of Auckland with support from Physiome Sciences Inc. 187

BioSpice

• Lead by Adam Arkin – a DARPA-backed effort • Described in some detail in two recent issues of “ -Omics ” • www.biospice.org

• More licensing term details than many open source efforts • The BioSpice Dashboard may be one of the better “ integrative ” tools under development at present • Uses SBML for model specification 188

Systems biology URLs

• SBW & SBML • NetBuilder • Gepasi • Virtual Cell www.sbw-sbml.org

strc.herts.ac.uk/bio/Maria/NetBuilder • CellML www.cellml.org

• Jarnac + JDesigner www.cds.caltech.edu/~hsauro www.gepasi.org

www.nrcam.uchc.edu/ (NIH-supported) • E-CELL • JigCell www.e-cell.org (based in Japan gnida.cs.vt.edu/~cellcyclepse/ • DARPA BioSPICEwww.biospice.org

• Karyote http://biodynamics.indiana.edu/overview/ 189

Some Good Books

• Winter, P.C., G.I. Hickey, H.L. Fletcher. 1998. Instant notes in genetics. Springer-Verlag, NY. ISBM 0-387-91562-1 • Durbin, R., S. Eddy, A. Krogh, G. Mitchison. 2000. Biological sequence analysis. Cambridge University Press.

• Gibas, C., and P. Jambeck. 2001. Developing bioinformatics computer skills. O ’ Reilly.

• Tisdall, J. 2001. Beginning perl for bioinformatics. O ’ Reilly.

• Gusfield, D. 1997. Algorithms on strings, trees, and sequences. Cambridge University Press.

• Berman, F., G.C. Fox, A.J.G. Hey. (eds) 2003. Grid computing: making the grid infrastructure a reality. Wiley, Sussex 190

And a few semi-random other matters

191

BioPerl

• Duct tape for DNA/protein sequence bioinformatics • Perl API for – Reading many data formats/data sources – Writing many data formats/data sources – Manipulating data objects in well known ways • Object oriented Perl module(s) • As with all frameworks, it can be extremely painful for quick and dirty applications • http://bio.perl.org/ • Siblings: BIOPYTHON (www.biopython.org), BIOJAVA (www.biojava.org) 192

BioLinux

• Repository of bioinformatics programs packaged as RPMs – Red Hat 9 – Fedora Core 1 – Fedora Core 2 – SuSE 9.1

• Many packages discussed today – NCBI BLAST – CLUSTAL W – Vienna RNA – BioPerl • http://www.biolinux.org/ 193

Apple bioclusters

• Apple Xserve clusters: head node plus compute nodes • Batch & queuing with Platform LSF, Sun Grid Engine, Open PBS, or PBS Pro • Parallel API ’ s: MPICH, MPI Pro, LAM/MPI • Globus tookit available • iNquiry Bioinformatics tools available – Many open source bioinformatics packages pre-compiled – All available through Pise web interface • “ Easy ” system/cluster administration tools • http://www.apple.com/ 194

BIRN

• Biomedical Informatics Research Network • http://www.nbirn.net/ • NIH-sponsored attempt to create health-oriented cyberinfrastructure • Function BIRN – brain function and disorders, e.g. schizophrenia • Morphometry BIRN – brain structural disorders, e.g. Alzheimers • Mouse BIRN – studying mouse brain and mouse models of human brain disorders • Grid technology, using federated data system approach, based on Globus, SRB, etc.

195

What is the killer application in computational biology?

• Systems biology – latest buzzword, but…. • Goal: multiscale modeling from cell chemistry up to multiple populations • Current software tools still inadequate • Multiscale modeling calls for use of established HPC techniques – e.g. adaptive mesh refinement, coupled applications, and something that *might* actually run better on a grid than on a supercomputers • Current challenge examples: actin fiber creation, heart attack modeling • Opportunity for predictive biology?

196

Drug Design

• Target generation – so what • Target verification – that ’ s important!

• Toxicity prediction – VERY important!!

• (Cholesterol example) • Counterintuitive problem: the more personalized a therapy is, the smaller its target audience!

197

Computational biology, biomedical research, and HPC

• Two challenges: – Scalability of applications – Wall-clock time sensitivity • Bioinformatics, Genomics, Proteomics, ____ics will radically change understanding of biological function and the way biomedical research is done.

• Traditional biomedical researchers must take advantage of new possibilities • Computer-oriented researchers must take advantage of the knowledge held by traditional biomedical researchers 198

Peta-Scale applications?

• Many biologists are unfamiliar with the real possibilities • Useful – even lifesaving – applications may require straightforward application of well known principles. • The low hanging fruit taste just fine. e.g. “ Parallel ” Matlab, GeneIndex, batch scripts (www.indiana.edu/~rac/bioinformatics/iubatchscripts.html) • Writing a parallel application that can be used to treat people is a very difficult challenge • Attacks on all fronts simultaneously are needed • Interactive Tera-scale applications might for many biologists be more valuable right now than Peta-scale applications (even if we had them!) • Portals and the TeraGrid –> solutions to problems that biologists care about • All of these open source codes are out there waiting for you to parallelize and/or tune them!

199

So how do you find biologists with whom to collaborate?

• Chicken and egg problem?

• Or more like fishing?

• Or bank robbery? Willie Sutton, a famous American bank robber, was asked why he robbed banks, and reportedly said “ because that's where the money is.

” (This is, sadly, an urban legend: Sutton never said this).

• Cultivating collaborations with biologists in the short run will require: – Active outreach – Different expectations than we might usually have – Patience • There are lots of opportunities open for HPC centers willing to take the effort to cultivate relationships. To do this, we ’ ll all have to spend a bit of time “ going where the biologists are.

” 200

Acknowledgments

• Some of the research described herein was supported by the following:\ – The Indiana Genomics Initiative of Indiana University, supported in part by Lilly Endowment Inc.

– Shared University Research grants from IBM, Inc. to Indiana University. – National Science Foundation under Grant No. 0116050 and Grant No. CDA-9601632. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

• Some of the ideas presented here were developed while the senior author was a visiting scientist at Höchstleistungsrechenzentrum Universität Stuttgart. Thanks to HLRS, Michael Resch, Matthias Müller, Peggy Lindner, Matthias Hess, and Rainer Keller.

• John Herrin, Malinda Lingwall, & W. Lester Teach assisted with graphics • Thanks to Christina Deximo, E. Chris Garrison, and Matt Link for logistical help with the tutorial • Thanks to IBM for providing Thinkpads for the tutorial! 201

Final version of slides ...

… will be available for download from http://racinfo.uits.iu.edu by 19 November 2004 202

Appendix 1 – DNA sequencing

203

DNA sequencing

Send in the clones!

• DNA chopped into blocks • Blocks inserted into bacterial cells using viruses • The bacterial clones make lots of copies of DNA so that you have something to work with • The sequence of each chunk of genetic material is determined using gel electrophoresis 204

Sanger

•Cut DNA at various places (at T, G, C, A) •Add a radioactive molecule at the end of the DNA chain •Find out how long the chain is by gel electrophoresis •Read off the sequence

www.ornl.gov/TechResources/ Human_Genome/publicat/primer/

Dye-terminator Sequencing

www.ornl.gov/TechResources/ Human_Genome/graphics/slides/ images/standardRGB200.jpg

205

Sequence assembly

• Phred – base calling • Phrap – shotgun sequence assembly • Consed – finishing • http://www.phrap.org/ • High quality software 206

Appendix 2: Phylogenetics

207

Building Phylogenetic Trees

• Goal: an objective means by which phylogenetic trees can be estimated in tolerable amounts of wall clock time, producing phylogenetic trees with measures of their uncertainty • All evolutionary changes are described as bifurcating trees -genes or gene products -organisms 208

Phylogenetic trees from DNA sequences

• Changes DNA modeled as Markov processes • Sequences available: • DNA (sequences are series of the base molecules; aligned sequences will also contain +s for gaps) • Amino acid sequences (series of letters indicating the 20 amino acids). Computational challenges more severe than with DNA sequences.

• RNA • The availability of data at present exceeds the ability of researchers to analyze it!

209

Why is tree-building a HPC problem?

• The number of bifurcating unrooted trees for n taxa is (2n-5)!/ (n-3)! 2n-3 • for 50 taxa the number of possible trees is ~10 74 ; most scientists are interested in much larger problems • NP-hard problem • The number of rooted trees is (2n-5)!

210

Phylogenetic software

• Phylip. (J. Felsenstein). Collection of software packages that cover most types of analysis. One of the most popular software collections. Free.

• PAUP. (D. Swofford). Parsimony, distance, and ML methods. Also one of the most popular software collections. Not free, but not expensive.

• fastDNAml. (G. Olsen). Maximum likelihood method for DNA; becoming one of the more popular ML packages. MPI version available soon; well suited to tree searching in large data sets. Free.

• GRAPPA (Bader et al.): Breakpoint analysis program - scales well 211

Stochastic change of DNA

• Markov process, independent for each site: 4 x 4 matrix for DNA, 20 x 20 for amino acids • • A A p(A->A) C p(A->C) G p(A->G) … T • C • G .

p(C->A) p(C->C) p(C->G) … • T .

• Transitions more probable than transversions.

• Must account for heterogeneity in substitution rates among sites (DNArates – Olsen) 212

fastDNAml

• Developed by Gary Olsen • Derived from Felsensteins ’ s PHYLIP programs • One of the more commonly used ML methods • The first phylogenetic software implemented in a parallel program (at Argonne National Laboratory, using P4 libraries) • Olsen, G.J.,et al.1994. fastDNAml: a tool for construction of phylogenetic trees of DNA sequences using maximum likelihood. Computer Applications in Biosciences 10: 41-48 • MPI version produced by Indiana University in collaboration with Gary Olsen available from http://www.indiana.edu/~rac/hpc/fastDNAml/ 213

fastDNAml algorithm – adding taxa

• Optimize tree for 3 (randomly chosen) taxa only one topology possible • Randomly pick another taxon – (2i-5) trees possible • Keep the best (maximum likelihood tree) 214

Basic fastDNAml algorithm - Branch rearrangement

• Move any subtree crossing n vertices (if n=1 there are 2i 6 possibilities) • Keep best resulting tree • Repeat this step until local swapping no longer improves likelihood value 215

fastDNAml algorithm con

’

t: Iterate

• Get sequence data for next taxon • Add new taxa (2i-5) • Keep best • Local rearrangements (2i-6) • Keep best • Keep going….

• When all taxa have been added, perform a full tree check 216

Overview of parallel program flow

• Program modules – Master (generates trees, receives back from Foreman best tree at each step) – Foreman (dispatches trees to workers, determines best tree, tracks activity of workers) – Worker – Monitor (instrumentation) – Parallel versions include fault tolerance features (useful in large clusters and grid computing) 217

Performance of fastDNAml

20 10 0 0 70 40 30 60 50 10 20 Perfect Scaling 30 40 Number of Processors 50 Taxa 101 Taxa 150 Taxa 50 60 70 218

Why bother with parallel code?

• Why not just achieve speedup of n on n processors by running n independent jobs?

• Practical benefits of seeing results quickly • Parallel program permits assault on much more complicated problems (e.g. protein sequences) 219

Appendix 3: Grand challenge problems and some thoughts about the future

220

Modeling Heart Function

• Based on Noble, D. 2002. Modeling the heart – from genes to cells to the whole organ. Science 295: 1678-1682 • Two mutations known for sodium channels – DeltaKPQ – deletion of 3 amino acids (lysine-proline glutamine) – causes persistent sodium flow through cell wall – Missense mutations in sodium channels which cause ventricular fibrillations that can be fatal • Models of heart function can produce counterintuitive predictions • Grand challenge problem: the full scale reconstruction of a heart attack 221

Real-time fMRI

222

3.0T MRI Scanner CRAY T3E SGI Onyx

In 1996, this required a supercomputer Today, it

’

s routine

Slide courtesy of Ralph Roskies, Pittsburgh Supercomputing Center, [email protected]

Gamma Knife

• Used to treat inoperable tumors • Treatment methods currently use a standardized head model • UITS is working with IU School of Medicine to adapt Penelope code to work with detailed model of an individual patient ’ s head 223

PENELOPE Basics

• “ PENELOPE performs Monte Carlo simulation of coupled electron-photon transport in arbitrary materials and complex quadric geometries ” (http://www.nea.fr/abs/html/nea-1525.html) • Improvement of targeting based on CT scans of patient ’ s head – 200 512 x 512 voxel slices • Simulation takes ~7 hours using a serial version of PENELOPE running on a 1 GHz PIII Windows system • Goal: 5 minutes to one hour 224

Parallelization of PENELOPE

• Each processor: – Views entire target – Generates its own random numbers – Generates a set number of independent trajectories – Accumulates data • Process 0: – Collects the raw data – Computes desired results • Uses F90 for parallel random number generator from MILC consortium • Uses MPI elsewhere 225

PENELOPE Scalability: processing time

100000 226 10000 1000 100 10 1 0 50 100 150 200

Number of Processors

250 300 On IBM SP/Power3

“

Simulation-only

”

studies

• Aquaporins -proteins which conduct large volumes of water through cell walls while filtering out charged particles like hydrogen ions. • Massive simulation (35,000 hours TCS) showed that water moves through aquaporin channels in single file. Oxygen leads the way in. Half way through, the water molecule flips over. • That breaks the ‘ proton wire ’ • Work done at Pittsburgh Supercomputing Center • Klaus Schulten et al, U. of Illinois, SCIENCE (April 19, 2002) 227

Other example large-scale computational biology grid projects

• Department of Energy “ Genomes to Life ” http://doegenomestolife.org/ • Encyclopedia of Life (http://eol.sdsc.edu/) • Biomedical Informatics Research Network (BIRN) http://birn.ncrr.nih.gov/birn/ • Asia Pacific BioGrid (http://www.apbionet.org/) • eDiamond – breast cancer/mammography grid (http://www.mirada-solutions.com/PH1.asp?PAGE_ID=739) 228

Visualization: OpenDX

• http://www.opendx.org/ • OpenDX is the open source software version of IBM's Visualization Data Explorer Product • Good sources of information in books, tutorials, etc.

• Interesting example of open source • Animations as well

http://www.opendx.org/highlights.php

229

Visualization: SciRUN

• Some of the most dramatic biological visualizations ever done • Has been used for surgical support • Scientific Computing and Imaging Institute – Christopher R. Johnson • http://www.sci.utah.edu/ 230

Genomes to Life

• http://www.doegenomestolife.org/ • Goals: – Identify and Characterize the Molecular Machines of Life — the Multiprotein Complexes That Execute Cellular Functions and Govern Cell Form – Characterize Gene Regulatory Networks – Characterize the Functional Repertoire of Complex Microbial Communities in Their Natural Environments at the Molecular Level – Develop the Computational Methods and Capabilities to Advance Understanding of Complex Biological Systems and Predict Their Behavior – (Goals taken directly from Genomes to Life web site) 231

EOL Basic Topology

Genomic Data Putative Functional and 3D Assignment Integration with Other Resources 232 Public and Private Databases

To Serve Thousands Worldwide

http://eol.sdsc.edu/methodology.html

Current Genomic Pipeline

structure info SCOP, PDB Building FOLDLIB: PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP 90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30) sequence info NR, PFAM Arabidopsis Protein sequences Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG) Create PSI-BLAST profiles for Protein sequences Structural assignment of domains by PSI-BLAST on FOLDLIB Only sequences w/out A-prediction Structural assignment of domains by 123D on FOLDLIB Only sequences w/out A-prediction Functional assignment by PFAM, NR, PSIPred assignments FOLDLIB Domain location prediction by sequence

http://eol.sdsc.edu/methodology.html

Store assigned regions in the DB 233

Scale of Multi-genome Analysis

234 ~800 genomes @ 10k-20k per =~10 7 ORF ’ s structure info SCOP, PDB Building FOLDLIB: PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP sequence info NR, PFAM 10 4 entries Genomes Protein sequences Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG) Create PSI-BLAST profiles for Protein sequences 4 CPU years 228 CPU years 90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30) FOLDLIB Structural assignment of domains by PSI-BLAST on FOLDLIB Only sequences w/out A-prediction 3 CPU years Structural assignment of domains by 123D on FOLDLIB Only sequences w/out A-prediction 9 CPU years Functional assignment by PFAM, NR, PSIPred assignments Domain location prediction by sequence

http://eol.sdsc.edu/methodology.html

252 CPU years 3 CPU years Store assigned regions in the DB

BIRN

• Biomedical Informatics Research Network • http://www.nbirn.net/ • NIH-sponsored attempt to create health-oriented cyberinfrastructure • Function BIRN – brain function and disorders, e.g. schizophrenia • Morphometry BIRN – brain structural disorders, e.g. Alzheimers • Mouse BIRN – studying mouse brain and mouse models of human brain disorders • Grid technology, using federated data system approach, based on Globus, SRB, etc.

235

Drug Design

• Target generation – so what • Target verification – that ’ s important!

• Toxicity prediction – VERY important!!

• (Cholesterol example) • Counterintuitive problem: the more personalized a therapy is, the smaller its target audience!

236

What is the killer application in computational biology?

• Systems biology – latest buzzword, but…. • Goal: multiscale modeling from cell chemistry up to multiple populations • Current software tools still inadequate • Multiscale modeling calls for use of established HPC techniques – e.g. adaptive mesh refinement, coupled applications, and something that *might* actually run better on a grid than on a supercomputers • Current challenge examples: actin fiber creation, heart attack modeling • Opportunity for predictive biology?

237

Computational biology, biomedical research, and HPC

• Two challenges: – Scalability of applications – Wall-clock time sensitivity • Bioinformatics, Genomics, Proteomics, ____ics will radically change understanding of biological function and the way biomedical research is done.

• Traditional biomedical researchers must take advantage of new possibilities • Computer-oriented researchers must take advantage of the knowledge held by traditional biomedical researchers 238

Peta-Scale applications?

• Is this what most biologist really need?

• Many biologists are unfamiliar with the real possibilities • Useful – even lifesaving – applications may require straightforward application of well known principles. • The low hanging fruit taste just fine. e.g. “ Parallel ” Matlab, GeneIndex, batch scripts (www.indiana.edu/~rac/bioinformatics/iubatchscripts.html) • Writing a parallel application that can be used to treat people is a very difficult challenge • Attacks on all fronts simultaneously are needed • Interactive Tera-scale applications might for many biologists be more valuable right now than Peta-scale applications (even if we had them!) • All of these open source codes are out there waiting for you to parallelize and/or tune them!

239

So how do you find biologists with whom to collaborate?

• Chicken and egg problem?

• Or more like fishing?

• Or bank robbery?

• Willie Sutton, a famous American bank robber, was asked why he robbed banks, and reportedly said “ because that's where the money is.

” (This is, sadly, an urban legend: Sutton never said this) • Cultivating collaborations with biologists in the short run will require: – Active outreach – Different expectations than we might usually have – Patience • There are lots of opportunities open for HPC centers willing to take the effort to cultivate relationships. To do this, we ’ ll all have to spend a bit of time “ going where the biologists are.

” 240

Open source tools for computational biology Craig A. Stewart, Richard Repasky, and Andrew Arenson [email protected]; [email protected]; [email protected] Indiana University SC2004 Tutorial S06 7 November 2004

Transcript Open source tools for computational biology Craig A. Stewart, Richard Repasky, and Andrew Arenson [email protected]; [email protected]; [email protected] Indiana University SC2004 Tutorial S06 7 November 2004

Open source tools for computational biology

Craig A. Stewart, Richard Repasky, and Andrew Arenson [email protected]; [email protected]; [email protected]

Indiana University SC2004 Tutorial S06 7 November 2004

License terms

Planned schedule for the day

Table of Contents

Class Plan & Objectives

Motivation

What has changed recently?

Genome Projects Timeline

Definitions

Challenges

Comparison of Complexity

Complexity, Con't

A rapid introduction to key elements of biology

Why is it important to know some biology?

Central dogma of biology

Cell Structure

Four (or Five) Bases

Genetic Code

Translating DNA to RNA and Transcribing RNA to Proteins

Human Chromosomes

Sickle Cell

What is a Gene?

Gene Components

Alternate splicing

A (very) little about evolutionary genetics Hardy-Weinberg Law

Population genetics & evolution

How do sequences differ?

CGTACCG T TAATAT CGTACCG A TAATAT

CGTACC CC GTAATAT CGTACC . .

GTAATAT

CGTACCG TTA ATAT CGTACCG . . .

ATAT

Random genetic change

Key points (so far)

Bioinformatics data sources

Bioinformatics Data Sources

Biomedical literature

PubMed

Genomic, Proteomic, etc. data sources

Types of genomic data

DNA databases

http://www.ncbi.nlm.nih.gov

Data Structures

Primary vs. Secondary Data sources

Protein Structure

Biochemistry and pathways

Similarity matching

Why pattern matching (and what are the problems)

US!

Problems!

Alignment

Alignment

Scoring Alignments

CGTACCG TTA ATAT CGTACCG . . .

ATAT CGT A CCGTT A ATAT CGT . C . GTT .

ATAT

Now what?

Dotter

Local Alignments with BLAST

(Original) BLAST Algorithm

(Original) BLAST Algorithm, 2

(Original) BLAST algorithm,3

Scoring of DNA

BLAST algorithm in more detail

BLAST Statistics

BLAST Options

Protein Sequence Alignment

Point Accepted Mutations (PAM)

BLOSUM Matrices

PSI BLAST

PSI BLAST

Gaps in PSI-Blast

Discontinuous MEGA Blast

How many BLASTs?

mpiBLAST http://mpiblast.lanl.gov/

mpiBLAST Algorithm