Transcript Slide 1
Chapter 1 Introduction What is bioinformatics Quantitation is essential in biology Counting bacterial colonies Counting animals in a natural environment Counting genetic variability among plants and fruit flies led to the laws of Mendelian inheritance More complex quantitative tools involve predictions of human population growth or enzyme kinetics Very sophisticated tools may involve application of “game theory” to model behavior and evolution Non-linear partial differential equations to model cardiac blood flow or in situ cytoplasm flow None of these examples are bioinformatics Bioinformatics relate to macromolecules Earliest bioinformatics exercise: Margaret Dayhoff (1965) first protein sequence database Atlas of Protein Sequence and Structure (now PIR) Early 1970s Brookhaven National Laboratory compiled Protein Database (PDB) of Xray and NMR structures First sequence alignment algorithm Needleman Wunsch 1970s Routine sequence comparisons and database searching First protein structure prediction algorithm Chou and Fasman 1974 1980s saw establishment of GenBank and FASTA and BLAST Human Genome Project started late 1980s Main reason why bioinformatics flourished and grew was due to enormous volumes of sequence data Definition Bioinformatics is the discipline that uses computers to store, retrieve, manipulate and distribute information related to biological macromolecules such as RNA, DNA and proteins Computational biology encompasses all areas of biology that involve computation Goal Better understand a living cell and how it functions at a molecular level Two major fields 1. Development of computational tools and databases •Software for sequence analysis •Sequence alignment, sequence database searching, motif and pattern discovery, gene and promoter finding, reconstruction of evolutionary relationships, genome assembly and comparison •Software for structural analysis •Protein and nucleic acid structural analysis, comparison, classification and prediction •Software for functional analysis •Gene expression profiling, protein-protein interaction prediction, protein sub-cellular location prediction, metabolic pathway reconstruction •Construction and curation of biological databases 2. Generate biological knowledge to better understand living systems •Often identify new problems that require new software to analyze •Bioinformatics is essential for basic genomic and molecular biology research •Major impact in biotechnology and biomedical sciences •Knowledge-based drug design •3D structure allows design of ligands that fit •Reduces time and cost to develop drugs •Forensic DNA analysis •Bayesian statistics and likelihood-based methods •Personalised healthcare •Agricultural biotechnology •Plant genome databases •Gene expression profiles •New crop varieties Limitations of bioinformatics •The results are as good as the data •Errors in sequences •Hypothesis independent •Bioinformatics does not replace traditional hypothesisdriven approaches •It complements and identified new questions •Integrate gene expression and protein functions in the cell •Analysis at the level of systems: systems biology •Description of a cell as a mathematical model •Predictive value Chapter 2 Biological Databases What is a database? •A database is a computerized archive used to store and organize data so that information can be retrieved by a variety of search criteria •A database can be thought as a stack or record cards, where each record card contains defined items of information, say Name, Address, Phone Number, Birth Date, etc. •In a database, each such card is an entry, and each set information item is a field •Each field of each entry contains a value (can be NULL) •Search all entries retrieve entries than contain a specific value in a field •This process is called making a query •Biological databases often have higher level requirements such as knowledge discovery, where previously unknown relations between values are found Different database formats •Flat file •ASCII file •Rows of comma delimited entries •The computer has to read the entire file to find all entries or relationships •Many databases are distributed as flat files •Below is a simple ASCII data file from REBASE, a database of restriction enzyme cleavage sites (http://rebase.neb.com/rebase/rebase.html) REBASE version 807 strider.807 =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= REBASE, The Restriction Enzyme Database http://rebase.neb.com Copyright (c) Dr. Richard J. Roberts, 2008. All rights reserved. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Rich Roberts #AarI,cacctgc,4,8, AatII,gacgt/c, AbsI,cc/tcgagg, AccI,gt/mkac, Acc65I,g/gtacc, #AceIII,cagctc,7,11, #AciI,ccgc,-3,-1, AclI,aa/cgtt, #AcuI,ctgaag,16,14, AfeI,agc/gct, AflII,c/ttaag, Jun 30 2008 Relational database •The relational database does not describe relations between entries •Relation is the mathematical term for “table” •Thus a relational database is composed of tables •Each table is composed of rows (entries = tuple) and each row has columns (attributes) with a value in each cell •Where multiple tables share a common column, it is possible to get relationship between the columns in different tables by combinining data with identical values for a column Entries/Tuple Columns/Attributes Student Number Name State 1 Jack Kansas 2 John Maryland 3 Jill Washington A simple three table relational database Student Number Name Gender State 1 Jack M Kansas 2 John M Maryland 3 Jill F Maryland Student Number Course 1 BOC314 2 BOC334 3 BOC364 Course Description BOC314 Biochemistry BOC334 Proteomics BOC364 Bioinformatics Query: What courses do students from Maryland take? Query: Do females take more courses in the first or second semester Object oriented databases •Attributes of entries are represented as members of classes •Each member can be a member of more than one class •This gives rise to a hierarchical relationship, very much like a tree •Parent objects point to child objects, which, in turn, pointy to their child objects •Thus, all students from Maryland will be pointed to by the Maryland object •All students who do BOC364 will be pointed to by the BOC364 object •Great care must be taken when designing a object-oriented database to ensure efficient querying Biological databases •Primary databases •Raw sequence data •GenBank •PDB •Secondary databases •Computationally processed or curated database •SWISS-PROT •PIR •Specialized databases •For specific interest groups •FlyBase •SGD Primary Databases Three major databases GenBank (http://www.ncbi.nlm.nih.gov/Genbank/) EMBL DDBJ Sequences are exchanged on a daily basis Each database is up to date (use any one) Deposition of data a prerequisite to publication Secondary databases •Significant processing of original raw data •Annotation •ORFs •Functional links •SWISS-PROT •Carefully curated database •High quality •SWISS-PROT, trEMBL and PIR combined in UniProt •Pfam aligned protein sequences to define families •BLOCKS – motifs and patterns •DALI – secondary predictions to find evolutionary relationships Specialized Databases •Often focused on a specific aspect of an organism •Curated by experts •Highly annotated and processed data •SGD •FlyBase •WormBase Interconnection between biological databases •Need to access both primary and secondary database •Provide links between databases •Difficult to connect databases with different structures: ASCII, Relational and Object-oriented •Common Object Request Broken Architecture (CORBA) •eXtensible Markup Language (XML) Information retrieval Entrez (Aahn-tray) Gateway that allows text-based searches of a wide variety of data Using “Limits” in Entrez Preview/Index History Clipboard Online Mendelian Inheritance in Man PubMed GenBank file format GenBank file format continued FASTA format •First line start with “>” sign followed by any information •Sequence continues with 60 or 80 characters per line Abstract Syntax Notation (ASN.1) Sequence retrieval system (SRS) (http://srs6.ebi.ac.uk/) Result of SRS search