Transcript Slide 1

MOLEKYLÄRBIOLOGI MED GENETIK – BIOINFORMATIK
Bioinformatics
David Brodin
[email protected]
BEA core facility
www.bea.ki.se
Course web page: www.bea.ki.se/biomedicin_v42/
HT -07
Lecture Content
Monday
• Introduction to Bioinformatics
-History of Bioinformatics
-Need for computers
-Computational Biologi
-Fields of Bioinformatics
-Bioinformatic tools
• Homologi, sekvensanalys och fylogenetik
• Introduction to Microarrays & Lab
Tuesday
• Mass spectrometry
• Web Databases, bioinformatic tools etc
• Genotyping Arrays
• Tiling Arrays
Wednesday
• Computer Lab
Need for Computers
Major advances in
• the field of molecular biology
• genomic technologies
Explosive growth in
the biological information generated by the
scientific community
Need of
computerized databases to store, organize,
and index the data and for specialized
tools to view and analyze the data
what computer science is to molecular biology is like
what mathematics has been to physics ......
-- Larry Hunter, ISMB’94
History of Bioinformatics
History of DNA Sequencing
Adapted from Messing & Llaca, PNAS (1998)
History of Bioinformatics
History of Bioinformatics
•
Early database: The Atlas of Protein Sequences was
available on Digital Tape in 1978, and by modem 1980.
•
Early programs: restriction enzyme sites, pattern
finding, promoters, etc… circa 1978.
•
1982: DDBJ/EMBL/GenBank are created as a public
repository of genetic sequence information.
•
1983: NIH funds the PIR (Protein Information Resource)
database.
•
1988: Pearson and Lipman create FASTA
Number of published base pairs
•
1971 First published DNA sequence
•
1977 PhiX174
5,375
•
1982 Lambda
48,502
•
1992 Yeast Chromosome III
316,613
•
1995 Haemophilus influenza
1,830,138
•
1996 Saccharomyces
12,068,000
•
1998 C. elegans
97,000,000
•
2000 D. melanogaster
•
2001 H. sapines (draft)
2,600,000,000
•
2003 H. sapiens
2,850,000,000
12
120,000,000
Computational Biology
In the early days of bioinformatics a major concern was
creation and maintenance of databases to store biological
information, involving design issues and development of
complex user interfaces.
Today the most pressing task involves the analysis and
interpretation of various types of data, including nucleotide
and amino acid sequences, protein domains, and protein
structures. The actual process of analyzing and
interpreting data is referred to as computational biology.
Bioinformatics: Research, development, or application
of computational tools and approaches for expanding the
use of biological, medical, behavioral or health data,
including those to acquire, store, organize, archive,
analyze, or visualize such data.
Computational Biology: The development and
application of data-analytical and theoretical methods,
mathematical modeling and computational simulation
techniques to the study of biological, behavioral, and
social systems.
-National Institute of Health (NIH)
Biology in the 21st century is being transformed from a
lab-based science to an information science as well.
Biological Databases
A biological database is a large, organized body of
persistent data, usually associated with
computerized software designed to update, query,
and retrieve components of the data stored within
the system.
For researchers to benefit from the data stored in a
database, two additional requirements must be met:
• easy access to the information
• a method for extracting only that information
needed to answer a specific biological question
Nucleotide sequence record:
• the input sequence with a description of the type
of molecule
• ID of sequence
• the scientific name of the source organism
• contact name
• literature citations associated with the
sequence.
Sub-disciplines, challenges & goals
Important sub-disciplines:
• Analysis and interpretation of various types of
biological data
• Developement of new algorithms and
statistics with which to asses biological
information
• Development and implementation of tools that
enable efficient access and management of
different types of information
Challenges of working with bioinformat.:
• Need to feel comfortable in interdisciplinary area
• Depend on others for primary data
• Need to address important biological and
computer science problems
Important goal of bioinformatics:
understanding basic biological processes and, in
turn, advances in the diagnosis, treatment, and
prevention of many genetic diseases.
Fields of Bioinformatics
The ”omics” Series:
•
•
•
•
•
•
Genomics: Gene identification & characterization
Transcriptomics: Expression profiles of mRNA
Proteomics: Functions & interactions of proteins
Structural Genomics: Large scale structure
determination
Cellinomics: Metabolic Pathways, Cell-cell interactions
Pharmacogenomics: Genome-based drug design
Genomes
Nucleotide Sequences
Protein Sequences
Macromolecular Structures
Small Molecules
Gene Expression
Molecular Interactions
Reactions & Pathways
Protein Families
Taxonomy
Ontologies
Sequence Similarity & Analysis
Structure Analysis
Typical Questions
Biological problems that computers
can help with:
• I cloned a gene –is it a known gene?
• Does the sequence match? Is the sequence any
good?
• Is the sequence similar to other known
sequences?
• Which gene family does it belong to?
• The gene I´m interested in was found in another
organism, but not in mine. How can I look for it?
• How is the gene expressed in different types of
tissues?
• What is the biological function of the protein
encoded by the gene?
• Is the gene associated with any disease?
Increasingly, biological studies begin with a scientist
conducting vast numbers of database and web site
searches to formulate specific hypotheses or to design
large-scale experiments.
Bioinformatic tools
• Many different bioinformatic tools
avaiable over the internet free of charge
to whoever wishes to use them
• Also many commersial software
packages avaiable
• Some bioinformaticians write their own
tools for specialized tasks
Many platforms avaiable for software
development...
Open Source & Open Access
Open Source in the life sciences:
• Present in all areas of bioinformatics
• Some very well known examples of tools used in
industry and academic circles include:
– BLAST
– EMBOSS
– EnsEMBL
– GenScan
– Bioconductor
Open Access:
•
•
•
•
Unrestricted access to data
Allows all to work and make discoveries
Discoveries are not necessarily open access
Open access is applicable to any kind of data
you want to apply it to:
– Sequence data (DNA, RNA or protein)
– Gene expression data
– Protein-protein interaction data
– Publication
Top 10 Future Challenges
Precise, predictive model of transcription initiation and
termination: ability to predict where and when transcription will
occur in a genome
Precise, predictive model of RNA splicing/alternative splicing:
ability to predict the splicing pattern of any primary transcript
in any tissue
Precise, quantitative models of signal transduction pathways:
ability to predict cellular responses to external stimuli
Determining effective protein:DNA, protein:RNA and
protein:protein recognition codes
Accurate ab initio protein structure prediction
Rational design of small molecule inhibitors of proteins
Mechanistic understanding of protein evolution:
understanding exactly how new protein functions evolve
Mechanistic understanding of speciation: molecular details of
how speciation occurs
Continued development of effective gene ontologies systematic ways to describe the functions of any gene or
protein
Education: development of appropriate bioinformatics
curricula for secondary, undergraduate and graduate
education
Chris Burge, Ewan Birney, Jim Fickett. Genome Technology, issue No.
17, January, 2002