Transcript Document

BIOINFORMATICS
CSC 2500
Survey of Information Science
Katherine W. McCain
Associate Dean and Professor
College of Information Science & Technology
Drexel University
[email protected]
Overview of Today’s Session
What is Bioinformatics
Bioinformatics History and Growth
The BIO in Bioinformatics
The INFO in Bioinformatics
What is Bioinformatics?
Bioinformatics is the field of science in which biology, computer
science, and information technology merge into a single discipline.
The ultimate goal of the field is to enable the discovery of new
biological insights as well as to create a global perspective from
which unifying principles in biology can be discerned. (NCBI)
Bioinformatics is a combination of Computer Science, Information
Technology, and Genetics to determine and analyse genetic
information (lead editorial BITS Journal)
What is Bioinformatics?
Bioinformatics is conceptualizing biology in terms of
macromolecules…and then applying “informatics” techniques…to
understand and organize the information associated with these
molecules, on a large scale (Gerstein’s lab at Yale)
Bioinformatics is the use of computers to aid in
extracting/processing biological information from a large set of data
points (Brad Jameson—MCP Hahnemann University)
Bioinformatics is the generation, visualization, analysis, storage and
retrieval of large quantities of biological information (Michael
Agostino)
What is Bioinformatics
Bioinformatics is the application of computer science to
the management and analysis of biological
information.
In genome projects, bioinformatics includes: the
development of methods to search databases quickly,
the creation of algorithms to analyse DNA and protein
sequences, and the creation of algorithms to predict
gene structure from DNA sequences (adapted from
DoubleTwist tutorial)
More Definitions
(from Gibas & Jambeck)
Computational Biology: the application of quantitative
analytical techniques in modeling biological systems.
Informatics: the representation, organization,
manipulation, distribution, maintenance, and use of
information, particularly in digital form.
Bioinformatics is frequently considered a subset of
Computational Biology
Subdisciplines of Bioinformatics
 the development of new algorithms and statistics with which to
assess relationships among members of large data sets
 the analysis and interpretation of various types of data including
nucleotide and amino acid sequences, protein domains, and
protein structures; and
 the development and implementation of tools that enable
efficient access and management of different types of
information.
A “Map” of Bioinformatics
From the website of Cynthia Gibas, Ph.D., faculty member at VaTech and author of good book on
Computational Methods for Bioinformatics
http://gibas.biotech.vt.edu/
Brief History of Bioinformatics
(focusing on the molecular end)
1956 Sanger sequences first protein – bovine insulin
1959 Journal of Molecular Biology published Vol. 1
1966 Holley et al sequence first nucleic acid – yeast alanine tRNA
1967 Dayhoff publishes Atlas of Protein Sequences and Structure
1972 Protein Databank established – X-ray crystallographic protein
structures
1974 Nucleic Acids Research published Vol. 1
1975 Sanger & Coulson publish technique for sequencing DNA—
cited in ~750 articles
1977 Sanger publishes nucleotide sequence of bacteriophage
øX174
Brief History of Bioinformatics
(continued)
1981 Genbank, EMBL, DDBJ established
1985 Computer Applications in the Biosciences (now
Bioinformatics) published Vol. 1
1987 SWISSPROT protein sequence database established
1990 Human Genome Project begun (DoE, NIH, many labs in US
and elsewhere)
1994-95 Craig Venter establishes TIGR—sequences H. influenzae
using “shotgun approach”
1996 Affymetrix produces the first commercial DNA chip
1998 Craig Venter establishes Celera (for-profit company)
2001 Human genome is “published” (DoE/NIH and Celera)
19
65
19
66
19
67
19
68
19
69
19
70
19
71
19
72
19
73
19
74
19
75
19
76
19
77
19
78
19
79
19
80
19
81
19
82
19
83
19
84
19
85
19
86
19
87
19
88
19
89
19
90
19
91
19
92
19
93
19
94
19
95
19
96
19
97
19
98
19
99
20
00
20
01
7000
6000
2001
5000
4000
3000
2000
1967
1000
1983
0
Growth of the Bioinformatics Literature in Medline
The three
institutions
(GenBank,
European
Molecular Biology
Laboratory, and
DNA Data Bank of
Japan) together
have contributed
100 gigabases to
Genbank.
These 100,000,000,000 bases, or "letters" of the genetic code, represent both
individual genes and partial and complete genomes of over 165,000
organisms. One hundred billion bases is about equal to the number of nerve
cells in a human brain and a bit less than the number of stars in the Milky Way.
MAJOR CHALLENGES FOR BIOINFORMATICS
(Mike Agostino)
•Data growth
•Data volume (disk space, memory)
•Data Retrieval
•Data Reduction and Visualization
•Data Integration
•Rapidly changing field
•New field—tools could be better
The BIO in Bioinformatics
Typical Animal Cell Structure
http://www.emc.maricopa.edu/faculty/farabee/BIOBK/BioBookCELL2.html
Where is the genetic information?
In the nucleus of the cell (in eukaryotes)—in the form of
long double strands of DNA that are coiled and bundled
in the form of chromosomes.
The Mammalian Cell Nucleus
http://spectorlab.cshl.edu/domains.html
The Genome—Organizational Hierarchy
Genome: The total set of genes carried by an individual or cell.
Chromosome: The DNA of eukaryotes is subdivided into
chromosomes, each of which has a long length of DNA associated
with various proteins. Each chromosome has a characteristic length
and banding pattern.
Humans have 23 pairs of chromosomes
in the nucleus
Drosophila has 4 pairs of chromosomes
in the nucleus
http://homepages.uel.ac.uk/V.K.Sieber/human.htm
http://www.bio.psu.edu/flylabs/karotype.htm
Where ELSE is the genetic information?
In the mitochondrion, in the form of a loop of mtDNA.
There are MANY mitochondria per cell (vs only one
nucleus & 1 set of chromosomes) -- mtDNA has
important forensic uses (maternal inheritance means that
relatives can provide reference samples) See : the website
of Mitotyping Technologies (a company that specializes in mtDNA
forensics) for an overview.
The Mitochondrion (1)
Fawcett, A Textbook of Histology, Chapman and Hall, 12th edition, 1994
The Mitochondrion (2)
About 90 percent of
the energy needed
by the body's tissues
is made by the
mitochondria, which
store it in a molecule
known as ATP, the
end result of a long
chain of chemical
events. ATP then has
to be carried out of
the mitochondria into
the main part of the
cell.
The Mitochondrion (3)
From: MITOCHONDRIAL and METABOLIC DISORDERS- a primary care physician’s guide
“The Spectrum of Mitochondrial Disease”. Robert K. Naviaux, MD, PhD
http://biochemgen.ucsd.edu/mmdc/ep-3-10.pdf
How many genes?
Organism
E. coli
Yeast
Nematode
Mustard
plant
Fruit Fly
Human
Genome
size
4.6 Mb
12.1 Mb
97 Mb
100 Mb
Date
# Genes
1997
1996
4,200
6,034
1998
2000
12,099
25,000
137 Mb
3000 Mb
2000
2001
13,061
39,000
Chromosome Features
The Central Dogma
DNARNAProtein
Replication: DNA strand unwinds and new matching
strands are constructed on the unwound templates
Transcription: DNA strand unwinds and a single strand of
RNA is constructed on one of the unwound templates
Translation: the sequence of nucleotides in the RNA
strand is “translated” into a sequence of amino acids that
are joined to make a protein
It’s really much more complicated than this……
Kinds of Nucleic Acids
DNA (deoxyribonucleic acid)– double stranded molecule consisting of 4
nucleotides (adenine, guanine, cytosine, thymine) and a backbone of
sugar/phosphate molecules.
RNA (ribonucleic acid) – single stranded molecule consisting of 4
nucleotides (adenine, guanine, cytosine, uracil)

messenger RNA (mRNA) – carries the genetic information as a
sequence of codons (3 nucleotide sequences) from the nucleus to the
cytoplasm

ribosomal RNA (rRNA) – RNA in a cell organelle (ribosome) that binds
to both mRNA and tRNA – the site of protein synthesis

transfer RNA (tRNA) – many kinds. Each binds to a specific amino acid
Complementary DNA (cDNA) is DNA that has been “reverse engineered”
by building a DNA strand on the mRNA strand in the cytoplasm
The Genome—Organizational Hierarchy
DNA: The genetic material of
all cells and many viruses. A
polymer of nucleotides.
Each nucleotide consists of a
sugar and a phosphate group
(the “backbone”) linked to
one of four bases : adenine,
cytosine, guanine or thymine.
Two complementary strands
are wound in a right-handed
helix and held together by
hydrogen bonds between
complementary base pairs.
The sequence of bases
encodes genetic information.
http://www.accessexcellence.org/AB/GG/dna_molecule.html
The Genome—Organizational Hierarchy
We speak of “base pairs”
because DNA is doublestranded, with
adenine thymine and
guaninecytosine.
The strands are directional and
complementary; The sequence
of nucleotides reads the same
starting from the “top” on one
strand and the “bottom” on the
other. Sequences are always
read from the 5’ end to the
3’end (the two “connectors” of
the sugar backbone)
.
http://www.accessexcellence.org/AB/GG/dna_molecule.html
DNA Structure & Replication
Check out the RealPlayer video: http://www.ucsd.tv/sciencematters/lesson1-col.shtml
From Gene to Protein
More Complexity…
The genetic information transcribed from the DNA strand to the
RNA strand is a “gene” plus—a “transcription unit.”
It consists of several parts:
• control segments (e.g. start reading here, stop reading here)
• introns – internal noncoding regions of the mRNA that are
removed
• exons – the part of the mRNA strand that contain the
information to code for a protein strand. The pieces are
“spliced” back together after the introns are removed
Thus a cDNA strand that has been built (reverse engineered) on the
mRNA exon sequence only has part of the information in the
chromosomal DNA strand that was the “original” template.
Two Views of the Gene
The Genome—Organizational Hierarchy
Gene: Specific segments of DNA that control cell structure and
function; the functional units of inheritance. A sequence of DNA
bases usually codes for a polypeptide sequence of amino acids:
3 nucleotides  one amino acid. (or a start or stop reading signal)
A sequence-level view of a gene
the mRNA sequence of beta globin
Depicted on the next page is the sequence of RNA bases
in the transcript of the human beta-globin gene. This is
the “message” that goes from the DNA strand in the
nucleus (the information on the gene) to the ribosome
(rRNA) in the cytoplasm.
The letters in magenta
are the two introns
which are removed
from the transcript in
the maturation process
resulting in mRNA.
The letters in blue are
the bases at either end
of the introns
(GU...AG) which are
used as "endpoints" in
the splicing process.
Proteins
Proteins are linear polymers (chains or strings) of amino acids joined by
peptide bonds in a specific sequence. They carry out most functional
activities in the cell; major classes of proteins include enzymes,
hormones, receptors (binding sites for signaling molecules) and
antibodies.
3D coordinates (e.g. determined by x-ray crystallography or NMR) are
stored in databases
3D structure is visualized using programs such as MolScript & RasMol.
Protein Structure
Each protein chain is made
up of a string of amino acids.
Different amino acids have
different properties – some
attract each other, some
repel, to give the higher level
structures
http://www.genome.gov/
Pages/Hyperion//DIR/VI
P/Glossary/Illustration/pr
otein.shtml
http://www.yangene.com/images/protein1c.jpg
Quaternary structure of Hemoglobin
The hemoglobin molecule is composed of 4
subunits – individual amino acid (polypeptide)
chains – plus a molecule of iron tucked inside
http://www.uic.edu/classes/bios/bios100/lectf03am/hemoglobin.jpg
Protein Synthesis
Mutations
Mutations ultimately derive from incorrect, un-repaired DNA
replication
Point mutations (Single Nucleotide Polymorphisms)--changes in a
single base of the coding triplet. May or may not have an effect
on amino acid coding (because of code redundancy). Frameshift
mutations can occur – 1 base shifts ALL the codes
Segmental mutation—larger scale changes in the sequence
within a single chromosome (insertion, deletion, inversion,
duplication of longer nucleotide sequences)
Chromosome duplication, deletion (e.g. Downs syndrome)
Sickle Cell Disease is a genetic disorder involving a change in a
single DNA nucleotide. The result of the change in nucleotide =>
hemoglobin molecule (protein) is a change in the shape of the
red blood cell. This affects the ability of the red blood cells to
move through the blood vessels and, ultimately, blood flow is
reduced and tissue damage results
http://www.nlm.nih.gov/medlineplus/ency/imagepages/1212.htm
A change in nucleotide sequence = change in mRNA =
change in amino acid = change in shape of hemoglobin
imiloa.wcc.hawaii.edu/.../present/ lcture17/sld022.htm
Reading the Genetic Code
The goal of the various genome projects has been to
chart the sequence of nucleotides in the chromosomes of
various organisms (humans, yeasts, fruit flies, mice, etc.)
and to connect the genetic information with other
biological information -- protein structure  function,
physiological processes, diseases and drug treatment,
etc.
How do you sequence the genome?
Basically, you need to cut up the chromosomal DNA into
pieces that are short enough (e.g. 2K – 150K base pairs
depending on method), determine the nucleotide
sequence of each piece, and then figure out how to string
the pieces back together in the proper order. This is aided
by computer processing.
http://nema.cap.ed.ac.uk/teaching/genomics/Genomics3.html
Random Sequencing Strategy
Randomly chunk the entire genome into pieces
Make multiple copies of the pieces
Sequence each piece
Look for sequence overlap to put all the sequences in
order (automated)
Celera assembled 27 million sequence chunk records using
the most powerful non-military computer in the world.
There are still lots of gaps AND we don’t know what most of
the sequences represent in terms of genes and their
expression
Expressed Sequence Tag
It is possible to “reverse engineer” a gene by working backwards
from the mRNA to a strand of DNA with the complementary base
sequence (cDNA).
A partial sequence derived from cDNA is called an Expressed
Sequence Tag. It may or may not represent the complete original
genetic message for a protein—it certainly does not represent the
complete gene as it existed in the nuclear DNA (only exons are
present). ESTs DO represent genes that are active in a particular
cell at a particular time – as evidenced by high levels of mRNA
production.
ESTs can be used to identify genes because they will hybridize to
(match up with) known DNA sequences.
What about ESTs?
If you are looking at liver cells, you will be able to study
the genes that are active in the liver, though you are
likely to end up with lots of ESTs for abundant messages
and few if any copies of the rare messages
Only a subset of all the genes are turned on in liver cells,
And only a subset of the liver genes may be active in your
sample. So you have to sequence a LOT of ESTs AND figure
out how they fit together to approximate the genome of
interest.
The INFO in Bioinformatics
Major organizations:
Public/non-profit: NCBI, EBI, TIGR
Private sector: Incyte Pharm, Millenium Pharm,
Affymetrix
Databases: sequences (e.g. genes, proteins, ESTs),
structures, images, biological & medical info, publications,
etc.—for single organisms or broad collections
Software: database entry, searching, sequence alignment,
pattern recognition, clustering, tree-building, mapping,
visualization
Computers & Biology—perfect together
Collecting & processing signals detected by lab
equipment
Tracking samples and managing experiments
(industrial strength)
Storing, searching, retrieving data in public
databases (e.g. Genbank, PubMed)
Data mining in large data collections—looking for
rules and patterns
Annotation—assigning functional meaning to
uncharacterized data and linking different data
collections
Simulation of biological systems at all levels
(interacting proteins to interacting populations)
Does all of this “count” as BIOINFORMATICS?
DATABASE ISSUES
Primary (archival) vs secondary (curated)
Public access vs private/fee-based access
Cross-organism vs single organism
Flat file/relational/OODB
Federated? Combined in a Data Warehouse?
Annotation (metadata, verbal indexing of known or
predicted information about the gene or protein)
Vocabulary standards (or lack of them)
GenBank
GenBank® is the NIH genetic sequence database, an annotated
collection of all publicly available DNA (and RNA) sequences.
There were approximately 37,893,844,733 bases in 32,549,400
sequence records as of Feb 2004. (And 100 gigabases in Aug,
2005)
GenBank is part of the International Nucleotide Sequence Database
Collaboration, which is comprised of the DNA DataBank of Japan
(DDBJ), the European Molecular Biology Laboratory (EMBL), and
GenBank at the NCBI (National Center for Biotechnology
Information). These three organizations exchange data on a daily
basis.
GenBank
Many journals require submission of sequence information to a
database prior to publication so that an accession number may
appear in the paper.
Counterpart databases for proteins include the Protein Data
Bank (sequence + 3D structures (X-ray, NMR)), the Protein
Information Resource (protein sequences), and Swiss-Prot
(curated protein sequences)
GenBank is one of a number of interrelated public databases
that support bioinformatics and related research. ENTREZ is
the gateway.
Growth of Genbank
Sequence Analysis
Similarity searches – matching your sequence to those
in the database (e.g. using BLAST--set of similarity
search programs designed to explore all of the available
sequence databases regardless of whether the query is
protein or DNA)
Alignment and multi-alignment of sequences
Detection of protein coding regions
Statistical analyses based on linguistic approaches for
the identification of functional elements such as
promoters, splicing sites, etc.
Prediction of secondary structures in nucleic acids and
protein sequences
Prediction of protein tertiary structure
Molecular evolutionary studies.
Sequence Similarity
Genes may share high sequence similarity across their
entire length.
Genes may show sequence similarity that is limited to a
certain region—some parts of a protein will be similar
and other parts will be different.
Genes may share similar motifs, meaning that they
encode regions of similar amino acid sequence that
aren't located right next to each other in the linear
sequence of the protein. The sequence lying between
these regions of similarity can be quite different. When
they fold up, however, proteins sharing a motif form
similar three-dimensional structures (for example, "zinc
finger" or "leucine zipper" motifs).
Bioinformatics Research Sequence
By looking for genes in model organisms that are similar to a given human
gene, researchers can learn about the protein the human gene encodes
and search for drugs to block it. The MLH1 gene, which is associated with
colon cancer in humans, is used in this example.
What are Microarrays?
Microarrays, or “gene chips” are essentially an orderly
array of samples of cDNA (with known nucleotide
sequence) on a glass or nylon membrane.
By exposing the samples on the chip to an unknown DNA
sequence, you may be able to identify the unknown
because it binds to the known cDNA “probe.” The strength
of binding is generally indicated by fluorescence of the
cDNA probe spots that have been targeted by the
unknown sample.
Processing, storing, analyzing the data produced are a
major concern of bioinformatics. With microarrays, it is
possible to study the expression of 10K genes at a time!
Applications of Microarrays
http://www.gene-chips.com/
Gene discovery
Disease diagnosis
Drug discovery: Pharmacogenomics
Toxicological research: Toxicogenomics
See http://www.ebi.ac.uk/microarray/ for a discussion of the
informatics of microarrays.
Bioinformatics and Microarray Data
Data management -- LIMS
Database design
Algorithms for mining
See http://www.ebi.ac.uk/microarray/ for a discussion of the
informatics of microarrays.
What are scientists doing with these new data?
[stolen from Mike Agostino]
Going for the low hanging fruit
I’ve been looking for genes related to my favorite gene
I’ve been looking for the rest of this gene
I know a disease maps to a particular place
Going for the high altitude fruit
Does the overall organization of the genome tell us
anything about gene expression?
What are the functions of all these genes?
Are there clusters of genes that are significant?
Are any genes missing?
In Summary --Types of Data
DNA


[stolen from Mike Agostino]
Sequence
 Genomic-BIG, largely uncharacterized
 Genes—smaller pieces, highly characterized
 cDNA—messenger RNA (mRNA) copies
 ESTs—”expressed sequence tags”
 SNPs—single nucleotide polymorphisms
Mapping

Chromosomal location
Protein


Sequence—determined or derived
Structure—crystals or other methods reveal 3-D shape
Expression

Where/when genes are active and how much
Questions?
www.nevtron.si/borderline/ old.html