CS 5263 Bioinformatics

Download Report

Transcript CS 5263 Bioinformatics

CS5263 Bioinformatics
Lecture 1: Introduction
Outline
•
•
•
•
•
•
Administravia
What is bioinformatics
Why bioinformatics
Topics in bioinformatics
What you will & will not learn
Introduction to molecular biology
Student info
•
•
•
•
•
Your name
Email
Enrollment status
Academic background
Interests
Course Info
• Instructor: Jianhua Ruan
Office: S.B. 4.01.48
Phone: 458-6819
Email: [email protected]
Office hours: Tues 6:30-7:30, Wed 3-4pm
• Web:
http://www.cs.utsa.edu/~jruan/teaching/cs
5263_fall_2007/
Course description
• A survey of algorithms and methods in
bioinformatics, approached from a
computational viewpoint.
• Discussions balanced between algorithmic
analyses and biological applications
• Prerequisite:
–
–
–
–
Knowledge in algorithms and data structure
Programming experience
Basic understanding of statistics and probability
Appetite to learn some biology
Textbooks
• Required:
– An Introduction to Bioinformatics Algorithms
by Jones and Pevzner
• Recommended:
– Biological Sequence Analysis: Probabilistic
Models of Proteins and Nucleic Acids
by Durbin, Eddy, Krogh and Mitchison
• Additional resources
– See course website
Grading
• Attendance: 10%
– At most 2 classes missed without affecting
grade
• Homeworks: 50%
– No late submission accepted
– Read the collaboration policy!
• Final project and presentation: 40%
What is bioinformatics
• National Institutes of Health (NIH):
– Research, development, or application of
computational tools and approaches for
expanding the use of biological, medical,
behavioral or health data, including those to
acquire, store, organize, archive, analyze, or
visualize such data.
What is bioinformatics
• National Center for Biotechnology
Information (NCBI):
– the field of science in which biology, computer
science, and information technology merge to
form a single discipline. The ultimate goal of
the field is to enable the discovery of new
biological insights as well as to create a global
perspective from which unifying principles in
biology can be discerned.
What is bioinformatics
• Wikipedia
– Bioinformatics refers to the creation and
advancement of algorithms, computational
and statistical techniques, and theory to solve
formal and practical problems posed by or
inspired from the management and analysis
of biological data.
Why bioinformatics
• Modern biology generates huge amount of data
– Human genome sequence has 3 billion bases
• Complex relationships among different types of data
– Challenges to integrate and analyze data
• Algorithmic challenges
– Biologists trained to programming are probably not sufficient
• Tremendous needs in both academic and industry
– Job opportunities
• You get the chance to learn something different
Some examples of central
role of CS in bioinformatics
1. Genome sequencing
AGTAGCACAGA
CTACGACGAGA
CGATCGTGCGA
GCGACGGCGTA
GTGTGCTGTAC
TGTCGTGTGTG
TGTACTCTCCT
3x109 nucleotides
~500 nucleotides
1. Genome sequencing
AGTAGCACAGA
CTACGACGAGA
CGATCGTGCGA
GCGACGGCGTA
GTGTGCTGTAC
TGTCGTGTGTG
TGTACTCTCCT
3x109 nucleotides
A big puzzle
~60 million pieces
Computational Fragment Assembly
Introduced ~1980
1995: assemble up to 1,000,000 long DNA pieces
2000: assemble whole human genome
2. Gene Finding
Where are the genes?
In humans:
~22,000 genes
~1.5% of human DNA
2. Gene Finding
5’
Exon 1 Intron 1
Start codon
ATG
Exon 2 Intron 2
Splice sites
Exon 3
3’
Stop codon
TAG/TGA/TAA
Hidden Markov Models
(Well studied for many years
in speech recognition)
3. Protein Folding
• The amino-acid sequence of a protein determines the 3D
fold
• The 3D fold of a protein determines its function
• Can we predict 3D fold of a protein given its amino-acid
sequence?
– Holy grail of compbio—40 years old problem
– Molecular dynamics, computational geometry, machine learning,
robotics
4. Sequence Comparison—Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--|
|
|
|
|
|
|
|
|
|
|
|
|
x
|
|
|
|
|
|
|
|
|
|
|
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Sequence Alignment
Introduced ~1970
BLAST: 1990, most cited paper in history
Still very active area of research
Efficient string matching algorithms
Fast database index techniques
query
DB
BLAST
Sequence conservation implies function
Sequence comparison is key to
• Finding genes
• Determining function
• Uncovering the evolutionary processes
5. Evolution
More than 200 complete
genomes have been
sequenced
5. Evolution
6. Microarray analysis
Clinical prediction of Leukemia type
• 2 types
– Acute lymphoid (ALL)
– Acute myeloid (AML)
• Different treatment & outcomes
• Predict type before treatment?
Bone marrow samples: ALL vs AML
Measure amount of each gene
Some goals of biology for the next 50 years
• List all molecular parts that build an organism
– Genes, proteins, other functional parts
•
•
•
•
•
•
Understand the function of each part
Understand how parts interact
Study how function has evolved across all species
Find genetic defects that cause diseases
Design drugs rationally
Sequence the genome of every human, use it for
personalized medicine
• Bioinformatics is an essential component for
all the goals above
Major conferences
•
•
•
•
•
•
ISMB (Summer every year)
RECOMB (and its satellites) (Spring every year)
PSB (Jan every year, Hawaii)
ECCB (Europe)
CSB (July every year, Stanford)
Conferences in computer science
– ICDM (conference on data mining)
– ICML (conference on machine learning)
– AAAI (conference on AI)
Major journals
•
•
•
•
•
•
•
•
•
Bioinformatics
Journal of Computational Biology
PLoS Computational Biology
BMC Bioinformatics
Genome Biology
Genome Research
Nucleic Acids Research
IEEE Trans on Computational Biology
Science, Nature, PNAS, Cell, Nature Genetics,
Nature Biotech, …
Major Bioinfo research topics
Covered topics
• Sequence analysis
–
–
–
–
Alignment
Motif finding
Pattern matching
Phylogenetic tree
• Sequence-based predictions
– Gene components
– RNA structure
• Functional Genomics
– Microarray analysis
– Biological networks
What you will learn?
• Basic concepts in molecular biology and
genetics
• Selected topics in bioinformatics and
challenges
• Algorithms:
– DP, graph, string algorithms
– Statistical learning algorithms: HMM, EM,
Gibbs sampling
– Data mining: clustering / classification
What you will not learn?
• Existing tools / databases
• Design / perform biological experiments
• Protein structure prediction (commonly
avoided by most bioinfo researchers…)
• Building bioinformatics software tools (GUI,
database, Perl / Python, …)
Goals
• Basis of sequence analysis and other
computational biology algorithms
• Overall picture about the field
• Read / criticize research articles
• Think about the sub-field that best suits
your background to explore
• Communicate and exchange ideas with
(computational) biologists
Computer Scientists vs
Biologists
(courtesy Serafim Batzoglou, Stanford)
Biologists vs computer scientists
• (almost) Everything is true or false in
computer science
• (almost) Nothing is ever true or false in
Biology
Biologists vs computer scientists
• Biologists seek to understand the
complicated, messy natural world
• Computer scientists strive to build their
own clean and organized virtual world
Biologists vs computer scientists
• Computer scientists are obsessed with
being the first to invent or prove something
• Biologists are obsessed with being the first
to discover something
Biologists vs computer scientists
• Biologists are comfortable with the idea
that all data have errors, and every rule
has exceptions
• Computer scientists are not
Biologists vs computer scientists
• Computer scientists get high-paid jobs
after graduation
• Biologists typically have to complete one
or more 5-year post-docs...
Molecular biology 101
•
•
•
•
Cell
DNA, RNA, Protein
Genome, chromosome, gene
Central dogma
Life
• Categories
– Prokaryotes (e.g. bacteria)
• Unicellular
• No nucleus
– Eukaryotes (e.g. fungi, plant, animal)
• Unicellular or multicellular
• Has nucleus
• The most important distinction among
groups of organism
Prokaryote vs Eukaryote
• Eukaryote has many membrane-bounded
compartment inside the cell
– Different biological processes occur at different
cellular location
Chemical contents of cell
• Small molecules
–Sugar
–Ions (Na+, Ka+, Ca2+, Cl- ,…)
–…
• Macromolecules (polymers):
–DNA
–RNA
–Protein
–…
• Polymers: “strings” made by linking monomers
from a specified set (alphabet)
Polymer
Monomer
DNA
Deoxyribonucleotides
RNA
Ribonucleotides
Protein
Amino Acid
DNA
• DNA: forms the genetic material of all living
organisms
– Can be replicated and passed to descendents
– Contains information to produce proteins
• To computer scientists, DNA is a string made
from alphabet {A, C, G, T}
– e.g. ACAGAACGTAGTGCCGTGAGCG
• Each letter is called a base
– A deoxyribonucleotides
• Length varies. From hundreds to billions
RNA
• Historically thought to be information carrier only
– DNA => RNA => Protein
– New roles have been found for them
• To computer scientists, RNA is a string made
from alphabet {A, C, G, U}
– e.g. ACAGAACGUAGUGCCGUGAGCG
• Each letter is called a base
– A ribonucleotides
• Length varies. From tens to thousands
Protein
• Protein: the actual “worker” for almost all processes in
the cell
–
–
–
–
–
Enzymes: speed up reactions
Signaling: information transduction
Structural support
Production of other macromolecules
Transport
• To computer scientists, protein is a string built from 20
letters
– E.g. MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGP
• Each letter is called an amino acid
• Lengths: from tens to thousands
Central dogma of molecular biology
DNA/RNA zoom-in
•
•
•
•
Commonly referred to as Nucleic Acid
DNA: Deoxyribonucleic acid
RNA: Ribonucleic acid
Found mainly in the nucleus of a cell
(hence “nucleic”)
• Contain phosphoric acid as a component
(hence “acid”)
• They are made up of nucleotides
Nucleotides
• A nucleotide has 3 components
– Sugar (ribose in RNA, deoxyribose in DNA)
– Phosphoric acid
– Nitrogen base
•
•
•
•
Adenine (A)
Guanine (G)
Cytosine (C)
Thymine (T) or Uracil (U)
Monomers of RNA
• A ribonucleotide has 3 components
– Sugar - Ribose
– Phosphate group
– Nitrogen base
•
•
•
•
Adenine (A)
Guanine (G)
Cytosine (C)
Uracil (U)
Monomers of DNA
• A deoxyribonucleotide has 3 components
– Sugar - Deoxyribose
– Phosphoric acid
– Nitrogen base
•
•
•
•
Adenine (A)
Guanine (G)
Cytosine (C)
Thymine (T)
Polymerization: Nucleotides => nucleic acids
Nitrogen Base
Phosphate
Sugar
Nitrogen Base
Phosphate
Sugar
Nitrogen Base
Phosphate
Sugar
5’
A
5’-AGCGACTG-3’
G
C
AGCGACTG
G
DNA
A
Many biological processes go from 5’ to 3’
e.g. DNA replication, transcription, etc.
C
T
G
3’
5
Phosphate
4
Base
1
Sugar
3
2
5’
A
5’-AGUGACUG-3’
G
U
AGUGACUG
G
RNA
A
Many biological processes go from 5’ to 3’
e.g. transcription.
C
U
G
3’
5
Phosphate
4
Base
1
Sugar
3
2
5’
A
3’
Base-pair:
A=T
Forward (+)
strand
G=C
G
5’-AGCGACTG-3’
3’-TCGCTGAC-5’
C
G
A
AGCGACTG
TCGCTGAC
C
T
AGCGACTG
G
3’
Backward (-)
strand
5’
One strand is said to be reversecomplementary to the other
Reverse-complementary
sequences
• 5’-ACGTTACAGTA-3’
• The reverse complement is:
3’-TGCAATGTCAT-5’
=>
5’-TACTGTAACGT-3’
• Or simply written as
TACTGTAACGT
DNA double helix
Orientation of the double helix
• Double helix is anti-parallel
–5’ end of each strand at 3’ end of the other
–5’ to 3’ motion in one strand is 3’ to 5’ in the other
• Double helix has no orientation
–Biology has no “forward” and “reverse” strand
–Relative to any single strand, there is a “reverse
complement” or “reverse strand”
–Information can be encoded by either strand or both
strands
5’TTTTACAGGACCATG 3’
3’AAAATGTCCTGGTAC 5’
RNA Secondary structures
• RNAs are normally single-stranded
• Can form complex structure by self-basepairing
• A=U, C=G