Transcript Overture

CS 466
Introduction to Bioinformatics
Saurabh Sinha
What is the course about?
• Algorithmic concepts, applied to sample (toy)
problems in “bioinformatics”
– Follows the text book
• “Real” bioinformatics research
– Follows the best journals
• Not about practical training in the use of
popular bioinformatics software
Grading
• Assignments: 40%
– About one every two weeks
• Mid Term: 30%
• Final: 30%
Expectations
• Some programming skills
– Any programming language is fine
Administrative Details
• Instructor:
– Saurabh Sinha
– Room 2122, Siebel Center
– Email: [email protected]
•
•
•
•
Class hrs: Tue & Thu, 3:30pm-4:45pm, 1131SC
Office hrs: Tue, before class (2:30 - 3:30 pm) 2122SC
Web site: http://veda.cs.uiuc.edu/courses/fa09/cs466/
Welcome to sit in, if not taking for credit
Text books
•
Jones and Pevzner: recommended
Other course
• CS 591 BIO: weekly seminar (“journal
club”) on bioinformatics research
• Thursdays at 11:00 am.
Motivating bioinformatics
Special issue of journal Science, July 1, 2005.
>What Is the Universe Made Of?>What is the Biological Basis of
Consciousness?>Why Do Humans Have So Few Genes?>To
What Extent Are Genetic Variation and Personal Health
Linked?>Can the Laws of Physics Be Unified?>How Much Can
Human Life Span Be Extended?>What Controls Organ
Regeneration?>How Can a Skin Cell Become a Nerve
Cell?>How Does a Single Somatic Cell Become a Whole
Plant?>How Does Earth's Interior Work?>Are We Alone in the
Universe?>How and Where Did Life on Earth Arise?>What
Determines Species Diversity?>What Genetic Changes Made
Us Uniquely Human?>How Are Memories Stored and
Retrieved?>How Did Cooperative Behavior Evolve?>How Will
Big Pictures Emerge from a Sea of Biological Data?>How Far
Can We Push Chemical Self-Assembly?>What Are the Limits of
Conventional Computing?>Can We Selectively Shut Off
Immune Responses?>Do Deeper Principles Underlie Quantum
Uncertainty and Nonlocality?>Is an Effective HIV Vaccine
Feasible?>How Hot Will the Greenhouse World Be?>What Can
Replace Cheap Oil -- and When?>Will Malthus Continue to Be
Wrong?
Many of the most profound scientific
questions of today are within the
realm of bioinformatics research
“Why do humans have so few genes ?”
A simple organism
Environmental signal
GENE
Response (protein)
A simple organism
GENE1
GENE2
GENE3
A simple organism
GENE1
GENE6
GENE2
GENE7
GENE3
GENE8
GENE4
GENE9
GENE5
GENE10
A complex organism
GENE1
GENE6
GENE2
GENE7
GENE3
GENE8
GENE4
GENE9
GENE5
GENE10
Complex
circuit of
interactions
Regulatory networks
• This may be the reason why humans
have so few genes (the circuit, not the
number of switches, carries the
complexity)
• Bioinformatics can unravel such
networks, given the genome (DNA
sequence) and gene activity information
Decoding the regulatory network
• Find patterns (“motifs”) in DNA
sequence that occur more often than
expected by chance
• Statistics on DNA sequences and words
• Knowing these can tell us about edges
in the regulatory network
Decoding the regulatory network
• An example computational problem:
• Given a string of length 10,000 over the
alphabet {A,C,G,T}
• Count the number of occurrences Nw of
every 6 letter word w
• Are there specific words that occur
more frequently than expected by
chance?
Decoding the regulatory network
• What is expected by chance?
• What is “more frequently”?
• Interesting mathematical questions
• The Moby Dick example
Motif
TGTCACCT|TGTGTCA|TGTGTCAC|TTGTGTC
GTCAGTAA|TCAGTAAT
GACACA|GCCACA
CGCGACGC|GACGCGA|GACGCGAA|GAGGCGA|GAGGCGAC|G
CGACGCG
GCGGCTAA|GGCGGCAAA|GGCGGCTAAA
CGTCACAAA|GTCACAAAA
Obser
ved
35
14
52
18
Expec
ted
7
3
2
3
10
17
1
1
Clas
ses
4
4
4
1,
2, 5
1
4
http://www.pnas.org/content/97/18/10096/T4.expansion.html
Decoding the regulatory network
• What is expected by chance?
• What is “more frequently”?
• Interesting mathematical questions
• The Moby Dick example
• This is called “motif finding”, and helps
decode the regulatory network!
Comparing DNA
• Humans are about 99.9% identical to each
other, DNA-wise.
• How do we know that ?
• Compare the genome of two individuals.
• The computational problem: Are two
sequences similar ?
Sequence alignment
• Why is this a problem?
• The two sequences will differ by “substitutions”,
“insertions” and “deletions” accumulated during
evolution
• The comparison algorithm has to be robust to
such possibilities.
– A special technique called “dynamic programming”
does all this, and is “efficient”
Sequence alignment
• Why should we care?
• Compare human genome with fish. You’ll see
some portions that are highly similar.
• These “conserved” portions are often genes
…
• … or regulatory sequences! The regulatory
network again.
On counting genes
• The original question was “Why do humans
have so few genes?”
• How do we know how many genes there are
in the human genome ? (And where they are
in the genome)
• Experiments can be designed, but
bioinformatics plays a major role
Gene prediction
• The task of predicting the locations of
genes in a new genome (“annotation”)
• Gene prediction software
• The more sophisticated ones use
“Hidden Markov models” (HMM) and
multiple species comparison
HMM for Gene Prediction
What is this graph?
It captures the “architecture”
of a gene.
It translates into a “probabilistic
model”.
It leads directly to a gene
finding algorithm
http://researchweb.watson.ibm.com/journal/rd/453/birney.html
“What controls organ
regeneration ?”
“How does a single somatic
cell become a whole plant ?”
Development and Regeneration
• Developmental biology
• The timeline from a single cell (with genetic
material from mother and father) to a
multicellular embryo, and to an adult
• A paradox : All cells in the adult body have
the same DNA, then how come different cells
are different ?
Regulatory networks again
1
inputs
0
0
inputs
1
GENE1
GENE6
GENE1
GENE6
GENE2
GENE7
GENE2
GENE7
GENE3
GENE8
GENE3
GENE8
GENE4
GENE9
GENE4
GENE9
GENE5
GENE10
GENE5
GENE10
HEAD PRECURSOR CELL
TAIL PRECURSOR CELL
Regulatory networks again
• Bioinformatics used to scan entire genome
for regions that participate in “segmenting”
the embryo
• Hidden Markov models used to detect such
regions
• Multiple species comparison aids discovery
“How did cooperative behavior evolve?”
Social behavior and bioinformatics?
• Social behavior in honey bees
• Young worker bees are nurses in the hive;
older ones go out to forage
• This behavioral maturation is determined by
needs of colony. What is the genetic basis of
this ?
Social behavior and bioinformatics
• Illinois team scanned the genome to
understand this (2006)
• Regulatory network of social behavior
• Statistical tools, such as Hypergeometric test
• Machine learning tools such as “support
vector machine classification”
Other challenges
Protein structure prediction
http://www.denizyuret.com/students/vkurt/thesis-main_dosyalar/image006.gif
Protein structure prediction
• Can we predict the 3-D structure of a protein
from its amino acid sequence ?
• Why ?
– One good reason: structure gives clues about function. If we
can tell the structure, we can perhaps tell the function
– We can design amino acid sequences that will fold into
proteins that do what we want them to do. Drug design !!
• Neural networks, a popular technique in
computer science, applied to this problem
Metagenomics
• Most studies to date are on genomes of one
species
• A sample from the soil contains hundreds of
bacteria, thousands of viruses. Can we study
all of these ?
• The Sorcerer II expedition
• http://www.sorcerer2expedition.org/version1/HTML/main.htm
Many more challenges
• New types of data come due to
technological breakthroughs in biology
• High throughput data carries
unprecedented amount of information
• Too much noise
• Bioinformatics removes the noise and
reveals the truth
Bioinformatics
• Is not about one problem (e.g., designing
better computer chips, better compilers,
better graphics, better networks, better
operating systems, etc.)
• Is about a family of very different problems,
all related to biology, all related to each other
• How can computers help solve any of this
family of problems ?
Bioinformatics and You
• You can learn the tools of bioinformatics
• These tools owe their origin to computer
science, information theory, probability theory,
statistics, etc.
• You can learn the language of biology,
enough to understand what the problems are
• You can apply the tools to these problems
and contribute to science