VERTEBRATE GENOME EVOLUTION AND FUNCTION …

Download Report

Transcript VERTEBRATE GENOME EVOLUTION AND FUNCTION …

Genome Comparisons and Gene
Regulation
Penn State University, Center for Comparative Genomics and
Bioinformatics: Webb Miller, Francesca Chiaromonte, Anton
Nekrutenko, Ross Hardison; James Taylor, David King, Hao
Wang
University of California at Santa Cruz: David Haussler, Jim Kent
National Human Genome Research Institute: Laura Elnitski
Children’s Hospital of Philadelphia: Mitch Weiss
Lawrence Livermore National Laboratory: Ivan Ovcharenko
CSH Nov. 6, 2005
DNA sequences of mammalian genomes
• Human: 2.9 billion bp, “finished”
– High quality, comprehensive sequence, very few gaps
• Mouse, rat, dog, oppossum, chicken, frog etc. etc etc.
• About 40% of the human genome aligns with mouse
– This is conserved, but not all is under selection.
• About 5-6% of the human genome is under purifying selection since the
rodent-primate divergence
• About 1.5% codes for protein
• The 4.5% of the human genome that is under selection but does not code
for protein should have:
– Regulatory sequences
– Non-protein coding genes
– Other important sequences
Silent and repressed chromatin
Transcription initiation and pausing
Repressors bind
to negative control
elements
General transcription
initiation factors, GTIFs
Assemble on promoter
Basal and activated transcription
Activators bind to
enhancers
Contact for activation
Enhancer
IID
Coactivators
Pol IIa
PolII
Promoter
Coactivators and/or activators sometimes recruit enzymes that
modify chromatin structure to facilitate transcription.
Histone acetylation
Nucleosome remodeling
Comparisons of promoters for eukaryotic RNA polymerases
Promoter for RNA Polymerase II
RNA polymerase II promoter
Upstream binding sites
TATA box DPE
Initiator
Gene
TFIID binds
Regulate efficiency at which
RNA polymerase I promoter
minimal promoter is used
Upstream control element
-180
Bad news
for prediction:-107
Minimal promoter: binding of
GTIFs and RNA Pol II
Core promoter
-45
+20
TATA box is moderately well-defined, but in large Gene
datasets of
mammalian promoters,
only about 11% UBF1
have binds
TATA boxes !
UBF1 binds
Inr (YANWYY) and DPE are not well-defined sequences.
SL1 binds cooperatively with UBF1
Good news for prediction of promoters:
About 70%
are in CpG
islands
RNA polymerase
III promoter
(5S RNA gene)
Gene
Almost all encompass the 5’ end of genes
Core promoter
+55
+80
Enhancers: Specific DNA sequences that
cause an increase in transcription
• Can act in a variety of positions:
– 5’ to gene (similar to an upstream activation sequence)
– Internal to a gene (e.g. in an intron)
– 3’ to a gene
• Can act at a considerable distance from the gene
– Current studies implicate enhancers as far as 200kb to 500kb away
from genes.
– Other genes can be between an enhancer and its target gene.
• Contain a set of binding sites for transcriptional activators.
– Sequence-specific binding sites
– Short: roughly 6-8bp
Interferon beta Enhancer-Promoter
Many regulatory DNA sequences in SV40
control region
Sequence-specific
Domain
opening is
associated
with
movement to
non-heterochromatic
regions
Expected properties of regulatory elements
•
Conserved between species
–
•
Enhancers and promoters: clusters of binding sites for transcription factors
–
•
Use TRANSFAC, TESS, MOTIF (GenomeNet), etc to find matches to binding sites for transcription
factors
Binding sites conserved between species
–
–
•
Examine interspecies alignments
Servers to find conserved matches to factor binding sites
• Comparative genomics at Lawrence Livermore http://www.dcode.org/
– zPicture and rVista
– Mulan and multiTF
– ECR browser
• Consite http://mordor.cgb.ki.se/cgi-bin/CONSITE/consite
The database GALA records conserved (and nonconserved) matches to factor binding sites
(http://www.bx.psu.edu/)
Can be almost anywhere
–
–
–
5’ or 3’ to gene
Within introns
Close or far away
Conservation
score S in
different types
of regions
Red: Ancestral repeats
(mostly neutral)
Blue: First class in label
Green: Second class in
label
Waterston et al., Nature
Use measures of alignment texture to
discriminate functional classes of DNA
• Mouse Cons track (L-scores) and phastCons are measures of
alignment quality.
– Match > Mismatch > Gap
• Alternatively, can analyze the patterns within alignments
(texture) to try to distinguish among functional classes
– Regulatory regions vs bulk DNA
– Patterns are short strings of matches, mismatches, gaps
– Find frequencies for each string using training sets
• 93 known regulatory regions
• 200 ancestral repeats (neutral)
• Regulatory potential genome-wide
– Elnitski et al. (2003) Genome Research 13: 64-72.
What types of regulatory sequences may we
hope to find?
• Sequence signature: specific binding sites
–
–
–
–
Promoters
Enhancers
Repressor binding sites
But these “signatures” are short and occur frequently in
any long sequence
• Sequence signature unknown, maybe none
–
–
–
–
Compact, silent chromatin
Insulators, boundaries
Release from pausing
Movement from inactive to active compartments
Coverage of human by alignments with other
vertebrates ranges from 1% to 91%
5.4
Millions of
years
Human
91
92
173
220
310
360
450
5%
Neutral DNA “cleared out” over 200Myr
100
Percent of human not aligned
90
Platypus
80
Chick Frog
Fish
Opossum
70
Mouse, Rat
60
Cow
50
Dog
40
30
20
10
Chimp
0
0
100
200
300
400
500
Divergence from common ancestor to human, Myr ago
Most human DNA is not alignable to species separated by more than 200 yr.
Divergence dates from Kumar and Hedges (Nature 1998) and Hedges (Nature Rev Genet 2002)
Distinctive divergence rates for different types
of functional DNA sequences
Percentofofregions
humannot
genome
not in
Percent
in alignments
alignments
100
100
9090
8080
7070
Genome
Coding exons
Ultraconserved (HM)
Log. (Genome)
6060
5050
4040
3030
2020
1010
00
00
100
200
300
400
500
100
200
300
400
500
Time of divergence from common ancestor to
Time of divergence
from common
human,
Myr ago ancestor to human,
Myr ago
Large divergence in cis-regulatory modules
from opossum to platypus
Marsupial genome adds substantially to the
conserved fraction of regulatory regions
Additive contribution of each 2nd species to conservation
100
80
Primate
Eutherian
Marsupial
Monotreme
Avian
Amphibian
Fish
Percent
60
40
20
m
e
ge
no
e
cT
F
er
s
BS
s
W
ho
l
io
na
l
pr
ry
Fu
nc
t
gu
la
to
re
n
om
ot
re
gi
on
s
nd
s
is
la
C
pG
Kn
ow
m
iR
N
s
in
g
ex
on
ed
C
od
co
ns
er
v
a
U
ltr
As
0
The distal Major regulatory element of the human HBA gene
complex is conserved in opossum but not beyond
cis-Regulatory modules conserved from human
to fish
Millions of
years
91
173
310
450
• About 20% of CRMs
• Tend to regulate genes whose
products control transcription
and development
• Recent reports:
– Sandelin, A. et al. (2004). BMC
Genomics 5: 99.
– Woolfe, A. et al. (2005). PLoS
Biol 3: e7
– Plessy, C., Dickmeis, T.,
Chalme,l F., Strahle, U. (2005)
Trends Genet. 21: 207-10.
cis-Regulatory modules conserved from human
to chicken
•
•
Millions of
years
91
– Conservation jungles
– Hillier et al. (2004) Nature
173
310
About 40% of CRMs
Noncoding sequences conserved
from human to chicken tend to
clusters in gene-poor regions
•
Stable gene deserts are conserved
from human to chicken
– Ovcharenko et al., (2005) Genome
Res. 15: 137-145.
450
•
Conserved noncoding sequences
in stable gene deserts tend to be
long-range enhancers
– Nobrega, M.A., Ovcharenko, I.,
Afzal, V., Rubin, E.M. (2003)
Science 302: 413.
cis-Regulatory modules conserved in eutherian
mammals (and marsupials?)
Millions of
years
91
173
310
450
• About 80-90% of CRMs
• Within aligned noncoding DNA
of eutherians, need to
distinguish constrained DNA
(purifying selection) from
neutral DNA.
Score multi-species alignments for features
associated with function
• Multiple alignment scores
– Binomial, parsimony (Margulies et al., 2003)
• PhastCons
– Siepel and Haussler, 2003; Siepel et al. 2005
– Phylogenetic Hidden Markov Model
– Posterior probability that a site is among the 10% most highly
conserved sites
– Allows for variation in rates and autocorrelation in rates
• Factor binding sites conserved in human, mouse and rat
– Tffind (from M. Weirauch, Schwartz et al., 2003)
• Score alignments by frequency of matches to patterns
distinctive for CRMs
– Regulatory potential (Elnitski et al., 2003; Kolbe et al., 2004)
Binding sites conserved between species
•
•
•
tffind: Identify high-quality matches to a weight matrix in one sequence (e.g.
human) that also aligns with other sequences (e.g. mouse and rat)
Look for matches to weight matrix in 2nd and 3rd sequences, in the part of
the alignment that aligns to match to weight matrix in first species
GALA records these matches
H
M
R
Program does not find
this, but some studies
show that it can happen.
Matt Weirach
Conserved transcription factor binding sites
• Track on UCSC Genome Browser (human)
• GALA (www.bx.psu.edu)
• rVista
– Can export alignments from zPicture and Mulan
– ECR browser
– All at dcode.org
• ConSite
Use measures of alignment texture to
discriminate functional classes of DNA
• Compute the probability of matching a pattern
characteristic of regulatory regions
– Analyze alignments as short strings of matches,
mismatches, gaps
– Find probabilities for each string using as training sets
• 93 known regulatory regions
• 200 ancestral repeats (neutral)
– Construct Markov models that give good separation of
regulatory regions from neutral DNA
– Regulatory potential of all 100 bp windows in the
genome
Computing Regulatory Potential (RP)
Alignment
seq1
seq2
seq3
Collapsed alphabet
G
G
A
1
T
T
T
2
A
G
G
1
C
T
T
3
C
C
C
4
T
G
A
5
A
7
C
7
T
A
A
6
A
G
A
8
C
C
T
3
G
C
G
6
C
C
T
3
A
A
A
9
•A
3-way alignment has 124 types of columns. Collapse these to a smaller alphabet
with characters s (for example, 1-9).
•Train two order t Markov models for the probability that t alignment columns are followed
by a particular column in training sets:
–positive (alignments in known regulatory regions)
–negative (alignments in ancestral repeats, a model for neutral DNA)
–E.g. Frequency that 3 4 is followed by 5:
0.001 in regulatory regions
0.0001 in ancestral repeats
•RP of any 3-way alignment is the sum of the log likelihood ratios of finding the strings of
alignment characters in known regulatory regions vs. ancestral repeats.
RP 

a in segment
 pREG ( sa | sa 1  sa t ) 

log
 p AR ( sa | sa 1  sa t ) 
RP and phastCons in HBB locus control region
HBB
HBD HBG2 HBG1
HBE LCR
- Both RP and phastCons are high in exons
- RP peaks in many cis-regulatory modules
- phastCons peaks in more regions
http://genome.ucsc.edu/
More species and better models improve
discriminatory power of RP scores
ROC curves for different RP scores, tested on a set of known regulatory regions
from the HBB gene complex
RP and phastCons can discriminate most known
functional elements from neutral DNA
Leveraging genome evolution to discover
function
• Overall goals and core concepts
• All-vs-all whole-genome comparisons
– Comparison of no two species is ideal for finding all
functional sequences
• Alignment scores:
– Aid in finding functional elements
– Discriminate between functional classes
• Example of experimental tests of the
bioinformatic predictions
Genes co-expressed in late erythroid maturation
•
G1E-ER cells: proerythroblast line from mice lacking the transcription factor
GATA-1.
– Can restore the activity of GATA-1 by expressing an estrogen-responsive form of
GATA-1
– Allows cells to mature further to erythroblasts
Use microarray analysis of each to find genes that increase or decrease
expression upon induction.
– Walsh et al., (2004) BLOOD; Image from k-means cluster, GEO:
repressed
induced
genes
•
time after restoration of GATA-1
Predicting cis-regulatory modules (preCRMs)
Identify a genomic region with a regulated gene.
Find all intervals whose RP score exceeds an empirical threshold.
Subtract exons
Find all matches to GATA-1 binding sites that are conserved (cGATA-1_BS)
Intervals with RP scores above the threshold and with a cGATA-1_BS within 50bp
are preCRMs.
Predicted cis-regulatory modules
(preCRMs) around erythroid genes
+
-
Test predicted cis-regulatory modules (preCRMs)
•
Enhancement in transient transfections of erythroid cells
test
HBG
prom
FF luciferase
Dual luciferase
assay
tk
Ren luciferase
prom
K562 cells
•
Activation and induction of reporter genes after site-directed, stable integration in
erythroid cells
•
Chromatin immunoprecipitation (ChIP) for GATA-1
Validation of preCRM in Alas2
Negative controls do not enhance transient expression
7
Fold change
6
5
4
3
2
1
pa
re
nt
Lu
Fo c
g1
N
Fo 1
g1
N
2
H
ip
k2
N
2
G
at
a2
N
2
Al
as
2N
1
H
S2
N
1
H
S2
N
2
Al
as
2N
Va 2
v2
N
Va 1
v2
N
2
C
dm
N
C
1
or
o2
aN
G
at
a2 1
r.2
N
1
0
Negative controls are segments of mouse DNA that align with rat and human but
have low RP scores and do not have a match to a GATA-1 binding site. They have
almost no effect on the level of expression of the reporter gene in erythroid cells.
7 of 24 Zfpm1 preCRMs enhance transient expression
9 of 24 Zfpm1 preCRMs enhance after stable
integration at RL5
All preCRMs in Gata2 are functional in at least one
assay
ChIP data are from publications from E. Bresnick’s lab.
About half of the preCRMs are validated as
functional
Assay
Transient
transfections
Site-directed
integrants
Either expression
assay
GATA-1 ChIPs
Number
tested
62
Number
positive
21
%
validated
34
62
21
34
62
33
53
17
11
65
Positive correlation between enhancer
activity and regulatory potential
Maximum fold increase, transient or stable
7
6
5
4
Max, validated
Max, nonvalidated
Max, NC
3
2
1
0
-0.15
-0.1
-0.05
0
0.05
RP score, mean
0.1
0.15
0.2
0.25
Developmental regulation of the HBB gene complex
transcription, in erythroid cells
adult
fetal
embryonic
locus control region
High throughput DNase I hypersensitive sites find
known regulatory regions
R
Long transcripts run through OR genes into
globin genes
Conclusions
• Particular types of functional DNA sequences are
conserved over distinctive evolutionary distances.
• Multispecies alignments can be used to predict whether a
sequence is functional (signature of purifying selection).
• Alignments can be used to predict certain functional
regions, including some cis-regulatory elements.
• The predictions of cis-regulatory elements for erythroid
genes are validated at a good rate.
• Databases such as the UCSC Table Browser, GALA and
Galaxy provide access to these data.
• Expect improvements at all steps.
Many thanks …
Wet Lab: Yuepin Zhou, Hao Wang, Ying
Zhang, Yong Cheng, David King
Alignments, chains, nets, browsers, ideas, …
Webb Miller, Jim Kent, David Haussler
PSU Database crew: Belinda Giardine,
Cathy Riemer, Yi Zhang, Anton Nekrutenko
RP scores and other bioinformatic input:
Francesca Chiaromonte, James Taylor, Shan Yang,
Diana Kolbe, Laura Elnitski
Funding from NIDDK, NHGRI, Huck Institutes of Life Sciences at PSU