labs.bio.unc.edu

Download Report

Transcript labs.bio.unc.edu

The gene family play and the
chromosomal theater
Todd Vision
Department of Biology
University of North Carolina at Chapel Hill
Outline
Large-scale duplication and loss of
genes in the angiosperms
Looking into the future of plant
phylogenomics
A case study in gene family
demography
Duplication and functional divergence
Paul Franz, University of Amsterdam
Arabidopsis as a hub for plant
comparative maps
megabases
genome sizes in angiosperms
907
1000
750
560 622
473
367 367 372 415 439
500
262
250 145
0
is ch er ge ya ce go ot am an to
s
p ea mb an pa ri an rr y be ma
o
o
p cu or pa
m ca
d
a
t
i
m
b
cu
a
li
r
A
data from Arumuganathan & Earle (1991)Plant Mol Biol Rep 9:208-218
Tomato-Arabidopsis synteny
Bancroft (2001) TIG 17, 89 after Ku et al (2000) PNAS 97, 9121
Duplicated genes in Arabidopsis
Modes of gene duplication
 Tandem (T)
• unequal crossing-over
• mostly young
 Dispersed (D)
• transposition
• all ages
 Segmental (S)
• polyploidy
• all old
Paleotetraploidy?
The Arabidopsis Genome Initiative. 2000. Nature 408:796
Vision et al. (2000) Science 290:2114-7.
Microsynteny within blocks
0.10
distribution of dA
in blocks
not in blocks
0.09
0.08
0.07
0.06
f
0.05
0.04
0.03
0.02
0.01
0.00
0.0 0.1 0.2 0.2 0.3 0.4 0.5 0.5 0.6 0.7 0.8 0.8 0.9 1.0
amino acid substitution
Problems
• proteins diverge at different rates
• high dA is difficult to estimate
Solution
• average dA within blocks
discrete duplication events
A B
frequency of blocks
12
C
D
E F
10
8
6
4
2
0
0.0
0.1
0.2
0.3
0.4 0.5
0.6
0.7
0.8
0.9
amino acid substitution
0
50
100
150
200 Mya
the 2-4 complex
(one ancestral segment broken up by 4 large
inversions)
4200
chromosome 4 (4.6 Mb)
52
3800
54
45
3400
56
49
3000
2600
1200
1600
2000
2400
chromosome 2 (5.6 Mb)
2800
70
coefficient of variation = 0.67
60
frequency
50
40
30
20
10
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Ka
120
coefficient of variation = 0.53
100
frequency
80
60
40
20
0
0
1
2
3
Ks
4
5
Rice-Arabidopsis microsynteny
Mayer et al. (2001) Genome Res. 11, 1167
Blanc, Hokamp, Wolfe (2003) Genome Res. 13, 137-144.
duplication
Rice
Arabidopsis
Rice
Arabidopsis
Rice
Rice
Arabidopsis
Arabidopsis
Block 37
after
Asterid-Rosid
split
Block 57
before
monocot-dicot
divergence
Raes, Vandepoele, Saeys, Simillion, Van de Peer (2003) J. Struct. Func. Genomics 3, 117-129
Divergence among duplicated
genes in rice
Goff et al. (2002) Science 296: 92
Hidden syntenies
Simillion, Vandepoele, Van Montagu, Zabeau, Van de Peer (2002) PNAS 99, 13627
Interspecies comparison can
reveal hidden syntenies
Vandepoele, Simillion, Van de Peer (2002) TIG 18, 606-608
Comparative mapping in a
phylogenetic context
Major plant genome datasets
Family
Genus
Aizoaceae
Mesembryanthemum crystallinum
Brassicaceae Arabidopsis thaliana
Brassica spp.
Fabaceae
Glycine max
Medicago truncatula
Phaseolus spp.
Malvaceae
Gossypium arboreum
Solanaceae
Capsicum annuum
Lycopersicon esculentum
Solanum tuberosum
Poaceae
Hordeum vulgare
Oryza sativa
Sorghum bicolor/propinguim
Triticum aestivum
Zea mays
Other
Beta vulgaris
Chlamydomonas reinhardtii
Pinus taeda
Populus spp.
Prunus spp.
genome
X
EST
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
map
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Plant unigene datasets
species
barley
beet
chlamydomonas
citrus
coffee
cotton
grape
iceplant
lettuce
lotus
maize
marchantia
medicago
oat
onion
pine
poplar
potato
rice
rye
sorghum
soybean
sunflower
tomato
wheat
+ Arabidopsis 27170
TIGR
49885
na
30296
na
na
24350
49885
8455
21960
11025
55063
na
36976
na
11726
26882
na
24275
60778
5199
33273
67826
20520
31012
109509
PlantGDB
74621
13565
na
4266
392
27854
74621
8945
na
na
71655
1059
43384
361
na
24668
20935
24839
52156
5384
34363
73946
na
35725
95949
Wikström et al (2001) Proc R Soc Lond B 268, 2211
Plant phylogenomics:
Phytome
The goal is to integrate
• Organismal phylogeny
• Gene family
sequence
alignment
phylogeny
• Genetic and physical maps
Some uses for Phytome
 Starting with a chromosome segment
• Identify homologous segments
• Predict unobserved gene content (candidate QTL)
 Starting with a gene family
• Resolve orthology/paralogy relationships
• Identify coevolving families
 Starting with a species
• Explore lineage-specific diversification
• Guide comparative mapping wet-work
Current pipeline
Unigene
collections
Annotations
Protein sequence
prediction
Homolog
identification
Protein family
clustering
Multiple sequence
alignment
Phytome
Phylogenetic
inference
Lineage specific diversification
1033
173
Arabidopsis
436
Cotton
334
836
696
715
919
Medicago
Tomato
Rice
152 genes are “single copy” in all four species
A tale of two sisters: the ARF
and the Aux/IAA gene families
Modulate whole plant response to auxin
Interact via dimerization
• ARFs are transcription factors
• Aux/IAAs bind and repress ARFs in the
absence of auxin
The chromosomal context
Diversification of ARFs
Diversification of the Aux/IAAs
Why the different patterns of
diversification?
12% (ARF) vs 40% (Aux/IAA)
segmental duplications
Presumably reflects differential retention
Possible explanations
• Dosage requirements
• Coevolution with other interacting genes
• Regional transcriptional regulation
Divergence of duplicated genes
Age of duplication
Duplicate pairs in yeast and human
(Gu et al. 2002, Makova and Li 2003)
 Appx. 50% of pairs diverge very rapidly
 Proportion of divergent pairs increases with
Ks and Ka
• Plateaus at Ka ~0.3 in human
 In humans,
• Immune response genes over-represented among
young, divergent pairs
• Distantly related pairs with conserved expression
tend to be either ubiquitous or very tissue specific
Retention of duplicated genes
 Nonfunctionalization, or loss of one copy
• The fate of most pairs
 Neofunctionalization (NF)
• Positive selection on a new mutation can maintain the pair
 Subfunctionalization (SF)
• Mutations that increase the specificity of duplicates can fix
due to drift provided that, combined, the two copies provide
the functionality of the ancestral gene. Once SF happens,
both copies are indispensable and are retained.
• One prediction of the model is that SF more likely for tandem
than dispersed pairs (due to linkage)
Digital expression profiling
 Massively Parallel Signature Sequencing (MPSS)
• Count occurrence of 17-20 bp mRNA signatures
• Cloning and sequencing is done on microbeads
• Similar to Serial Analysis of Gene Expression
(SAGE)
 “Bar-code” counting reduces concerns of
• cross-hybridization
• probe affinity
• background hybridization
 Advantages
• Accurate counts of low expression genes
• Can distinguish expression profiles of duplicate genes
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
extract mRNA from tissue
MPSS library
construction
Brenner et al., PNAS 97:1665-70.
mRNA
AAAAAAA
Convert to
cDNA
TTTTTTT
AAAAAAA
Cut w/
Sau3A
5’ - Add
standard
primer
Anneal to beads coated with unique anti-tag
(32 bp, complementary to tag on mRNA)
Remove 3’ primer and expose
single stranded unique tag
(digest, 3'  5' exonuclease)
GATC
TTTTTTT
AAAAAAA
TTTTTTT
AAAAAAA
(added by cloning)
PCR
Add linker
TTTTTTT
AAAAAAA
3’ - Add
unique
32 bp
tag and
standard
primer
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
MPSS library
construction
Brenner et al., PNAS 97:1665-70.
Sort by FACS to
remove ‘empty’ beads
The result of the library construction is a
set of microbeads. Each bead contains
many DNA molecules, all derived from the
3’ end of a single transcript.
Beads are loaded in a monolayer on a
microscope slide for the sequencing of
17 – 20 bp from the 5’ end.
+
NNNN
4321
NNNX
RS
CODEX1
NNXN
RS
CODEX2
NXNN
RS
CODEX3
RS
CODEX4
XNNN
Add adaptors
MPSS Sequencing
Brenner et al., Nat. Biotech. 18:630-4.
Sequence by
hybridization
16 cycles
for 4 bp
13 bp
Repeat Cycle
NNNN
8765
^GNNN
^
Steps of four
bases; overhang
is shifted by four
bases in each
round
CNNN
4321
9 bp
Digest with Type
IIS enzyme to
uncover next 4
bases
RS
CODEC4
DECODERED
MPSS Sequencing
Each bead provides a signature of 17-20 bp
Tag #
1
2
3
4
5
6
7
8
9
.
.
30,285
Signature
Sequence
GATCAATCGGACTTGTC
GATCGTGCATCAGCAGT
GATCCGATACAGCTTTG
GATCTATGGGTATAGTC
GATCCATCGTTTGGTGC
GATCCCAGCAAGATAAC
GATCCTCCGTCTTCACA
GATCACTTCTCTCATTA
GATCTACCAGAACTCGG
.
.
GATCGGACCGATCGACT
Total # of tags:
# of Beads
(Frequency)
2
53
212
349
417
561
672
702
814
.
.
2,935
>1,000,000
ATG
Two sets of signatures are generated from each
sample in different reading frames staggered
by two bases
TGA
Classifying signatures
Duplicated:
expression may
be from other
site in genome
Potential alternative
splicing or nested
gene
Anti-sense transcript
or nested gene?
Potential
alternative
termination
Typical
signatures
Potential
anti-sense
transcript
Potential
un-annotated
ORF
Triangles refer to colors used on our web page:
or
Class 1 - in an exon, same strand as ORF.
Class 2 - within 500 bp after stop codon, same strand as ORF.
or
Class 3 - anti-sense of ORF (like Class 1, but on opposite strand).
or
Class 4 - in genome but NOT class 1, 2, 3, 5 or 6.
or
Class 5 - entirely within intron, same strand.
or
Class 6 - entirely within intron, anti-sense.
or
Grey = potential signature NOT expressed
Class 0 - signatures found in the expression libraries but not the genome.
Core Arabidopsis MPSS libraries
sequenced by Lynx for Blake Meyers, U. of Delaware
Library
Root
Shoot
Flower
Callus
Silique
TOTAL
Signatures
sequenced
3,645,414
2,885,229
1,791,460
1,963,474
2,018,785
12,304,362
Distinct
signatures
48,102
53,396
37,754
40,903
38,503
133,377
http://www.dbi.udel.edu/mpss
Query by
• Sequence
• Arabidopsis gene identifier
• chromosomal position
• BAC clone ID
• MPSS signature
• Library comparison
Site includes
• Library and tissue information
• FAQs and help pages
Genome-wide MPSS profile in Arabidopsis
Chr. I
Chr. II
Chr. III
Chr. IV
Chr. V
Of the 29,084 gene models,
17,849 match unambiguous, expressed class 1 and/or 2 signatures
Dataset of duplicate pairs
 Gene families of size two in Arabidopsis
classified as
• Dispersed (280)
• Segmental (149)
• Tandem (63)
 For each pair
• Measure similarity/distance in expression profile
• Estimate of Ks and KA
Expression distance
library 2
library 1
library 3
The number of genes with >5 ppm expression in a
given number of libraries among the 984 genes
in pairs analyzed and among all Arabidopsis
genes with MPSS profiles.
Libraries
0
1
2
3
4
5
Genes in pairs
153 (15.5%)
124 (12.6%)
73 (7.4%)
93 (9.5%)
109 (11.1%)
432 (43.9%)
All genes
4160 (23.3%)
2643 (14.8%)
1727 (9.6%)
1777 (10.0%)
1930 (10.8%)
5612 (31.4%)
Asymmetry in levels of expression
among libraries within pairs
Symmetry of divergence
Type of Pair
A
B
C
D
________________________________________________________________
Young
Dispersed (Ks0.5)
14
61
8
6
15.7%
68.5%
9.0%
6.7%
Tandem (Ks0.5)
8
14.3%
29
51.8
10
17.9%
9
16.1%
Dispersed (Ks>0.5)
35
18.3%
111
58.1%
24
12.6%
21
11.0%
Segmental (All)
31
20.8%
104
69.8%
7
4.7%
7
4.7%
Old
A: Each copy has higher expression in at least one library
B: One copy has higher expression in all libraries that differ and at least
two libraries differ
C: Copies differ in expression in only one library
D: Copies do not differ in expression in any libraries
0.7
normalized distance
0.6
0.5
0.4
D
0.3
S
T
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.9
1
synonymous substitution
0.6
normalized distance
0.5
0.4
D
S
T
0.3
0.2
0.1
dN =0.48+0.37 KA, p<0.0001
0
0.05
0.1
0.15
nonsynonymous substitution
0.2
4000
3500
2500
D
2000
S
T
1500
1000
500
0
0.05
0.1
0.15
0.2
0.25
0.35
0.4
nonsynonymous substitution
5
4.5
4
breadth of expression
total expression
3000
3.5
3
D
2.5
S
T
2
1.5
1
0.5
0
0.05
0.1
0.15
0.2
0.25
nonsynonymous substitution
0.35
0.4
Pairs with small Ks but dissimilar
expression profiles.
Ks
0.03
Ka
<0.01
dup
D
gene pair
AT1G80700
AT1G80980
callus
71
0
flower leaf
59
11
0
1
root
140
8
silique
94
17
0.17
0.05
T
AT2G46280
AT2G46290
246
28
210
29
160
1
308
29
80
16
0.20
0.06
T
AT2G15400
AT2G15430
4
42
14
128
5
14
5
136
34
18
0.22
0.05
D
AT1G36280
AT4G18440
1
40
3
87
9
69
13
69
10
51
0.26
0.05
T
AT1G71270
AT1G71300
88
0
56
0
44
0
52
0
107
1
0.27
0.07
T
AT3G13290
AT3G13300
20
246
22
245
1
72
1
192
6
77
0.27
0.10
T
AT1G29390
AT1G29395
18
0
238
63
89
5
8
0
165
36
0.27
0.06
T
AT3G26070
AT3G26080
16
349
169
13
346
41
0
4
524
135
0.28
0.13
D
AT3G56190
AT3G56450
216
15
115
0
144
6
239
4
56
1
Pairs with large Ks but similar
expression profiles.
Ks
0.87
Ka
0.28
dup
T
gene pair
AT3G16220
AT3G16230
callus
16
21
flower leaf
10
57
12
35
root
3
13
silique
19
13
0.89
0.13
D
AT3G03660
AT5G17810
14
71
0
0
0
0
0
0
0
0
0.95
0.29
D
AT2G41180
AT3G56710
57
75
14
15
78
39
4
3
29
14
0.97
0.28
D
AT1G31814
AT5G16320
2
0
39
55
4
10
3
19
0
8
0.98
0.23
D
AT5G07230
AT5G62080
0
0
344
288
0
0
0
0
0
0
0.99
0.26
D
AT3G22160
AT4G15120
86
34
6
2
10
0
4
0
4
0
A closing thought
 1965
• The Ecological Theater and the Evolutionary Play,
G. E. Hutchison
 2004
• The Chromosomal Theater and the Gene Family
Play
 Phylogenetics has a great deal to contribute
to understanding the evolutionary interplay of
genome structure and function
Dan Brown
Brandon Gaut
Steven Tanksley
Liqing Zhang
Jason Phillips
Dihui Lu
David Remington
Jason Reed
Tom Guilfoyle
Blake Meyers
NSF