Transcript Slide 1

Orthology predictions for whole
mammalian genomes
Leo Goodstadt
MRC Functional Genomics Unit
Oxford University
Mammalian Genomes
How does our genome,
and how do our genes,
differ from those of
other mammals and
other vertebrates?
Great Expectations
So why is it taking so long to
understand a simple genome
1. We did not appreciate how much 1. How much?
functional sequence there would
be.
2. We did not appreciate how hard 2. Speciesspecific
it would be to ‘read off’ functions
genes?
from the human genome.
3. Human
3. We had no idea that individual
genomes
human genomes can differ so
much!
How do we find function in the
genome?
• Nothing in Biology Makes Sense Except in the
Light of Evolution. Theodosius Dobzhansky
(1900-1975).
The dawn of mammalian comparative genomics
SANITY CHECKS FOR ALL MAMMALIAN PROJECTS:
LESSONS FROM THE MOUSE GENOME (2002)
Domain-regions are more conserved
Because they are under higher purifying selection
Because they are under higher purifying selection
Secreted proteins evolve faster
Rapid duplicators are rapid evolvers
Higher purifying pressures in enzymes
Mouse-Human Orthologues % Identity
•
•
•
•
•
sites not in domains:
cSNP sites:
all sites:
sites in domains:
disease sites:
64.4%
67.1%
70.1%
88.9%
90.3%
WHICH GENES HAVE LINEAGE
SPECIFIC DUPLICATES?
Large number of lineage specific duplications
10 – 20% of genes are lineage specific depending on
comparisons
20% of human genes have been duplicated
or do not have a rodent orthologue
Human specific genes
missing from mouse.
Family trees for genes:
(In many cases, more distantly
related mouse gene
(homologues) can be found.
(8%)
1 to 1
(80%)
Gene families shared with
mouse but which have
expanded in human (9%)
Shared Orthologues
(present as a single gene
in the common ancestor
to human and mouse)
Where do new genes come from?
•De novo
(from non-coding)
•Rapid sequence change
•Gene duplication
M. Lynch and A. Force , The probability of duplicate gene preservation by subfunctionalisation.
Genetics 154 (2000), pp. 459–473
•Pseudogenisation
y
•Missing: Horizontal Gene transfer
Inparalogues
• Chemosensation (OR, V1R and V2R )
• Reproduction (Vomeronasal Receptors, lipocalins, bmicroseminoprotein (12:1))
• Immunity (IG chains, butyrophilins, leukocyte IG-like receptors,
T-cell receptor chains and carcinoembryonic antigen-related cell
adhesion molecules )
pancreatic RNAses
• Detoxification (hypoxanthine phosphoribosyltransferase
homologues nitrogen poor diets)
• KRAB ZnFingers
Reproduction Clusters
No. in
cluster
Odorant binding proteins / aphrodisin
8
Aphrodisiac hormone
Hydroxysteroid dehydrogenase
7
Biosynthesis of hormonal steroids.
Class CYP4A Cytochromes P450
7
Oxidation of compounds.
Seminal vesicle-antigen (SVA)
4
Suppression of spermatozoa motility.
Submandibular gland secretory proteins
9
Expression is androgen-dependent.
Obox, homeobox proteins
6
Homeobox proteins.
Androgen-binding protein-α
9
Mate selection.
Prolactin related proteins
17
Placentation.
Cathepsin J-like enzymes
6
Placentation.
Cystatins / Stefins
7
Placentation
HOX cluster
8
Placentation.
Class CYP2D Cytochromes P450
5
Regulated by androgens.
MHC class I
8
Immunity / Mate selection ?
Elafin, eppin, and antileukoproteinase 1
7
Anti-microbial.
Beta-defensin proteins. X 2 clusters
5/5
Anti-microbial.
Eosinophil-associated ribonuclease.
11
Pathogen response.
Weaker purifying selection for duplicate genes
Rapid evolvers in protein coding genes
KRAB Zn Fingers
TOXIN
DEGRADATION
Hypothesis: Darwinian evolution
Competition:
• Inter-specific (pathogens, predators)
• Intra-specific
– mating
– sub-speciation / kin-selection
– gender conflict
– clonal expansions in sperm
Immunity genes evolve the fastest
Rapidly-changing developmental or transcriptional
regulatory genes?
• KRAB-zinc finger genes
• Cancer-testis antigen genes (e.g. PRAMEs)
• Regulate chromatin structure and
therefore the timing of transcription.
Detecting biological signals among
inparalogues
Correlations with known annotations
• Biological Annotations (gene descriptions /
Gene Ontology)
• Tissue specificity
• Comparative changes across lineages (dating)
• Chromosomal Distribution
• Positive selection
• Genomic environment
Different genes duplicate at
different times
LeoGoodstadt et al. Genome Res. 2007; 17: 969-981
Look for differential evolution
GO-analysis of
innovations
Trends - Functions
Human - Chicken
GCSC (2005)
Trends - Tissues
Chicken - Human
CGSC (2005)
Exploring rapid evolutionary with protein structure
GENE FAMILIES
Independent expansions in the PRAME gene
family
Positive selection: PRAME genes
Amino acid sites under
positive selection in human
(red), mouse (blue) and rat
(purple) [or multiple species
(yellow)] PRAME genes.
Gene Duplication Remodels Genome
Androgen-binding proteins.
produced by sertoli cells in testes seminiferous tubules
Emes et al. (2004) Genome Res. 14(8):1516-29
Lipocalins:Mouse Major Urinary Proteins
Rat 2u-globulin genes
sites subject to
positive selection
VR2 olfactory receptor N-terminal domain:
 sites: dark blue, ligand (glutamate) pink
(other monomer)
MHC class1b,
M10s
 sites :
in blue,
peptide ligand in MHC
structure in green
Finding disease candidates within model organisms
ORTHOLOGY AND DISEASE
Few Mendelian disease genes
lack mouse orthologues
– Kallmann syndrome gene
C. elegans orthologue.
– CETP - cholesteryl ester transfer protein
Rabbit and Hamster
– Glycophorin E
Primate specific
MN and Ss blood types
Mouse equivalents of human disease
variants
Hs normal:
MAETLFWTPLLVVLLAGLGDTEAQQTTLHPLVGRVFVHTLDHETFLSLPEHVAVPPAVHI
Hs variant:
MAETLFWTPLLVVLLAGLGDTEAQQTTLHLLVGRVFVHTLDHETFLSLPEHVAVPPAVHI
Mm normal:
MAAAVTWIPLLAGLLAGLRDTKAQQTTLHLLVGRVFVHPLEHATFLRLPEHVAVPPTVRL
Nick Dickens & Jörg Schultz
Hirschsprung disease (142623)
E251K
Leukencephaly with vanishing white matter (603896)
R113H
Mucopolysaccharidosis type IVA (253000)
R376Q
Breast cancer (113705)
L892S
Breast cancer (600185)
V211A, Q2421H
Parkinson disease (601508)
A53T
Tuberous sclerosis (605284)
Q654E
Bardet-Biedl syndrome, type 6 (209900)
T57A
Mesothelioma (156240)
N93S
Long QT syndrome 5 (176261)
V109I
Cystic fibrosis (602421)
F87L , V754M
Porphyria variegata (176200)
Q127H
Non-Hodgkin's lymphoma (605027)
A25T, P183L
Severe combined immunodeficiency disease (102700)
R142Q
Limb-girdle muscular dystrophy type 2D (254110)
P30L
LCAD deficiency (201460)
Q333K
Usher syndrome type 1B (276902)
G955S
Chronic nonspherocytic hemolytic anemia (206400)
A295V
Mantle cell lymphoma (in 208900)
N750K
Becker muscular dystrophy (300377)
H2921R
Complete Androgen Insensitivity syndrome (300068)
G491S
Prostate cancer (176807)
P269S, S647N
Crohn's disease (266600)
W157R
Disease mutations do not always lead to pathological phenotypes
in mouse!
7293 SwissProt disease-associated variants
• 90.3%
mouse residue = human wild-type residue
• 7.5%
mouse residue ≠ human wild-type residue
• 2.2%
mouse residue = human disease residue
Genomes are not a bag of genes
GENOMIC CONTEXT IS IMPORTANT:
LESSONS FROM THE MONODELPHIS
Mutation rate is higher on the X chromosome
Human X has separate synteny with MDOX/4/7
Only Marsupial X show an increase
in dS
Comparisons with a third genome
• Australian marsupial
silver-gray bushtail possum
Trichosurus vulpecula
• 8,237 orthologues from 111,634 ESTs
• More closely related to Monodelphis
Median dS:
Monodelphis -Trichosurus
Homo-Monodelphis
0.26
1.02
X chromosome increased mutation rate
is marsupial specific
Homo Monodelphis 1:1 orthologues
dN /dS
0.086
1.02
dS
Amino acid sequence identity
Pairwise alignment coverage
81.0%
94.2%
Homo sapiens
Number of exons
Sequence length (codons)
Unspliced transcript length (bp)
G+C content at 4D sites
9
471
27,241
56.9%
Monodelphis
domestica
9
445
25,365
48.7%
Higher G+C in Monodelphis X
Increased G+C
Lower G+C in Homo X
Decreased G+C
Mutation rate varies with G+C
Mutation rate varies with G+C
Canis has many small
chromosomes
Monodelphis has few, big
chromosomes
Relative chromosome sizes in Monodelphis
1 2 3 4 5 6 7 8 X
Monodelphis
1 X
Homo
Variations in Female recombination
rate
• Telomeric ends are highly recombining
– Short chromosomes have proportionally more
subtelomeric sequence
– Long chromosomes have proportionally more
interstitial sequence
• Obligatory chiasma per chromosome
Biased Gene Conversion during Recombination
Galtier, N. et al. Genetics 2001;159:907-911
Consequences of Recombination
Biased Gene Conversion
• High G + C
• High dS
In subtelomeres and X
chromosomes
Chromosomes
Chromosomes
Chicken chromosomes
Consequences of Recombination
• Increased selection efficiency
(disrupt linkage between
neighbouring mutations:
“Hill Robertson” effect)
• Most genes are under purifying
selection (dN/dS ~ 0.086)
• Highly recombining regions predicted
to have lower dN/dS
Selection varies with G+C
Selection varies with G+C
Summary
X chromosome / Subtelomeric regions are:
• Highly recombining in Monodelphis
females
• Have high G+C
• High dS
• Increased purifying selection
• Short intron lengths
Consequences
•
•
•
•
Some lineages have highly rearranged karyotypes
Chromosome breakage highly correlated
Rearrangements correlated with G+C content
Gene function is not independent of
chromosomal location
• Many “high evolvers” may be under relaxed
selection
• Mutation rate variation has consequences for
finding disease genes
Are functionally-linked genes
genomic neighbours?
• Interacting Proteins: Very small effect mostly
arising from local gene duplication
• Co-expressed genes: Small effect mostly
arising from local gene duplication
• ‘Housekeeping genes’: Small effect with
unclear biological significance
• So, function is not able to be clearly ‘read off’
the genome
The future: Clade Genomics
EVOLUTIONARY ANALYSES ACROSS
ORTHOLOG CLADES?
Evolutionary rate analyses in clades
• Comparative genomics: two or few genomes
– highlights differences
• Clade genomics: sets of genomes
– highlights innovations
• Examples
– 12 flies
– 5 mammals + chicken
– 4 worms
Analysis pipeline
• Gene prediction
• Assignment of
orthology and
paralogy
• Rate analysis
Genome1
Rererence
Genome
Genome2
Genome3
Gene prediction
All-on-all BLAST
Pairwise orthology assignment
Clustering
Multiple alignment
Tree topology
Multiple orthology assignment
Rate analyses
Rate analyses
Duplication rate in flies show
constant turnover
100
Cumulative frequency / %
80
Recent
60
40
relative to tree height
all
lineage specific duplications
internal duplications
20
0
0
20
40
60
80
Relative distance to tip/speciation event / %
100
Ancient
Beware of naive comparisons of Gene
Duplications
Mouse has few recent duplications than rat?
20
15
10
5
0.30
0.20
0.10
KS distance from current time
0
0.00
Number of Nodes
Mouse node
Rat node
Orth
Ancestral node
Extension to other clades
• 12 Flies
– D. melanogaster, D. simulans, D. sechellia, D.
yakuba, D. erecta, D. ananassae,
D.pseudoobscura, D. persimilis, D. willistoni, D.
grimshawi, D. mojavensis, D. virilis
• 6 Amniotes
– Human, mouse, dog, opossum, platypus, chicken
• 4 Nematodes
– C. elegans, C. briggsae, C. remanei, C. 2801
Extension to other clades
Nematodes
Drosophila
Amniotes
Orthology assignment
Species
D. melanogaster*
D. simulans
D. sechellia
D. erecta
D. yakuba
D. ananassae
D. pseudoobsura
D. persimilis
D. willistoni
D. virilis
D. mojavensis
D. grimshawi
C. elegans*
C. remanei
C. four
C. briggsae
H. sapiens*
M. musculus*
C. familiaris*
M. domestica*
O. anatinus*
G. gallus*
Genes
13836
20101
23482
19058
26099
21143
15561
19316
19999
15030
15055
15266
20105
26567
34537
22568
22810
24442
19314
19597
18597
16715
Genes with orthologs
13697
99.0%
18337
91.2%
21098
89.8%
18273
95.9%
24355
93.3%
18853
89.2%
14494
93.1%
17227
89.2%
17161
85.8%
14359
95.5%
14279
94.8%
14329
93.9%
14079
70.0%
15540
58.5%
18428
53.4%
16424
72.8%
19049
83.5%
20318
83.1%
18316
94.8%
17987
91.8%
15753
84.7%
13834
82.8%
Orphaned genes
139
1764
2384
785
1744
2290
1067
2089
2838
671
776
937
6026
11027
16109
6144
3761
4124
998
1610
2844
2881
1.0%
8.8%
10.2%
4.1%
6.7%
10.8%
6.9%
10.8%
14.2%
4.5%
5.2%
6.1%
30.0%
41.5%
46.6%
27.2%
16.5%
16.9%
5.2%
8.2%
15.3%
17.2%
Lineage specific dN/dS reflects
populations size?
0.14
Drosophila
Nematodes
Amniotes
Lineage specificdN/dS
0.12
0.10
0.08
0.06
0.04
0.02
0.00
Species
Population genomics
THE HUMAN GENOME OR
HUMAN GENOMES?
Most Human Duplications are Recent:
Some before the Chimp-Human Split, Most After
Polymorphisms?
Hominin-specific genes?
Unfixed genomic structural variants
explain population differences?
Differences in the number of copies of a gene
Copy Number / Structural Variation
Tuzun et al. Nature Genetics 2005
(Luckily!) humans have a relatively low
polymorphism rate
KAESSMANN, H. & PÄÄBO, S.
The genetical history of humans and the great apes.
Journal of Internal Medicine 251 (1), 1-18.
Structural variation and disease
•
> 12% of human genome is structurally
variable
represent more DNA than SNPs!
• more likely to be disease-associated than
SNPs
• Structural variants are often complex,
including changes in regulation
How can we find causative
differences?
Look at annotations / evolutionary history
(between species and in the population)
of corresponding genes (Orthology!)
How can we find causative
differences?
Over- /Under- representations of
– Disease (Rare and common alleles)
– GO annotations
– Pathways, protein-protein interactions
– Domain structure
– Sequence conservation / divergence:
indels, base changes (SNPs), rearrangements
– GC
– Tissue specificity
– Duplication history
2 new genes associated with
Coronary
artery
disease
The Wellcome Trust Case Control Consortium:
500,000 SNPs
7 x 2000 samples per disease
3000 controls
Chromosome 9 (Mb)
USING ORTHOLOGY TO EXPLORER
DEVELOPMENTAL CHANGE
Exercise:
• Brain Anatomy presumably evolved step by
step
• How did neural anatomy evolve in vertebrates
/ mammals / primates?
• Which neural anatomical structure
correspond?
Exercise:
The future:
• Find brain genes in other species groups which
have shown apparent increase in “brain
power” (e.g. song birds, dolphins)
• Increases in brain capacity may involve
convergent evolution
• See if same trends also visible in our lineage
Exercise:
Current techniques:
• Homology by “inspection” (phenotypical comparisons of a few
anatomical traits)
• Marker genes which are shared in the same cell types across
species
The future:
• Sequence all the active genes (mRNA) in all
cell types across multiple species
• Use the patterns of all active genes (the
transcriptome)
FUTURE OF HUMAN /
MAMMALIAN GENOMICS
Other areas of genomic medicine
Cancer genomics
sequence the genome of cancer cells
• Some variants are associated with high morbidity
• A few genes are highly associated with increased risk of
cancers e.g. BRCA1
• Some variants may be associated with increased response
to chemotherapy
• However, apart from a few solid tumours, most cancer
cells appear to harbour huge number of changes and
rearrangements
• It may be impossible to identify causative / facilitative /
therapeutic candidates
Other areas of genomic medicine
Pharmaceutical genomics
• Differential drug efficacy
Response to treatment varies within the
population
e.g. 15% of breast cancers have copy number amplification of HER2
and are thus candidates for Herceptin
• Differential side-effects
e.g. 1 in 300 patients have lethal, hematopoietic adverse response to
mercaptopurine for acute lymphoblastic leukemia, linked to
mutations in thiopurine S-methyltransferase
• Differential prognosis
Carriers of CCR5 mutation either HIV 1 resistant or have much slower
progression of AIDS
Coming changes
• Major reduction in cost: 100-1000x
• Major increase in throughput
$100k per mammalian genome to
$1,000 per resequenced human
• Large scale studies to gather phenotypic
differences and associate with genomic
variation
• Most labs will start doing some sort of
genomics
What is the future of genomics
• We will be awash with data and genomic
variations
• Which of these variations correspond to:
(a) disease causation?
(b) natural phenotypic differences?
Must use evolutionary signal.
For protein coding genes, that requires
constructing family trees:
orthology will continue to be central