Transcript Slide 1
Orthology predictions for whole mammalian genomes Leo Goodstadt MRC Functional Genomics Unit Oxford University Mammalian Genomes How does our genome, and how do our genes, differ from those of other mammals and other vertebrates? Great Expectations So why is it taking so long to understand a simple genome 1. We did not appreciate how much 1. How much? functional sequence there would be. 2. We did not appreciate how hard 2. Speciesspecific it would be to ‘read off’ functions genes? from the human genome. 3. Human 3. We had no idea that individual genomes human genomes can differ so much! How do we find function in the genome? • Nothing in Biology Makes Sense Except in the Light of Evolution. Theodosius Dobzhansky (1900-1975). The dawn of mammalian comparative genomics SANITY CHECKS FOR ALL MAMMALIAN PROJECTS: LESSONS FROM THE MOUSE GENOME (2002) Domain-regions are more conserved Because they are under higher purifying selection Because they are under higher purifying selection Secreted proteins evolve faster Rapid duplicators are rapid evolvers Higher purifying pressures in enzymes Mouse-Human Orthologues % Identity • • • • • sites not in domains: cSNP sites: all sites: sites in domains: disease sites: 64.4% 67.1% 70.1% 88.9% 90.3% WHICH GENES HAVE LINEAGE SPECIFIC DUPLICATES? Large number of lineage specific duplications 10 – 20% of genes are lineage specific depending on comparisons 20% of human genes have been duplicated or do not have a rodent orthologue Human specific genes missing from mouse. Family trees for genes: (In many cases, more distantly related mouse gene (homologues) can be found. (8%) 1 to 1 (80%) Gene families shared with mouse but which have expanded in human (9%) Shared Orthologues (present as a single gene in the common ancestor to human and mouse) Where do new genes come from? •De novo (from non-coding) •Rapid sequence change •Gene duplication M. Lynch and A. Force , The probability of duplicate gene preservation by subfunctionalisation. Genetics 154 (2000), pp. 459–473 •Pseudogenisation y •Missing: Horizontal Gene transfer Inparalogues • Chemosensation (OR, V1R and V2R ) • Reproduction (Vomeronasal Receptors, lipocalins, bmicroseminoprotein (12:1)) • Immunity (IG chains, butyrophilins, leukocyte IG-like receptors, T-cell receptor chains and carcinoembryonic antigen-related cell adhesion molecules ) pancreatic RNAses • Detoxification (hypoxanthine phosphoribosyltransferase homologues nitrogen poor diets) • KRAB ZnFingers Reproduction Clusters No. in cluster Odorant binding proteins / aphrodisin 8 Aphrodisiac hormone Hydroxysteroid dehydrogenase 7 Biosynthesis of hormonal steroids. Class CYP4A Cytochromes P450 7 Oxidation of compounds. Seminal vesicle-antigen (SVA) 4 Suppression of spermatozoa motility. Submandibular gland secretory proteins 9 Expression is androgen-dependent. Obox, homeobox proteins 6 Homeobox proteins. Androgen-binding protein-α 9 Mate selection. Prolactin related proteins 17 Placentation. Cathepsin J-like enzymes 6 Placentation. Cystatins / Stefins 7 Placentation HOX cluster 8 Placentation. Class CYP2D Cytochromes P450 5 Regulated by androgens. MHC class I 8 Immunity / Mate selection ? Elafin, eppin, and antileukoproteinase 1 7 Anti-microbial. Beta-defensin proteins. X 2 clusters 5/5 Anti-microbial. Eosinophil-associated ribonuclease. 11 Pathogen response. Weaker purifying selection for duplicate genes Rapid evolvers in protein coding genes KRAB Zn Fingers TOXIN DEGRADATION Hypothesis: Darwinian evolution Competition: • Inter-specific (pathogens, predators) • Intra-specific – mating – sub-speciation / kin-selection – gender conflict – clonal expansions in sperm Immunity genes evolve the fastest Rapidly-changing developmental or transcriptional regulatory genes? • KRAB-zinc finger genes • Cancer-testis antigen genes (e.g. PRAMEs) • Regulate chromatin structure and therefore the timing of transcription. Detecting biological signals among inparalogues Correlations with known annotations • Biological Annotations (gene descriptions / Gene Ontology) • Tissue specificity • Comparative changes across lineages (dating) • Chromosomal Distribution • Positive selection • Genomic environment Different genes duplicate at different times LeoGoodstadt et al. Genome Res. 2007; 17: 969-981 Look for differential evolution GO-analysis of innovations Trends - Functions Human - Chicken GCSC (2005) Trends - Tissues Chicken - Human CGSC (2005) Exploring rapid evolutionary with protein structure GENE FAMILIES Independent expansions in the PRAME gene family Positive selection: PRAME genes Amino acid sites under positive selection in human (red), mouse (blue) and rat (purple) [or multiple species (yellow)] PRAME genes. Gene Duplication Remodels Genome Androgen-binding proteins. produced by sertoli cells in testes seminiferous tubules Emes et al. (2004) Genome Res. 14(8):1516-29 Lipocalins:Mouse Major Urinary Proteins Rat 2u-globulin genes sites subject to positive selection VR2 olfactory receptor N-terminal domain: sites: dark blue, ligand (glutamate) pink (other monomer) MHC class1b, M10s sites : in blue, peptide ligand in MHC structure in green Finding disease candidates within model organisms ORTHOLOGY AND DISEASE Few Mendelian disease genes lack mouse orthologues – Kallmann syndrome gene C. elegans orthologue. – CETP - cholesteryl ester transfer protein Rabbit and Hamster – Glycophorin E Primate specific MN and Ss blood types Mouse equivalents of human disease variants Hs normal: MAETLFWTPLLVVLLAGLGDTEAQQTTLHPLVGRVFVHTLDHETFLSLPEHVAVPPAVHI Hs variant: MAETLFWTPLLVVLLAGLGDTEAQQTTLHLLVGRVFVHTLDHETFLSLPEHVAVPPAVHI Mm normal: MAAAVTWIPLLAGLLAGLRDTKAQQTTLHLLVGRVFVHPLEHATFLRLPEHVAVPPTVRL Nick Dickens & Jörg Schultz Hirschsprung disease (142623) E251K Leukencephaly with vanishing white matter (603896) R113H Mucopolysaccharidosis type IVA (253000) R376Q Breast cancer (113705) L892S Breast cancer (600185) V211A, Q2421H Parkinson disease (601508) A53T Tuberous sclerosis (605284) Q654E Bardet-Biedl syndrome, type 6 (209900) T57A Mesothelioma (156240) N93S Long QT syndrome 5 (176261) V109I Cystic fibrosis (602421) F87L , V754M Porphyria variegata (176200) Q127H Non-Hodgkin's lymphoma (605027) A25T, P183L Severe combined immunodeficiency disease (102700) R142Q Limb-girdle muscular dystrophy type 2D (254110) P30L LCAD deficiency (201460) Q333K Usher syndrome type 1B (276902) G955S Chronic nonspherocytic hemolytic anemia (206400) A295V Mantle cell lymphoma (in 208900) N750K Becker muscular dystrophy (300377) H2921R Complete Androgen Insensitivity syndrome (300068) G491S Prostate cancer (176807) P269S, S647N Crohn's disease (266600) W157R Disease mutations do not always lead to pathological phenotypes in mouse! 7293 SwissProt disease-associated variants • 90.3% mouse residue = human wild-type residue • 7.5% mouse residue ≠ human wild-type residue • 2.2% mouse residue = human disease residue Genomes are not a bag of genes GENOMIC CONTEXT IS IMPORTANT: LESSONS FROM THE MONODELPHIS Mutation rate is higher on the X chromosome Human X has separate synteny with MDOX/4/7 Only Marsupial X show an increase in dS Comparisons with a third genome • Australian marsupial silver-gray bushtail possum Trichosurus vulpecula • 8,237 orthologues from 111,634 ESTs • More closely related to Monodelphis Median dS: Monodelphis -Trichosurus Homo-Monodelphis 0.26 1.02 X chromosome increased mutation rate is marsupial specific Homo Monodelphis 1:1 orthologues dN /dS 0.086 1.02 dS Amino acid sequence identity Pairwise alignment coverage 81.0% 94.2% Homo sapiens Number of exons Sequence length (codons) Unspliced transcript length (bp) G+C content at 4D sites 9 471 27,241 56.9% Monodelphis domestica 9 445 25,365 48.7% Higher G+C in Monodelphis X Increased G+C Lower G+C in Homo X Decreased G+C Mutation rate varies with G+C Mutation rate varies with G+C Canis has many small chromosomes Monodelphis has few, big chromosomes Relative chromosome sizes in Monodelphis 1 2 3 4 5 6 7 8 X Monodelphis 1 X Homo Variations in Female recombination rate • Telomeric ends are highly recombining – Short chromosomes have proportionally more subtelomeric sequence – Long chromosomes have proportionally more interstitial sequence • Obligatory chiasma per chromosome Biased Gene Conversion during Recombination Galtier, N. et al. Genetics 2001;159:907-911 Consequences of Recombination Biased Gene Conversion • High G + C • High dS In subtelomeres and X chromosomes Chromosomes Chromosomes Chicken chromosomes Consequences of Recombination • Increased selection efficiency (disrupt linkage between neighbouring mutations: “Hill Robertson” effect) • Most genes are under purifying selection (dN/dS ~ 0.086) • Highly recombining regions predicted to have lower dN/dS Selection varies with G+C Selection varies with G+C Summary X chromosome / Subtelomeric regions are: • Highly recombining in Monodelphis females • Have high G+C • High dS • Increased purifying selection • Short intron lengths Consequences • • • • Some lineages have highly rearranged karyotypes Chromosome breakage highly correlated Rearrangements correlated with G+C content Gene function is not independent of chromosomal location • Many “high evolvers” may be under relaxed selection • Mutation rate variation has consequences for finding disease genes Are functionally-linked genes genomic neighbours? • Interacting Proteins: Very small effect mostly arising from local gene duplication • Co-expressed genes: Small effect mostly arising from local gene duplication • ‘Housekeeping genes’: Small effect with unclear biological significance • So, function is not able to be clearly ‘read off’ the genome The future: Clade Genomics EVOLUTIONARY ANALYSES ACROSS ORTHOLOG CLADES? Evolutionary rate analyses in clades • Comparative genomics: two or few genomes – highlights differences • Clade genomics: sets of genomes – highlights innovations • Examples – 12 flies – 5 mammals + chicken – 4 worms Analysis pipeline • Gene prediction • Assignment of orthology and paralogy • Rate analysis Genome1 Rererence Genome Genome2 Genome3 Gene prediction All-on-all BLAST Pairwise orthology assignment Clustering Multiple alignment Tree topology Multiple orthology assignment Rate analyses Rate analyses Duplication rate in flies show constant turnover 100 Cumulative frequency / % 80 Recent 60 40 relative to tree height all lineage specific duplications internal duplications 20 0 0 20 40 60 80 Relative distance to tip/speciation event / % 100 Ancient Beware of naive comparisons of Gene Duplications Mouse has few recent duplications than rat? 20 15 10 5 0.30 0.20 0.10 KS distance from current time 0 0.00 Number of Nodes Mouse node Rat node Orth Ancestral node Extension to other clades • 12 Flies – D. melanogaster, D. simulans, D. sechellia, D. yakuba, D. erecta, D. ananassae, D.pseudoobscura, D. persimilis, D. willistoni, D. grimshawi, D. mojavensis, D. virilis • 6 Amniotes – Human, mouse, dog, opossum, platypus, chicken • 4 Nematodes – C. elegans, C. briggsae, C. remanei, C. 2801 Extension to other clades Nematodes Drosophila Amniotes Orthology assignment Species D. melanogaster* D. simulans D. sechellia D. erecta D. yakuba D. ananassae D. pseudoobsura D. persimilis D. willistoni D. virilis D. mojavensis D. grimshawi C. elegans* C. remanei C. four C. briggsae H. sapiens* M. musculus* C. familiaris* M. domestica* O. anatinus* G. gallus* Genes 13836 20101 23482 19058 26099 21143 15561 19316 19999 15030 15055 15266 20105 26567 34537 22568 22810 24442 19314 19597 18597 16715 Genes with orthologs 13697 99.0% 18337 91.2% 21098 89.8% 18273 95.9% 24355 93.3% 18853 89.2% 14494 93.1% 17227 89.2% 17161 85.8% 14359 95.5% 14279 94.8% 14329 93.9% 14079 70.0% 15540 58.5% 18428 53.4% 16424 72.8% 19049 83.5% 20318 83.1% 18316 94.8% 17987 91.8% 15753 84.7% 13834 82.8% Orphaned genes 139 1764 2384 785 1744 2290 1067 2089 2838 671 776 937 6026 11027 16109 6144 3761 4124 998 1610 2844 2881 1.0% 8.8% 10.2% 4.1% 6.7% 10.8% 6.9% 10.8% 14.2% 4.5% 5.2% 6.1% 30.0% 41.5% 46.6% 27.2% 16.5% 16.9% 5.2% 8.2% 15.3% 17.2% Lineage specific dN/dS reflects populations size? 0.14 Drosophila Nematodes Amniotes Lineage specificdN/dS 0.12 0.10 0.08 0.06 0.04 0.02 0.00 Species Population genomics THE HUMAN GENOME OR HUMAN GENOMES? Most Human Duplications are Recent: Some before the Chimp-Human Split, Most After Polymorphisms? Hominin-specific genes? Unfixed genomic structural variants explain population differences? Differences in the number of copies of a gene Copy Number / Structural Variation Tuzun et al. Nature Genetics 2005 (Luckily!) humans have a relatively low polymorphism rate KAESSMANN, H. & PÄÄBO, S. The genetical history of humans and the great apes. Journal of Internal Medicine 251 (1), 1-18. Structural variation and disease • > 12% of human genome is structurally variable represent more DNA than SNPs! • more likely to be disease-associated than SNPs • Structural variants are often complex, including changes in regulation How can we find causative differences? Look at annotations / evolutionary history (between species and in the population) of corresponding genes (Orthology!) How can we find causative differences? Over- /Under- representations of – Disease (Rare and common alleles) – GO annotations – Pathways, protein-protein interactions – Domain structure – Sequence conservation / divergence: indels, base changes (SNPs), rearrangements – GC – Tissue specificity – Duplication history 2 new genes associated with Coronary artery disease The Wellcome Trust Case Control Consortium: 500,000 SNPs 7 x 2000 samples per disease 3000 controls Chromosome 9 (Mb) USING ORTHOLOGY TO EXPLORER DEVELOPMENTAL CHANGE Exercise: • Brain Anatomy presumably evolved step by step • How did neural anatomy evolve in vertebrates / mammals / primates? • Which neural anatomical structure correspond? Exercise: The future: • Find brain genes in other species groups which have shown apparent increase in “brain power” (e.g. song birds, dolphins) • Increases in brain capacity may involve convergent evolution • See if same trends also visible in our lineage Exercise: Current techniques: • Homology by “inspection” (phenotypical comparisons of a few anatomical traits) • Marker genes which are shared in the same cell types across species The future: • Sequence all the active genes (mRNA) in all cell types across multiple species • Use the patterns of all active genes (the transcriptome) FUTURE OF HUMAN / MAMMALIAN GENOMICS Other areas of genomic medicine Cancer genomics sequence the genome of cancer cells • Some variants are associated with high morbidity • A few genes are highly associated with increased risk of cancers e.g. BRCA1 • Some variants may be associated with increased response to chemotherapy • However, apart from a few solid tumours, most cancer cells appear to harbour huge number of changes and rearrangements • It may be impossible to identify causative / facilitative / therapeutic candidates Other areas of genomic medicine Pharmaceutical genomics • Differential drug efficacy Response to treatment varies within the population e.g. 15% of breast cancers have copy number amplification of HER2 and are thus candidates for Herceptin • Differential side-effects e.g. 1 in 300 patients have lethal, hematopoietic adverse response to mercaptopurine for acute lymphoblastic leukemia, linked to mutations in thiopurine S-methyltransferase • Differential prognosis Carriers of CCR5 mutation either HIV 1 resistant or have much slower progression of AIDS Coming changes • Major reduction in cost: 100-1000x • Major increase in throughput $100k per mammalian genome to $1,000 per resequenced human • Large scale studies to gather phenotypic differences and associate with genomic variation • Most labs will start doing some sort of genomics What is the future of genomics • We will be awash with data and genomic variations • Which of these variations correspond to: (a) disease causation? (b) natural phenotypic differences? Must use evolutionary signal. For protein coding genes, that requires constructing family trees: orthology will continue to be central