Skokie - Claude Bernard University Lyon 1

Download Report

Transcript Skokie - Claude Bernard University Lyon 1

Orthology Analysis Erik Sonnhammer

C

enter for

G

enomics and

B

ioinformatics Karolinska Institutet, Stockholm

Outline • Basic concepts • BLAST-based approaches to orthology • Tree-based approaches to orthology • Domain-level orthology

Homologs = genes with a common origin • May be genes in the same or in different organisms • Does not say that function is identical • Can only be true or false, and not a percentage!

• Homologs have the same 3D-structure layout

Orthologs Homologs Paralogs

Orthologs: separated by speciation Gene X in ancient mammal

S

Gene X in

human

Gene X in

rat D

Gene X in ancient animal

S

Gene Y in ancient mammal

D

Gene Y1 in

human

Gene Y2 in

human

Gene Y in

rat

speciation Time

In/Out-paralog definition

In-paralogs

~ co-orthologs paralogs that were duplicated

after

the speciation and hence are orthologs to a cluster in the other species

Out-paralogs

= not co-orthologs paralogs that were duplicated

before

the speciation. Not necessarily in the same species.

Sonnhammer & Koonin, Trends Genet. 18:619-620 (2002)

Orthologs for functional genomics •

Co-orthologs / inparalogs

are more likely than outparalogs to have identical biochemical functions and biological roles.

• Co-orthologs can be used to discover human gene function via model organism experiments • Co-orthologs are key to exploit functional genomics/proteomics data in in model organisms

Orthology and function conservation • Orthology does not say anything about evolutionary distance. • Close orthologs, e.g. human-mouse are very likely to have the same biological role in the organism.

• Distant orthologs, e.g. human-worm are less likely to have the same

phenotypical role

, but may have the same role in the corresponding

pathway

.

Ortholog Databases

Sequence database

SwTrembl proteomes

Orthology detection method

Inparanoid (blast)

Ortholog database

Inparanoid

proteomes COGs (blast) COGs / KOGs TIGR gene index proteomes Pfam Pfam COGs (blast) OrthoMCL (blast) Orthostrapper (tree) RIO (tree) TOGA/EGO OrthoMCL

HOPS

How to find orthologs?

1. Calculate phylogenetic tree, look for orthologs in the tree (Orthostrapper, Rio): 2. Two-way best matches between two species can be used to find orthologs without trees.

[However, in-paralogs are harder to find this way]

Two-way best match approach to finding orthologs

COG2813:

orthologs Out paralogs

COGs

Blue = species 1 Red = species 2

Inparanoid Blue = species 1 Red = species 2

Resolve overlapping clusters No overlap - no problems: Partial overlap - separate: Complete overlap - merge:

Inparalog score

B 0 20 40 60 80 100% A P Score for inparalog P = (scoreAP - scoreAB) / (scoreAA - scoreAB)

Confidence values for main orthologs from sampling

TVHIVDDEEPVR---KSLAFM---LTMNGFA T+ ++DD +R K L M +T+ G A TILLIDDHPMLRTGVKQLISMAPDITVVGEA

Sampling with replacement; insertions kept intact

GAFDEP---LVTHVR..........

GA + ++T +R GAEEHMAPDILTLLR..........

“Bootstrap alignment” -> “bootstrap score”

Confidence

= (bootstrap alignments best-best matches / nr of bootstraps)

http://inparanoid.cgb.ki.se

inparanoid.cgb.ki.se

Homo Sapiens vs. C. elegans Remm

et al, J. Mol. Biol.

314:1041-1052 (2001)

Ortholog group sizes, human vs X Version 2.5: 08-apr-03

151360 sequences from Swissprot-TREMBL

44996 sequences from

Homo sapiens

26674 sequences from

Mus musculus

20316 sequences from

Drosophila melanogaster

20997 sequences from

Caenorhabditis elegans

36751 sequences from

Arabidopsis thaliana

6910 sequences from

Saccharomyces cerevisiae

8709 sequences from

Escherichia coli

Species

M.musculus

D.melanogaster

C.elegans

A.thaliana

S.cerevisiae

E.coli

Number of orthologs (orthologous groups) in

H.sapiens

12458 5549 4541 3258 2175 599 Number of sequences (in Number of sequences (in paralogs) from

H.sapiens

in paralogs) from this species in orthologous groups orthologous groups 19532 15259 14222 10863 7265 2144 17055 9854 6537 12178 2751 1037

Nr of inparalogs per ortholog group Species Mouse Fly Worm Mustard weed Yeast E. coli Avg. inparalogs in

model organism

ortholog groups 1.36

1.77

1.44

3.73

1.26

1.73

Avg. inparalogs in

human

ortholog groups 1.56

2.75

3.13

3.33

3.34

3.57

Drawbacks of Blast-based orthology assignment • No guarantee that the same segment is used in different sequences • No evolutionary distance model • Does not take multiple domains into account

Domain orthology • Inparanoid Human-Fly ortholog pairs with domains in Pfam-A 13.0: 20335 • Different domain architectures: 5411 – Many of these are minor differences, e.g. 22 vs 21 Spectrin repeats – Sometimes the difference is big: ef-hand TBC UCH UCH

Tree-based approaches

Distance-based tree building A1 MKFYSLPNFPEN A2 MKYYKLPDLPDE A3 MRFYTACENPRS

Distance matrix

A2 A3 A1 A2 4 8 10 2 1 5 3 • Bootstrapping: – randomly pick columns to bootstrap alignment, calculate tree – Repeat 1000 times, frequency of node = bootstrap support A1 A2 A3

Orthology by tree reconciliation Species tree Gene tree

Infer 2 duplications and 2 losses

Drawbacks of tree reconciliation for orthology assignment • Assumption that the species tree is fully known • Does not give confidence values • Gene trees become unreliable when involving a lot of sequences (more data -> less certainty) • Computationally expensive

Partial tree reconciliation • Find pairwise orthologs by computer parsing of tree.

Pairwise orthology confidence by ‘orthostrapping’ 99 99 45 85 100 82

PIR-S67168 AAF52138.1

T04F8.1

C47D12.3

Y6E2A.9

F37H8.4

AH6.2

C14F5.4

AAF49194.1

The original tree with bootstrap support values

Pairwise orthology confidence by ‘orthostrapping’

PIR-S67168 AAF52138.1

T04F8.1

C47D12.3

Worm Fly AH6.2

F37H8.4

AAF49194.1

AAF52138.1

0 0 0 0

Y6E2A.9

0 0

Y6E2A.9

F37H8.4

AH6.2

C14F5.4

AAF49194.1

C47D12.

3 T04F8.1

C14F5.4

0 0 1 0 1 0

Pairwise orthology confidence by ‘orthostrapping’

PIR-S67168 AAF52138.1

T04F8.1

C47D12.3

Worm Fly AH6.2

F37H8.4

AAF49194.1

AAF52138.1

0 0 0 0

Y6E2A.9

0 0

Y6E2A.9

F37H8.4

AH6.2

C14F5.4

AAF49194.1

C47D12.

3 T04F8.1

C14F5.4

0 0 2 1 2 0

Pairwise orthology confidence by ‘orthostrapping’ 99 99 45 85 100 82

PIR-S67168 AAF52138.1

T04F8.1

C47D12.3

Worm Fly AH6.2

F37H8.4

AAF49194.1

AAF52138.1

0 0 77 77

Y6E2A.9

0 77

Y6E2A.9

F37H8.4

AH6.2

C14F5.4

AAF49194.1

C47D12.

3 T04F8.1

C14F5.4

0 0 99 81 98 0

orthostrapper.cgb.ki.se

Orthology is not transitive!

Multiple species at different distances may give

erroneous groups

, that includes out-paralogs

Orthology is not transitive!

Y H1 D1 H2 D2 Y D1 H2

-> Orthology strictly defined for only 2 species/clades Combining species of different distances is very dangerous But OK to combine multiple equidistant ones

Domain-level orthology

HOPS

- Hierarchy of Orthologs and Paralogs 1. All species in Pfam are bundled in groups according to scheme: chordata eukaryota metazoa viridiplantae fungi arthropoda nematoda 2. Apply Orthostrapper to groups at same level in Pfam families 3. Display results in NIFAS

Pfam

Pfam in brief: SEED alignment representative members Profile-HMM HMMer-2.0

Search database Description file FULL alignment

Manually curated Automatically made

• Release 13.0 (April 2004): – 7426 families Pfam-A domain families – Based on 1160000 sequences (Swissprot & Trembl) – 21980 unique Pfam-A domain architectures – 73% of all proteins have >=1 Pfam-A domain

HOPS results Pfam 10, 6190 families: • 2450 families (40%) have HOPS orthologs • 1319 families (21%) have HOPS orthologs in all 6 pairwise comparisons • 286356 pairwise orthology assignments (> 75% orthostrap)

Storm and Sonnhammer, Genome Research 13:2353-2362 (2003)

Ways to access HOPS •

NIFAS

graphical browser • By sequence ID at Pfam.cgb.ki.se/HOPS • Flatfiles (Orthostrap tables of 2 clades)

Pfam.cgb.ki.se/HOPS

Evolution of Domain Architectures

NIFAS:

ATP sulfurylase /APS kinase

ATP sulfurylase domain, metazoa vs fungi Orthologous shuffled domains?

APS kinase domain

HOPS orthologs of PPS1_HUMAN (ATP sulfurylase/APS kinase)

Summary of ATP sulfurylases/APS kinases: Shuffled non-orthologous domains

Metazoa Fungi

Conclusions • Orthologs can be detected by – Blast: fast – tree: slow but less error-prone • Species at different evolutionary distances should not be combined in orthology analysis •

Inparanoid

and

Orthostrapper

were designed to find inparalogs but not outparalogs • HOPS/NIFAS can be used to find

domain orthologs

and analyze domain architecture evolution

Future perspectives • Multiparanoid – multiple species merging of pairwise Inparalogs.

• Functional divergence among inparalogs

Acknowledgments – Christian Storm – Maido Remm – Andrey Alexeyenko – Volker Hollich – Mats Jonsson

http://sonnhammer.cgb.ki.se