Comparative identification of mammalian regulatory elements

Download Report

Transcript Comparative identification of mammalian regulatory elements

Motif instance identification using comparative genomics

Pouya Kheradpour

Joint work with: Alexander Stark, Sushmita Roy and Manolis Kellis

TF1

Background and goal

TF2 microRNA1 • • •

Regulators bind to short (5 to 20bp) sequence specific patterns (motifs)

– –

Genes are largely controlled through the binding of regulators

Transcription factors (TFs) are proteins that bind near the transcription start site (TSS) of genes and either activate or repress transcription miRNAs bind to the 3’ un-translated region (UTR) of mRNAs to repress translation

The goal of our work is to identify these binding sites (motif instances)

Motivation

• • •

Network: Davidson and Erwin, Science (2006) Mouse: Pennacchio, et al., Nature (2006) Fly: Tomancak, et al., Genome Biology (2002) In all animals, genes are both temporally and spatially regulated to produce complex expression patterns Identifying the targets of regulators is vital to understanding this expression Conservation allows for identifying targets that are evolutionarily meaningful

Previous work

• • – – –

Single genome approaches

• Generally use positional clustering of motif matches to increase signal (e.g. Berman, et al. 2002; Schroeder, et al. 2004; Philippakis, et al. 2006) A single 5mer match occurs on average 3 million times in mammalian genome Requires set of specific factors that act together Miss instances of motifs that may occur alone –

Multi-genome approaches (phylogentic footprinting)

Blanchette and Tompa 2002 use an alignment free phylogenetic approach to find k-mers that are unusually well conserved – – – Moses, et al. 2004 use a strict phylogenetic model to find regions that evolve according to the motif and not the background Etwiller, et al. 2005 use both nearby species and distant species (fish) to identify motif instances Lewis, et al. 2005 finds putative microRNA binding sites requiring full conservation in five species

Approach outline

1. Produce a raw conservation score for each motif match (branch length score or BLS) 2. For each motif and region, produce a mapping from BLS to confidence

Advantages

• • • • – –

Now we have many, complete, closely related genomes

Gives enough power to identify binding sites (Eddy, 2005) Do not have to worry about dramatic divergence

Account for non-motif conservation using globally derived statistics Robust against errors and evolutionary turnover Computationally feasible to run genome wide for all available motifs

Large phylogeny challenges in instance identification

Motif instance movement missing sequence

• •

Sequencing / assembly / alignment artifacts

– Low coverage sequencing, mis-alignments

Evolutionary variation

– Individual binding sites can move / mutate – Some instances found only in subset of species Don’t require perfect conservation: 

Branch length score

Don’t require exact alignment: 

Search within a window

Computing Branch Length Score (BLS)

mutations movement missing short branches CTCF

BLS = 2.23

sps (78%) Does not over count redundant branch length Allows for: 1. Mutations permitted by motif degeneracy 2. Misalignment/movement of motifs within window (up to hundreds of nucleotides) 3. Missing motif matches in dense species tree

Branch Length Score



Confidence

1. Evaluate non-motif probability of a given score

• Sequence could also be conserved due to overlap with un-annotated element (e.g. non-coding RNA)

2. Account for differences in motif composition and length

• For example, short motif more likely to be conserved by chance

Control motifs

• Control motifs are the basis of our estimation of the

background level of conservation and for evaluating enrichment

• Each motif has its own set of controls • They are chosen to: – Have the same composition as the original motif – Match the target regions (e.g. promoters) with approximately the same frequency (+/- 20%) – Not too similar to each other (to preserve diversity) – Not be similar to known motifs (including the one being shuffled) • Background level is estimated separately in each region

type (e.g. Promoters or 3’ UTRs)

Branch Length Score



Confidence

1. Use motif-specific shuffled control motifs determine the expected number of instances at each BLS by chance alone or due to non-motif conservation 2. Compute Confidence Score as fraction of instances over noise at a given BLS (=1 – false discovery rate) 3. Select movement window that leads to the most instances at each confidence

Confidence selects for functional instances

Transcription factor motifs 3’UTR Intron CDS 5’UTR Promoter 3’UTR Intron CDS MicroRNA motifs 5’UTR Promoter 1. Confidence selects for transcription factor motif instances in promoters and miRNA motifs in 3’ UTRs

Confidence selects for functional instances

Strand Bias 1. Confidence selects for transcription factor motif instances in promoters and miRNA motifs in 3’ UTRs 2. miRNA motifs are found preferentially on the plus strand, whereas no such preference is found for TF motifs

Experimental identification of binding sites ChIP-seq

Maridis 2007

• Chromatin immunoprecipitation (ChIP) combined with either

sequencing (seq) or with microarrays (chip) are experimental procedures that are used to identify binding sites

– Not all binding is functional, can have high false positive rate – Only binding that is active in the surveyed conditions is found

Intersection with CTCF ChIP-Seq regions

• • •

ChIP data from Barski, et al., Cell (2007)

Conserved CTCF motif instances highly enriched in ChIP-Seq sites High enrichment does not require low sensitivity Many motif instances are verified

CTCF 50% motifs verified ≥ 50% of regions with a motif

Enrichment found for other factors in mammals and flies

Mammals Flies

Enrichment increases in conserved bound regions

1. ChIP bound regions may not be conserved (Odom, et al. 2007) 2. For CTCF we also have binding data in mouse 3. Enrichment in intersection is dramatically higher

Human: Barski, et al., Cell (2007) Mouse: Bernstein, unpublished

Enrichment increases in conserved bound regions

1. ChIP bound regions may not be conserved (Odom, et al. 2007) 2. For CTCF we also have binding data in mouse 3. Enrichment in intersection is dramatically higher 4. Trend persists for other factors where we have multi-species ChIP data

Enrichment of instances in fly muscle genes

1. Motifs at 60% confidence and ChIP have similar enrichments (depletion for the repressor Snail) in the functional promoters 2. Enrichments persist even when you look at non-overlapping subsets 3. Intersection of two has strongest signal 4. Evolutionary and experimental evidence is complementary • ChIP includes species specific regions and differentiates tissues • Conserved instances include binding sites not seen in tissues surveyed

ChIP data from: Zeitlinger, et al., G&D (2007); Sandmann, et al,. G&D (2007); Sandmann, et al., Dev Cell (2006)

Fly regulatory network at 60% confidence

TFs: 67 of 83 (81%) 46k instances miRNAs: 49 of 67 (86%) 4k instances • Several connections confirmed by literature (either directly or indirectly) • •

Global view of instances allows us to make network level observations:

TFs were more targeted by TFs (P < 10 -20 ) and by miRNAs (P < 5 x 10 -5 ) TF in-degree associated with miRNA in-degree (high-high: P < 10 -4 ; low-low P < 10 -6 )

Contributions

• • – – –

A general methodology for regulatory motif instance identification using many, closely related genomes

Robust against errors from sequencing, assembly and alignment Allows limited functional turnover and motif movement Provides statistical measurement of confidence for each instance, correcting for length, composition and overlap with other functional elements – –

Validation and comparison to experimental data

High enrichment of binding sites in ChIP regions for a variety of factors Functional enrichments suggest comparable ability to identify functional instances as ChIP

Future directions

• Our predicted network was static, but real regulatory

networks are dynamic

– They change throughout development and in different conditions – They can vary greatly in different species • We want to expand this work to learn about this

network dynamics

– ChIP data is becoming increasingly available in a variety of conditions – we can use this to learn what causes changes in binding – Multi-species data is also becoming more available • Can match motif binding to cross-species expression changes – We can train on this data to find motifs that act together or compensate for each other

Acknowledgments

• Alexander Stark • Sushmita Roy • Manolis Kellis

MIT CSAIL

• Matt Rasmussen • Mike Lin • Issao Fujiwara • Rogerio Candeias

Mouse CTCF ChIP-Seq

• Tarjei Mikkelsen • Brad Bernstein

Funding

• William C.H. Chao Fellowship • NSF Graduate Research Fellowship

Broad Institute

• Or Zuk • Michele Clamp • Manuel Garber • Mitch Guttman • Eric Lander

The End

Implementation details

• Table lookup on the next 8 bases of the genome are

used to find potential matches to the target genome

– Results in an order-of-magnitude increase in speed over scanning through all motifs • In a first run, 100 shuffles of each motif are evaluated

and up to 10 that fulfill the requirements are selected

• All motifs and their selected shuffles are matched to

the target genome and their BLS scores are computed

• The matches are evaluated at each branch length

cutoff and a mapping is produced for each motif from branch length score to confidence

• All code is designed to run on BROAD cluster (often

with parallelization) and is written in C

Performance on mammalian TRANSFAC motifs

2.5x increase 3.5x

6.5x

• • Most motifs have confident instances into 90% confidence with 18 mammals Substantial increase in the number of instances compared to only human, mouse rat and dog.

The promise of many genomes

• •

Eddy showed that with many genomes, resolving binding sites using conservation is possible

– –

The goal of our work is to make this practical

Integrate evidence from multiple informant species Determine which of the thousands of motif matches are functional using conservation

Slides on motif discovery

Motif discovery pipeline

1. Enumerate motif seeds

• • Six non-degenerate characters with variable size gap in the middle

2. Score seed motifs

Use a conservation ratio corrected for composition and small counts to rank seed motifs

3. Expand seed motifs

S R

T T G G C C

gap gap

T T A G A G

R • • Use expanded nucleotide IUPAC alphabet to fill unspecified bases around seed using hill climbing

4. Cluster to remove redundancy

Using sequence similarity

9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 26 27 28 29 30 16 17 18 19 20 21 22 23 24 25

Consensus

CTAATTAAA TTKCAATTAA WATTRATTK AAATTTATGCK GCAATAAA DTAATTTRYNR TGATTAAT YMATTAAAA AAACNNGTT RATTKAATT GCACGTGT AACASCTG AATTRMATTA TATGCWAAT TAATTATG CATNAATCA TTACATAA RTAAATCAA AATKNMATTT ATGTCAAHT ATAAAYAAA YYAATCAAA WTTTTATG TTTYMATTA TGTMAATA TAAYGAG AAAKTGA AAANNAAA RTAAWTTAT TTATTTAYR

Top 30 discovered fly motifs Expression enrichment MCS

65.6

57.3

54.9

54.4

51 46.7

45.7

43.1

41.2

40 39.5

38.8

38.2

37.8

37.5

36.9

36.3

36 35.6

35.5

33.9

33.8

33.6

33.2

33.1

32.9

Matches to known

engrailed (en) reversed-polarity (repo) araucan (ara) paired (prd) ventral veins lacking (vvl) Ultrabithorax (Ubx) apterous (ap) abdominal A (abd-A) fushi tarazu (ftz) broad-Z3 (br-Z3) Antennapedia (Antp) Abdominal B (Abd-B) extradenticle (exd) gooseberry-neuro (gsb-n) Deformed (Dfd)

Promoters

25.4

5.8

11.7

4.5

13.2

16 7.1

7 20.1

3.9

17.9

10.7

19.5

5.8

14.1

1.8

5.4

3.2

3.6

2.4

57.2

5.3

6.3

6.7

8.9

4.7

7.6

449.7

11 30.7

Enhancers

2 4.2

2.6

16.5

0.3

3.3

1.7

2.2

4.3

0.7

1.2

2 5.4

1.7

2.8

0 4.6

-0.5

0.6

6 1.7

1.6

2.7

0.3

0.8

1. Many of the top discovered motifs match known motifs 2. Motifs are associated with genes that are preferentially expressed in tissues

Discovered motifs have functional enrichments

Enrichment or depletion of a motif in the promoters of genes expressed in a tissue Tissues 1. Most motifs avoided in ubiquitously expressed genes 2. Functional clusters emerge

Comparative identification of mammalian regulatory elements

Transcript Comparative identification of mammalian regulatory elements

Motif instance identification using comparative genomics

Pouya Kheradpour

Background and goal

Motivation

Previous work

Approach outline

Advantages

Large phylogeny challenges in instance identification

Computing Branch Length Score (BLS)

Branch Length Score

Confidence

Control motifs

Branch Length Score

Confidence

Confidence selects for functional instances

Confidence selects for functional instances

Experimental identification of binding sites ChIP-seq

Intersection with CTCF ChIP-Seq regions

Enrichment found for other factors in mammals and flies

Enrichment increases in conserved bound regions

Enrichment increases in conserved bound regions

Enrichment of instances in fly muscle genes

Fly regulatory network at 60% confidence

Contributions

Future directions

Acknowledgments

The End

Implementation details

Performance on mammalian TRANSFAC motifs

The promise of many genomes

Slides on motif discovery

Related problem: computational motif discovery

Motif discovery pipeline

Discovered motifs have functional enrichments

Directory