Lab Meeting #2 12-13-2010 Expression profiles and transcriptional networks in the CNS midline Wheeler et al., 2006
Download ReportTranscript Lab Meeting #2 12-13-2010 Expression profiles and transcriptional networks in the CNS midline Wheeler et al., 2006
Lab Meeting #2 12-13-2010 Expression profiles and transcriptional networks in the CNS midline Wheeler et al., 2006 High-throughput sequencing of the midline transcriptome Sequence Isolate midline cells Generate adapterligated cDNA library Steps to RNA-seq Isolate Cells Prepare cDNA Library Sequence Analyze Find midline specific GAL4 driver Stage embryos @ 25oC Dissociate cells Sort cells via FACS 3.7sim-GAL4; UAS-mCD8.GFP-LL6 Fluorescence Activated Cell Sorting (FACS) w1118 sim > LL6 Sorted GFP- Sorted GFP+ Sort results 2 hr collection ~75 – 125 uL of embryos ~20,000 cells / uL or ~20 x 106 cells / collection Sort #1 14Gp01 14Gp02 Sorted Events 42,000 43,000 total RNA extracted 32ng 32ng Sort #2 14Gp03 14Gp04 62,000 59,000 55ng 65ng Sort #3 14Gp05 14Gp06 42,000 45,000 25ng 35ng Steps to RNA-seq Isolate Cells Prepare cDNA Library Sequence Analyze Isolate total RNA Purify mRNA Fragment mRNA Generate double-stranded cDNA Blunt, Phosphorylate 5’ end, Adenylate 3’ end Ligate Adapter Sequence Size select fragments Enrich with PCR Fragmenting and Size-Selecting Size Selection Enrichment Product Adapter dimer Primer dimer Size Selection Product Fragmenting and Size-Selecting Size Selection Enrichment Adapters and enrichment ad2 ad1 T A A ad1 T ad2 Steps to RNA-seq Isolate Cells Prepare cDNA Library Sequence Analyze Walk to Mary Ellen Jones Take Elevator to 9th Floor Hand off sample and Pay lots of money Wait Steps to RNA-seq Isolate Cells Prepare cDNA Library Sequence Analyze Programs Involved Bowtie SAMTools Ultra-fast, memory efficient, small sequence aligner Uses Burrows-Wheeler Transform (BWT) indexing Sequence Alignment/Map Tools for indexing, sorting, formatting sequence alignment data TopHat Fast splice-junction mapping tool Aligns RNAseq reads to genome using Bowtie, taking into consideration the existence of splice junctions Cufflinks Takes TopHat alignment data and: Assembles transcripts Estimates transcript abundance Measures differential expression between samples assemblers such as ments to transcripts Dilworth’s theorem18 of haplotypes from s extends these ideas, finding a maximum hat represents comand Supplementary d microRNAs21 have nd development, and g isoforms as a means e-mediated decay22. assembler does not open reading frame ng gene annotations kes as input cDNA enome by software opHat. (b–e) With agment reads as pping ‘bundles’ of s running time and s the fragments from the abundances of gment assembly is ust have originated s are connected in an alignments overlap graph, and an edge, new isoforms, 7,395 (58%) contain novel splice junctions, with the remainder being novel combinations of known splicing outcomes; 11,712 (92%) have an ORF, 8,752 of which end at an annotated stop codon. Although we sequenced deeply by current standards, 73% of theInput moderately file: abundant transcripts .txt .fa (15–30 .fq expected fragments per kilobase of transcript per million fragments mapped, abbreviated FPKM; see below for further explanation) detected at the 60-h time Options: -i 20 (70) point with three lanes of GAII transcriptome sequencing were fully -I 142000 (500,000) recovered with just a single lane. Because distinguishing a full-length --solexa1.3-quals transcript from a partially assembled fragment is difficult, we con(0) analyses the novel isoforms that servatively excluded from -m further -g (40) were unique to a single time point. Out of the new isoforms, 3,724 were present in multiple time points, and 581 were present at all time points;files: 6,518 (51%) accepted_hits.bam of the new isoforms and 2,316 (62%) of Output the multiple time point novel isoforms were tiled by high-identity junctions.bed TopHat a Map paired cDNA fragment sequences to genome TopHat Spliced fragment alignments Trapnell et al., 2010 fragof known genes. Wegenes. estimate 77% of the reads originated forms of known Wethat estimate that 77% ofbetween the reads originated ed byleft ourforms fragom to right along the genome, is placed each from previously known transcripts (Supplementary Table 2). Of the2). Of the ssem- assemriptome from previously known transcripts (Supplementary Table mpatible this example, the yellow, blue red ch as such newfragments. isoforms, 7,395In (58%) contain splice junctions, with the and mblers as new isoforms, 7,395 (58%)novel contain novel splice junctions, with the cripts remainder being novel combinations of known splicing outcomes; must have originated from separate isoforms, but any other s to 18 transcripts remainder being novel combinations of known splicing outcomes; 11,712 have an ORF, of which at an annotated stop rem 18 (92%) rth’s theorem 11,712 (92%) have8,752 an ORF, 8,752end of which endone at an annotated could have come from the same transcript as of ofthesestop from codon. Although we sequenced deeply by current standards, 73% aplotypes from codon. Although we sequenced deeply by current standards, 73% of ideas, the moderately abundant abundant transcripts (15–30 expected fragments per Paths orms are thentheassembled from transcripts the overlap moderately (15–30graph expected(c). fragments per nds these ideas, kilobase of transcript per million fragments mapped, abbreviated mum ng a maximum kilobase ofto transcript million fragments mapped, fragments abbreviated e graph correspond sets ofpermutually compatible com- FPKM; below see for below furtherfor explanation) detected atdetected the 60-hattime epresents com- see FPKM; further explanation) the 60-h time be merged into complete isoforms. The overlap graph here can ntary point with three lanes of GAII transcriptome sequencing were fullywere fully Supplementary point with three lanes of GAII transcriptome sequencing recovered with just a single lane. Because distinguishing a full-length have 21 ally ‘covered’ by three paths (shaded in yellow, blue and red), roRNAs have recovered with just a single lane. Because distinguishing a full-length t, and transcript from a partially fragment fragment is difficult, con- we contranscript from aassembled partially assembled iswe difficult, velopment, and esenting a different isoform. Dilworth’s Theorem states that means servatively excluded from further analyses the novel isoforms isoforms that orms as a means servatively excluded from further analyses the novel that 22. mutually er of incompatible reads is the same as the minimum were unique to a single time point. Out of the new isoforms, 3,724 cay diated decay22. were unique to a single time point. Out of the new isoforms, 3,724 es not does were multiple time points, 581fragments. were present all wereinpresent multiple time points, and 581 wereatCufflinks present at all mbler notpresent transcripts needed to in ‘explain’ alland the points; 6,518 (51%) of (51%) the newofisoforms 2,316and (62%) of (62%) of rame time points; 6,518 the new and isoforms 2,316 readingtime frame ts a proof of Dilworth’s Theorem that produces a minimal set the multiple pointtime novel isoforms tiledwere by high-identity ations ene annotations thetime multiple point novel were isoforms tiled by high-identity Cufflinks Cufflinks Mutually incompatible fragments d c 3 3 Sequence quality e Maximum likelihood abundances Log-likelihood 2 1 Minimum path cover NATURE BIOTECHNO LO GY 3 2 1 Transcripts 3 Transcripts and their abundances MAY 2010 Transcripts and their abundances 2 VO LU M E 28 VOLUME 28 NUMBER 5 1 2 3 Fragment length distribution Transcript coverage and compatibility Sequence c 3 1 2 1 Abundance estimation Overlap graph Transcripts 2 3 Minimum path cover 1 2 1 Transcripts b Log-likelihood e 1 Overlap graph Minimum pathMinimum cover path cover Transcripts Maximum likelihood abundances Transcript coverage and compatibility d Mutually incompatible fragments 2 Assembly to genome TopHat Cufflinks Abundance estimation Spliced fragment alignments Fragment length distribution name hat cover allSequence the fragments in the CIGAR overlap graph by finding the of reads with a the that no two could have originated Map property paired cDNA Map paired cDNA Ainput cDNAa fragment sequences fragment sequences re ame isoform. Next, transcript abundance is estimated to genome e by software to genome TopHat TopHat h (b–e) With ents are matched (denoted here using color) to the transcripts nt reads as hfbundles’ they ofcould have originated. The violet fragment could have nd ing time and from the blue or red isoform. Gray fragments could have come rom ragments from of fragment bundances of ofis the three shown. Cufflinks estimatesSpliced transcript abundances Spliced fragment alignments alignments assembly is ed atistical model in which the probability of observing each ve originated Position of first base nonnected an in an s a overlap linear function of the abundances of the transcripts from ap ments ge, and anhave edge, originated. Because only the ends of each fragment ould Cufflinks Cufflinks ween each nced, ed blue andthe red length of each may be unknown. Assigning a fragment Assembly Abundance estimation erut any other Assembly Abundance estimation t isoforms a different length for it. Cufflinks b often d Mutually b implies d Mutually e of these incompatible incompatible shes fragments lengths to help assign fragments (c).the Pathsdistribution of fragment fragments nts ble fragments s.graph Forhere example, the violet fragment would be much longer, and e can can Fragment d), Fragment from lue and red), bable according to the Cufflinks model, if it were to come length Transcript coverage length Transcript coverage states that distribution distribution and compatibility and compatibility oform instead of the blue isoform. Last, the program numerically mum s the minimum Overlap graphOverlap a ssets.aCufflinks function that assigns likelihood to all possible sets of graph a minimal set hefinding the undances ( 1, 2 , 3) by c of the e blueeisoforms c yellow, red and Maximum likelihood Maximum likelihood ed ve originated abundances abundances cing the abundances that best explain the observed fragments, mated ipts Log-likelihood transcripts Log-likelihood athepie chart. vecould have nt ome have come ould ces t abundances ving each m nscripts from ent fragment each ment a fragment gning . Cufflinks ments fragments assign Assembly .sam formatted alignment 3 Input file: b NUMBER 5 MAY 2010 NATU RE BI O TECH NO LO GY Trapnell et al., 2010 Cufflinks Options: Output: -G Reference Annotation .gtf -M Mask File .gtf genes.expr transcripts.expr transcripts.gtf FBgn # bundle ID Chromosome Left Boundary Right Boundary FPKM FPKM_conf_lo FPKM_conf_hi Status FBtr # bundle ID Chromosome Left Boundary Right Boundary FPKM FMI Frac FPKM_conf_lo FPKM_conf_hi Coverage Length Effective Length Status Chromosome name “Cufflinks” feature: “exon” “transcript” Start End Score Strand Frame Attributes Cuffcompare Hcs ksr CG31550 CG31550 CG31550 CG31547 CG31547 Nmdar1 FBtr0078683 FBtr0078766 FBtr0113409 FBtr0078684 FBtr0078685 FBtr0078765 FBtr0078764 FBtr0078763 Input: transcripts.gtf Options: -r Output: transcripts.refmap transcripts.tmap Gene symbol FBtr # Class code FBgn#|FBtr# Gene symbol FBtr # Class code FBgn # FBtr # FMI FPKM FPKM_conf_lo FPKM_conf_hi Coverage Length Major isoform FBtr # = = = = = = = = FBgn0037332 FBgn0015402 FBgn0051550 FBgn0051550 FBgn0051550 FBgn0051547 FBgn0051547 FBgn0010399 FBtr0078683 FBtr0078766 FBtr0113409 FBtr0078684 FBtr0078685 FBtr0078765 FBtr0078764 FBtr0078763 100 100 100 7 13 2 100 100 15.613557 62.015780 59.515339 4.176125 7.805313 0.192514 8.893786 104.835138 7.710759 46.265760 43.526016 0.000000 1.577164 0.000000 2.888758 84.357332 23.516356 77.765800 75.504663 9.549375 14.033462 1.396952 14.898814 125.312945 10.575610 42.005461 40.311825 2.828636 5.286812 0.130397 6.024073 71.008514 3970 3737 2584 2702 1849 3297 3891 4186 Issues that had to be overcome… Chromosome Names in Genome File from UCSC : Chromosome Names in Annotation File from ensemble: FPKM chr2L chr2R etc. 2L 2R etc. Issues that had to be overcome… Genome Chromosomes 2L 4 2LHet U 2R Uextra 2RHet X 3L XHet 3LHet YHet 3R 3RHet FPKM Annotation Chromosomes 2L 4 2LHet U 2R X 2RHet XHet 3L YHet 3LHet 3R 3RHet Quality and Confidence Number of reads generated Midline Cells 23,995,806 Non-Midline Cells 23,758,803 Filtered due to low quality 67,921 (0.28%) 63,308 (0.27%) Good quality reads 23,927,885 23,695,495 sim ple Vmat argos wrapper elav gcm Midline FPKM 91.4569 77.3427 2130.63 53.4021 47.9539 176.195 1.77936 Non-Midline FPKM 0.935763 2.77772 65.0964 4.33003 21.0768 131.928 6.59764 Cursory Analysis of TFs gene_ontology_obo.txt [Term] id: GO:0000117 name: regulation of transcription involved in G2/M-phase of mitotic cell cycle namespace: biological_process def: "Any process that regulates transcription such that the target genes are transcribed as part of the G2/M phase of the mitotic cell cycle." [GOC:dph, GOC:mah, GOC:tb] related_synonym: "G2/M-specific transcription in mitotic cell cycle" [] related_synonym: "regulation of transcription from RNA polymerase II promoter during G2/M-phase of mitotic cell cycle" [] xref_analog: Reactome:69274 "G2/M-specific transcription in mitotic cell cycle" is_a: GO:0006357 ! regulation of transcription from RNA polymerase II promoter is_a: GO:0022402 ! cell cycle process relationship: part_of GO:0000086 ! G2/M transition of mitotic cell cycle is_a: is_a: is_a: is_a: is_a: is_a: is_a: is_a: is_a: GO:0010551 GO:0006357 GO:0005667 GO:0045892 GO:0045990 GO:0000409 GO:0045013 GO:0045991 GO:0000429 ! ! ! ! ! ! ! ! ! regulation of gene-specific transcription from RNA polymerase II promoter regulation of transcription from RNA polymerase II promoter transcription factor complex negative regulation of transcription, DNA-dependent carbon catabolite regulation of transcription regulation of transcription by galactose carbon catabolite repression of transcription carbon catabolite activation of transcription carbon catabolite regulation of transcription from RNA polymerase II promoter 181 GO terms with “transcription” Finding Transcription Factors gene_association.fb “FB” FBgn # Gene symbol Qualifier GO # 10 Additional Fields Is GO # associated with “transcription” If Yes, copy FBgn #, gene symbol and GO# FBgn0052062 FBgn0052062 FBgn0250816 FBgn0261953 FBgn0261953 FBgn0261953 FBgn0261953 FBgn0261953 FBgn0261953 FBgn0039946 FBgn0039946 FBgn0039946 FBgn0039946 FBgn0039946 FBgn0039946 A2bp1 A2bp1 AGO3 AP-2 AP-2 AP-2 AP-2 AP-2 AP-2 ATbp ATbp ATbp ATbp ATbp ATbp GO:0008134 GO:0045941 GO:0035194 GO:0003704 GO:0010552 GO:0003700 GO:0006355 GO:0003702 GO:0003700 GO:0016563 GO:0006355 GO:0006357 GO:0006357 GO:0030528 GO:0030528 FBgn0052062 FBgn0250816 FBgn0261953 FBgn0039946 FBgn0000015 FBgn0027620 FBgn0037555 FBgn0000054 FBgn0005694 FBgn0261238 FBgn0010774 FBgn0260642 FBgn0029512 FBgn0026598 FBgn0261823 A2bp1 AGO3 AP-2 ATbp Abd-B Acf1 Ada2b Adf1 Aef1 Alh Aly Antp Aos1 Apc2 Asx 842 genes Extracting TFs from RNAseq Data transcripts.tmap FBgn0052062 FBgn0250816 FBgn0261953 FBgn0039946 FBgn0000015 FBgn0027620 FBgn0037555 FBgn0000054 FBgn0005694 FBgn0261238 FBgn0010774 FBgn0260642 FBgn0029512 FBgn0026598 FBgn0261823 Gene symbol FBtr # Class code FBgn # FBtr # FMI FPKM FPKM_conf_lo FPKM_conf_hi Coverage Length Major isoform FBtr # Name hkb CG9775 CG9775 CG9775 MED31 MED31 opa corto Trascript FBtr0078951 FBtr0078895 FBtr0301297 FBtr0078894 FBtr0078856 FBtr0078857 FBtr0078836 FBtr0078844 Gene ID FBgn0261434 FBgn0037261 FBgn0037261 FBgn0037261 FBgn0037262 FBgn0037262 FBgn0003002 FBgn0010313 FMI 0 100 87 2 46 100 0 100 A2bp1 AGO3 AP-2 ATbp Abd-B Acf1 Ada2b Adf1 Aef1 Alh Aly Antp Aos1 Apc2 Asx FPKM 0.000000 48.615585 42.165026 0.807909 20.250985 44.226801 0.000000 123.936859 Dangers discovered so far… biology462:cuff_with_ref Fontana$ grep MED8 transcripts.tmap MED8 FBtr0086297 = FBgn0034503 FBtr0086297 100 34.558291 0.000000 12.612187 biology462:cuff_with_ref Fontana$ grep FBgn0034503 genes.expr FBgn0034503 15535 chr2R 16213345 16214311 34.5583 0 12.6122 FAIL biology462:cuff_with_ref Fontana$ grep Vmat transcripts.tmap Vmat FBtr0091491 = FBgn0260964 FBtr0091491 4 Vmat FBtr0091492 = FBgn0260964 FBtr0091492 0 108.881571 9.646350 91.482036 4.141926 126.281106 15.150773 biology462:cuff_with_ref Fontana$ grep FBgn0260964 genes.expr FBgn0260964 14703 chr2R 9400631 9420494 2635.91 2550.95 2720.87 OK 18.924431 transcripts.tmap genes.expr Gene symbol FBtr # Class code FBgn # FBtr # FMI FPKM FPKM_conf_lo FPKM_conf_hi Coverage Length Major isoform FBtr # FBgn0052062 FBgn0250816 FBgn0261953 FBgn0039946 FBgn0000015 FBgn0027620 FBgn0037555 FBgn0000054 FBgn0005694 FBgn0261238 FBgn0010774 FBgn0260642 FBgn0029512 FBgn0026598 FBgn0261823 A2bp1 AGO3 AP-2 ATbp Abd-B Acf1 Ada2b Adf1 Aef1 Alh Aly Antp Aos1 Apc2 Asx FBgn # bundle ID Chromosome Left Boundary Right Boundary FPKM FPKM_conf_lo FPKM_conf_hi Status Transcription Factor Tally FPKM >5 > 20 Midline cells 512 319 Non-midline cells 604 340 FPKM ≥ 20 in midline < 5 non-midline MED8 FBgn0034503 34.558291 Status = Fail; FPKM 0.425 Nurf-38 FBgn0016687 24.349477 FPKM = 32 in non-midline FMI = 44 cry FBgn0025680 27.772590 dmrt99B FBgn0039683 55.270044 sim 87.116834 FBgn0004666 Symbol FBgn cry dmrt99B HLH3B per sim Tip60 vg FBgn0025680 FBgn0039683 FBgn0011276 FBgn0003068 FBgn0004666 FBgn0026080 FBgn0003975 Mid FPKM 22.4535 44.6845 157.235 31.2259 91.4569 20.3645 27.9836 Mid Status OK OK OK OK OK OK OK NonMid FPKM 2.15868 1.98577 4.7074 3.31479 0.935763 4.5745 4.21669 NonMid Status OK OK OK OK OK OK OK Known expression patterns cryptochrome period dmrt99B HLH3B vg sim No Images for Tip60…yet! All genes CG33056 Midline ≥ 20; Amyrel argos CG1077 CG13685 CG14044 CG14052 CG14082 CG14237 CG14238 CG31323 CG33056 CG34325 CG42456 CG6426 CG6709 CG7059 CR31846 cry dmrt99B HLH3B Hsp67Ba Pdf per ple Sh sim Tdc2 Tip60 Tsp42En vg FBgn0020506 FBgn0004569 FBgn0037405 FBgn0035816 FBgn0031650 FBgn0029606 FBgn0036851 FBgn0039428 FBgn0039429 FBgn0051323 FBgn0053056 FBgn0085354 FBgn0259933 FBgn0034162 FBgn0036056 FBgn0038957 FBgn0051846 FBgn0025680 FBgn0039683 FBgn0011276 FBgn0001227 FBgn0023178 FBgn0003068 FBgn0005626 FBgn0003380 FBgn0004666 FBgn0050446 FBgn0026080 FBgn0033135 FBgn0003975 Nonmidline < 5 21.5749 53.4021 39.7551 20.2799 29.8253 22.3947 20.7989 26.0363 23.4602 21.5139 24.8664 27.8598 42.585 37.2291 33.3694 20.8566 25.6772 22.4535 44.6845 157.235 27.0698 203.259 31.2259 77.3427 29.2139 91.4569 592.613 20.3645 28.5682 27.9836 OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK FAIL OK OK OK OK OK 4.5629 OK 4.33003 OK 0.337708 OK 0 OK 3.27032 OK 0 OK 3.89604 OK 3.62562 OK 2.67148 OK 3.97617 OK 2.0825 OK 4.44943 OK 6.7805e-09 4.32297 OK 3.55502 OK 2.71942 OK 4.52441 OK 2.15868 OK 1.98577 OK 4.7074 OK 2.76941 OK 3.65838 OK 3.31479 OK 2.77772 OK 4.74804 FAIL 0.935763 3.05323 OK 4.5745 OK 4.60738 OK 4.21669 OK OK OK Things to do: Biological Replicate of Midline Sample Earlier time-point to compare with Validate with in situs Look at interesting mutants Down the line… RNAseq of mutants or transgenically overexpressed genes Isolate individual cell types for RNAseq Future Direction 3.7sim-QF > QUAS-mtdTomato-3xHA CoolEnhancer-GAL4 > UAS-GFP 3.7sim > LL5 2 Insertions on second chromosome: Line A : 2R, 55E1 in 3’UTR of CG42697 Line B : 2R, 47F7 in 5’UTR of TapΔ CG32105 Lmx1a Msx1 Otx (orthodenticle, ocelliless) Nkx6.1 (HGTX) (indirectly) Ngn2, TH, Nurr1, Pitx3 (tap) (pale) (Hr38) (Ptx1) Runt CG32105 3.7sim > tauGFP T1 1 * * T2 1 * * Abdominal segments Runt CG32105 3.7sim > tauGFP 1 1 * * * * zfh1 CG32105 Stage 12-anterior segments ** * ** * * * * sim>tau-GFP Castor CG32105 sim>tau-GFP Castor CG32105 Summary of CG32105 expression Stage 11 – not expressed early expressed in mVUMs as soon as they are born Stage 12 – expressed in mVUMs of all segments looked at Stage 15-16 – not expressed in all thoracic mVUMs may be expressed in 1-2 thoracic MNBp expressed in abdominal mVUMs