Transcript ppt
NGS Bioinformatics Workshop
2.3 Tutorial – Transcriptome Assembly
May 16th, 2012
IRMACS 10900
Facilitator: Richard Bruskiewich
Adjunct Professor, MBB
Workflow for Today
Erratum about last week
Questions from last week
Galaxy @ Westgrid now available
Transcriptome assembly
Mapped-based assembly:
Bowtie, TopHat,Cufflinks
de novo assembly:
Velvet + Oases
Trans-ABySS
Erratum: Running Velvet with
two paired end data files
Run velveth:
velveth outputdir k_mer –fastq
-shortPaired paired_data_file_1
-shortPaired2 paired_data_file_2
Run velvetg:
velvetg outputdir
-ins_length 200 -exp_cov 20
1st Question from last week…
What are the limits in read lengths (e.g. Sanger ~1000 - 1500)
to NGS assemblers?
From P.4 of ALLPATHS-LG manual:
Capabilities and limitations
ALLPATHS-LG is a short-read assembler. It has been designed
to use reads produced by new sequencing technology
machines such as the Illumina Genome Analyser. the version
described here has been optimized for, but not necessarily
limited to, reads of length 100 bases.
ALLPATHS is not designed to assemble Sanger or 454 FLX
reads, or a mix of these with short reads.
1st Question from last week…
On p5 of the Velvet manual:
Read lengths are stored on signed 16bit integers,
meaning that if you are assembling contigs longer than
32kb long, then more memory is required to store the
coordinates. To do so, simply add the following option to
the make command:
make 'LONGSEQUENCES=1‘
(Note the single quotes and absence of spacing.)
This will cost more memory overhead.
2nd Question from last week…
What are the limits to insert sizes of libraries?
From P.10 of ALLPATHS-LG manual:
Supported library constructions
…any input dataset should include as least one fragment library and one
jumping library... A jumping library has a longer separation, typically in the
3kbp-10kbp range...
…Additionally, ALLPATHS also supports long jumping libraries. A jumping
library is considered to be long if the insert size is larger than 20 kbp.
In Velvet Manual, P.10
Shows a command line switch example of –ins_length_long=40000
Now available: Galaxy @ WestGrid
https://joffre.westgrid.ca/galaxy/
Accessing the Westgrid Galaxy instance
Use your Westgrid ID (email name without @part)
to log into Joffre, e.g. if your email is
‘[email protected]’, your server access id is
‘rbruskie’, and use your WestGrid password
Logging into the Galaxy instance
Once into Galaxy, you need to register (initially) or
log in (if already registered) using your username
(your full email, e.g. ‘[email protected]’) and
(important!) use your WestGrid password as the
Galaxy password
Transcriptome Assembly - Overview
As in whole genome, one can have a reference
based (‘map based’) assembly, based on read
alignment, and a ‘de novo’ assembly, based on De
Bruijn graph construction.
In some respects, transcriptome assembly can be
more challenging due to splice isoforms and
overlapping transcripts, and other issues.
For a detailed review of the issues and available
software, see
Martin JA and Wang Z. 2011.
Next-generation transcriptome assembly.
Nature Reviews Genetics 12:671-682
Assembly by Mapping:
Bowtie/TopHat/Cufflinks Suite
Bowtie2: Ultrafast short read alignment
http://bowtie-bio.sourceforge.net/bowtie2
TopHat: is a fast splice junction mapper for RNA-Seq reads.
It aligns RNA-Seq reads to large genomes using the ultra
high-throughput short read aligner Bowtie, and then
analyzes the mapping results to identify splice junctions
between exons.
http://tophat.cbcb.umd.edu
Cufflinks: Isoform assembly and quantitation for RNA-Seq.
http://cufflinks.cbcb.umd.edu/
It is non-trivial to install this software suite…
Fortunately, the software is installed under Galaxy and
some useful tutorials are available (see
https://main.g2.bx.psu.edu/u/jeremy/p/galaxy-rna-seqanalysis-exercise)
de novo Assembly: Velvet (last week) + Oases
Obtain version of oases compatible with velvet
http://www.ebi.ac.uk/~zerbino/oases/
wget …oases_latest.tgz
tar –zxvf oases_latest.tgz
make VELVET_DIR=/path/to/velvet
Put on your $PATH
Velvet + Oases with (BAM) paired end read data
Running velveth:
velveth outputdir k_mer –bam
-shortPaired read_data.bam
Running velvetg:
velvetg outputdir
-ins_length 250
-exp_cov auto
Run oases:
oases outputdir -scaffolding yes
-min_trans_lgth 100 -ins_length2 250
-unused_reads yes
Sort, Filter and Cluster your Transcripts
Sorting and clustering transcripts. Can use the
‘usearch’ tool (http://www.drive5.com/usearch/)
usearch --sort transcripts.fa
--output transcripts.sorted.fa
--minlen min# --maxlen max# --log sorted.log
usearch --cluster transcripts.sorted.fa
--id 0.95
--seedsout $@ --uc results.uc
--minlen min# --maxlen max#
--log clustered.log
trans-Abyss
Obtain software:
Download
http://www.bcgsc.ca/platform/bioinfo/software/trans-abyss
tar –zxvf …/trans-ABySS-v1.3.2.tar.gz
Need to look under the release web page for the manual link.
http://www.bcgsc.ca/platform/bioinfo/software/trans-abyss/
releases/1.3.2
Consult this file for full details about how to set up and run
trans-ABySS (non-trivial to set up, many dependencies)
To execute, first need to run ABySS (abyss-pe) over a series of kmer values, then run the pipeline.
Unfortunately, NOT installed (yet) under Galaxy…