Transcript NGS QC

Sequencing Data Quality
Saulo Aflitos
Assembly - Concepts
Read (≈100bp)
Contig (≈2Kbp)
Paired-End
Mate-Pair
Scaffold (≈ 2Mbp)
Pseudo Molecule
Low
Complexity
Region
(Super Scaffold)
Scaffolding
Paired-End
Mate-Pair
Scaffold
(≈ 2Mbp)
Pseudo Molecule
(Super Scaffold)
Low
Complexity
Region
Assembly
Scaffolding
Repeats?!
Reality
1x
Consensus
3x
2x
3x
Contig
Reads
Depth of Coverage
Goldberg SMD et al. 2006
1x
Heterozygozity
A
N
A/C
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
C
G
T
A
C
G
T
A
A
A
A
A
A
A
A
A
A
A
C
C
C
C
C
C
C
95% ±5
50% ±10
Consequences of Data Cleaning
Non-unique 31-mers volume (bp) / 950 Mbp
265.89
36.83
Raw
Filtered
48.65
27.04
50.37
23.66
41.61
29.72
57.60
1.33
Heinz
All Round
Pennellii
F5
Pimpinellifolium
Sequencing
Shotgun
RNAseq
Sequencing
Paired End
Mate Pair
Genome
Sample Preparation
Ultrasound
Physical
RE
Shred
Gel
Beads
Size Selection
Adapter
Illumina
454
PacBio
Sequencing
ID
Binding to Surface
Circularization
Shredding
Size Selection
Sequencing
Illumina PE
Insert Size
150bp-2Kbp
100bp
Read Length
100bp
Sequencing
454 MP
Insert Size
2K-20Kbp
Read Length
500bp
150bp
150bp
150bp
Data
FastQ
Machine Name
Read ID (unique)
Encoded Quality
0-40
Chance of being wrong
FastQ Format
FastQ Statistics
13
0.05
5%
Cleaning
FastQC Quality Checking Tool
Contamination screen
fastq screen
Per base sequence quality
Per base sequence content
Sequence duplication
Per sequence quality
Sequence length distribution
Per base N-content
Per base GC content
Per sequence GC content
SolexaQA Cleaning Tool
SolexaQA Cleaning Tool
Exercise
• Create “cleaning” folder
– mkdir cleaning; cd cleaning
• Inside it, run:
wget -O saulo.bash http://goo.gl/Tx8g6
• Run it with:
bash saulo.bash
• This will download FastQC and SolexaQA
–
–
–
–
FASTQC HELP
:
FASTQC TUTORIAL:
FASTQC MANUAL :
SolexaQA Help :
http://goo.gl/EE8M7
http://goo.gl/rihyA
http://goo.gl/9yihC
http://solexaqa.sourceforge.net/
• Run FastQC:
./FastQC/fastqc &
• File > open [Files of Type = FastQ files]
Exercise
• Verify the two .fq files (you can use less):
– bad_MiSeq_dataset.fq
– good_MiSeq_dataset.fq
• Clean the bad dataset with SolexaQA’s
DynamicTrim.pl script:
– perl SolexaQA_v.2.1/DynamicTrim.pl ►
bad_MiSeq_dataset.fq -h 25
• Verify the improvement (or not) by opening
– bad_MiSeq_dataset.fq.trimmed
?