INT - NGS Workshop @ CBI

Transcript INT - NGS Workshop @ CBI

Samtools、BEDTools、
FASTX-Toolkit
北京大学生命科学学院叶永鑫
2011年5月23日星期一
Outline
 Samtools
 操作sam/bam文件(mapping)的工具
 附加BCFtools
 操作vcf/bcf文件(SNP/indel calling)的工具
 BEDTools
 操作bed文件(block feature annotation)的工具
 FASTX-Toolkit
 操作fasta/fastq文件的工具
Samtools
SAM文件格式
 Sequence Alignment/Map format
 存储mapping的结果的Tab分隔的文本文件
 header section (optional)
 @开头，例如：
 @SQ SN:ref LN:45
 alignment section
 11个必需的字段，例如：
 r001 163 ref 7 30 8M2I4M1D3M =
37 39 TTAGATAAAGGATACTG *
header section
 @HD header line
 VN* format version
 SO sorting order of alignments
 @SQ reference seq. dictionary
 SN* reference seq. name
 LN* ref. seq. length
 AS genome assembly identifier
 M5 MD5 checksum
 SP Species
 UR URI
header section
 @RG read group
 ID* read group identifier
 CN seq. center
 DS description
 DT date of the run
 FO flow order
 KS key sequence
 LB library
 PG programs
 PI predicted median insert size
 PL platform/technology
 PU platform unit
 SM sample
header section
 @PG program
 ID* program record identifier
 PN program name
 CL command line
 PP previous @PG-ID
 VN program version
 @CO comment (one-line text)
alignment section
--- mandatory fields






1
2
3
4
5
6
QNAME Query template NAME
FLAG bitwise flag
RNAME Ref. seq. NAME
POS 1-based leftmose POSition
MAPQ MAPping Quality
CIGAR CIGAR string
 M match/mismatch
I insertion D deletion （相对reference）
S soft clipping H hard clipping P padding
N skip = match X mismatch
alignment section
--- mandatory fields
 7 MRNM Mate Ref. sequence NaMe
('=' if same as RNAME)
 8 MPOS 1-based Mate POSition
 9 ISIZE Inferred insert SIZE
 10 SEQ query SEQuence
 11 QUAL query QUALity
(ASCII of Phred-scaled base quality)
alignment section
--- mandatory fields - FLAG











0x001 p the read is paired in sequencing
0x002 P the read is mapped in a proper pair
0x004 u the query seq. itself is unmapped
0x008 U the mate is unmapped
0x010 r strand of the query (1 for reverse)
0x020 R strand of the mate
0x040 1 the read is the first read in a pair
0x080 2 the read is the second read in a pair
0x100 s the alignment is not primary
0x200 f the read fails platform quality checks
0x400 d the read is either a PCR or optical dup.
alignment section
--- optional fields
 TAG:TYPE:VALUE
 CS color read seq.
 CQ color read quality
 MD string for mismatching position
 aim to achieve SNP/indel calling without looking at
the reference
 NM edit distance to the refernce
 RG read group
 match @RG - ID if @RG exists
 ……
BAM文件格式
 Binary SAM
 compressed in the BGZF format
 查看bam文件内容：
 samtools view [-h] aln.bam | less –S
 bam可被排序：
 samtools sort aln.bam aln.sorted
 排序的bam可建索引：
 samtools index aln.sorted.bam
Samtools
 sam、bam转换
 view、import△
 查看sam、bam文件
 view、tview
 排序bam文件
 sort
 建索引
 faidx、index、tabix*
Samtools
 初步统计
 flagstats、idxstats、depth*
 改动bam文件
 reheader、merge、cat*
 校正bam文件
 fixmate、calmd、rmdup
 call SNP/small indel
 pileup、mpileup
Samtools
 Samtools设计为可以在流（管道）上
执行，在需要FILE处填 '-' 就可表示
STDIN或STDOUT。
 命令行未跟足够参数时就会显示
Samtools各程序的简要帮助。eg.
 samtools
 samtools view
 samtools sort
sam、bam转换
 sam -> bam
 △samtools import ref.fa.fai in.sam out.bam
if in.sam w/ @SQ:
 samtools view -bS in.sam > out.bam
if in.sam w/o @SQ:
 samtools faidx ref.fa
 samtools view -bt ref.fa.fai in.sam > out.bam
or
 samtools view -bT ref.fa in.sam > out.bam
 bam -> sam
 samtools view [-h] in.bam > out.sam
samtools view
 samtools view [options] in.bam/sam
[region1 [...]]
 -? longer help
 -b output in BAM format
 -c instead, only count
 -f INT only output alignments with all bits in
FLAG [0]
 -F INT skip alignments with FLAG bits [0]
 -h include the header in output
 -H output the header only
 -l STR only output reads in library STR [null]
samtools view
 -o FILE output file [stdout]
 -q INT skip alignments with map quality
smaller than INT [0]
 -r STR only output reads in read group STR
[null]
 -R FILE output reads in read groups listed in
FILE [null]
 -S input is in SAM,
if @SQ is absent, -t or -T is required
 -T FILE 指定reference sequence file. 若尚未
建立索引，会自动先建立索引，类似ref.fa.fai
samtools view
 -t FILE tab-delimited file, each line contains
ref. name and length; can be the resultant
index file ref.fa.fai by 'samtools faidx ref.fa'
 -u output uncompressed bam, for pipe
 -x output FLAG in HEX (eg. 0x00)
 -X output FLAG in string
 region (1-based)
 chr2
 chr2:1000 (start from 1000)
 chr2:1,000-2,000 (include 2000)
samtools view
 eg.
 samtools view -bS in.sam > out.bam
 samtools view -bt ref.fa.fai in.sam > out.bam
 samtools view -bT ref.fa in.sam > out.bam
 samtools view [-h] in.bam > out.sam
 samtools view ex1.bam seq1:1-10 | less -S
NOTE: random retrieval need indexed
 samtools view -x ex1.bam | less -S
 samtools view -X ex1.bam | less -S
对bam文件排序和建索引
 samtools sort 可以对bam文件进行排序。
 samtools index 可以对已排序的bam文件
建立索引，从而使得对该bam文件的
random retrieval成为可能。
 eg.
 samtools sort aln.bam aln.sorted
（产生已排序的aln.sorted.bam文件）
 samtools index aln.sorted.bam
（产生aln.sorted.bam.bai索引文件）
samtools sort
 samtools sort [options] in.bam out.prefix
 -o output the final alignment to the standard
output
即把排序的bam文件输出到STDOUT；仍然需
要在命令行指定out.prefix，否则不能执行，但
不产生out.prefix.bam
 -n sort by read names rather than by
chromosomal coordinates
 -m INT approximately the maximum required
memory
建立索引
 建立索引的目的是使 random retrieval 成为
可能
 samtools index aln.bam
为已排序的aln.bam建索引，将创建
aln.bam.bai
 samtools faidx ref.fasta [region1 [...]]
为reference fasta建索引，将创建
ref.fasta.fai
samtools tview
 samtools tview in.sorted.bam [ref.fasta]
 in.sorted.bam需要先用samtools index建索引
 ref.fasta可不用先建索引；若无索引，
samtools tview会自动去建立ref.fasta.fai索引
 In samtools tview:
 ? help
 g goto a region
 chrM:1000; seq1:10
 =1000 (if same reference sequence)
 q exit
samtools tview
 h,j,k,l & ←,↑,↓,→ small scroll (1 base)
 H,J,K,L large scroll (20 bases)
 [space], [backspace] scroll (1 screen)
 m color for Mapping quality
 n color for Nucleotide
 b color for Base quality
 . dot view
 r read name
初步统计
 samtools idxstats in.bam
 in.bam需要已建索引
 输出tab分隔的文本，每个ref. seq一行：
 ref. seq. name
 seq. length
 # mapped reads
 # unmapped reads
初步统计
 samtools flagstat in.bam
 in.bam不需要先建索引
 输出多行文本，每行一种被统计项目，
统计reads数目
 *samtools depth [options] in.bam
 in.bam需要已排序
 输出tab分隔的文本，每个位点一行，统
计每个位点的覆盖的reads数
 output eg. chrM 112 3
初步统计
 *samtools depth [options] in.bam
 -r STR region
 -q INT base quality threshold (only
count base quality >= INT)
 -Q INT mapping quality threshold (only
count mapping quality >= INT)
 -b FILE bed file
 eg. samtools depth -r chrM:100-200 -q 30
aln.sorted.bam | awk '$3>2' | less -S
改动bam文件
 samtools reheader in.header.sam in.bam
 改变bam文件的header section
比BAM->SAM->BAM快
 samtools merge [options] [-h header.sam]
out.bam in1.bam in2.bam
 合并已排序的bam文件
（必需含有数量相同的reference sequences）
 *samtools cat [-h header.sam]
[-o out.bam] in1.bam in2.bam [...]
 直接上下连接bam文件内容
改动bam文件
 samtools merge [options]
out.bam in1.bam in2.bam
 -h FILE use this sam file's headers, copied
to out.bam
 -R STR merge files in the secified region
 -r attach an RG tag to each alignment. The
tag value is inferred from file names
 -n the input alignments are sorted by read
names rather than by chr. coordinates
 -u output uncompressed BAM
改动bam文件
 samtools merge eg.
 perl -e 'print
"@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illumina\n
@RG\tID:454\tSM:hs\tLB:454\tPL:454\n"'
>rg.txt
 samtools merge -rh rg.txt merged.bam
ga.bam 454.bam
 NOTE: in the merged.bam, reads from
ga.bam will be attached RG:Z:ga, while reads
from 454.bam will be attached RG:Z:454
校正bam文件
 samtools fixmate in.bam out.bam
 in.bam需要已按read name排序，
out.bam也将会按read name排序
 fixmate用来校正mate的坐标、
ISIZE(Inferred insert SIZE)、相关FLAG
 samtools calmd [options] aln.bam ref.fa
 calmd用来生成或检查每个alignment的
MD(string for mismatching position)标签，
默认输出SAM到STDOUT
校正bam文件
 samtools calmd [options] aln.bam ref.fa
 -A modify the quality string
 -b compressed BAM output
 -e change identical bases to '='
 -S input is SAM with header
 -r compute the BQ tag (w/o -A)
or cap baseQ by BAQ (w/ -A)
 -u uncompressed BAM output
校正bam文件
 samtools rmdup [-sS] in.bam out.bam
 in.bam需要已排序
 -s remove dup. for SE reads
 -S treat PE reads as SE (force -s)
 (default) remove dup. for PE reads
 rmdup用来去除可能的PCR duplicates
 一致的外部坐标，仅mapping quality不同
 处理PE时，依赖正确的ISIZE
 对于PE，不能正确处理unpaired reads的情况
（Picard的MarkDuplicates能正确处理该情况）
rmdup example
 Merge sorted alignments and remove
potential PCR/optical duplicates:
 perl -e 'print
"@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illumina\n
@RG\tID:454\tSM:hs\tLB:454\tPL:454\n"'
> rg.txt
 samtools merge -rh rg.txt - ga.bam 454.bam |
samtools rmdup - - |
samtools rmdup -s - aln.bam
call SNP / small indel
 pileup -> pileup format
 mpileup -> bcf format -> bcftools
 NOTE: since 0.1.10,
pileup is deprecated by mpileup
samtools pileup
 samtools pileup [options] in.bam/sam
 输入的in.bam需要先排序
 -B disable the BAQ (base alignment quality)
computation
 -c call the consensus sequence
 -C INT coefficient for adjusting mapping
quality of poor mappings [0]
 -d INT limit maximum depth for indels for
speed up. 0 for unlimited [1024]
samtools pileup
 -f FILE the reference sequence, in FASTA;
index file FILE.fai will be created if absent
 -G FLOAT prior of an indel between two
haplotypes (for -c) [0.00015]
 -i only show lines/consensus with indels
 -I INT phred prob. of an indel in
sequencing/prep. [40]
 -l FILE list of sites at which pileup is output
samtools pileup
 -m INT filter reads with flag containing bits in
INT [1796(0x704, usfd)]
 -M INT cap mapping quality at INT [60]
 -N INT # haplotypes in the sample (for -c)
(>=2) [2]
 -Q INT min base quality [13]
 -r FLOAT prior of a difference between two
haplotypes (for -c) [0.001]
 -S the input is in SAM
samtools pileup
 -t FILE list of reference names and
sequence lengths (force -S)
 -T FLOAT the theta parameter (error
dependency coefficient) in MAQ consensus
calling model (for -c) [0.83]
 -v print variants only (for -c)
 NOTE: pileup is deprecated, please use
mpileup
samtools pileup
 default output: pileup format
 每一行都是一个genomic position，
包含以下6列：
 1 chromosome name
 2 coordinate
 3 reference base
 4 read coverage/depth
 5 read bases
 6 alignment mapping qualities
samtools pileup
 read bases
 . for a match to the ref. on the + strand
 , for a match to the ref. on the - strand
 > < for a reference skip
 ACGTN for a mismatch on + strand
 acgtn for a mismatch on - strand
 +NumSeq insertion
 -NumSeq deletion
 ^ start of a read
 $ end of a read
 * deleted base
samtools pileup
 when -c is apllied,
 在原先第3、第4列之间增添4列，总共10列：
 4 consensus base
 5 consensus quality
 6 SNP quality
 7 root mean square (RMS) mapping quality
 8 read coverage/depth
 9 read bases
 10 alignment mapping qualities
samtools pileup
 an indel occupies an additional line (-i)
 1 chromosome same
 2 coordinate
 3 just '*'
 4 genotype
 5 consensus quality
 6 SNP quality
 7 RMS mapping quality
samtools pileup
 8 # covering reads
 9 the first allele
 10 the second allele
 11 # reads supporting the first allele
 12 # reads supporting the second allele
 13 # reads containing indels different from
the top two alleles
 ……
samtools pileup
 eg.
 samtools pileup -f human_chrM.fasta
bwa_result.sorted.bam | less -S
 samtools pileup -vcf human_chrM.fasta
bwa_result.sorted.bam | less -S
 samtools pileup -ivcf human_chrM.fasta
bwa_result.sorted.bam | less -S
samtools.pl varFilter
 samtools.pl varFilter [options]
in.cns_pileup
 输入 samtools pileup -vcf 的结果，进行过滤
 -D INT maximum read depth [100]
 -d INT minimum read depth [3]
 -G INT min indel score for nearby SNP
filtering [25]
 -i INT minimum indel quality []
 -l INT window size for filtering adjacent gaps
[30]
samtools.pl varFilter
 -N INT max number of SNPs in a window [2]
 -p print filtered variants to STDERR
 -Q INT minimum RMS mapping quality for
SNPs [25]
 -q INT minimum RMS mapping quality for
gaps [10]
 -S INT minimum SNP quality []
 -W INT window size for filtering dense SNPs
[10]
 -w INT SNP within INT bp around a gap to
be filtered [10]
samtools.pl pileup2fq
 samtools.pl pileup2fq [options]
in.cns_pileup
 输入 samtools pileup -cf 的结果，转换为
FASTQ格式
 -d INT min depth [3]
 -D INT max depth [255]
 -G INT min indel score [25]
 -l INT indel filter window size [10]
 -Q INT min RMS mapping quality [25]
pileup example
 Call Variants
 samtools pileup -vcf ref.fa aln.bam |
tee raw.txt |
samtools.pl varFilter -D 100 > flt.txt
 Please remember to set the -D to set a maximum
read depth according to the average read depth
(×2). In whole genome shotgun (WGS)
resequencing, SNPs with excessively high read
depth are usually caused by structural variations
or alignment artifacts and should not be trusted.
 awk '($3=="*"&&$6>=50)
||($3!="*"&&$6>=20)' flt.txt > final.txt
pileup example
 Get the consensus sequence
 samtools view -u aln.bam X |
samtools pileup -cf hsRef.fa - |
samtools.pl pileup2fq -D100 > var-X.fq
 if you want to call for one chromosome only,
use 'samtools view' to specify the region.
(just like above)
 http://sourceforge.net/apps/mediawiki/samtoo
ls/index.php?title=SAM_protocol
pileup example
 Before pileup, you can use 'samtools view'
or other command to filter the mapping
reads/alignments.
 'samtools view' can filter on mapping
quality (-q), flags (-f, -F), read group (-r, -R)
or library (-l).
 other filters can be specified using perl or
awk on SAM format
pileup example
 use reads with <=2 differenced:
 samtools view -h aln.bam |
perl -ne 'print if (/^@/||(/NM:i:(\d+)/&&$1<=2))'
| samtools pileup -S - > out.txt
 (NM: edit distance to the refernce)
 exclude all gapped alignments:
 samtools view -h aln.bam |
awk '$6!~/[ID]/' | samtools pileup -S -
pileup example
 set a threashold on mapping quality
 samtools view -bq 1 aln.bam
> aln-reliable.bam
 exclude read group ERR00001
 samtools view -h in.bam |
grep -v "\<RG:Z:ERR00001\>" |
samtools view -bS - >out.bam
samtools mpileup
 samtools mpileup [options] in.bam
[in2.bam [...]]
 generate BCF or pileup for one or multiple
BAM files.
 Alignment records are grouped by sample
identifiers in @RG header lines.
 If sample identifiers are absent, each input
file is regarded as one sample.
samtools mpileup
 -B disable BAQ (base alignment quality)
computation
 -C INT coefficient for adjusting mapping
quality of poor mappings [0 (disable)]
 the recommended value for BWA is 50
 -e INT phred-scaled gap extension
sequencing error probability [20]
 reducing leads to longer indels
 -f FILE the reference file [null]
 -g compute genotype likelihoods,
and output them in BCF(binary call format)
samtools mpileup
 -h INT coefficient for modeling homopolymer
errors [100]
 -I do not perform indel calling
 -l FILE file containing a list of sites where
pileup or BCF is outputed [null]
 -o INT phred-scaled gap open sequencing
error probability [40]
 reducing leads to more indel calls
 -P STR comma dilimited list of platforms for
indels [all]
samtools mpileup
 -q INT min mapping quality for an alignment
to be used [0]
 -Q INT min base quality for a base to be
considered [13]
 -r STR only generate pileup in the region
[all sites]
 -u similar to -g, but output uncompressed
BCF
VCF and BCF format
 VCF(variant call format)
 BCF(binary VCF)
 查看BCF内容
 bcftools view in.bcf
VCF format
 meta-information lines
 以##开头
 the header line
 以#开头，tab分隔，data lines每列的解释
 #CHROM POS ID REF ALT QUAL
FILTER INFO FORMAT (Sample...)
 data lines
 tab分隔
VCF format





1 CHROM chr. name
2 POS 1-base position
3 ID variant ID
4 REF ref. seq. at POS
5 ALT comma delimited list of alternative
seq.
 6 QUAL phred-scaled prob. of all
samples being homozygous ref.
VCF format
 7 FILTER semicolon delimited list of
filters that the variat fails to pass
 8 INFO semicolon delimited list of
variant information
 DP combined depth across samples
 AF allele freq. for each ALT allele
 MQ RMS mapping quality
VCF format
 9 FORMAT colon delimited list of the
format of individual genotypes in the
following fields
 GT genotype; 0 for REF, 1 for ALT1, ...
 PL phred-scaled genotype likelihood
 order: 00, 01, 11, 02, 12, 22, ...
 GQ conditional genotype phred quality
 10+ Sample(s) individual genotype
information defined by FORMAT
BCFtools
 bcftools
 view 查看，转换文件格式
 index 为BCF建索引
 cat 连接BCF
 ld compute all-pair r^2
 ldpair compute r^2 between requested pairs
bcftools view
 bcftools view [options] in.bcf [region]
 -b output BCF instead of VCF
 -c SNP calling (force -e)
 -e likelihood based analyses
 -G suppress all individual genotype info.
 -g call genotypes at variant sites (force -c)
 -I skip indels
 -l FILE list of sites (chr pos) or regions (BED)
to output [all]
 -P STR type of prior: full, cond2, flat [full]
 -v output potential variant sites only (force -c)
vcfutils.pl
 vcfutils.pl
 varFilter
 vcf2fq
 subsam
 listsam
 fillac
 qstats
 hapmap2vcf
 ucscsnp2vcf
vcfutils.pl varFilter
 vcfutils.pl varFilter [options] in.vcf
 -a INT min number of alternate bases [2]
 -D INT max read depth [10000000]
 -d INT min read depth [2]
 -p print filtered variants to STDERR
 -Q INT min RMS mapping quality for SNPs
[10]
 -W INT window size for filtering adjacent
gaps [10]
 -w INT SNP with INT bp around a gap to be
filtered [10]
vcfutils.pl vcf2fq
 vcfutils.pl vcf2fq [options] all-site.vcf
 输入 samtools mpileup -uf | bcftools view -cg
的结果，转换为FASTQ格式
 -d INT min depth [3]
 -D INT max depth [100000]
 -l INT indel filter window size [5]
 -Q INT min RMS mapping quality [10]
mpileup example
 Call SNPs and short indels for one diploid
individual:
 samtools mpileup -uf ref.fa aln.bam |
bcftools view -bvcg - > var.raw.bcf
 usually add -C50 to mpileup if mapping quality is
overestimated, eg. using BWA-short
 bcftools view var.raw.bcf |
vcfutils.pl varFilter -D 100 > var.flt.vcf
 -D controls the max read depth, which should be
adjusted to about twice the average read depth
mpileup example
 Call SNPs and short indels for multiple
diploid individuals:
 samtools mpileup -P ILLUMINA -ugf ref.fa
*.bam | bcftools view -bvcg - > var.raw.bcf
 Individuals are identified from the SM tags in the
@RG header lines.
 -P specifies that indel candidates should be
collected only from read groups with @RG-PL tag
set to ILLUMINA
 bcftools view var.raw.bcf |
vcfutils.pl varFilter -D 2000 > var.flt.vcf
mpileup example
 Derive the AFS (allele frequency spectrum)
on a list of sites from multiple individuals:
 samtools mpileup -Igf ref.fa *.bam >all.bcf
 bcftools view -bl sites.list all.bcf >sites.bcf
 bcftools view -cGP cond2 sites.bcf
>/dev/null 2>sites.1.afs
 bcftools view -cGP sites.1.afs sites.bcf
>/dev/null 2>sites.2.afs
 bcftools view -cGP sites.2.afs sites.bcf
>/dev/null 2>sites.3.afs
 ……
mpileup example
 Generate the consensus sequence:
 samtools mpileup -uf ref.fa aln.bam |
bcftools view -cg - |
vcfutils.pl vcf2fq > cns.fq
 Dump BAQ applied alignment for other
SNP callers:
 samtools calmd -br aln.bam >aln.baq.bam
Java版本——Picard
 类似Samtools功能的Java版本
 Samtools可以在流上（管道）操作，
Picard好像一般对文件操作。
 Picard的MarkDuplicates能处理Samtools
的rmdup不能处理的一些问题。（注：
MarkDuplicates依赖mate的坐标）
Reference
 SAM format
 http://samtools.sourceforge.net/SAM1.pdf
 Samtools homepage
 http://samtools.sourceforge.net/
 Samtools manual
 http://samtools.sourceforge.net/samtools.shtml
 Samtools pileup
 http://samtools.sourceforge.net/cns0.shtml
 http://samtools.sourceforge.net/pileup.shtml
Reference
 Samtools protocal
 http://sourceforge.net/apps/mediawiki/samtools/index
.php?title=SAM_protocol
 Samtools mpileup
 http://samtools.sourceforge.net/mpileup.shtml
 VCF format
 http://vcftools.sourceforge.net/specs.html
 http://www.1000genomes.org/wiki/Analysis/Variant%
20Call%20Format/vcf-variant-call-format-version-40
 http://www.1000genomes.org/wiki/Analysis/Variant%
20Call%20Format/vcf-variant-call-format-version-41
BEDTools
BED文件格式
 BED format是UCSC Genome Browser和
Ensembl Genome Browser都支持的，描
述annotation track信息的文件格式。
 它是tab分隔的文本文件，可以有3-12列。
 前3列是必需的：
 1 chrom name of chr. or scaffold
 2 chromStart
start pos. of the feature (include) (0-based)
 3 chromEnd
end pos. of the feature (not include)
BED文件格式
 第4-9列是可选的：
 4 name the name of the BED line
 5 score 0~1000, higher, darker
 6 strand
 7 thickStart starting pos. drawn thickly
 8 thickEnd ending pos. drawn thickly
 9 itemRgb display color
BED文件格式
 第10-12列也是可选的，UCSC使用它们，
而Ensembl忽略它们：
 10 blockCount # blocks (exons) in the line
 11 blockSizes comma-separated, block
sizes, should correspond to blockCount
 12 blockStarts comma-separated, block
starts, should correspond to blockCount
 eg. snps.bed
 chr1 100 101 A/G 100
 chr1 200 201 C/G 1000
BedGraph文件格式
 4列tab分隔的文本文件
 1 chromName
 2 chromStart
 3 chromEnd
 4 dataValue
BEDTools
 文件格式转换 (bam->bed)
 bamToBed
 计算交(overlap)、差、补
 intersectBed
 pairToBed
 subtractBed
 pairToPair
 complementBed
 windowBed
BEDTools
 改动BED文件
 mergeBed
 slopBed
 shuffleBed
 sortBed
 计算coverage
 coverageBed
 genomeCoverageBed
BEDTools
 其他
 closestBed
 fastaFromBed
 maskFastaFromBed
 linksBed
 分组简单统计
 groupBy
 -h display help page
bamToBed
 bamToBed -i in.bam > out.bed
 convert BAM to BED6 format
 -bed12
 convert to BED12 format
 -bedpe
 convert to BEDPE format
 -ed
 use edit distance (NM tag) as BED score
intersectBed
 intersectBed -a a.bed -b b.bed
 report overlaps between two BED files
 -c
 for each A, report # overlaps with B (count)
 -u
 report original A once if any overlaps found
in B
 -v
 only report those A that have no overlaps
with B (similar to 'grep -v')
intersectBed
 intersectBed -a a.bed -b b.bed
 -wa
 report entire, original entry in A for each
overlap
 -wb
 report entire, original entry in B for each
overlap
intersectBed
 intersectBed -a a.bed -b b.bed
 -wa -wb
 report entire, original entry in A and B for
each overlap
 -wo
 report original A and B and # bp of overlap
between the two features
intersectBed
 intersectBed -a a.bed -b b.bed
 -f FLOAT
 minimum overlap required as a fraction of A
 -f FLOAT -r
 require that the fraction overlap be reciprocal
for A and B
 -s
 force strandedness, only report when A and B
on the same strand
intersectBed
 intersectBed -abam a.bam -b b.bed
>out.bam
 input file a is in BAM, and ouput will be BAM
 -ubam
 write uncompressed BAM output
 -bed
 when using BAM input, write output as BED
 -a 可跟 stdin，表示从STDIN读入
intersectBed
 eg.
 find genes that overlap Lines but not SINEs:
 intersectBed -a genes.bed -b LINEs.bed |
intersectBed -a stdin -b SINEs.bed -v
 retain SE BAM that overlap exons:
 intersectBed -abam reads.bam -b exons.bed
>reads.touchingExons.bam
intersectBed
 screen novel SNP
 intersectBed -a snp.calls.bed -b dbSnp.bed -v
| intersectBed -a stdin -b 1KG.bed -v
>snp.calls.novel.bed
pairToBed
 pairToBed -a a.bedpe -b b.bed
 report overlaps between a BEDPE and a BED
 -type either (default)
 report overlaps if either end of A overlaps B
 -type neither
 report A if neither end of A overlaps B
 -s
 force strandedness
pairToBed
 pairToBed -a a.bedpe -b b.bed
 -type both
 report overlaps if both ends of A overlap B
 -type notboth
 report overlaps if neither or only one end of A
overlap B
 -type xor
 report overlaps if only one end of A overlap B
pairToBed
 pairToBed -a a.bedpe -b b.bed
 -type ispan
 report overlaps between [end1, start2] of A,B
 -type ospan
 report overlaps between [start1, end2] of A,B
 -type notispan
 report A if ispan of A does not overlap B
 -type notospan
 report A if ospan of A does not overlap B
pairToBed
 pairToBed -abam a.bam -b b.bed >out.bam
 input file a is PE BAM, and output will be in
BAM
 -ubam
 write uncompressed BAM output
 -bedpe
 when using BAM input, write output as BEDPE
pairToBed
 eg.
 return all structural variants that overlap with
genes on either end:
 pairToBed -a sv.bedpe -b genes.bed
>sv.genes
 retain only PE alignments where neither or
only one end overlaps segmental duplications:
 pairToBed -abam reads.bam -b segdups.bed type notboth >reads.notbothSSRs.bam
pairToPair
 pairToPair -a a.bedpe -b b.bedpe
 report overlaps between two BEDPE
 -type both (default)
 report overlaps if both ends of A overlap B
 -type either
 report overlaps if either ends of A overlap B
 -type neither
 report overlaps if neither end of A overlaps B
pairToPair
 eg.
 find all SVs in sample 1 that are also in
sample 2:
 pairToPair -a 1.sv.bedpe -b 2.sv.bedpe |
cut -f 1-10 > 1.sv.in2.bedpe
 find all SVs in sample 1 that are not in sample
2:
 pairToPair -a 1.sv.bedpe -b 2.sv.bedpe -type
neither | cut -f 1-10 > 1.sv.notin2.bedpe
windowBed
 windowBed -a a.bed -b b.bed
 examines a window around each A, and
reports all features in B that overlap the
window
 for each overlap, A and B are reported
 -w INT
 bp added upstream and downstream of
each A (symterical windows around A)
windowBed
 windowBed -a a.bed -b b.bed
 -l INT
 bp added upstream (left) of each A
(assymterical windows around A)
 -r INT
 bp added downstream (right) of each A
(assymterical windows around A)
 -sw
 define -l and -r based on strand
windowBed
 windowBed -a a.bed -b b.bed
 -sm
 only report hits in B that overlap A on the
same strand
 -abam FILE
 -ubam
 -bed
windowBed
 eg.
 report all SNPs that are within 5000bp
upstream or 1000bp downstream of genes:
 windowBed -a genes.bed -b snps.bed
-l 5000 -r 1000 -sw
subtractBed
 subtractBed -a a.bed -b b.bed
 removes A that overlap by any feature in B
 -f FLOAT
 minimum overlap required as a fraction of A
 -s
 force strandedness
 eg.
 remove introns from gene features:
 subtractBed -a genes.bed -b intron.bed
complementBed
 complementBed -i in.bed -g genome
 return bp complement of the feature file
 genome file should tab-delimited and contain:
chromName chromSize
 how to get chromSize? see -h
 eg.
 report all intervals in human genome that are
not covered by repetitive elements:
 complementBed -i repeatMasker.bed -g
hg18.genome
mergeBed
 mergeBed -i in.bed
 merge overlapping entries into a single entry
 -n
 report # BED entries merged
 '1' is reported if no merging occurred
 -d INT
 maximum distance (bp) between features
allowed to be merged [0]
mergeBed
 mergeBed -i in.bed
 -s
 force strandedness
 -scores STR
 how to determine the scores of the merged
 sum, min, max, mean, median, mode,
antimode, collapse
slopBed
 slopBed -i in.bed -g genome
 将feature向两侧扩展
 -b INT/FLOAT
 in each direction
 -l INT/FLOAT
 in left (upstream) direction
 -r INT/FLOAT
 in right (downstream) direction
slopBed
 slopBed -i in.bed -g genome
 -s
 define -l and -r based on strand
 -pct
 specify a fraction of the feature's length,
instead of absolute bases
 only with -pct, -b/-l/-r can be given FLOAT
 eg.
 slopBed -i probes.bed -g hg19.genome
-b 500
shuffleBed
 shuffleBed -i in.bed -g genome
 randomly permute the locations of features
among a genome
 -excl BED_FILE
 prevent placing in these intervals
 -f FLOAT
 maximum overlap fraction of the feature with
an -excl feature
 -chrom
 keep features on the same chromosome
sortBed
 sortBed -i in.bed
 sort BED file, by chrom, then start position
 输出排序好的BED到STDOUT
 see -h for more options
coverageBed
 coverageBed -a a.bed -b b.bed
 compute the depth and breadth of coverage
of features from A on intervals in B
 default output: (for each entry in B)
 # A overlapped the B interval
 # bases in B had >0 coverage
 the length of B
 the fraction of bases in B had >0 coverage
coverageBed
 coverageBed -a a.bed -b b.bed
 -s
 force strandedness
 -d
 report the depth at each position (base) in
each B
 eg.
 compute coverage and create a BEDGRAPH
 coverageBed -a reads.bed -b win10kb.bed |
cut -f 1-4 > win10kb.cov.bedg
coverageBed
 by default, coverageBed counts any
feature in A that overlaps B by >= 1 bp.
 If you want to specify a minimum fraction,
you can first use intersectBed:
 intersectBed -a a.bed -b b.bed -f 1.0 |
coverageBed -a stdin -b b.bed
>a_totally_in_b.coverage
coverageBed
 work with samtools or bam file
 bamToBed -i reads.bam |
coverageBed -a stdin -b exons.bed
>exons.bed.coverage
 samtools view -bf 0x2 reads.bam |
bamToBed -i stdin |
coverageBed -a stdin -b exons.bed
>exons.bed.pair_proper_mapped.coverage
coverageBed
 compute separatedly for each strand
 bamToBed -i reads.bam | grep '\+$' |
coverageBed -a stdin -b genes.bed
>genes.bed.forward.coverage
 bamToBed -i reads.bam | grep '\-$' |
coverageBed -a stdin -b genes.bed
>genes.bed.reverse.coverage
genomeCoverageBed
 genomeCoverageBed -i in.bed -g genome
 compute the coverage among a genome
 input BED file must be grouped by chr.
 default output: (for each chromosome in B)
 depth (0~max)
 # bases in the chr. had this depth
 the length of the chr.
 the fraction of bases in the chr. that had this
depth
 -strand +/-
genomeCoverageBed
 -ibam FILE
 input in BAM
 must be sorted by position (samtools sort)
 -d
 report the depth at each position (base)
 -bg
 report depth in BedGraph format
 -bga
 report depth in BedGraph format, also report
zero coverage regions
genomeCoverageBed
 eg.
 genomeCoverageBed
-ibam bwa_result.sorted.bam
-g human_chrM.genome
 genomeCoverageBed
-ibam bwa_result.sorted.bam
-g human_chrM.genome |
grep '^chrM' |
awk 'BEGIN{SUM = 0} {SUM += ($2*$5)}
END {print SUM}'
closestBed
 closestBed -a a.bed -b b.bed
 for each A, finds the closest feature in B
 -t all (default)
 report all ties
 -t first
 report the first tie that occurred in B
 -t last
 report the last tie that occurred in B
closestBed
 closestBed -a a.bed -b b.bed
 -s
 force strandedness, find the closest feature
on the same strand
 eg.
 find the closest ALU to each gene:
 closestBed -a genes.bed -b ALUs.bed
fastaFromBed
 maskFastaFromBed -fi in.fasta
-bed extract.bed -fo out.fasta
 根据extract.bed文件，从in.fasta中抽取一些区
域，输出到out.fasta
 -s
 force strandedness
 -name
 use name field for fasta header
maskFastaFromBed
 maskFastaFromBed -fi in.fasta
-bed mask.bed -fo out.fasta
 根据mask.bed文件，mask掉in.fasta中的一些
区域，输出到out.fasta
 -soft
 soft masking: mask with lower-case instead
of N
 -mc CHAR
 replace masking char, instead of N
linksBed
 linksBed -i in.bed > out.html
 创建一个链到UCSC browser的HTML文件
 -base
 browser basename [http://genome.ucsc.edu]
 -org
 organism [human]
 -db
 reference build [hg18]
groupBy
 groupBy 可以指定几列(-g)作为组别，
然后分组简单统计(-o)指定列(-c)
 Deprecated. Now in the filo package.
 -i FILE
 input file, assume STDIN if omitted
 -g STR / -grp STR
 specify the columns (1-based) for grouping,
comma separated [1,2,3]
 -c INT / -opCols INT
 specify the column (1-based) to be
summarized, required
groupBy
 -o STR / -ops STR
 specify the operation applied to -c column
 sum, count, min, max, mean, median,
mode(most common), antimode(least
common), stdev, sstdev(sample stdev),
collapse(comma separated), concat(nondelimited), freqdesc, freqasc [sum]
 can be comma separated,
eg. sum,mean,max
groupBy
 -inheader
 input has a header line, would be ignored
 -outheader
 output header line, showing column names
BEDTools Notes
 All BEDTools load B file into memory and
process A file one-by-one. Therefore when
possible, set smaller file to be B file.
 Most of BEDTools allow A file to be STDIN
from pipes, by using '-a stdin'
Reference
 BED format
 http://genome-test.cse.ucsc.edu/FAQ/FAQformat
 http://www.ensembl.org/info/website/upload/bed.html
 BedGraph format
 http://genome.ucsc.edu/goldenPath/help/bedgraph.ht
ml
 BEDTools
 http://code.google.com/p/bedtools/
 filo package (for groupBy)
 https://github.com/arq5x/filo
FASTX-Toolkit
FASTX-Toolkit
 预处理FASTA/FASTQ文件的工具
（mapping之前）
 格式转换，切短，切去adapter，分
barcode，quality筛选……
 -h show usage information
 you can also use them with Galaxy
FASTX-Toolkit







fastq_to_fasta
fastx_quality_stats
fastq_quality_boxplot_graph.sh
fastx_nucleotide_distribution_graph.sh
fastx_clipper
fastx_renamer
fastx_trimmer
FASTX-Toolkit








fastx_collapser
fastx_artifacts_filter
fastx_quality_filter
fastx_reverse_complement
fasta_formatter
fasta_nucleotide_changer
fasta_clipping_histogram.pl
fasta_barcode_splitter.pl
Reference
 FASTX-Toolkit homepage
 http://hannonlab.cshl.edu/fastx_toolkit/
 FASTX-Toolkit manual
 http://hannonlab.cshl.edu/fastx_toolkit/comma
ndline.html
Thanks for your attention!

INT - NGS Workshop @ CBI

Transcript INT - NGS Workshop @ CBI

Directory