INT - NGS Workshop @ CBI
Download
Report
Transcript INT - NGS Workshop @ CBI
Samtools、BEDTools、
FASTX-Toolkit
北京大学 生命科学学院 叶永鑫
2011年5月23日 星期一
Outline
Samtools
操作sam/bam文件(mapping)的工具
附加BCFtools
操作vcf/bcf文件(SNP/indel calling)的工具
BEDTools
操作bed文件(block feature annotation)的工具
FASTX-Toolkit
操作fasta/fastq文件的工具
Samtools
SAM文件格式
Sequence Alignment/Map format
存储mapping的结果的Tab分隔的文本文件
header section (optional)
@开头,例如:
@SQ SN:ref LN:45
alignment section
11个必需的字段,例如:
r001 163 ref 7 30 8M2I4M1D3M =
37 39 TTAGATAAAGGATACTG *
header section
@HD header line
VN* format version
SO sorting order of alignments
@SQ reference seq. dictionary
SN* reference seq. name
LN* ref. seq. length
AS genome assembly identifier
M5 MD5 checksum
SP Species
UR URI
header section
@RG read group
ID* read group identifier
CN seq. center
DS description
DT date of the run
FO flow order
KS key sequence
LB library
PG programs
PI predicted median insert size
PL platform/technology
PU platform unit
SM sample
header section
@PG program
ID* program record identifier
PN program name
CL command line
PP previous @PG-ID
VN program version
@CO comment (one-line text)
alignment section
--- mandatory fields
1
2
3
4
5
6
QNAME Query template NAME
FLAG bitwise flag
RNAME Ref. seq. NAME
POS 1-based leftmose POSition
MAPQ MAPping Quality
CIGAR CIGAR string
M match/mismatch
I insertion D deletion (相对reference)
S soft clipping H hard clipping P padding
N skip = match X mismatch
alignment section
--- mandatory fields
7 MRNM Mate Ref. sequence NaMe
('=' if same as RNAME)
8 MPOS 1-based Mate POSition
9 ISIZE Inferred insert SIZE
10 SEQ query SEQuence
11 QUAL query QUALity
(ASCII of Phred-scaled base quality)
alignment section
--- mandatory fields - FLAG
0x001 p the read is paired in sequencing
0x002 P the read is mapped in a proper pair
0x004 u the query seq. itself is unmapped
0x008 U the mate is unmapped
0x010 r strand of the query (1 for reverse)
0x020 R strand of the mate
0x040 1 the read is the first read in a pair
0x080 2 the read is the second read in a pair
0x100 s the alignment is not primary
0x200 f the read fails platform quality checks
0x400 d the read is either a PCR or optical dup.
alignment section
--- optional fields
TAG:TYPE:VALUE
CS color read seq.
CQ color read quality
MD string for mismatching position
aim to achieve SNP/indel calling without looking at
the reference
NM edit distance to the refernce
RG read group
match @RG - ID if @RG exists
……
BAM文件格式
Binary SAM
compressed in the BGZF format
查看bam文件内容:
samtools view [-h] aln.bam | less –S
bam可被排序:
samtools sort aln.bam aln.sorted
排序的bam可建索引:
samtools index aln.sorted.bam
Samtools
sam、bam转换
view、import△
查看sam、bam文件
view、tview
排序bam文件
sort
建索引
faidx、index、tabix*
Samtools
初步统计
flagstats、idxstats、depth*
改动bam文件
reheader、merge、cat*
校正bam文件
fixmate、calmd、rmdup
call SNP/small indel
pileup、mpileup
Samtools
Samtools设计为可以在流(管道)上
执行,在需要FILE处填 '-' 就可表示
STDIN或STDOUT。
命令行未跟足够参数时就会显示
Samtools各程序的简要帮助。eg.
samtools
samtools view
samtools sort
sam、bam转换
sam -> bam
△samtools import ref.fa.fai in.sam out.bam
if in.sam w/ @SQ:
samtools view -bS in.sam > out.bam
if in.sam w/o @SQ:
samtools faidx ref.fa
samtools view -bt ref.fa.fai in.sam > out.bam
or
samtools view -bT ref.fa in.sam > out.bam
bam -> sam
samtools view [-h] in.bam > out.sam
samtools view
samtools view [options] in.bam/sam
[region1 [...]]
-? longer help
-b output in BAM format
-c instead, only count
-f INT only output alignments with all bits in
FLAG [0]
-F INT skip alignments with FLAG bits [0]
-h include the header in output
-H output the header only
-l STR only output reads in library STR [null]
samtools view
-o FILE output file [stdout]
-q INT skip alignments with map quality
smaller than INT [0]
-r STR only output reads in read group STR
[null]
-R FILE output reads in read groups listed in
FILE [null]
-S input is in SAM,
if @SQ is absent, -t or -T is required
-T FILE 指定reference sequence file. 若尚未
建立索引,会自动先建立索引,类似ref.fa.fai
samtools view
-t FILE tab-delimited file, each line contains
ref. name and length; can be the resultant
index file ref.fa.fai by 'samtools faidx ref.fa'
-u output uncompressed bam, for pipe
-x output FLAG in HEX (eg. 0x00)
-X output FLAG in string
region (1-based)
chr2
chr2:1000 (start from 1000)
chr2:1,000-2,000 (include 2000)
samtools view
eg.
samtools view -bS in.sam > out.bam
samtools view -bt ref.fa.fai in.sam > out.bam
samtools view -bT ref.fa in.sam > out.bam
samtools view [-h] in.bam > out.sam
samtools view ex1.bam seq1:1-10 | less -S
NOTE: random retrieval need indexed
samtools view -x ex1.bam | less -S
samtools view -X ex1.bam | less -S
对bam文件排序和建索引
samtools sort 可以对bam文件进行排序。
samtools index 可以对已排序的bam文件
建立索引,从而使得对该bam文件的
random retrieval成为可能。
eg.
samtools sort aln.bam aln.sorted
(产生已排序的aln.sorted.bam文件)
samtools index aln.sorted.bam
(产生aln.sorted.bam.bai索引文件)
samtools sort
samtools sort [options] in.bam out.prefix
-o output the final alignment to the standard
output
即把排序的bam文件输出到STDOUT;仍然需
要在命令行指定out.prefix,否则不能执行,但
不产生out.prefix.bam
-n sort by read names rather than by
chromosomal coordinates
-m INT approximately the maximum required
memory
建立索引
建立索引的目的是使 random retrieval 成为
可能
samtools index aln.bam
为已排序的aln.bam建索引,将创建
aln.bam.bai
samtools faidx ref.fasta [region1 [...]]
为reference fasta建索引,将创建
ref.fasta.fai
samtools tview
samtools tview in.sorted.bam [ref.fasta]
in.sorted.bam需要先用samtools index建索引
ref.fasta可不用先建索引;若无索引,
samtools tview会自动去建立ref.fasta.fai索引
In samtools tview:
? help
g goto a region
chrM:1000; seq1:10
=1000 (if same reference sequence)
q exit
samtools tview
h,j,k,l & ←,↑,↓,→ small scroll (1 base)
H,J,K,L large scroll (20 bases)
[space], [backspace] scroll (1 screen)
m color for Mapping quality
n color for Nucleotide
b color for Base quality
. dot view
r read name
初步统计
samtools idxstats in.bam
in.bam需要已建索引
输出tab分隔的文本,每个ref. seq一行:
ref. seq. name
seq. length
# mapped reads
# unmapped reads
初步统计
samtools flagstat in.bam
in.bam不需要先建索引
输出多行文本,每行一种被统计项目,
统计reads数目
*samtools depth [options] in.bam
in.bam需要已排序
输出tab分隔的文本,每个位点一行,统
计每个位点的覆盖的reads数
output eg. chrM 112 3
初步统计
*samtools depth [options] in.bam
-r STR region
-q INT base quality threshold (only
count base quality >= INT)
-Q INT mapping quality threshold (only
count mapping quality >= INT)
-b FILE bed file
eg. samtools depth -r chrM:100-200 -q 30
aln.sorted.bam | awk '$3>2' | less -S
改动bam文件
samtools reheader in.header.sam in.bam
改变bam文件的header section
比BAM->SAM->BAM快
samtools merge [options] [-h header.sam]
out.bam in1.bam in2.bam
合并已排序的bam文件
(必需含有数量相同的reference sequences)
*samtools cat [-h header.sam]
[-o out.bam] in1.bam in2.bam [...]
直接上下连接bam文件内容
改动bam文件
samtools merge [options]
out.bam in1.bam in2.bam
-h FILE use this sam file's headers, copied
to out.bam
-R STR merge files in the secified region
-r attach an RG tag to each alignment. The
tag value is inferred from file names
-n the input alignments are sorted by read
names rather than by chr. coordinates
-u output uncompressed BAM
改动bam文件
samtools merge eg.
perl -e 'print
"@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illumina\n
@RG\tID:454\tSM:hs\tLB:454\tPL:454\n"'
>rg.txt
samtools merge -rh rg.txt merged.bam
ga.bam 454.bam
NOTE: in the merged.bam, reads from
ga.bam will be attached RG:Z:ga, while reads
from 454.bam will be attached RG:Z:454
校正bam文件
samtools fixmate in.bam out.bam
in.bam需要已按read name排序,
out.bam也将会按read name排序
fixmate用来校正mate的坐标、
ISIZE(Inferred insert SIZE)、相关FLAG
samtools calmd [options] aln.bam ref.fa
calmd用来生成或检查每个alignment的
MD(string for mismatching position)标签,
默认输出SAM到STDOUT
校正bam文件
samtools calmd [options] aln.bam ref.fa
-A modify the quality string
-b compressed BAM output
-e change identical bases to '='
-S input is SAM with header
-r compute the BQ tag (w/o -A)
or cap baseQ by BAQ (w/ -A)
-u uncompressed BAM output
校正bam文件
samtools rmdup [-sS] in.bam out.bam
in.bam需要已排序
-s remove dup. for SE reads
-S treat PE reads as SE (force -s)
(default) remove dup. for PE reads
rmdup用来去除可能的PCR duplicates
一致的外部坐标,仅mapping quality不同
处理PE时,依赖正确的ISIZE
对于PE,不能正确处理unpaired reads的情况
(Picard的MarkDuplicates能正确处理该情况)
rmdup example
Merge sorted alignments and remove
potential PCR/optical duplicates:
perl -e 'print
"@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illumina\n
@RG\tID:454\tSM:hs\tLB:454\tPL:454\n"'
> rg.txt
samtools merge -rh rg.txt - ga.bam 454.bam |
samtools rmdup - - |
samtools rmdup -s - aln.bam
call SNP / small indel
pileup -> pileup format
mpileup -> bcf format -> bcftools
NOTE: since 0.1.10,
pileup is deprecated by mpileup
samtools pileup
samtools pileup [options] in.bam/sam
输入的in.bam需要先排序
-B disable the BAQ (base alignment quality)
computation
-c call the consensus sequence
-C INT coefficient for adjusting mapping
quality of poor mappings [0]
-d INT limit maximum depth for indels for
speed up. 0 for unlimited [1024]
samtools pileup
-f FILE the reference sequence, in FASTA;
index file FILE.fai will be created if absent
-G FLOAT prior of an indel between two
haplotypes (for -c) [0.00015]
-i only show lines/consensus with indels
-I INT phred prob. of an indel in
sequencing/prep. [40]
-l FILE list of sites at which pileup is output
samtools pileup
-m INT filter reads with flag containing bits in
INT [1796(0x704, usfd)]
-M INT cap mapping quality at INT [60]
-N INT # haplotypes in the sample (for -c)
(>=2) [2]
-Q INT min base quality [13]
-r FLOAT prior of a difference between two
haplotypes (for -c) [0.001]
-S the input is in SAM
samtools pileup
-t FILE list of reference names and
sequence lengths (force -S)
-T FLOAT the theta parameter (error
dependency coefficient) in MAQ consensus
calling model (for -c) [0.83]
-v print variants only (for -c)
NOTE: pileup is deprecated, please use
mpileup
samtools pileup
default output: pileup format
每一行都是一个genomic position,
包含以下6列:
1 chromosome name
2 coordinate
3 reference base
4 read coverage/depth
5 read bases
6 alignment mapping qualities
samtools pileup
read bases
. for a match to the ref. on the + strand
, for a match to the ref. on the - strand
> < for a reference skip
ACGTN for a mismatch on + strand
acgtn for a mismatch on - strand
+NumSeq insertion
-NumSeq deletion
^ start of a read
$ end of a read
* deleted base
samtools pileup
when -c is apllied,
在原先第3、第4列之间增添4列,总共10列:
4 consensus base
5 consensus quality
6 SNP quality
7 root mean square (RMS) mapping quality
8 read coverage/depth
9 read bases
10 alignment mapping qualities
samtools pileup
an indel occupies an additional line (-i)
1 chromosome same
2 coordinate
3 just '*'
4 genotype
5 consensus quality
6 SNP quality
7 RMS mapping quality
samtools pileup
8 # covering reads
9 the first allele
10 the second allele
11 # reads supporting the first allele
12 # reads supporting the second allele
13 # reads containing indels different from
the top two alleles
……
samtools pileup
eg.
samtools pileup -f human_chrM.fasta
bwa_result.sorted.bam | less -S
samtools pileup -vcf human_chrM.fasta
bwa_result.sorted.bam | less -S
samtools pileup -ivcf human_chrM.fasta
bwa_result.sorted.bam | less -S
samtools.pl varFilter
samtools.pl varFilter [options]
in.cns_pileup
输入 samtools pileup -vcf 的结果,进行过滤
-D INT maximum read depth [100]
-d INT minimum read depth [3]
-G INT min indel score for nearby SNP
filtering [25]
-i INT minimum indel quality []
-l INT window size for filtering adjacent gaps
[30]
samtools.pl varFilter
-N INT max number of SNPs in a window [2]
-p print filtered variants to STDERR
-Q INT minimum RMS mapping quality for
SNPs [25]
-q INT minimum RMS mapping quality for
gaps [10]
-S INT minimum SNP quality []
-W INT window size for filtering dense SNPs
[10]
-w INT SNP within INT bp around a gap to
be filtered [10]
samtools.pl pileup2fq
samtools.pl pileup2fq [options]
in.cns_pileup
输入 samtools pileup -cf 的结果,转换为
FASTQ格式
-d INT min depth [3]
-D INT max depth [255]
-G INT min indel score [25]
-l INT indel filter window size [10]
-Q INT min RMS mapping quality [25]
pileup example
Call Variants
samtools pileup -vcf ref.fa aln.bam |
tee raw.txt |
samtools.pl varFilter -D 100 > flt.txt
Please remember to set the -D to set a maximum
read depth according to the average read depth
(×2). In whole genome shotgun (WGS)
resequencing, SNPs with excessively high read
depth are usually caused by structural variations
or alignment artifacts and should not be trusted.
awk '($3=="*"&&$6>=50)
||($3!="*"&&$6>=20)' flt.txt > final.txt
pileup example
Get the consensus sequence
samtools view -u aln.bam X |
samtools pileup -cf hsRef.fa - |
samtools.pl pileup2fq -D100 > var-X.fq
if you want to call for one chromosome only,
use 'samtools view' to specify the region.
(just like above)
http://sourceforge.net/apps/mediawiki/samtoo
ls/index.php?title=SAM_protocol
pileup example
Before pileup, you can use 'samtools view'
or other command to filter the mapping
reads/alignments.
'samtools view' can filter on mapping
quality (-q), flags (-f, -F), read group (-r, -R)
or library (-l).
other filters can be specified using perl or
awk on SAM format
pileup example
use reads with <=2 differenced:
samtools view -h aln.bam |
perl -ne 'print if (/^@/||(/NM:i:(\d+)/&&$1<=2))'
| samtools pileup -S - > out.txt
(NM: edit distance to the refernce)
exclude all gapped alignments:
samtools view -h aln.bam |
awk '$6!~/[ID]/' | samtools pileup -S -
pileup example
set a threashold on mapping quality
samtools view -bq 1 aln.bam
> aln-reliable.bam
exclude read group ERR00001
samtools view -h in.bam |
grep -v "\<RG:Z:ERR00001\>" |
samtools view -bS - >out.bam
samtools mpileup
samtools mpileup [options] in.bam
[in2.bam [...]]
generate BCF or pileup for one or multiple
BAM files.
Alignment records are grouped by sample
identifiers in @RG header lines.
If sample identifiers are absent, each input
file is regarded as one sample.
samtools mpileup
-B disable BAQ (base alignment quality)
computation
-C INT coefficient for adjusting mapping
quality of poor mappings [0 (disable)]
the recommended value for BWA is 50
-e INT phred-scaled gap extension
sequencing error probability [20]
reducing leads to longer indels
-f FILE the reference file [null]
-g compute genotype likelihoods,
and output them in BCF(binary call format)
samtools mpileup
-h INT coefficient for modeling homopolymer
errors [100]
-I do not perform indel calling
-l FILE file containing a list of sites where
pileup or BCF is outputed [null]
-o INT phred-scaled gap open sequencing
error probability [40]
reducing leads to more indel calls
-P STR comma dilimited list of platforms for
indels [all]
samtools mpileup
-q INT min mapping quality for an alignment
to be used [0]
-Q INT min base quality for a base to be
considered [13]
-r STR only generate pileup in the region
[all sites]
-u similar to -g, but output uncompressed
BCF
VCF and BCF format
VCF(variant call format)
BCF(binary VCF)
查看BCF内容
bcftools view in.bcf
VCF format
meta-information lines
以##开头
the header line
以#开头,tab分隔,data lines每列的解释
#CHROM POS ID REF ALT QUAL
FILTER INFO FORMAT (Sample...)
data lines
tab分隔
VCF format
1 CHROM chr. name
2 POS 1-base position
3 ID variant ID
4 REF ref. seq. at POS
5 ALT comma delimited list of alternative
seq.
6 QUAL phred-scaled prob. of all
samples being homozygous ref.
VCF format
7 FILTER semicolon delimited list of
filters that the variat fails to pass
8 INFO semicolon delimited list of
variant information
DP combined depth across samples
AF allele freq. for each ALT allele
MQ RMS mapping quality
VCF format
9 FORMAT colon delimited list of the
format of individual genotypes in the
following fields
GT genotype; 0 for REF, 1 for ALT1, ...
PL phred-scaled genotype likelihood
order: 00, 01, 11, 02, 12, 22, ...
GQ conditional genotype phred quality
10+ Sample(s) individual genotype
information defined by FORMAT
BCFtools
bcftools
view 查看,转换文件格式
index 为BCF建索引
cat 连接BCF
ld compute all-pair r^2
ldpair compute r^2 between requested pairs
bcftools view
bcftools view [options] in.bcf [region]
-b output BCF instead of VCF
-c SNP calling (force -e)
-e likelihood based analyses
-G suppress all individual genotype info.
-g call genotypes at variant sites (force -c)
-I skip indels
-l FILE list of sites (chr pos) or regions (BED)
to output [all]
-P STR type of prior: full, cond2, flat [full]
-v output potential variant sites only (force -c)
vcfutils.pl
vcfutils.pl
varFilter
vcf2fq
subsam
listsam
fillac
qstats
hapmap2vcf
ucscsnp2vcf
vcfutils.pl varFilter
vcfutils.pl varFilter [options] in.vcf
-a INT min number of alternate bases [2]
-D INT max read depth [10000000]
-d INT min read depth [2]
-p print filtered variants to STDERR
-Q INT min RMS mapping quality for SNPs
[10]
-W INT window size for filtering adjacent
gaps [10]
-w INT SNP with INT bp around a gap to be
filtered [10]
vcfutils.pl vcf2fq
vcfutils.pl vcf2fq [options] all-site.vcf
输入 samtools mpileup -uf | bcftools view -cg
的结果,转换为FASTQ格式
-d INT min depth [3]
-D INT max depth [100000]
-l INT indel filter window size [5]
-Q INT min RMS mapping quality [10]
mpileup example
Call SNPs and short indels for one diploid
individual:
samtools mpileup -uf ref.fa aln.bam |
bcftools view -bvcg - > var.raw.bcf
usually add -C50 to mpileup if mapping quality is
overestimated, eg. using BWA-short
bcftools view var.raw.bcf |
vcfutils.pl varFilter -D 100 > var.flt.vcf
-D controls the max read depth, which should be
adjusted to about twice the average read depth
mpileup example
Call SNPs and short indels for multiple
diploid individuals:
samtools mpileup -P ILLUMINA -ugf ref.fa
*.bam | bcftools view -bvcg - > var.raw.bcf
Individuals are identified from the SM tags in the
@RG header lines.
-P specifies that indel candidates should be
collected only from read groups with @RG-PL tag
set to ILLUMINA
bcftools view var.raw.bcf |
vcfutils.pl varFilter -D 2000 > var.flt.vcf
mpileup example
Derive the AFS (allele frequency spectrum)
on a list of sites from multiple individuals:
samtools mpileup -Igf ref.fa *.bam >all.bcf
bcftools view -bl sites.list all.bcf >sites.bcf
bcftools view -cGP cond2 sites.bcf
>/dev/null 2>sites.1.afs
bcftools view -cGP sites.1.afs sites.bcf
>/dev/null 2>sites.2.afs
bcftools view -cGP sites.2.afs sites.bcf
>/dev/null 2>sites.3.afs
……
mpileup example
Generate the consensus sequence:
samtools mpileup -uf ref.fa aln.bam |
bcftools view -cg - |
vcfutils.pl vcf2fq > cns.fq
Dump BAQ applied alignment for other
SNP callers:
samtools calmd -br aln.bam >aln.baq.bam
Java版本——Picard
类似Samtools功能的Java版本
Samtools可以在流上(管道)操作,
Picard好像一般对文件操作。
Picard的MarkDuplicates能处理Samtools
的rmdup不能处理的一些问题。(注:
MarkDuplicates依赖mate的坐标)
Reference
SAM format
http://samtools.sourceforge.net/SAM1.pdf
Samtools homepage
http://samtools.sourceforge.net/
Samtools manual
http://samtools.sourceforge.net/samtools.shtml
Samtools pileup
http://samtools.sourceforge.net/cns0.shtml
http://samtools.sourceforge.net/pileup.shtml
Reference
Samtools protocal
http://sourceforge.net/apps/mediawiki/samtools/index
.php?title=SAM_protocol
Samtools mpileup
http://samtools.sourceforge.net/mpileup.shtml
VCF format
http://vcftools.sourceforge.net/specs.html
http://www.1000genomes.org/wiki/Analysis/Variant%
20Call%20Format/vcf-variant-call-format-version-40
http://www.1000genomes.org/wiki/Analysis/Variant%
20Call%20Format/vcf-variant-call-format-version-41
BEDTools
BED文件格式
BED format是UCSC Genome Browser和
Ensembl Genome Browser都支持的,描
述annotation track信息的文件格式。
它是tab分隔的文本文件,可以有3-12列。
前3列是必需的:
1 chrom name of chr. or scaffold
2 chromStart
start pos. of the feature (include) (0-based)
3 chromEnd
end pos. of the feature (not include)
BED文件格式
第4-9列是可选的:
4 name the name of the BED line
5 score 0~1000, higher, darker
6 strand
7 thickStart starting pos. drawn thickly
8 thickEnd ending pos. drawn thickly
9 itemRgb display color
BED文件格式
第10-12列也是可选的,UCSC使用它们,
而Ensembl忽略它们:
10 blockCount # blocks (exons) in the line
11 blockSizes comma-separated, block
sizes, should correspond to blockCount
12 blockStarts comma-separated, block
starts, should correspond to blockCount
eg. snps.bed
chr1 100 101 A/G 100
chr1 200 201 C/G 1000
BedGraph文件格式
4列tab分隔的文本文件
1 chromName
2 chromStart
3 chromEnd
4 dataValue
BEDTools
文件格式转换 (bam->bed)
bamToBed
计算交(overlap)、差、补
intersectBed
pairToBed
subtractBed
pairToPair
complementBed
windowBed
BEDTools
改动BED文件
mergeBed
slopBed
shuffleBed
sortBed
计算coverage
coverageBed
genomeCoverageBed
BEDTools
其他
closestBed
fastaFromBed
maskFastaFromBed
linksBed
分组简单统计
groupBy
-h display help page
bamToBed
bamToBed -i in.bam > out.bed
convert BAM to BED6 format
-bed12
convert to BED12 format
-bedpe
convert to BEDPE format
-ed
use edit distance (NM tag) as BED score
intersectBed
intersectBed -a a.bed -b b.bed
report overlaps between two BED files
-c
for each A, report # overlaps with B (count)
-u
report original A once if any overlaps found
in B
-v
only report those A that have no overlaps
with B (similar to 'grep -v')
intersectBed
intersectBed -a a.bed -b b.bed
-wa
report entire, original entry in A for each
overlap
-wb
report entire, original entry in B for each
overlap
intersectBed
intersectBed -a a.bed -b b.bed
-wa -wb
report entire, original entry in A and B for
each overlap
-wo
report original A and B and # bp of overlap
between the two features
intersectBed
intersectBed -a a.bed -b b.bed
-f FLOAT
minimum overlap required as a fraction of A
-f FLOAT -r
require that the fraction overlap be reciprocal
for A and B
-s
force strandedness, only report when A and B
on the same strand
intersectBed
intersectBed -abam a.bam -b b.bed
>out.bam
input file a is in BAM, and ouput will be BAM
-ubam
write uncompressed BAM output
-bed
when using BAM input, write output as BED
-a 可跟 stdin,表示从STDIN读入
intersectBed
eg.
find genes that overlap Lines but not SINEs:
intersectBed -a genes.bed -b LINEs.bed |
intersectBed -a stdin -b SINEs.bed -v
retain SE BAM that overlap exons:
intersectBed -abam reads.bam -b exons.bed
>reads.touchingExons.bam
intersectBed
screen novel SNP
intersectBed -a snp.calls.bed -b dbSnp.bed -v
| intersectBed -a stdin -b 1KG.bed -v
>snp.calls.novel.bed
pairToBed
pairToBed -a a.bedpe -b b.bed
report overlaps between a BEDPE and a BED
-type either (default)
report overlaps if either end of A overlaps B
-type neither
report A if neither end of A overlaps B
-s
force strandedness
pairToBed
pairToBed -a a.bedpe -b b.bed
-type both
report overlaps if both ends of A overlap B
-type notboth
report overlaps if neither or only one end of A
overlap B
-type xor
report overlaps if only one end of A overlap B
pairToBed
pairToBed -a a.bedpe -b b.bed
-type ispan
report overlaps between [end1, start2] of A,B
-type ospan
report overlaps between [start1, end2] of A,B
-type notispan
report A if ispan of A does not overlap B
-type notospan
report A if ospan of A does not overlap B
pairToBed
pairToBed -abam a.bam -b b.bed >out.bam
input file a is PE BAM, and output will be in
BAM
-ubam
write uncompressed BAM output
-bedpe
when using BAM input, write output as BEDPE
pairToBed
eg.
return all structural variants that overlap with
genes on either end:
pairToBed -a sv.bedpe -b genes.bed
>sv.genes
retain only PE alignments where neither or
only one end overlaps segmental duplications:
pairToBed -abam reads.bam -b segdups.bed type notboth >reads.notbothSSRs.bam
pairToPair
pairToPair -a a.bedpe -b b.bedpe
report overlaps between two BEDPE
-type both (default)
report overlaps if both ends of A overlap B
-type either
report overlaps if either ends of A overlap B
-type neither
report overlaps if neither end of A overlaps B
pairToPair
eg.
find all SVs in sample 1 that are also in
sample 2:
pairToPair -a 1.sv.bedpe -b 2.sv.bedpe |
cut -f 1-10 > 1.sv.in2.bedpe
find all SVs in sample 1 that are not in sample
2:
pairToPair -a 1.sv.bedpe -b 2.sv.bedpe -type
neither | cut -f 1-10 > 1.sv.notin2.bedpe
windowBed
windowBed -a a.bed -b b.bed
examines a window around each A, and
reports all features in B that overlap the
window
for each overlap, A and B are reported
-w INT
bp added upstream and downstream of
each A (symterical windows around A)
windowBed
windowBed -a a.bed -b b.bed
-l INT
bp added upstream (left) of each A
(assymterical windows around A)
-r INT
bp added downstream (right) of each A
(assymterical windows around A)
-sw
define -l and -r based on strand
windowBed
windowBed -a a.bed -b b.bed
-sm
only report hits in B that overlap A on the
same strand
-abam FILE
-ubam
-bed
windowBed
eg.
report all SNPs that are within 5000bp
upstream or 1000bp downstream of genes:
windowBed -a genes.bed -b snps.bed
-l 5000 -r 1000 -sw
subtractBed
subtractBed -a a.bed -b b.bed
removes A that overlap by any feature in B
-f FLOAT
minimum overlap required as a fraction of A
-s
force strandedness
eg.
remove introns from gene features:
subtractBed -a genes.bed -b intron.bed
complementBed
complementBed -i in.bed -g genome
return bp complement of the feature file
genome file should tab-delimited and contain:
chromName chromSize
how to get chromSize? see -h
eg.
report all intervals in human genome that are
not covered by repetitive elements:
complementBed -i repeatMasker.bed -g
hg18.genome
mergeBed
mergeBed -i in.bed
merge overlapping entries into a single entry
-n
report # BED entries merged
'1' is reported if no merging occurred
-d INT
maximum distance (bp) between features
allowed to be merged [0]
mergeBed
mergeBed -i in.bed
-s
force strandedness
-scores STR
how to determine the scores of the merged
sum, min, max, mean, median, mode,
antimode, collapse
slopBed
slopBed -i in.bed -g genome
将feature向两侧扩展
-b INT/FLOAT
in each direction
-l INT/FLOAT
in left (upstream) direction
-r INT/FLOAT
in right (downstream) direction
slopBed
slopBed -i in.bed -g genome
-s
define -l and -r based on strand
-pct
specify a fraction of the feature's length,
instead of absolute bases
only with -pct, -b/-l/-r can be given FLOAT
eg.
slopBed -i probes.bed -g hg19.genome
-b 500
shuffleBed
shuffleBed -i in.bed -g genome
randomly permute the locations of features
among a genome
-excl BED_FILE
prevent placing in these intervals
-f FLOAT
maximum overlap fraction of the feature with
an -excl feature
-chrom
keep features on the same chromosome
sortBed
sortBed -i in.bed
sort BED file, by chrom, then start position
输出排序好的BED到STDOUT
see -h for more options
coverageBed
coverageBed -a a.bed -b b.bed
compute the depth and breadth of coverage
of features from A on intervals in B
default output: (for each entry in B)
# A overlapped the B interval
# bases in B had >0 coverage
the length of B
the fraction of bases in B had >0 coverage
coverageBed
coverageBed -a a.bed -b b.bed
-s
force strandedness
-d
report the depth at each position (base) in
each B
eg.
compute coverage and create a BEDGRAPH
coverageBed -a reads.bed -b win10kb.bed |
cut -f 1-4 > win10kb.cov.bedg
coverageBed
by default, coverageBed counts any
feature in A that overlaps B by >= 1 bp.
If you want to specify a minimum fraction,
you can first use intersectBed:
intersectBed -a a.bed -b b.bed -f 1.0 |
coverageBed -a stdin -b b.bed
>a_totally_in_b.coverage
coverageBed
work with samtools or bam file
bamToBed -i reads.bam |
coverageBed -a stdin -b exons.bed
>exons.bed.coverage
samtools view -bf 0x2 reads.bam |
bamToBed -i stdin |
coverageBed -a stdin -b exons.bed
>exons.bed.pair_proper_mapped.coverage
coverageBed
compute separatedly for each strand
bamToBed -i reads.bam | grep '\+$' |
coverageBed -a stdin -b genes.bed
>genes.bed.forward.coverage
bamToBed -i reads.bam | grep '\-$' |
coverageBed -a stdin -b genes.bed
>genes.bed.reverse.coverage
genomeCoverageBed
genomeCoverageBed -i in.bed -g genome
compute the coverage among a genome
input BED file must be grouped by chr.
default output: (for each chromosome in B)
depth (0~max)
# bases in the chr. had this depth
the length of the chr.
the fraction of bases in the chr. that had this
depth
-strand +/-
genomeCoverageBed
-ibam FILE
input in BAM
must be sorted by position (samtools sort)
-d
report the depth at each position (base)
-bg
report depth in BedGraph format
-bga
report depth in BedGraph format, also report
zero coverage regions
genomeCoverageBed
eg.
genomeCoverageBed
-ibam bwa_result.sorted.bam
-g human_chrM.genome
genomeCoverageBed
-ibam bwa_result.sorted.bam
-g human_chrM.genome |
grep '^chrM' |
awk 'BEGIN{SUM = 0} {SUM += ($2*$5)}
END {print SUM}'
closestBed
closestBed -a a.bed -b b.bed
for each A, finds the closest feature in B
-t all (default)
report all ties
-t first
report the first tie that occurred in B
-t last
report the last tie that occurred in B
closestBed
closestBed -a a.bed -b b.bed
-s
force strandedness, find the closest feature
on the same strand
eg.
find the closest ALU to each gene:
closestBed -a genes.bed -b ALUs.bed
fastaFromBed
maskFastaFromBed -fi in.fasta
-bed extract.bed -fo out.fasta
根据extract.bed文件,从in.fasta中抽取一些区
域,输出到out.fasta
-s
force strandedness
-name
use name field for fasta header
maskFastaFromBed
maskFastaFromBed -fi in.fasta
-bed mask.bed -fo out.fasta
根据mask.bed文件,mask掉in.fasta中的一些
区域,输出到out.fasta
-soft
soft masking: mask with lower-case instead
of N
-mc CHAR
replace masking char, instead of N
linksBed
linksBed -i in.bed > out.html
创建一个链到UCSC browser的HTML文件
-base
browser basename [http://genome.ucsc.edu]
-org
organism [human]
-db
reference build [hg18]
groupBy
groupBy 可以指定几列(-g)作为组别,
然后分组简单统计(-o)指定列(-c)
Deprecated. Now in the filo package.
-i FILE
input file, assume STDIN if omitted
-g STR / -grp STR
specify the columns (1-based) for grouping,
comma separated [1,2,3]
-c INT / -opCols INT
specify the column (1-based) to be
summarized, required
groupBy
-o STR / -ops STR
specify the operation applied to -c column
sum, count, min, max, mean, median,
mode(most common), antimode(least
common), stdev, sstdev(sample stdev),
collapse(comma separated), concat(nondelimited), freqdesc, freqasc [sum]
can be comma separated,
eg. sum,mean,max
groupBy
-inheader
input has a header line, would be ignored
-outheader
output header line, showing column names
BEDTools Notes
All BEDTools load B file into memory and
process A file one-by-one. Therefore when
possible, set smaller file to be B file.
Most of BEDTools allow A file to be STDIN
from pipes, by using '-a stdin'
Reference
BED format
http://genome-test.cse.ucsc.edu/FAQ/FAQformat
http://www.ensembl.org/info/website/upload/bed.html
BedGraph format
http://genome.ucsc.edu/goldenPath/help/bedgraph.ht
ml
BEDTools
http://code.google.com/p/bedtools/
filo package (for groupBy)
https://github.com/arq5x/filo
FASTX-Toolkit
FASTX-Toolkit
预处理FASTA/FASTQ文件的工具
(mapping之前)
格式转换,切短,切去adapter,分
barcode,quality筛选……
-h show usage information
you can also use them with Galaxy
FASTX-Toolkit
fastq_to_fasta
fastx_quality_stats
fastq_quality_boxplot_graph.sh
fastx_nucleotide_distribution_graph.sh
fastx_clipper
fastx_renamer
fastx_trimmer
FASTX-Toolkit
fastx_collapser
fastx_artifacts_filter
fastx_quality_filter
fastx_reverse_complement
fasta_formatter
fasta_nucleotide_changer
fasta_clipping_histogram.pl
fasta_barcode_splitter.pl
Reference
FASTX-Toolkit homepage
http://hannonlab.cshl.edu/fastx_toolkit/
FASTX-Toolkit manual
http://hannonlab.cshl.edu/fastx_toolkit/comma
ndline.html
Thanks for your attention!