bwa 使用指南 #1. 介紹 #2. 安裝 #3. 使用示例 #4. 常用命令行 #5. 命令及參數 #6. 原文

bwa - Burrows-Wheeler Alignment Tool

#1. 介紹

BWA 是一個能將差異較小的測序數據比對到參考基因組的工具。它包含三種算法:BWA-backtrack, BWA-SW 和 BWA-MEM.

  • BWA-backtrack:適用於Illumina 數據,最長100bp; aln/samse/sampe
  • BWA-SW:長序列,70bp-1Mbp;bwasw
  • BWA-MEM:長序列,70bp-1Mbp;mem

注:但是對於高質量的數據,通常推薦使用最新的BWA-MEM,因爲它更快、更準確。對於70-100bp Illumina讀數,BWA-MEM也比BWA-backtrack有更好的性能。

對於所有的算法,BWA首先需要爲參考基因組構建fm-index(index 命令)。使用不同的子命令調用比對算法: BWA-backtrack可調用aln/samse/sampe,BWA-SW可調用bwasw,對於BWA-MEM算法調用mem。

#2. 安裝

##2.1 github安裝

git clone https://github.com/lh3/bwa.git
cd bwa; make
./bwa index ref.fa
./bwa mem ref.fa read-se.fq.gz | gzip -3 > aln-se.sam.gz
./bwa mem ref.fa read1.fq read2.fq | gzip -3 > aln-pe.sam.gz

##2.2 自定義安裝

bunzip2 bwa-0.5.9.tar.bz2 
tar xvf bwa-0.5.9.tar
cd bwa-0.5.9
make
  • 添加到環境路徑
export PATH=$PATH:/path/to/bwa-0.5.9 
source ~/.bashrc
  • 測試
$ bwa

Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.5a-r405
Contact: Heng Li <[email protected]>

Usage:   bwa <command> [options]

Command: index         index sequences in the FASTA format
         mem           BWA-MEM algorithm
         fastmap       identify super-maximal exact matches
         pemerge       merge overlapping paired ends (EXPERIMENTAL)
         aln           gapped/ungapped alignment
         samse         generate alignment (single ended)
         sampe         generate alignment (paired ended)
         bwasw         BWA-SW for long queries

         fa2pac        convert FASTA to PAC format
         pac2bwt       generate BWT from PAC
         pac2bwtgen    alternative algorithm for generating BWT
         bwtupdate     update .bwt to the new format
         bwt2sa        generate SA from BWT and Occ

Note: To use BWA, you need to first index the genome with `bwa index'. There are
      three alignment algorithms in BWA: `mem', `bwasw' and `aln/samse/sampe'. If
      you are not sure which to use, try `bwa mem' first. Please `man ./bwa.1' for
      for the manual.

#3. 使用示例

  • 創建參考索引
bwa index -p hg19bwaidx -a bwtsw wg.fa
  • 比對:使用4個CPU,reads在s_3_sequence.txt.gz文件
bwa aln -t 4 hg19bwaidx s_3_sequence.txt.gz >  s_3_sequence.txt.bwa

注:BWA 可以讀取壓縮文件;當超過10個cpu比對時,觀察到SAM輸出有問題。

  • 將格式轉換爲sam
bwa samse hg19bwaidx s_3_sequence.txt.bwa s_3_sequence.txt.gz > s_3_sequence.txt.sam
  • Mapping short reads to RefSeq mRNAs
bwa samse RefSeqbwaidx s_3_sequence.txt.bwa s_3_sequence.txt > s_3_sequence.txt.sam
  • Mapping long reads (454)
bwa bwasw hg19bwaidx 454seqs.txt > 454seqs.sam
#Illumina/454/IonTorrent single-end reads longer than ~70bp or assembly contigs up to a few megabases mapped to a closely related reference genome:
  bwa mem ref.fa reads.fq > aln.sam

#Illumina single-end reads shorter than ~70bp:
  bwa aln ref.fa reads.fq > reads.sai; bwa samse ref.fa reads.sai reads.fq > aln-se.sam

#Illumina/454/IonTorrent paired-end reads longer than ~70bp:
  bwa mem ref.fa read1.fq read2.fq > aln-pe.sam

#Illumina paired-end reads shorter than ~70bp:
  bwa aln ref.fa read1.fq > read1.sai; bwa aln ref.fa read2.fq > read2.sai
  bwa sampe ref.fa read1.sai read2.sai read1.fq read2.fq > aln-pe.sam

#PacBio subreads or Oxford Nanopore reads to a reference genome:
  bwa mem -x pacbio ref.fa reads.fq > aln.sam
  bwa mem -x ont2d ref.fa reads.fq > aln.sam

#4. 常用命令行

bwa index ref.fa
bwa mem ref.fa reads.fq > aln-se.sam
bwa mem ref.fa read1.fq read2.fq > aln-pe.sam
bwa aln ref.fa short_read.fq > aln_sa.sai
bwa samse ref.fa aln_sa.sai short_read.fq > aln-se.sam
bwa sampe ref.fa aln_sa1.sai aln_sa2.sai read1.fq read2.fq > aln-pe.sam
bwa bwasw ref.fa long_read.fq > aln.sam

#5. 命令及參數

##5.1 index

  • FASTA格式的數據建立索引.

  • bwa index [-p prefix] [-a algoType] <in.db.fasta>

參數
-p STR 輸出前綴
-a STR BWT索引構建算法 (is和bwtsw)

    • is 需要5.37N的內存,其中N是數據庫的大小。IS的速度比較快,但是不能用於大於2GB的數據庫。由於其簡單性,IS是默認算法。較小的參考基因組。
    • bwtsw 算法內置於BWT-SW。這種方法適用於整個人類基因組。較大的參考基因組

##5.2 mem

  • bwa mem [-aCHMpP] [-t nThreads] [-k minSeedLen] [-w bandWidth] [-d zDropoff] [-r seedSplitRatio] [-c maxOcc] [-A matchScore] [-B mmPenalty] [-O gapOpenPen] [-E gapExtPen] [-L clipPen] [-U unpairPen] [-R RGline] [-v verboseLevel] db.prefix reads.fq [mates.fq]

使用BWA-MEM算法比對70bp-1Mbp的測序數據;簡而言之,此算法是基於最大精確匹配(maximal exact matches,MEMs)的種子比對(seeding alignments),然後通過the affine-gap Smith-Waterman algorithm (SW)延伸種子長度。

當沒有mates.fq文件以及 沒有指定參數-p時,測序會被認爲是單端測序。

  • mates.fq存在,reads.fqmates.fq中數據組成pair read。
  • 使用-preads.fq文件中的2i和(2i+1)read 組成一對read.

在paired-end模式下,mem命令將從部分reads推斷read 方向和插入大小分佈。

BWA-MEM 進行local alignment,因此對於一個read可能產生多個匹配結果。

參數

-t INT: 線程數; 默認1

-k INT:種子最短長度,小於種子長度的比對結果會被丟棄。默認19

-w INT: gap 最大長度。值得一提,最大gap長度也受評分矩陣和命中長度的影響,而不僅僅由這個參數決定。默認100

-d INT:Off-diagonal X-dropoff (Z-dropoff). [100]

-r FLOAT:a MEM longer than minSeedLen*FLOAT時,重新生成種子。這是調優性能的一個探索式參數。值越大,得到的種子越少,校準速度越快,精度越低。[1.5]

-c INT:當一個MEM在參考基因組多餘INT次結果,丟棄。[10000]

-P: In the paired-end mode, perform SW to rescue missing hits only but do not try to find hits that fit a proper pair.

-A INT:匹配分數[1]

-B INT:Mismatch 罰分;序列錯誤率約爲{.75 * exp[-log(4) * B/A]}[4]

-O INT:gap罰分。[6]

-E INT:Gap延伸罰分,A gap of length k costs O + k*E (i.e. -O is for opening a zero-length gap). [1]

-L INT:read剪切罰分;進行SW 延伸時。鹼基剪切後,匹配分數減去罰分結果更差,則不進行[5]

-U INT:Penalty for an unpaired read pair。[9]

-p:假定第一個文件就包含雙端測序的所有數據,查看另一個參數-P。

-R STR:Complete read group header line. ’@RG\tID:foo\tSM:bar’. [null]

-T INT:匹配打分的最低閾值。 [30]

-a:輸出比對的所有結果。

-C:Append append FASTA/Q comment to SAM output.

-H:Use hard clipping ’H’ in the SAM output.

-M:Mark shorter split hits as secondary (for Picard compatibility).
-v INT:Control the verbose level of the output.

##5.3 aln

  • bwa aln [-n maxDiff] [-o maxGapO] [-e maxGapE] [-d nDelTail] [-i nIndelEnd] [-k maxSeedDiff] [-l seedLen] [-t nThrds] [-cRN] [-M misMsc] [-O gapOsc] [-E gapEsc] [-q trimQual] <in.db.fasta> <in.query.fq> > <out.sai>

使用aln查找read SA座標;在第一個種子區域的最大差異設置maxSeedDiff ,整個read序列匹配的差異閾值設置:maxDiff

-n NUM:最大的編輯距離,或者百分比;[0.04]

-o INT:gap最大數量;[1]

-e INT:Maximum number of gap extensions, -1 for k-difference mode (disallowing long gaps) [-1]

-d INT:3端最大刪除鹼基數目;[16]

-i INT:末端插入最大鹼基數目;[5]

-l INT:Take the first INT subsequence as seed. If INT is larger than the query sequence, seeding will be disabled. For long reads, this option is typically ranged from 25 to 35 for ‘-k 2’. [inf]

-k INT:種子最大編輯距離;[2]

-t INT:線程數

-M INT: 錯誤罰分[3]

-O INT: Gap open penalty [11]
-E INT: Gap extension penalty [4]

-R INT: Proceed with suboptimal alignments if there are no more than INT equally best hits. This option only affects paired-end mapping. Increasing this threshold helps to improve the pairing accuracy at the cost of speed, especially for short reads (~32bp).

-c: Reverse query but not complement it, which is required for alignment in the color space. (Disabled since 0.6.x)

-N: 禁用迭代搜索。所有沒有超過maxDiff 的匹配都被找到

-q INT: Parameter for read trimming. BWA trims a read down to argmax_x{\sum_{i=x+1}^l(INT-q_i)} if q_l<INT where l is the original read length. [0]

-I:The input is in the Illumina 1.3+ read format (quality equals ASCII-64).

-B INT:Length of barcode starting from the 5’-end. When INT is positive, the barcode of each read will be trimmed before mapping and will be written at the BC SAM tag. For paired-end reads, the barcode from both ends are concatenated. [0]
-b: Specify the input read sequence file is the BAM format. For paired-end data, two ends in a pair must be grouped together and options -1 or -2 are usually applied to specify which end should be mapped. Typical command lines for mapping pair-end data in the BAM format are:

bwa aln ref.fa -b1 reads.bam > 1.sai
bwa aln ref.fa -b2 reads.bam > 2.sai
bwa sampe ref.fa 1.sai 2.sai reads.bam reads.bam > aln.sam

-0: When -b is specified, only use single-end reads in mapping.
-1: When -b is specified, only use the first read in a read pair in mapping (skip single-end reads and the second reads).
-2:When -b is specified, only use the second read in a read pair in mapping.

##5.4 samse

  • bwa samse [-n maxOcc] <in.db.fasta> <in.sai> <in.fq> > <out.sam>

  • 單端測序,產生SAM文件,重複序列隨機選擇。

-n INT:Maximum number of alignments to output in the XA tag for reads paired properly. If a read has more than INT hits, the XA tag will not be written. [3]

-r STR:Specify the read group in a format like ‘@RG\tID:foo\tSM:bar’. [null]

##5.5 sampe雙端測序

  • bwa sampe [-a maxInsSize] [-o maxOcc] [-n maxHitPaired] [-N maxHitDis] [-P] <in.db.fasta> <in1.sai> <in2.sai> <in1.fq> <in2.fq> > <out.sam>
  • 雙端測序,產生SAM文件,重複序列隨機選擇。

-a INT:最大插入片段大小. [500]
-o INT:Maximum occurrences of a read for pairing. A read with more occurrneces will be treated as a single-end read. Reducing this parameter helps faster pairing. [100000]
-P: Load the entire FM-index into memory to reduce disk operations (base-space reads only). With this option, at least 1.25N bytes of memory are required, where N is the length of the genome.
-n INT: Maximum number of alignments to output in the XA tag for reads paired properly. If a read has more than INT hits, the XA tag will not be written. [3]
-N INT:Maximum number of alignments to output in the XA tag for disconcordant read pairs (excluding singletons). If a read has more than INT hits, the XA tag will not be written. [10]
-r STR:指定read group ‘@RG\tID:foo\tSM:bar’. [null]

##5.6 bwasw

bwa bwasw [-a matchScore] [-b mmPen] [-q gapOpenPen] [-r gapExtPen] [-t nThreads] [-w bandWidth] [-T thres] [-s hspIntv] [-z zBest] [-N nHspRev] [-c thresCoef] <in.db.fasta> <in.fq> [mate.fq]

Align query sequences in the in.fq file. When mate.fq is present, perform paired-end alignment. The paired-end mode only works for reads Illumina short-insert libraries. In the paired-end mode, BWA-SW may still output split alignments but they are all marked as not properly paired; the mate positions will not be written if the mate has multiple local hits.

OPTIONS:
-a INT 比對分數 [1]
-b INT 錯配罰分 [3]
-q INT Gap 罰分 [5]
-r INT Gap 延伸罰分. The penalty for a contiguous gap of size k is q+kr. [2]
-t INT 線程數 [1]
-w INT Band width in the banded alignment [33]
-T INT Minimum score threshold divided by a [37]
-c FLOAT Coefficient for threshold adjustment according to query length. Given an l-long query, the threshold for a hit to be retained is a
max{T,c*log(l)}. [5.5]
-z INT Z-best heuristics. Higher -z increases accuracy at the cost of speed. [1]
-s INT Maximum SA interval size for initiating a seed. Higher -s increases accuracy at the cost of speed. [3]
-N INT Minimum number of seeds supporting the resultant alignment to skip reverse alignment. [5]

##5.7 SAM比對格式

aln的結果是二進制,只適用於BWA 。BWA 將其轉換爲SAM (Sequence Alignment/Map) 格式:

Col Field Description
1 QNAME Query (pair) NAME
2 FLAG bitwise FLAG
3 RNAME Reference sequence NAME
4 POS 1-based leftmost POSition/coordinate of clipped sequence
5 MAPQ MAPping Quality (Phred-scaled)
6 CIAGR extended CIGAR string
7 MRNM Mate Reference sequence NaMe (‘=’ if same as RNAME)
8 MPOS 1-based Mate POSistion
9 ISIZE Inferred insert SIZE
10 SEQ query SEQuence on the same strand as the reference
11 QUAL query QUALity (ASCII-33 gives the Phred base quality)
12 OPT variable OPTional fields in the format TAG:VTYPE:VALUE

FLAG 內容:

Chr Flag Description
p 0x0001 the read is paired in sequencing
P 0x0002 the read is mapped in a proper pair
u 0x0004 the query sequence itself is unmapped
U 0x0008 the mate is unmapped
r 0x0010 strand of the query (1 for reverse)
R 0x0020 strand of the mate
1 0x0040 the read is the first read in a pair
2 0x0080 the read is the second read in a pair
s 0x0100 the alignment is not primary
f 0x0200 QC failure
d 0x0400 optical or PCR duplicate

BWA 會產生以下內容;X開始是BWA特有的內容。

Tag Meaning
NM Edit distance
MD Mismatching positions/bases
AS Alignment score
BC Barcode sequence
X0 Number of best hits
X1 Number of suboptimal hits found by BWA
XN Number of ambiguous bases in the referenece
XM Number of mismatches in the alignment
XO Number of gap opens
XG Number of gap extentions
XT Type: Unique/Repeat/N/Mate-sw
XA Alternative hits; format: (chr,pos,CIGAR,NM;)*
XS Suboptimal alignment score
XF Support from forward/reverse alignment
XE Number of supporting seeds

Note that XO and XG are generated by BWT search while the CIGAR string by Smith-Waterman alignment. These two tags may be inconsistent with the CIGAR string. This is not a bug.

##5.8 比對精度

禁用種子比對時,BWA 會找到匹配(最大差異(maxDiff 設置),最大gap數量(maxGapO), read任何一端都剪切超過n bp (nIndelEnd ));如果maxGapE 是正數u,Longer gaps也會被尋找到,但是不保證找到所有的匹配。如果使用種子匹配,

BWA 報賬第一個種子長度的序列與reference 的距離不大於maxSeedDiff

當不使用gapped alignment時,BWA 將‘N’改爲隨機核苷酸,並且與之的匹配也會被保留。

#6. 原文

Manual Reference Pages -* bwa
Elementolab/BWA tutorial

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章