Blat & BLAST

Blat

簡單介紹

Blat,全稱 The BLAST-Like Alignment Tool,可以稱爲**“類BLAST 比對工具”**,對於DNA序列,BLAT是用來設計尋找95%及以上相似至少40個鹼基的序列。對於蛋白序列,BLAT是用來設計尋找80%及以上相似至少20個氨基酸的序列。

​ Blat 的主要特點就是:速度快,共線性輸出結果簡單易讀。對於比較小的序列(如 cDNA 等)對大基因組的blat與blast比較比對,blat 無疑是首選。Blat 把相關的呈共線性的比對結果連接成爲更大的 比對結果,從中也可以很容易的找到 exons 和 introns。因此,在相近物種的基因同源性分析和EST 分析中,blat 得到了廣泛的應用。Blat的比對速度之所以能比Blast快幾百倍,是因爲此兩者之間的比對機制有着本質的差別。Blast是將查詢序列索引化,然後線性搜索龐大的目標數據庫,期間頻繁地訪問硬盤數據,時間和空間上的數據相關性較小;Blat則將龐大的目標數據庫索引,然後線性搜索查詢序列,這種搜索方式在時間和空間上的數據相關性比較大。Blat將數據庫索引一次性讀入內存,可以反覆地高速調用,無需訪問硬盤,佔用的系統資源很少。只要索引建立,查詢序列的量越大,Blat的優勢就越明顯。

Blat is an alignment tool like BLAST, but it is structured differently. Blat produces two major classes of alignments:

  • at the DNA level between two sequences that are of 95% or greater identity, but which may include large inserts.
  • at the protein or translated DNA level between sequences that are of 80% or greater identity and may also include large inserts.

安裝

wget -c https://users.soe.ucsc.edu/~kent/src/blatSrc35.zip
unzip blatSrc35.zip 
cd blatSrc
uname -a
export MACHTYPE="x86_64"
mkdir ~/bin/$MACHTYPE
mkdir $MACHTYPE
make >make$(date +%F).log
echo 'export PATH="/public/home/user/bin/x86_64:$PATH"' >> ~/.bash_profile

基本原理

基本原理: 首先blat將參考序列拆分成tiles/kmers,其拆分的方式取決於兩個參數:-tileSize and -stepSize。其中-tileSize決定tiles/kmers的大小,一般設定範圍是:8-12,預設DNA爲11,蛋白質爲5;-stepSize決定tiles/kmers移動的步長。

參考鏈接:Using blat

參考鏈接:Blat-The BLAST-Like Alignment Tool (詳細的使用教程)

常見用法

#blat常見用法
#處理單個job
blat chr11.fa human/test.fa test.psl #輸出不含序列
blat chr11.fa human/test.fa -out=pslx test.pslx #輸出含序列
blat chr11.fa human/test.fa -out=blast test.blast #輸出格式同NABI的blast格式
#並行處理多個jobs
time parallel blat chr{}.fa human/human.fa test_{}.psl ::: {1..22} X Y M

參數詳情

#blat參數
#用法:blat database query [-ooc=11.ooc] output.psl
#database  輸入文件必須是其中一種類型:a .fa , .nib or .2bit file
#query 輸入文件必須是其中一種類型:a .fa , .nib or .2bit file
#output.psl 輸出文件
#-t=type 數據庫類型,可選項: dna/prot/dnax
#-q=type 查詢序列的類型,可選項:dna/prot/dnax/rnax
#-prot   等同於 -t=prot -q=prot
#-ooc=N.ooc Use overused tile file N.ooc.  N should correspond to the tileSize
#-tileSize=N 設定tiles/kmers的大小
#-stepSize=N 設定tiles/kmers在比對時移動的步長,即兩個相鄰tiles/kmers之間的距離,預設值是tileSize
#-oneOff=N  如果設定爲 1 ,則表示在比對到tile上允許有一個錯配鹼基(mismatch),預設值是0
#-minMatch=N 設定至少匹配的tile的個數,一般設置值的範圍是2-4,通常核苷酸的預設值爲2,蛋白質的預設值爲1
#-minScore=N 設定最小分值。 由於indel通常會對序列的功能產生影響,所以空位在比對過程中總是對應於一個負分,也就是所謂的空位罰分(Gap penalty)。根據打分機制,這個分值等於匹配鹼基分值減去替換分值(mismatch)和空位罰分。預設值爲30
#-minIdentity=N 設置序列相似度(sequence identity)最小百分比。通常核苷酸(nucleotide searches)預設值爲90,蛋白質和翻譯蛋白(protein or translated protein searches)預設值爲25
#-maxGap=N 在一定長度序列中,設定兩個tiles/kmers之間的允許最大的空位(gap)大小。通常設定範圍是0-3,預設值爲2,且僅在minMatch > 1時搭配使用
#-noHead 抑制.psl頭文件的輸出,內容全部均是以製表符爲分隔符的文件
#-makeOoc=N.ooc Make overused tile file. Target needs to be complete genome.
#-repMatch=N  在一段序列被標記爲overused之前,設定允許tiles/kmers重複次數。如果超過設定值,該tiles/kmers將會被標記爲overused。通常當tileSize設定爲12時,repMatch則設定爲256;當tileSize設定爲11時,repMatch則設定爲1024;當tileSize設定爲10時,repMatch則設定爲4096。
#-mask=type Mask out repeats.  Alignments won't be started in masked region but may extend through it in nucleotide searches.  Masked areas are ignored entirely in protein or translated searches. Types are
            #lower - mask out lower cased sequence
            #upper - mask out upper cased sequence
            #out   - mask according to database.out RepeatMasker .out file
            #file.out - mask database according to RepeatMasker file.out
#-qMask=type Mask out repeats in query sequence. 類型選擇與參數-mask相同
#-repeats=type 類型選擇與參數-mask相同。無論如何重複鹼基不會被掩蓋(masked),但是在匹配重複區域時將會在psl輸出文件中會單獨展示其匹配結果,即與其他區域的匹配結果是分開的。
#-minRepDivergence=NN - minimum percent divergence of repeats to allow them to be unmasked.  Default is 15.  Only relevant for masking using RepeatMasker .out files.
#-dots=N     每N個序列就輸出一個點,用於展示程序運行的進度
#-trimT      剪切首部的poly-T
#-noTrimA    不剪切尾部的poly-A
#-trimHardA  從psl輸出文件中的qSize和alignments中移除poly-A尾巴
#-fastMap    快速的DNA/DNA remapping,要求查詢序列長度不超過5000、高相似度和不進行內含子的比對
#-out=type  輸出文件格式,格式如下:
 				  # psl - Default.  Tab separated format, no sequence
                  # pslx - Tab separated format with sequence
                  # axt - blastz-associated axt format
                  # maf - multiz-associated maf format
                  # sim4 - similar to sim4 format
                  # wublast - similar to wublast format
                  # blast - similar to NCBI blast format 
                  # blast8- NCBI blast tabular format
                  # blast9 - NCBI blast tabular format with comments
#-fine  對於高質量的mRNAs搜索small initial和terminal exons更爲嚴苛。此選項不推薦應用於ESTs  
		#For high quality mRNAs look harder for small initial and terminal exons.
#-maxIntron=N  設定內含子最大的序列長度. Default is 750000
#-extendThroughN - 允許序列的比對可以從大段N區域延伸

BLAST+

簡單介紹

假設有一個或多個query sequences(常見FASTA文件格式),利用BLAST尋找 query sequencessubject sequences 之間匹配的序列區域。

A high-scoring pair (HSP):A sufficiently close match between query subsequences and subject subsequences

如果一個 query sequence 和 一個 target sequence 共有一個或多個HSPs,則認爲 該 query sequence hit 一個 target sequence。

在這裏插入圖片描述


常見 BLAST 應用程序
在這裏插入圖片描述

參考鏈接:

常見用法

#簡單幫助文檔的獲取,其他blast+子程序同blastn
blastn -h
#詳細說明文檔的獲取
blastn -help

#創建BLAST databases
makeblastdb -in genome.fa -dbtype nucl -parse_seqids -hash_index

#序列比對
blastn -query test.fa  -db /path/to/genome.fa \
-max_target_seqs 1 -outfmt 6 -evalue 1e-5 -num_threads 8 > blastn_nucldb_test.outfmt6 

#輸出文件含有的字段
#1] query id 		subject id		 % identity		alignment length		mismatches		 gap opens
#7] q. start		q. end			 s. start		s. end					evalue			 bit score

max_target_seqs 詳解

假設用戶使用了BLAST的一個參數-max_target_seqs N。該參數含義是,返回在database sequences 中發現前N個滿足條件的 good hits,並非該參數的字面意思(字面意思:在database sequences 中,搜索與query sequence相近序列,會返回top N best hits),即不能返回最佳 hits。這意味着,如果database sequences 中的序列內容未發生改變,而僅僅改變其順序,使用該參數,極大可能造成不同的輸出結果。

BLAST returns the first N hits that exceed the specified E-value threshold, which may or may not be the highest scoring N hits. The invocation using the parameter ‘-max_target_ seqs 1’ simply returns the first good hit found in the database, not the best hit as one would assume. Worse yet, the output produced depends on the order in which the sequences occur in the database. For the same query, different results will be returned by BLAST when using different versions of the database even if all versions contain the same best hit for this database sequence. Even ordering the database in a different way would cause BLAST to return a different ‘top hit’ when setting the max_target_seqs parameter to 1.
Shah N, Nute M G, Warnow T, et al. Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows[J]. Bioinformatics, 2018.

參數詳解

makeblastdb

makeblastdb –help
## 必選參數
#-dbtype <String>:創建目標數據庫的類型;選項有nucl(Nucleotides, 核苷酸)和prot(protein, 蛋白質)

## 輸入文件參數
#–in mydb.fsa:輸入文件或數據庫的名稱
#-input_type <String>:輸入文件的類型,默認值是fasta; 選項有asn1_bin,asn1_txt,blastdb,fasta

## 配置參數
#-title <String>:BLAST數據庫的標題;預設值爲輸入文件的文件名
#-parse_seqids:如果設置該選項,將會對FASTA輸入文件自動解析seqid;此選項主要便於後續調用某個序列
#-hash_index: 創建序列哈希值的索引

## 輸出文件參數
#-out <String>:指定創建數據庫的路徑和文件名;若不指定該選項,則輸出文件名前綴同輸入文件
#-max_file_sz <String>:設置BLAST數據庫的文件的最大值;默認值爲1GB
#-logfile <File_Out>:指定程序運行的日誌文件

## 分類(Taxonomy)參數
#-taxid <Integer, >=0>:爲所有序列賦值Taxonomy ID;該參數與taxid_map參數不兼容
#-taxid_map <File_In>:指定文件,文件中含 sequence IDs 與 taxonomy IDs的對應關係,文件的格式:<SequenceId> <TaxonomyId><newline>
#                      該參數的使用需要parse_seqids;同樣的,該參數與taxid參數不兼容


## Sequence masking options
# -mask_data <String>:Comma-separated list of input files containing masking data as produced by NCBI masking applications (e.g. dustmasker, segmasker, windowmasker)
# -mask_id <String>:Comma-separated list of strings to uniquely identify the masking algorithm
    * Requires:  mask_data
    * Incompatible with:  gi_mask
# -mask_desc <String>: Comma-separated list of free form strings to describe the masking algorithm details
    * Requires:  mask_id
# -gi_mask: Create GI indexed masking data.
    * Requires:  parse_seqids
    * Incompatible with:  mask_id
# -gi_mask_name <String>: Comma-separated list of masking data output files.
    * Requires:  mask_data, gi_mask

blastn

blastn -help
## 輸入待檢索文件的參數
#-query <File_In>:待檢索文件
#-query_loc <String>:指定待檢索序列的檢索位置 (格式: start-stop)
#-strand <String, 'both', 'minus', 'plus'>: 檢索正義鏈、反義鏈或者是兩者;默認值:'both'

## 常見的檢索參數
#-task <String, 可選值: 'blastn' 'blastn-short' 'dc-megablast' 'megablast' 'rmblastn' >:選擇執行的任務;默認值: 'megablast'
#-db <String>:格式化了的BLAST數據庫路徑及數據庫名;不兼容項有:  subject, subject_loc
#-out <File_Out>:輸出文件路徑及文件名;如果不設置此選項,比對結果則會輸出到屏幕,即標準輸出
#-evalue:設置E值;即當hits結果對應的E值小於該閾值,才被保留輸出。(補充說明:E值:指在隨機的情況下,其它序列與target序列相似度 要大於 這條被save的序列與target序列相似度的 可能性。 與S值有關,S值表示兩序列的同源性,分值越高表明它們之間相似的程度越大)
#-word_size <Integer, >=4>:Word size for wordfinder algorithm (最佳匹配的長度)
#-gapopen <Integer>:Cost to open a gap
#-gapextend <Integer>:Cost to extend a gap
#-penalty <Integer, <=0>:核苷酸錯配的懲罰
#-reward <Integer, >=0>:一個核苷酸匹配的獎勵
#-use_index <Boolean>:使用MegaBLAST database 索引;默認值:'false'


## BLAST-2-Sequences options
#-subject <File_In>:Subject sequence(s) to search;不兼容項有:db, gilist, seqidlist, negative_gilist, db_soft_mask, db_hard_mask
#-subject_loc <String>:  指定subject sequence的檢索位置 (格式: start-stop);不兼容項有:db, gilist, seqidlist, negative_gilist, db_soft_mask, db_hard_mask, remote

## 輸出格式化參數
#-outfmt <String>:輸出文件格式,詳情見幫助文檔;此外,用戶可對選項 6, 7, 10 和 17 自定義配置,詳見幫助文檔
###   alignment view options:
###     0 = Pairwise,
###     1 = Query-anchored showing identities,
###     2 = Query-anchored no identities,
###     3 = Flat query-anchored showing identities,
###     4 = Flat query-anchored no identities,
###     5 = BLAST XML,
###     6 = Tabular,
###     7 = Tabular with comment lines,
###     8 = Seqalign (Text ASN.1),
###     9 = Seqalign (Binary ASN.1),
###    10 = Comma-separated values,
###    11 = BLAST archive (ASN.1),
###    12 = Seqalign (JSON),
###    13 = Multiple-file BLAST JSON,
###    14 = Multiple-file BLAST XML2,
###    15 = Single-file BLAST JSON,
###    16 = Single-file BLAST XML2,
###    17 = Sequence Alignment/Map (SAM)
#-show_gis: Show NCBI GIs in deflines?
#-num_descriptions < Integer, >=0 >:Number of database sequences to show one-line descriptions for;不適用的輸出格式有: outfmt > 4;默認值:500;不兼容項有:max_target_seqs
#-num_alignments < Integer, >=0 >:Number of database sequences to show alignments for 默認值:250;不兼容項:  max_target_seqs
#-line_length < Integer, >=1 >:Line length for formatting alignments;不適用的輸出格式有:outfmt > 4;默認值:60
#-html:Produce HTML output?


##檢索過濾的參數
#   -dust <String>:Filter query sequence with DUST (Format: 'yes', 'level window linker', or 'no' to disable);Default = '20 64 1'
#   -filtering_db <String>:BLAST database containing filtering elements (i.e.: repeats)
#   -window_masker_taxid <Integer>:Enable WindowMasker filtering using a Taxonomic ID
#   -window_masker_db <String>:Enable WindowMasker filtering using this repeats database.
#   -soft_masking <Boolean>:Apply filtering locations as soft masks; Default = 'true'
#   -lcase_masking:Use lower case filtering in query and subject sequence(s)?


## 限制檢索或輸出結果的參數
#  -gilist <String>:Restrict search of database to list of GI's;不兼容項: negative_gilist, seqidlist, remote, subject,subject_loc
#  -seqidlist <String>:Restrict search of database to list of SeqId's;不兼容項:  gilist, negative_gilist, remote, subject, subject_loc
#  -negative_gilist <String>:Restrict search of database to everything except the listed GIs;不兼容項:  gilist, seqidlist, remote, subject, subject_loc
#  -entrez_query <String>:Restrict search with the given Entrez query;不兼容項::  remote
#  -db_soft_mask <String>:Filtering algorithm ID to apply to the BLAST database as soft masking;不兼容項:  db_hard_mask, subject, subject_loc
#  -db_hard_mask <String>:Filtering algorithm ID to apply to the BLAST database as hard masking;不兼容項:  db_soft_mask, subject, subject_loc
#  -perc_identity <Real, 0..100>:Percent identity
#  -qcov_hsp_perc <Real, 0..100>:Percent query coverage per hsp
#  -max_hsps < Integer, >=1 >:Set maximum number of HSPs per subject sequence to save for each query
#  -culling_limit < Integer, >=0 >:If the query range of a hit is enveloped by that of at least this many higher-scoring hits, delete the hit;不兼容項:  best_hit_overhang, best_hit_score_edge
#  -best_hit_overhang < Real, (>0 and <0.5) >:Best Hit algorithm overhang value (recommended value: 0.1);不兼容項:  culling_limit
#  -best_hit_score_edge < Real, (>0 and <0.5) >:Best Hit algorithm score edge value (recommended value: 0.1);不兼容項:  culling_limit
#  -max_target_seqs < Integer, >=1 >:Maximum number of aligned sequences to keep;不適用的輸出格式有:outfmt <= 4;Default = '500';不兼容項:  num_descriptions, num_alignments

##Discontiguous MegaBLAST options
# -template_type <String, `coding', `coding_and_optimal', `optimal'>:Discontiguous MegaBLAST template type;必須與之搭配使用項:  template_length
# -template_length <Integer, Permissible values: '16' '18' '21' >:Discontiguous MegaBLAST template length;必須與之搭配使用項:  template_type

##Statistical options
# -dbsize <Int8>:Effective length of the database
# -searchsp <Int8, >=0>:Effective length of the search space
# -sum_stats <Boolean>:Use sum statistics

##Search strategy options
#-import_search_strategy <File_In>:Search strategy to use;不兼容項:  export_search_strategy
#-export_search_strategy <File_Out>:File name to record the search strategy used;不兼容項:  import_search_strategy

## Extension options
# -xdrop_ungap <Real>:X-dropoff value (in bits) for ungapped extensions
# -xdrop_gap <Real>:X-dropoff value (in bits) for preliminary gapped extensions
# -xdrop_gap_final <Real>:X-dropoff value (in bits) for final gapped alignment
# -no_greedy:Use non-greedy dynamic programming extension
# -min_raw_gapped_score <Integer>:Minimum raw gapped score to keep an alignment in the preliminary gapped and traceback stages
# -ungapped:Perform ungapped alignment only?
# -window_size <Integer, >=0>:Multiple hits window size, use 0 to specify 1-hit algorithm
# -off_diagonal_range <Integer, >=0>:Number of off-diagonals to search for the 2nd hit, use 0 to turn off;Default = '0'

## 其他參數
# -parse_deflines:Should the query and subject defline(s) be parsed?
# -num_threads < Integer, >=1 >:Number of threads (CPUs) to use in the BLAST search;Default = '1';不兼容項:  remote
# -remote:Execute search remotely? 不兼容項:  gilist, seqidlist, negative_gilist, subject_loc, num_threads

GNU Parallel

GNU Parallel 的安裝

#安裝編譯
wget ftp://ftp.gnu.org/gnu/parallel/parallel-20170822.tar.bz2
tar -jxvf parallel-20170822.tar.bz2 
cd parallel-20170822/
cat README 
./configure && make && sudo make install

GNU Parallel 的使用

  • parallel教程: http://www.gnu.org/software/parallel/parallel_tutorial.html
  • parallel中文版教程: http://my.oschina.net/enyo/blog/271612
  • parallel與其他Linux命令的搭配使用: http://www.vaikan.com/use-multiple-cpu-cores-with-your-linux-commands/
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章