Blat & BLAST

Blat

简单介绍

Blat,全称 The BLAST-Like Alignment Tool,可以称为**“类BLAST 比对工具”**,对于DNA序列,BLAT是用来设计寻找95%及以上相似至少40个碱基的序列。对于蛋白序列,BLAT是用来设计寻找80%及以上相似至少20个氨基酸的序列。

​ Blat 的主要特点就是:速度快,共线性输出结果简单易读。对于比较小的序列(如 cDNA 等)对大基因组的blat与blast比较比对,blat 无疑是首选。Blat 把相关的呈共线性的比对结果连接成为更大的 比对结果,从中也可以很容易的找到 exons 和 introns。因此,在相近物种的基因同源性分析和EST 分析中,blat 得到了广泛的应用。Blat的比对速度之所以能比Blast快几百倍,是因为此两者之间的比对机制有着本质的差别。Blast是将查询序列索引化,然后线性搜索庞大的目标数据库,期间频繁地访问硬盘数据,时间和空间上的数据相关性较小;Blat则将庞大的目标数据库索引,然后线性搜索查询序列,这种搜索方式在时间和空间上的数据相关性比较大。Blat将数据库索引一次性读入内存,可以反复地高速调用,无需访问硬盘,占用的系统资源很少。只要索引建立,查询序列的量越大,Blat的优势就越明显。

Blat is an alignment tool like BLAST, but it is structured differently. Blat produces two major classes of alignments:

  • at the DNA level between two sequences that are of 95% or greater identity, but which may include large inserts.
  • at the protein or translated DNA level between sequences that are of 80% or greater identity and may also include large inserts.

安装

wget -c https://users.soe.ucsc.edu/~kent/src/blatSrc35.zip
unzip blatSrc35.zip 
cd blatSrc
uname -a
export MACHTYPE="x86_64"
mkdir ~/bin/$MACHTYPE
mkdir $MACHTYPE
make >make$(date +%F).log
echo 'export PATH="/public/home/user/bin/x86_64:$PATH"' >> ~/.bash_profile

基本原理

基本原理: 首先blat将参考序列拆分成tiles/kmers,其拆分的方式取决于两个参数:-tileSize and -stepSize。其中-tileSize决定tiles/kmers的大小,一般设定范围是:8-12,预设DNA为11,蛋白质为5;-stepSize决定tiles/kmers移动的步长。

参考链接:Using blat

参考链接:Blat-The BLAST-Like Alignment Tool (详细的使用教程)

常见用法

#blat常见用法
#处理单个job
blat chr11.fa human/test.fa test.psl #输出不含序列
blat chr11.fa human/test.fa -out=pslx test.pslx #输出含序列
blat chr11.fa human/test.fa -out=blast test.blast #输出格式同NABI的blast格式
#并行处理多个jobs
time parallel blat chr{}.fa human/human.fa test_{}.psl ::: {1..22} X Y M

参数详情

#blat参数
#用法:blat database query [-ooc=11.ooc] output.psl
#database  输入文件必须是其中一种类型:a .fa , .nib or .2bit file
#query 输入文件必须是其中一种类型:a .fa , .nib or .2bit file
#output.psl 输出文件
#-t=type 数据库类型,可选项: dna/prot/dnax
#-q=type 查询序列的类型,可选项:dna/prot/dnax/rnax
#-prot   等同于 -t=prot -q=prot
#-ooc=N.ooc Use overused tile file N.ooc.  N should correspond to the tileSize
#-tileSize=N 设定tiles/kmers的大小
#-stepSize=N 设定tiles/kmers在比对时移动的步长,即两个相邻tiles/kmers之间的距离,预设值是tileSize
#-oneOff=N  如果设定为 1 ,则表示在比对到tile上允许有一个错配碱基(mismatch),预设值是0
#-minMatch=N 设定至少匹配的tile的个数,一般设置值的范围是2-4,通常核苷酸的预设值为2,蛋白质的预设值为1
#-minScore=N 设定最小分值。 由于indel通常会对序列的功能产生影响,所以空位在比对过程中总是对应于一个负分,也就是所谓的空位罚分(Gap penalty)。根据打分机制,这个分值等于匹配碱基分值减去替换分值(mismatch)和空位罚分。预设值为30
#-minIdentity=N 设置序列相似度(sequence identity)最小百分比。通常核苷酸(nucleotide searches)预设值为90,蛋白质和翻译蛋白(protein or translated protein searches)预设值为25
#-maxGap=N 在一定长度序列中,设定两个tiles/kmers之间的允许最大的空位(gap)大小。通常设定范围是0-3,预设值为2,且仅在minMatch > 1时搭配使用
#-noHead 抑制.psl头文件的输出,内容全部均是以制表符为分隔符的文件
#-makeOoc=N.ooc Make overused tile file. Target needs to be complete genome.
#-repMatch=N  在一段序列被标记为overused之前,设定允许tiles/kmers重复次数。如果超过设定值,该tiles/kmers将会被标记为overused。通常当tileSize设定为12时,repMatch则设定为256;当tileSize设定为11时,repMatch则设定为1024;当tileSize设定为10时,repMatch则设定为4096。
#-mask=type Mask out repeats.  Alignments won't be started in masked region but may extend through it in nucleotide searches.  Masked areas are ignored entirely in protein or translated searches. Types are
            #lower - mask out lower cased sequence
            #upper - mask out upper cased sequence
            #out   - mask according to database.out RepeatMasker .out file
            #file.out - mask database according to RepeatMasker file.out
#-qMask=type Mask out repeats in query sequence. 类型选择与参数-mask相同
#-repeats=type 类型选择与参数-mask相同。无论如何重复碱基不会被掩盖(masked),但是在匹配重复区域时将会在psl输出文件中会单独展示其匹配结果,即与其他区域的匹配结果是分开的。
#-minRepDivergence=NN - minimum percent divergence of repeats to allow them to be unmasked.  Default is 15.  Only relevant for masking using RepeatMasker .out files.
#-dots=N     每N个序列就输出一个点,用于展示程序运行的进度
#-trimT      剪切首部的poly-T
#-noTrimA    不剪切尾部的poly-A
#-trimHardA  从psl输出文件中的qSize和alignments中移除poly-A尾巴
#-fastMap    快速的DNA/DNA remapping,要求查询序列长度不超过5000、高相似度和不进行内含子的比对
#-out=type  输出文件格式,格式如下:
 				  # psl - Default.  Tab separated format, no sequence
                  # pslx - Tab separated format with sequence
                  # axt - blastz-associated axt format
                  # maf - multiz-associated maf format
                  # sim4 - similar to sim4 format
                  # wublast - similar to wublast format
                  # blast - similar to NCBI blast format 
                  # blast8- NCBI blast tabular format
                  # blast9 - NCBI blast tabular format with comments
#-fine  对于高质量的mRNAs搜索small initial和terminal exons更为严苛。此选项不推荐应用于ESTs  
		#For high quality mRNAs look harder for small initial and terminal exons.
#-maxIntron=N  设定内含子最大的序列长度. Default is 750000
#-extendThroughN - 允许序列的比对可以从大段N区域延伸

BLAST+

简单介绍

假设有一个或多个query sequences(常见FASTA文件格式),利用BLAST寻找 query sequencessubject sequences 之间匹配的序列区域。

A high-scoring pair (HSP):A sufficiently close match between query subsequences and subject subsequences

如果一个 query sequence 和 一个 target sequence 共有一个或多个HSPs,则认为 该 query sequence hit 一个 target sequence。

在这里插入图片描述


常见 BLAST 应用程序
在这里插入图片描述

参考链接:

常见用法

#简单帮助文档的获取,其他blast+子程序同blastn
blastn -h
#详细说明文档的获取
blastn -help

#创建BLAST databases
makeblastdb -in genome.fa -dbtype nucl -parse_seqids -hash_index

#序列比对
blastn -query test.fa  -db /path/to/genome.fa \
-max_target_seqs 1 -outfmt 6 -evalue 1e-5 -num_threads 8 > blastn_nucldb_test.outfmt6 

#输出文件含有的字段
#1] query id 		subject id		 % identity		alignment length		mismatches		 gap opens
#7] q. start		q. end			 s. start		s. end					evalue			 bit score

max_target_seqs 详解

假设用户使用了BLAST的一个参数-max_target_seqs N。该参数含义是,返回在database sequences 中发现前N个满足条件的 good hits,并非该参数的字面意思(字面意思:在database sequences 中,搜索与query sequence相近序列,会返回top N best hits),即不能返回最佳 hits。这意味着,如果database sequences 中的序列内容未发生改变,而仅仅改变其顺序,使用该参数,极大可能造成不同的输出结果。

BLAST returns the first N hits that exceed the specified E-value threshold, which may or may not be the highest scoring N hits. The invocation using the parameter ‘-max_target_ seqs 1’ simply returns the first good hit found in the database, not the best hit as one would assume. Worse yet, the output produced depends on the order in which the sequences occur in the database. For the same query, different results will be returned by BLAST when using different versions of the database even if all versions contain the same best hit for this database sequence. Even ordering the database in a different way would cause BLAST to return a different ‘top hit’ when setting the max_target_seqs parameter to 1.
Shah N, Nute M G, Warnow T, et al. Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows[J]. Bioinformatics, 2018.

参数详解

makeblastdb

makeblastdb –help
## 必选参数
#-dbtype <String>:创建目标数据库的类型;选项有nucl(Nucleotides, 核苷酸)和prot(protein, 蛋白质)

## 输入文件参数
#–in mydb.fsa:输入文件或数据库的名称
#-input_type <String>:输入文件的类型,默认值是fasta; 选项有asn1_bin,asn1_txt,blastdb,fasta

## 配置参数
#-title <String>:BLAST数据库的标题;预设值为输入文件的文件名
#-parse_seqids:如果设置该选项,将会对FASTA输入文件自动解析seqid;此选项主要便于后续调用某个序列
#-hash_index: 创建序列哈希值的索引

## 输出文件参数
#-out <String>:指定创建数据库的路径和文件名;若不指定该选项,则输出文件名前缀同输入文件
#-max_file_sz <String>:设置BLAST数据库的文件的最大值;默认值为1GB
#-logfile <File_Out>:指定程序运行的日志文件

## 分类(Taxonomy)参数
#-taxid <Integer, >=0>:为所有序列赋值Taxonomy ID;该参数与taxid_map参数不兼容
#-taxid_map <File_In>:指定文件,文件中含 sequence IDs 与 taxonomy IDs的对应关系,文件的格式:<SequenceId> <TaxonomyId><newline>
#                      该参数的使用需要parse_seqids;同样的,该参数与taxid参数不兼容


## Sequence masking options
# -mask_data <String>:Comma-separated list of input files containing masking data as produced by NCBI masking applications (e.g. dustmasker, segmasker, windowmasker)
# -mask_id <String>:Comma-separated list of strings to uniquely identify the masking algorithm
    * Requires:  mask_data
    * Incompatible with:  gi_mask
# -mask_desc <String>: Comma-separated list of free form strings to describe the masking algorithm details
    * Requires:  mask_id
# -gi_mask: Create GI indexed masking data.
    * Requires:  parse_seqids
    * Incompatible with:  mask_id
# -gi_mask_name <String>: Comma-separated list of masking data output files.
    * Requires:  mask_data, gi_mask

blastn

blastn -help
## 输入待检索文件的参数
#-query <File_In>:待检索文件
#-query_loc <String>:指定待检索序列的检索位置 (格式: start-stop)
#-strand <String, 'both', 'minus', 'plus'>: 检索正义链、反义链或者是两者;默认值:'both'

## 常见的检索参数
#-task <String, 可选值: 'blastn' 'blastn-short' 'dc-megablast' 'megablast' 'rmblastn' >:选择执行的任务;默认值: 'megablast'
#-db <String>:格式化了的BLAST数据库路径及数据库名;不兼容项有:  subject, subject_loc
#-out <File_Out>:输出文件路径及文件名;如果不设置此选项,比对结果则会输出到屏幕,即标准输出
#-evalue:设置E值;即当hits结果对应的E值小于该阈值,才被保留输出。(补充说明:E值:指在随机的情况下,其它序列与target序列相似度 要大于 这条被save的序列与target序列相似度的 可能性。 与S值有关,S值表示两序列的同源性,分值越高表明它们之间相似的程度越大)
#-word_size <Integer, >=4>:Word size for wordfinder algorithm (最佳匹配的长度)
#-gapopen <Integer>:Cost to open a gap
#-gapextend <Integer>:Cost to extend a gap
#-penalty <Integer, <=0>:核苷酸错配的惩罚
#-reward <Integer, >=0>:一个核苷酸匹配的奖励
#-use_index <Boolean>:使用MegaBLAST database 索引;默认值:'false'


## BLAST-2-Sequences options
#-subject <File_In>:Subject sequence(s) to search;不兼容项有:db, gilist, seqidlist, negative_gilist, db_soft_mask, db_hard_mask
#-subject_loc <String>:  指定subject sequence的检索位置 (格式: start-stop);不兼容项有:db, gilist, seqidlist, negative_gilist, db_soft_mask, db_hard_mask, remote

## 输出格式化参数
#-outfmt <String>:输出文件格式,详情见帮助文档;此外,用户可对选项 6, 7, 10 和 17 自定义配置,详见帮助文档
###   alignment view options:
###     0 = Pairwise,
###     1 = Query-anchored showing identities,
###     2 = Query-anchored no identities,
###     3 = Flat query-anchored showing identities,
###     4 = Flat query-anchored no identities,
###     5 = BLAST XML,
###     6 = Tabular,
###     7 = Tabular with comment lines,
###     8 = Seqalign (Text ASN.1),
###     9 = Seqalign (Binary ASN.1),
###    10 = Comma-separated values,
###    11 = BLAST archive (ASN.1),
###    12 = Seqalign (JSON),
###    13 = Multiple-file BLAST JSON,
###    14 = Multiple-file BLAST XML2,
###    15 = Single-file BLAST JSON,
###    16 = Single-file BLAST XML2,
###    17 = Sequence Alignment/Map (SAM)
#-show_gis: Show NCBI GIs in deflines?
#-num_descriptions < Integer, >=0 >:Number of database sequences to show one-line descriptions for;不适用的输出格式有: outfmt > 4;默认值:500;不兼容项有:max_target_seqs
#-num_alignments < Integer, >=0 >:Number of database sequences to show alignments for 默认值:250;不兼容项:  max_target_seqs
#-line_length < Integer, >=1 >:Line length for formatting alignments;不适用的输出格式有:outfmt > 4;默认值:60
#-html:Produce HTML output?


##检索过滤的参数
#   -dust <String>:Filter query sequence with DUST (Format: 'yes', 'level window linker', or 'no' to disable);Default = '20 64 1'
#   -filtering_db <String>:BLAST database containing filtering elements (i.e.: repeats)
#   -window_masker_taxid <Integer>:Enable WindowMasker filtering using a Taxonomic ID
#   -window_masker_db <String>:Enable WindowMasker filtering using this repeats database.
#   -soft_masking <Boolean>:Apply filtering locations as soft masks; Default = 'true'
#   -lcase_masking:Use lower case filtering in query and subject sequence(s)?


## 限制检索或输出结果的参数
#  -gilist <String>:Restrict search of database to list of GI's;不兼容项: negative_gilist, seqidlist, remote, subject,subject_loc
#  -seqidlist <String>:Restrict search of database to list of SeqId's;不兼容项:  gilist, negative_gilist, remote, subject, subject_loc
#  -negative_gilist <String>:Restrict search of database to everything except the listed GIs;不兼容项:  gilist, seqidlist, remote, subject, subject_loc
#  -entrez_query <String>:Restrict search with the given Entrez query;不兼容项::  remote
#  -db_soft_mask <String>:Filtering algorithm ID to apply to the BLAST database as soft masking;不兼容项:  db_hard_mask, subject, subject_loc
#  -db_hard_mask <String>:Filtering algorithm ID to apply to the BLAST database as hard masking;不兼容项:  db_soft_mask, subject, subject_loc
#  -perc_identity <Real, 0..100>:Percent identity
#  -qcov_hsp_perc <Real, 0..100>:Percent query coverage per hsp
#  -max_hsps < Integer, >=1 >:Set maximum number of HSPs per subject sequence to save for each query
#  -culling_limit < Integer, >=0 >:If the query range of a hit is enveloped by that of at least this many higher-scoring hits, delete the hit;不兼容项:  best_hit_overhang, best_hit_score_edge
#  -best_hit_overhang < Real, (>0 and <0.5) >:Best Hit algorithm overhang value (recommended value: 0.1);不兼容项:  culling_limit
#  -best_hit_score_edge < Real, (>0 and <0.5) >:Best Hit algorithm score edge value (recommended value: 0.1);不兼容项:  culling_limit
#  -max_target_seqs < Integer, >=1 >:Maximum number of aligned sequences to keep;不适用的输出格式有:outfmt <= 4;Default = '500';不兼容项:  num_descriptions, num_alignments

##Discontiguous MegaBLAST options
# -template_type <String, `coding', `coding_and_optimal', `optimal'>:Discontiguous MegaBLAST template type;必须与之搭配使用项:  template_length
# -template_length <Integer, Permissible values: '16' '18' '21' >:Discontiguous MegaBLAST template length;必须与之搭配使用项:  template_type

##Statistical options
# -dbsize <Int8>:Effective length of the database
# -searchsp <Int8, >=0>:Effective length of the search space
# -sum_stats <Boolean>:Use sum statistics

##Search strategy options
#-import_search_strategy <File_In>:Search strategy to use;不兼容项:  export_search_strategy
#-export_search_strategy <File_Out>:File name to record the search strategy used;不兼容项:  import_search_strategy

## Extension options
# -xdrop_ungap <Real>:X-dropoff value (in bits) for ungapped extensions
# -xdrop_gap <Real>:X-dropoff value (in bits) for preliminary gapped extensions
# -xdrop_gap_final <Real>:X-dropoff value (in bits) for final gapped alignment
# -no_greedy:Use non-greedy dynamic programming extension
# -min_raw_gapped_score <Integer>:Minimum raw gapped score to keep an alignment in the preliminary gapped and traceback stages
# -ungapped:Perform ungapped alignment only?
# -window_size <Integer, >=0>:Multiple hits window size, use 0 to specify 1-hit algorithm
# -off_diagonal_range <Integer, >=0>:Number of off-diagonals to search for the 2nd hit, use 0 to turn off;Default = '0'

## 其他参数
# -parse_deflines:Should the query and subject defline(s) be parsed?
# -num_threads < Integer, >=1 >:Number of threads (CPUs) to use in the BLAST search;Default = '1';不兼容项:  remote
# -remote:Execute search remotely? 不兼容项:  gilist, seqidlist, negative_gilist, subject_loc, num_threads

GNU Parallel

GNU Parallel 的安装

#安装编译
wget ftp://ftp.gnu.org/gnu/parallel/parallel-20170822.tar.bz2
tar -jxvf parallel-20170822.tar.bz2 
cd parallel-20170822/
cat README 
./configure && make && sudo make install

GNU Parallel 的使用

  • parallel教程: http://www.gnu.org/software/parallel/parallel_tutorial.html
  • parallel中文版教程: http://my.oschina.net/enyo/blog/271612
  • parallel与其他Linux命令的搭配使用: http://www.vaikan.com/use-multiple-cpu-cores-with-your-linux-commands/
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章