blat database query [-ooc=11.ooc] output.psl
- where:
- database and query are each either a .fa , .nib or .2bit file,
- or a list these files one file name per line.
- -ooc=11.ooc tells the program to load over-occurring 11-mers from
- and external file. This will increase the speed
- by a factor of 40 in many cases, but is not required
- output.psl is where to put the output.
- Subranges of nib and .2bit files may specified using the syntax:
- /path/file.nib:seqid:start-end
- or
- /path/file.2bit:seqid:start-end
- or
- /path/file.nib:start-end
- With the second form, a sequence id of file:start-end will be used.
- options:
- -t=type Database type. Type is one of:
- 庫序列 dna - DNA sequence
- prot - protein sequence
- dnax - DNA sequence translated in six frames to protein
- The default is dna
- -q=type Query type. Type is one of:
- 查詢序列 dna - DNA sequence
- rna - RNA sequence
- prot - protein sequence
- dnax - DNA sequence translated in six frames to protein
- rnax - DNA sequence translated in three frames to protein
- The default is dna
- -prot Synonymous with -t=prot -q=prot
- -ooc=N.ooc Use overused tile file N.ooc. N should correspond to
- the tileSize
- -tileSize=N sets the size of match that triggers an alignment.
- Usually between 8 and 12
- Default is 11 for DNA and 5 for protein.
- -stepSize=N spacing between tiles. Default is tileSize.
- -oneOff=N If set to 1 this allows one mismatch in tile and still
- triggers an alignments. Default is 0.
- -minMatch=N sets the number of tile matches. Usually set from 2 to 4
- Default is 2 for nucleotide, 1 for protein.
- -minScore=N sets minimum score. This is the matches minus the
- mismatches minus some sort of gap penalty. Default is 30
- -minIdentity=N Sets minimum sequence identity (in percent). Default is
- 90 for nucleotide searches, 25 for protein or translated
- protein searches.
- -maxGap=N sets the size of maximum gap between tiles in a clump. Usually
- set from 0 to 3. Default is 2. Only relevent for minMatch > 1.
- -noHead suppress .psl header (so it's just a tab-separated file)
- -makeOoc=N.ooc Make overused tile file. Target needs to be complete genome.
- -repMatch=N sets the number of repetitions of a tile allowed before
- it is marked as overused. Typically this is 256 for tileSize
- 12, 1024 for tile size 11, 4096 for tile size 10.
- Default is 1024. Typically only comes into play with makeOoc.
- Also affected by stepSize. When stepSize is halved repMatch is
- doubled to compensate.
- -mask=type Mask out repeats. Alignments won't be started in masked region
- but may extend through it in nucleotide searches. Masked areas
- are ignored entirely in protein or translated searches. Types are
- lower - mask out lower cased sequence
- upper - mask out upper cased sequence
- out - mask according to database.out RepeatMasker .out file
- file.out - mask database according to RepeatMasker file.out
- -qMask=type Mask out repeats in query sequence. Similar to -mask above but for query rather than target sequence.
- -repeats=type Type is same as mask types above. Repeat bases will not be
- masked in any way, but matches in repeat areas will be reported
- separately from matches in other areas in the psl output.
- -minRepDivergence=NN - minimum percent divergence of repeats to allow
- them to be unmasked. Default is 15. Only relevant for
- masking using RepeatMasker .out files.
- -dots=N Output dot every N sequences to show program's progress
- -trimT Trim leading poly-T
- -noTrimA Don't trim trailing poly-A
- -trimHardA Remove poly-A tail from qSize as well as alignments in
- psl output
- -fastMap Run for fast DNA/DNA remapping - not allowing introns,
- requiring high %ID
- -out=type Controls output file format. Type is one of:
- psl - Default. Tab separated format, no sequence
- pslx - Tab separated format with sequence
- axt - blastz-associated axt format
- maf - multiz-associated maf format
- sim4 - similar to sim4 format
- wublast - similar to wublast format
- blast - similar to NCBI blast format
- blast8- NCBI blast tabular format
- blast9 - NCBI blast tabular format with comments
- -fine For high quality mRNAs look harder for small initial and
- terminal exons. Not recommended for ESTs
- -maxIntron=N Sets maximum intron size. Default is 750000
- -extendThroughN - Allows extension of alignment through large blocks of N's
Blat,全稱The BLAST-Like Alignment Tool, 可以稱爲“類BLAST比對工具”,由W.James Kent於2002年開發。當時隨着人類基因組計劃的進展,把大量的基因和ESTs快速定位到較大的基因組上稱爲一種迫切需要。blast相對於這種比對有幾個缺陷:速度偏慢、結果難於處理、無法表示包含intron的基因定位。Blat就是再這種形勢下應運而生了。
Blat的主要特點是:速度快,共線性輸出結果簡單易讀。對於比較小的序列(如cDNA等)對大基因組的比對,blat無疑是首選。Blat把相關的呈共線性的比對結果連接成更大的比對結果,從中也可以很容易的找到exons和introns。因此,在相近物種的基因同源性分析和EST分析中,blat得到了廣泛的應用。
如下圖所示,blast會把每一個比對作爲一個輸出,而blat會把一些符合共線性關係的比對連接起來作爲一個輸出。
Blat的輸入文件必須滿足fasta格式,運行時非常的簡單,不需要進行建庫就可以直接比對。Blat的基本命令:
blat database query [-參數] output
程序正常運行時,會在讀完database中的所有subject序列時在屏幕輸出database的統計結果:
Loaded 1493629 letters in 486 sequences###486條序列中有1493629個letters
Searched 1493629 bases in 486 sequences###自己和自己比對
默認的輸出結果是列表形式的文本文件,即psl格式。
psl格式的結果包含了詳細的比對位置信息,每一列的意義都在文件開頭列出。第1~8列是通體的比對統計,包括精確比對鹼基數、錯配、query和subject上的gap個數與gap總長等;第9~17列是比對位置信息,包括比對方向、query和subject的名字、長度、比對起止位置;18~21列是顯示每一個精確比對的block的信息,包括blocks數、每個block的長度和在query、subject上的位置。
對psl輸出結果,需要注意一下幾點:
1.blat的結果在subject上允許存在很大的gap(intron區域),所以同一個結果在query和subjects上覆蓋的區域可能會相差很多,這一點與blast不同。
2.在基因對基因組的比對中,block的個數不能等同於exon的個數。因爲blat對block的定義是一個沒有插入缺失的比對,任何插入或者缺失的鹼基都會使一個block終止,所以一個exon很可能是有很多block構成的。因此exon和intron的個數要通過足夠大的gap來判斷。
3.psl結果裏面鹼基位置的計算是從0開始的而不是1.
做不同類型的比對時候需要注意一個問題,就是 “-t”和“-q”的定義必須爲同一類型。比如database和query都是蛋白序列,並且兩者同時定義爲 “prot”的時候,比對能夠正常進行;如果database是DNA序列而query序列是蛋白序列,那麼在定義 “-q=prot”的同時還需要定義 “-tdnax”.下面就用同一個基因的DNA和蛋白序列舉幾個例子。
運行命令1:
blat cdna.seq pro.seq -q=prot out.psl
程序報錯退出:
d and q must both be either protein or dna
運行命令2:
blat cdna.seq pro.seq -t=dnax -q=prot -noHead out.psl
ok, right
注意蛋白比對和核酸比對在輸出上的不同點,在顯示方向的位置顯示了2個“+”,表示query和subject都是正向比對。
運行命令3,核酸序列的蛋白級別比對:
blat cdna.seq cdna.seq -t=dnax -q=dnax -noHead out.psl
http://blog.sina.com.cn/s/blog_959d22480101k348.html