同源基因查找軟件OrthoMCL的使用

原創

2020-06-02 01:18

OrthoMCL (http://orthomcl.org/orthomcl/)主要用來找直系同源基因以及旁系同源基因。它主要在比較完整的基因組之間找直系同源基因。OrthoMCL的使用主要13步，可以參考doc/OrthoMCLEngine/Main/UserGuide.txt。爲了方便運行OrthoMCL，可以建立一個工作目錄“my_orthomcl_dir”。

1>配置OrthoMCL程序

將orthomcl.config.template拷貝到你工作目錄下(my_orthomcl_dir)。然後根據所建的mysql數據庫名，用戶名，密碼。修改該文件。例子如下：

主要有兩個閾值參數：

percentMatchCutoff:

blastsimilarities with percent match less than this value are ignored.

evalueExponentCutoff:

blastsimilarities with evalue Exponents greather than this value are ignored.

2> 利用orthomclInstallSchema命令對Oracle或者Mysql數據庫進行配置

Usage:

orthomclInstallSchema config_file sql_log_file table_suffix

比如在my_orthomcl_dir目錄下運行:

../orthomclSoftware-v2.0.9/bin/orthomclInstallSchemaorthomcl.config.template orthomcl.config.log。

3> 利用orthomclAdjustFasta命令把輸入文件轉換爲orthomcl所需的文件格式

Usage:

orthomclAdjustFasta taxon_code fasta_file id_field

這裏從Ensembl下載了Ustilago maydis和Saccharomyces cerevisiae兩個物種的蛋白質組文件。

../orthomclSoftware-v2.0.9/bin/orthomclAdjustFasta Ust Ustilago_maydis.fasta 1

../orthomclSoftware-v2.0.9/bin/orthomclAdjustFasta Sac Saccharomyces_cerevisiae1

就會生成兩個文件:Ust.fasta 和Sac.fasta。爲方便運行my_orthomcl_dir/compliantFasta/。

參數意義如下:

4> 利用orthomclFilterFasta命令過濾掉差的序列文件

Usage:

orthomclFilterFasta input_dirmin_length max_percent_stops [good_proteins_file poor_proteins_file]

例如運行:

“../orthomclSoftware-v2.0.9/bin/orthomclFilterFasta ./compliantFasta/”。

之後就會生成兩個文件:goodProteins.fasta與poorProteins.fasta。

參數意義如下:

5> Blast比對

對上一步得到的goodProteins.fasta進行多對多的比對。推薦使用NCBIBlast.

我這裏使用是ncbi-blast-2.2.28+。

運行命令:

“~/Universal_softwore_src/ncbi-blast-2.2.28+/bin/makeblastdb-in good_proteins.fasta -dbtype prot -out good_proteins.fasta”

“~/Universal_softwore_src/ncbi-blast-2.2.28+/bin/blastp-db goodProteins.fasta -querygoodProteins.fasta -outfmt 7 –out goodProteins_blastp.out ”。

然後生成tab delimited格式的輸出文件goodProteins_blastp.out。生成的比對文件最好是tab文件格式。不同的版本的輸出格式參數也許不一樣。該軟件就是-outfmt 7。

得到該文件之後需進一步處理之後才能被後面的步驟所使用(只把hits行挑選出來,註釋信息丟掉)

可以運行如下命令得到:

“grep -v -P"^#" goodProteins_blastp.out > goodProteins_v1_blastp.out”。

6> 利用orthomclBlastParser命令將上一步得到的blast比對結果進行解析，默認閾值爲e-value：1e-5 ；Coverage：50%

Usage:

orthomclBlastParser blast_file fasta_files_dir

運行命令:

” ../orthomclSoftware-v2.0.9/bin/orthomclBlastParser goodProteins_v1_blastp.out ./compliantFasta”

運行完之後生成similarSequences.txt文件。

參數意義如下:

7> 利用orthomclLoadBlast命令將blast結果導入到mysql數據庫中

Usage:

orthomclLoadBlast config_file similar_seqs_file

運行命令如下:

“../orthomclSoftware-v2.0.9/bin/orthomclLoadBlast orthomcl.config.template

similarSequences.txt”

參數意義如下:

8> 利用” orthomclPairs”對數據庫中的SimilarSequence表中數據，進行pairs的運算

Usage:

orthomclPairs config_file log_file cleanup=[yes|no|only|all]<startAfter=TAG>

運行命令如下:

“../orthomclSoftware-v2.0.9/bin/orthomclPairs orthomcl.config.template pairs.log cleanup=yes ”

默認情況,下，在mysql中生成三個表: PotentialOrthologs,PotentialInParalogs, PotentialCoOrthologs。

參數意義如下:

9> 利用命令orthomclDumpPairsFiles對數據庫中的pairs表進行處理

Usage:

orthomclDumpPairsFiles config_file

運行命令如下:

“../orthomclSoftware-v2.0.9/bin/orthomclDumpPairsFiles paris.log”

參數意義如下:

生成mcllnput文件和pairs目錄。這個目錄包含三個文件:

ortholog.txt, coortholog.txt, inparalog.txt。

每一個文件有三列: proteinA, protein B, their normalized score (See the Orthomcl Algorithm Document)。

10> 利用mcl程序把上一步的結果進行聚類

運行命令如下:

mcl ./mclInput --abc-I 1.5 –o ./mclOutput 具體參數可以參考mcl文檔。

11> 利用orthomclMclToGroups命令將mcl的輸出結果轉換爲groups.txt

Usage:

orthomclMclToGroups my_prefix 1 < mclOutput > groups.txt

運行命令如下:

參數意義如下:

groups.txt就是最終的結果文件。文件中的每一行代表可能存在的蛋白質家族。

文章來源：http://ngs-assist.com/forum.php?mod=viewthread&tid=46

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

同源基因查找軟件OrthoMCL的使用

生信常用軟件

Unigene build produce(NCBI)(原文）

NCBI：UniGene數據庫

RNA測序研究現狀與發展

測序中常用的術語

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結