Pathway analysis for GWAS data（GWAS的通路分析）

Pathway analysis是post-GWAS時代的一種頗有希望的數據處理方法，從07年GSEA方法提出至今，已經有了各種層出不窮的方法。

本文將逐步所學的方法彙總如下。

一 GSEA（gene set enrichment analysis）[software: GenGen]

0 該軟件爲Linux下運行包，指令--和-的作用相同，方便起見統一使用-。此外支持縮寫，-plink_tpedfile_format可以簡化成-plink，爲避免混淆，使用完整名。

1 Using calculate_association.pl for genome-wide association analysis
1.1 輸入文件
1.1.1 輸入文件有兩個，一個是ped_header file，與ped文件類似，每行記錄的是個體信息，前6列是fid，iid，fid，mid，sex，表型（12或qt變量）。可通過cut -f 1-6語句從ped文件（tab分隔）中生成。
1.1.2 另一個是GT文件，每行記錄的是SNP信息，前三列是Markerid（rs1344..），chr(1-22)，位置(2232343),隨後是每個個體基因型信息（AB或A B,缺失表示爲NC,--,00,或0）。
1.1.3 也可使用plink tfam/tped文件進行輸入，指令爲./calculate_association.pl xxx.tfam xxx.tped -plink_tpedfile_format

1.1.4 個體列表文件，僅兩列fid，iid。假設indist.cau存儲所有高加索人信息，indlist.ger存儲所有德國人信息，則選項-keep indlist.cau,indlist.ger則表示僅分析這些個體。-remove的作用相反。會隨着運行而產生remove文件，需再次加入-remove選項以排除。
1.1.5 標記列表文件，僅有一列markerid(rs...)。同上，使用-extract或-exclude納入或排除。

1.2 關聯分析（病例對照）
1.2.1 指令./calculate_association.pl ped_header gt.txt -cc -ab
-cc 表示分析類型（--tdt, --tsp, --pdt, --cc or --qt）。-cc表示這是病例-對照檢驗，並且會輸出5種關聯模型檢驗（allelic association test, Cochran-Armitage trend association test, genotypic association test (2df test), dominant model association test and recessive model association test.）的chi2和P值；
-ab 表示基因型的記錄爲AB，如果是ACGT則不加選項
-cellsize 5 默認表示列聯表每格的最小值爲5，如果計算小於該值，則GENO_chi2, G_P, DOM_chi2, D_P, REC_chi2, R_P則無法計算顯示NA，-cellsize 0則會計算輸出，與plink不同
-output yy 會產生yy.log會記錄程序的注意或警告信息；
yy.minfo記錄所有標記的統計描述：no-call rate，maf，HW p，fraction of markers with Mendelian inconsistency for all individuals
yy.sinfo記錄所有個體的統計描述：no-call rate， fraction of markers with Mendelian inconsistency for all individuals
yy.finfo記錄家系間的孟德爾遺傳錯誤：fraction of markers with Mendelian inconsistency for all nuclear families
yy.remove記錄個體不滿足納入標準，並且需要再次運行中排除，-remove yy.remove。可以在yy.sinfo中詳見其被排除的原因。

1.2.2 高階指令./calculate_association.pl -cc -ab ped_header gt.txt -allmarker-allind -snpprofile gt.snpgenmap 2> /dev/null
-exclude file -remove file
-extract file -keep file
-geno_threshold <float>-mind <float>
-maf_threshold <float>-fme
-mme_threshold <float>-ime
-hwe_threshold <float>
默認會對個體進行的篩選標準：--mind 0.1 --fme 1 --ime 0.02。不滿足的個體會輸出yy.remove用於第二次運行使用。
默認會對標記進行的篩選：--geno 0.1 --maf 0.01 --mme 0.1 --hwe 0.001。不滿足的標記會在STDERR（即交互界面）中顯示。
如不滿足maf條件的標記會在#dm_maf# #dm_maf#標籤對中顯示。
#dm_geno# 顯示不滿足分型率的標記
#dm_hwe# 則顯示不滿足hwe的標記
#dm_invalidgt#顯示有錯誤分型的標記，如值不在ACGT-0範圍的
#dm_invalidchr#顯示標記不在chr的常或X染色體上的
#dm_gt2allele#顯示超過兩個等位基因的標記
2> /dev/null used for suppression of warning/notification messages by redirection.

1.2.3 置換檢驗
-cycle 10 產生10次表型的permutation，會在結果文件中多生成chi2_perm，chi2_P_perm兩列，可通過下列選項改變chi2_perm來源。這兩列可取出用於gsea分析。
-perm_method 1-5(1=allelic as default, 2=trend, 3=genotypic, 4=dom, 5=rec)

1.2.4 註釋SNP
-snpprofile gt.snpgenmap 該文件包含兩列，一列是標記名，一列是其所在的基因名。加上這個選項可以在結果中顯示SNP_property即SNP所在的基因信息，有助於查看結果
sort -k 7,7g output | head -n 1000順便排序結果文件

1.3 關聯分析（連續表型）
1.3.1 指令./calculate_association.pl ped_header gt.txt -qt
該模型假定an additive model for B alleles，如果想使用multiplicative model，需在準備文件中log轉換QT值
當標記不滿足條件：如超過兩個等位基因，定位在mt，F和F_P值會顯示成NA

1.3.2 置換檢驗./calculate_association.pl ped_header gt.txt -qt -cycle 10
會多產生兩列F_PERM和F_P_PERM，F_PERM可用於GSEA分析

1.4 並行計算（如進行1000次permutation）
calculate_association.pl ../nb.tfam ../nb.tped -plink -cycle 100 -seed 1 -perm 2 -noflush > combine.perm1
calculate_association.pl ../nb.tfam ../nb.tped -plink -cycle 100 -seed 2 -perm 2 -noflush > combine.perm2
calculate_association.pl ../nb.tfam ../nb.tped -plink -cycle 100 -seed 3 -perm 2 -noflush > combine.perm3
calculate_association.pl ../nb.tfam ../nb.tped -plink -cycle 100 -seed 4 -perm 2 -noflush > combine.perm4
calculate_association.pl ../nb.tfam ../nb.tped -plink -cycle 100 -seed 5 -perm 2 -noflush > combine.perm5
calculate_association.pl ../nb.tfam ../nb.tped -plink -cycle 100 -seed 6 -perm 2 -noflush > combine.perm6
calculate_association.pl ../nb.tfam ../nb.tped -plink -cycle 100 -seed 7 -perm 2 -noflush > combine.perm7
calculate_association.pl ../nb.tfam ../nb.tped -plink -cycle 100 -seed 8 -perm 2 -noflush > combine.perm8
calculate_association.pl ../nb.tfam ../nb.tped -plink -cycle 100 -seed 9 -perm 2 -noflush > combine.perm9
calculate_association.pl ../nb.tfam ../nb.tped -plink -cycle 100 -seed 10 -perm 2 -noflush > combine.perm10

2 Using calculate_gsea.pl for pathway-based association analysis on GWAS

2.1 輸入文件
2.1.1 關聯結果文件
2.1.2 SNP-Gene mapping文件 2006 human assembly或是RefSeq annotation
2.1.3 通路定義文件前兩列分別是pathway id，pathway description，餘下的爲基因id。每行表示一個通路。作者已經從Ontology, Biocarta, KEGG中彙編了含~2000條通路的文件； MSigDB database的C2，C5也同樣適用。

2.2
2.2.1 指令./calculate_gsea.pl gsea.chi2 gsea.gmt -map gsea.snpgenemap -perm gsea.cc10-log yy -dist 500k -setmin 20 -setmax 200
由於Perumtation關聯結果文件需分多次計算得出，因此gsea計算可分別計算每個部分，得到Log文件，每個文件都含有計算得到的ES,NES,nominal P,FDR,FWER。然後使用combine_gsea.pl合併這些結果，從而得到單個統計量。
gsea.chi2 如有Perm文件，可忽略
-map gsea.snpgenemap Snp與基因mapping文件
-dist 500k SNP在500K範圍外會被捨棄
gsea.gmt 基因集文件
-setmin 20 默認在snp-gene mapping文件中有出現的基因在基因集中的最小數爲20
-setmax 200 默認在snp-gene mapping文件中有出現的基因在基因集中的最大數爲200
-perm gsea.cc10 包含10次置換的chi2 values，

2.2.2 合併多個統計量結果文件./combine_gsea.pl *.log

Pathway analysis for GWAS data（GWAS的通路分析）

lightdb hash index的性能和限制

計算機基礎知識

計算機學習

Python基本操作

Python常見函數

Pathway analysis for GWAS data（GWAS的通路分析）

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結