基因芯片（Affymetrix）分析3：獲取差異表達基因

(本文於2013.09.04更新）

“差異”是個統計學概念，獲取差異表達基因就要用統計方法，R的統計功能很強大，適合做這樣的事情。用前面的方法讀取數據：

library(affy)
library(tcltk)
filters <- matrix(c("CEL file", ".[Cc][Ee][Ll]", "All", ".*"), ncol = 2, byrow = T)
cel.files <- tk_choose.files(caption = "Select CELs", multi = TRUE,
                             filters = filters, index = 1)
data.raw <- ReadAffy(filenames = cel.files)
n.cel <- length(cel.files)
# 查看樣品名稱
sampleNames(data.raw)

## [1] "NRID9780_Zarka_2-1_MT-0HCA(SOIL)_Rep1_ATH1.CEL" 
## [2] "NRID9781_Zarka_2-2_MT-0HCB(SOIL)_Rep2_ATH1.CEL" 
## [3] "NRID9782_Zarka_2-3_MT-1HCA(SOIL)_Rep1_ATH1.CEL" 
## [4] "NRID9783_Zarka_2-4_MT-1HCB(SOIL)_Rep2_ATH1.CEL" 
## [5] "NRID9784_Zarka_2-5_MT-24HCA(SOIL)_Rep1_ATH1.CEL"
## [6] "NRID9785_Zarka_2-6_MT-24HCB(SOIL)_Rep2_ATH1.CEL"
## [7] "NRID9786_Zarka_2-7_MT-7DCA(SOIL)_Rep1_ATH1.CEL" 
## [8] "NRID9787_Zarka_2-8_MT-7DCB(SOIL)_Rep2_ATH1.CEL"

# 簡化一下名稱，設置pData
sampleNames(data.raw)  <- paste("sample",1:n.cel, sep='')
pData(data.raw)$treatment <- rep(c("0h", "1h", "24h", "7d"), each=2)
pData(data.raw)

##         sample treatment
## sample1      1        0h
## sample2      2        0h
## sample3      3        1h
## sample4      4        1h
## sample5      5       24h
## sample6      6       24h
## sample7      7        7d
## sample8      8        7d

使用rma和mas5方法進行預處理：

eset.rma <- rma(data.raw)

## Background correcting
## Normalizing
## Calculating Expression

eset.mas5 <- mas5(data.raw)

## background correction: mas 
## PM/MM correction : mas 
## expression values: mas 
## background correcting...done.
## 22810 ids to be processed
## |                    |
## |####################|

1 計算基因表達量

很簡單，用一個exprs函數就可以從eset數據中提取出表達量，得到的數據類型是矩陣。但是應該注意rma的eset結果是經過對數變換的，而mas5的eset結果是原始信號強度。雖然表達量是用對數變換的信號值表示的，但是有些計算過程要用到未經變換的原始值，應該把它們都計算出來：

emat.rma.log2 <- exprs(eset.rma)
emat.mas5.nologs <- exprs(eset.mas5)
class(emat.rma.log2)

## [1] "matrix"

head(emat.rma.log2, 1)

##           sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8
## 244901_at    4.04   4.348   4.048   4.052   4.019   3.962    4.03   4.062

head(emat.mas5.nologs, 1)

##           sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8
## 244901_at   39.79   94.94   57.35   60.23   61.08   55.57   53.95   85.98

emat.rma.nologs <- 2^emat.rma.log2
emat.mas5.log2 <- log2(emat.mas5.nologs)

下面我們僅使用rma的結果做演示。計算平均表達量和差異表達倍數（和0h對照比）：

rm(eset.mas5)
rm(emat.mas5.nologs)
rm(emat.mas5.log2)
#計算平均值，並做對數轉換
results.rma <- data.frame((emat.rma.log2[,c(1,3,5,7)] + emat.rma.log2[,c(2,4,6,8)])/2)
#計算表達量差異倍數
results.rma$fc.1h <- results.rma[,2]-results.rma[,1]
results.rma$fc.24h <- results.rma[,3]-results.rma[,1]
results.rma$fc.7d <- results.rma[,4]-results.rma[,1]
head(results.rma, 2)

##           sample1 sample3 sample5 sample7   fc.1h  fc.24h   fc.7d
## 244901_at   4.194   4.050   3.991   4.046 -0.1448 -0.2037 -0.1481
## 244902_at   4.293   4.159   4.061   3.937 -0.1340 -0.2316 -0.3557

簡單補充介紹一下R語言中取數據子集的三種方法，主要是矩陣和數據框：

用下標子集取數據子集::比如上面用到的eset.rma[, c(1,3,5,7)]。由於eset.rma是2維矩陣，eset.rma[, c(1,3,5,7)]的第一維留空（逗號前不寫東西）表示取全部的行，第二維下標的取值爲向量c(1,3,5,7)，表示取1,3,5,7共4列。

用行、列名稱取子集::eset.rma[c("244901_at", "244902_at"), ]的第一維（行）是名稱向量爲c("244901_at", "244902_at")，第二維留空，表示取數據中行名稱爲c("244901_at", "244902_at")的所有列。同樣方法可應用在列選取上。
用邏輯向量取子集::比如我們要選取results.rma中fc.7d大於0的所有行，分兩步：先產生一個邏輯向量，然後用這個邏輯向量取子集，也可以一步完成。

subset.logic <- results.rma$fc.7d>0
subset.data <- results.rma[subset.logic,]

要注意的是邏輯向量的長度要和相應維度的數據長度一致，邏輯向量中爲TRUE的就保留，FALSE的就丟棄：

length(subset.logic); nrow(results.rma)

## [1] 22810

head(subset.logic)

## [1] FALSE FALSE FALSE FALSE  TRUE  TRUE

2 選取“表達”基因

很多人可能不做這一步，認爲沒必要。但是理論上說，芯片可以檢測的基因不一定在樣品中都有表達，對於樣本量較小的非“pooled”樣品來說這是肯定的。把芯片上所有基因當成樣本中的表達基因去做統計分析顯然不合適。

選取“表達”基因的方法常見的有兩種，一是使用genefilter軟件包，另外一種是調用affy包的mas5calls()函數。使用 genefilter需要設定篩選閾值，不同的人可能有不同的標準，稍嫌隨機，不夠自動化，不介紹了。mas5calls方法使用探針水平數據（AffyBatch類型數據）進行處理，一般使用沒經過預處理的芯片數據通用性強些，其他參數用默認就可以：

data.mas5calls <- mas5calls(data.raw)

## Getting probe level data...
## Computing p-values
## Making P/M/A Calls

繼續用exprs計算“表達”量，得到的數據只有三個值P/M/A。對於這三個值的具體解釋可以用?mas5calls查看幫助。P爲present，A爲absent，M爲marginal（臨界值）。

eset.mas5calls <- exprs(data.mas5calls)
head(eset.mas5calls)

##           sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8
## 244901_at "A"     "P"     "P"     "A"     "P"     "P"     "A"     "P"    
## 244902_at "A"     "P"     "P"     "M"     "A"     "A"     "P"     "A"    
## 244903_at "P"     "P"     "P"     "P"     "P"     "P"     "P"     "P"    
## 244904_at "A"     "A"     "A"     "A"     "A"     "A"     "A"     "A"    
## 244905_at "A"     "A"     "A"     "A"     "A"     "A"     "A"     "A"    
## 244906_at "A"     "A"     "A"     "A"     "A"     "A"     "A"     "A"

下面我們把至少一個芯片中有表達的基因選出來：從22810中選出了16005個。

AP <- apply(eset.mas5calls, 1, function(x)any(x=="P"))
present.probes <- names(AP[AP])
paste(length(present.probes),"/",length(AP))

## [1] "16005 / 22810"

刪掉一些中間數據很必要：

rm(data.mas5calls)
rm(eset.mas5calls)

present.probes是名稱向量，用它進行數據子集提取：

results.present <- results.rma[present.probes,]

3 獲取差異表達基因

生物學數據分析時的"差異"應該有兩個意思，一是統計學上的差異，另外一個是生物學上的差異。一個基因在兩個條件下的表達量分別有3個測量值：99,100,101 和 102,103,104。統計上兩種條件下的基因表達數值是有差異的，後者比前者表達量要大。但生物學上有意義嗎？未必。按平均值計算表達變化上升了3%，能產生什麼樣的生物學效應？這得看是什麼基因了。所以差異表達基因的選取一般設置至少兩個閾值：基因表達變化量和統計顯著性量度（p值、q值等）。

3.1 簡單t-測驗

這種方法不用太多的統計學知識，生物專業的人很容易想到，而且確實有不少人在用。經常使用的篩選閾值是表達量變化超過2倍，即|log2(fc)|>=log(2)。先簡單看看有沒有：

apply(abs(results.present[,5:7]), 2, max)

##  fc.1h fc.24h  fc.7d 
##  5.309  6.688  6.844

apply是一個很有用的函數，它對數據按某個維度批量應用一個函數進行計算。第一個參數爲向量或矩陣（或者是能轉成向量或矩陣的數據，如數據框），第三個參數表示要使用的函數，第二個參數爲應用的維度。上面語句的意思是對數據 abs(results.present[,5:7]) 按列（第二維）使用統計函數max（計算最大值）。表達變化超過2倍的基因共有842個：

sum(abs(results.present[,"fc.7d"])>=log2(2))

## [1] 842

選出這842個基因：

results.st <- results.present[abs(results.present$fc.7d)>=log2(2),]
sel.genes <- row.names(results.st)

t測驗，並選出p<0.05的差異表達基因：

p.value <- apply(emat.rma.log2[sel.genes,], 1, function(x){t.test(x[1:2], x[7:8])$p.value})
results.st$p.value <- p.value
names(results.st)

## [1] "sample1" "sample3" "sample5" "sample7" "fc.1h"   "fc.24h"  "fc.7d"  
## [8] "p.value"

results.st <- results.st[, c(1,4,7,8)]
results.st <- results.st[p.value<0.05,]
head(results.st, 2)

##           sample1 sample7  fc.7d p.value
## 245042_at   8.153   7.021 -1.133 0.01004
## 245088_at   7.041   5.419 -1.622 0.03381

nrow(results.st)

## [1] 347

通過簡單t測驗方法得到347個表達倍數變化超過2倍的差異表達基因。

3.2 SAM（Significance Analysis of Microarrays）

R軟件包samr可以做這個。得先安裝：

library(BiocInstaller)
biocLite("samr")

這種方法流行過一段時間，但由於FDR（錯誤檢出率）控制太差，現在基本不用了。

要用也不復雜。但是注意SAM函數使用的emat表達數據是present.probes篩選出來的“表達”基因子集，如果你用沒有經過篩選的數據，得到的結果會差別很大，不信可以自己試試（這點可能也是這種方法的毛病之一）：

library(samr)
samfit <- SAM(emat.rma.nologs[present.probes,c(1,2,7,8)], c(1,1,2,2),
 resp.type="Two class unpaired", genenames=present.probes)

## perm= 1
## perm= 2
## perm= 3
## perm= 4
## perm= 5
## perm= 6
## perm= 7
## perm= 8
## perm= 9
## perm= 10
## perm= 11
## perm= 12
## perm= 13
## perm= 14
## perm= 15
## perm= 16
## perm= 17
## perm= 18
## perm= 19
## perm= 20
## perm= 21
## perm= 22
## perm= 23
## perm= 24
## 
## Computing delta table
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
## 11
## 12
## 13
## 14
## 15
## 16
## 17
## 18
## 19
## 20
## 21
## 22
## 23
## 24
## 25
## 26
## 27
## 28
## 29
## 30
## 31
## 32
## 33
## 34
## 35
## 36
## 37
## 38
## 39
## 40
## 41
## 42
## 43
## 44
## 45
## 46
## 47
## 48
## 49
## 50

SAM函數返回值一個列表結構，可以自己用?SAM看看。差異表達基因的數據在siggenes.table中，也是一個列表結構：

str(samfit$siggenes.table)

## List of 5
##  $ genes.up           : chr [1:6748, 1:7] "265483_at" "248583_at" "253183_at" "252409_at" ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:7] "Gene ID" "Gene Name" "Score(d)" "Numerator(r)" ...
##  $ genes.lo           : chr [1:5341, 1:7] "259382_s_at" "248748_at" "249658_s_at" "266863_at" ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:7] "Gene ID" "Gene Name" "Score(d)" "Numerator(r)" ...
##  $ color.ind.for.multi: NULL
##  $ ngenes.up          : int 6748
##  $ ngenes.lo          : int 5341

上調基因在siggenes.table的genes.up中，下調基因在genes.lo中。從上面的數據結構顯示還可以看到差異表達基因的數量： ngenes.up和ngenes.lo。提取差異表達基因數據：

results.sam <- data.frame(rbind(samfit$siggenes.table$genes.up,samfit$siggenes.table$genes.lo),
 row.names=1, stringsAsFactors=FALSE)
for(i in 1:ncol(results.sam)) results.sam[,i] <- as.numeric(results.sam[,i])
head(results.sam, 2)

##           Gene.Name Score.d. Numerator.r. Denominator.s.s0. Fold.Change
## 265483_at     14534    222.3        22.48             0.101       1.989
## 248583_at      2667    186.8        88.45             0.473       1.670
##           q.value...
## 265483_at          0
## 248583_at          0

應用表達倍數進行篩選，有861個基因表達變化超過2倍（和前面簡單t測驗結果僅差1個，說明t測驗還是可以的嘛！）：

results.sam <- results.sam[abs(log2(results.sam$Fold.Change))>=log2(2), ] ; nrow(results.sam)

## [1] 861

應用q值篩選，q<0.05只有10個，而q<0.1則有685個，選擇篩選閾值也成了這種方法的一個問題：

#samr的q值表示方式爲%，即5表示5%
nrow(results.sam[results.sam$q.val<5,])

## [1] 10

nrow(results.sam[results.sam$q.val<10,])

## [1] 685

3.3 Wilcoxon's signed-rank test

這個方法發表在 Liu, W.-m. et al, Analysis of high density expression microarrays with signed-rank call algorithms. Bioinformatics, 2002, 18, 1593-1599。R軟件包simpleaffy的detection.p.val函數有實現，可以通過pairwise.comparison函數調用：

library(simpleaffy)
#注意下面語句中的數據順序
sa.fit <- pairwise.comparison(eset.rma, "treatment", c("7d", "0h"))

pairwise.comparison返回的數據爲simpleaffy自定義的"PairComp"類型，提取數據要用它專門的函數：平均值用means函數獲得，變化倍數（log2）用fc函數獲得，t測驗的p值用tt函數獲得：

class(sa.fit)

## [1] "PairComp"
## attr(,"package")
## [1] "simpleaffy"

results.sa <- data.frame(means(sa.fit), fc(sa.fit), tt(sa.fit))
#選擇有表達的基因
results.sa <- results.sa[present.probes,]
head(results.sa, 2)

##             X7d   X0h fc.sa.fit. tt.sa.fit.
## 244901_at 4.047 4.203    -0.1562    0.43982
## 244902_at 3.938 4.295    -0.3570    0.05824

colnames(results.sa) <- c("7d", "0h", "fold.change", "p.val")
head(results.sa, 2)

##              7d    0h fold.change   p.val
## 244901_at 4.047 4.203     -0.1562 0.43982
## 244902_at 3.938 4.295     -0.3570 0.05824

應用表達倍數篩選得到表達倍數超過2倍的基因數量有862個，應用p值篩選後得到562個差異表達基因：

results.sa <- results.sa[abs(results.sa$fold.change)>=log2(2), ]; nrow(results.sa)

## [1] 862

results.sa <- results.sa[results.sa$p.val<0.05,]; nrow(results.sa)

## [1] 562

3.4 Moderated T statistic

這種方法在R軟件包limma裏面實現得最好。limma最初主要用於雙色（雙通道）芯片的處理，現在不僅支持單色芯片處理，新版還添加了對RNAseq數據的支持，很值得學習使用。安裝方法同前面其他Bioconductor軟件包的安裝。載入limm軟件包後可以用limmaUsersGuide()函數獲取pdf格式的幫助文檔。

limma的功能很多，這裏只看看差異表達基因的分析流程，具體算法原理請參考limma在線幫助和這篇文獻：Smyth G K. Linear models and empirical bayes methods for assessing differential expression in microarray experiments[J]. Statistical applications in genetics and molecular biology, 2004, 3(1): 3.

limma需要先產生一個design矩陣，用於描述RNA樣品：

library(limma)
treatment <- factor(pData(eset.rma)$treatment)
design <- model.matrix(~ 0 + treatment)
colnames(design) <- c("C0h", "T1h", "T24h", "T7d")
design

##   C0h T1h T24h T7d
## 1   1   0    0   0
## 2   1   0    0   0
## 3   0   1    0   0
## 4   0   1    0   0
## 5   0   0    1   0
## 6   0   0    1   0
## 7   0   0    0   1
## 8   0   0    0   1
## attr(,"assign")
## [1] 1 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$treatment
## [1] "contr.treatment"

可以看到：矩陣的每一行代表一張芯片，每一列代表一種RNA來源（或處理）。此外，你可能還需要另外一個矩陣，用來說明你要進行哪些樣品間的對比分析：

contrast.matrix <- makeContrasts(T1h-C0h, T24h-C0h, T7d-C0h, levels=design)
contrast.matrix

##       Contrasts
## Levels T1h - C0h T24h - C0h T7d - C0h
##   C0h         -1         -1        -1
##   T1h          1          0         0
##   T24h         0          1         0
##   T7d          0          0         1

下一步建立線性模型，並進行分組比較和p值校正：

fit <- lmFit(eset.rma[present.probes,], design)
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit2)

先統計一下數量。可以看到：對於T7d-C0h比較組（coef=3），表達變化超過2倍（lfc參數）的基因數爲842個，而變化超過2倍、p<0.05的基因總數爲740個：

nrow(topTable(fit2, coef=3, adjust.method="fdr", lfc=1, number=30000))

## [1] 842

nrow(topTable(fit2, coef=3, adjust.method="fdr", p.value=0.05, lfc=1, number=30000))

## [1] 740

把toTable函數的返回結果存到其他變量就可以了，它是數據框類型數據，可以用write或write.csv函數保存到文件：

results.lim <- topTable(fit2, coef=3, adjust.method="fdr", p.value=0.05, lfc=1, number=30000)
class(results.lim)

## [1] "data.frame"

head(results.lim)

##            logFC AveExpr      t   P.Value adj.P.Val      B
## 254818_at  6.215  10.363  41.38 7.304e-10 1.169e-05 11.538
## 254805_at  6.844   7.280  30.81 6.095e-09 4.878e-05 10.431
## 245998_at  2.778  10.011  25.44 2.411e-08 9.545e-05  9.528
## 265119_at  4.380   8.282  24.07 3.588e-08 9.545e-05  9.240
## 256114_at  4.461   7.668  23.92 3.745e-08 9.545e-05  9.208
## 265722_at -2.913   9.276 -23.91 3.760e-08 9.545e-05  9.205

爲什麼以上幾種方法僅用表達倍數（2倍）篩選得到的數字不大一樣？limma和直接計算的結果都是842個，而simpleaffy和SAM爲862/861個。這是對eset信號值取對數和求平均值的先後導致的，limma先取對數再求平均值，而simpleaffy和SAM是先求平均值再取對數。

3.5 其他方法：

如Rank products方法，在R軟件包RankProd裏實現，方法文獻爲：Breitling R, Armengaud P, Amtmann A, et al. Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments[J]. FEBS letters, 2004, 573(1): 83-92.

最後我們保存部分數據備下次使用：

#所有表達基因的名稱
write(present.probes, "genes.expressed.txt")
#處理7天的差異表達基因
write.csv(results.lim, "results.lim.7d.csv")
#emat.rma.log2
write.csv(emat.rma.log2[present.probes,], "emat.rma.log2.csv")

4 Session Info

sessionInfo()

## R version 3.0.1 (2013-05-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## 
## locale:
##  [1] LC_CTYPE=zh_CN.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=zh_CN.UTF-8        LC_COLLATE=zh_CN.UTF-8    
##  [5] LC_MONETARY=zh_CN.UTF-8    LC_MESSAGES=zh_CN.UTF-8   
##  [7] LC_PAPER=C                 LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=zh_CN.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  tcltk     stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] limma_3.17.22         simpleaffy_2.37.1     gcrma_2.33.1         
##  [4] genefilter_1.43.0     samr_2.0              matrixStats_0.8.5    
##  [7] impute_1.35.0         ath1121501cdf_2.12.0  AnnotationDbi_1.23.18
## [10] affy_1.39.2           Biobase_2.21.6        BiocGenerics_0.7.4   
## [13] zblog_0.0.1           knitr_1.4.1          
## 
## loaded via a namespace (and not attached):
##  [1] affyio_1.29.0         annotate_1.39.0       BiocInstaller_1.11.4 
##  [4] Biostrings_2.29.15    DBI_0.2-7             digest_0.6.3         
##  [7] evaluate_0.4.7        formatR_0.9           highr_0.2.1          
## [10] IRanges_1.19.28       preprocessCore_1.23.0 R.methodsS3_1.4.4    
## [13] RSQLite_0.11.4        splines_3.0.1         stats4_3.0.1         
## [16] stringr_0.6.2         survival_2.37-4       tools_3.0.1          
## [19] XML_3.98-1.1          xtable_1.7-1          XVector_0.1.0        
## [22] zlibbioc_1.7.0

Author: ZGUANG@LZU

Created: 2013-09-04 三 15:35

Emacs 24.3.1 (Org mode 8.0.5)

Validate XHTML 1.0

基因芯片（Affymetrix）分析3：獲取差異表達基因

1 計算基因表達量

2 選取“表達”基因

3 獲取差異表達基因

3.1 簡單t-測驗

3.2 SAM（Significance Analysis of Microarrays）

3.3 Wilcoxon's signed-rank test

3.4 Moderated T statistic

3.5 其他方法：

4 Session Info

容器中nginx無法使用同一個網絡下的容器域名

Python: SunMoonTimeCalculator

NETCore中實現一個輕量無負擔的極簡任務調度ScheduleTask

docker使用特定的網絡

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

nodejs學習07——API

避免DbContext同時在多個線程調用

GPT-4o 引領人機交互新風向，向量數據庫賽道沸騰了

基因芯片（Affymetrix）分析3：獲取差異表達基因

用R和BioConductor進行基因芯片數據分析(三)：計算median

馬爾可夫入門概念

R語言編程入門--replicate()函數比較有意思!

R入門

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結