GENIE3 || 基因調控網絡推斷

前情提要:
AUCell:在單細胞轉錄組中識別細胞對“基因集”的響應
RcisTarget || 從基因列表到調控網絡

我們知道單細胞數據分析中,有兩個明顯的分支:

  • 細胞
  • 基因

細胞我們可以分析細胞之間的細胞分羣、擬時間關係、細胞通訊,基因可以看基因的表達量、基因的調控網絡,基因功能預測。對於細胞這一支主要解決的生物學問題是識別出細胞的(新的或舊的)身份,基因這一塊可分析這一身份下的特徵。爲了分析細胞我們又Seurat,monocle,Velocity等‘;分析基因有GO,KEGG,GSVA,RcisTarget ,WGCNA等。今天我們順着SCENIC的思路來看看其內核的基因調控網絡工具:GENIE3 。

同樣請出我們的老朋友:

library(Seurat)
library(GENIE3)
library(SeuratData)
library(tidyverse)
pbmc3k.final -> pbmc
pbmc
An object of class Seurat 
13714 features across 2638 samples within 1 assay 
Active assay: RNA (13714 features, 2000 variable features)
 2 dimensional reductions calculated: pca, umap
> levels(Idents(pbmc))
[1] "Naive CD4 T"  "Memory CD4 T" "CD14+ Mono"   "B"            "CD8 T"        "FCGR3A+ Mono" "NK"          
[8] "DC"           "Platelet"    

GENIE3的核心函數接受一個行爲基因列爲樣本的表達譜矩陣,可以是未經過任何均一化的矩陣,也可以接受處理後的矩陣,二者會有所區別。我們不打算就整個表達譜來做基因調控的分析,主要是因爲計算耗時。我們先把數據處理一下,獲取一個需要分析的矩陣。

mk <-FindAllMarkers( pbmc)
(mk %>% filter(cluster == 'Naive CD4 T') %>% top_n(100,avg_logFC))$gene   -> NaiveCD4T
NaiveCD4T <- NaiveCD4T[-c(grep("^RP",NaiveCD4T))]
cd4nt <-  subset(pbmc,idents = "Naive CD4 T")
exprMatr <- as.matrix(cd4nt@assays$RNA@data[NaiveCD4T,])
exprMatr[1:4,1:4]
       AAACGCTGTAGCCA AAACTTGATCCAGA AAAGAGACGAGATA AAAGAGACGGACTT
LDHB         0.000000       3.088221       2.867757       2.270898
MALAT1       6.069288       5.988516       5.743655       7.045627
EEF1A1       4.374893       4.699368       4.326630       4.255680
CCR7         0.000000       2.607332       0.000000       0.000000

 dim(exprMatr)
[1] 141 697

這是一個含有Naive CD4 T特徵基因的矩陣。運行GENIE3很簡單,在不知道調控基因的時候,我們可以這樣

set.seed(1314) # For reproducibility of results
?GENIE3
weightMat <- GENIE3(exprMatr,nCores = 4)   # with the default parameters

這樣會把每個基因都當做調控基因,如果已經知道調控基因,可以放在一個向量裏,如:

# Genes that are used as candidate regulators
regulators <- c(2, 4, 7)
# Or alternatively:
regulators <- c("Gene2", "Gene4", "Gene7")
weightMat <- GENIE3(exprMatr, regulators=regulators)

GENIE3是基於迴歸樹的。這些樹可以學習使用任意隨機森林方法Breiman L.(2001)或者是額外樹方法 Geurts P., Ernst D.和Wehenkel L.(2006)。

weightMat[1:5,1:5]
              AAK1       ACAP1          AES       ALOX5       APBA2
AAK1  0.0000000000 0.005258715 0.0054358216 0.001129751 0.003247545
ACAP1 0.0063033230 0.000000000 0.0086014879 0.010451221 0.009816754
AES   0.0101961203 0.010506656 0.0000000000 0.009392012 0.006051315
ALOX5 0.0003247848 0.001475766 0.0009462572 0.000000000 0.006683968
APBA2 0.0025998370 0.003291497 0.0021388552 0.010121726 0.000000000
dim(weightMat)
[1] 59 59

這是一個加權矩陣,矩陣是兩兩基因之間的調控加權值,據預測我們可以用igraph來構建網絡。

library(igraph)
net1<-graph_from_adjacency_matrix(weightMat1)
weightMat1[which(weightMat1<0.04)] =0 
net1<- graph_from_incidence_matrix(weightMat1)

多個佈局的可視化調節網絡:

layouts <- grep("^layout_", ls("package:igraph"), value=TRUE)[-1] 
# Remove layouts that do not apply to our graph.
layouts <- layouts[!grepl("bipartite|merge|norm|sugiyama|tree", layouts)]

layouts<- c("layout_as_star","layout_in_circle"  ,   "layout_nicely",
            "layout_on_grid"   ,    "layout_on_sphere"   ,  "layout_randomly"  ,    "layout_with_dh",
            "layout_with_drl"   ,   "layout_with_fr"    ,   "layout_with_gem"  ,    "layout_with_graphopt",
            "layout_with_kk" ,      "layout_with_lgl"    ,  "layout_with_mds")
layouts


length(layouts)
par(mfrow=c(3,5), mar=c(1,1,1,1))
for (layout in layouts) {
    print(layout)
    l <- do.call(layout, list(net1)) 
    plot(net1, edge.arrow.mode=0, layout=l, main=layout) }

在我們的過濾中,許多基因與其他基因沒有任何的鏈接,可能是沒有什麼調控關係,我們想看哪些基因是有調控關係的。

V(net1)[igraph::degree(net1)>1]
+ 22/118 vertices, named, from 7a4fa09:
 [1] BTG1    EEF1A1  EEF1B2  JUNB    MALAT1  NPM1    TMEM66  TPT1    BTG1    CCR7    EEF1A1  EEF1B2  GLTSCR2 JUNB   
[15] LCK     LDHB    LEF1    MAL     MALAT1  SOCS3   TMEM66  TPT1   
plot(induced_subgraph(net1,  V(net1)[igraph::degree(net1)>1] ))

GENIE3 考慮到很多童鞋操作網絡可能會有困難,就安排了一個函數根據給定的閾值刪選

linkList <- getLinkList(weightMat, reportMax=5)
linkList <- getLinkList(weightMat, threshold=0.03)
head(linkList)
  regulatoryGene targetGene     weight
1         MALAT1     EEF1A1 0.14649515
2         EEF1A1     MALAT1 0.08383228
3           BTG1       JUNB 0.07233470
4           JUNB       BTG1 0.06985202
5         MALAT1       TPT1 0.06641574
6         EEF1A1     EEF1B2 0.06631800

這個格式當然也是可以用igraph來構建網絡的。

net<- graph_from_data_frame(linkList)
plot(net, edge.arrow.size=.2, edge.curved=0,
     vertex.color="orange", vertex.frame.color="#555555",
vertex.label.color="black",
     vertex.label.cex=.7) 

我們把表達量信息畫在網絡圖上:


avexp <- AverageExpression(cd4nt,features =names(V(net)),slot = 'data')
V(net)$size <- abs(scale(avexp$RNA$`Naive CD4 T`))
E(net)$width <- E(net)$weight*10
plot(net, edge.arrow.size=.2, edge.curved=0.5,
     vertex.color="orange", vertex.frame.color="#555555",
     vertex.label.color="black",
     vertex.label.cex=.7) 

The weights of the links returned by GENIE3() do not have any statistical meaning and only provide a way to rank the regulatory links. There is therefore no standard threshold value, and caution must be taken when choosing one.



https://bioconductor.org/packages/release/bioc/html/GENIE3.html
https://scenic.aertslab.org/examples/SCENIC_MouseBrain/
https://github.com/aertslab/SCENICprotocol

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章