前情提要:
AUCell:在單細胞轉錄組中識別細胞對“基因集”的響應
RcisTarget || 從基因列表到調控網絡
我們知道單細胞數據分析中,有兩個明顯的分支:
- 細胞
- 基因
細胞我們可以分析細胞之間的細胞分羣、擬時間關係、細胞通訊,基因可以看基因的表達量、基因的調控網絡,基因功能預測。對於細胞這一支主要解決的生物學問題是識別出細胞的(新的或舊的)身份,基因這一塊可分析這一身份下的特徵。爲了分析細胞我們又Seurat,monocle,Velocity等‘;分析基因有GO,KEGG,GSVA,RcisTarget ,WGCNA等。今天我們順着SCENIC的思路來看看其內核的基因調控網絡工具:GENIE3 。
同樣請出我們的老朋友:
library(Seurat)
library(GENIE3)
library(SeuratData)
library(tidyverse)
pbmc3k.final -> pbmc
pbmc
An object of class Seurat
13714 features across 2638 samples within 1 assay
Active assay: RNA (13714 features, 2000 variable features)
2 dimensional reductions calculated: pca, umap
> levels(Idents(pbmc))
[1] "Naive CD4 T" "Memory CD4 T" "CD14+ Mono" "B" "CD8 T" "FCGR3A+ Mono" "NK"
[8] "DC" "Platelet"
GENIE3的核心函數接受一個行爲基因列爲樣本的表達譜矩陣,可以是未經過任何均一化的矩陣,也可以接受處理後的矩陣,二者會有所區別。我們不打算就整個表達譜來做基因調控的分析,主要是因爲計算耗時。我們先把數據處理一下,獲取一個需要分析的矩陣。
mk <-FindAllMarkers( pbmc)
(mk %>% filter(cluster == 'Naive CD4 T') %>% top_n(100,avg_logFC))$gene -> NaiveCD4T
NaiveCD4T <- NaiveCD4T[-c(grep("^RP",NaiveCD4T))]
cd4nt <- subset(pbmc,idents = "Naive CD4 T")
exprMatr <- as.matrix(cd4nt@assays$RNA@data[NaiveCD4T,])
exprMatr[1:4,1:4]
AAACGCTGTAGCCA AAACTTGATCCAGA AAAGAGACGAGATA AAAGAGACGGACTT
LDHB 0.000000 3.088221 2.867757 2.270898
MALAT1 6.069288 5.988516 5.743655 7.045627
EEF1A1 4.374893 4.699368 4.326630 4.255680
CCR7 0.000000 2.607332 0.000000 0.000000
dim(exprMatr)
[1] 141 697
這是一個含有Naive CD4 T
特徵基因的矩陣。運行GENIE3很簡單,在不知道調控基因的時候,我們可以這樣
set.seed(1314) # For reproducibility of results
?GENIE3
weightMat <- GENIE3(exprMatr,nCores = 4) # with the default parameters
這樣會把每個基因都當做調控基因,如果已經知道調控基因,可以放在一個向量裏,如:
# Genes that are used as candidate regulators
regulators <- c(2, 4, 7)
# Or alternatively:
regulators <- c("Gene2", "Gene4", "Gene7")
weightMat <- GENIE3(exprMatr, regulators=regulators)
GENIE3是基於迴歸樹的。這些樹可以學習使用任意隨機森林方法Breiman L.(2001)或者是額外樹方法 Geurts P., Ernst D.和Wehenkel L.(2006)。
weightMat[1:5,1:5]
AAK1 ACAP1 AES ALOX5 APBA2
AAK1 0.0000000000 0.005258715 0.0054358216 0.001129751 0.003247545
ACAP1 0.0063033230 0.000000000 0.0086014879 0.010451221 0.009816754
AES 0.0101961203 0.010506656 0.0000000000 0.009392012 0.006051315
ALOX5 0.0003247848 0.001475766 0.0009462572 0.000000000 0.006683968
APBA2 0.0025998370 0.003291497 0.0021388552 0.010121726 0.000000000
dim(weightMat)
[1] 59 59
這是一個加權矩陣,矩陣是兩兩基因之間的調控加權值,據預測我們可以用igraph來構建網絡。
library(igraph)
net1<-graph_from_adjacency_matrix(weightMat1)
weightMat1[which(weightMat1<0.04)] =0
net1<- graph_from_incidence_matrix(weightMat1)
多個佈局的可視化調節網絡:
layouts <- grep("^layout_", ls("package:igraph"), value=TRUE)[-1]
# Remove layouts that do not apply to our graph.
layouts <- layouts[!grepl("bipartite|merge|norm|sugiyama|tree", layouts)]
layouts<- c("layout_as_star","layout_in_circle" , "layout_nicely",
"layout_on_grid" , "layout_on_sphere" , "layout_randomly" , "layout_with_dh",
"layout_with_drl" , "layout_with_fr" , "layout_with_gem" , "layout_with_graphopt",
"layout_with_kk" , "layout_with_lgl" , "layout_with_mds")
layouts
length(layouts)
par(mfrow=c(3,5), mar=c(1,1,1,1))
for (layout in layouts) {
print(layout)
l <- do.call(layout, list(net1))
plot(net1, edge.arrow.mode=0, layout=l, main=layout) }
在我們的過濾中,許多基因與其他基因沒有任何的鏈接,可能是沒有什麼調控關係,我們想看哪些基因是有調控關係的。
V(net1)[igraph::degree(net1)>1]
+ 22/118 vertices, named, from 7a4fa09:
[1] BTG1 EEF1A1 EEF1B2 JUNB MALAT1 NPM1 TMEM66 TPT1 BTG1 CCR7 EEF1A1 EEF1B2 GLTSCR2 JUNB
[15] LCK LDHB LEF1 MAL MALAT1 SOCS3 TMEM66 TPT1
plot(induced_subgraph(net1, V(net1)[igraph::degree(net1)>1] ))
GENIE3 考慮到很多童鞋操作網絡可能會有困難,就安排了一個函數根據給定的閾值刪選
linkList <- getLinkList(weightMat, reportMax=5)
linkList <- getLinkList(weightMat, threshold=0.03)
head(linkList)
regulatoryGene targetGene weight
1 MALAT1 EEF1A1 0.14649515
2 EEF1A1 MALAT1 0.08383228
3 BTG1 JUNB 0.07233470
4 JUNB BTG1 0.06985202
5 MALAT1 TPT1 0.06641574
6 EEF1A1 EEF1B2 0.06631800
這個格式當然也是可以用igraph來構建網絡的。
net<- graph_from_data_frame(linkList)
plot(net, edge.arrow.size=.2, edge.curved=0,
vertex.color="orange", vertex.frame.color="#555555",
vertex.label.color="black",
vertex.label.cex=.7)
我們把表達量信息畫在網絡圖上:
avexp <- AverageExpression(cd4nt,features =names(V(net)),slot = 'data')
V(net)$size <- abs(scale(avexp$RNA$`Naive CD4 T`))
E(net)$width <- E(net)$weight*10
plot(net, edge.arrow.size=.2, edge.curved=0.5,
vertex.color="orange", vertex.frame.color="#555555",
vertex.label.color="black",
vertex.label.cex=.7)
The weights of the links returned by GENIE3() do not have any statistical meaning and only provide a way to rank the regulatory links. There is therefore no standard threshold value, and caution must be taken when choosing one.
https://bioconductor.org/packages/release/bioc/html/GENIE3.html
https://scenic.aertslab.org/examples/SCENIC_MouseBrain/
https://github.com/aertslab/SCENICprotocol