多形性成膠質細胞瘤(GBM)甲基化區域的計算鑑別
目的:找出膠質細胞瘤特異性甲基化區域,爲臨牀診斷提供理論依據
步驟:
1、 查找數據:下載TCGA中GBM的RNA-seq和甲基化數據
2、 甲基化數據分析,正常腫瘤對比,進行差異甲基化分析,找出腫瘤樣本中高甲基化區域
3、 對RNA-seq數據進行分析,正常腫瘤對比,差異表達基因的篩選,找出腫瘤樣本中低表達基因。
4、 結合甲基化和RNA-seq數據,將高甲基化和低表達基因取交集,這些基因很可能屬於抑癌基因,與抑癌基因取交集,再結合promoter區域的CpG整合分析,尋找候選靶標。
5、 對找出的靶標進行驗證,利用pubmed以及其他數據庫,反向驗證靶標的可靠性
一、數據下載
首先進入TCGA下載數據GBM的RNA-seq和甲基化數據,從下表可見GBM共有172套RNA-seq數據以及437套DNA甲基化數據,由於TCGA提供Infinium HumanMethylation27 BeadChip和Infinium HumanMethylation450 BeadChip兩種芯片平臺的數據,爲了避免後續不同芯片平臺間數據合併的困難,僅下載HumanMethylation450的芯片數據,共計154套。
圖表 1TCGA數據彙總
二、初步整理數據
使用TCGA-Assembler.2.0.5進行GBM數據批量下載與初步整理,並且繪製RNA-seq基因表達量盒型圖以及甲基化芯片數據盒型圖,由於數據量較大,此處不貼圖。
三、整體可視化
首先對於甲基化數據,選取ID爲TCGA.06.AABW.11A.31D.A368.05的數據,查看總體甲基化程度。由於每個位點真實情況只存在:甲基化/非甲基化兩種,所以對全部位點甲基化程度進行統計,也應該是大部分位點處於“完全甲基化”(Methylation state=1)和“完全非甲基化”(Methylation state=0)兩種狀態,下圖繪製了數據的頻數柱狀圖,可以明顯看出形狀處於“兩頭高,中間低”,反向說明芯片數據質量較好。
圖表 2單個樣本CpG甲基化程度統計
接下來,對多個樣本繪製CpG甲基化程度小提琴圖,同一行是同一個病人,左邊樣本來源於Primary Solid Tumor,右邊樣本來源於Recurrent Solid Tumor,除了甲基化程度大部分分佈於0和1附近外,還能看出來源於同一病人腫瘤的甲基化程度依舊會有略微差異。
TCGA barcode:https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode
圖表 3小提琴圖
同樣的,對於RNA-seq數據也可以進行一些初步可視化,除了數據下載後繪製的盒型圖,亦可以進行PCA初步查看數據分佈,下圖左爲PCA陡坡圖,反映了第一主成分、第二主成分…等等所擁有信息量的比例,下圖右爲使用PCA1和PCA2繪製的散點圖,可以發現5個正常樣本距離較近,從側面反映數據可信度較好。
最後,對於RNA-seq表達譜數據,使用系統聚類方法,繪製樹狀圖,可以發現5個正常樣本距離也是很近,數據質量還行。
四、差異甲基化區域篩選
爲了更加科學高效地篩選差異甲基化位點,參考bioconductor中甲基化芯片的分析流程,使用minfi包進行差異甲基化分析,得到差異甲基化位點。
http://www.bioconductor.org/help/workflows/methylationArrayAnalysis/
在檢測的526733個CpG位點中,共有4927個CpG位點P值<0.01,且在腫瘤樣本中保持着甲基化程度高於0.7,對應2054個基因。
五、差異表達基因篩選
由於數據源自RNA-seq,最主流的分析方法當然是基於負二項分佈模型的DESeq2包。
先用MA-plot查看差異表達基因大致分佈。意外的是,圖形左側有大概七條線狀條紋,最初我懷疑這是sample之間有batch effect導致,需要用其他更好normalize的方法,後來用identify方法挨個找出每條線上的基因名及其對應的表達量,發現這些基因在172套樣本中表達量幾乎全爲0,僅有一兩個樣本有一點點表達,這種數據的存在導致這些線狀條紋的產生。
圖表 4 MA-plot
然後, 選取p值最小的差異表達基因,繪製其在不同組間表達量,確實差異很顯著。
圖表 5表達量散點圖
接着,繪製差異表達基因在不同組間的表達量熱圖,正常樣本是圖片最左邊的五列,當然如果需要解釋具體的生物學問題,需要將聚類出來的每一類,將差異表達基因進行GO以及KEGG註釋,結合有關的生物學表型,探討其分子機制及意義
圖表 6表達量熱圖
最後,選取篩選條件爲p值小於0.01且log2FoldChange<-2的差異表達基因,在腫瘤樣本低表達的基因,共計1657個
六、抑癌基因的獲取
在pubmed中查詢研究人員整理的tumor suppressor genes,果然在Nucleic Acids Res發表過TSGene數據庫,存儲了抑癌基因的列表。https://bioinfo.uth.edu/TSGene/download.cgi
下載全部1217個人類抑癌基因的列表。
All the 1217 human tumor suppressor genes with basic annotations
圖表 7 TSGene database
七、數據整合
對於甲基化數據中,腫瘤樣本高甲基化CpG附近的基因,RNA-seq中腫瘤樣本低表達的基因,以及TSGene數據庫中下載的抑癌基因列表,三者做overlap,找出特異性的候選靶標,爲後續分析做準備,下圖爲三者overlap的韋恩圖。
圖表 8數據整合韋恩圖
共計找出12個候選靶標基因。
八、靶標篩選
之前篩選選擇的單個CpG的差異甲基化,而實際臨牀檢測應用時候,可能需要多個CpG作爲對照,因此統計了12個候選靶標基因TSS前1.5kb內所有CpG的甲基化程度,然後繪製熱圖,可以明顯發現,雖然當初用CpG的差異甲基化位點篩出來的基因都是腫瘤樣本高甲基化的,可是統計TSS前1.5kb內所有CpG的甲基化程度,這些基因卻有很多在所有樣本中都是低甲基化狀態,而看上去很靠譜的是NUAK1基因,其正常樣本在TSS前1.5kb內低甲基化,腫瘤樣本中對應區域高甲基化。
圖表 9 methylation heatmap
NUAK1基因TSS前1.5kb內共檢測了7個CpG,這7個CpG在154個樣本中檢測出來的甲基化程度如下圖,可以明顯看出來這7個CpG在Tumor組織中甲基化程度都相對高,而在Normal組織中甲基化程度相對較低。
圖表 10 NUAK1的TSS區CpG甲基化程度
使用Cpgplot預測CpG island位置,這7個CpG在promoter5’到3’序列圖上相對位置如下
1035 1094 1408 1413 1444 1448 1471
圖表 11 CpG island預測
參數使用:
NUAK1promoter from 1 to 1500
Observed/Expected ratio > 0.40
Percent C + Percent G > 40.00
Length > 100
CpG island詳細信息: Length 101 (1086..1186) Length 105 (1366..1470)
這七個CpG基本都在CpGisland中,具體序列見附錄
九、靶標基因相關討論
進入Gene數據庫搜索NUAK1相關內容,可以發現基因全稱NUAK family kinase 1,還是個激酶,激酶的話就對調控會有很大作用了,而在HPA RNA-seq normal tissues項目中,又看出來這個激酶在腦中表達量明顯高於其他組織,這又與發生在腦部的GBM不謀而合。
圖表 12 NUAK1相關討論
十、分子機制探討
對於腫瘤組織中高甲基化CpG附近的,並且在腫瘤樣本中低表達的intersect共計274個基因,使用Gene Ontology進行富集分析,可以明顯發現在GO biological process生物學過程中的“神經系統發育”、“化學性突觸傳遞”和“細胞膜的組織”等部分裏面有着富集,特別是“中樞神經系統的髓鞘形成”,富集程度達到26.95倍,這又與研究的多發生於腦補的GBM有着密切的聯繫,反向驗證實驗結果的正確性。
圖表 13 GO富集分析
十一、FurtherMore
根據生物學知識可以得到,CpG的甲基化會調控基因的轉錄,因此,Transcript Start Site(TSS)附近的甲基化程度值得進行一番深入研究,選用人類基因組hg19版本,對23056基因共計46489個轉錄起始位點,進行轉錄起始位點富集甲基化程度統計。
統計TSS前後5000bp內CpG甲基化程度,並且使用曲線進行擬合,可以發現TSS處的CpG Methylation水平明顯降低,這也與科學常識相吻合。
圖表 14 TSS附近甲基化程度
附錄
NUAK1promoter區CpG island用藍色標註,檢測的CpG用加粗橫線下標標註。
>NUAK1promoter
TATGAAAGGAGAAGGGGGAGCTTTGGAACTGGACAGGTAGGGTTTAAATCCTGGTCCTGC
CATTTACAAACTGTGTAACCTCTGGGAAATTACTTAACCTTTCTGATACGGTTTCTTCAA
TTGAAAATAGGGATTGTAAACAGCTACTTTACAGAGGAGGGTTTACTGTCATAAAACAGT
ACCAGCCTATGGTAGATGATGCTGTTGATTCAATAGATACTGATGAAGTCCCACATATCT
GGGAGTAACACTATCAGCCAGAATAAGCCAGGTTATGCTGCTTTAACAAATCCCACTGAC
TTAACATAAATAAAGTTTAATATTTGCTGACACTACTTTTCCAATGCAGATTGGCTGGGA
TTTTCTGCCATCTTTACACAGAGACTCAGGCTGGCTTGGGGAGGCTCCAACTTTGTACCA
CCATGATTGTCAAGATAGGAAAAAGAAGACATGGGGATTTGCTCACTGGCTCCTAAAGGT
TCCAAGTGAAAGCAACACAATCACTTCTGCTCTCATTCCATTGGCCTAAGCAAGTCCGAT
GGTCTCATCTAACTTCAAGAGGATGGGGTAGTGGATTTCTACCACATCGCAGGGGAGGGG
AAGAAGTAGAAAATATTTTAAAATACTACTGTAGACACAATGTGTTTATCTCTACAATAG
ATCTGCTAAATCCATATCTTAAGAGAAAACCGAAGCTCTGAGAGGTTATGTCATTTGACC
AAGGCCACACAGCCAATCCTCTGGGAAGCCAGGACTCAACCTGCATGCCACTCATCCTGA
CTGTGGGATCTATAGGCACAACGTCCACAAACTGTATAGAAGCAGAACAGTTTCAGGATG
GGGGTGGGTGGCAGGGAGGCCCTGCAAATGCGGTGGACAGAATAGGAATTGAATAGGCAT
GATGCGCCTTTGTGTCTGTGTTCTGTCATTGCTCCTGGGGAAGGGAAGAAGGGGCAGGGA
AGTTTGTGTGAAATATCAGTAGAAGGAAAGTTTGGACAGGTAAGAATATCACGCGTCAGA
GTACAGCAGAGACACGTGTGGAGGATGAGGGCAGTGTTTCAGGCCATTACTCTGGCAGCA
GTGAGGAGGCTCCCGGGGAGGGTTGGGGAGAGCGGGCTGTTGCTGGAGTCCGGGGTAAGG
TGACCAGGGGTAGACAGGAATGAAGGCGGCGGGAATGTAGAGGAAGGGCTGCACCTGAGA
GCTGCAGAGGAGGCCCCAGTGAGGTCCACTGGTGGAAAAGAACCGCCATGCGATCTGCAG
AACACCAGACCTTCTTCAGCCTTCACTCTTCCCACCCTAGTCTGGTACATTGCACCACTT
GTTAAAAAAAAAAATTTCCCTAAGACCTCTTTTTCTCTAGCCTTCTTCCTTATTTTTCAT
GGTCTCTTCTTAGAACGGCGGCAGCCACGCCGCGGTGGGAGGCCCTTCCTGCCTGACCCT
TACCGTGCGGGGTACCGTTCCTGTCACCATCGCCAGGATCTGGCCCTTTCAGTGAAAGGA
diffFind.R
setwd("G:/AllShare/SkillTrainHomework/BasicDataProcessingResult")
load("GBM__methylation_450.rda")
setwd("G:/AllShare/SkillTrainHomework")
DATAFRAME<- data.frame(Data)
TCGASampleName<- colnames(DATAFRAME)
TCGASamplelist <- strsplit(TCGASampleName,split=".",fixed = TRUE)
TCGASampletpye <- c()
TCGASampleID <- c()
for(i in 1:length(TCGASamplelist)){
TCGASampletpye <- c(TCGASampletpye,TCGASamplelist[[i]][4])
TCGASampleID <- c(TCGASampleID,paste(TCGASamplelist[[i]][2],TCGASamplelist[[i]][3],sep = "."))
}
TCGASampletpyeResult <- c()
for(i in 1:length(TCGASampletpye)){
if(as.integer(substr(TCGASampletpye[i],1,2))<10){
TCGASampletpyeResult <- c(TCGASampletpyeResult,"Tumor")
}else{
TCGASampletpyeResult <- c(TCGASampletpyeResult,"Normal")
}
}
library(ggplot2)
#h<- DATAFRAME$TCGA.06.0125.01A.01D.A45W.05
h<- DATAFRAME$TCGA.06.AABW.11A.31D.A368.05
pdf("CpGmethlation.pdf")
ggplot(NULL,aes(x=h))+
geom_histogram(binwidth = 0.02)+
labs(x = "% methylation per base",title = "Histogram of % CpG methylation\nSampleID:TCGA.06.AABW.11A.31D.A368.05")
dev.off()
library(reshape2)
tmp <- Data[,1:8]
tmp <- data.frame(tmp)
colnames(tmp) <- TCGASampleID[1:8]
tmp2 <- stack(tmp)
tmp2 <- data.frame(tmp2)
colnames(tmp2) <- c("CpG methylation","SampleID")
pdf("MethlationViolin.pdf",width =12,height = 10 )
ggplot(tmp2,aes(x=SampleID,y=`CpG methylation`,fill=SampleID))+
geom_violin()+
facet_wrap(~SampleID,ncol=2)
dev.off()
library(minfi)
dmp <- dmpFinder(Data, pheno = TCGASampletpyeResult, type = "categorical")
save(dmp,file = "diff.Rdata")
load("diff.Rdata")
meanTumorData<- apply(Data[,TCGASampletpyeResult=="Tumor"],1,mean)
#&!is.na(meanTumorData)
# head(meanTumorData)
# head(dmp)
# head(signifidmp,10)
# summary(dmp)
# rownames(dmp)
dmp$genelistID <- as.integer(rownames(dmp))
o<- order(dmp[,"genelistID"])
dmp<- dmp[o,]
#篩選出p<0.01且無空值的CpG,並且正常樣本甲基化程度<0.3,即篩選腫瘤中高甲基化的基因
signifidmp <- dmp[meanTumorData>0.7 & !is.na(meanTumorData) & dmp$pval<0.01
&!is.na(dmp$pval) & !is.na(dmp$qval) & !is.na(dmp$intercept),]
#signifidmp <- signifidmp[signifidmp$intercept<0.3,]dmp <-
signifiData<- Data[as.integer(rownames(signifidmp)),]
signifiDes<- Des[as.integer(rownames(signifidmp)),]
# Data[Des[,1]=="ch.3.2438620R",]
# meanTumorData[Des[,1]=="ch.3.2438620R"]>0.7
# mean(Data[103429,TCGASampletpyeResult=="Normal"])
# mean(Data[103429,TCGASampletpyeResult=="Tumor"])
# summary(Data[Des[,1]=="ch.3.2438620R",])
# apply((head(signifiData,10)[,TCGASampletpyeResult=="Tumor"]),1,mean)
signifiGene <- signifiDes[,2]
length(unique(signifiGene))
head(signifidmp)
#intercept正常樣本甲基化程度
# Data[362831,118]
# Data[362831,48]
# Data[362831,]
#
# Data[310369,118]
# Data[310369,48]
# Data[310369,]
#以下是關於CpG的一些計算
#因爲Des有部分空缺,取出非空部分,生成Desfull
Desfull<- Des[!is.na(Des[,3]),]
#同樣取出Data裏非空部分,計算mean
methtstat<- apply(Data[!is.na(Des[,3]),],1,mean,na.rm=TRUE)
#去掉計算mean值以後NA的部分對應的Desfull
Desfull<- Desfull[!is.na(methtstat),]
#再去掉methtstat的mean值中NA的部分
methtstatresult<- methtstat[!is.na(methtstat)]
grCpG <- GRanges(seqnames = paste("chr",Desfull[,3],sep = ""),
ranges = IRanges(start = as.integer(Desfull[,4]), width = 1))
grCpG$value <- methtstatresult
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb_hg19 <- TxDb.Hsapiens.UCSC.hg19.knownGene
trans <- as.data.frame(transcripts(txdb_hg19))
trans<- trans[trans$seqnames %in% c("chrX","chrY",paste("chr",1:22,sep="")),]
grTSS <- GRanges(seqnames = trans$seqnames,
ranges = IRanges(start = trans$start-5000,end = trans$start+5000))
grCpG
grTSS
hitObj<- findOverlaps(grTSS,grCpG)
CpGRelativeSite <- c()
CpGRelativeMeth <- c()
for(i in 1:length(hitObj)){
#取出對應CpG真實位置
CpGsite <- grCpG[hitObj[i]@to]@ranges@start
#計算回TSS真實位置
TSSsite<- mean(grTSS[hitObj[i]@from]@ranges)
#計算CpG相對位置並且存儲
CpGRelativeSite <- c(CpGRelativeSite,TSSsite - CpGsite)
#取出CpG平均甲基化程度
CpGRelativeMeth <- c(CpGRelativeMeth,grCpG[hitObj[i]@to]$value)
}
save.image(file = "diffresult.Rdata")
load("diffresult.Rdata")
CpGResult <- c()
for(i in -5000:5000){
CpGResult <- c(CpGResult,mean(CpGframe[CpGframe$CpGRelativeSite==i,"CpGRelativeMeth"]))
}
CpGframe <- data.frame(-5000:5000,CpGResult)
colnames(CpGframe) <- c("CpGRelativeSite","MethState")
ggplot(CpGframe, aes(x=CpGRelativeSite, y=MethState))+
#geom_point()+
geom_smooth()
ggsave(filename="TSS附近甲基化程度.pdf",width = 12,height=8)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb_hg19 <- TxDb.Hsapiens.UCSC.hg19.knownGene
trans <- as.data.frame(transcripts(txdb_hg19))
grTSS <- GRanges(seqnames = trans$seqnames,
ranges = IRanges(start = trans$start-5000,end = trans$start+5000))
genesall<- genes(txdb_hg19)
SERPIND1<- genesall[genesall$gene_id=="3053"]
SERPIND1promoter<- promoters(SERPIND1,upstream = 1500,downstream = 0)
SERPIND1promoter
SERPIND1hit<- findOverlaps(SERPIND1promoter,grCpG)
SERPIND1hit@to
grCpG[SERPIND1hit@to]
#將Data裏對應數據取出
Datafull<- Data[!is.na(Des[,3]),]
Datafull<- Datafull[!is.na(methtstat),]
NUAK1<- genesall[genesall$gene_id=="9891"]
NUAK1promoter<- promoters(NUAK1,upstream = 1500,downstream = 0)
NUAK1promoter
NUAK1hit<- findOverlaps(NUAK1promoter,grCpG)
NUAK1hit@to
grCpG[NUAK1hit@to]
#將Data裏對應數據取出
Datafull<- Data[!is.na(Des[,3]),]
Datafull<- Datafull[!is.na(methtstat),]
apply(Datafull[NUAK1hit@to,TCGASampletpyeResult=="Normal"],1,mean)
apply(Datafull[NUAK1hit@to,TCGASampletpyeResult=="Tumor"],1,mean)
library(ggplot2)
normalNUAK1<- data.frame(stack(Datafull[NUAK1hit@to,TCGASampletpyeResult=="Normal"]))
tumorNUAK1 <- data.frame(stack(Datafull[NUAK1hit@to,TCGASampletpyeResult=="Tumor"]))
normalNUAK1$sampletype <- "Normal"
normalNUAK1$site <- 1:7
tumorNUAK1$sampletype <- "Tumor"
tumorNUAK1$site <- 1:7
frame1<- data.frame(normalNUAK1$NA..1,normalNUAK1$sampletype,normalNUAK1$site)
colnames(frame1) <- c("MethState","sampletype","site")
frame2<- data.frame(tumorNUAK1$NA..1,tumorNUAK1$sampletype,tumorNUAK1$site)
colnames(frame2) <- c("MethState","sampletype","site")
resultframe <- rbind(frame1,frame2)
ggplot(resultframe, aes(x=site, y=MethState,colour=sampletype)) +
scale_x_discrete(limits=1:7)+
geom_point(position="jitter")
ggsave(filename="TargetCpGNUAK1.pdf",width = 8,height=6)
apply(Datafull[SERPIND1hit@to,TCGASampletpyeResult=="Normal"],1,mean)
apply(Datafull[SERPIND1hit@to,TCGASampletpyeResult=="Tumor"],1,mean)
library(ggplot2)
normalSERPIND1<- data.frame(stack(Datafull[SERPIND1hit@to,TCGASampletpyeResult=="Normal"]))
tumorSERPIND1 <- data.frame(stack(Datafull[SERPIND1hit@to,TCGASampletpyeResult=="Tumor"]))
normalSERPIND1$sampletype <- "Normal"
normalSERPIND1$site <- 1:2
tumorSERPIND1$sampletype <- "Tumor"
tumorSERPIND1$site <- 1:2
frame1<- data.frame(normalSERPIND1$NA..1,normalSERPIND1$sampletype,normalSERPIND1$site)
colnames(frame1) <- c("MethState","sampletype","site")
frame2<- data.frame(tumorSERPIND1$NA..1,tumorSERPIND1$sampletype,tumorSERPIND1$site)
colnames(frame2) <- c("MethState","sampletype","site")
resultframe <- rbind(frame1,frame2)
ggplot(resultframe, aes(x=site, y=MethState,colour=sampletype)) +
scale_x_discrete(limits=1:2)+
geom_point(position="jitter")
ggsave(filename="TargetCpGSERPIND1.pdf",width = 8,height=6)
library("BSgenome.Hsapiens.UCSC.hg19")
library(seqinr)
genome <- BSgenome.Hsapiens.UCSC.hg19
grCpG[NUAK1hit@to]
promoterSeq<- getSeq(genome,NUAK1promoter)
write.fasta(promoterSeq$`9891`,"NUAK1promoter","promoterSeq.txt")
normalNUAK1$NA..1
tumorNUAK1$NA..1
dim(Datafull)
dim(Desfull)
# 嘗試用mice包補缺失值,由於數據量太大而取消
# library(mice)
# library(VIM)
# #md.pattern(Data)
# tmp <- Data[,1:3]
# aggr_plot <- aggr(tmp, col = c('navyblue', 'red'), numbers=TRUE, sortVars=TRUE,
# labels=names(tmp),cex.axis=.7, gap=2,
# ylab=c("Histogram of missing data", "Pattern"))
# tempData <- mice(Data,m=1,maxit=50,meth='pmm',seed=500)
#
# 嘗試T檢驗,由於有缺失值而取消
# Ttest<- t.test(Data[1,1:3],Data[1,c(48,118)])
#
# Data[1,118]
# Data[1,48]
# tmp<- Data[1,]
# myfun <-function(x){
# Ttest<- t.test(x[c(1:47,49:117,119:154)],x[c(48,118)])
# return(Ttest$p.value)
# }
# myfun(Data[1,])
# result<- apply(Data,1,myfun)
# group <- factor(TCGASampletpyeResult,levels=c("Normal","Tumor"))
# design <- model.matrix(~-1+group)
# design
# fit.reduced <- lmFit(Data,design)
# fit.reduced <- eBayes(fit.reduced)
# summary(decideTests(fit.reduced))
#
# 嘗試使用missMethyl包,最後designMatrix設置有錯誤而差異區域識別錯誤,差異不明顯
# top<-topTable(fit.reduced,coef=1)
# top
# cpgs <- as.integer(rownames(top))
# Data[cpgs[10],]
# Data[cpgs[10],48]
# Data[cpgs[10],118]
#
# par(mfrow=c(2,2))
# for(i in 1:4){
# stripchart(Data[rownames(Data)==cpgs[i],]~design[,4],method="jitter",
# group.names=c("Normal","Tumor"),pch=16,cex=1.5,col=c(4,2),ylab="Beta values",
# vertical=TRUE,cex.axis=1.5,cex.lab=1.5)
# title(cpgs[i],cex.main=1.5)
# }
Downloaded.R
setwd("G:/AllShare/SkillTrainHomework/TCGA-Assembler.2.0.5/TCGA-Assembler")
#' Load functions
source("Module_A.R")
source("Module_B.R")
setwd("G:/AllShare/SkillTrainHomework")
#' choose a cancer type
#' 可查看網址https://tcga-data.nci.nih.gov/docs/publications/tcga/
sCancer <- "GBM"
sPath1 <- "./DownloadedData"
sPath2 <- "./BasicDataProcessingResult"
sPath3 <- "./AdvancedDataProcessingResult"
#下載RNA-Seq數據
path_geneExp <- DownloadRNASeqData(cancerType = sCancer,
assayPlatform = "gene.normalized_RNAseq",
saveFolderName = sPath1)
#' Download DNA methylation 450 data
#' 使用Illumina的甲基化分析芯片測出來的甲基化數據
path_methylation_450 <- DownloadMethylationData(cancerType = sCancer,
assayPlatform = "methylation_450",
saveFolderName = sPath1)
# 下載出來的格式說明:
# TCGA-3C-AAAU-01A-11R-A41B-07
# 前三位TCGA-3C-AAAU表示病人ID
# 第四位01A表示腫瘤類型Tumor types range from 01 - 09,
# normal types from 10 - 19 and control samples from 20 - 29.
# 第五位Order of portion in a sequence of 100 - 120 mg sample portions
# 可查看網址https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode
#' Process gene expression data
#' 對RNA-seq下載結果進行處理,將基因名進行處理
list_geneExp <-
ProcessRNASeqData(inputFilePath = path_geneExp[1],
outputFileName = paste(sCancer,
"geneExp",
sep = "__"),
dataType = "geneExp",
outputFileFolder = sPath2)
#處理methylation數據
list_methylation_450 <-
ProcessMethylation450Data(inputFilePath =
path_methylation_450[1], outputFileName = paste(sCancer,
"methylation_450", sep = "__"), outputFileFolder = sPath2)
#'Perform advanced data processing using Module B functions
#'Advanced進一處理methylation數據
list_methylation_450_OverallAverage <-
CalculateSingleValueMethylationData(input = list_methylation_450,
regionOption = c("TSS1500", "TSS200"), DHSOption = "Both",
outputFileName = paste(sCancer, "methylation_450", sep = "__"),
outputFileFolder = sPath3,
chipAnnotationFile = "./SupportingFiles/MethylationChipAnnotation.rda")
save.image("tmpdata.Rdata")
load("tmpdata.Rdata")
#setwd("G:/AllShare/SkillTrainHomework/BasicDataProcessingResult")
#load("GBM__geneExp.rda")
setwd("G:/AllShare/SkillTrainHomework")
load("RNAdiff.RData")
DATAFRAME<- data.frame(Data)
TCGASampleName<- colnames(DATAFRAME)
TCGASamplelist <- strsplit(TCGASampleName,split=".",fixed = TRUE)
TCGASampletpye <- c()
TCGASampleID <- c()
for(i in 1:length(TCGASamplelist)){
TCGASampletpye <- c(TCGASampletpye,TCGASamplelist[[i]][4])
TCGASampleID <- c(TCGASampleID,paste(TCGASamplelist[[i]][2],TCGASamplelist[[i]][3],sep = "."))
}
TCGASampletpyeResult <- c()
for(i in 1:length(TCGASampletpye)){
if(as.integer(substr(TCGASampletpye[i],1,2))<10){
TCGASampletpyeResult <- c(TCGASampletpyeResult,"Tumor")
}else{
TCGASampletpyeResult <- c(TCGASampletpyeResult,"Normal")
}
}
phenotype <- data.frame(TCGASampletpyeResult)
rownames(phenotype) <- colnames(Data)
rownames(Data) <- Des[,2]
library(DESeq2)
Data<- floor(Data)
dds <- DESeqDataSetFromMatrix(countData = Data,
colData = phenotype,
design = ~ TCGASampletpyeResult)
dds <- DESeq(dds)
res <- results(dds)
resOrdered <- res[order(res$padj),]
resSig <- subset(resOrdered, padj < 0.01)
res001 <- results(dds, alpha=0.01)
pdf(file = "MA-plot.pdf",width = 15,height = 9)
plotMA(res001, ylim=c(-3,3))
idx <- identify(res$baseMean, res$log2FoldChange)
rownames(res)[idx]
res[idx,]
Data[c("100131561","677811","414899","677846"),]
Data[c("9271","140893"),]
Data[c("390077","144742"),]
dev.off()
d <- plotCounts(dds, gene=which.min(res$padj), intgroup="TCGASampletpyeResult",
returnData=TRUE)
library("ggplot2")
ggplot(d, aes(x=TCGASampletpyeResult, y=count)) +
geom_point(position=position_jitter(w=0.1,h=0)) +
scale_y_log10(breaks=c(25,100,400))+
labs(title=paste("EntrezID:",res@rownames[which.min(res$padj)]))+
theme(plot.title = element_text(hjust = 0.5))
ggsave(filename="p值最小的基因不同表達量.pdf",width = 8,height=6)
ntd <- normTransform(dds)
library("pheatmap")
select <- rownames(resOrdered)[1:20]
df <- as.data.frame(colData(dds)[,"TCGASampletpyeResult"])
pdf(file = "熱圖.pdf",width = 20,height = 9)
pheatmap(assay(ntd)[select,],show_rownames=TRUE)
dev.off()
RnaSigGeneID<- rownames(resSig)
RnaSigGeneName<- Des[Des[,2]%in%RnaSigGeneID,1]
intersect(RnaSigGeneName,signifiGene)
#log2FoldChange爲負數,正常高表達;log2FoldChange爲正數,腫瘤高表達
#篩選log2FoldChange爲負數,即腫瘤低表達的部分
resSigdown<- resSig[resSig$log2FoldChange<(-2),]
#dim(resSigdown)
RnaSigGeneIDdown<- rownames(resSigdown)
RnaSigGeneNamedown<- Des[Des[,2]%in%RnaSigGeneIDdown,1]
# Data["2498",c(38:41,76)]
# Data["2498",]
# Data["7153",c(38:41,76)]
# Data["7153",]
#與抑癌基因取交集
Human_TSGs <- read.table("Human_TSGs.txt",header = TRUE,sep = "\t")
#TSGene <- read.table("TSGene-LOFdataset.txt",header = TRUE,sep = "\t")
#sig_exp <- read.table("sig_exp.txt",header = TRUE,sep = "\t")
TSGs <- Human_TSGs$GeneSymbol
#TSGs <- TSGene$GeneName
#TSGs <- sig_exp$Symbol
result1<- intersect(RnaSigGeneNamedown,TSGs)
result2<- intersect(result1,signifiGene)#三者交集
resultmuch <- intersect(RnaSigGeneNamedown,signifiGene)
write.table(resultmuch,"候選靶標基因多多多.txt",quote = FALSE,row.names = FALSE,col.name = FALSE)
#"CCHCR1"%in%signifiGene
#此段代碼需要methylation的TCGASampletpyeResult對象
setwd("G:/AllShare/SkillTrainHomework/AdvancedDataProcessingResult")
load("GBM__methylation_450__TSS1500-TSS200__Both.rda")
MethTssFrame<- data.frame(Data)
rownames(MethTssFrame) <- Des[,1]
targetFrame<- MethTssFrame[result2,]
colnames(targetFrame) <- TCGASampletpyeResult
targetFrame<-na.omit(targetFrame)
library(pheatmap)
pdf(file = "target熱圖.pdf",width = 20,height = 6)
pheatmap(targetFrame,show_rownames=TRUE)
dev.off()
library (VennDiagram)
draw.triple.venn(area1=5, area2=5, area3=5
,n12=3, n23=3, n13=3, n123=3
,category = c('A','B','C'))
pdf("交集.pdf",width = 10,height = 10)
T<-venn.diagram(list(A=RnaSigGeneNamedown,B=TSGs,C=signifiGene),filename=NULL
,lwd=1,lty=2,category = c('RNA-seq down','Tumor suppressor gene','Hypermethylation'))
grid.draw(T)
dev.off()
#save.image("RNAdiff.RData")
PCAdata<-t(Data)
#作主成分分析
PCAdata.pr<-prcomp(PCAdata,scale=FALSE)
#作預測
PCA_eset<- predict(PCAdata.pr)
colnames(PCA_eset)
pdf(file = "陡坡圖.pdf",width = 14,height = 7)
screeplot(PCAdata.pr,type="lines")
dev.off()
data.hc <- hclust( dist(PCAdata))
pdf(file = "樹狀圖.pdf",width = 22,height = 12)
plot(data.hc, hang = -1)
dev.off()
#plot(PCA_eset[,1:2])
library("ggplot2")
ggplot(NULL, aes(x=PCA_eset[,1], y=PCA_eset[,2], colour=TCGASampletpyeResult)) +
geom_point() +
guides(color=guide_legend(title=NULL)) +
labs(x = "PCA1 34.7%",y = "PCA2 15.3%",title = "RNA-seq Principal components analysis")
ggsave(filename="PCA RNA-seq.pdf",width = 8,height=6)
GOterm <- read.table("analysis.txt",header = TRUE,sep = "\t")
library(ggplot2)
ggplot(GOterm,aes(x=GO.biological.process,y=upload_1..over.under.,fill = P.value))+
geom_bar(stat="identity")+
labs(x = "GO terms",y = "Fold Enrichment",title = "GO biological process analysis")+
coord_flip()
ggsave(filename="GO analysis.pdf",width = 8,height=6)