甲基化特異性區域的計算鑑別

多形性成膠質細胞瘤(GBM)甲基化區域的計算鑑別

 

目的:找出膠質細胞瘤特異性甲基化區域,爲臨牀診斷提供理論依據

 

步驟:

1、  查找數據:下載TCGA中GBM的RNA-seq和甲基化數據

2、  甲基化數據分析,正常腫瘤對比,進行差異甲基化分析,找出腫瘤樣本中高甲基化區域

3、  對RNA-seq數據進行分析,正常腫瘤對比,差異表達基因的篩選,找出腫瘤樣本中低表達基因。

4、  結合甲基化和RNA-seq數據,將高甲基化和低表達基因取交集,這些基因很可能屬於抑癌基因,與抑癌基因取交集,再結合promoter區域的CpG整合分析,尋找候選靶標。

5、  對找出的靶標進行驗證,利用pubmed以及其他數據庫,反向驗證靶標的可靠性

一、數據下載

首先進入TCGA下載數據GBM的RNA-seq和甲基化數據,從下表可見GBM共有172套RNA-seq數據以及437套DNA甲基化數據,由於TCGA提供Infinium HumanMethylation27 BeadChip和Infinium HumanMethylation450 BeadChip兩種芯片平臺的數據,爲了避免後續不同芯片平臺間數據合併的困難,僅下載HumanMethylation450的芯片數據,共計154套。

圖表 1TCGA數據彙總

二、初步整理數據

使用TCGA-Assembler.2.0.5進行GBM數據批量下載與初步整理,並且繪製RNA-seq基因表達量盒型圖以及甲基化芯片數據盒型圖,由於數據量較大,此處不貼圖。

三、整體可視化

首先對於甲基化數據,選取ID爲TCGA.06.AABW.11A.31D.A368.05的數據,查看總體甲基化程度。由於每個位點真實情況只存在:甲基化/非甲基化兩種,所以對全部位點甲基化程度進行統計,也應該是大部分位點處於“完全甲基化”(Methylation state=1)和“完全非甲基化”(Methylation state=0)兩種狀態,下圖繪製了數據的頻數柱狀圖,可以明顯看出形狀處於“兩頭高,中間低”,反向說明芯片數據質量較好。

圖表 2單個樣本CpG甲基化程度統計

接下來,對多個樣本繪製CpG甲基化程度小提琴圖,同一行是同一個病人,左邊樣本來源於Primary Solid Tumor,右邊樣本來源於Recurrent Solid Tumor,除了甲基化程度大部分分佈於0和1附近外,還能看出來源於同一病人腫瘤的甲基化程度依舊會有略微差異。

TCGA barcode:https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode

圖表 3小提琴圖

 

同樣的,對於RNA-seq數據也可以進行一些初步可視化,除了數據下載後繪製的盒型圖,亦可以進行PCA初步查看數據分佈,下圖左爲PCA陡坡圖,反映了第一主成分、第二主成分…等等所擁有信息量的比例,下圖右爲使用PCA1和PCA2繪製的散點圖,可以發現5個正常樣本距離較近,從側面反映數據可信度較好。

最後,對於RNA-seq表達譜數據,使用系統聚類方法,繪製樹狀圖,可以發現5個正常樣本距離也是很近,數據質量還行。

 

四、差異甲基化區域篩選

爲了更加科學高效地篩選差異甲基化位點,參考bioconductor中甲基化芯片的分析流程,使用minfi包進行差異甲基化分析,得到差異甲基化位點。

http://www.bioconductor.org/help/workflows/methylationArrayAnalysis/

在檢測的526733個CpG位點中,共有4927個CpG位點P值<0.01,且在腫瘤樣本中保持着甲基化程度高於0.7,對應2054個基因。

五、差異表達基因篩選

由於數據源自RNA-seq,最主流的分析方法當然是基於負二項分佈模型的DESeq2包。

先用MA-plot查看差異表達基因大致分佈。意外的是,圖形左側有大概七條線狀條紋,最初我懷疑這是sample之間有batch effect導致,需要用其他更好normalize的方法,後來用identify方法挨個找出每條線上的基因名及其對應的表達量,發現這些基因在172套樣本中表達量幾乎全爲0,僅有一兩個樣本有一點點表達,這種數據的存在導致這些線狀條紋的產生。

圖表 4 MA-plot

然後, 選取p值最小的差異表達基因,繪製其在不同組間表達量,確實差異很顯著。

圖表 5表達量散點圖

接着,繪製差異表達基因在不同組間的表達量熱圖,正常樣本是圖片最左邊的五列,當然如果需要解釋具體的生物學問題,需要將聚類出來的每一類,將差異表達基因進行GO以及KEGG註釋,結合有關的生物學表型,探討其分子機制及意義

圖表 6表達量熱圖

最後,選取篩選條件爲p值小於0.01且log2FoldChange<-2的差異表達基因,在腫瘤樣本低表達的基因,共計1657個

六、抑癌基因的獲取

在pubmed中查詢研究人員整理的tumor suppressor genes,果然在Nucleic Acids Res發表過TSGene數據庫,存儲了抑癌基因的列表。https://bioinfo.uth.edu/TSGene/download.cgi

下載全部1217個人類抑癌基因的列表。

All the 1217 human tumor suppressor genes with basic annotations

圖表 7 TSGene database

七、數據整合

對於甲基化數據中,腫瘤樣本高甲基化CpG附近的基因,RNA-seq中腫瘤樣本低表達的基因,以及TSGene數據庫中下載的抑癌基因列表,三者做overlap,找出特異性的候選靶標,爲後續分析做準備,下圖爲三者overlap的韋恩圖。

圖表 8數據整合韋恩圖

共計找出12個候選靶標基因。

八、靶標篩選

之前篩選選擇的單個CpG的差異甲基化,而實際臨牀檢測應用時候,可能需要多個CpG作爲對照,因此統計了12個候選靶標基因TSS前1.5kb內所有CpG的甲基化程度,然後繪製熱圖,可以明顯發現,雖然當初用CpG的差異甲基化位點篩出來的基因都是腫瘤樣本高甲基化的,可是統計TSS前1.5kb內所有CpG的甲基化程度,這些基因卻有很多在所有樣本中都是低甲基化狀態,而看上去很靠譜的是NUAK1基因,其正常樣本在TSS前1.5kb內低甲基化,腫瘤樣本中對應區域高甲基化。

圖表 9 methylation heatmap

NUAK1基因TSS前1.5kb內共檢測了7個CpG,這7個CpG在154個樣本中檢測出來的甲基化程度如下圖,可以明顯看出來這7個CpG在Tumor組織中甲基化程度都相對高,而在Normal組織中甲基化程度相對較低。

圖表 10 NUAK1的TSS區CpG甲基化程度

使用Cpgplot預測CpG island位置,這7個CpG在promoter5’到3’序列圖上相對位置如下

1035 1094 1408 1413 1444 1448 1471

圖表 11 CpG island預測

參數使用:

NUAK1promoter from 1 to 1500

     Observed/Expected ratio > 0.40

     Percent C + Percent G > 40.00

     Length > 100

CpG island詳細信息: Length 101 (1086..1186)   Length 105 (1366..1470)

這七個CpG基本都在CpGisland中,具體序列見附錄

 

九、靶標基因相關討論

進入Gene數據庫搜索NUAK1相關內容,可以發現基因全稱NUAK family kinase 1,還是個激酶,激酶的話就對調控會有很大作用了,而在HPA RNA-seq normal tissues項目中,又看出來這個激酶在腦中表達量明顯高於其他組織,這又與發生在腦部的GBM不謀而合。

圖表 12 NUAK1相關討論

 

十、分子機制探討

對於腫瘤組織中高甲基化CpG附近的,並且在腫瘤樣本中低表達的intersect共計274個基因,使用Gene Ontology進行富集分析,可以明顯發現在GO biological process生物學過程中的“神經系統發育”、“化學性突觸傳遞”和“細胞膜的組織”等部分裏面有着富集,特別是“中樞神經系統的髓鞘形成”,富集程度達到26.95倍,這又與研究的多發生於腦補的GBM有着密切的聯繫,反向驗證實驗結果的正確性。

圖表 13 GO富集分析

十一、FurtherMore

根據生物學知識可以得到,CpG的甲基化會調控基因的轉錄,因此,Transcript Start Site(TSS)附近的甲基化程度值得進行一番深入研究,選用人類基因組hg19版本,對23056基因共計46489個轉錄起始位點,進行轉錄起始位點富集甲基化程度統計。

統計TSS前後5000bp內CpG甲基化程度,並且使用曲線進行擬合,可以發現TSS處的CpG Methylation水平明顯降低,這也與科學常識相吻合。

圖表 14 TSS附近甲基化程度

附錄

NUAK1promoter區CpG island用藍色標註,檢測的CpG用加粗橫線下標標註。

>NUAK1promoter

TATGAAAGGAGAAGGGGGAGCTTTGGAACTGGACAGGTAGGGTTTAAATCCTGGTCCTGC

CATTTACAAACTGTGTAACCTCTGGGAAATTACTTAACCTTTCTGATACGGTTTCTTCAA

TTGAAAATAGGGATTGTAAACAGCTACTTTACAGAGGAGGGTTTACTGTCATAAAACAGT

ACCAGCCTATGGTAGATGATGCTGTTGATTCAATAGATACTGATGAAGTCCCACATATCT

GGGAGTAACACTATCAGCCAGAATAAGCCAGGTTATGCTGCTTTAACAAATCCCACTGAC

TTAACATAAATAAAGTTTAATATTTGCTGACACTACTTTTCCAATGCAGATTGGCTGGGA

TTTTCTGCCATCTTTACACAGAGACTCAGGCTGGCTTGGGGAGGCTCCAACTTTGTACCA

CCATGATTGTCAAGATAGGAAAAAGAAGACATGGGGATTTGCTCACTGGCTCCTAAAGGT

TCCAAGTGAAAGCAACACAATCACTTCTGCTCTCATTCCATTGGCCTAAGCAAGTCCGAT

GGTCTCATCTAACTTCAAGAGGATGGGGTAGTGGATTTCTACCACATCGCAGGGGAGGGG

AAGAAGTAGAAAATATTTTAAAATACTACTGTAGACACAATGTGTTTATCTCTACAATAG

ATCTGCTAAATCCATATCTTAAGAGAAAACCGAAGCTCTGAGAGGTTATGTCATTTGACC

AAGGCCACACAGCCAATCCTCTGGGAAGCCAGGACTCAACCTGCATGCCACTCATCCTGA

CTGTGGGATCTATAGGCACAACGTCCACAAACTGTATAGAAGCAGAACAGTTTCAGGATG

GGGGTGGGTGGCAGGGAGGCCCTGCAAATGCGGTGGACAGAATAGGAATTGAATAGGCAT

GATGCGCCTTTGTGTCTGTGTTCTGTCATTGCTCCTGGGGAAGGGAAGAAGGGGCAGGGA

AGTTTGTGTGAAATATCAGTAGAAGGAAAGTTTGGACAGGTAAGAATATCACGCGTCAGA

GTACAGCAGAGACACGTGTGGAGGATGAGGGCAGTGTTTCAGGCCATTACTCTGGCAGCA

GTGAGGAGGCTCCCGGGGAGGGTTGGGGAGAGCGGGCTGTTGCTGGAGTCCGGGGTAAGG

TGACCAGGGGTAGACAGGAATGAAGGCGGCGGGAATGTAGAGGAAGGGCTGCACCTGAGA

GCTGCAGAGGAGGCCCCAGTGAGGTCCACTGGTGGAAAAGAACCGCCATGCGATCTGCAG

AACACCAGACCTTCTTCAGCCTTCACTCTTCCCACCCTAGTCTGGTACATTGCACCACTT

GTTAAAAAAAAAAATTTCCCTAAGACCTCTTTTTCTCTAGCCTTCTTCCTTATTTTTCAT

GGTCTCTTCTTAGAACGGCGGCAGCCACGCCGCGGTGGGAGGCCCTTCCTGCCTGACCCT

TACCGTGCGGGGTACCGTTCCTGTCACCATCGCCAGGATCTGGCCCTTTCAGTGAAAGGA


diffFind.R

setwd("G:/AllShare/SkillTrainHomework/BasicDataProcessingResult")
load("GBM__methylation_450.rda")
setwd("G:/AllShare/SkillTrainHomework")
DATAFRAME<- data.frame(Data)

TCGASampleName<- colnames(DATAFRAME)
TCGASamplelist <- strsplit(TCGASampleName,split=".",fixed = TRUE)
TCGASampletpye <- c()
TCGASampleID <- c()
for(i in 1:length(TCGASamplelist)){
  TCGASampletpye <- c(TCGASampletpye,TCGASamplelist[[i]][4])
  TCGASampleID <- c(TCGASampleID,paste(TCGASamplelist[[i]][2],TCGASamplelist[[i]][3],sep = "."))
}
TCGASampletpyeResult <- c()
for(i in 1:length(TCGASampletpye)){
  if(as.integer(substr(TCGASampletpye[i],1,2))<10){
    TCGASampletpyeResult <- c(TCGASampletpyeResult,"Tumor")
  }else{
    TCGASampletpyeResult <- c(TCGASampletpyeResult,"Normal")
  }
}

library(ggplot2)
#h<- DATAFRAME$TCGA.06.0125.01A.01D.A45W.05
h<- DATAFRAME$TCGA.06.AABW.11A.31D.A368.05
pdf("CpGmethlation.pdf")
ggplot(NULL,aes(x=h))+
  geom_histogram(binwidth = 0.02)+
  labs(x = "% methylation per base",title = "Histogram of % CpG methylation\nSampleID:TCGA.06.AABW.11A.31D.A368.05")
dev.off()


library(reshape2)
tmp <- Data[,1:8]
tmp <- data.frame(tmp)
colnames(tmp) <- TCGASampleID[1:8]
tmp2 <- stack(tmp)
tmp2 <- data.frame(tmp2)
colnames(tmp2) <- c("CpG methylation","SampleID")
pdf("MethlationViolin.pdf",width =12,height = 10 )
ggplot(tmp2,aes(x=SampleID,y=`CpG methylation`,fill=SampleID))+
  geom_violin()+
  facet_wrap(~SampleID,ncol=2)
dev.off()


library(minfi)
dmp <- dmpFinder(Data, pheno = TCGASampletpyeResult, type = "categorical")
save(dmp,file = "diff.Rdata")
load("diff.Rdata")

meanTumorData<- apply(Data[,TCGASampletpyeResult=="Tumor"],1,mean)

#&!is.na(meanTumorData)

# head(meanTumorData)
# head(dmp)
# head(signifidmp,10)
# summary(dmp)
# rownames(dmp)
dmp$genelistID <- as.integer(rownames(dmp))
o<- order(dmp[,"genelistID"])
dmp<- dmp[o,]
#篩選出p<0.01且無空值的CpG,並且正常樣本甲基化程度<0.3,即篩選腫瘤中高甲基化的基因
signifidmp <- dmp[meanTumorData>0.7 & !is.na(meanTumorData) & dmp$pval<0.01
                  &!is.na(dmp$pval) & !is.na(dmp$qval) & !is.na(dmp$intercept),]
#signifidmp <- signifidmp[signifidmp$intercept<0.3,]dmp <- 
signifiData<- Data[as.integer(rownames(signifidmp)),]
signifiDes<- Des[as.integer(rownames(signifidmp)),]

# Data[Des[,1]=="ch.3.2438620R",]
# meanTumorData[Des[,1]=="ch.3.2438620R"]>0.7
# mean(Data[103429,TCGASampletpyeResult=="Normal"])
# mean(Data[103429,TCGASampletpyeResult=="Tumor"])
# summary(Data[Des[,1]=="ch.3.2438620R",])

# apply((head(signifiData,10)[,TCGASampletpyeResult=="Tumor"]),1,mean)

signifiGene <- signifiDes[,2]

length(unique(signifiGene))

head(signifidmp)

#intercept正常樣本甲基化程度
# Data[362831,118]
# Data[362831,48]
# Data[362831,]
# 
# Data[310369,118]
# Data[310369,48]
# Data[310369,]



#以下是關於CpG的一些計算


#因爲Des有部分空缺,取出非空部分,生成Desfull
Desfull<- Des[!is.na(Des[,3]),]
#同樣取出Data裏非空部分,計算mean
methtstat<- apply(Data[!is.na(Des[,3]),],1,mean,na.rm=TRUE)
#去掉計算mean值以後NA的部分對應的Desfull
Desfull<- Desfull[!is.na(methtstat),]
#再去掉methtstat的mean值中NA的部分
methtstatresult<- methtstat[!is.na(methtstat)]
grCpG <- GRanges(seqnames = paste("chr",Desfull[,3],sep = ""),
              ranges = IRanges(start = as.integer(Desfull[,4]), width = 1))
grCpG$value <- methtstatresult


library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb_hg19 <- TxDb.Hsapiens.UCSC.hg19.knownGene
trans <- as.data.frame(transcripts(txdb_hg19))
trans<- trans[trans$seqnames %in% c("chrX","chrY",paste("chr",1:22,sep="")),]
grTSS <- GRanges(seqnames = trans$seqnames,
                 ranges = IRanges(start = trans$start-5000,end = trans$start+5000))

grCpG
grTSS
hitObj<- findOverlaps(grTSS,grCpG)
CpGRelativeSite <- c()
CpGRelativeMeth <- c()
for(i in 1:length(hitObj)){
  #取出對應CpG真實位置
  CpGsite <- grCpG[hitObj[i]@to]@ranges@start
  #計算回TSS真實位置
  TSSsite<- mean(grTSS[hitObj[i]@from]@ranges)
  #計算CpG相對位置並且存儲
  CpGRelativeSite <- c(CpGRelativeSite,TSSsite - CpGsite)
  #取出CpG平均甲基化程度
  CpGRelativeMeth <- c(CpGRelativeMeth,grCpG[hitObj[i]@to]$value)
}
save.image(file = "diffresult.Rdata")

load("diffresult.Rdata")

CpGResult <- c()
for(i in -5000:5000){
  CpGResult <- c(CpGResult,mean(CpGframe[CpGframe$CpGRelativeSite==i,"CpGRelativeMeth"]))
}
CpGframe <- data.frame(-5000:5000,CpGResult)
colnames(CpGframe) <- c("CpGRelativeSite","MethState")
ggplot(CpGframe, aes(x=CpGRelativeSite, y=MethState))+
  #geom_point()+
  geom_smooth()
ggsave(filename="TSS附近甲基化程度.pdf",width = 12,height=8)





library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb_hg19 <- TxDb.Hsapiens.UCSC.hg19.knownGene
trans <- as.data.frame(transcripts(txdb_hg19))
grTSS <- GRanges(seqnames = trans$seqnames,
                 ranges = IRanges(start = trans$start-5000,end = trans$start+5000))


genesall<- genes(txdb_hg19)
SERPIND1<- genesall[genesall$gene_id=="3053"]
SERPIND1promoter<- promoters(SERPIND1,upstream = 1500,downstream = 0)
SERPIND1promoter
SERPIND1hit<- findOverlaps(SERPIND1promoter,grCpG)
SERPIND1hit@to
grCpG[SERPIND1hit@to]
#將Data裏對應數據取出
Datafull<- Data[!is.na(Des[,3]),]
Datafull<- Datafull[!is.na(methtstat),]

NUAK1<- genesall[genesall$gene_id=="9891"]
NUAK1promoter<- promoters(NUAK1,upstream = 1500,downstream = 0)
NUAK1promoter

NUAK1hit<- findOverlaps(NUAK1promoter,grCpG)
NUAK1hit@to
grCpG[NUAK1hit@to]
#將Data裏對應數據取出
Datafull<- Data[!is.na(Des[,3]),]
Datafull<- Datafull[!is.na(methtstat),]


apply(Datafull[NUAK1hit@to,TCGASampletpyeResult=="Normal"],1,mean)
apply(Datafull[NUAK1hit@to,TCGASampletpyeResult=="Tumor"],1,mean)
library(ggplot2)
normalNUAK1<- data.frame(stack(Datafull[NUAK1hit@to,TCGASampletpyeResult=="Normal"]))
tumorNUAK1 <- data.frame(stack(Datafull[NUAK1hit@to,TCGASampletpyeResult=="Tumor"]))
normalNUAK1$sampletype <- "Normal"
normalNUAK1$site <- 1:7
tumorNUAK1$sampletype <- "Tumor"
tumorNUAK1$site <- 1:7
frame1<- data.frame(normalNUAK1$NA..1,normalNUAK1$sampletype,normalNUAK1$site)
colnames(frame1) <- c("MethState","sampletype","site")
frame2<- data.frame(tumorNUAK1$NA..1,tumorNUAK1$sampletype,tumorNUAK1$site)
colnames(frame2) <- c("MethState","sampletype","site")
resultframe <- rbind(frame1,frame2)
ggplot(resultframe, aes(x=site, y=MethState,colour=sampletype)) + 
  scale_x_discrete(limits=1:7)+
  geom_point(position="jitter")
ggsave(filename="TargetCpGNUAK1.pdf",width = 8,height=6)


apply(Datafull[SERPIND1hit@to,TCGASampletpyeResult=="Normal"],1,mean)
apply(Datafull[SERPIND1hit@to,TCGASampletpyeResult=="Tumor"],1,mean)
library(ggplot2)
normalSERPIND1<- data.frame(stack(Datafull[SERPIND1hit@to,TCGASampletpyeResult=="Normal"]))
tumorSERPIND1 <- data.frame(stack(Datafull[SERPIND1hit@to,TCGASampletpyeResult=="Tumor"]))
normalSERPIND1$sampletype <- "Normal"
normalSERPIND1$site <- 1:2
tumorSERPIND1$sampletype <- "Tumor"
tumorSERPIND1$site <- 1:2
frame1<- data.frame(normalSERPIND1$NA..1,normalSERPIND1$sampletype,normalSERPIND1$site)
colnames(frame1) <- c("MethState","sampletype","site")
frame2<- data.frame(tumorSERPIND1$NA..1,tumorSERPIND1$sampletype,tumorSERPIND1$site)
colnames(frame2) <- c("MethState","sampletype","site")
resultframe <- rbind(frame1,frame2)
ggplot(resultframe, aes(x=site, y=MethState,colour=sampletype)) + 
  scale_x_discrete(limits=1:2)+
  geom_point(position="jitter")
ggsave(filename="TargetCpGSERPIND1.pdf",width = 8,height=6)

library("BSgenome.Hsapiens.UCSC.hg19")
library(seqinr)
genome <- BSgenome.Hsapiens.UCSC.hg19
grCpG[NUAK1hit@to]
promoterSeq<- getSeq(genome,NUAK1promoter)
write.fasta(promoterSeq$`9891`,"NUAK1promoter","promoterSeq.txt")


normalNUAK1$NA..1
tumorNUAK1$NA..1

dim(Datafull)
dim(Desfull)
# 嘗試用mice包補缺失值,由於數據量太大而取消
# library(mice)
# library(VIM)
# #md.pattern(Data)
# tmp <- Data[,1:3]
# aggr_plot <- aggr(tmp, col = c('navyblue', 'red'), numbers=TRUE, sortVars=TRUE, 
#                   labels=names(tmp),cex.axis=.7, gap=2, 
#                   ylab=c("Histogram of missing data", "Pattern"))
# tempData <- mice(Data,m=1,maxit=50,meth='pmm',seed=500)
# 
# 嘗試T檢驗,由於有缺失值而取消
# Ttest<- t.test(Data[1,1:3],Data[1,c(48,118)])
# 
# Data[1,118]
# Data[1,48]
# tmp<- Data[1,]
# myfun <-function(x){
#   Ttest<- t.test(x[c(1:47,49:117,119:154)],x[c(48,118)])
#   return(Ttest$p.value)
# }
# myfun(Data[1,])

# result<- apply(Data,1,myfun)
# group <- factor(TCGASampletpyeResult,levels=c("Normal","Tumor"))
# design <- model.matrix(~-1+group)
# design
# fit.reduced <- lmFit(Data,design)
# fit.reduced <- eBayes(fit.reduced)
# summary(decideTests(fit.reduced))
# 
# 嘗試使用missMethyl包,最後designMatrix設置有錯誤而差異區域識別錯誤,差異不明顯
# top<-topTable(fit.reduced,coef=1)
# top
# cpgs <- as.integer(rownames(top))
# Data[cpgs[10],]
# Data[cpgs[10],48]
# Data[cpgs[10],118]
# 
# par(mfrow=c(2,2))
# for(i in 1:4){
#   stripchart(Data[rownames(Data)==cpgs[i],]~design[,4],method="jitter",
#              group.names=c("Normal","Tumor"),pch=16,cex=1.5,col=c(4,2),ylab="Beta values",
#              vertical=TRUE,cex.axis=1.5,cex.lab=1.5)
#   title(cpgs[i],cex.main=1.5)
# }


Downloaded.R

setwd("G:/AllShare/SkillTrainHomework/TCGA-Assembler.2.0.5/TCGA-Assembler")
#' Load functions
source("Module_A.R")
source("Module_B.R")
setwd("G:/AllShare/SkillTrainHomework")
#' choose a cancer type
#' 可查看網址https://tcga-data.nci.nih.gov/docs/publications/tcga/
sCancer <- "GBM"
sPath1 <- "./DownloadedData"
sPath2 <- "./BasicDataProcessingResult"
sPath3 <- "./AdvancedDataProcessingResult"
#下載RNA-Seq數據
path_geneExp <- DownloadRNASeqData(cancerType = sCancer,
                        assayPlatform = "gene.normalized_RNAseq",
                        saveFolderName = sPath1)
#' Download DNA methylation 450 data
#' 使用Illumina的甲基化分析芯片測出來的甲基化數據
path_methylation_450 <- DownloadMethylationData(cancerType = sCancer,
                        assayPlatform = "methylation_450",
                        saveFolderName = sPath1)
# 下載出來的格式說明:
# TCGA-3C-AAAU-01A-11R-A41B-07
# 前三位TCGA-3C-AAAU表示病人ID
# 第四位01A表示腫瘤類型Tumor types range from 01 - 09,
# normal types from 10 - 19 and control samples from 20 - 29.
# 第五位Order of portion in a sequence of 100 - 120 mg sample portions
# 可查看網址https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode

#' Process gene expression data
#' 對RNA-seq下載結果進行處理,將基因名進行處理
list_geneExp <-
  ProcessRNASeqData(inputFilePath = path_geneExp[1],
                    outputFileName = paste(sCancer,
                                           "geneExp",
                                           sep = "__"),
                    dataType = "geneExp",
                    outputFileFolder = sPath2)
#處理methylation數據
list_methylation_450 <-
  ProcessMethylation450Data(inputFilePath =
                    path_methylation_450[1], outputFileName = paste(sCancer,
                    "methylation_450", sep = "__"), outputFileFolder = sPath2)

#'Perform advanced data processing using Module B functions
#'Advanced進一處理methylation數據
list_methylation_450_OverallAverage <-
  CalculateSingleValueMethylationData(input = list_methylation_450,
                    regionOption = c("TSS1500", "TSS200"), DHSOption = "Both",
                    outputFileName = paste(sCancer, "methylation_450", sep = "__"),
                    outputFileFolder = sPath3,
                    chipAnnotationFile = "./SupportingFiles/MethylationChipAnnotation.rda")
save.image("tmpdata.Rdata")
load("tmpdata.Rdata")


RNAdiff.R

#setwd("G:/AllShare/SkillTrainHomework/BasicDataProcessingResult")
#load("GBM__geneExp.rda")
setwd("G:/AllShare/SkillTrainHomework")
load("RNAdiff.RData")
DATAFRAME<- data.frame(Data)
TCGASampleName<- colnames(DATAFRAME)
TCGASamplelist <- strsplit(TCGASampleName,split=".",fixed = TRUE)
TCGASampletpye <- c()
TCGASampleID <- c()
for(i in 1:length(TCGASamplelist)){
  TCGASampletpye <- c(TCGASampletpye,TCGASamplelist[[i]][4])
  TCGASampleID <- c(TCGASampleID,paste(TCGASamplelist[[i]][2],TCGASamplelist[[i]][3],sep = "."))
}
TCGASampletpyeResult <- c()
for(i in 1:length(TCGASampletpye)){
  if(as.integer(substr(TCGASampletpye[i],1,2))<10){
    TCGASampletpyeResult <- c(TCGASampletpyeResult,"Tumor")
  }else{
    TCGASampletpyeResult <- c(TCGASampletpyeResult,"Normal")
  }
}
phenotype <- data.frame(TCGASampletpyeResult)
rownames(phenotype) <- colnames(Data)
rownames(Data) <- Des[,2]
library(DESeq2)
Data<- floor(Data)
dds <- DESeqDataSetFromMatrix(countData = Data,
                              colData = phenotype,
                              design = ~ TCGASampletpyeResult)
dds <- DESeq(dds)
res <- results(dds)
resOrdered <- res[order(res$padj),]
resSig <- subset(resOrdered, padj < 0.01)
res001 <- results(dds, alpha=0.01)
pdf(file = "MA-plot.pdf",width = 15,height = 9)
plotMA(res001, ylim=c(-3,3))
idx <- identify(res$baseMean, res$log2FoldChange)
rownames(res)[idx]
res[idx,]
Data[c("100131561","677811","414899","677846"),]
Data[c("9271","140893"),]
Data[c("390077","144742"),]
dev.off()
d <- plotCounts(dds, gene=which.min(res$padj), intgroup="TCGASampletpyeResult", 
                returnData=TRUE)

library("ggplot2")
ggplot(d, aes(x=TCGASampletpyeResult, y=count)) + 
  geom_point(position=position_jitter(w=0.1,h=0)) + 
  scale_y_log10(breaks=c(25,100,400))+
  labs(title=paste("EntrezID:",res@rownames[which.min(res$padj)]))+
  theme(plot.title = element_text(hjust = 0.5))
ggsave(filename="p值最小的基因不同表達量.pdf",width = 8,height=6)

ntd <- normTransform(dds)
library("pheatmap")
select <- rownames(resOrdered)[1:20]
df <- as.data.frame(colData(dds)[,"TCGASampletpyeResult"])
pdf(file = "熱圖.pdf",width = 20,height = 9)
pheatmap(assay(ntd)[select,],show_rownames=TRUE)
dev.off()

RnaSigGeneID<- rownames(resSig)
RnaSigGeneName<- Des[Des[,2]%in%RnaSigGeneID,1]

intersect(RnaSigGeneName,signifiGene)

#log2FoldChange爲負數,正常高表達;log2FoldChange爲正數,腫瘤高表達
#篩選log2FoldChange爲負數,即腫瘤低表達的部分
resSigdown<- resSig[resSig$log2FoldChange<(-2),]
#dim(resSigdown)
RnaSigGeneIDdown<- rownames(resSigdown)
RnaSigGeneNamedown<- Des[Des[,2]%in%RnaSigGeneIDdown,1]

# Data["2498",c(38:41,76)]
# Data["2498",]
# Data["7153",c(38:41,76)]
# Data["7153",]

#與抑癌基因取交集
Human_TSGs <- read.table("Human_TSGs.txt",header = TRUE,sep = "\t")
#TSGene <- read.table("TSGene-LOFdataset.txt",header = TRUE,sep = "\t")
#sig_exp <- read.table("sig_exp.txt",header = TRUE,sep = "\t")
TSGs <- Human_TSGs$GeneSymbol
#TSGs <- TSGene$GeneName
#TSGs <- sig_exp$Symbol
result1<- intersect(RnaSigGeneNamedown,TSGs)
result2<- intersect(result1,signifiGene)#三者交集
resultmuch <- intersect(RnaSigGeneNamedown,signifiGene)
write.table(resultmuch,"候選靶標基因多多多.txt",quote = FALSE,row.names = FALSE,col.name = FALSE)

#"CCHCR1"%in%signifiGene




#此段代碼需要methylation的TCGASampletpyeResult對象
setwd("G:/AllShare/SkillTrainHomework/AdvancedDataProcessingResult")
load("GBM__methylation_450__TSS1500-TSS200__Both.rda")
MethTssFrame<- data.frame(Data)
rownames(MethTssFrame) <- Des[,1]
targetFrame<- MethTssFrame[result2,]
colnames(targetFrame) <- TCGASampletpyeResult
targetFrame<-na.omit(targetFrame)
library(pheatmap)
pdf(file = "target熱圖.pdf",width = 20,height = 6)
pheatmap(targetFrame,show_rownames=TRUE)
dev.off()

library (VennDiagram)
draw.triple.venn(area1=5, area2=5, area3=5
                 ,n12=3, n23=3, n13=3, n123=3
                 ,category = c('A','B','C'))
pdf("交集.pdf",width = 10,height = 10)
T<-venn.diagram(list(A=RnaSigGeneNamedown,B=TSGs,C=signifiGene),filename=NULL
                ,lwd=1,lty=2,category = c('RNA-seq down','Tumor suppressor gene','Hypermethylation'))
grid.draw(T)
dev.off()


#save.image("RNAdiff.RData")


PCAdata<-t(Data)
#作主成分分析
PCAdata.pr<-prcomp(PCAdata,scale=FALSE)
#作預測
PCA_eset<- predict(PCAdata.pr)
colnames(PCA_eset)
pdf(file = "陡坡圖.pdf",width = 14,height = 7)
screeplot(PCAdata.pr,type="lines")
dev.off()

data.hc <- hclust( dist(PCAdata))
pdf(file = "樹狀圖.pdf",width = 22,height = 12)
plot(data.hc, hang = -1)
dev.off()

#plot(PCA_eset[,1:2])

library("ggplot2")
ggplot(NULL, aes(x=PCA_eset[,1], y=PCA_eset[,2], colour=TCGASampletpyeResult)) + 
  geom_point() + 
  guides(color=guide_legend(title=NULL)) +
  labs(x = "PCA1 34.7%",y = "PCA2 15.3%",title = "RNA-seq Principal components analysis")
ggsave(filename="PCA RNA-seq.pdf",width = 8,height=6)


GOterm <- read.table("analysis.txt",header = TRUE,sep = "\t")
library(ggplot2)
ggplot(GOterm,aes(x=GO.biological.process,y=upload_1..over.under.,fill = P.value))+
  geom_bar(stat="identity")+
  labs(x = "GO terms",y = "Fold Enrichment",title = "GO biological process analysis")+
  coord_flip()
ggsave(filename="GO analysis.pdf",width = 8,height=6)



發佈了79 篇原創文章 · 獲贊 142 · 訪問量 53萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章