前面我們探索了處理不能拼接的V4 PE150數據,首先雙向reads根據質量情況分別切成120bp,然後使用dada2 R包進行了直接+10N拼接,生成ASV表,再分別使用dada2包和qiime2進行了物種註釋,基本上完成了一個最簡單的分析過程,這裏,使用比較流行的phyloseq包進行下多樣性分析。
說實話,之前從沒有使用R進行過16S數據分析,一般認爲R速度慢些,而且不熟悉。這裏用了一次感覺還不錯,雖然由於只有一個樣本參數報錯,改了下(基本上是直接刪除參數的),總算有個圖出來了。下面是詳細過程:
1.安裝加載包
BiocManager::install("phyloseq", version = "3.10")
library(phyloseq); packageVersion("phyloseq")
library(Biostrings); packageVersion("Biostrings")
library(ggplot2); packageVersion("ggplot2")
theme_set(theme_bw())
2.準備sample-meta表信息
由於只有一個樣本,做了最簡單的處理,目的是爲了不報錯,代碼運行,可能有不合理之處,歡迎指正。
samples.out <- "NULL"#rownames(seqtab.nochim)
subject <- "join"#sapply(strsplit(samples.out, "D"), `[`, 1)
gender <- "man"#substr(subject,1,1)
subject <- substr(subject,2,999)
day <- "2019" #as.integer(sapply(strsplit(samples.out, "D"), `[`, 2))
samdf <- data.frame(Subject=subject)#, Gender=gender, Day=day)
samdf$When <- "Early"
samdf$When[samdf$Day>100] <- "Late"
rownames(samdf) <- samples.out
#sample_names() 這個是軟件報錯提示運行的,上面運行後不需要這行
ps <- phyloseq(otu_table(seqtab.nochim, taxa_are_rows=FALSE),
sample_data(samdf),
tax_table(taxa))
ps <- prune_samples(sample_names(ps) != "Mock", ps) # Remove mock sample
3.otutable處理,多樣性和柱形圖繪製
dna <- Biostrings::DNAStringSet(taxa_names(ps))
names(dna) <- taxa_names(ps)
ps <- merge_phyloseq(ps, dna)
taxa_names(ps) <- paste0("ASV", seq(ntaxa(ps)))
ps
#這是輸出信息
phyloseq-class experiment-level object
otu_table() OTU Table: [ 348 taxa and 1 samples ]
sample_data() Sample Data: [ 1 samples by 1 sample variables ]
tax_table() Taxonomy Table: [ 348 taxa by 7 taxonomic ranks ]
refseq() DNAStringSet: [ 348 reference sequences ]
#豐富度繪圖
plot_richness(ps, measures=c("Shannon", "Simpson"))#, color="When")
# Transform data to proportions as appropriate for Bray-Curtis distances
#以下兩行beta多樣性,只有一個樣本,就沒有運行
#ps.prop <- transform_sample_counts(ps, function(otu) otu/sum(otu))
#ord.nmds.bray <- ordinate(ps.prop, method="NMDS", distance="bray")
#選前20屬繪圖
top20 <- names(sort(taxa_sums(ps), decreasing=TRUE))[1:20]
ps.top20 <- transform_sample_counts(ps, function(OTU) OTU/sum(OTU))
ps.top20 <- prune_taxa(top20, ps.top20)
plot_bar(ps.top20, fill="Family") #+ facet_wrap(~When, scales="free_x")
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-LiK9Kz6y-1580028204222)(https://github.com/zd200572/mp_kejijizhe/raw/master/blogs/Rplot-shannon.png)]
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-7A459PuC-1580028204231)(https://github.com/zd200572/mp_kejijizhe/raw/master/blogs/Rplot-bar.png)]
4.另一種序列註釋方法
官方文檔裏介紹了另一種物種註釋方法,這裏也試試。需要下載訓練好的參考數據庫集,RDP_v16-mod_March2018.RData。下載地址: http://DECIPHER.codes/Downloads.html
#包安裝和加載
BiocManager::install("DECIPHER", version = "3.10")
library(DECIPHER); packageVersion("DECIPHER")
dna <- DNAStringSet(getSequences(seqtab.nochim)) # Create a DNAStringSet from the ASVs
load("~/Biodata/Refrence/RDP_v16-mod_March2018.RData") # CHANGE TO THE PATH OF YOUR TRAINING SET
ids <- IdTaxa(dna, trainingSet, strand="top", processors=NULL, verbose=FALSE) # use all processors
ranks <- c("domain", "phylum", "class", "order", "family", "genus", "species") # ranks of interest
# Convert the output object of class "Taxa" to a matrix analogous to the output from assignTaxonomy
taxid <- t(sapply(ids, function(x) {
m <- match(ranks, x$rank)
taxa <- x$taxon[m]
taxa[startsWith(taxa, "unclassified_")] <- NA
taxa
}))
colnames(taxid) <- ranks; rownames(taxid) <- getSequences(seqtab.nochim)
比較下三種註釋方法的差別
好了,到這裏,基本上兩三個物種註釋方法已經完成了,下面比較下三種方法的差別。
前面已經進行的探索有: