使用preseq計算文庫複雜度以及估計加測量

原創

2020-07-03 09:29

在評估下機數據的時候，如果發現數據去重複之後無法達到目標覆蓋度，那麼就需要進一步加測。然而，有些文庫複雜度很低，即使加測很多數據也無法得到更多的有效信息。那麼如何評估文庫複雜度，判斷是否有加測的必要呢？

使用preseq軟件可以實現根據現有測序數據評估已測序數據的複雜度，以及整個文庫的複雜度。其中子命令 c_curve可以方便的計算現有測序數據中總測序量（total reads）與有效數據量（distinct reads）的關係。流程如下：

下載安裝 preseq
http://smithlabresearch.org/software/preseq/
安裝教程中說是需要解壓後 make all來安裝，實際上直接解壓後就可以執行。正常運行需要事先有samtools的環境變量。

使用preseq計算複雜度

	
preseq c_curve  -P -B  inbam > out.c_curve.txt

其中 -P代表是pairend數據，-B代表輸入是bam（也可以使用bed作爲輸入文件，我沒測試過）

如果有多個文件可以通過以下代碼合併

#!/usr/bin/perl -w
# 2017-12-17
# zxcippo
 
$header = "gene";
undef %total;
 
foreach $i(0..@ARGV-1){
        $header .="\t$ARGV[$i]";
        open I,"$ARGV[$i]";
        while(<I>){
                chomp;
                @tmp = split;
                $total{$tmp[0]}[$i] = $tmp[1];
        }
}
 
print "$header\n";
foreach $key(keys %total){
        foreach $i (0..@ARGV-1){
                if(! $total{$key}[$i]){
                        $total{$key}[$i] ="NA";
                }
        }
        print $key,"\t",join("\t",@{$total{$key}}),"\n";
}

之後使用ggplot作圖，參考代碼如下

require(ggplot2)
require(reshape2)
require(stringr)
options(stringsAsFactors=F)
read.table("./merged.c_curve.txt",header=T)->ccurve
ccurve_melt<-melt(data=ccurve,id.vars="gene",variable.name = "sample",value.name="distinct_reads")
str_replace(ccurve_melt$sample,pattern=".c_curve.txt",replacement="")
colnames(ccurve_melt)[1]<-"total_reads"
 
 
ggplot(data=ccurve_melt) + geom_line(aes(x=total_reads,y=distinct_reads,group = sample,color=sample))
dev.off()
 
 
p<- ggplot(data=ccurve_melt) + geom_line(aes(x=total_reads,y=distinct_reads,group = sample)) + geom_abline(intercept=0,slope=1,lty=2) + ylim(0,12e07) + xlim(0,12e07) + theme(axis.text.x=element_text(angle=270)) + facet_wrap(~sample,nrow = 4)
ggsave(plot= p ,filename="estimate_complexity_single_mini_abline.pdf",device="pdf",units = "cm" , width= 25 ,height=25)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

使用preseq計算文庫複雜度以及估計加測量

爲什麼要⽤ Foundry

【筆記】動手學深度學習-預備知識

py發送email

MySQL 分庫分表方案，總結太全了。。

Qt/C++音視頻開發71-指定mjpeg/h264格式採集本地攝像頭/存儲文件到mp4/設備推流/採集推流

WPF開源輕便、快速的桌面啓動器

公司來了個新同事，把 DDD 運用得爐火純青！

[genefuse] 生成genefuse 的fusion.csv文件

使用preseq計算文庫複雜度以及估計加測量

解析 pumbed的xml

BWT 算法和序列比對的基本實現

R 常見錯誤和處理方法

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結