HISAT2-StringTie-Ballgown有参转录组数据分析

参考文献：

Pertea M, Kim D,Pertea G M, et al. Transcript-level expression analysis of RNA-seq experimentswith HISAT, StringTie and Ballgown.[J]. Nature Protocols, 2016, 11(9):1650.

1.clean reads质量检验：

用fastqc检验clean reads质量，代码如下：
fastqc -o 【待测序列所在目录】--extract -f fastq *.fq.gz（序列名称）

质检结果：
所有序列的Kmer Content均为FAIL，Per Base Sequence Content、Per Tile sequence quality、sequence duplication levels至少有一个WARN。
Per Base Sequence Content图中四条线交织，表明存在overrepresented sequence：

猜测：转录组中，表达量高的reads被识别为overrepresented sequence，属于正常现象。

参考：http://www.cnblogs.com/longjianggu/p/5078782.html

2.制作索引（indexes）：

在GENOSCOPE下载油菜基因组序列文件（.fa.gz）和基因组注释文件（.gff3.gz）：http://www.genoscope.cns.fr/brassicanapus/data/

hisat2-build /genome/GCF_000686985.1_Brassica_napus_assembly_v1.0_genomic.fna（基因组序列文件目录） brassica_tran（索引名称及输出目录）

3.序列比对（alignment）：

原始代码：

hisat2 -p（线程数） 16 --dta -x indexes/brassica_tran（索引文件目录） -1 /data/CleanData/$reads1（reads1目录） -2 /data/CleanData/$reads2（reads2目录） -S /hisat2_results/$【sample name】.sam（输出文件名称及目录）

每个样本都需通过上述代码进行比对，封装成BASH脚本运行：

vi alignment（新建空白文档alignment，以下代码在vi中输入）

#! /bin/bash

# run some hisat2 alignments

while read line

reads1=${line}_1.fq.gz

reads2=${line}_2.fq.gz

hisat2 -p 16 --dta -x /indexes/brassica_tran -1/data/CleanData/$reads1 -2 /data/CleanData/$reads2 -S/hisat2_results/${line}.sam

done </files/samples.txt（将样本名称按行写在位于/files目录下的samples.txt文件中）

<Esc>:wq保存退出

chmod +x alignment（赋予可执行权限）

nohup ./alignment & 后台运行脚本，输出结果将储存在生成的nohup.log文件中

4.samtools排序及格式转换

原始代码：

samtools sort -@ 16 -o /data/bamfiles/【sample name】.bam（bam文件名称及输出目录） /data/hisat2_results/【sample name】.sam（sam文件名称及输出目录）

封装成BASH脚本：

#! /bin/bash

# samtools sort

while read line

samtools sort -@ 16 -o/data/bamfiles/${line}.bam /data/hisat2_results/${line}.sam

done < /files/samples.txt

5.序列初组装（assembly）

原始代码：

stringtie -p 16 -G /genes/brassica.gff（基因组注释文件） -o /data/transcripts/【sample name】.gtf（输出，比对结果，gtf文件） -l 【sample name】（命名规则） /data/bamfiles/【sample name】.bam（输入，bam文件所在目录）

封装成BASH脚本：

#! /bin/bash

# stringtie 1st step

while read line

stringtie -p 16 -G/genes/brassica.gff -o /data/transcripts/${line}.gtf -l ${line}/data/bamfiles/${line}.bam

done < /files/samples.txt

6.Merge

stringtie --merge -p 16 -G /genes/brassica.gff -o stringtie_merged.gtf/data/transcripts/mergelist.txt

gffcompare检测组装结果：

gffcompare -r /genes/brassica.gff -G -o merged stringtie_merged.gtf

7.计算表达量并输出成ballgown格式

原始代码：

stringtie -e -B -p 16 -G /data/transcripts/stringtie_merged.gtf（用merge生成的gtf文件代替基因组注释） -o /data/ballgown/$【sample name】/$【sample name】.gtf（输出为ballgown所需的输入格式） /data/bamfiles/【sample name】.bam（输入，bam文件）

封装成BASH脚本：

#! /bin/bash

# stringtie 3rd step

while read line

stringtie -e -B -p 16 -G/data/transcripts/stringtie_merged.gtf -o /data/ballgown/${line}/${line}.gtf/data/bamfiles/${line}.bam

done < /files/samples.txt

8.ballgown分析差异表达基因（在R中进行）

>install.packages(‘devtools’)

>source('http://www.bioconductor.org/biocLite.R')

>biocLite(c(’ballgown’, 'genefilter’,'dplyr’,'devtools’))

>library(ballgown)

>library(genefilter)

>library(dplyr)

>library(devtools)

>file <- ballgown(dataDir='/data/ballgown',samplePattern=【sample名前缀】）输入基因表达数据

>samplesNames(file)查看ballgown文件中各样本的排列顺序

>pData(file) <- read.csv(‘/data/geuvadis_phenodata.csv’）（导入自制的表型数据）

>filefilt <-subset(file,'rowVars(texpr(file))>1',genomesubset=TRUE)（过滤掉表达差异较小的基因）

>diff_genes <- stattest(filefilt,feature='gene',covariate=【自变量】,adjustvars=【无关变量】,meas='FPKM')（统计差异表达的基因）

将差异表达基因按pval从小到大排序，并写入csv文件：

>diff_genes <- arrange(diff_genes,pval)

>write.csv(diff_genes,'/data/diff_genes',row.names=FALSE)

HISAT2-StringTie-Ballgown有参转录组数据分析

1.clean reads质量检验：

2.制作索引（indexes）：

3.序列比对（alignment）：

4.samtools排序及格式转换

5.序列初组装（assembly）

6.Merge

7.计算表达量并输出成ballgown格式

8.ballgown分析差异表达基因（在R中进行）

ggradar繪製多邊形雷達圖

HISAT2-StringTie-Ballgown有參轉錄組數據分析

無參轉錄組GO、KEGG富集分析——diamond+idmapping+GOstats

python-在英文句子中查找單詞並突出顯示

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結