STAR manual

來源:STARmanual.pdf
來源:Calling variants in RNAseq

PART0 準備工作

    #STAR 安裝前的依賴的工具
    #Red Hat, CentOS, Fedora.
     sudo yum update
     sudo yum install make
     sudo yum install gcc-c++
     sudo yum install glibc-static

PART1 Quick start


#STAR Basic workflow
###1. Generating genome indexes files
###構建索引
###2. Mapping reads to the genome
###將reads 比對到基因組上

#範例
##1) STAR uses genome index files that must be saved in unique directories. The human genome index was built from the FASTA file hg19.fa as follows:
##(以人類基因組爲例,構建索引)
genomeDir=/path/to/hg19
mkdir $genomeDir
STAR --runMode genomeGenerate --genomeDir $genomeDir \
--genomeFastaFiles hg19.fa --runThreadN <n>

##2) Alignment jobs were executed as follows:
##(比對)
runDir=/path/to/1pass
mkdir $runDir
cd $runDir
STAR --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq \
 --runThreadN <n>

##3) For the 2-pass STAR, a new index is then created using splice junction information contained in the file SJ.out.tab from the first pass:
##(從1-pass STAR運行結果中提取可變剪切位點信息,再次構建索引)
genomeDir=/path/to/hg19_2pass
mkdir $genomeDir
STAR --runMode genomeGenerate --genomeDir $genomeDir \
--genomeFastaFiles hg19.fa \
--sjdbFileChrStartEnd /path/to/1pass/SJ.out.tab \
--sjdbOverhang 75 --runThreadN <n>

##4) The resulting index is then used to produce the final alignments as follows:
##(利用索引文件信息將reads進行比對,生成結果文件)
runDir=/path/to/2pass
mkdir $runDir
cd $runDir
STAR --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq \
--runThreadN <n>

PART2 Generating genome indexes files


 #構建索引使用的基本參數
--runThreadN NumberOfThreads(線程數)

--runMode genomeGenerate (生成index的運行模式)

--genomeDir /path/to/genomeDir (index 存儲的目錄)

--genomeFastaFiles /path/to/genome/fasta1 /path/to/genome/fasta2 ... (參考基因組序列文件Ref.fa)

--sjdbGTFfile /path/to/annotations.gtf (參考基因組註釋文件Ref.gtf)
    ### STAR will extract splice junctions from this file and use them to greatly improve accuracy of the mapping. While this is optional, and STAR can be run without annotations, using annotations is highly recommended whenever they are available. Starting from 2.4.1a, the annotations can also be included on the fly at the mapping step.
    ###(可選項;如果應用此項,STAR 將會從註釋文件中提取可變剪切位點的信息,從而使後面的比對更爲準確,一般如果有註釋文件的話,則推薦使用;但是如果沒有文件,STAR也能運行)

--sjdbOverhang ReadLength-1

    ###specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. Ideally, this length should be equal to the ReadLength-1, where ReadLength is the length of the reads. For instance, for Illumina 2x100b paired-end reads, the ideal value is 100-1=99. In case of reads of varying length, the ideal value is max(ReadLength)-1. In most cases, the default value of 100 will work as well as the ideal value.
    ###(在構建可變剪切位點數據庫時,應用這個參數可指定在已被註釋的剪切位點附近的基因組序列的長度。理想情況下,這個長度等於(ReadLength-1),注意,這裏的ReadLength指的是reads 的長度。例如,對於Illumina 2x100b 雙端reads,理論值應該是100-1=99。但是假設reads長度不一致,那麼其理論值則是max(ReadLength)-1。大多情況下,默認值100的運行效果和理論值幾乎相同 )

    ###Genome files comprise binary genome sequence, suffix arrays, text chromosome names/lengths,splice junctions coordinates, and transcripts/genes information.

PART3 mapping jobs


 #比對(mapping jobs)使用的基本參數 
--runThreadN NumberOfThreads    (線程數)
--genomeDir /path/to/genomeDir  (index存儲的目錄)
--readFilesIn /path/to/read1 [/path/to/read2 ]
(RNA-seq FASTQ/FASTA files)
    ###如果提供的是壓縮文件,則可使用此參數: 
    #####--readFilesCommand UncompressionCommand
    #####例如:針對gzipped 文件 (*.gz)
    --readFilesCommand zcat
or 
    --readFilesCommand gunzip -c

    #####例如:針對bzip2-compressed 文件 
    --readFilesCommand bunzip2 -c 

PART4 Output files


#STAR 生成多個輸出文件,一般默認會自動存儲在當前工作目錄下,但可以利用參數指定生成目錄和文件前綴。如下:
--outFileNamePrefix /path/to/output/dir/prefix.

#log files
###log.out/log.process.out/log.final.out

#SAM
###Aligned.out.sam - alignments in standard SAM format.

#STAR可以指定輸出的比對文件的格式
--outSAMtype BAM Unsorted
#僅將SAM轉變成BAM格式,不排序,輸出Aligned.out.bam
--outSAMtype BAM SortedByCoordinate
#既將SAM轉變成BAM格式,也對文件按名稱排序,輸出Aligned.sortedByCoord.out.bam,類似samtools sort 命令
--outSAMtype BAM Unsorted SortedByCoordinate
#生成兩個文件,即未經過排序的Aligned.out.bam和經過排序的Aligned.sortedByCoord.out.bam


#Splice junctions
#SJ.out.tab 文件包含以下9列信息
column 1: chromosome
#(染色體)
column 2: first base of the intron (1-based)
#(內含子的第一個鹼基)
column 3: last base of the intron (1-based)
#(內含子最後一個鹼基)
column 4: strand (0: undened, 1: +, 2: -)
#(鏈的方向)
column 5: intron motif: 0: non-canonical; 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5:AT/AC, 6: GT/AT
#(內含子序列)
column 6: 0: unannotated, 1: annotated (only if splice junctions database is used)
#(剪切位點是否已被註釋)
column 7: number of uniquely mapping reads crossing the junction
column 8: number of multi-mapping reads crossing the junction
column 9: maximum spliced alignment overhang

STAR-Fusion is a software package for detecting fusion transcript from STAR chimeric output.


PART5 與下游分析相關的參數


With –quantMode TranscriptomeSAM option STAR will output alignments translated into transcript coordinates in the Aligned.toTranscriptome.out.bam file (in addition to alignments in genomic coordinates in Aligned.*.sam/bam files).

With –quantMode GeneCounts option STAR will count number reads per gene while mapping.
The counts coincide with those produced by htseq-count with default parameters.(這個參數的作用與htseq-count作用相同)
這個參數對應生成的文件是ReadsPerGene.out.tab
- column 1: gene ID
- column 2: counts for unstranded RNA-seq
- column 3: counts for the 1st read strand aligned with RNA (htseq-count option -s yes)
- column 4: counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse)

With –quantMode TranscriptomeSAM GeneCounts
生成兩個文件:
1)Aligned.toTranscriptome.out.bam
2)ReadsPerGene.out.tab


PART6 2-pass mapping


爲了能準確發現新的剪切位點,力推使用STAR的 2-pass mode。
這並不是說增加檢測的新剪切位點的數目,而是增強了檢測到可變剪切reads比對到新剪切位點的能力。(即通過reads比對從而發現新的剪切位點)
It does not increase the number of detected novel junctions, but allows to detect more splices reads mapping to novel junctions.

基本思想是:
首先進行1-pass STAR mapping(基本參數即可),收集可變剪切位點信息;
其次利用上一步的變剪切位點信息,進行2-pass STAR mapping

#For a study with multiple samples, it is recommended to collect 1st pass junctions from all samples.
    ###1. Run 1st mapping pass for all samples with "usual" parameters. Using annotations is recommended either a the genome generation step, or mapping step.
    ###2. Run 2nd mapping pass for all samples , listing SJ.out.tab files from all samples in --sjdbFileChrStartEnd /path/to/sj1.tab /path/to/sj2.tab ....
#Per-sample 2-pass mapping.
    ###Annotated junctions will be included in both the 1st and 2nd passes. To run STAR 2-pass mapping for each sample separately, use --twopassMode Basic option. STAR will perform the 1st pass mapping, then it will automatically extract junctions, insert them into the genome index, and, finally, re-map all reads in the 2nd mapping pass. This option can be used with annotations, which can be included either at the run-time (see #1), or at the genome generation step.
    ###若每個樣本分開運行 2-pass mapping,則推薦使用參數:--twopassMode
    #####其基本思想是,先運行1st pass mapping,自動提取剪切位點信息,再將其插入genome index中,最後,將所有reads重新運行2nd mapping pass。
    ### --twopass1readsN denefines the number of reads to be mapped in the 1st pass.(該參數指定在1st pass mapping過程中進行比對的reads數)
    #####The default and most sensitive approach is to set it to -1 (or make it bigger than the number of reads in the sample) in which case all reads in the input read file(s) are used in the 1st pass.

2-pass mapping with re-generated genome

  1. Run 1st pass STAR for all samples with “usual” parameters. Genome indices generated with annotations are recommended.
  2. Collect all junctions detected in the 1st pass by merging SJ.out.tab files from all runs. Filter the junctions by removing likelie false positives, e.g. junctions in the mitochondrion genome,
    or non-canonical junctions supported by a few reads. If you are using annotations, only novel junctions need to be considered here, since annotated junctions will be re-used in the 2nd pass
    anyway.
  3. Use the filtered list of junctions from the 1st pass with –sjdbFileChrStartEnd option, together with annotations (via –sjdbGTFfile option) to generate the new genome indices for the 2nd pass mapping. This needs to be done only once for all samples.
  4. Run the 2nd pass mapping for all samples with the new genome index.
發佈了25 篇原創文章 · 獲贊 26 · 訪問量 13萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章