NGS

目錄(?)[+]

General workflow

eXpress是一個通用的丰度估計工具，它可以應用於任意靶序列和高通量測序reads。靶序列可以是任意基因組區域，例如RNA-seq中的轉錄本。因此，一般的流程應該是這樣的：

1. 選擇你要分析的數據

2. 產生靶序列的集合

3.將目的片段比對到靶序列上

4. eXpress需要的參數包括目的片段，然後進行靶序列的丰度估計

5. 額外的下游分析

圖示:

這個教程涉及如下工具:

Gene Expression Omnibus (GEO) - 取得數據
The Short Read Archive (SRA) -取得數據
SRA Toolkit - 從壓縮的測序數據中抽取FASTQ文件
UCSC Genome Browser - 取得註釋信息
Bowtie - 進行比對
Bowtie 2 -進行比對
eXpress - 靶序列丰度估計

其它有用的工具（不限於RNA-seq）還包括 here.

例子: 沒有參考基因組序列也沒有註釋信息的情況

有的時候，你將研究沒有參考基因組序列的物種，或者參考序列質量較差。這通常意味着你也沒有轉錄組序列。下面要進行的步驟經常是從頭組裝轉錄組。接下來，我們將使用Bowtie2進行片段比對。

取得數據

爲了取得一個數據，我們將使用GEO訪問號。如果你沒有一個GEO訪問號，而是僅僅想要瀏覽數據，你可以跟隨這個tutorial。爲了演示的目的，我們將要研究犛牛轉錄組。

爲什麼選擇犛牛呢？因爲犛牛是天生沒有氣味的。事實上，是犛牛毛無味。爲了下載數據，就直接去GEO吧，然後在"GEO accession" 輸入訪問號“GSE33300”，點擊“GO”。

這將將你帶到主實驗頁面。可以看到有6個來自不同器官的不同的樣品。我們先看一下"GSM823609 brain"。點擊這個實驗，並點擊ftp鏈接下載SRA文件。點擊目錄可以看到SRR361433.sra. 這是一個paired end的RNA-seq數據，我們將使用如下命令抽取數據

[text]view
plaincopy

$ fastq-dump --split-3 SRR361433.sra  

結果產生兩個文件，SRR361433_1.fastq和SRR361433_2.fastq. 注意到使用 --split-3. 只有當你下載的數據是paired end的情況下才需要用這個參數.

通過從頭組裝進行註釋

在這個演示中，我們花點時間進行一個完全從頭組裝分析，而不使用基因組信息，使用的工具是Trinity.
使用如下參數運行Trinity:

[text]view
plaincopy

$ Trinity.pl --seqType fq --JM 200G --left SRR361433_1.fastq --right SRR361433_2.fastq --CPU 2  

在幾個小時內，我們將在trinity_out_dir中得到幾個文件, 包括註釋文件Trinity.fasta.這將是新的註釋文件. 從這裏開始，如果你有參考基因組的情況下，分析將大致相同（下面的例子也是一樣的）.

從這裏，可以下載組裝文件.

比對

建立索引

在你進行任何比對之前，你首先需要建立靶序列的一系列索引文件.

[text]view
plaincopy

$ cd trinity_out_dir  

$ bowtie2-build --offrate 1 Trinity.fasta Trinity  

這將在trinity_out_dir中建立索引，base name是Trinity，bowtie2需要的參數只需要寫到base name結束即可，不需要後面的部分. 這個索引將允許bowtie2快速將reads比對到靶序列。Offrate 1可以加快比對速度，代價是需要的硬盤空間增大.

進行比對

使用一行命令即可運行bowtie2,

[text]view
plaincopy

$ bowtie2 -a -X 800 -p 4 -x trinity_out_dir/Trinity \  

    -1 SRR361433_1.fastq -2 SRR361433_2.fastq | samtools view -Sb - > hits.bam  

-a - 想讓bowtie2報告所有的比對可用此參數，適用於轉錄組分析，不適用於比對到基因組的分析(eXpress將處理多比對的問題，非常慢！)
-X 800 - 將設置片段長度（fragment length）不超過800（實際的應用一般設置到300以內就可以了）. 這個設置將完全將RNA-seq的測序片段納入分析的範疇
-p 4 - 使用4個CPUs進行比對。你應該儘可能多的使用多個CPUs以加快速度.
-x ... - Bowtie 2索引
-1 ... -2 ... - RNA-seq實驗的左reads和右reads

幾十分鐘到幾小時後(取決於你使用多少CPUs), 你應該看到如下控制檯信息:

[text]view
plaincopy

[samopen] SAM header is present: 165714 sequences.  

32959992 reads; of these:  

  32959992 (100.00%) were paired; of these:  

    4385612 (13.31%) aligned concordantly 0 times  

    22571641 (68.48%) aligned concordantly exactly 1 time  

    6002739 (18.21%) aligned concordantly >1 times  

    ----  

    4385612 pairs aligned concordantly 0 times; of these:  

      352368 (8.03%) aligned discordantly 1 time  

    ----  

    4033244 pairs aligned 0 times concordantly or discordantly; of these:  

      8066488 mates make up the pairs; of these:  

        5942057 (73.66%) aligned 0 times  

        1356189 (16.81%) aligned exactly 1 time  

        768242 (9.52%) aligned >1 times  

90.99% overall alignment rate

運行eXpress

我們現在運行eXpress. 簡單的說，你應該準備下面的數據:

multi-FASTA format的靶參考序列（註釋）
比對得到的BAM格式的文件

運行eXpress非常簡單,

[text]view
plaincopy

$ express -o xprs_out trinity_out_dir/Trinity.fa hits.bam  

參數 -o 建立一個新的文件夾並將結果存儲到xprs_out. 下面一行的參數是一個multi-FASTA文件，包含靶序列. 最後，hits.bam是比對過的reads文件.

你將看到如下類似的輸出:

[text]view
plaincopy

Attempting to read 'hits.bam' in BAM format...  

Parsing BAM header...  

Loading target sequences and measuring bias background...  

Processing input fragment alignments...  

Synchronized parameter tables.  

Fragments Processed (hits.bam): 1000000          Number of Bundles: 145443  

Synchronized parameter tables.  

Fragments Processed (hits.bam): 2000000          Number of Bundles: 144074

最終，你將得到以下文件：

results.xprs
params.xprs

在個人電腦中，只是用兩核心CPUs, eXpress在30分鐘內得到28,613,101個片段的量化計算.

分析`results.xprs`

results.xprs是一個tab分隔的文件，包括如下列:

bundle_id - The bundle that the particular target belongs to
target_id - The name of the target from the multi-FASTA annotation
length - The true length of the target sequence
effective length - The length adjusted for biases
total counts - The number of fragments that mapped to this target
unique counts - The number of unique fragments that mapped to this target
estimated counts - The number of counts estimated by the online EM algorithm
effective counts - The number of fragments adjusted for biases: est_counts * length / effective_length
alpha - Beta-Binomial alpha parameter
beta - Beta-binomial beta parameter
FPKM - Fragments Per Kilobase per Million Mapped. Proportional to (estimated counts) / (effective length)
FPKM low
FPKM high
solveable flag - Whether or not the likelihood has a unique solution

你可以將這個結果文件導入Excel等工具. 你將可能對FPKM進行排序感興趣。

結果文件可從這裏下載。

第二個例子：使用已知註釋分析大的測序數據

這裏，我們將看到一個測序深度非常大的數據. 我們將從Peng et. al.（發表在Nature Biotechnology上）這裏看到數據。一共有 ~520,000,000個reads. 我們將只關注polyA的數據集, 包含402,000,000個reads.

取得數據

如果你已經有一個感興趣的數據集, 直接將訪問號在GEO中輸入即可. 這裏，作者在文章中提供了SRA的訪問號.如果你點擊它，或者在SRA搜索該訪問號(SRA043767.1) ，你將不會找得到它。將 '.1'去除再搜它，就能找到了.

這裏有很多數據集，但是我們只對polyA的數據感興趣. 有時候，你可能想通過閱讀文件描述決定下載哪些文件，但是這裏不行. 爲了找到這個信息，我們閱讀"Supplemental Information"後，發現了描述數據的頁面 (Supplementary Table 1. Overview of sequencing samples).

根據這個表格，我們對HUMwktTBRAAPE_* 和HUMwktTBRBAPE感興趣. 我們首先點擊HUMwktTBRAAPE. 當從SRA下載完後,需要首先確定它是single-end還是paired-end，因爲抽取FASTQ的方式不同. 如果我們看到"Spot description"信息, 我們能看出reads是paired-end. 點擊size後面的鏈接, 然後下載SRR324687.sra和SRR325616.sra. 針對HUMwktTBRBAPE也是類似的，相應的SRA文件是SRR324688.sra.

SRA格式的reads是以壓縮的方式存儲的,但是大多數工具無法直接處理它對應的平展格式FASTQ. 幸運的是，我們可以很容易地抽取FASTQ文件通過使用 fastq-dump程序（來自SRA Toolkit）.

[text]view
plaincopy

$ fastq-dump --split-3 SRR324687.sra   

$ fastq-dump --split-3 SRR324688.sra   

$ fastq-dump --split-3 SRR325616.sra  

注意到使用 --split-3. 如果你有paired end的數據，你必須這麼用. 等到fastq-dump運行結果後, 你應該有6個.fastq結尾的文件了.

使用已知信息產生靶序列

在你做任何定量工作之前，你應該確定靶序列對於你的分析是不是充足的。對於RNA-seq而言，通常需要GTF格式的轉錄組註釋。因爲eXpress需要一般性的靶序列, 必須使用GTF文件爲每個轉錄本生成一個FASTA記錄.

如果你分析的是一個常見的物種，例如人，實際上已經有很多註釋了. 如果你分析一個研究較少的物種（例如犛牛），你可能需要自己做從頭轉錄組組裝了.

毫無疑問，最簡單的建立靶序列的文件的方式是使用已知註釋. 如果你研究的物種在UCSC中存在，那更簡單了. 對於一個不在UCSC中註釋的物種，可以先參考一下附件，這裏提供了獲得基因組註釋的步驟:

訪問http://genome.ucsc.edu
點擊主菜單的"Tables"
選擇基因組
- 例如選擇 Mammal -> Human -> hg19
選擇註釋
- 例如選擇 'Ensembl'. 點擊: "Ensembl Genes"
改變輸出格式"output format"到"sequence"
輸入一個具有描述性的名字
- 例如: ensembl-hg19.fa.gz
壓縮以節省下載時間
最後的界面看起來應該類似於:
確保一切都正確, 然後點擊"get output". 在下一個界面中, 點擊"genomic",然後點擊submit. 如果你沒看到這些的話，說明你做錯了什麼! 返回重試!
重要: 在下一個界面中, 確保"Introns"沒被選上. 其它就選擇默認的就可以了. 點擊"get sequence", 你的下載就該開始了. 你的界面應該是類似於此的.

使用解壓縮軟件解壓下載的文件，然後簡單看下.

[text]view
plaincopy

$ gunzip ensembl-hg19.fa.gz  

$ head ensembl-hg19.fa  

>hg19_ensGene_ENST00000237247 range=chr1:66999066-67210057 5'pad=0 3'pad=0 strand=+ repeatMasking=none  

AGTTTGATTCCAGAGCCCCACTCGGCGGACGGAATAGACCTCAGCAGCGG  

CGTGGTGAGGACTTAGCTGGGACCTGGAATCGTATCCTCCTGTGTTTTTT  

CAGACTCCTTGGAAATTAAGGAATGCAATTCTGCCACCATGATGGAAGGA  

TTGAAAAAACGTACAAGGAAGGCCTTTGGAATACGGAAGAAAGAAAAGGA  

CACTGATTCTACAGGTTCACCAGATAGAGATGGAATTCAGGGGAAAAAAA  

AGACCCAGAAGACTCAGCTTCTCCTCACCTCTTGCTTCTGGCTCAGAGCC  

CTCTCGTTAACTCTGTCTCAGAAGAAAAGCAATGGGGCACCAAATGGATT  

TTATGCGGAAATTGATTGGGAAAGATATAACTCACCTGAGCTGGATGAAG  

AAGGCTACAGCATCAGACCCGAGGAACCCGGCTCTACCAAAGGAAAGCAC

基於這個multi-FASTA靶序列文件, 你現在可以建立一個bowtie索引了.

比對

建立索引

使用Bowtie,

[text]view
plaincopy

$ bowtie-build --offrate 1 ensembl-hg19.fa ensembl-hg19  

bowtie-build建立了註釋的索引，這個索引對於bowtie快速地到靶序列的比對是必須的. Offrate 1可以加快比對速度，代價是需要的硬盤空間增大.

這個索引大概需要25分鐘才能建立起來. Ensembl註釋通常是比較的註釋.

需要注意的是，你不需要每次都建一次索引，建立一個註釋的索引只需要一次，以後直接引用它就可以了.

進行比對

現在，我們進行實際的比對.

[text]view
plaincopy

$ bowtie -aS --chunkmbs 128 -v 3 -X 800 -p 40 /home/lmcb/bowtie_indices/ensembl-hg19/ensembl-hg19 \   

    -1 SRR324687_1.fastq,SRR324688_1.fastq,SRR325616_1.fastq \  

    -2 SRR324687_2.fastq,SRR324688_2.fastq,SRR325616_2.fastq \  

    | samtools view -Sb - > hits.bam  

這裏使用的參數有些不同，因爲使用的是Bowtie:

-a - 想讓bowtie2報告所有的比對可用此參數，適用於轉錄組分析，不適用於比對到基因組的分析(eXpress將處理多比對的問題，非常慢！)
-S - 以SAM格式輸出
-v 3 - 允許多達3次錯配
-X 800 - 將設置片段長度（fragment length）不超過800（實際的應用一般設置到300以內就可以了）. 這個設置將完全將RNA-seq的測序片段納入分析的範疇
-p 4 - Use四個CPUs進行比對. 越多CPUs越好.
-1 ... -2 ... - RNA-seq實驗 Paired end的FASTQ的左reads和右reads

然後，我們使用管道(|)將輸出的SAM直接導入samtools以直接生成BAM文件，可以節省磁盤空間，得到hits.bam.當然，你可以使用附錄裏提供的腳步簡化工作量.

[text]view
plaincopy

[samopen] SAM header is present: 181648 sequences.  

# reads processed: 421836549  

# reads with at least one reported alignment: 184591763 (43.76%)  

# reads that failed to align: 237244786 (56.24%)  

Reported 242921711 paired-end alignments to 1 output stream(s)

使用40個核心的CPUs, 這個比對過程在大約11個小時可以完成. 你可能注意到了比對率有些低. 然而，這並不意外，因爲測定的主體是中國人羣，可能和參考基因組有些不同.

運行eXpress進行丰度估計

因爲我們之前運行過的例子使用的是相同的名字，所以eXpress的命令是相同的.

[text]view
plaincopy

$ express -o xprs_out /home/lmcb/bowtie_indices/ensembl-hg19/ensembl-hg19.fa hits.bam  

參數 -o 新建了文件夾病將結果存儲在xprs_out.下面一行的參數是一個multi-FASTA文件，包含靶序列. 最後，hits.bam是比對過的reads文件.

你將看到如下類似的輸出:

[text]view
plaincopy

Attempting to read 'hits.bam' in BAM format...  

Parsing BAM header...  

Loading target sequences and measuring bias background...  

Processing input fragment alignments...  

Synchronized parameter tables.  

Fragments Processed (hits.bam): 1000000      Number of Bundles: 135798  

Fragments Processed (hits.bam): 2000000      Number of Bundles: 124769  

Fragments Processed (hits.bam): 3000000      Number of Bundles: 118656  

Fragments Processed (hits.bam): 4000000      Number of Bundles: 114476  

....

最終，你將得到以下文件：

results.xprs
params.xprs

在個人電腦中，只是用兩核心CPUs, eXpress在120分鐘內可運行完.

Analyzing `results.xprs`

results.xprs是一個tab分隔的文件，包括如下列:

bundle_id - The bundle that the particular target belongs to
target_id - The name of the target from the multi-FASTA annotation
length - The true length of the target sequence
effective length - The length adjusted for biases
total counts - The number of fragments that mapped to this target
unique counts - The number of unique fragments that mapped to this target
estimated counts - The number of counts estimated by the online EM algorithm
effective counts - The number of fragments adjusted for biases: est_counts * length / effective_length
alpha - Beta-Binomial alpha parameter
beta - Beta-binomial beta parameter
FPKM - Fragments Per Kilobase per Million Mapped. Proportional to (estimated counts) / (effective length)
FPKM low
FPKM high
solveable flag - Whether or not the likelihood has a unique solution

你可以將這個結果文件導入Excel等工具. 你將可能對FPKM進行排序感興趣。

結果文件可從這裏下載。

附錄: 有用的腳本

<span class="co0" style="color: rgb(102, 102, 102); font-style: italic;">#!/bin/bash</span>
<span class="co0" style="color: rgb(102, 102, 102); font-style: italic;"># Author: Harold Pimentel</span>
 
<span class="kw1" style="font-weight: bold;">if</span> <span class="br0" style="color: rgb(122, 8, 116); font-weight: bold;">[</span> <span class="re4" style="color: rgb(0, 120, 0);">$#</span> <span class="re5" style="color: rgb(102, 0, 51);">-eq</span> <span class="nu0">0</span> <span class="br0" style="color: rgb(122, 8, 116); font-weight: bold;">]</span>; <span class="kw1" style="font-weight: bold;">then</span>
    <span class="kw3" style="color: rgb(122, 8, 116); font-weight: bold;">echo</span> <span class="st0" style="color: rgb(255, 0, 0);">"Usage: bowtie2.sh DIR1 [DIR2 DIR3 ... DIRN]"</span> <span class="nu0">1</span><span class="sy0" style="font-weight: bold;">>&</span><span class="nu0">2</span>
    <span class="kw3" style="color: rgb(122, 8, 116); font-weight: bold;">exit</span> <span class="nu0">1</span>
<span class="kw1" style="font-weight: bold;">fi</span>
 
<span class="co0" style="color: rgb(102, 102, 102); font-style: italic;"># TODO: Configure the variables in this block!</span>
<span class="re2" style="color: rgb(0, 120, 0);">BWT2</span>=<span class="sy0" style="font-weight: bold;">`</span><span class="kw2" style="color: rgb(194, 12, 185); font-weight: bold;">which</span> bowtie2<span class="sy0" style="font-weight: bold;">`</span>
<span class="re2" style="color: rgb(0, 120, 0);">IDX</span>=<span class="sy0" style="font-weight: bold;">/</span>home<span class="sy0" style="font-weight: bold;">/</span>lmcb<span class="sy0" style="font-weight: bold;">/</span>bowtie2_indices<span class="sy0" style="font-weight: bold;">/</span>ensembl-mm9<span class="sy0" style="font-weight: bold;">/</span>ensembl-mm9
<span class="re2" style="color: rgb(0, 120, 0);">N_THR</span>=<span class="nu0">40</span>
 
<span class="kw1" style="font-weight: bold;">for</span> DIR <span class="kw1" style="font-weight: bold;">in</span> <span class="st0" style="color: rgb(255, 0, 0);">"$@"</span>
<span class="kw1" style="font-weight: bold;">do</span>
    <span class="kw3" style="color: rgb(122, 8, 116); font-weight: bold;">echo</span> <span class="st0" style="color: rgb(255, 0, 0);">"Aligning reads in <span class="es2" style="color: rgb(0, 120, 0);">$DIR</span>..."</span>
 
    <span class="kw1" style="font-weight: bold;">if</span> <span class="br0" style="color: rgb(122, 8, 116); font-weight: bold;">[</span> <span class="sy0" style="font-weight: bold;">!</span> <span class="re5" style="color: rgb(102, 0, 51);">-e</span> <span class="re1" style="color: rgb(0, 120, 0);">$DIR</span> <span class="br0" style="color: rgb(122, 8, 116); font-weight: bold;">]</span>; <span class="kw1" style="font-weight: bold;">then</span>
        <span class="kw3" style="color: rgb(122, 8, 116); font-weight: bold;">echo</span> <span class="st0" style="color: rgb(255, 0, 0);">"ERROR: Couldn't find directory '<span class="es3" style="color: rgb(0, 120, 0);">${DIR}</span>'"</span> <span class="nu0">1</span><span class="sy0" style="font-weight: bold;">>&</span><span class="nu0">2</span>
        <span class="kw3" style="color: rgb(122, 8, 116); font-weight: bold;">echo</span> <span class="st0" style="color: rgb(255, 0, 0);">"<span class="es1" style="color: rgb(0, 0, 153); font-weight: bold;">\t</span>skipping...."</span> <span class="nu0">1</span><span class="sy0" style="font-weight: bold;">>&</span><span class="nu0">2</span>
        <span class="kw3" style="color: rgb(122, 8, 116); font-weight: bold;">continue</span>
    <span class="kw1" style="font-weight: bold;">fi</span>
 
    <span class="re2" style="color: rgb(0, 120, 0);">LEFT</span>=<span class="sy0" style="font-weight: bold;">`</span><span class="kw2" style="color: rgb(194, 12, 185); font-weight: bold;">ls</span> <span class="co1" style="color: rgb(128, 0, 0);">${DIR}</span><span class="sy0" style="font-weight: bold;">/*</span>_1.fastq <span class="sy0" style="font-weight: bold;">|</span> <span class="kw2" style="color: rgb(194, 12, 185); font-weight: bold;">sed</span> <span class="st_h" style="color: rgb(255, 0, 0);">':a;N;$!ba;s/\n/,/g'</span><span class="sy0" style="font-weight: bold;">`</span>
    <span class="re2" style="color: rgb(0, 120, 0);">RIGHT</span>=<span class="sy0" style="font-weight: bold;">`</span><span class="kw2" style="color: rgb(194, 12, 185); font-weight: bold;">ls</span> <span class="co1" style="color: rgb(128, 0, 0);">${DIR}</span><span class="sy0" style="font-weight: bold;">/*</span>_2.fastq <span class="sy0" style="font-weight: bold;">|</span> <span class="kw2" style="color: rgb(194, 12, 185); font-weight: bold;">sed</span> <span class="st_h" style="color: rgb(255, 0, 0);">':a;N;$!ba;s/\n/,/g'</span><span class="sy0" style="font-weight: bold;">`</span>
 
    <span class="kw1" style="font-weight: bold;">if</span> <span class="br0" style="color: rgb(122, 8, 116); font-weight: bold;">[</span> <span class="re5" style="color: rgb(102, 0, 51);">-z</span> <span class="st0" style="color: rgb(255, 0, 0);">"<span class="es2" style="color: rgb(0, 120, 0);">$LEFT</span>"</span>  <span class="br0" style="color: rgb(122, 8, 116); font-weight: bold;">]</span> <span class="sy0" style="font-weight: bold;">||</span> <span class="br0" style="color: rgb(122, 8, 116); font-weight: bold;">[</span> <span class="re5" style="color: rgb(102, 0, 51);">-z</span> <span class="st0" style="color: rgb(255, 0, 0);">"<span class="es2" style="color: rgb(0, 120, 0);">$RIGHT</span>"</span> <span class="br0" style="color: rgb(122, 8, 116); font-weight: bold;">]</span>; <span class="kw1" style="font-weight: bold;">then</span>
        <span class="kw3" style="color: rgb(122, 8, 116); font-weight: bold;">echo</span> <span class="st0" style="color: rgb(255, 0, 0);">"Didn't find matching paired-end reads in <span class="es2" style="color: rgb(0, 120, 0);">$DIR</span>"</span>
        <span class="kw3" style="color: rgb(122, 8, 116); font-weight: bold;">echo</span> <span class="st0" style="color: rgb(255, 0, 0);">"Trying single-end..."</span>
 
        <span class="re2" style="color: rgb(0, 120, 0);">SINGLE</span>=<span class="sy0" style="font-weight: bold;">`</span><span class="kw2" style="color: rgb(194, 12, 185); font-weight: bold;">ls</span> <span class="co1" style="color: rgb(128, 0, 0);">${DIR}</span><span class="sy0" style="font-weight: bold;">/*</span>_1.fastq <span class="sy0" style="font-weight: bold;">|</span> <span class="kw2" style="color: rgb(194, 12, 185); font-weight: bold;">sed</span> <span class="st_h" style="color: rgb(255, 0, 0);">':a;N;$!ba;s/\n/,/g'</span><span class="sy0" style="font-weight: bold;">`</span>
        <span class="kw1" style="font-weight: bold;">if</span> <span class="br0" style="color: rgb(122, 8, 116); font-weight: bold;">[</span> <span class="re5" style="color: rgb(102, 0, 51);">-z</span> <span class="re1" style="color: rgb(0, 120, 0);">$SINGLE</span> <span class="br0" style="color: rgb(122, 8, 116); font-weight: bold;">]</span>; <span class="kw1" style="font-weight: bold;">then</span>
            <span class="kw3" style="color: rgb(122, 8, 116); font-weight: bold;">echo</span> <span class="st0" style="color: rgb(255, 0, 0);">"ERROR: Couldn't find any single-end reads either... skipping <span class="es2" style="color: rgb(0, 120, 0);">$DIR</span>"</span>
            <span class="kw3" style="color: rgb(122, 8, 116); font-weight: bold;">continue</span>
        <span class="kw1" style="font-weight: bold;">fi</span>
    <span class="kw1" style="font-weight: bold;">fi</span>
 
    <span class="kw1" style="font-weight: bold;">if</span> <span class="br0" style="color: rgb(122, 8, 116); font-weight: bold;">[</span> <span class="re5" style="color: rgb(102, 0, 51);">-z</span> <span class="st0" style="color: rgb(255, 0, 0);">"<span class="es2" style="color: rgb(0, 120, 0);">$SINGLE</span>"</span> <span class="br0" style="color: rgb(122, 8, 116); font-weight: bold;">]</span>; <span class="kw1" style="font-weight: bold;">then</span>
        <span class="re2" style="color: rgb(0, 120, 0);">CMD</span>=<span class="st0" style="color: rgb(255, 0, 0);">"<span class="es2" style="color: rgb(0, 120, 0);">$BWT2</span> -a -X 800 -p <span class="es2" style="color: rgb(0, 120, 0);">$N_THR</span> -x <span class="es2" style="color: rgb(0, 120, 0);">$IDX</span> -1 <span class="es2" style="color: rgb(0, 120, 0);">$LEFT</span> -2 <span class="es2" style="color: rgb(0, 120, 0);">$RIGHT</span> | samtools view -Sb - > <span class="es3" style="color: rgb(0, 120, 0);">${DIR}</span>/hits.bam"</span>
    <span class="kw1" style="font-weight: bold;">else</span>
        <span class="re2" style="color: rgb(0, 120, 0);">CMD</span>=<span class="st0" style="color: rgb(255, 0, 0);">"<span class="es2" style="color: rgb(0, 120, 0);">$BWT2</span> -a -X 800 -p <span class="es2" style="color: rgb(0, 120, 0);">$N_THR</span> -x <span class="es2" style="color: rgb(0, 120, 0);">$IDX</span> -U <span class="es2" style="color: rgb(0, 120, 0);">$SINGLE</span> | samtools view -Sb - > <span class="es3" style="color: rgb(0, 120, 0);">${DIR}</span>/hits.bam"</span>
    <span class="kw1" style="font-weight: bold;">fi</span>
 
    <span class="kw3" style="color: rgb(122, 8, 116); font-weight: bold;">echo</span> <span class="st0" style="color: rgb(255, 0, 0);">"Executing..."</span>
    <span class="kw3" style="color: rgb(122, 8, 116); font-weight: bold;">echo</span> <span class="re1" style="color: rgb(0, 120, 0);">$CMD</span>
 
    <span class="co0" style="color: rgb(102, 102, 102); font-style: italic;"># Before you run it, you might want to comment the next line so that you can do a "dry run"</span>
    <span class="re1" style="color: rgb(0, 120, 0);">$CMD</span> <span class="nu0">2</span><span class="sy0" style="font-weight: bold;">>&</span><span class="nu0">1</span> <span class="sy0" style="font-weight: bold;">|</span> <span class="kw2" style="color: rgb(194, 12, 185); font-weight: bold;">tee</span> <span class="co1" style="color: rgb(128, 0, 0);">${DIR}</span><span class="sy0" style="font-weight: bold;">/</span>bwt2.log
 
    <span class="kw3" style="color: rgb(122, 8, 116); font-weight: bold;">unset</span> SINGLE
<span class="kw1" style="font-weight: bold;">done</span>

雖然，不能保證的這個腳本在你那兒也能運行成功!, 但是這個腳本對於我能運行成功，並且希望你能使用. 正如腳本快要結束前的註釋, 你可能想要註釋掉$CMD行,去觀察程序的運行的詳細過程，並且確保一切如你所願的運行，然後不需要的話可以再註釋上.
如果你將它拷貝並粘貼到一個叫bowtie2.sh的文件中並且使它可執行的話, 你就需要使用如下命令:

[text]view
plaincopy

$ chmod 755 bowtie2.sh  

$ ./bowtie2.sh exp1 exp2 exp3  

其中 exp1, exp2, exp3 是包含 *.fastq reads的文件夾. 這些reads應該滿足假設的左reads *_1.fastq以及右reads的 *_2.fastq的命名習慣.

附錄：基於定製註釋數據生成multi-FASTA文件

使用UCSC基因組瀏覽器

使用這個方法，如果:

在UCSC沒有你感興趣的註釋數據
甚至連參考基因組也沒有

直接訪問UCSC Genome Broswer並再次點擊"Tables" . 點擊"add custom tracks" ，然後再點擊"Choose file"以上傳註釋. 點擊submit.

上傳成功後, 你將進入一個新的界面，然後點擊 "go to table browser."

使用從人類數據集中抽取序列的方法來抽取你自己的註釋對應的序列.

在本地電腦上使用自定義參考序列

如果你已經安裝了TopHat, 其中就包括了一個工具 gtf_to_fasta. 這個工具需要有以下文件,

multi-FASTA格式的基因組序列
GFF/GTF格式的註釋

使用這個工具也是蠻簡單的:

[text]view
plaincopy

$ gtf_to_fasta ensembl-hg19.gtf hg19.fa ensembl-hg19.fa  

其中不同的參數一次代表 [annotation.gtf] [genome.fa] [target out]. 最後得到的靶序列應該看來是這樣的:

[text]view
plaincopy

$ head ensembl-hg19.fa   

>99061 chr10:134210672-134231367 ENST00000305233  

ggcggcggcggcggcggcggcggcggcggggcgggggcggcCTGGGACGCGGCGGGAGCA  

TGGAGCCGCGCGCCGGCTGCCGGCTGCCGGTGCGGGTGGAGCAGGTCGTCAACGGCGCGC  

TGGTGGTCACGGTGAGCTGCGGCGAGCGGAGCTTCGCGGGGATCCTGCTGGACTGCACGA  

AAAAGTCCGGCCTCTTTGGCCTACCCCCGTTGGCTCCGCTGCCCCAGGTCGATGAGTCCC  

CTGTCAACGACAGCCATGGCCGGGCTCCCGAGGAGGGGGATGCAGAGGTGATGCAGCTGG  

GGTCCAGCTCCCCCCCTCCTGCCCGCGGGGTTCAGCCCCCCGAGACCACCCGCCCCGAGC  

CACCCCCGCCCCTCGTGCCGCCGCTGCCCGCCGGAAGCCTGCCCCCGTACCCTCCCTACT  

TCGAAGGCGCCCCCTTCCCTCACCCGCTGTGGCTCCGGGACACGTACAAGCTGTGGGTGC  

CCCAGCCGCCGCCCAGGACCATCAAGCGCACGCGGCGGCGTCTGTCCCGCAACCGCGACC

如果你傾向於使用轉錄本的名字而不是包含位置信息的索引，你可以使用awk寫的命令處理這個文件:

[text]view
plaincopy

$ awk '{if ($1 ~ /^>/) print ">"$3; else print $0}' ensembl-hg19.fa > ensembl-hg19_fixed_names.fa  

$ head ensembl-hg19_fixed_names.fa   

>ENST00000305233  

ggcggcggcggcggcggcggcggcggcggggcgggggcggcCTGGGACGCGGCGGGAGCA  

TGGAGCCGCGCGCCGGCTGCCGGCTGCCGGTGCGGGTGGAGCAGGTCGTCAACGGCGCGC  

TGGTGGTCACGGTGAGCTGCGGCGAGCGGAGCTTCGCGGGGATCCTGCTGGACTGCACGA  

AAAAGTCCGGCCTCTTTGGCCTACCCCCGTTGGCTCCGCTGCCCCAGGTCGATGAGTCCC  

CTGTCAACGACAGCCATGGCCGGGCTCCCGAGGAGGGGGATGCAGAGGTGATGCAGCTGG  

GGTCCAGCTCCCCCCCTCCTGCCCGCGGGGTTCAGCCCCCCGAGACCACCCGCCCCGAGC  

CACCCCCGCCCCTCGTGCCGCCGCTGCCCGCCGGAAGCCTGCCCCCGTACCCTCCCTACT  

TCGAAGGCGCCCCCTTCCCTCACCCGCTGTGGCTCCGGGACACGTACAAGCTGTGGGTGC  

CCCAGCCGCCGCCCAGGACCATCAAGCGCACGCGGCGGCGTCTGTCCCGCAACCGCGACC

All better!

站內首發文章

biocq

發佈了29 篇原創文章 · 獲贊 1 · 訪問量 51萬+

私信關注

基於eXpress對轉錄組和基因組進行量化

General workflow

例子: 沒有參考基因組序列也沒有註釋信息的情況

取得數據

通過從頭組裝進行註釋

比對

建立索引

進行比對

運行eXpress

分析`results.xprs`

第二個例子：使用已知註釋分析大的測序數據

取得數據

使用已知信息產生靶序列

比對

建立索引

進行比對

運行eXpress進行丰度估計

Analyzing `results.xprs`

附錄: 有用的腳本

附錄：基於定製註釋數據生成multi-FASTA文件

使用UCSC基因組瀏覽器

在本地電腦上使用自定義參考序列

如何使用 JS 判斷用戶是否處於活躍狀態

Mono 支持LoongArch架構

lightdb秒級增加列和刪除列（not null帶默認值）

lightdb數據庫超時相關控制參數

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

❤️‍🔥 Solon Cloud Event 新的事務特性與應用

lightdb mysql 8.0兼容之不可見主鍵

使用 JS 實現在瀏覽器控制檯打印圖片 console.image()

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（四）使用域名訪問網站應用

淘寶刷鑽爲何屢禁不止透過現象看本質

一些有用的UNIX命令

PLoS ONE 文章在線發表了

UCSC瀏覽器的安裝

2010年12月24日打保齡球

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

基於eXpress對轉錄組和基因組進行量化

General workflow

例子: 沒有參考基因組序列也沒有註釋信息的情況

取得數據

通過從頭組裝進行註釋

比對

建立索引

進行比對

運行eXpress

分析results.xprs

第二個例子：使用已知註釋分析大的測序數據

取得數據

使用已知信息產生靶序列

比對

建立索引

進行比對

運行eXpress進行丰度估計

Analyzing results.xprs

附錄: 有用的腳本

附錄：基於定製註釋數據生成multi-FASTA文件

使用UCSC基因組瀏覽器

在本地電腦上使用自定義參考序列

分析`results.xprs`

Analyzing `results.xprs`