爲了便於測序數據的發佈和共享,高通量測序數據以FASTQ 格式來記錄所測的鹼基讀段和質量分數.如下圖 所示,FASTQ 格式以測序讀段爲單位存儲,每條讀段佔4 行,其中第1 行和第3行由文件識別標誌和讀段名(ID)組成(第1 行以“@”開頭而第3 行以“+”開頭;第3 行中ID 可以省略,但“+”不能省略),第2 行爲鹼基序列,第4行爲對應的測序質量分數.
FastQ數據格式
1.序列名稱:
對於每一條FastQ序列,都有一個可以唯一標示的序列名稱,如下:
1 | @HWUSI-EAS100R:6:73:941:1973#0/1 |
HWUSI-EAS100R | the unique instrument name |
---|---|
6 | flowcell lane |
73 | tile number within the flowcell lane |
941 | 'x'-coordinate of the cluster within the tile |
1973 | 'y'-coordinate of the cluster within the tile |
#0 | index number for a multiplexed sample (0 for no indexing) |
/1 | the member of a pair, /1 or /2 (paired-end or mate-pair reads only) |
Versions of the Illumina pipeline since 1.4 appear to use #NNNNNN instead of #0 for the multiplex ID, where NNNNNN is the sequence of the multiplex tag.
With Casava 1.8 the format of the '@' line has changed:
1 | @EAS139:136:FC706VJ:2:2104:15343:197393
1:Y:18:ATCACG |
EAS139 | the unique instrument name |
---|---|
136 | the run id |
FC706VJ | the flowcell id |
2 | flowcell lane |
2104 | tile number within the flowcell lane |
15343 | 'x'-coordinate of the cluster within the tile |
197393 | 'y'-coordinate of the cluster within the tile |
1 | the member of a pair, 1 or 2 (paired-end or mate-pair reads only) |
Y | Y if the read fails filter (read is bad), N otherwise |
18 | 0 when none of the control bits are on, otherwise it is an even number |
ATCACG | index sequence |
2、質量值:對於每一條序列,其每一個鹼基都有一個對應的測序質量值:
傳統測序的質量值是基於Phred quality scores,定義如下:
Phred quality scores Q are defined as a property which is logarithmically related to the base-calling error probabilities P.
Q=-10 log10P Phred quality scores are logarithmically linked to error probabilities
Phred Quality Score | Probability of incorrect base call | Base call accuracy |
---|---|---|
10 | 1 in 10 | 90 % |
20 | 1 in 100 | 99 % |
30 | 1 in 1000 | 99.9 % |
40 | 1 in 10000 | 99.99 % |
50 | 1 in 100000 | 99.999 % |
The Solexa pipeline (i.e., the software delivered with the Illumina Genome Analyzer) earlier used a different mapping, encoding the odds p/(1-p) instead of the probability p:
Although both mappings are asymptotically identical at higher quality values, they differ at lower quality levels (i.e., approximately p > 0.05, or equivalently, Q < 13).
爲了便於序列存儲,通常採用單字符來標示序列的質量值。至於序列的quality values值,是通過一些算法得出來的。即:用字母的ASCII值減去相應的數(不同測序平臺數值不一樣),然後就得到Q值,然後通過前面的計算公式計算出鹼基的測序錯誤率。
下面是不同測序平臺使用的字符區間段: