FastQ格式介紹

原創

2020-02-20 19:52

FastQ格式介紹

爲了便於測序數據的發佈和共享，高通量測序數據以FASTQ 格式來記錄所測的鹼基讀段和質量分數．如下圖所示，FASTQ 格式以測序讀段爲單位存儲，每條讀段佔4 行，其中第1 行和第3行由文件識別標誌和讀段名(ID)組成(第1 行以“@”開頭而第3 行以“+”開頭；第3 行中ID 可以省略，但“+”不能省略)，第2 行爲鹼基序列，第4行爲對應的測序質量分數．

FastQ數據格式

1.序列名稱：

對於每一條FastQ序列，都有一個可以唯一標示的序列名稱，如下：

`1`	`@HWUSI-EAS100R:6:73:941:1973#0/1`

HWUSI-EAS100R	the unique instrument name
6	flowcell lane
73	tile number within the flowcell lane
941	'x'-coordinate of the cluster within the tile
1973	'y'-coordinate of the cluster within the tile
#0	index number for a multiplexed sample (0 for no indexing)
/1	the member of a pair, /1 or /2 (paired-end or mate-pair reads only)

Versions of the Illumina pipeline since 1.4 appear to use #NNNNNN instead of #0 for the multiplex ID, where NNNNNN is the sequence of the multiplex tag.

With Casava 1.8 the format of the '@' line has changed:

1@EAS139:136:FC706VJ:2:2104:15343:197393
1:Y:18:ATCACG

EAS139	the unique instrument name
136	the run id
FC706VJ	the flowcell id
2	flowcell lane
2104	tile number within the flowcell lane
15343	'x'-coordinate of the cluster within the tile
197393	'y'-coordinate of the cluster within the tile
1	the member of a pair, 1 or 2 (paired-end or mate-pair reads only)
Y	Y if the read fails filter (read is bad), N otherwise
18	0 when none of the control bits are on, otherwise it is an even number
ATCACG	index sequence

2、質量值：對於每一條序列，其每一個鹼基都有一個對應的測序質量值：

傳統測序的質量值是基於Phred quality scores，定義如下：

Phred quality scores Q are defined as a property which is logarithmically related to the base-calling error probabilities P.

Q=-10 log₁₀P Phred quality scores are logarithmically linked to error probabilities

Phred Quality Score	Probability of incorrect base call	Base call accuracy
10	1 in 10	90 %
20	1 in 100	99 %
30	1 in 1000	99.9 %
40	1 in 10000	99.99 %
50	1 in 100000	99.999 %

The Solexa pipeline (i.e., the software delivered with the Illumina Genome Analyzer) earlier used a different mapping, encoding the odds p/(1-p) instead of the probability p:

Although both mappings are asymptotically identical at higher quality values, they differ at lower quality levels (i.e., approximately p > 0.05, or equivalently, Q < 13).

爲了便於序列存儲，通常採用單字符來標示序列的質量值。至於序列的quality values值，是通過一些算法得出來的。即：用字母的ASCII值減去相應的數（不同測序平臺數值不一樣），然後就得到Q值，然後通過前面的計算公式計算出鹼基的測序錯誤率。