深度學習caffe--手寫字體識別例程(三)—— MNIST數據集文件格式詳解

        手寫體識別例程的mnist數據集包含了4個文件,分別爲train-images-idx3-ubyte、train-labels-idx1-ubyte、t10k-images-idx3-ubyte、t10k-labels-idx1-ubyte。這四個文件分別是訓練集的圖片文件、訓練集的標籤文件、測試集的圖片文件、測試集的標籤文件,其中,訓練集中有60000張圖片,測試集有10000張圖片。這4個文件是二進制文件,那麼這麼圖片是如何在這些文件存儲的呢?

        我們已測試集的文件爲例進行分析,即t10k-images-idx3-ubyte和t10k-labels-idx1-ubyte。文件的定義如下

t10k-images-idx3-ubyte定義

地址偏移 數據類型 取值 描述
0000 uint32 2051 魔數(大端存儲)
0004 uint32 10000 文件包含的條目總數
0008 uint32 28 行數
000c uint32 28 列數
0010 uint8 像素灰度值(0~255)
0011 uint8 像素灰度值(0~255)
... ... ... ...

t10k-labels-idx1-ubyte定義

地址偏移 數據類型 取值 描述
0000 uint32 2049 魔數(大端存儲)
0004 uint32 10000 文件包含的條目總數
0008 uint8 ? 標籤值(0~9)
000c uint8 ? 標籤值(0~9)
... ... ... ...

訓練集的兩個文件與測試集的文件定義是完全一樣的。

        從上面的表格我們可以看出來t10k-images-idx3-ubyte中存儲的是10000張圖片,每個圖片是28X28的像素分辨率,每個像素佔一個字節數據,數據描述的是像素的灰度值,0表示白色,255表示黑色。

        t10k-labels-idx1-ubyte中存儲的是每張圖片對應的標籤值,即0~9的數字值,每張圖片與每個標籤是一一對應的。爲了詳細分析,我們看一下這兩個文件的內部取值,由於是二進制文件,我們可以用hexdump軟件打開,如果計算機中沒有這個軟件,可以用下面的命令進行安裝

sudo apt-get install hexdump

用下面的命令打開文件

hexdump -Cv t10k-images-idx3-ubyte | more

打開文件後如下所示

00000000  00 00 08 03 00 00 27 10  00 00 00 1c 00 00 00 1c  |......'.........|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000050  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000060  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000070  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000080  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000090  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000000a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000000b0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000000c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000000d0  00 00 00 00 00 00 00 00  00 00 54 b9 9f 97 3c 24  |..........T...<$|
000000e0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000000f0  00 00 00 00 00 00 de fe  fe fe fe f1 c6 c6 c6 c6  |................|
00000100  c6 c6 c6 c6 aa 34 00 00  00 00 00 00 00 00 00 00  |.....4..........|
00000110  00 00 43 72 48 72 a3 e3  fe e1 fe fe fe fa e5 fe  |..CrHr..........|
00000120  fe 8c 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000130  00 00 00 11 42 0e 43 43  43 3b 15 ec fe 6a 00 00  |....B.CCC;...j..|
00000140  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000150  00 00 00 00 00 00 53 fd  d1 12 00 00 00 00 00 00  |......S.........|
00000160  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000170  00 16 e9 ff 53 00 00 00  00 00 00 00 00 00 00 00  |....S...........|
00000180  00 00 00 00 00 00 00 00  00 00 00 00 00 81 fe ee  |................|
00000190  2c 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |,...............|
000001a0  00 00 00 00 00 00 00 00  3b f9 fe 3e 00 00 00 00  |........;..>....|
000001b0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000001c0  00 00 00 00 85 fe bb 05  00 00 00 00 00 00 00 00  |................|
000001d0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 09  |................|
000001e0  cd f8 3a 00 00 00 00 00  00 00 00 00 00 00 00 00  |..:.............|
000001f0  00 00 00 00 00 00 00 00  00 00 00 7e fe b6 00 00  |...........~....|
00000200  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000210  00 00 00 00 00 00 4b fb  f0 39 00 00 00 00 00 00  |......K..9......|
00000220  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000230  00 13 dd fe a6 00 00 00  00 00 00 00 00 00 00 00  |................|
00000240  00 00 00 00 00 00 00 00  00 00 00 00 03 cb fe db  |................|
00000250  23 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |#...............|
00000260  00 00 00 00 00 00 00 00  26 fe fe 4d 00 00 00 00  |........&..M....|
00000270  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000280  00 00 00 1f e0 fe 73 01  00 00 00 00 00 00 00 00  |......s.........|
00000290  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 85  |................|
000002a0  fe fe 34 00 00 00 00 00  00 00 00 00 00 00 00 00  |..4.............|
000002b0  00 00 00 00 00 00 00 00  00 00 3d f2 fe fe 34 00  |..........=...4.|
000002c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000002d0  00 00 00 00 00 00 79 fe  fe db 28 00 00 00 00 00  |......y...(.....|
000002e0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000002f0  00 00 79 fe cf 12 00 00  00 00 00 00 00 00 00 00  |..y.............|
00000300  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000310  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000320  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000330  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000340  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000350  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000360  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000370  00 00 00 00 00 00 00 00  00 00 00 00 00 00 74 7d  |..............t}|
00000380  ab ff ff 96 5d 00 00 00  00 00 00 00 00 00 00 00  |....]...........|
00000390  00 00 00 00 00 00 00 00  00 a9 fd fd fd fd fd fd  |................|
000003a0  da 1e 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000003b0  00 00 00 00 a9 fd fd fd  d5 8e b0 fd fd 7a 00 00  |.............z..|
000003c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 34  |...............4|

        對於多字節的數據,是按大端模式存儲的,我們可以根據上面表格的定義對文件內容進行分析。

首先我們看前4個字節 00000803,它轉換爲十進制爲2051,正是表格中定義的魔數。

接下來4個字節是00002710,它轉換爲十進制爲10000,正是表格中定義的條目數。

再接下來就是兩個4字節的0000001c,都是十進制的28,分別是圖片像素的行數和列數。

再下面的就是圖片的像素值,28X28=784個字節爲一張圖片,我們從上面的數據中可以看到00比較多,代表白色的背景,其它值則代表有顏色的像素,我們把這些784個像素值描繪在28X28的圖片上,如下圖所示。

很明顯,這是一個數字7,文件中其它的數據是剩下的9999張和上圖類似的圖片的數據,有興趣可以描幾個看看。至此t10k-images-idx3-ubyte文件已經淋漓盡致的展現在我們的面前。

        接下來,我們再來看看t10k-labels-idx1-ubyte文件,它存儲的是標籤值,用下面的命令打開文件

hexdump -Cv t10k-labels-idx1-ubyte | more

打開後如下圖所示。

00000000  00 00 08 01 00 00 27 10  07 02 01 00 04 01 04 09  |......'.........|
00000010  05 09 00 06 09 00 01 05  09 07 03 04 09 06 06 05  |................|
00000020  04 00 07 04 00 01 03 01  03 04 07 02 07 01 02 01  |................|
00000030  01 07 04 02 03 05 01 02  04 04 06 03 05 05 06 00  |................|
00000040  04 01 09 05 07 08 09 03  07 04 06 04 03 00 07 00  |................|
00000050  02 09 01 07 03 02 09 07  07 06 02 07 08 04 07 03  |................|
00000060  06 01 03 06 09 03 01 04  01 07 06 09 06 00 05 04  |................|
00000070  09 09 02 01 09 04 08 07  03 09 07 04 04 04 09 02  |................|
00000080  05 04 07 06 07 09 00 05  08 05 06 06 05 07 08 01  |................|
00000090  00 01 06 04 06 07 03 01  07 01 08 02 00 02 09 09  |................|
000000a0  05 05 01 05 06 00 03 04  04 06 05 04 06 05 04 05  |................|
000000b0  01 04 04 07 02 03 02 07  01 08 01 08 01 08 05 00  |................|
000000c0  08 09 02 05 00 01 01 01  00 09 00 03 01 06 04 02  |................|
000000d0  03 06 01 01 01 03 09 05  02 09 04 05 09 03 09 00  |................|
000000e0  03 06 05 05 07 02 02 07  01 02 08 04 01 07 03 03  |................|
000000f0  08 08 07 09 02 02 04 01  05 09 08 07 02 03 00 04  |................|
00000100  04 02 04 01 09 05 07 07  02 08 02 06 08 05 07 07  |................|
00000110  09 01 08 01 08 00 03 00  01 09 09 04 01 08 02 01  |................|
00000120  02 09 07 05 09 02 06 04  01 05 08 02 09 02 00 04  |................|
00000130  00 00 02 08 04 07 01 02  04 00 02 07 04 03 03 00  |................|
00000140  00 03 01 09 06 05 02 05  09 02 09 03 00 04 02 00  |................|
00000150  07 01 01 02 01 05 03 03  09 07 08 06 05 06 01 03  |................|
00000160  08 01 00 05 01 03 01 05  05 06 01 08 05 01 07 09  |................|
00000170  04 06 02 02 05 00 06 05  06 03 07 02 00 08 08 05  |................|
00000180  04 01 01 04 00 03 03 07  06 01 06 02 01 09 02 08  |................|
00000190  06 01 09 05 02 05 04 04  02 08 03 08 02 04 05 00  |................|
000001a0  03 01 07 07 05 07 09 07  01 09 02 01 04 02 09 02  |................|
000001b0  00 04 09 01 04 08 01 08  04 05 09 08 08 03 07 06  |................|
000001c0  00 00 03 00 02 06 06 04  09 03 03 03 02 03 09 01  |................|
000001d0  02 06 08 00 05 06 06 06  03 08 08 02 07 05 08 09  |................|
000001e0  06 01 08 04 01 02 05 09  01 09 07 05 04 00 08 09  |................|
000001f0  09 01 00 05 02 03 07 08  09 04 00 06 03 09 05 02  |................|
00000200  01 03 01 03 06 05 07 04  02 02 06 03 02 06 05 04  |................|
00000210  08 09 07 01 03 00 03 08  03 01 09 03 04 04 06 04  |................|
00000220  02 01 08 02 05 04 08 08  04 00 00 02 03 02 07 07  |................|
00000230  00 08 07 04 04 07 09 06  09 00 09 08 00 04 06 00  |................|
00000240  06 03 05 04 08 03 03 09  03 03 03 07 08 00 08 02  |................|
00000250  01 07 00 06 05 04 03 08  00 09 06 03 08 00 09 09  |................|
00000260  06 08 06 08 05 07 08 06  00 02 04 00 02 02 03 01  |................|
00000270  09 07 05 01 00 08 04 06  02 06 07 09 03 02 09 08  |................|
00000280  02 02 09 02 07 03 05 09  01 08 00 02 00 05 02 01  |................|
00000290  03 07 06 07 01 02 05 08  00 03 07 02 04 00 09 01  |................|

我們依然對照這上面的表格來研究,

前4個字節爲00000801,它轉換爲十進制爲2049,正是表格中定義的魔數。

接下來4個字節爲00002710,它轉換爲十進制爲10000,正是表格中定義的條目數。

再後面就是標籤值,我們看到第一個值就是07,正是我們從t10k-images-idx3-ubyte文件中描出來的圖片的取值。這個文件中剩下的就是9999個0~9的標籤值。

       對於訓練集的train-images-idx3-ubyte、train-labels-idx1-ubyte文件,我們這類就不做過多解釋了,定義和測試集的文件是完全一樣的。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章