前段時間用過CNN在mnist數據集上做訓練,最近在學機器學習算法,因此準備用SVM試試。不過在用SVM訓練前,先學習學習mnist數據集的讀取。
【數據集介紹】
先看看官方庫中的描述:
訓練數據集train和測試數據集test都分爲label和image兩個文件。
label中前兩個整數爲magic number和標籤數目number of items;
image中前四個整數爲magic number、圖片數目number of
images、行數number of rows、列數number of columns。
可以看出訓練數據集的數量爲60000,測試數據集的數量爲10000,圖片大小爲28×28。
【讀取mnist數據集】
讀取mnist數據集其實就是讀取二進制文件
讀取方式一:
import numpy as np
import struct
def load_images(file_name):
## 在讀取或寫入一個文件之前,你必須使用 Python 內置open()函數來打開它。##
## file object = open(file_name [, access_mode][, buffering]) ##
## file_name是包含您要訪問的文件名的字符串值。 ##
## access_mode指定該文件已被打開,即讀,寫,追加等方式。 ##
## 0表示不使用緩衝,1表示在訪問一個文件時進行緩衝。 ##
## 這裏rb表示只能以二進制讀取的方式打開一個文件 ##
binfile = open(file_name, 'rb')
## 從一個打開的文件讀取數據
buffers = binfile.read()
## 讀取image文件前4個整型數字
magic,num,rows,cols = struct.unpack_from('>IIII',buffers, 0)
## 整個images數據大小爲60000*28*28
bits = num * rows * cols
## 讀取images數據
images = struct.unpack_from('>' + str(bits) + 'B', buffers, struct.calcsize('>IIII'))
## 關閉文件
binfile.close()
## 轉換爲[60000,784]型數組
images = np.reshape(images, [num, rows * cols])
return images
def load_labels(file_name):
## 打開文件
binfile = open(file_name, 'rb')
## 從一個打開的文件讀取數據
buffers = binfile.read()
## 讀取label文件前2個整形數字,label的長度爲num
magic,num = struct.unpack_from('>II', buffers, 0)
## 讀取labels數據
labels = struct.unpack_from('>' + str(num) + "B", buffers, struct.calcsize('>II'))
## 關閉文件
binfile.close()
## 轉換爲一維數組
labels = np.reshape(labels, [num])
return labels
使用:
filename_train_images = '絕對路徑\\train-images.idx3-ubyte'
filename_train_labels = '絕對路徑\\train-labels.idx1-ubyte'
filename_test_images = '絕對路徑\\t10k-images.idx3-ubyte'
filename_test_labels = '絕對路徑\\t10k-labels.idx1-ubyte'
train_images=load_images(filename_train_images)
train_labels=load_labels(filename_train_labels)
test_images=load_images(filename_test_images)
test_labels=load_labels(filename_test_labels)
讀取方式二:
import numpy as np
import struct
import os
def load_mnist_train(path, kind='train'):
labels_path = os.path.join(path,'%s-labels.idx1-ubyte'% kind)
images_path = os.path.join(path,'%s-images.idx3-ubyte'% kind)
with open(labels_path, 'rb') as lbpath:
magic, n = struct.unpack('>II',lbpath.read(8))
labels = np.fromfile(lbpath,dtype=np.uint8)
with open(images_path, 'rb') as imgpath:
magic, num, rows, cols = struct.unpack('>IIII',imgpath.read(16))
images = np.fromfile(imgpath,dtype=np.uint8).reshape(len(labels), 784)
return images, labels
def load_mnist_test(path, kind='t10k'):
labels_path = os.path.join(path,'%s-labels.idx1-ubyte'% kind)
images_path = os.path.join(path,'%s-images.idx3-ubyte'% kind)
with open(labels_path, 'rb') as lbpath:
magic, n = struct.unpack('>II',lbpath.read(8))
labels = np.fromfile(lbpath,dtype=np.uint8)
with open(images_path, 'rb') as imgpath:
magic, num, rows, cols = struct.unpack('>IIII',imgpath.read(16))
images = np.fromfile(imgpath,dtype=np.uint8).reshape(len(labels), 784)
return images, labels
使用:
path='絕對路徑'
train_images,train_labels=load_mnist_train(path)
test_images,test_labels=load_mnist_test(path)
打印前30個數字看一看,和前面digits數據集一樣的操作。
fig=plt.figure(figsize=(8,8))
fig.subplots_adjust(left=0,right=1,bottom=0,top=1,hspace=0.05,wspace=0.05)
for i in range(30):
images = np.reshape(train_images[i], [28,28])
ax=fig.add_subplot(6,5,i+1,xticks=[],yticks=[])
ax.imshow(images,cmap=plt.cm.binary,interpolation='nearest')
ax.text(0,7,str(train_labels[i]))
plt.show()
ok,數據讀取完畢,可以進行後續的訓練了~