機器學習 - 算法模型 - 參數調整 - 驗證曲線

機器學習 - 算法模型 - 參數調整 - 驗證曲線 | 學習曲線

原創

Rakish Leilie

2020-06-23 20:43

文章目錄

驗證曲線

驗證曲線：模型性能 = f(超參數)

驗證曲線所需API：

train_scores, test_scores = ms.validation_curve(
    model,		# 模型 
    輸入集, 輸出集, 
    'n_estimators', 		#超參數名
    np.arange(50, 550, 50),	#超參數序列
    cv=5		#摺疊數
)

train_scores的結構:

超參數取值	第一次摺疊	第二次摺疊	第三次摺疊	第四次摺疊	第五次摺疊
50	0.91823444	0.91968162	0.92619392	0.91244573	0.91040462
100	0.91968162	0.91823444	0.91244573	0.92619392	0.91244573
…	…	…	…	…	…

test_scores的結構與train_scores的結構相同。

案例：在小汽車評級案例中使用驗證曲線選擇較優參數。

# 獲得關於n_estimators的驗證曲線
model = se.RandomForestClassifier(max_depth=6, random_state=7)
n_estimators = np.arange(50, 550, 50)
train_scores, test_scores = ms.validation_curve(model, train_x, train_y, 'n_estimators', n_estimators, cv=5)
print(train_scores, test_scores)
train_means1 = train_scores.mean(axis=1)
for param, score in zip(n_estimators, train_means1):
    print(param, '->', score)

mp.figure('n_estimators', facecolor='lightgray')
mp.title('n_estimators', fontsize=20)
mp.xlabel('n_estimators', fontsize=14)
mp.ylabel('F1 Score', fontsize=14)
mp.tick_params(labelsize=10)
mp.grid(linestyle=':')
mp.plot(n_estimators, train_means1, 'o-', c='dodgerblue', label='Training')
mp.legend()
mp.show()

# 獲得關於max_depth的驗證曲線
model = se.RandomForestClassifier(n_estimators=200, random_state=7)
max_depth = np.arange(1, 11)
train_scores, test_scores = ms.validation_curve(
    model, train_x, train_y, 'max_depth', max_depth, cv=5)
train_means2 = train_scores.mean(axis=1)
for param, score in zip(max_depth, train_means2):
    print(param, '->', score)

mp.figure('max_depth', facecolor='lightgray')
mp.title('max_depth', fontsize=20)
mp.xlabel('max_depth', fontsize=14)
mp.ylabel('F1 Score', fontsize=14)
mp.tick_params(labelsize=10)
mp.grid(linestyle=':')
mp.plot(max_depth, train_means2, 'o-', c='dodgerblue', label='Training')
mp.legend()
mp.show()

學習曲線

學習曲線：模型性能 = f(訓練集大小)

學習曲線所需API：

_, train_scores, test_scores = ms.learning_curve(
    model,		# 模型 
    輸入集, 輸出集, 
    [0.9, 0.8, 0.7],	# 訓練集大小序列
    cv=5		# 摺疊數
)

train_scores的結構:

案例：在小汽車評級案例中使用學習曲線選擇訓練集大小最優參數。

# 獲得學習曲線
model = se.RandomForestClassifier( max_depth=9, n_estimators=200, random_state=7)
train_sizes = np.linspace(0.1, 1, 10)
_, train_scores, test_scores = ms.learning_curve(
    model, x, y, train_sizes=train_sizes, cv=5)
test_means = test_scores.mean(axis=1)
for size, score in zip(train_sizes, train_means):
    print(size, '->', score)
mp.figure('Learning Curve', facecolor='lightgray')
mp.title('Learning Curve', fontsize=20)
mp.xlabel('train_size', fontsize=14)
mp.ylabel('F1 Score', fontsize=14)
mp.tick_params(labelsize=10)
mp.grid(linestyle=':')
mp.plot(train_sizes, test_means, 'o-', c='dodgerblue', label='Training')
mp.legend()
mp.show()

案例：預測工人工資收入。

讀取adult.txt，針對不同形式的特徵選擇不同類型的編碼器，訓練模型，預測工人工資收入。

自定義標籤編碼器，若爲數字字符串，則使用該編碼器，保留特徵數字值的意義。

class DigitEncoder():

    def fit_transform(self, y):
        return y.astype(int)

    def transform(self, y):
        return y.astype(int)

    def inverse_transform(self, y):
        return y.astype(str)

讀取文件，整理樣本數據，對樣本矩陣中的每一列進行標籤編碼。

num_less, num_more, max_each = 0, 0, 7500
data = []

txt = np.loadtxt('../data/adult.txt', dtype='U20', delimiter=', ')
for row in txt:
    if(' ?' in row):
        continue
    elif(str(row[-1]) == '<=50K'):
        num_less += 1
        data.append(row)
    elif(str(row[-1]) == '>50K'):
        num_more += 1
        data.append(row)
   
data = np.array(data).T
encoders, x = [], []
for row in range(len(data)):
    if str(data[row, 0]).isdigit():
        encoder = DigitEncoder()
    else:
        encoder = sp.LabelEncoder()
    if row < len(data) - 1:
        x.append(encoder.fit_transform(data[row]))
    else:
        y = encoder.fit_transform(data[row])
    encoders.append(encoder)

劃分訓練集與測試集，基於樸素貝葉斯分類算法構建學習模型，輸出交叉驗證分數，驗證測試集。

x = np.array(x).T
train_x, test_x, train_y, test_y = ms.train_test_split(
    x, y, test_size=0.25, random_state=5)
model = nb.GaussianNB()
print(ms.cross_val_score(
    model, x, y, cv=10, scoring='f1_weighted').mean())
model.fit(train_x, train_y)
pred_test_y = model.predict(test_x)
print((pred_test_y == test_y).sum() / pred_test_y.size)

模擬樣本數據，預測收入級別。

data = [['39', 'State-gov', '77516', 'Bachelors',
         '13', 'Never-married', 'Adm-clerical', 'Not-in-family',
         'White', 'Male', '2174', '0', '40', 'United-States']]
data = np.array(data).T
x = []
for row in range(len(data)):
    encoder = encoders[row]
    x.append(encoder.transform(data[row]))
x = np.array(x).T
pred_y = model.predict(x)
print(encoders[-1].inverse_transform(pred_y))

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

機器學習 - 算法模型 - 參數調整 - 驗證曲線 | 學習曲線

文章目錄

驗證曲線

學習曲線

PDManer [元數建模]-v4.9.0 發佈：一款簡單好用的數據庫建模平臺

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

cs01 CSS Syntax

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

[MASM拾遺]Offset僞指令

h30 HTML Layout Elements

瞭解顯卡

一款基於C#開發的通訊調試工具（支持Modbus RTU、MQTT調試）

Linux/Golang/glibC系統調用

cs04 CSS Measurement Units

Python 內建模塊之 - time時間模塊

機器學習 - 概述

python數據分析 - pandas

機器學習 - 算法模型 - 參數調整 - 驗證曲線 | 學習曲線

深度學習 - 概述 | 神經網絡

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結