作業內容翻譯:@胡楊([email protected]) && @胥可([email protected])
解答與編排:寒小陽 && 龍心塵
時間:2016年6月
出處:http://blog.csdn.net/han_xiaoyang/article/details/51760923
說明:本文爲斯坦福大學CS224d課程的中文版內容筆記,已得到斯坦福大學課程@Richard Socher教授的授權翻譯與發表
0 前言
前面一個接一個的Lecture,看得老衲自己也是一臉懵逼,不過你以爲你做一個安安靜靜的美男子(總感覺有勇氣做deep learning的女生也是一條漢紙)就能在Stanford這樣的學校順利畢業啦?圖樣圖森破,除掉極高的內容學習梯度,這種頂尖大學的作業和考試一樣會讓你突(tong)飛(bu)猛(yu)進(sheng)。
說起來,怎麼也是堂堂斯坦福的課,這種最看重前言研究在實際工業應用的學校,一定是理論和應用並進,對動手能力要求極強的,於是乎,我們把作業和小測驗(MD你這也敢叫小測驗!!)也扒過來,整理整理,讓大家都來體驗體驗。反正博主君自己每次折騰完這些大學的assignment之後,都會感慨一句,“還好不生在水生火熱的萬惡資本主義國家,才能讓我大學和研究僧順利畢業(什麼?phd?呵呵…博主是渣渣,智商常年處於欠費狀態,我就不參與你們高端人士的趴體了)”。
不能再BB了,直接開始做作業考試吧…
1 Softmax (10 分)
(part a) (5分)
證明針對任何輸入向量
其中
提示:在實際應用中,經常會用到這個性質。爲了穩定地計算softmax概率,我們會選擇
博主:熬過了高中,居然又看見證明了,也是驚(ri)喜(le)萬(gou)分(le),答案拿來!!!
解答:
證明,針對所有維度
(part b) (5 分)
已知一個N行d列的輸入矩陣,計算每一行的softmax概率。在q1_softmax.py中寫出你的實現過程,並使用python q1_softmax.py執行。
要求:你所寫的代碼應該儘可能的有效並以向量化的形式來實現。非向量化的實現將不會得到滿分。
博主:簡直要哭暈在廁所了,當年畢業設計也是加論文一星期都可以寫完的節奏,這裏一個5分的作業,還這麼多要求…社會主義好…答案拿來!!!
import numpy as np
def softmax(x):
"""
Softmax 函數
"""
assert len(x.shape) > 1, "Softmax的得分向量要求維度高於1"
x -= np.max(x, axis=1, keepdims=True)
x = np.exp(x) / np.sum(np.exp(x), axis=1, keepdims=True)
return x
2 神經網絡基礎(30分)
(part a) (3 分)
推導sigmoid函數的導數,並且只以sigmoid函數值的形式寫出來(導數的表達式裏只包含
旁白:我年紀輕輕幹嘛要走上深度學習這條不歸路,真是生無所戀了。
答案:
(part b) (3 分)
當使用交叉熵損失來作爲評價標準時,推導出損失函數以softmax爲預測結果的輸入向量
其中
答案:
或者等價於下面表達式,其中假設k是正確的類別
(part c) (6 分)
推導出單隱層神經網絡關於輸入
前向傳播方程如下:
在編程問題中,我們假設輸入向量(隱層變量和輸出概率)始終是一個行向量。此處我們約定,當我們說要對向量使用sigmoid函數時,也就是說要對向量每一個元素使用sigmoid函數。
旁白:好好的100分總分,硬要被你這麼5分6分地拆,人家5分6分是一道選擇題,你特麼是一整個畢業設計!!好吧,不哭,跪着也要把題目做完,代碼寫完。哎,博主還是太年輕,要多學習啊。
答案:令
(part d) (2 分)
上面所說的這個神經網絡有多少個參數?我們可以假設輸入是
旁白:還有part d!!!
答案:
(part e) (4 分) 在q2_sigmoid.py中補充寫出sigmoid激活函數的和求它的梯度的對應代碼。並使用python q2_sigmoid.py進行測試,同樣的,測試用例有可能不太詳盡,因此儘量檢查下自己的代碼。
旁白:如果博主沒有陣亡,就在走向陣亡的路上…
def sigmoid_grad(f):
"""
計算Sigmoid的梯度
"""
#好在我有numpy
f = f * ( 1 - f )
return f
(part f) (4 分)
爲了方便debugging,我們需要寫一個梯度檢查器。在q2_gradcheck.py中補充出來,使用python q2_gradcheck.py測試自己的代碼。
旁白:做到昏天黑地,睡一覺起來又是一條好漢…
def gradcheck_naive(f, x):
"""
對一個函數f求梯度的梯度檢驗
- f 輸入x,然後輸出loss和梯度的函數
- x 就是輸入咯
"""
rndstate = random.getstate()
random.setstate(rndstate)
fx, grad = f(x)
h = 1e-4
# 遍歷x的每一維
it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
while not it.finished:
ix = it.multi_index
old_val = x[ix]
x[ix] = old_val - h
random.setstate(rndstate)
( fxh1, _ ) = f(x)
x[ix] = old_val + h
random.setstate(rndstate)
( fxh2, _ ) = f(x)
numgrad = (fxh2 - fxh1)/(2*h)
x[ix] = old_val
# 比對梯度
reldiff = abs(numgrad - grad[ix]) / max(1, abs(numgrad), abs(grad[ix]))
if reldiff > 1e-5:
print "Gradient check failed."
print "First gradient error found at index %s" % str(ix)
print "Your gradient: %f \t Numerical gradient: %f" % (grad[ix], numgrad)
return
it.iternext() # Step to next dimension
print "Gradient check passed!"
(part g) (8 分)
現在,在q2 neural.py中,寫出只有一個隱層且激活函數爲sigmoid的神經網絡前向和後向傳播代碼。使用python q2_neural.py測試自己的代碼。
旁白:一入DL深似海…
def forward_backward_prop(data, labels, params, verbose = False):
"""
2個隱層的神經網絡的前向運算和反向傳播
"""
if len(data.shape) >= 2:
(N, _) = data.shape
### 展開每一層神經網絡的參數
t = 0
W1 = np.reshape(params[t:t+dimensions[0]*dimensions[1]], (dimensions[0], dimensions[1]))
t += dimensions[0]*dimensions[1]
b1 = np.reshape(params[t:t+dimensions[1]], (1, dimensions[1]))
t += dimensions[1]
W2 = np.reshape(params[t:t+dimensions[1]*dimensions[2]], (dimensions[1], dimensions[2]))
t += dimensions[1]*dimensions[2]
b2 = np.reshape(params[t:t+dimensions[2]], (1, dimensions[2]))
### 前向運算
# 第一個隱層做內積
a1 = sigmoid(data.dot(W1) + b1)
# 第二個隱層做內積
a2 = softmax(a1.dot(W2) + b2)
cost = - np.sum(np.log(a2[labels == 1]))/N
### 反向傳播
# Calculate analytic gradient for the cross entropy loss function
grad_a2 = ( a2 - labels ) / N
# Backpropagate through the second latent layer
gradW2 = np.dot( a1.T, grad_a2 )
gradb2 = np.sum( grad_a2, axis=0, keepdims=True )
# Backpropagate through the first latent layer
grad_a1 = np.dot( grad_a2, W2.T ) * sigmoid_grad(a1)
gradW1 = np.dot( data.T, grad_a1 )
gradb1 = np.sum( grad_a1, axis=0, keepdims=True )
if verbose: # Verbose mode for logging information
print "W1 shape: {}".format( str(W1.shape) )
print "W1 gradient shape: {}".format( str(gradW1.shape) )
print "b1 shape: {}".format( str(b1.shape) )
print "b1 gradient shape: {}".format( str(gradb1.shape) )
### 梯度拼起來
grad = np.concatenate((gradW1.flatten(), gradb1.flatten(), gradW2.flatten(), gradb2.flatten()))
return cost, grad
3 word2vec(40分+5附加分)
(part a) (3分)
假設你得到一個關聯到中心詞word2vec
模型中被找到。
式中,
提示:問題2中的標記法將有助於此問題的解答。比如:設
其中,
旁邊:是的,旁白我已經不知道寫什麼了,感謝黨感謝祖國吧。
解答:設
或者等同於:
(part b) (3分)
條件仍然如前一題所描述,求解輸出詞向量
旁白:我還是安安靜靜在天朝搬磚吧
解答:
或者等同於:
(part c) (6分)
仍然延續(part a)和(part b),假設我們使用爲預測的向量
其中,
當你完成上述操作之後,嘗試簡要描述這個損失函數比softmax-CE損失函數計算更爲有效的原因(你可以給出遞增式的學習率,即,給出softmax-CE損失函數的計算時間除以負採樣損失函數的計算時間的結果)。
註釋:由於我們打算計算目標函數的最小值而不是最大值,這裏提到的損失函數與Mikolov等人最先在原版論文中描述的正好相反。
旁白:突然想起來,小時候好焦慮,長大後到底去清華還是去北大,後來發現多慮了。我想如果當初走了狗屎運進了貴T大貴P大,也一定完不成學業。
解答:
(part d) (8分)
試得到由skip-gram和CBOW算法分別算出的全部詞向量的梯度,前提步驟和詞內容集合[wordc-m,…,wordc-1,wordc,wordc+1,…,wordc+m]都已給出,其中,
提示:可以隨意使用函數
回憶skip-gram算法,以
其中,
CBOW略有不同,不同於使用
於是,CBOW的損失函數定義爲:
註釋:爲了符合
旁白:我誠實一點,這個部分真的是煩了課件抄下來的。
解答:爲了表達得更爲清晰,我們將詞庫中全部詞彙的全部輸出向量集合記作
對於skip-gram方法,一個內容窗口的損失梯度爲:
同樣地,對於CBOW則有:
(part e) (12分)
在這一部分,你將實現word2vec模型,並且使用隨機梯度下降方法(SGD)訓練屬於你自己的詞向量。首先,在代碼q3_word2vec.py
中編寫一個輔助函數對矩陣中的每一行進行歸一化。同樣在這個文件中,完成對softmax、負採樣損失函數以及梯度計算函數的實現。然後,完成面向skip-gram的梯度損失函數。當你完成這些的時候,使用命令:python q3_word2vec.py
對編寫的程序進行測試。
註釋:如果你選擇不去實現CBOW(h部分),只需簡單地刪除對NotImplementedError錯誤的捕獲即可完成你的測試。
旁白:前方高能預警,代碼量爆炸了!
import numpy as np
import random
from q1_softmax import softmax
from q2_gradcheck import gradcheck_naive
from q2_sigmoid import sigmoid, sigmoid_grad
def normalizeRows(x):
"""
行歸一化函數
"""
N = x.shape[0]
x /= np.sqrt(np.sum(x**2, axis=1)).reshape((N,1)) + 1e-30
return x
def test_normalize_rows():
print "Testing normalizeRows..."
x = normalizeRows(np.array([[3.0,4.0],[1, 2]]))
# 結果應該是 [[0.6, 0.8], [0.4472, 0.8944]]
print x
assert (np.amax(np.fabs(x - np.array([[0.6,0.8],[0.4472136,0.89442719]]))) <= 1e-6)
print ""
def softmaxCostAndGradient(predicted, target, outputVectors, dataset):
"""
word2vec的Softmax損失函數
"""
# 輸入:
# - predicted: 預測詞向量的numpy數組
# - target: 目標詞的下標
# - outputVectors: 所有token的"output"向量(行形式)
# - dataset: 用來做負例採樣的,這裏其實沒用着
# 輸出:
# - cost: 輸出的互熵損失
# - gradPred: the gradient with respect to the predicted word
# vector
# - grad: the gradient with respect to all the other word
# vectors
probabilities = softmax(predicted.dot(outputVectors.T))
cost = -np.log(probabilities[target])
delta = probabilities
delta[target] -= 1
N = delta.shape[0]
D = predicted.shape[0]
grad = delta.reshape((N,1)) * predicted.reshape((1,D))
gradPred = (delta.reshape((1,N)).dot(outputVectors)).flatten()
return cost, gradPred, grad
def negSamplingCostAndGradient(predicted, target, outputVectors, dataset,
K=10):
"""
Word2vec模型負例採樣後的損失函數和梯度
"""
grad = np.zeros(outputVectors.shape)
gradPred = np.zeros(predicted.shape)
indices = [target]
for k in xrange(K):
newidx = dataset.sampleTokenIdx()
while newidx == target:
newidx = dataset.sampleTokenIdx()
indices += [newidx]
labels = np.array([1] + [-1 for k in xrange(K)])
vecs = outputVectors[indices,:]
t = sigmoid(vecs.dot(predicted) * labels)
cost = -np.sum(np.log(t))
delta = labels * (t - 1)
gradPred = delta.reshape((1,K+1)).dot(vecs).flatten()
gradtemp = delta.reshape((K+1,1)).dot(predicted.reshape(
(1,predicted.shape[0])))
for k in xrange(K+1):
grad[indices[k]] += gradtemp[k,:]
t = sigmoid(predicted.dot(outputVectors[target,:]))
cost = -np.log(t)
delta = t - 1
gradPred += delta * outputVectors[target, :]
grad[target, :] += delta * predicted
for k in xrange(K):
idx = dataset.sampleTokenIdx()
t = sigmoid(-predicted.dot(outputVectors[idx,:]))
cost += -np.log(t)
delta = 1 - t
gradPred += delta * outputVectors[idx, :]
grad[idx, :] += delta * predicted
return cost, gradPred, grad
def skipgram(currentWord, C, contextWords, tokens, inputVectors, outputVectors,
dataset, word2vecCostAndGradient = softmaxCostAndGradient):
""" Skip-gram model in word2vec """
# skip-gram模型的實現
# 輸入:
# - currrentWord: 當前中心詞所對應的串
# - C: 上下文大小(詞窗大小)
# - contextWords: 最多2*C個詞
# - tokens: 對應詞向量中詞下標的字典
# - inputVectors: "input" word vectors (as rows) for all tokens
# - outputVectors: "output" word vectors (as rows) for all tokens
# - word2vecCostAndGradient: the cost and gradient function for a prediction vector given the target word vectors, could be one of the two cost functions you implemented above
# 輸出:
# - cost: skip-gram模型算得的損失值
# - grad: 詞向量對應的梯度
currentI = tokens[currentWord]
predicted = inputVectors[currentI, :]
cost = 0.0
gradIn = np.zeros(inputVectors.shape)
gradOut = np.zeros(outputVectors.shape)
for cwd in contextWords:
idx = tokens[cwd]
cc, gp, gg = word2vecCostAndGradient(predicted, idx, outputVectors, dataset)
cost += cc
gradOut += gg
gradIn[currentI, :] += gp
return cost, gradIn, gradOut
def word2vec_sgd_wrapper(word2vecModel, tokens, wordVectors, dataset, C, word2vecCostAndGradient = softmaxCostAndGradient):
batchsize = 50
cost = 0.0
grad = np.zeros(wordVectors.shape)
N = wordVectors.shape[0]
inputVectors = wordVectors[:N/2,:]
outputVectors = wordVectors[N/2:,:]
for i in xrange(batchsize):
C1 = random.randint(1,C)
centerword, context = dataset.getRandomContext(C1)
if word2vecModel == skipgram:
denom = 1
else:
denom = 1
c, gin, gout = word2vecModel(centerword, C1, context, tokens, inputVectors, outputVectors, dataset, word2vecCostAndGradient)
cost += c / batchsize / denom
grad[:N/2, :] += gin / batchsize / denom
grad[N/2:, :] += gout / batchsize / denom
return cost, grad
def test_word2vec():
# Interface to the dataset for negative sampling
dataset = type('dummy', (), {})()
def dummySampleTokenIdx():
return random.randint(0, 4)
def getRandomContext(C):
tokens = ["a", "b", "c", "d", "e"]
return tokens[random.randint(0,4)], [tokens[random.randint(0,4)] \
for i in xrange(2*C)]
dataset.sampleTokenIdx = dummySampleTokenIdx
dataset.getRandomContext = getRandomContext
random.seed(31415)
np.random.seed(9265)
dummy_vectors = normalizeRows(np.random.randn(10,3))
dummy_tokens = dict([("a",0), ("b",1), ("c",2),("d",3),("e",4)])
print "==== Gradient check for skip-gram ===="
gradcheck_naive(lambda vec: word2vec_sgd_wrapper(skipgram, dummy_tokens, vec, dataset, 5), dummy_vectors)
gradcheck_naive(lambda vec: word2vec_sgd_wrapper(skipgram, dummy_tokens, vec, dataset, 5, negSamplingCostAndGradient), dummy_vectors)
print "\n==== Gradient check for CBOW ===="
gradcheck_naive(lambda vec: word2vec_sgd_wrapper(cbow, dummy_tokens, vec, dataset, 5), dummy_vectors)
gradcheck_naive(lambda vec: word2vec_sgd_wrapper(cbow, dummy_tokens, vec, dataset, 5, negSamplingCostAndGradient), dummy_vectors)
print "\n=== Results ==="
print skipgram("c", 3, ["a", "b", "e", "d", "b", "c"], dummy_tokens, dummy_vectors[:5,:], dummy_vectors[5:,:], dataset)
print skipgram("c", 1, ["a", "b"], dummy_tokens, dummy_vectors[:5,:], dummy_vectors[5:,:], dataset, negSamplingCostAndGradient)
print cbow("a", 2, ["a", "b", "c", "a"], dummy_tokens, dummy_vectors[:5,:], dummy_vectors[5:,:], dataset)
print cbow("a", 2, ["a", "b", "a", "c"], dummy_tokens, dummy_vectors[:5,:], dummy_vectors[5:,:], dataset, negSamplingCostAndGradient)
if __name__ == "__main__":
test_normalize_rows()
test_word2vec()
(f) (4分) 在代碼q3_sgd.py
中完成對隨即梯度下降優化函數的實現。並且在該代碼中運行測試你的實現。
旁白:想到這篇文章有可能會被無數可以智商碾壓我的大神看到,就臉一陣發燙。
# 實現隨機梯度下降
# 隨機梯度下降每1000輪,就保存一下現在訓練得到的參數
SAVE_PARAMS_EVERY = 1000
import glob
import os.path as op
import cPickle as pickle
import sys
def load_saved_params():
"""
載入之前的參數以免從頭開始訓練
"""
st = 0
for f in glob.glob("saved_params_*.npy"):
iter = int(op.splitext(op.basename(f))[0].split("_")[2])
if (iter > st):
st = iter
if st > 0:
with open("saved_params_%d.npy" % st, "r") as f:
params = pickle.load(f)
state = pickle.load(f)
return st, params, state
else:
return st, None, None
def save_params(iter, params):
with open("saved_params_%d.npy" % iter, "w") as f:
pickle.dump(params, f)
pickle.dump(random.getstate(), f)
def sgd(f, x0, step, iterations, postprocessing = None, useSaved = False, PRINT_EVERY=10, ANNEAL_EVERY = 20000):
""" 隨機梯度下降 """
###########################################################
# 輸入
# - f: 需要最優化的函數
# - x0: SGD的初始值
# - step: SGD的步長
# - iterations: 總得迭代次數
# - postprocessing: 參數後處理(比如word2vec裏需要對詞向量做歸一化處理)
# - PRINT_EVERY: 指明多少次迭代以後輸出一下狀態
# 輸出:
# - x: SGD完成後的輸出參數 #
###########################################################
if useSaved:
start_iter, oldx, state = load_saved_params()
if start_iter > 0:
x0 = oldx;
step *= 0.5 ** (start_iter / ANNEAL_EVERY)
if state:
random.setstate(state)
else:
start_iter = 0
x = x0
if not postprocessing:
postprocessing = lambda x: x
expcost = None
for iter in xrange(start_iter + 1, iterations + 1):
cost, grad = f(x)
x = x - step * grad
x = postprocessing(x)
if iter % PRINT_EVERY == 0:
print "Iter#{}, cost={}".format(iter, cost)
sys.stdout.flush()
if iter % SAVE_PARAMS_EVERY == 0 and useSaved:
save_params(iter, x)
if iter % ANNEAL_EVERY == 0:
step *= 0.5
return x
(part g) (4分)
開始秀啦!現在我們將要載入真實的數據並使用你已經實現的手段訓練詞向量!我們將使用Stanford Sentiment Treebank (SST)數據集來進行詞向量的訓練,之後將他們應用到情感分析任務中去。在這一部分中,無需再編寫更多的代碼;只需要運行命令python q3 run.py
即可。
註釋:訓練過程所佔用的時間可能會很長,這取決於你所實現的程序的效率(一個擁有優異效率的實現程序大約需要佔用1個小時)。努力去接近這個目標!
當腳本編寫完成,需要完成對詞向量的可視化顯示。相應的結果同樣被保存下來,如項目目錄中的圖片q3 word_vectors.png
所示。包括在你作業中繪製的座標圖。簡明解釋最多三個句子在你的座標圖中的顯示狀況。
解答:
(part h) 附加題(5分)
在代碼q3_word2vec.py
中完成對CBOW的實現。註釋:這部分內容是可選的,但是在d部分中關於CBOW的梯度推導在這裏並不適用!
def cbow(currentWord, C, contextWords, tokens, inputVectors, outputVectors,
dataset, word2vecCostAndGradient = softmaxCostAndGradient):
"""
word2vec的CBOW模型
"""
cost = 0
gradIn = np.zeros(inputVectors.shape)
gradOut = np.zeros(outputVectors.shape)
D = inputVectors.shape[1]
predicted = np.zeros((D,))
indices = [tokens[cwd] for cwd in contextWords]
for idx in indices:
predicted += inputVectors[idx, :]
cost, gp, gradOut = word2vecCostAndGradient(predicted, tokens[currentWord], outputVectors, dataset)
gradIn = np.zeros(inputVectors.shape)
for idx in indices:
gradIn[idx, :] += gp
return cost, gradIn, gradOut
4 情感分析(20分)
現在,隨着詞向量的訓練,我們準備展示一個簡單的情感分析案例。隨着詞向量的訓練,我們準備展示一個簡單的情感分析。對於每條Stanford Sentiment Treebank數據集中的句子,將句子中全體詞向量的平均值算作其特徵值,並試圖預測所提句子中的情感層次。短語的情感層次使用真實數值在原始數據集中表示,並被我們用以下5個類別來表示:
對其分別進行從0到4的編碼。在這一部分,你將學習用SGD來訓練一個softmax迴歸機,並且通過不斷地訓練/調試驗證來提高迴歸機的泛化能力。
(part a)(10分)
實現一個句子的特徵生成器和softmax迴歸機。在代碼q4_softmaxreg.py
中完成對這個任務的實現,並運行命令python q4_ softmaxreg.py
,對剛纔完成的功能函數進行調試。
import numpy as np
import random
from cs224d.data_utils import *
from q1_softmax import softmax
from q2_gradcheck import gradcheck_naive
from q3_sgd import load_saved_params
def getSentenceFeature(tokens, wordVectors, sentence):
"""
簡單粗暴的處理方式,直接對句子的所有詞向量求平均做爲情感分析的輸入
"""
# 輸入:
# - tokens: a dictionary that maps words to their indices in the word vector list
# - wordVectors: word vectors (each row) for all tokens
# - sentence: a list of words in the sentence of interest
# 輸出:
# - sentVector: feature vector for the sentence
sentVector = np.zeros((wordVectors.shape[1],))
indices = [tokens[word] for word in sentence]
sentVector = np.mean(wordVectors[indices, :], axis=0)
return sentVector
def softmaxRegression(features, labels, weights, regularization = 0.0, nopredictions = False):
""" Softmax Regression """
# 完成加正則化的softmax迴歸
# 輸入:
# - features: feature vectors, each row is a feature vector
# - labels: labels corresponding to the feature vectors
# - weights: weights of the regressor
# - regularization: L2 regularization constant
# 輸出:
# - cost: cost of the regressor
# - grad: gradient of the regressor cost with respect to its weights
# - pred: label predictions of the regressor (you might find np.argmax helpful)
prob = softmax(features.dot(weights))
if len(features.shape) > 1:
N = features.shape[0]
else:
N = 1
# A vectorized implementation of 1/N * sum(cross_entropy(x_i, y_i)) + 1/2*|w|^2
cost = np.sum(-np.log(prob[range(N), labels])) / N
cost += 0.5 * regularization * np.sum(weights ** 2)
grad = np.array(prob)
grad[range(N), labels] -= 1.0
grad = features.T.dot(grad) / N
grad += regularization * weights
if N > 1:
pred = np.argmax(prob, axis=1)
else:
pred = np.argmax(prob)
if nopredictions:
return cost, grad
else:
return cost, grad, pred
def accuracy(y, yhat):
""" Precision for classifier """
assert(y.shape == yhat.shape)
return np.sum(y == yhat) * 100.0 / y.size
def softmax_wrapper(features, labels, weights, regularization = 0.0):
cost, grad, _ = softmaxRegression(features, labels, weights,
regularization)
return cost, grad
def sanity_check():
"""
Run python q4_softmaxreg.py.
"""
random.seed(314159)
np.random.seed(265)
dataset = StanfordSentiment()
tokens = dataset.tokens()
nWords = len(tokens)
_, wordVectors0, _ = load_saved_params()
wordVectors = (wordVectors0[:nWords,:] + wordVectors0[nWords:,:])
dimVectors = wordVectors.shape[1]
dummy_weights = 0.1 * np.random.randn(dimVectors, 5)
dummy_features = np.zeros((10, dimVectors))
dummy_labels = np.zeros((10,), dtype=np.int32)
for i in xrange(10):
words, dummy_labels[i] = dataset.getRandomTrainSentence()
dummy_features[i, :] = getSentenceFeature(tokens, wordVectors, words)
print "==== Gradient check for softmax regression ===="
gradcheck_naive(lambda weights: softmaxRegression(dummy_features,
dummy_labels, weights, 1.0, nopredictions = True), dummy_weights)
print "\n=== Results ==="
print softmaxRegression(dummy_features, dummy_labels, dummy_weights, 1.0)
if __name__ == "__main__":
sanity_check()
(part b)(2分)
解釋當分類語料少於三句時爲什麼要引入正則化(實際上在大多數機器學習任務都這樣)。
解答:爲了避免訓練集的過擬合以及對未知數據集的適應力不佳現象。
(part c)(4分)
在q4 sentiment.py
中完成超參數的實現代碼從而獲取“最佳”的懲罰因子。你是如何選擇的?報告你的訓練、調試和測試精度,在最多一個句子中校正你的超參數選定方法。 註釋:在開發中應該獲取至少30%的準確率。
解答:參考值爲1e-4,在調試、開發和測試過程中準確率分別爲29.1%,31.4%和27.6%
import numpy as np
import matplotlib.pyplot as plt
from cs224d.data_utils import *
from q3_sgd import load_saved_params, sgd
from q4_softmaxreg import softmaxRegression, getSentenceFeature, accuracy, softmax_wrapper
# 試試不同的正則化係數,選最好的
REGULARIZATION = [0.0, 0.00001, 0.00003, 0.0001, 0.0003, 0.001, 0.003, 0.01]
# 載入數據集
dataset = StanfordSentiment()
tokens = dataset.tokens()
nWords = len(tokens)
# 載入預訓練好的詞向量
_, wordVectors0, _ = load_saved_params()
wordVectors = (wordVectors0[:nWords,:] + wordVectors0[nWords:,:])
dimVectors = wordVectors.shape[1]
# 載入訓練集
trainset = dataset.getTrainSentences()
nTrain = len(trainset)
trainFeatures = np.zeros((nTrain, dimVectors))
trainLabels = np.zeros((nTrain,), dtype=np.int32)
for i in xrange(nTrain):
words, trainLabels[i] = trainset[i]
trainFeatures[i, :] = getSentenceFeature(tokens, wordVectors, words)
# 準備好訓練集的特徵
devset = dataset.getDevSentences()
nDev = len(devset)
devFeatures = np.zeros((nDev, dimVectors))
devLabels = np.zeros((nDev,), dtype=np.int32)
for i in xrange(nDev):
words, devLabels[i] = devset[i]
devFeatures[i, :] = getSentenceFeature(tokens, wordVectors, words)
# 嘗試不同的正則化係數
results = []
for regularization in REGULARIZATION:
random.seed(3141)
np.random.seed(59265)
weights = np.random.randn(dimVectors, 5)
print "Training for reg=%f" % regularization
# batch optimization
weights = sgd(lambda weights: softmax_wrapper(trainFeatures, trainLabels,
weights, regularization), weights, 3.0, 10000, PRINT_EVERY=100)
# 訓練集上測效果
_, _, pred = softmaxRegression(trainFeatures, trainLabels, weights)
trainAccuracy = accuracy(trainLabels, pred)
print "Train accuracy (%%): %f" % trainAccuracy
# dev集合上看效果
_, _, pred = softmaxRegression(devFeatures, devLabels, weights)
devAccuracy = accuracy(devLabels, pred)
print "Dev accuracy (%%): %f" % devAccuracy
# 保存結果權重
results.append({
"reg" : regularization,
"weights" : weights,
"train" : trainAccuracy,
"dev" : devAccuracy})
# 輸出準確率
print ""
print "=== Recap ==="
print "Reg\t\tTrain\t\tDev"
for result in results:
print "%E\t%f\t%f" % (
result["reg"],
result["train"],
result["dev"])
print ""
# 選最好的正則化係數
BEST_REGULARIZATION = None
BEST_WEIGHTS = None
best_dev = 0
for result in results:
if result["dev"] > best_dev:
best_dev = result["dev"]
BEST_REGULARIZATION = result["reg"]
BEST_WEIGHTS = result["weights"]
# Test your findings on the test set
testset = dataset.getTestSentences()
nTest = len(testset)
testFeatures = np.zeros((nTest, dimVectors))
testLabels = np.zeros((nTest,), dtype=np.int32)
for i in xrange(nTest):
words, testLabels[i] = testset[i]
testFeatures[i, :] = getSentenceFeature(tokens, wordVectors, words)
_, _, pred = softmaxRegression(testFeatures, testLabels, BEST_WEIGHTS)
print "Best regularization value: %E" % BEST_REGULARIZATION
print "Test accuracy (%%): %f" % accuracy(testLabels, pred)
# 畫出正則化和準確率的關係
plt.plot(REGULARIZATION, [x["train"] for x in results])
plt.plot(REGULARIZATION, [x["dev"] for x in results])
plt.xscale('log')
plt.xlabel("regularization")
plt.ylabel("accuracy")
plt.legend(['train', 'dev'], loc='upper left')
plt.savefig("q4_reg_v_acc.png")
plt.show()
(d)(4分)繪出在訓練和開發過程中的分類準確率,並在x軸使用對數刻度來對正則化值進行相關設置。這應該自動化的進行。包括在你作業中詳細展示的座標圖q4_reg_acc.png
。簡明解釋最多三個句子在此座標圖中的顯示情況。
解答: