前言
推薦系統的評價指標在不同類型的推薦場景下,選用的不同。有些推薦的評價指標並非完全出自推薦系統,而是從搜索算法,信息檢索,機器學習等相關領域沿用過來,因此網上有些對評價指標的解釋並非完全以推薦系統的角度進行,這讓我會在學習的時候產生困惑,特此做出一些整理,力求完全用推薦系統的角度解釋,並給出計算的具體流程。
如果你對本系列(未寫完,持續更新中)感興趣,可接以下傳送門:
本系列的數據集:【推薦算法】從零開始做推薦(一)——認識數據
本系列的評價指標:【推薦算法】從零開始做推薦(二)——推薦系統的評價指標,計算原理與實現樣例
【推薦算法】從零開始做推薦(三)——傳統矩陣分解的TopK推薦實戰
【推薦算法】從零開始做推薦(四)——python Keras框架 利用Embedding實現矩陣分解TopK推薦
【推薦算法】從零開始做推薦(五)——貝葉斯個性化排序矩陣分解 (BPRMF) 推薦實戰
【推薦算法】從零開始做推薦(六)——貝葉斯性化排序矩陣分解 (BPRMF) 的Tensorflow版
矩陣分解與Embedding的關係
自詞向量(Word2Vec)推出以來,各種嵌入(Embedding)方法層出不窮,推薦系統也有部分文章借用Embedding思想進行推薦,Embedding是一種思想,可以理解爲提特徵的手段,萬物皆可Embedding,下面我們來引入這種思想到推薦算法裏。
在NLP領域裏,我們將詞轉化爲K維度的詞向量,再用詞向量去做更爲複雜的NLP任務,如簡單的尋找相關詞裏,就可以直接用詞向量進行相似度計算直觀得到。而在推薦系統的場景裏,我們有用戶和項目兩個主體,假如能將用戶和項目嵌入到同一空間中,再計算相似性,不就直接完成了推薦目的了嗎?
在矩陣分解中,我們同樣也是將User和Item分開,User的每一行,代表用戶的嵌入向量,Item的每一列代表項目的嵌入向量,兩者都在K維空間中,而矩陣乘法的本質就是向量的點積,即User的每一行點乘Item的每一列,而點積a·b = |a||b|cosθ,不就是在計算相似度嗎?
到現在你就會發現,原來矩陣分解就是Embedding的一種,二者殊途同歸。那麼利用神經網絡的框架來實現矩陣分解也就帶來了可能,整體框架如下圖所示。
Keras框架介紹
Keras是一種搭建神經網絡的框架,它封裝得很好,簡便易用,對於新手來講還是十分友好的。它主要包括兩種模型,其一爲序列式模型,即一步步往後走,一條路走到黑;另一種爲函數式模型,適合多輸入;我們的輸入包括User和Item,因此,使用的是函數式模型。
核心算法
本文最初的實現參照了這篇文章,但文中方法有人提出未加正則,且分解出來是負數。針對這兩個問題,進行改進從而得到如下代碼:
def Recmand_model(num_user,num_item,d):
K.clear_session()
input_uer = Input(shape=[None,],dtype="int32")
model_uer = Embedding(num_user,d,input_length = 1,
embeddings_constraint=non_neg() #非負,下同
)(input_uer)
Dropout(0.2)
model_uer = BatchNormalization()(model_uer)
model_uer = Reshape((d,))(model_uer)
input_item = Input(shape=[None,],dtype="int32")
model_item = Embedding(num_item,d,input_length = 1,
embeddings_constraint=non_neg()
)(input_item)
Dropout(0.2)
model_item = BatchNormalization()(model_item)
model_item = Reshape((d,))(model_item)
out = Dot(1)([model_uer,model_item]) #點積運算
model = Model(inputs=[input_uer,input_item], outputs=out)
model.compile(loss= 'mse', optimizer='sgd')
model.summary()
return model
關於非負,指的是分解後的兩個矩陣每個值都要非負,實現起來比較簡單,Embedding層剛好有進行約束的參數,但思想上還存在一定模糊,電影評分是處於[1-5]的,預測評分爲什麼一定要非負?負數是否可以代表該用戶不喜歡該項目?由於推薦結果實際上只與分值的大小排序有關,非負還是否一定更好?
除此以外,還嘗試了神經網絡的一些trick,加了dropout防止過擬合,加了BatchNormalization更易靠近最優解。下面是Keras的結構圖,一目瞭然。
有效性驗證
同樣我們先看核心算法是不是真的能完成矩陣分解,結果如下,非零值與原矩陣接近,完成驗證。下附驗證代碼。
'''
Created on Fri Oct 18 15:08:00 2019
@author: YLC
'''
import os
import numpy as np
import pandas as pd
import time
import math
from keras import Model
import keras.backend as K
from keras.layers import Embedding,Reshape,Input,Dot,Dense,Dropout,concatenate
from keras.models import load_model
from keras.utils import to_categorical
from keras import regularizers
from keras.constraints import non_neg
def Recmand_model(num_user,num_item,d):
K.clear_session()
input_uer = Input(shape=[None,],dtype="int32")
model_uer = Embedding(num_user,d,input_length = 1,
embeddings_constraint=non_neg() #非負,下同
)(input_uer)
Dropout(0.2)
model_uer = BatchNormalization()(model_uer)
model_uer = Reshape((d,))(model_uer)
input_item = Input(shape=[None,],dtype="int32")
model_item = Embedding(num_item,d,input_length = 1,
embeddings_constraint=non_neg()
)(input_item)
Dropout(0.2)
model_item = BatchNormalization()(model_item)
model_item = Reshape((d,))(model_item)
out = Dot(1)([model_uer,model_item]) #點積運算
model = Model(inputs=[input_uer,input_item], outputs=out)
model.compile(loss= 'mse', optimizer='sgd')
model.summary()
return model
def train(num_user,num_item,train_data,d,step):
model = Recmand_model(num_user,num_item,d)
train_user = train_data[:,0]
train_item = train_data[:,1]
train_x = [train_user,train_item]
train_y = train_data[:,2]
model.fit(train_x,train_y,batch_size = 4,epochs = step)
model.save("./MFmodel.h5")
def test(num_user,num_item,R):
model = load_model('./MFmodel.h5')
nR = np.zeros([num_user,num_item])
for i in range(num_user):
for j in range(num_item):
nR[i][j] = model.predict([[i],[j]])
return nR
def cal_e(R,nR):
e = 0
cnt = 0
for i in range(len(R)):
for j in range(len(R[0])):
if(R[i][j]!=0):
cnt = cnt + 1
e = e + math.pow(R[i][j]-nR[i][j],2)
e = 1.0 * e/cnt
return e
def RtransT(R):
user = [u for u in range(len(R))]
item = [i for i in range(len(R[0]))]
Table = []
for i in user:
for j in item:
if R[i][j]!= 0:
Table.append([i,j,R[i][j]])
Table = np.array(Table)
return Table
def NMF(R,d,step):
T = RtransT(R)
M=len(R)
N=len(R[0])
train(M,N,T,d,step)
nR = test(M,N,R)
e = cal_e(R,nR)
return e,nR
if __name__ == '__main__':
R=[
[5,2,0,3,1],
[0,2,1,4,5],
[1,1,0,2,4],
[2,2,0,5,0]
]
R=np.array(R)
dimension = 3
step = 2000
e,nR = NMF(R,dimension,step)
print('-----原矩陣R:------')
print(R)
print('-----近似矩陣nR:------')
print(nR)
print('e is:',e)
訓練
模型的構建有三個參數,用戶數、項目數和嵌入向量的維度,而模型的輸入爲訓練集的用戶記錄數據、項目記錄數據和真實評分。batch_size是批處理參數,epochs是模型的迭代次數,h5爲HDF5文件格式。
def train(all_user,all_item,train_data,d):
num_user = max(all_user) + 1
num_item = max(all_item) + 1
model = Recmand_model(num_user,num_item,d)
train_user = train_data['user'].values
train_item = train_data['item'].values
train_x = [train_user,train_item]
# train_data['rating'] = 1 #不用評分
train_y = train_data['rating'].values
model.fit(train_x,train_y,batch_size = 128,epochs = 8)
plot_model(model, to_file='./NN MF/NNMF.png',show_shapes=True) #輸出框架圖
model.save("./NN MF/MFmodel.h5")
測試
同(三)一樣,加入推薦新項目的限制。
def test(train_data,test_data,all_item,k):
model = load_model('./NN MF/MFmodel.h5')
PRE = 0
REC = 0
MAP = 0
MRR = 0
AP = 0
HITS = 0
sum_R = 0
sum_T = 0
valid_cnt = 0
stime = time.time()
test_user = np.unique(test_data['user'])
for user in test_user:
# user = 0
visited_item = list(train_data[train_data['user']==user]['item'])
# print('訪問過的item:',visited_item)
if len(visited_item)==0: #沒有訓練數據,跳過
continue
per_st = time.time()
testlist = list(test_data[test_data['user']==user]['item'].drop_duplicates()) #去重保留第一個
testlist = list(set(testlist)-set(testlist).intersection(set(visited_item))) #去掉訪問過的item
if len(testlist)==0: #過濾後爲空,跳過
continue
valid_cnt = valid_cnt + 1 #有效測試數
poss = {}
for item in all_item:
if item in visited_item:
continue
else:
poss[item] = float(model.predict([[user],[item]]))
# print(poss)
# print("對用戶",user)
rankedlist,test_score = topk(poss,k)
# print("Topk推薦:",rankedlist)
# print("實際訪問:",testlist)
# print("單條推薦耗時:",time.time() - per_st)
AP_i,len_R,len_T,MRR_i,HITS_i= cal_indicators(rankedlist, testlist)
AP += AP_i
sum_R += len_R
sum_T += len_T
MRR += MRR_i
HITS += HITS_i
# print(test_score)
# print('--------')
# break
etime = time.time()
PRE = HITS/(sum_R*1.0)
REC = HITS/(sum_T*1.0)
MAP = AP/(valid_cnt*1.0)
MRR = MRR/(valid_cnt*1.0)
p_time = (etime-stime)/valid_cnt
print('評價指標如下:')
print('PRE@',k,':',PRE)
print('REC@',k,':',REC)
print('MAP@',k,':',MAP)
print('MRR@',k,':',MRR)
print('平均每條推薦耗時:',p_time)
with open('./Basic MF/result_'+dsname+'.txt','w') as f:
f.write('評價指標如下:\n')
f.write('PRE@'+str(k)+':'+str(PRE)+'\n')
f.write('REC@'+str(k)+':'+str(REC)+'\n')
f.write('MAP@'+str(k)+':'+str(MAP)+'\n')
f.write('MRR@'+str(k)+':'+str(MRR)+'\n')
f.write('平均每條推薦耗時@:'+str(k)+':'+str(p_time)+'\n')
Main函數
與(三)不同,這裏取。
if __name__ == '__main__':
dsname = 'ML100K'
dformat = ['user','item','rating','time']
all_user,all_item,train_data,test_data = getUI(dsname,dformat) #第一次使用需取消註釋
d = 60 #隱因子維度
steps = 10
k = 10
train(all_user,all_item,train_data,d)
test(train_data,test_data,all_item,k)
實驗結果
先看ML100K的結果。
再看ML1M的結果。
進階!靈魂拷問
拷問1. 利用神經網絡做矩陣分解(下稱NNMF(Neural Network Matrix Factorization))與傳統矩陣分解相比,哪種更好?
從實驗結果上來看,Basic MF會更好。以下爲傳統矩陣分解的實驗結果。但從個人觀點,NNMF的潛力更大。相比於傳統矩陣分解的過程,利用神經網絡做矩陣分解有以下好處:
1).NNMF訓練的時耗遠低於Basic MF。因爲NNMF是直接嵌入,不需要構建矩陣,且神經網絡的優化器什麼的都是現成的。
2).NNMF喫神經網絡的紅利。理論上任何有利於神經網絡的技術都可以用過來,比如本文用到的Dropout、BN等。
那爲什麼效果會差呢?個人認爲一方面是參數沒有調到最優,另一方面是數據量不夠,衆所周知,神經網絡是數據驅動的。不過在本次試驗中,兩者的差距並非很大。
拷問2. 評分矩陣和0-1矩陣在此實驗中效果如何?
以下爲0-1矩陣的。明顯基本沒有效果,不過如果數據量夠大的話還是有機會。
完整代碼
# -*- coding: utf-8 -*-
"""
Created on Fri Oct 18 15:08:00 2019
@author: YLC
"""
import os
import numpy as np
import pandas as pd
import time
import math
from keras import Model
import keras.backend as K
from keras.layers import Embedding,Reshape,Input,Dot,Dense,Dropout,concatenate,BatchNormalization
from keras.models import load_model
from keras.utils import plot_model,to_categorical
from keras import regularizers
from keras.constraints import non_neg
from keras import optimizers
def getUI(dsname,dformat): #獲取全部用戶和項目
st = time.time()
train = pd.read_csv(dsname+'_train.txt',header = None,names = dformat)
test = pd.read_csv(dsname+'_test.txt',header = None,names = dformat)
data = pd.concat([train,test])
all_user = np.unique(data['user'])
all_item = np.unique(data['item'])
train.sort_values(by=['user','item'],axis=0,inplace=True) #先按時間、再按用戶排序
if os.path.exists('./NN MF'):
pass
else:
os.mkdir('./NN MF')
train.to_csv('./NN MF/train.txt',index = False,header=0)
test.to_csv('./NN MF/test.txt',index = False,header=0)
et = time.time()
print("get UI complete! cost time:",et-st)
return all_user,all_item,train,test
def topk(dic,k):
keys = []
values = []
for i in range(0,k):
key,value = max(dic.items(),key=lambda x: x[1])
keys.append(key)
values.append(value)
dic.pop(key)
return keys,values
def cal_indicators(rankedlist, testlist):
HITS_i = 0
sum_precs = 0
AP_i = 0
len_R = 0
len_T = 0
MRR_i = 0
ranked_score = []
for n in range(len(rankedlist)):
if rankedlist[n] in testlist:
HITS_i += 1
sum_precs += HITS_i / (n + 1.0)
if MRR_i == 0:
MRR_i = 1.0/(rankedlist.index(rankedlist[n])+1)
else:
ranked_score.append(0)
if HITS_i > 0:
AP_i = sum_precs/len(testlist)
len_R = len(rankedlist)
len_T = len(testlist)
return AP_i,len_R,len_T,MRR_i,HITS_i
def Recmand_model(num_user,num_item,d):
K.clear_session()
input_uer = Input(shape=[None,],dtype="int32")
model_uer = Embedding(num_user,d,input_length = 1,
embeddings_constraint=non_neg() #非負,下同
)(input_uer)
Dropout(0.2)
model_uer = BatchNormalization()(model_uer)
model_uer = Reshape((d,))(model_uer)
input_item = Input(shape=[None,],dtype="int32")
model_item = Embedding(num_item,d,input_length = 1,
embeddings_constraint=non_neg()
)(input_item)
Dropout(0.2)
model_item = BatchNormalization()(model_item)
model_item = Reshape((d,))(model_item)
out = Dot(1)([model_uer,model_item]) #點積運算
model = Model(inputs=[input_uer,input_item], outputs=out)
model.compile(loss= 'mse', optimizer='sgd')
model.summary()
return model
def train(all_user,all_item,train_data,d):
num_user = max(all_user) + 1
num_item = max(all_item) + 1
model = Recmand_model(num_user,num_item,d)
train_user = train_data['user'].values
train_item = train_data['item'].values
train_x = [train_user,train_item]
# train_data['rating'] = 1 #不用評分
train_y = train_data['rating'].values
model.fit(train_x,train_y,batch_size = 128,epochs = 8)
plot_model(model, to_file='./NN MF/NNMF.png',show_shapes=True) #輸出框架圖
model.save("./NN MF/MFmodel.h5")
def test(train_data,test_data,all_item,k):
model = load_model('./NN MF/MFmodel.h5')
PRE = 0
REC = 0
MAP = 0
MRR = 0
AP = 0
HITS = 0
sum_R = 0
sum_T = 0
valid_cnt = 0
stime = time.time()
test_user = np.unique(test_data['user'])
for user in test_user:
# user = 0
visited_item = list(train_data[train_data['user']==user]['item'])
# print('訪問過的item:',visited_item)
if len(visited_item)==0: #沒有訓練數據,跳過
continue
per_st = time.time()
testlist = list(test_data[test_data['user']==user]['item'].drop_duplicates()) #去重保留第一個
testlist = list(set(testlist)-set(testlist).intersection(set(visited_item))) #去掉訪問過的item
if len(testlist)==0: #過濾後爲空,跳過
continue
valid_cnt = valid_cnt + 1 #有效測試數
poss = {}
for item in all_item:
if item in visited_item:
continue
else:
poss[item] = float(model.predict([[user],[item]]))
# print(poss)
# print("對用戶",user)
rankedlist,test_score = topk(poss,k)
# print("Topk推薦:",rankedlist)
# print("實際訪問:",testlist)
# print("單條推薦耗時:",time.time() - per_st)
AP_i,len_R,len_T,MRR_i,HITS_i= cal_indicators(rankedlist, testlist)
AP += AP_i
sum_R += len_R
sum_T += len_T
MRR += MRR_i
HITS += HITS_i
# print(test_score)
# print('--------')
# break
etime = time.time()
PRE = HITS/(sum_R*1.0)
REC = HITS/(sum_T*1.0)
MAP = AP/(valid_cnt*1.0)
MRR = MRR/(valid_cnt*1.0)
p_time = (etime-stime)/valid_cnt
print('評價指標如下:')
print('PRE@',k,':',PRE)
print('REC@',k,':',REC)
print('MAP@',k,':',MAP)
print('MRR@',k,':',MRR)
print('平均每條推薦耗時:',p_time)
with open('./Basic MF/result_'+dsname+'.txt','w') as f:
f.write('評價指標如下:\n')
f.write('PRE@'+str(k)+':'+str(PRE)+'\n')
f.write('REC@'+str(k)+':'+str(REC)+'\n')
f.write('MAP@'+str(k)+':'+str(MAP)+'\n')
f.write('MRR@'+str(k)+':'+str(MRR)+'\n')
f.write('平均每條推薦耗時@:'+str(k)+':'+str(p_time)+'\n')
if __name__ == '__main__':
dsname = 'ML100K'
dformat = ['user','item','rating','time']
all_user,all_item,train_data,test_data = getUI(dsname,dformat) #第一次使用需取消註釋
d = 60 #隱因子維度
steps = 10
k = 10
train(all_user,all_item,train_data,d)
test(train_data,test_data,all_item,k)