最近在python從入門放棄的路上,做了用MovieLens(ml-100k)數據集的電影推薦系統,主要基於Pearson相關係數判斷數據集中其他用戶與目標用戶的相似性,取其中最相似的50個用戶加權計算其推薦係數,排序後推薦得分最高的10部電影。
以下是具體實現過程:
0.準備
我們首先得了解數據集的標籤,畢竟年代久遠直接讀來有點困難我查找了一下資料得到了以下信息:
u.data: 完整的數據集文件,包含943位用戶對1682部電影的100000個評分
評分1—5
每個用戶至少20部
u.info: 用戶數、項目數、評價總數
u.item: 電影的信息,由tab字符分隔。
u.genre: 電影流派信息0-18編號
u.user: 用戶基本信息。id,年齡,性別,職業,郵編,id與u.data一致
1.首先載入數據集文件,u.data與電影名稱文件u.item。
def loadData():
f = open('u.data')
data = []
for i in range(100000):
h = f.readline().split('\t')
h = list(map(int, h))
data.append(h[0:3])
f.close()
return data
def loadMovieName():
f=open('u.item',encoding='ISO-8859-1')
name = []
for i in range(1682):
h = f.readline()
k=''
m=0
for j in range(100):
k+=str(h[j])
if str(h[j])=='|':
m+=1
if m==2:
break
name.append(k)
f.close()
return name
這裏我在載入電影名稱時出現了編碼問題,gbk編碼沒辦法讀這個文件,需增加“encoding=‘ISO-8859-1’”。
2.整合與處理數據
通過了解了標籤信息我們就可以編寫函數將其處理成以每行爲一個用戶對所有電影一一對應評分的一個列表或者說943*1682的矩陣。
def manageDate(data):
outdata = []
for i in range(943):
outdata.append([])
for j in range(1682):
outdata[i].append(0)
for h in data:
outdata[h[0] - 1][h[1] - 1] = h[2]
return outdata
3.計算相關係數
這一步主要分爲兩步三個函數,先求向量也就是列表均值,後求相關係數。
def calcMean(x, y):
sum_x = sum(x)
sum_y = sum(y)
n = len(x)
x_mean = float(sum_x + 0.0) / n
y_mean = float(sum_y + 0.0) / n
return x_mean, y_mean
def calcPearson(x, y):
x_mean, y_mean = calcMean(x, y) # 計算x,y向量平均值
n = len(x)
sumTop = 0.0
sumBottom = 0.0
x_pow = 0.0
y_pow = 0.0
for i in range(n):
sumTop += (x[i] - x_mean) * (y[i] - y_mean)
for i in range(n):
x_pow += math.pow(x[i] - x_mean, 2)
for i in range(n):
y_pow += math.pow(y[i] - y_mean, 2)
sumBottom = math.sqrt(x_pow * y_pow)
p = sumTop / sumBottom
return p
def calcAttribute(dataSet, num):
prr = []
n, m = np.shape(dataSet) # 獲取數據集行數和列數
x = [0] * m # 初始化特徵x和類別y向量
y = [0] * m
y = dataSet[num - 1]
for j in range(n): # 獲取每個特徵的向量,並計算Pearson係數,存入到列表中
x = dataSet[j]
prr.append(calcPearson(x, y))
return prr
4.選擇電影
我們採用開頭提到的策略取其中最相似的50個用戶加權計算其推薦係數,排序後推薦得分最高的10部電影。
def choseMovie(outdata, num):
prr = calcAttribute(outdata, num)
list=[]
mid=[]
out_list=[]
movie_rank=[]
for i in range(1682):
movie_rank.append([i,0])
k=0
for i in range(943):
list.append([i,prr[i]])
for i in range(943):
for j in range(942-i):
if list[j][1]<list[j+1][1]:
mid=list[j]
list[j]=list[j+1]
list[j+1]=mid
for i in range(1,51):
for j in range(0,1682):
movie_rank[j][1]=movie_rank[j][1]+outdata[list[i][0]][j]*list[i][1]/50
for i in range(1682):
for j in range(1681-i):
if movie_rank[j][1]<movie_rank[j+1][1]:
mid=movie_rank[j]
movie_rank[j]=movie_rank[j+1]
movie_rank[j+1]=mid
for i in range(1,1682):
if(outdata[num-1][movie_rank[i][0]]==0):
mark=0
for d in out_list:
if d[0]==j:
mark=1
if mark!=1:
k+=1
out_list.append(movie_rank[i])
if k==10:
break
return movie_rank
這裏返回的是推薦電影的索引與評分。
5.輸出
這裏簡單的輸出了電影名稱與推薦評分。
def printMovie(out_list,name):
print("base on the data we think you may like those movies:")
for i in range(10):
print(name[out_list[i][0]]," rank score:",out_list[i][1])
運行結果如下圖所示:
下面給出完整代碼:
import numpy as np
import math
def loadData():
f = open('u.data')
data = []
for i in range(100000):
h = f.readline().split('\t')
h = list(map(int, h))
data.append(h[0:3])
f.close()
return data
def loadMovieName():
f=open('u.item.txt',encoding='ISO-8859-1')
name = []
for i in range(1682):
h = f.readline()
k=''
m=0
for j in range(100):
k+=str(h[j])
if str(h[j])=='|':
m+=1
if m==2:
break
name.append(k)
f.close()
return name
def manageDate(data):
outdata = []
for i in range(943):
outdata.append([])
for j in range(1682):
outdata[i].append(0)
for h in data:
outdata[h[0] - 1][h[1] - 1] = h[2]
return outdata
def calcMean(x, y):
sum_x = sum(x)
sum_y = sum(y)
n = len(x)
x_mean = float(sum_x + 0.0) / n
y_mean = float(sum_y + 0.0) / n
return x_mean, y_mean
def calcPearson(x, y):
x_mean, y_mean = calcMean(x, y) # 計算x,y向量平均值
n = len(x)
sumTop = 0.0
sumBottom = 0.0
x_pow = 0.0
y_pow = 0.0
for i in range(n):
sumTop += (x[i] - x_mean) * (y[i] - y_mean)
for i in range(n):
x_pow += math.pow(x[i] - x_mean, 2)
for i in range(n):
y_pow += math.pow(y[i] - y_mean, 2)
sumBottom = math.sqrt(x_pow * y_pow)
p = sumTop / sumBottom
return p
def calcAttribute(dataSet, num):
prr = []
n, m = np.shape(dataSet) # 獲取數據集行數和列數
x = [0] * m # 初始化特徵x和類別y向量
y = [0] * m
y = dataSet[num - 1]
for j in range(n): # 獲取每個特徵的向量,並計算Pearson係數,存入到列表中
x = dataSet[j]
prr.append(calcPearson(x, y))
return prr
def choseMovie(outdata, num):
prr = calcAttribute(outdata, num)
list=[]
mid=[]
out_list=[]
movie_rank=[]
for i in range(1682):
movie_rank.append([i,0])
k=0
for i in range(943):
list.append([i,prr[i]])
for i in range(943):
for j in range(942-i):
if list[j][1]<list[j+1][1]:
mid=list[j]
list[j]=list[j+1]
list[j+1]=mid
for i in range(1,51):
for j in range(0,1682):
movie_rank[j][1]=movie_rank[j][1]+outdata[list[i][0]][j]*list[i][1]/50
for i in range(1682):
for j in range(1681-i):
if movie_rank[j][1]<movie_rank[j+1][1]:
mid=movie_rank[j]
movie_rank[j]=movie_rank[j+1]
movie_rank[j+1]=mid
for i in range(1,1682):
if(outdata[num-1][movie_rank[i][0]]==0):
mark=0
for d in out_list:
if d[0]==j:
mark=1
if mark!=1:
k+=1
out_list.append(movie_rank[i])
if k==10:
break
return movie_rank
def printMovie(out_list,name):
print("base on the data we think you may like those movies:")
for i in range(10):
print(name[out_list[i][0]]," rank score:",out_list[i][1])
i_data = loadData()
name = loadMovieName()
out_data = manageDate(i_data)
a = eval(input("please input the id of user:"))
out_list = choseMovie(out_data, a)
printMovie(out_list,name)