咱以電影電視的推薦系統爲例,一步一步的來實現一個簡單的推薦系統吧, 由於比較簡單,整個推薦系統源碼不到100行,大概70-80行吧,應該很容易掌握。 爲了快速開發原型,咱採用Python代碼來演示
1. 推薦系統的第一步,需要想辦法收集信息
不同的業務,不同的推薦系統需要收集的信息不一樣 針對咱要做的電影推薦,自然是每個用戶對自己看過的電影的評價了,如下圖所示:
Name,Friends,Bedtime Stories,Dawn of the Planet of the Apes,RoboCop,Fargo,Cougar Town
Kai Zhou,4,3,5,,1,2
Shuai Ge,,3.5,3,4,2.5,4.5
Mei Nv,3,4,2,3,2,3
xiaoxianrou,2.5,3.5,3,3.5,2.5,3
fengzhi,3,4,,5,3.5,3
meinv,,4.5,,4,1,
mincat,3,3.5,1.5,5,3.5,3
alex,2.5,3,,3.5,,4
先從csv文件中加載二維矩陣,代碼如下:
def load_matrix():
matrix = {}
f = open("d:\\train.csv")
columns = f.readline().split(',')
for line in f:
scores = line.split(',')
for i in range(len(scores))[1:]:
matrix[(scores[0], columns[i])] = scores[i].strip("\n")
return matrix
matrix = load_matrix()
print "matrix:", matrix
def sim_distance(matrix, row1, row2):
columns = set(map(lambda l: l[1], matrix.keys()))
si = filter(lambda l: matrix.has_key((row1, l)) and matrix[(row1, l)] != "" and matrix.has_key((row2, l)) and matrix[(row2, l)] != "", columns)
if len(si) == 0: return 0
sum_of_distance = sum([pow(float(matrix[(row1, column)]) - float(matrix[(row2, column)]), 2) for column in si])
return 1 / (1 + sqrt(sum_of_distance))
print sim_distance(matrix, "Kai Zhou", "Shuai Ge")
def top_matches(matrix, row, similarity=sim_distance):
rows = set(map(lambda l: l[0], matrix.keys()))
scores = [(similarity(matrix, row, r), r) for r in rows if r != row]
scores.sort()
scores.reverse()
return scores
person = "Kai Zhou"
print "top match for:", person
print top_matches(matrix, person)
b. 找到和某影片相似的影片,這個需要稍微變化下。咱的輸入數據是以用戶爲行數據,影片爲列數據, 只要改成以影片爲行數據,用戶爲列數據,一樣的調用。 所以需要一個函數,將矩陣轉置def transform(matrix):
rows = set(map(lambda l: l[0], matrix.keys()))
columns = set(map(lambda l: l[1], matrix.keys()))
transform_matrix = {}
for row in rows:
for column in columns:
transform_matrix[(column, row)] = matrix[(row, column)]
return transform_matrix
找到和Friends 相似的影片:
Source code
trans_matrix = transform(matrix)
print "trans:", trans_matrix
film = "Friends"
print "top match for:", film
print top_matches(trans_matrix, film)
def get_recommendations(matrix, row, similarity=sim_distance):
rows = set(map(lambda l: l[0], matrix.keys()))
columns = set(map(lambda l: l[1], matrix.keys()))
sum_of_column_sim = {}
sum_of_column = {}
for r in rows:
if r == row: continue
sim = similarity(matrix, row, r)
if sim <= 0: continue
for c in columns:
if matrix[(r, c)] == "": continue
sum_of_column_sim.setdefault(c, 0)
sum_of_column_sim[c] += sim
sum_of_column.setdefault(c, 0)
sum_of_column[c] += float(matrix[(r, c)]) * sim
scores = [(sum_of_column[c] / sum_of_column_sim[c], c) for c in sum_of_column]
scores.sort()
scores.reverse()
return scores
print get_recommendations(matrix, person)