《Factorization Machines》 | FM模型及python實現

原創

蠡1204

2020-06-01 03:53

1 Factorization Machines 原文

2 FM模型

2.1 背景

計算廣告和推薦系統中，CTR預估是一個非常重要的環節，判斷一個item是否應該被推薦要根據CTR預估的點擊率進行。CTR預估時，除了單特徵以外，往往需要組合特徵。

數據經過one-hot編碼以後，導致樣本數據變得非常稀疏，另外，還導致特徵空間變大。爲了解決數據稀疏（one-hot coding）情況下，特徵如何組合的問題，FM由此誕生。

2.2 ＦＭ模型求解

普通的現行模型，例如邏輯迴歸，都是單獨的考慮各個特徵，並沒有考慮特徵之間的聯繫。常用模型爲：

從上式中可以發現，各個特徵並沒有進行組合，忽略了特徵之間的關聯。FM模型將特徵進行組合，考慮了特徵之間的相關關係，模型如下：

對比兩個模型發現，FM比線性模型僅僅多了最後一項。

最後一項求解

利用SGD對模型參數求解

3 python實現

1、實驗數據集：movielens，包含四列。【用戶ID | 電影ID | 打分 | 時間戳】

2、用到的函數庫

from itertools import count # 迭代器
from collections import defaultdict # 使用dict時，如果引用的Key不存在，就會拋出KeyError。如果希望key不存在時，返回一個默認值，就可以用defaultdict
from scipy.sparse import csr # csr_matrix，全名爲Compressed Sparse Row，是按行對矩陣進行壓縮的。CSR需要三類數據：數值，列號，以及行偏移量。CSR是一種編碼的方式，其中，數值與列號的含義，與coo裏是一致的。行偏移表示某一行的第一個元素在values裏面的起始偏移位置。 
import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
import tensorflow as tf
from tqdm import tqdm_notebook as tqdm # 可以顯示循環的進度條的庫

3、將數據轉換成一個大小爲 用戶ID數X電影ID數（樣本數*特徵總數） 的矩陣，使用scipy.sparse中的csr_matrix函數。csr_matrix((data, indices, indptr)第一個參數是數值對應下圖中的data，第二個參數是每個數對應的列號column indices，第三個參數是每行的起始的偏移量row offsets。

def vectorize_dic(dic, ix=None, p=None):
    """ 
    Creates a scipy csr matrix from a list of lists (each inner list is a set of values corresponding to a feature) 
    
    parameters:
    -----------
    dic -- dictionary of feature lists. Keys are the name of features
    ix -- index generator (default None)
    p -- dimension of featrure space (number of columns in the sparse matrix) (default None)
    """
    if (ix == None):
        d = count(0)
        ix = defaultdict(lambda: next(d)) 
        
    n = len(list(dic.values())[0]) # num samples
    g = len(list(dic.keys())) # num groups
    nz = n * g # number of non-zeros

    col_ix = np.empty(nz, dtype=int)     
    
    i = 0
    for k, lis in dic.items():     
        # append index el with k in order to prevet mapping different columns with same id to same index
        col_ix[i::g] = [ix[str(el) + str(k)] for el in lis]
        i += 1
        
    row_ix = np.repeat(np.arange(0, n), g)      
    data = np.ones(nz)
    
    if (p == None):
        p = len(ix)
        
    ixx = np.where(col_ix < p)

    return csr.csr_matrix((data[ixx],(row_ix[ixx], col_ix[ixx])), shape=(n, p)), ix

cols = ['user','item','rating','timestamp']

train = pd.read_csv('data/ua.base',delimiter='\t',names = cols)
test = pd.read_csv('data/ua.test',delimiter='\t',names = cols)

x_train,ix = vectorize_dic({'users':train['user'].values,  'items':train['item'].values},n=len(train.index),g=2)

x_test,ix = vectorize_dic({'users':test['user'].values,   'items':test['item'].values},ix,x_train.shape[1],n=len(test.index),g=2)

print(x_train)
y_train = train['rating'].values
y_test = test['rating'].values

x_train = x_train.todense() # toarray returns an ndarray; todense returns a matrix. If you want a matrix, use todense otherwise, use toarray
x_test = x_test.todense()

4、生成器，

def batcher(X_, y_=None, batch_size=-1):
    n_samples = X_.shape[0]

    if batch_size == -1:
        batch_size = n_samples
    if batch_size < 1:
       raise ValueError('Parameter batch_size={} is unsupported'.format(batch_size))

    for i in range(0, n_samples, batch_size):
        upper_bound = min(i + batch_size, n_samples)
        ret_x = X_[i:upper_bound]
        ret_y = None
        if y_ is not None:
            ret_y = y_[i:i + batch_size]
            yield (ret_x, ret_y)

5、估計值計算

n,p = x_train.shape

k = 10

x = tf.placeholder('float',[None,p])

y = tf.placeholder('float',[None,1])

w0 = tf.Variable(tf.zeros([1]))
w = tf.Variable(tf.zeros([p]))

v = tf.Variable(tf.random_normal([k,p],mean=0,stddev=0.01))

#y_hat = tf.Variable(tf.zeros([n,1]))

linear_terms = tf.add(w0,tf.reduce_sum(tf.multiply(w,x),1,keep_dims=True)) # n * 1
pair_interactions = 0.5 * tf.reduce_sum(
    tf.subtract(
        tf.pow(
            tf.matmul(x,tf.transpose(v)),2),
        tf.matmul(tf.pow(x,2),tf.transpose(tf.pow(v,2)))
    ),axis = 1 , keep_dims=True)

y_hat = tf.add(linear_terms,pair_interactions)

6、損失函數計算：損失函數除了平方損失外，還加了l2正則項，並使用梯度下降法進行參數的更新：

lambda_w = tf.constant(0.001,name='lambda_w')
lambda_v = tf.constant(0.001,name='lambda_v')

l2_norm = tf.reduce_sum(
    tf.add(
        tf.multiply(lambda_w,tf.pow(w,2)),
        tf.multiply(lambda_v,tf.pow(v,2))
    )
)

error = tf.reduce_mean(tf.square(y-y_hat))
loss = tf.add(error,l2_norm)

train_op = tf.train.GradientDescentOptimizer(learning_rate=0.01).minimize(loss)

7、模型訓練

epochs = 10
batch_size = 1000

# Launch the graph
init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)

    for epoch in tqdm(range(epochs), unit='epoch'):
        perm = np.random.permutation(x_train.shape[0]) # 函數shuffle與permutation都是對原來的數組進行重新洗牌（即隨機打亂原來的元素順序）；區別在於shuffle直接在原來的數組上進行操作，改變原來數組的順序，無返回值。而permutation不直接在原來的數組上進行操作，而是返回一個新的打亂順序的數組，並不改變原來的數組。
        # iterate over batches
        for bX, bY in batcher(x_train[perm], y_train[perm], batch_size):
            _,t = sess.run([train_op,loss], feed_dict={x: bX.reshape(-1, p), y: bY.reshape(-1, 1)})
            print(t)


    errors = []
    for bX, bY in batcher(x_test, y_test):
        errors.append(sess.run(error, feed_dict={x: bX.reshape(-1, p), y: bY.reshape(-1, 1)}))
        print(errors)
    RMSE = np.sqrt(np.array(errors).mean())
    print (RMSE)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

《Factorization Machines》 | FM模型及python實現

1 Factorization Machines 原文

2 FM模型

2.1 背景

2.2 ＦＭ模型求解

3 python實現

Python實現大麥網搶票的四大關鍵技術點解析

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

leecode 深度優先搜索 DFS

Linux | 文件管理

linux | vim編輯器

《Factorization Machines》 | FM模型及python實現

阿里媽媽DIN模型（Deep Interest Network）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結