【CTR預估】The Wide and Deep Learning Model(譯文+Tensorlfow源碼解析)

本文主要講解Google的Wide and Deep Learning 模型。本文先從原始論文開始,先一步步分析論文,把論文看懂。再去分析官方開源的Tensorflow源碼,解析各個特徵的具體實現方法,以及模型的具體構造方法等。
我在github開源來一個自己寫的關於ctr預估模型的庫,裏面復現了一些經典的網絡,如DIN,ESMM,DIEN,DeepFM等。當然其中也包含了wide and deep model,感興趣的可以看看:
https://github.com/Shicoder/Deep_Rec/tree/master/Deep_Rank

先上圖

這裏寫圖片描述
1.論文翻譯

ABSTRACT

Generalized linear models with nonlinear feature transformations are widely used for large-scale regression and classification problems with sparse inputs. Memorization of feature interactions through a wide set of cross-product feature transformations are effective and interpretable, while generalization requires more feature engineering effort. With less feature engineering, deep neural networks can generalize better to unseen feature combinations through low-dimensional dense embeddings learned for the sparse features. However, deep neural networks with embeddings can over-generalize and recommend less relevant items when the user-item interactions are sparse and high-rank. In this paper, we present Wide & Deep learning—jointly trained wide linear models and deep neural networks—to combine the benefits of memorization and generalization for recommender systems. We productionized and evaluated the system on Google Play, a commercial mobile app store with over one billion active users and over one million apps. Online experiment results show that Wide & Deep significantly increased app acquisitions compared with wide-only and deep-only models. We have also open sourced our implementation in TensorFlow.

譯文:
通過將稀疏數據的非線性轉化特徵應用在廣義線性模型中被廣泛應用於大規模的迴歸和分類問題。通過廣泛的使用交叉特徵轉化,使得特徵交互的記憶性是有效的,並且具有可解釋性,而然不得不做許多的特徵工作。相對來說,通過從稀疏數據中學習低緯稠密embedding特徵,並應用到深度學習中,只需要少量的特徵工程就能對潛在的特徵組合具有更好的範化性。但是當用戶項目交互是稀疏和高緯數據的時候,利用了embeddings的深度學習則表現得過於籠統(over-generalize),推薦的都是些相關性很低的items。在這篇文章中,我們提出了一個wide and deep 聯合學習模型,去結合集合推薦系統的memorization和generalization。我們評估在Google Play上評估了該方法,在線實驗結果顯示,相比於單個的wide或者deep模型,WD模型顯著的增加了app獲取率。我們在tensorflow上開源了該源碼。

點評:
提出了一個結合使用了非線性特徵的線性模型和一個用來embedding特徵的深度學習,並且使用聯合訓練的方法進行優化。思想是,基於交叉特徵的線性模型只能從歷史出現過的數據中找到非線性(顯性的非線性),深度學習可以找到沒有出現過的非線性(隱形的非線性)。


INTRODUCTION

A recommender system can be viewed as a search ranking system, where the input query is a set of user and contextual information, and the output is a ranked list of items. Given a query, the recommendation task is to find the relevant items in a database and then rank the items based on certain objectives, such as clicks or purchases.
One challenge in recommender systems, similar to the general search ranking problem, is to achieve both memorization and generalization. Memorization can be loosely defined as learning the frequent co-occurrence of items or features and exploiting the correlation available in the historical data.Generalization, on the other hand, is based on transitivity of correlation and explores new feature combinations that have never or rarely occurred in the past. Recommendations based on memorization are usually more topical and directly relevant to the items on which users have already performed actions. Compared with memorization, generalization tends to improve the diversity of the recommended items. In this paper, we focus on the apps recommendation problem for the Google Play store, but the approach should apply to generic recommender systems.
譯文:

推薦系統可以被看做是一個搜索排序系統,其中輸入的query是一系列的用戶和文本信息,輸出是items的排序列表。給定一個query,推薦的任務就是到數據庫中去找出相關的items,然後對這些items根據相關對象,如點擊或者購買行爲,進行排序。
和傳統的搜索排序問題一樣,在推薦系統中,一個挑戰就是區域同時達到memorization和generalization。Memorization可以被大概定義爲學習items或者features之間的相關頻率,在歷史數據中探索相關性的可行性。Generalizaion的話則是基於相關性的傳遞,去探索一些在過去沒有出現過的特徵組合。基於memorization的推薦相對來說具有局部性,是在哪些用戶和items已經有直接相關聯的活動上。相較於memorization,generalization嘗試去提高推薦items的多元化。在這篇paper中,我們主要關注Google Play 商店的app推薦問題,但是該方法對推薦系統具有通用性。

點評:
這裏我沒有對memorization和generalization進行翻譯,因爲我也不知道怎麼翻譯,memorization的話就是去把歷史數據中顯性的非線性找出來,generalization的就是範化性,就是把一些隱性的找出來。

For massive-scale online recommendation and ranking systems in an industrial setting, generalized linear models such as logistic regression are widely used because they are simple, scalable and interpretable. The models are often trained on binarized sparse features with one-hot encoding. E.g., the binary feature “user_installed_app=netflix ”has value 1 if the user installed Netflix. Memorization can be achieved effectively using cross-product transformations over sparse features, such as AND( user_installed_app=netflix, impression_app=pandora”), whose value is 1 if the user installed Netflix and then is later shown Pandora. This explains how the co-occurrence of a feature pair correlates with the target label. Generalization can be added by using features that are less granular, such as AND ( user_installed_category=video, impression_category=music ), but manual feature engineering is often required. One limitation of cross-product transformations is that they do not generalize to query-item feature pairs that have not appeared in the training data.

譯文:
在工業中,對於大規模的在線推薦和排序系統,想邏輯迴歸這樣的廣義線性模型應用是相當廣泛的,簡單,伸縮性好,可解釋性強。可以餵給它一些one-hot編碼的稀疏特徵,比如二值特徵‘user_installed_app=netfix’表示用戶安裝了Netflix。Memorization則可以通過對稀疏特徵做交叉積轉換獲得,就是求交叉特徵,比如AND操作 (user_installed_app= netflix, impression_app=pandora )這兩個特徵,當用戶安裝了Netflix並且之後展示在Pandora上,那麼得到特徵的值爲1,其餘爲0.這個交叉特徵就展示了特徵對之間的相關性和目標lable之間的關聯。Generalization可以通過增加一些粗粒度的特徵實現,如AND(user_installed_category=video, impression_category=music ),但是這寫都是需要手工做特徵工程實現。交叉積轉換的一個限制就是他們不能生成從未在訓練數據中出現過的query-item特徵對。

點評:
這裏主要是對接下來線性模型需要的特徵做了下解釋,一個是one-hot,比較稀疏。一個是交叉特徵,簡單的說就是AND,就是特徵之間做笛卡爾積。用於線性模型去尋找顯性的非線性。

Embedding-based models, such as factorization machines[5] or deep neural networks, can generalize to previously unseen query-item feature pairs by learning a low-dimensional dense embedding vector for each query and item feature, with less burden of feature engineering. However, it is difficult to learn effective low-dimensional representations for queries and items when the underlying query-item matrix is sparse and high-rank, such as users with specific preferences or niche items with a narrow appeal. In such cases, there should be no interactions between most query-item pairs, but dense embeddings will lead to nonzero predictions for all query-item pairs, and thus can over-generalize and make less relevant recommendations. On the other hand, linear models with cross-product feature transformations can memorize these “exception rules” with much fewer parameters.

譯文:
像FM或者DNN,這種基於embedding的模型,是對預先沒出現的query-item特徵對有一定範化性,通過爲每個query和item特徵學習一個低緯稠密的embedding向量,而且不需要太多的特徵工程。但是如果潛在的query-item矩陣是稀疏,高秩的話,爲query和items學習出一個有效的低緯表示往往很困難,比如基於特殊愛好的users,或者一些很少出現的小衆items。在這種情況下,大多數的query-item沒有交互,但是稠密的embedding還是會對全部的query-item對有非零的輸出預測,因此能做出一些過範化和做出一些不太相關的推薦。另一方面,利用交叉積特徵的線性模型能用很少的參數記住那些‘exception_rules’。

點評:
講了下深度網絡需要的特徵,embedding特徵,就是把稀疏數據映射到稠密的低緯數據。

In this paper, we present the Wide & Deep learning framework to achieve both memorization and generalization in one model, by jointly training a linear model component and a neural network component as shown in Figure 1.
The main contributions of the paper include:
• The Wide & Deep learning framework for jointly training feed-forward neural networks with embeddings and linear model with feature transformations for generic recommender systems with sparse inputs.
• The implementation and evaluation of the Wide & Deep recommender system productionized on Google Play, a mobile app store with over one billion active users and over one million apps.
• We have open-sourced our implementation along with a high-level API in TensorFlow.
While the idea is simple, we show that the Wide & Deep framework significantly improves the app acquisition rate on the mobile app store, while satisfying the training and serving speed requirements.

譯文:

在這篇paper裏,我們提出一個wide&Deep學習框架,以此來同時在一個模型中獲得Memorization和generalization,並聯合訓練之。
本文的主要貢獻:
1.聯合訓練使用了embedding的深度網絡和使用了交叉特徵的線性模型。
2.WD系統在Google Play上投入使用。
3.在Tensrolfow開源代碼。
儘管idea簡單,但是wd顯著的提高了app獲取率,且速度也還可以。

RECOMMENDER SYSTEM OVERVIEW

An overview of the app recommender system is shown in Figure 2. A query, which can include various user and contextual features, is generated when a user visits the app store. The recommender system returns a list of apps (also referred to as impressions) on which users can perform certain actions such as clicks or purchases. These user actions, along with the queries and impressions, are recorded in the logs as the training data for the learner. Since there are over a million apps in the database, it is intractable to exhaustively score every app for every query within the serving latency requirements (often O(10) milliseconds). Therefore, the first step upon receiving a query is retrieval. The retrieval system returns a short list of items that best match the query using various signals, usually a combination of machine-learned models and human-defined rules. After reducing the candidate pool, the ranking system ranks all items by their scores. The scores are usually P(y|x), the probability of a user action label y given the features x, including user features (e.g., country, language, demographics), contextual features (e.g., device, hour of the day, day of the week), and impression features (e.g., app age, historical statistics of an app). In this paper, we focus on the ranking model using the Wide & Deep learning framework.

這裏寫圖片描述

圖2展示了app推薦系統的概括圖。

query:當用戶訪問app store的時候生成的許多用戶和文本特徵。
推薦系統返回一個app列表(也被叫做展示(impressions)),然後用戶能在這些展示的app上進行確切的操作,比如點擊或者購買。這些用戶活動,以及queries和impressions都被記錄下來作爲訓練數據。
因爲數據庫中有過百萬的apps,所以對全部的app計算score不合理。因此,收到一個query的第一步是retrieval(檢索)。檢索系統返回一個items的短列表,這個列表是通過機器學習和人工定義的大量標記找出來的,和query最匹配的一個app列表。然後減少了候選池後,排序系統通過對這些items按score再對其進行排序。而這個scores通常就是給定的特徵x下,用戶行爲y的概率值
P(y|x)。特徵x包括一些用戶特徵(國家,語言。。。),文本特徵(設備。使用時長。。。)和展示特徵(app歷史統計數據。。。)。在本論文中,我們主要關注的是將WD模型用戶排序系統。

WIDE&DEEP LEARNING

The Wide Component

The wide component is a generalized linear model of the form y = wT x + b, as illustrated in Figure 1 (left). y is the prediction, x = [x1, x2, …, xd] is a vector of d features, w =[w1, w2, …, wd] are the model parameters and b is the bias.The feature set includes raw input features and transformed features. One of the most important transformations is the cross-product transformation, which is defined as:
這裏寫圖片描述
where cki is a boolean variable that is 1 if the i-th feature is part of the k-th transformation φk, and 0 otherwise.For binary features, a cross-product transformation (e.g.,“AND(gender=female, language=en)”) is 1 if and only if the constituent features (“gender=female” and “language=en”) are all 1, and 0 otherwise. This captures the interactions between the binary features, and adds non linearity to the generalized linear model.

譯文:

模型中Wide模塊是一個形如y=WTX+by=W^{T}X+b的廣義線性模型,如圖1左所示。yy是預測值,X=[x1,x2,...,xd]X=[x_{1},x_{2},...,x_{d}]是d維特徵的一個向量,其中W=[w1,w2,...,wd]W=[w_{1},w_{2},...,w_{d}]是模型的參數,b是偏置項。特徵集包含了原始輸入特徵和轉化後的特徵。其中最重要的就是交叉積轉換特徵,可以被定義爲:
這裏寫圖片描述
其中ckic_{ki}是一個boolean值變量,當第i個特徵是第k個轉換ϕk\phi_{k},否則的就是0。對於一個二進制特徵,交叉積特徵可以簡單理解爲AND(gender=female,language=en),當且僅當gender=female,language=en時,交叉特徵爲1,其他都爲0。該方法能捕捉出特徵間的交互,爲模型添加非線性。

點評:
就是生成交叉特徵

The Deep Component

deep模塊則是一個前向神經網絡,如圖1右,對於類別特徵,原始輸入特徵其原始輸入都是字符串形式的特徵,如“language=en".我們把這些稀疏,高維的類別特徵轉換爲低緯稠密的實值向量,這就是embedding向量。embedding隨機初始化,並利用反向傳播對其進行更新。將高維的特徵換換爲embedding特徵後,這些低維的embedding向量就被fed到神經網絡中,每個隱藏層做如下計算:
這裏寫圖片描述
其中l是網絡的層數,f是激活函數,一般用RELU,al,bl,Wla^{l},b^{l},W^{l}分別爲第l層的激活函數,偏置項,模型權值。

點評:
其實就是說輸入的類別特徵是字符串,得轉化下,然後做embedding,模型是一個全連接。

Joint Training of Wide & Deep Model

通過將Wide模塊和Deep模塊的對數加權輸出作爲預測值,然後將其fed給一個常規的邏輯損失函數中,用於聯合訓練。需要注意的是,聯合訓練和ensemble是由區別滴。在集成方法中,模型都是獨立訓練的,模型之間沒有關係,他們的預測輸出只在最後才合併。但是,聯合訓練的話,兩個模型是一起訓練所有參數。對於模型大小來說,集成方法,因爲模型之間獨立,所以單個模型的大小需要更大,即需要更多的特徵和特徵工程。以此起來獲得合理的精度。但是聯合訓練,兩個模塊只要互相補充對方不足即可。
WD模型的聯合訓練通過反向傳播將輸出值的誤差梯度通過最小批隨機梯度同時傳送給Wide和Deep模塊。在實驗中,我們使用帶L1L_{1}的FTRL算法作爲wide模塊的優化器,使用AdaGrad更新deep模塊。
結合的模型在圖1(中)。對於邏輯迴歸問題,我們模型的預測是:
P(Y=1X)=σ(WwideT[X,ϕ(X)]+WdeepTalf+b)P(Y=1|X)=\sigma(W_{wide}^{T}[X,\phi(X)]+W_{deep}^{T}a^{l_{f}}+b) (3)
其中Y是一個二值的類別標籤,σ()\sigma()是sigmoid函數,ϕ(x)\phi(x)表示交叉特徵,bb是一個bias項,WwideW_{wide}是Wide模型的權值,WdeepW_{deep}是應用在最後的隱藏層輸出上的權值。

點評:
一個是聯合訓練,就是一起訓練唄,一個是優化器,分別爲FTRL和AdaGrad。最後是將兩個模型的輸出加起來。WdeepW_{deep}其實就是隱藏層到輸出層的權值。

Data Generation

In this stage, user and app impression data within a period of time are used to generate training data. Each example corresponds to one impression. The label is app acquisition:1 if the impressed app was installed, and 0 otherwise. Vocabularies, which are tables mapping categorical feature strings to integer IDs, are also generated in this stage. The system computes the ID space for all the string features that occurred more than a minimum number of times. Continuous real-valued features are normalized to [0, 1] by mapping a feature value x to its cumulative distribution function P(X ≤ x), divided into nq quantiles. The normalized value is i−1 nq−1 for values in the i-th quantiles. Quantile boundaries are computed during data generation.

譯文:

app推薦主要由三個階段組成,data generation,model training,model serving。圖3所示。
數據生成階段,就是把之前的用戶和app展示數據用於生成訓練數據。每個樣本對應一個展示,標籤是app acquisition:如果展示的app被安裝了則爲1,否則爲0。
Vacabularies,是一些將類別特徵字符串映射爲整型的ID。系統計算爲哪些出現超過設置的最小次數的字符串特徵計算ID空間。連續的實值特徵通過映射特徵x到它的累積分佈P(X<=x),將其標準化到[0,1],然後在離散到nq個分位數。這些分位數邊界也是在該階段計算獲得。

點評:
將整個推薦過程分爲三部分,一:數據生成,爲線性模型創建交叉特徵,爲深度模型創建embedding特徵。對字符串類型的類別特徵做整型轉換。

這裏寫圖片描述

Model Training

The model structure we used in the experiment is shown in Figure 4. During training, our input layer takes in training data and vocabularies and generate sparse and dense features together with a label. The wide component consists of the cross-product transformation of user installed apps and impression apps. For the deep part of the model, A 32 dimensional embedding vector is learned for each categorical feature. We concatenate all the embeddings together with the dense features, resulting in a dense vector of approximately 1200 dimensions. The concatenated vector is then fed into 3 ReLU layers, and finally the logistic output unit. The Wide & Deep models are trained on over 500 billion examples. Every time a new set of training data arrives, the model needs to be re-trained. However, retraining from scratch every time is computationally expensive and delays the time from data arrival to serving an updated model.

To tackle this challenge, we implemented a warm-starting system which initializes a new model with the embeddings and the linear model weights from the previous model. Before loading the models into the model servers, a dry run of the model is done to make sure that it does not cause problems in serving live traffic. We empirically validate the model quality against the previous model as a sanity check.

譯文:

我們在實驗中所用的模型結構展示在圖4中。訓練階段,我們的輸入層吸收訓練數據,詞彙,生成稀疏和稠密特徵。Wide模塊包含用戶安裝的app和展示的app的交叉特徵。對於深度模塊,我們爲每個類別特徵學習了32維的emedding特徵。並將全部的embedding特徵串聯起來,獲得一個近似1200維的稠密向量。並將該向量傳入3層的RELU隱層,最終獲得邏輯輸出單元。
WD將被訓練在超過5000億的樣本上。每次一個新的訓練數據達到,模型需要重新訓練。但是,重新訓練費時費力。爲了克服這個挑戰,我們實現了一個熱啓動系統,我們使用預先的模型權值去初始化新模型權值。
在加載模型到模型server之前,爲確保模型在實時情況下不會出現問題,我們對模型進行了預先模擬。

Model Serving

Once the model is trained and verified, we load it into the model servers. For each request, the servers receive a set of app candidates from the app retrieval system and user features to score each app. Then, the apps are ranked from the highest scores to the lowest, and we show the apps to the users in this order. The scores are calculated by running a forward inference pass over the Wide & Deep model. In order to serve each request on the order of 10 ms, we optimized the performance using multithreading parallelism by running smaller batches in parallel, instead of scoring all candidate apps in a single batch inference step.

譯文:

一旦模型完成訓練和驗證,我們就將它放到模型server中。對每次請求,server都會從app檢索系統獲得一個app候選集,然後,對這些app利用模型計算的成績排序,我們再按該順序顯示app。
爲了使得能在10ms內響應請求,我們利用多線程並行運行小批次數據來代替對全部候選集在單個batch上打分,一次優化時間。

實驗

後面是實驗部分,不再翻譯。

代碼分析

該模型的Tenflow源碼最近好像更新過了,和之前所用的模塊也不一樣了,之前的特徵構建都是使用的tf.contrib模塊,現在的代碼使用的是tf.feature_column,之前的模型在tf.contrib.learn.DNNLinearCombinedClassifier,現在的版本在tf.estimator.DNNLinearCombinedClassifier。不過我試了下,之前的代碼是能用的,說明兩個都存在。(TF有點臃腫了),我就按我down下來的代碼來分析吧,之前那個有大神分析的相當透徹:http://geek.csdn.net/news/detail/235465(完美,跪舔)。

數據

人口普查數據
這裏寫圖片描述

分別有連續性數據和類別數據,最後一類離散化後作爲標籤
用jupyter notebook看了下據大概長這樣子
這裏寫圖片描述

選出了幾個類別(教育年份,職業、國家),看了下,是這樣子的,有的類別值比較多,有的少點。

這裏寫圖片描述

數據的輸入

數據用的是pandas讀取,直接將原始數據傳入,並以原始數據的列名作爲key,爲之後的做特徵工作做準備,之後所有的特徵工作都是基於原始數據的key來構造的。
具體可看下圖,
這裏寫圖片描述

特徵工程

Feature_column模塊自帶的函數有這麼幾個:
‘crossed_column’,
‘numeric_column’,
‘bucketized_column’,‘
‘categorical_column_with_hash_bucket’,
‘categorical_column_with_vocabulary_file’,
‘categorical_column_with_vocabulary_list’,
‘categorical_column_with_identity’,
‘weighted_categorical_column’,
‘indicator_column’,
crossed_column用於構造交叉特徵,numeric_column用於處理實值,bucketized_column用於離散連續特徵,categorical_column_with_hash_bucket將類別特徵hash到不同bin中,categorical_column_with_vocabulary_file將類別特徵的所有取值保存在文件中,categorical_column_vocabulary_list將類別特徵的所有取值保存在list中,categorical_column_with_identity返回的是和特徵本身一樣的id,weighted_categorical_column是加權用的,indicator_column用來對類別特徵做one-hot。
下面,我針對demo展開講一下這幾個方法

1.針對取值較少的類別特徵,demo裏使用了tf.feature_column.categorical_column_with_vocabulary_list()方法將類別特徵從字符串類型映射到整型。比如性別特徵,原始數據集中的取值爲Femal或者Male,這樣我們就可以將其通過

gender = tf.feature_column.categorical_column_with_vocabulary_list(
"gender", ["Female", "Male"])

把字符串Female和Male按其在vocabulary中的順序,從0開始,按序編碼,這裏的話就是Female:0;Male:1。
categorical_column_with_vocabulary_list()方法中還有一個參數是oov,意思就是out of vocabulary,就是說如果數據沒有出現在我們定義的vocabulary中的話,我們可以將其投到oov中。其實這個方法的底層就是一個將String映射到int的一個hashtable。

2.針對那些不清楚有多少個取值的類別特徵,或者說取值數很多的特徵,可以使用tf.feature_column.categorical_column_with_hash_bucket()方法,思想和categorical_column_with_vocabulary_list一樣,因爲我們不知道類別特徵的取值,所以沒法定義vocabulary。所以可以直接利用hash方法將其直接hash到不同的bucket中,該方法將特徵中的每一個可能的取值散列分配一個整型ID。比如

occupation = tf.feature_column.categorical_column_with_hash_bucket(
"occupation", hash_bucket_size=1000)

這段代碼就是講occupation中的取值,哈希到1000個bucket中,這1000個bucket分別爲0~999,那麼occupation中的值就會被映射爲這1000箇中的一個整數。

if self.dtype == dtypes.string:
	sparse_values = input_tensor.values
else:
	sparse_values = string_ops.as_string(input_tensor.values)
sparse_id_values = string_ops.string_to_hash_bucket_fast(   sparse_values, self.hash_bucket_size, name='lookup')

底層做的就是講String轉化爲整型,然後再做hash,其實幹的就是這麼回事:
output_id = Hash(input_feature_string) % bucket_size

3.連續性變量
對於連續型變量就沒啥說的,就是將其轉化爲浮點型

# Continuous base columns.
age = tf.feature_column.numeric_column("age")
education_num = tf.feature_column.numeric_column("education_num")
capital_gain = tf.feature_column.numeric_column("capital_gain")
capital_loss = tf.feature_column.numeric_column("capital_loss")
hours_per_week = tf.feature_column.numeric_column("hours_per_week")

4.對於分佈不平均的連續性變量
對於一些每個段分佈密度不均的連續性變量可以做分塊,所以有了如下函數tf.feature_column.bucketized_column()。
連續型特徵通過 bucketization 生成離散特徵,boundaries 是一個浮點數的列表,而且列表必須是遞增序的,如下代碼

age_buckets = tf.feature_column.bucketized_column(
age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])

這裏就是講age按給定的boundaries分成11個區域,比如樣本的age是34,那麼輸出的就是3,age是21,那麼輸出的就是1。

5.交叉特徵
交叉特徵是爲了找出一些非線性的特徵,tf.feature_column.crossed_column()。如
tf.feature_column.crossed_column( ["education", "occupation"], hash_bucket_size=1000)
就是講education和occupation做交叉,然後再做hash。

如下是源碼中的一個例子,

SparseTensor referred by first key:
shape = [2, 2] {
     [0, 0]: "a"
     [1, 0]: "b"   
     [1, 1]: "c"
     }
SparseTensor referred by second key:
shape = [2, 1]
{     
  [0, 0]: "d"
  [1, 0]: "e"
  }
 then crossed feature will look like:
 shape = [2, 2]{    
 [0, 0]: Hash64("d", Hash64("a")) % hash_bucket_size
 [1, 0]: Hash64("e", Hash64("b")) % hash_bucket_size    
 [1, 1]: Hash64("e", Hash64("c")) % hash_bucket_size

這裏的[0,0]表示的是輸入batch_size含有值的座標(就是sparseTensor,tensorflow裏的sparseTensor定義就是由三個denseTensor組成,一個id,用於表示那個位置有值,一個value,用於表示這個位置上的值是多少,還有一個shape,用於表示數據的shape),像第一個key第一個樣本只有[0,0],即只有第一個位置有值,第二個樣本有[1,0],[1,1],那麼說明其第一個維度和第二個維度都有值。
借用下前面提到過的那位大神畫的圖
這裏寫圖片描述
本例的交叉特徵做的就是這麼回事,一行表示一個樣本。

6.indicator特徵,因爲dnn是不能直接輸入sparseColumn的,怎麼說呢,之前那些類別特徵處理好後,全是將string轉化成了int,但是針對每個取值返回的還是一個整形的id值,我們不可能直接將該id傳入網絡,但是線性模型可以直接將這類特徵做embedding,來實現線性模型。
接着看,具體的方法是tf.feature_column.indicator_column(),該方法主要講一些類別特徵進行one-hot編碼,如果是多值的就進行multi-hot編碼,底層調用的就是_IndicatorColumn()類,其實現就是一個one-hot()

one_hot_id_tensor = array_ops.one_hot(    
dense_id_tensor,    
depth=self._variable_shape[-1],
on_value=1.0,
off_value=0.0)

如果是多值的特徵,在參數返回的時候回將各個one-hot編碼進行壓縮

return math_ops.reduce_sum(one_hot_id_tensor, axis=[-2])

Embedding_column

tf.feature_column.embedding_column(native_country, dimension=8)

看了下底層實現,相當於建了一個表,從表裏去取embedding向量。
可以看這張圖,
這裏寫圖片描述
大概就是這麼個意思,按id從矩陣表裏取embedding向量。
具體實現在
return _EmbeddingColumn( categorical_column=categorical_column, dimension=dimension, combiner=combiner, initializer=initializer, ckpt_to_load_from=ckpt_to_load_from, tensor_name_in_ckpt=tensor_name_in_ckpt, max_norm=max_norm, trainable=trainable)
然後其初始化了embedding矩陣

embedding_weights = variable_scope.get_variable(
name='embedding_weights',
shape=(self.categorical_column._num_buckets, self.dimension), 
dtype=dtypes.float32,
initializer=self.initializer,
trainable=self.trainable and trainable,
collections=weight_collections)

這個權值矩陣,其實相當於神經網絡的權值,後續如果是trainable的話,我們就會把這個當做網絡的權值矩陣進行訓練,但是在用的時候,就把這個當成一個embedding表,按id去取每個特徵的embedding。
取的時候是去_safe_embedding_lookup_sparse()按id取embedding。
爲了防止矩陣過大,其底層還實現了矩陣的分塊,就是將一個大矩陣分成幾個小矩陣,所以有一個;partition_strategy
其定義了兩種取數據的方式
https://stackoverflow.com/questions/34870614/what-does-tf-nn-embedding-lookup-function-do
這裏分析的挺詳細,說白了就是很多個矩陣現在來個id怎麼去取數據,那肯定是按表取,每個表取完了再去取下一個表,這就是mod;或者一個一個來,這個表取一個,下一個表取一個,按順序依次從各個表取,就是div。

模型的構造

m = tf.estimator.DNNLinearCombinedClassifier(   
model_dir=model_dir,  
linear_feature_columns=crossed_columns,   
dnn_feature_columns=deep_columns,   
dnn_hidden_units=[100, 50])

模型的構造直接調用該方法,該方法繼承自Estimator。

class DNNLinearCombinedClassifier(estimator.Estimator)

具體實現在

_dnn_linear_combined_model_fn

主要做兩件事,定義好優化器,架好模型結構(頂層的輸出和loss等單獨定義在head中)
這裏寫圖片描述

DNN模型構建

先造輸入層

net = feature_column_lib.input_layer()

這裏寫圖片描述
其實就是按每個特徵的維度建立節點,然後把全部的特徵合起來輸出爲output_tensors。當然輸入時按batch_size輸入的。可以看出來,網絡的數據是直接按batch_size來的,應該一次訓練就是一個batch_size的數據,而不是一個一個的算,在按batch_size加起來。
輸入層造好了,就是按自己傳入的隱藏層節點數構造隱藏層。
這裏寫圖片描述
他這裏能改的只有激活函數,節點個數,權值的初試化方式是默認的,沒有提供接口。
最後就是輸出
這裏寫圖片描述
輸出層的節點,像這裏因爲是二分類,所以輸出層的節點只有一個,這裏head.logits_dimension獲取的就是二分類問題的輸出節點個數。如果是多分類,幾分類最後的節點個數就是幾,三分類那麼輸出的節點就是3。這裏也可以發現,最後這裏是沒有激活函數的,因爲等下要和線性模型加起來才一起輸入的sigmoid函數裏去。這裏的dnn_logits就是deep部分的輸出了。
這裏做的就是這麼一件事:
這裏寫圖片描述

Linear模塊

線性模型的話就是去做y=WTX+by=W^{T}X+b一個東西,很簡單,具體在linear_logits = feature_column_lib.linear_model()函數實現,這裏和一般的線性模型不一樣的是,它對類別特徵和實值特徵具體實現的方法有所不一樣。

for column in sorted(feature_columns, key=lambda x: x.name):  
	with variable_scope.variable_scope(None, default_name=column.name):
	    ordered_columns.append(column)
	    if isinstance(column, _CategoricalColumn):
	        weighted_sums.append(_create_categorical_column_weighted_sum(         
	        column, builder, units, sparse_combiner, weight_collections,        
	        trainable))
	    else:     
		    weighted_sums.append(_create_dense_column_weighted_sum(          
		    column, builder, units, weight_collections, trainable))

針對categorical_column和dense_column其實現的時候分別使用embedding和矩陣乘積。分別在_create_categorical_column_weighted_sum()和_create_dense_column_weighted_sum()裏實現。並將每個特徵的實現加到weighted_sum中做彙總。

_create_categorical_column_weighted_sum()

weight = variable_scope.get_variable(    
name='weights', 
shape=(column._num_buckets, units),  # pylint: disable=protected-access    initializer=init_ops.zeros_initializer(),  
trainable=trainable,   
collections=weight_collections)
return _safe_embedding_lookup_sparse(   
weight,   
id_tensor, 
sparse_weights=weight_tensor,   
combiner=sparse_combiner,    
name='weighted_sum')

可以看到,其先用全0初始化了一個權值矩陣,再去調用_safe_embedding_lookup_sparse去取權重值,其實就是一個embedding的過程。

_create_dense_column_weighted_sum()

對於其他實值特徵的話,就比較直接了
這裏寫圖片描述
最後再把輸出的weighted_sum都加起來,再加個bias就可以了

predictions_no_bias = math_ops.add_n(   
weighted_sums, name='weighted_sum_no_bias')
bias = variable_scope.get_variable(    
'bias_weights',   
shape=[units],  
initializer=init_ops.zeros_initializer(),   
trainable=trainable,   
collections=weight_collections)
predictions = nn_ops.bias_add(   
predictions_no_bias, bias, name='weighted_sum')

簡單畫個圖,其實就是這麼回事
這裏寫圖片描述

combine

if dnn_logits is not None and linear_logits is not None: 
	logits = dnn_logits + linear_logits

如圖:
這裏寫圖片描述
最後把兩個模型的輸出,直接加起來,送到sigmoid裏去,再用交叉熵計算loss,這些都在head_lib._binary_logistic_head_with_sigmoid_cross_entropy_loss()裏面完成了。

loss的反向傳播,也很直接
`def _train_op_fn(loss):
“”“Returns the op to optimize the loss.”"" ……
……
if dnn_logits is not None:
train_ops.append(
dnn_optimizer.minimize(
loss, ……
……)
if linear_logits is not None:
train_ops.append(
loss, ……
……)

其實我不是特別理解,感覺線性模型的收斂速度,肯定比網絡要快很多,那怎麼去保證兩邊收斂情況呢?當然正則做得好,只要不過擬合,最後應該都會收斂得比較好。
還有就是這個版本的wd代碼和之前版本的wd代碼有一些差異,參數方面,默認的學習率其做了一定的調整,之前的center_bias,現在沒有了。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章