kaggle ： StumbleUpon Evergreen Classification Challenge

StumbleUpon Evergreen Classification Challenge

------2013/08/16 -- 2013/10/31

一背景

Build a classifier to categorize webpages as evergreen or non-evergreen

Stumbleupon是美國的UGC網站，用戶分享內容，網站通過用戶行爲數據構建興趣圖譜和對用戶喜好進行一個個性化定位。

Stumbleupon 發佈一個比賽，公司提供數據集，包括有標記的訓練集和待預測的測試集，根據StumbleUpon提供歷史數據，設計分類模型，預測StumbleUpon提供的網頁是否是長期流行，還是短暫流行。

訓練集是網頁的內容和標記（網頁是否是evergreen-長期備受歡迎）

測試集是網頁內容，

預測目標y：0,1 （0：non-evergreen，1：evergreen）

官網上數據集格式如下：

FieldName	Type	Description
url	string	Url of the webpage to be classified
urlid	integer	StumbleUpon's unique identifier for each url
boilerplate	json	Boilerplate text
alchemy_category	string	Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com)
alchemy_category_score	double	Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com)
avglinksize	double	Average number of words in each link
commonLinkRatio_1	double	# of links sharing at least 1 word with 1 other links / # of links
commonLinkRatio_2	double	# of links sharing at least 1 word with 2 other links / # of links
commonLinkRatio_3	double	# of links sharing at least 1 word with 3 other links / # of links
commonLinkRatio_4	double	# of links sharing at least 1 word with 4 other links / # of links
compression_ratio	double	Compression achieved on this page via gzip (measure of redundancy)
embed_ratio	double	Count of number of <embed> usage
frameBased	integer (0 or 1)	A page is frame-based (1) if it has no body markup but have a frameset markup
frameTagRatio	double	Ratio of iframe markups over total number of markups
hasDomainLink	integer (0 or 1)	True (1) if it contains an <a> with an url with domain
html_ratio	double	Ratio of tags vs text in the page
image_ratio	double	Ratio of <img> tags vs text in the page
is_news	integer (0 or 1)	True (1) if StumbleUpon's news classifier determines that this webpage is news
lengthyLinkDomain	integer (0 or 1)	True (1) if at least 3 <a> 's text contains more than 30 alphanumeric characters
linkwordscore	double	Percentage of words on the page that are in hyperlink's text
news_front_page	integer (0 or 1)	True (1) if StumbleUpon's news classifier determines that this webpage is front-page news
non_markup_alphanum_characters	integer	Page's text's number of alphanumeric characters
numberOfLinks	integer	Number of <a> markups
numwords_in_url	double	Number of words in url
parametrizedLinkRatio	double	A link is parametrized if it's url contains parameters or has an attached onClick event
spelling_errors_ratio	double	Ratio of words not found in wiki (considered to be a spelling mistake)
label	integer (0 or 1)	User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only

二：評估指標

分類結果的AUC值，AUC值越高，排名越靠前。AUC值是分類問題中比較常見的評估指標，尤其針對二分類中正負類別不平衡的情況。

三：數據分佈情況

訓練集樣本數目：6706

測試集樣本大小： 3171

四：分類預測模型

4.1 特徵提取

(1) 提取樣本中文本詞

提取樣本中三個字段title，body， url中文本字段

(2) 提取數字特徵

提取22個數字特徵avglinksize，alchemy_category_score，linkwordscore，numwords_in_url等等。

(3) 文本特徵預處理

對文本進行相應的預處理，比如去除停用詞，低頻和高頻詞，分詞等等，然後，對文本特徵特徵選擇。

特徵選擇代碼如下：

fs_num = 75000

term_set_fs= feature_selection.feature_selection(train_doc_terms_list, labels, fs_method)

term_set_fs = term_set_fs[:fs_num]

其中特徵選擇採用的是信息增益的方法，當然還有互信息和卡方檢驗的方法。

文本預處理採用對方法如下：

vectorizer = TfidfVectorizer(min_df = 3, max_df = 1, token_pattern=r'\w{1,}', strip_accents = 'unicode', ngram_range=(1, 2), stop_words = 'english',sublinear_tf = True )

(4) 文本特徵tf-idf向量化

對訓練集和測試集的預處理後的文本進行tf-idf向量化，產生稀疏向量特徵。

X_train = vectorizer.transform(train_doc_str_list)

X_test = vectorizer.transform(test_doc_str_list)

(5) 數字特徵歸一化

這裏是把訓練集和測試集的數字特徵放在一起，然後進行norm歸一化。

all_num_feature=sparse.vstack((train_num_feature_matrix, test_num_feature_matrix)).tocsr()

all_num_feature = normalize(all_num_feature, axis = 0)

train_num_feature_matrix= all_num_feature[0:train_num_feature_matrix._shape[0],:]

test_num_feature_matrix= all_num_feature[train_num_feature_matrix._shape[0]:all_num_feature._shape[0],:]

（6）組合文本tf-idf向量特徵和數字特徵

X_train= sparse.hstack((X_train,train_num_feature_matrix)).tocsr()

X_test= sparse.hstack((X_test,test_num_feature_matrix)).tocsr()

4.2 分類模型

（1） navie bayes

Navie bayes是在文本分類上一個常用的算法。這也是我本次比賽中第一次嘗試的分類算法。

Navie bayes分類時候，我們採用的特徵和上面描述有點出入。Bayes 分類只是去文本的特徵，不計算tf-idf值，是二值特徵。

CountVectorizer(min_df = 3, max_df = 1, token_pattern=r'\w{1,}', strip_accents = 'unicode', ngram_range=(1, 2), stop_words = 'english', binary=True)

分類採用 model = BernoulliNB(alpha=a)，

詞特徵爲6000時候，a = 0.5, 10級cv的AUC平均值是0.87.

Leaderboard表現大約0.85左右。

（2） Logistic Regression

LR方法所採用的特徵就是如4.1所說。

模型如下：

model = LogisticRegression(penalty='l2', dual=True, tol=0.0001,

C=1, fit_intercept=True, intercept_scaling=1.0,

class_weight='auto')

10級cv的平均結果：0.886

在最開始Leaderboard的表現是0.884左右，可是到最後公佈結果是0.87894，出現了over-fit. 更滑級的是當時的第一名也掉到200名左右。

對於overfit的原因，一方面樣本量不夠大，特徵變量數目高於樣本數，另外一方面是測試集中有些詞未出現在訓練集中，尤其在詞特徵選擇的過程中，更根據訓練踢出了一些對分類相關性差的詞，但是這些詞在測試集中會對分類產生誤差，導致一些測試集的詞特徵量過少。所以在詞特徵選擇時候，不能選最優的詞個數，不然會造成嚴重的over-fit。

（3） Ensemble

最後的時候其實有把不用參數的lr模型進行的一個集成，每個模型分配不同的權重，這個權重是人工設定的，非學習方式。

有想過用線性模型把LR，RandomForest， GBT，SVM，幾種模型進行加權，學習權重，不過後來因爲在GBT，RandomForeset調參上經驗不足，後來放棄了。

五：總結

對於這種分類問題，基本上方法和過程都類似，如上面所說。

一個特徵預處理和提取，建立cv集合，在cv上訓練和測試。方法上就是單個模型+ensemble，自己在ensemble上水平還有待提高。

kaggle ： StumbleUpon Evergreen Classification Challenge

一背景

二：評估指標

三：數據分佈情況

四：分類預測模型

4.1 特徵提取

4.2 分類模型

五：總結

Kaggle ： Display Advertising Challenge( ctr 預估 )

餘額支付風控 -- 風控評分模型篇

模型集成方法： Stacked generation

Loan default predictor（貸款違約預測）

FaceBook: Text Tag Recommendation

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

kaggle ： StumbleUpon Evergreen Classification Challenge

一 背景

二： 評估指標

三： 數據分佈情況

四：分類預測模型

4.1 特徵提取

4.2 分類模型

五：總結

一背景

二：評估指標

三：數據分佈情況