【精通特徵工程】學習筆記（三）

【精通特徵工程】學習筆記Day3&2.13&D4章&P52-64頁

4、特徵縮放的效果:從詞袋到 tf-idf

4.1 tf-idf:詞袋的一種簡單擴展

tf-idf：詞頻 - 逆文檔頻率
tf-idf 計算的不是數據集中每個單詞在每個文檔中的原本計數，而是一個歸一化的計數，其中每個單詞的計數要除以這個單詞出現在其中的文檔數量

bow(w, d) = 單詞 w 在文檔 d 中出現的次數
tf-idf(w, d) = bow(w, d) * N / ( 單詞 w 出現在其中的文檔數量 )

4.2 tf-idf 方法測試

tf-idf 通過乘以一個常數，對單詞計數特徵進行了轉換。因此，它是一種特徵縮放方法

Step1:使用 Python 加載並清理 Yelp 點評數據集

>>> import json
     >>> import pandas as pd
# 加載Yelp商家數據
>>> biz_f = open('yelp_academic_dataset_business.json')
>>> biz_df = pd.DataFrame([json.loads(x) for x in biz_f.readlines()]) >>> biz_f.close()
# 加載Yelp點評數據
>>> review_file = open('yelp_academic_dataset_review.json')
>>> review_df = pd.DataFrame([json.loads(x) for x in review_file.readlines()]) >>> review_file.close()
# 選取出夜店和餐館
>>> two_biz = biz_df[biz_df.apply(lambda x: 'Nightlife' in x['categories'] or ... 'Restaurants' in x['categories'], ... axis=1)]
# 與點評數據連接，得到兩種類型商家的所有點評
>>> twobiz_reviews = two_biz.merge(review_df, on='business_id', how='inner')
# 去除我們不需要的特徵
 >>> twobiz_reviews = twobiz_reviews[['business_id',
... 'name',
     ...                                  'stars_y',
... 'text',
     ...                                  'categories']]
# 創建目標列——夜店類型的商家爲True，否則爲False
>>> two_biz_reviews['target'] = \
... twobiz_reviews.apply(lambda x: 'Nightlife' in x['categories'], ... axis=1)

4.2.1 創建分類數據集

Yelp商店點評數據爲一個類別不平衡數據集，故可做如下處理：

對夜店點評數據進行 10% 的隨機抽樣，對餐館點評數據進行 2.1% 的隨機抽樣(選擇這樣的比例可以使兩個類別的抽樣數據基本相當)。
按照 70/30 的比例將這個數據集劃分爲訓練集和測試集。在這個例子中，訓練集有 29 264 條點評數據，測試集有 12 542 條點評數據。
訓練數據包含 46 924 個唯一單詞，這就是詞袋錶示法的特徵數量。
Step2：創建平衡的分類數據集

# 創建一個類別平衡的子樣本，供練習使用
>>> nightlife = \
... twobiz_reviews[twobiz_reviews.apply(lambda x: 'Nightlife' in x['categories'],
...
>>> restaurants = \
... twobiz_reviews[twobiz_reviews.apply(lambda x: 'Restaurants' in x['categories'], ... axis=1)]
>>> nightlife_subset = nightlife.sample(frac=0.1, random_state=123)
>>> restaurant_subset = restaurants.sample(frac=0.021, random_state=123)
>>> combined = pd.concat([nightlife_subset, restaurant_subset])
# 劃分訓練集和測試集
>>> training_data, test_data = modsel.train_test_split(combined,
...
...
>>> training_data.shape
(29264, 5)
>>> test_data.shape
(12542, 5)
train_size=0.7,
random_state=123)
axis=1)]

4.2.2 使用 tf-idf 變換來縮放詞袋

Step3：轉換特徵

# 用詞袋錶示點評文本
>>> bow_transform = text.CountVectorizer()
>>> X_tr_bow = bow_transform.fit_transform(training_data['text']) 
>>> X_te_bow = bow_transform.transform(test_data['text'])
>>> len(bow_transform.vocabulary_)
46924
>>> y_tr = training_data['target']
>>> y_te = test_data['target']
# 使用詞袋矩陣創建tf-idf表示
>>> tfidf_trfm = text.TfidfTransformer(norm=None) 
>>> X_tr_tfidf = tfidf_trfm.fit_transform(X_tr_bow) 
>>> X_te_tfidf = tfidf_trfm.transform(X_te_bow)
# 僅出於練習的目的，對詞袋錶示進行l2歸一化
>>> X_tr_l2 = preproc.normalize(X_tr_bow, axis=0)
>>> X_te_l2 = preproc.normalize(X_te_bow, axis=0)

注：測試集上的特徵縮放特徵縮放的微妙之處在於，它要求我們知道一些實際中我們很可能不知道的特徵統計量，比如均值、方差、文檔頻率、l2 範數，等等。爲了計算出 tf-idf 表示，我們必須基於訓練數據計算出逆文檔頻率，並用這些統計量既縮放訓練數據也縮放測試數據。在 scikit-learn 中，在訓練數據上擬合特徵轉換器相當於收集相關統計量。然後可以將擬合好的特徵轉換器應用到測試數據上。

4.2.3 使用邏輯迴歸進行分類

Step4：使用默認參數訓練邏輯迴歸分類器

 >>> def simple_logistic_classify(X_tr, y_tr, X_test, y_test, description):
            ### 輔助函數，用來訓練邏輯迴歸分類器，並在測試數據上進行評分。
     ...     m = LogisticRegression().fit(X_tr, y_tr)
     ...     s = m.score(X_test, y_test)
     ...     print ('Test score with', description, 'features:', s)
     return m
 >>> m1 = simple_logistic_classify(X_tr_bow, y_tr, X_te_bow, y_te, 'bow')
 >>> m2 = simple_logistic_classify(X_tr_l2, y_tr, X_te_l2, y_te, 'l2-normalized')
 >>> m3 = simple_logistic_classify(X_tr_tfidf, y_tr, X_te_tfidf, y_te, 'tf-idf')

Test score with bow features: 0.775873066497
Test score with l2-normalized features: 0.763514590974
Test score with tf-idf features: 0.743182905438

結果顯示準確率最高的分類器使用的是詞袋特徵，實際上，出現這種情況的原因在於分類器沒有很好地“調優”

4.2.4 使用正則化對邏輯迴歸進行調優

scikit-learn 中的 GridSearchCV 函數可以執行帶交叉驗證的網格搜索
Step5：使用網格搜索對邏輯迴歸進行調優

>>> import sklearn.model_selection as modsel
# 確定一個搜索網格，然後對每種特徵集合執行5-折網格搜索
>>> param_grid_ = {'C': [1e-5, 1e-3, 1e-1, 1e0, 1e1, 1e2]}
# 爲詞袋錶示法進行分類器調優
>>> bow_search = modsel.GridSearchCV(LogisticRegression(), cv=5, ... param_grid=param_grid_)
>>> bow_search.fit(X_tr_bow, y_tr)
# 爲L2-歸一化詞向量進行分類器調優
>>> l2_search = modsel.GridSearchCV(LogisticRegression(), cv=5, ... param_grid=param_grid_)
>>> l2_search.fit(X_tr_l2, y_tr)
# 爲tf-idf進行分類器調優
>>> tfidf_search = modsel.GridSearchCV(LogisticRegression(), cv=5, ... param_grid=param_grid_)
>>> tfidf_search.fit(X_tr_tfidf, y_tr)
# 檢查網格搜索的一個輸出，看看它是如何運行的
>>> bow_search.cv_results_
{'mean_fit_time': array([ 0.43648252, 0.94630651,
               5.64090128,  15.31248307,  31.47010217,  42.44257565]),
     'mean_score_time': array([ 0.00080056,  0.00392466,  0.00864897,  0 .00784755,
              0.01192751,  0.0072515 ]),
     'mean_test_score': array([ 0.57897075,  0.7518111 ,  0.78283898,  0.77381766,
              0.75515992,  0.73937261]),
'mean_train_score': array([ 0.5792185 ,  0.76731652,  0.87697341,  0.94629064,
         0.98357195,  0.99441294]),
'param_C': masked_array(data = [1e-05 0.001 0.1 1.0 10.0 100.0],
              mask = [False False False False False False],
        fill_value = ?),
'params': ({'C': 1e-05},
  {'C': 0.001},
  {'C': 0.1},
  {'C': 1.0},
  {'C': 10.0},
  {'C': 100.0}),
'rank_test_score': array([6, 4, 1, 2, 3, 5]),
'split0_test_score': array([ 0.58028698,  0.75025624,  0.7799795 ,  0.7726341 ,
         0.75247694,  0.74086095]),
'split0_train_score': array([ 0.57923964,  0.76860316,  0.87560871,  0.94434003,
         0.9819308 ,  0.99470312]),
'split1_test_score': array([ 0.5786776 ,  0.74628396,  0.77669571,  0.76627371,
         0 .74867589,  0.73176149]),
'split1_train_score': array([ 0.57917218,  0.7684849 ,  0.87945837,  0.94822946,
         0.98504976,  0.99538678]),
'split2_test_score': array([ 0.57816504,  0.75533914,  0.78472578,  0.76832394,
         0.74799248,  0.7356911 ]),
'split2_train_score': array([ 0.57977019,  0.76613558,  0.87689548,  0.94566657,
         0.98368288,  0.99397719]),
'split3_test_score': array([ 0.57894737,  0.75051265,  0.78332194,  0.77682843,
         0.75768968,  0.73855092]),
'split3_train_score': array([ 0.57914745,  0.76678626,  0.87634546,  0.94558346,
         0.98385443,  0.99474628]),
'split4_test_score': array([ 0.57877649,  0.75666439,  0.78947368,  0.78503076,
         0.76896787,  0.75      ]),
'split4_train_score': array([ 0.57876303,  0.7665727 ,  0.87655903,  0.94763369,
         0.98334188,  0.99325132]),
'std_fit_time': array([ 0.03874582,  0.02297261,  1.18862097,  1.83901079,
         4.21516797,  2.93444269]),
'std_score_time': array([ 0.00160112,  0.00605009,  0.00623053,  0.00698687,
         0.00713112,  0.00570195]),
'std_test_score': array([ 0.00070799,  0.00375907,  0.00432957,  0.00668246,
         0.00612049]),
'std_train_score': array([ 0.00032232,  0.00102466,  0.00131222,  0.00143229,
         0.00100223,  0.00073252])}
# 在箱線圖中繪製出交叉驗證結果
# 對分類器性能進行可視化比較
>>> search_results = pd.DataFrame.from_dict({
...
...
...
...
'bow': bow_search.cv_results_['mean_test_score'], 'tfidf': tfidf_search.cv_results_['mean_test_score'], 'l2': l2_search.cv_results_['mean_test_score']
# 常用的matplotlib設置
# seaborn用來美化圖形
>>> import matplotlib.pyplot as plt >>> import seaborn as sns
>>> sns.set_style("whitegrid")
>>> ax = sns.boxplot(data=search_results, width=0.4)
>>> ax.set_ylabel('Accuracy', size=14)
>>> ax.tick_params(labelsize=14)

Step6：比較不同特徵集合的最終訓練與測試步驟

# 使用前面找到的最優超參數設置，在整個訓練集上訓練一個最終模型
# 在測試集上測量準確度
>>> m1 = simple_logistic_classify(X_tr_bow, y_tr, X_te_bow, y_te, 'bow',
... _C=bow_search.best_params_['C'])
>>> m2 = simple_logistic_classify(X_tr_l2, y_tr, X_te_l2, y_te, 'l2-normalized', ... _C=l2_search.best_params_['C'])
>>> m3 = simple_logistic_classify(X_tr_tfidf, y_tr, X_te_tfidf, y_te, 'tf-idf', ... _C=tfidf_search.best_params_['C'])
Test score with bow features: 0.78360708021
Test score with l2-normalized features: 0.780178599904
Test score with tf-idf features: 0.788470738319

4.3 深入研究，發生了什麼

tf-idf = 列縮放
tf-idf 和 l2 歸一化都是數據矩陣上的列操作
正確的特徵縮放有助於分類問題。正確縮放可以突出有信息量的單詞，並削弱普通單詞的影響。它還可以減少數據矩陣的條件數。正確的縮放不一定是標準的列縮放。

參考：《精通特徵工程》愛麗絲·鄭·阿曼達·卡薩麗

面向機器學習的特徵工程學習筆記：
【精通特徵工程】學習筆記（一）
【精通特徵工程】學習筆記（二）

【精通特徵工程】學習筆記（三）

【精通特徵工程】學習筆記Day3&2.13&D4章&P52-64頁

4、特徵縮放的效果:從詞袋到 tf-idf

4.1 tf-idf:詞袋的一種簡單擴展

4.2 tf-idf 方法測試

4.2.1 創建分類數據集

4.2.2 使用 tf-idf 變換來縮放詞袋

4.2.3 使用邏輯迴歸進行分類

4.2.4 使用正則化對邏輯迴歸進行調優

4.3 深入研究，發生了什麼

MySQL 核心模塊揭祕 | 18 期 | 鎖在內存里長什麼樣*

使用perf工具生成火焰圖

大齡程序員思考

響應式界面控件DevExtreme * 更強的數據分析和可視化功能

HttpSecurity 是如何組裝過濾器鏈的

數說海南——近6年海南各市縣人口簡單看

長序列中Transformers的高級注意力機制總結

WebStorm 創建 Vue 項目

python、matlab調用tushare數據

【實用小站】配色網站

【精通特徵工程】學習筆記（二）

python讀寫數據文件

產品分析數據來源渠道

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結