《Python數據分析與機器學習實戰-唐宇迪》讀書筆記第10章-特徵工程《數據分析實戰-托馬茲.卓巴斯》讀書筆記第9章--自然語言處理NLTK（分析文本、詞性標註、主題抽取、文本數據分類）

第10章特徵工程

　　特徵工程是整個機器學習中非常重要的一部分，如何對數據進行特徵提取對最終結果的影響非常大。在建模過程中，一般會優先考慮算法和參數，但是數據特徵才決定了整體結果的上限，而算法和參數只決定了如何逼近這個上限。特徵工程其實就是要從原始數據中找到最有價值的信息，並轉換成計算機所能讀懂的形式。本章結合數值數據與文本數據來分別闡述如何進行數值特徵與文本特徵的提取。

10.1數值特徵

　　實際數據中，最常見的就是數值特徵，本節介紹幾種常用的數值特徵提取方法與函數。首先還是讀取一份數據集，並取其中的部分特徵來做實驗，不用考慮數據特徵的具體含義，只進行特徵操作即可。

10.1.1字符串編碼

1 import pandas as pd
2 import numpy as np
3 
4 vg_df = pd.read_csv('datasets/vgsales.csv', encoding = "ISO-8859-1")
5 vg_df[['Name', 'Platform', 'Year', 'Genre', 'Publisher']].iloc[1:7]

　　上述代碼生成的數據中很多特徵指標都是字符串，首先假設Genre列是最終的分類結果標籤，但是計算機可不認識這些字符串，此時就需要將字符轉換成數值。

1 genres = np.unique(vg_df['Genre'])
2 genres

array(['Action', 'Adventure', 'Fighting', 'Misc', 'Platform', 'Puzzle',
       'Racing', 'Role-Playing', 'Shooter', 'Simulation', 'Sports',
       'Strategy'], dtype=object)

　　讀入數據後，最常見的情況就是很多特徵並不是數值類型，而是用字符串來描述的，打印結果後發現，Genre列一共有12個不同的屬性值，將其轉換成數值即可，最簡單的方法就是用數字進行映射：

1 from sklearn.preprocessing import LabelEncoder
2 
3 gle = LabelEncoder()
4 genre_labels = gle.fit_transform(vg_df['Genre'])
5 genre_mappings = {index: label for index, label in enumerate(gle.classes_)}
6 genre_mappings

{0: 'Action',
 1: 'Adventure',
 2: 'Fighting',
 3: 'Misc',
 4: 'Platform',
 5: 'Puzzle',
 6: 'Racing',
 7: 'Role-Playing',
 8: 'Shooter',
 9: 'Simulation',
 10: 'Sports',
 11: 'Strategy'}

　　使用sklearn工具包中的LabelEncoder()函數可以快速地完成映射工作，默認是從數值0開始，fit_transform()是實際執行的操作，自動對屬性特徵進行映射操作。變換完成之後，可以將新得到的結果加入原始DataFrame中對比一下：

1 vg_df['GenreLabel'] = genre_labels
2 vg_df[['Name', 'Platform', 'Year', 'Genre', 'GenreLabel']].iloc[1:7]

　　此時所有的字符型特徵就轉換成相應的數值，也可以自定義一份映射。

1 poke_df = pd.read_csv('datasets/Pokemon.csv', encoding='utf-8')
2 poke_df.head()

1 poke_df = poke_df.sample(random_state=1, frac=1).reset_index(drop=True)
2 
3 np.unique(poke_df['Generation'])

array(['Gen 1', 'Gen 2', 'Gen 3', 'Gen 4', 'Gen 5', 'Gen 6'], dtype=object)

　　這份數據集中同樣有多個屬性值需要映射，也可以自己動手寫一個map函數，對應數值就從1開始吧：

1 gen_ord_map = {'Gen 1': 1, 'Gen 2': 2, 'Gen 3': 3, 
2                'Gen 4': 4, 'Gen 5': 5, 'Gen 6': 6}
3 
4 poke_df['GenerationLabel'] = poke_df['Generation'].map(gen_ord_map)
5 poke_df[['Name', 'Generation', 'GenerationLabel']].iloc[4:10]

　　對於簡單的映射操作，無論自己完成還是使用工具包中現成的命令都非常容易，但是更多的時候，對這種屬性特徵可以選擇獨熱編碼，雖然操作稍微複雜些，但從結果上觀察更清晰：

 1 from sklearn.preprocessing import OneHotEncoder, LabelEncoder
 2 
 3 # 完成LabelEncoder
 4 gen_le = LabelEncoder()
 5 gen_labels = gen_le.fit_transform(poke_df['Generation'])
 6 poke_df['Gen_Label'] = gen_labels
 7 
 8 poke_df_sub = poke_df[['Name', 'Generation', 'Gen_Label', 'Legendary']]
 9 
10 # 完成OneHotEncoder
11 gen_ohe = OneHotEncoder()
12 gen_feature_arr = gen_ohe.fit_transform(poke_df[['Gen_Label']]).toarray()
13 gen_feature_labels = list(gen_le.classes_)
14 
15 # 將轉換好的特徵組合到dataframe中
16 gen_features = pd.DataFrame(gen_feature_arr, columns=gen_feature_labels)
17 poke_df_ohe = pd.concat
18 poke_df_ohe.head()

　　上述代碼首先導入了OneHotEncoder工具包，對數據進行數值映射操作，又進行獨熱編碼。輸出結果顯示，獨熱編碼相當於先把所有可能情況進行展開，然後分別用0和1表示實際特徵情況，0代表不是當前列特徵，1代表是當前列特徵。例如，當Gen_Label=3時，對應的獨熱編碼就是，Gen4爲1，其餘位置都爲0（注意原索引從0開始，Gen_Label=3，相當於第4個位置）。

　　上述代碼看起來有點麻煩，那麼有沒有更簡單的方法呢？其實直接使用Pandas工具包更方便：

1 gen_dummy_features = pd.get_dummies(poke_df['Generation'], drop_first=True)
2 pd.concat([poke_df[['Name', 'Generation']], gen_dummy_features], axis=1).iloc[4:10]

　　Get_dummies()函數可以完成獨熱編碼的工作，當特徵較多時，一個個命名太麻煩，此時可以直接指定一個前綴用於標識：

1 gen_onehot_features = pd.get_dummies(poke_df['Generation'],prefix = 'one-hot')
2 pd.concat([poke_df[['Name', 'Generation']], gen_onehot_features], axis=1).iloc[4:10]

　　現在所有執行獨熱編碼的特徵全部帶上“one-hot”前綴了，對比發現還是get_dummies()函數更好用，1行代碼就能解決問題。

1 poke_df = pd.read_csv('datasets/Pokemon.csv', encoding='utf-8')
2 poke_df.head()

10.1.2二值與多項式特徵

　　接下來打開一份音樂數據集：

1 popsong_df = pd.read_csv('datasets/song_views.csv', encoding='utf-8')
2 popsong_df.head(10)

　　數據中包括不同用戶對歌曲的播放量，可以發現很多歌曲的播放量都是0，表示該用戶還沒有播放過此音樂，這個時候可以設置一個二值特徵，以表示用戶是否聽過該歌曲：

1 watched = np.array(popsong_df['listen_count']) 
2 watched[watched >= 1] = 1
3 popsong_df['watched'] = watched
4 popsong_df.head(10)

　　新加入的watched特徵表示歌曲是否被播放，同樣也可以使用sklearn工具包中的Binarizer來完成二值特徵：

1 from sklearn.preprocessing import Binarizer
2 
3 bn = Binarizer(threshold=0.9)
4 pd_watched = bn.transform([popsong_df['listen_count']])[0]
5 popsong_df['pd_watched'] = pd_watched
6 popsong_df.head(10)

　　特徵的變換方法還有很多，還可以對其進行各種組合。接下來登場的就是多項式特徵，例如有a、b兩個特徵，那麼它的2次多項式爲（1,a,b,a²,ab,b²），下面通過sklearn工具包完成變換操作：

1 poke_df = pd.read_csv('datasets/Pokemon.csv', encoding='utf-8')
2 atk_def = poke_df[['Attack', 'Defense']]
3 atk_def.head()
4 
5 from sklearn.preprocessing import PolynomialFeatures
6 
7 pf = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
8 res = pf.fit_transform(atk_def)
9 res[:5]

　　Attack     Defense
0     49     49
1     62     63
2     82     83
3     100     123
4     52     43

array([[   49.,    49.,  2401.,  2401.,  2401.],
       [   62.,    63.,  3844.,  3906.,  3969.],
       [   82.,    83.,  6724.,  6806.,  6889.],
       [  100.,   123., 10000., 12300., 15129.],
       [   52.,    43.,  2704.,  2236.,  1849.]])

　　PolynomialFeatures()函數涉及以下3個參數。

degree：控制多項式的度，如果設置的數值越大，特徵結果也會越多。
interaction_only：默認爲False。如果指定爲True，那麼不會有特徵自己和自己結合的項，例如上面的二次項中沒有a²和b²。
include_bias：默認爲True。如果爲True的話，那麼會新增1列。

　　爲了更清晰地展示，可以加上操作的列名：

1 intr_features = pd.DataFrame(res, columns=['Attack', 'Defense', 'Attack^2', 'Attack x Defense', 'Defense^2'])
2 intr_features.head(5)

10.1.3連續值離散化

　　連續值離散化的操作非常實用，很多時候都需要對連續值特徵進行這樣的處理，效果如何還得實際通過測試集來觀察，但在特徵工程構造的初始階段，肯定還是希望可行的路線越多越好。

1 cc_survey_df = pd.read_csv('datasets/fcc_2016_coder_survey_subset.csv', encoding='utf-8')
2 fcc_survey_df[['ID.x', 'EmploymentField', 'Age', 'Income']].head()

　　上述代碼讀取了一份帶有年齡信息的數據集，接下來要對年齡特徵進行離散化操作，也就是劃分成一個個區間，實際操作之前，可以觀察其分佈情況：

 1 import pandas as pd
 2 import matplotlib.pyplot as plt
 3 import matplotlib as mpl
 4 import numpy as np
 5 import scipy.stats as spstats
 6 
 7 %matplotlib inline
 8 mpl.style.reload_library()
 9 mpl.style.use('classic')
10 mpl.rcParams['figure.facecolor'] = (1, 1, 1, 0)
11 mpl.rcParams['figure.figsize'] = [6.0, 4.0]
12 mpl.rcParams['figure.dpi'] = 100
13 
14 fig, ax = plt.subplots()
15 fcc_survey_df['Age'].hist(color='#A9C5D3')
16 ax.set_title('Developer Age Histogram', fontsize=12)
17 ax.set_xlabel('Age', fontsize=12)
18 ax.set_ylabel('Frequency', fontsize=12)

　　上述輸出結果顯示，年齡特徵的取值範圍在10～90之間。所謂離散化，就是將一段區間上的數據映射到一個組中，例如按照年齡大小可分成兒童、青年、中年、老年等。簡單起見，這裏直接按照相同間隔進行劃分：

1 fcc_survey_df['Age_bin_round'] = np.array(np.floor(np.array(fcc_survey_df['Age']) / 10.))
2 fcc_survey_df[['ID.x', 'Age', 'Age_bin_round']].iloc[1071:1076]

　　上述代碼中，np.floor表示向下取整，例如，對3.3取整後，得到的就是3。這樣就完成了連續值的離散化，所有數值都劃分到對應的區間上。

　　還可以利用分位數進行分箱操作，換一個特徵試試，先來看看收入的情況：

1 #fcc_survey_df[['ID.x', 'Age', 'Income']].iloc[4:9]
2 fig, ax = plt.subplots()
3 fcc_survey_df['Income'].hist(bins=30, color='#A9C5D3')
4 ax.set_title('Developer Income Histogram', fontsize=12)
5 ax.set_xlabel('Developer Income', fontsize=12)
6 ax.set_ylabel('Frequency', fontsize=12)

　　分位數就是按照比例來劃分，也可以自定義合適的比例：

1 quantile_list = [0, .25, .5, .75, 1.]
2 quantiles = fcc_survey_df['Income'].quantile(quantile_list)
3 quantiles

0.00      6000.0
0.25     20000.0
0.50     37000.0
0.75     60000.0
1.00    200000.0
Name: Income, dtype: float64

 1 fig, ax = plt.subplots()
 2 fcc_survey_df['Income'].hist(bins=30, color='#A9C5D3')
 3 
 4 for quantile in quantiles:
 5     qvl = plt.axvline(quantile, color='r')
 6 ax.legend([qvl], ['Quantiles'], fontsize=10)
 7 
 8 ax.set_title('Developer Income Histogram with Quantiles', fontsize=12)
 9 ax.set_xlabel('Developer Income', fontsize=12)
10 ax.set_ylabel('Frequency', fontsize=12)

　　Quantile函數就是按照選擇的比例得到對應的切分值，再應用到數據中進行離散化操作即可：

1 quantile_labels = ['0-25Q', '25-50Q', '50-75Q', '75-100Q']
2 fcc_survey_df['Income_quantile_range'] = pd.qcut(fcc_survey_df['Income'], 
3                                                  q=quantile_list)
4 fcc_survey_df['Income_quantile_label'] = pd.qcut(fcc_survey_df['Income'], 
5                                                  q=quantile_list, labels=quantile_labels)
6 fcc_survey_df[['ID.x', 'Age', 'Income', 
7                'Income_quantile_range', 'Income_quantile_label']].iloc[4:9]

　　此時所有數據都完成了分箱操作，拿到實際數據後如何指定比例就得看具體問題，並沒有固定不變的規則，根據實際業務來判斷纔是最科學的。

10.1.4對數與時間變換

　　拿到某列數據特徵後，其分佈可能是各種各樣的情況，但是，很多機器學習算法希望預測的結果值能夠呈現高斯分佈，這就需要再對其進行變換，最直接的就是對數變換：

1 fcc_survey_df['Income_log'] = np.log((1+ fcc_survey_df['Income']))
2 fcc_survey_df[['ID.x', 'Age', 'Income', 'Income_log']].iloc[4:9]

1 income_log_mean = np.round(np.mean(fcc_survey_df['Income_log']), 2)
2 
3 fig, ax = plt.subplots()
4 fcc_survey_df['Income_log'].hist(bins=30, color='#A9C5D3')
5 plt.axvline(income_log_mean, color='r')
6 ax.set_title('Developer Income Histogram after Log Transform', fontsize=12)
7 ax.set_xlabel('Developer Income (log scale)', fontsize=12)
8 ax.set_ylabel('Frequency', fontsize=12)
9 ax.text(11.5, 450, r'$\mu$='+str(income_log_mean), fontsize=10)

　　經過對數變換之後，特徵分佈更接近高斯分佈，雖然還不夠完美，但還是有些進步的，感興趣的讀者還可以進一步瞭解cox-box變換，目的都是相同的，只是在公式上有點區別。

　　時間相關數據也是可以提取出很多特徵，例如年、月、日、小時等，甚至上旬、中旬、下旬、工作時間、下班時間等都可以當作算法的輸入特徵。

 1 import datetime
 2 import numpy as np
 3 import pandas as pd
 4 from dateutil.parser import parse
 5 import pytz
 6 
 7 import numpy as np
 8 import pandas as pd
 9 
10 time_stamps = ['2015-03-08 10:30:00.360000+00:00', '2017-07-13 15:45:05.755000-07:00',
11                '2012-01-20 22:30:00.254000+05:30', '2016-12-25 00:30:00.000000+10:00']
12 df = pd.DataFrame(time_stamps, columns=['Time'])
13 df

                    Time
0     2015-03-08 10:30:00.360000+00:00
1     2017-07-13 15:45:05.755000-07:00
2     2012-01-20 22:30:00.254000+05:30
3     2016-12-25 00:30:00.000000+10:00

　　接下來就要得到各種細緻的時間特徵，如果用的是標準格式的數據，也可以直接調用其屬性，更方便一些：

ts_objs = np.array([pd.Timestamp(item) for item in np.array(df.Time)])
df['TS_obj'] = ts_objs
ts_objs

array([Timestamp('2015-03-08 10:30:00.360000+0000', tz='UTC'),
       Timestamp('2017-07-13 15:45:05.755000-0700', tz='pytz.FixedOffset(-420)'),
       Timestamp('2012-01-20 22:30:00.254000+0530', tz='pytz.FixedOffset(330)'),
       Timestamp('2016-12-25 00:30:00+1000', tz='pytz.FixedOffset(600)')],
      dtype=object)

 1 df['Year'] = df['TS_obj'].apply(lambda d: d.year)
 2 df['Month'] = df['TS_obj'].apply(lambda d: d.month)
 3 df['Day'] = df['TS_obj'].apply(lambda d: d.day)
 4 df['DayOfWeek'] = df['TS_obj'].apply(lambda d: d.dayofweek)
 5 # df['DayName'] = df['TS_obj'].apply(lambda d: d.weekday_name)#
 6 # AttributeError: 'Timestamp' object has no attribute 'weekday_name'
 7 df['DayOfYear'] = df['TS_obj'].apply(lambda d: d.dayofyear)
 8 df['WeekOfYear'] = df['TS_obj'].apply(lambda d: d.weekofyear)
 9 df['Quarter'] = df['TS_obj'].apply(lambda d: d.quarter)
10 
11 # df[['Time', 'Year', 'Month', 'Day', 'Quarter', 
12 #     'DayOfWeek', 'DayName', 'DayOfYear', 'WeekOfYear']]
13 df[['Time', 'Year', 'Month', 'Day', 'Quarter', 
14     'DayOfWeek',  'DayOfYear', 'WeekOfYear']]

1 hour_bins = [-1, 5, 11, 16, 21, 23]
2 bin_names = ['Late Night', 'Morning', 'Afternoon', 'Evening', 'Night']
3 df['TimeOfDayBin'] = pd.cut(df['Hour'], 
4                             bins=hour_bins, labels=bin_names)
5 df[['Time', 'Hour', 'TimeOfDayBin']]

　　　　Time     　　　　　　　　　　　　　　　　Hour  TimeOfDayBin
0     2015-03-08 10:30:00.360000+00:00     10     Morning
1     2017-07-13 15:45:05.755000-07:00     15     Afternoon
2     2012-01-20 22:30:00.254000+05:30     22     Night
3     2016-12-25 00:30:00.000000+10:00     0      Late Night

　　原始時間特徵確定後，竟然分出這麼多小特徵。當拿到具體時間數據後，還可以整合一些相關信息，例如天氣情況，氣象臺數據很輕鬆就可以拿到，對應的溫度、降雨等指標也就都有了。

10.2文本特徵

　　文本特徵經常在數據中出現，一句話、一篇文章都是文本特徵。還是同樣的問題，計算機依舊不認識它們，所以首先要將其轉換成數值，也就是向量。關於文本特徵的提取方式，這裏先做簡單介紹，在下一章的新聞分類任務中，還會詳細解釋文本特徵提取操作。

10.2.1詞袋模型

　　先來構造一個數據集，簡單起見就用英文表示，如果是中文數據，還需要先進行分詞操作，英文中默認就是分好詞的結果：

 1 import pandas as pd
 2 import numpy as np
 3 import re
 4 import nltk #pip install nltk
 5 #jieba
 6 
 7 corpus = ['The sky is blue and beautiful.',
 8           'Love this blue and beautiful sky!',
 9           'The quick brown fox jumps over the lazy dog.',
10           'The brown fox is quick and the blue dog is lazy!',
11           'The sky is very blue and the sky is very beautiful today',
12           'The dog is lazy but the brown fox is quick!'    
13 ]
14 labels = ['weather', 'weather', 'animals', 'animals', 'weather', 'animals']
15 corpus = np.array(corpus)
16 corpus_df = pd.DataFrame({'Document': corpus, 
17                           'Category': labels})
18 corpus_df = corpus_df[['Document', 'Category']]
19 corpus_df

　　　　Document     　　　　　　　　　　　　　　　　　　　　　　Category
0     The sky is blue and beautiful.     　　　　　　　　　　weather
1     Love this blue and beautiful sky!     　　　　　　　　 weather
2     The quick brown fox jumps over the lazy dog.     　　animals
3     The brown fox is quick and the blue dog is lazy! 　　animals
4     The sky is very blue and the sky is very beaut...　　weather
5     The dog is lazy but the brown fox is quick!     　　 animals

　　在自然語言處理中有一個非常實用的NLTK工具包，使用前需要先安裝該工具包，但是，安裝完之後，它相當於一個空架子，裏面沒有實際的功能，需要有選擇地安裝部分插件（見圖10-1）。

　　圖10-1 NLTK工具包

　　執行nltk.download()會跳出安裝界面，選擇需要的功能進行安裝即可。不僅如此，NLTK工具包還提供了很多數據集供我們練習使用，功能還是非常強大的。

NLTK安裝可以參考這裏：《數據分析實戰-托馬茲.卓巴斯》讀書筆記第9章--自然語言處理NLTK（分析文本、詞性標註、主題抽取、文本數據分類）

1 nltk.download()
2 # nltk.download('wordnet')
3 #並把文件從默認的路徑C:\Users\tony zhang\AppData\Roaming\nltk_data\移動到D:\download\nltk_data\

1 from nltk import data
2 data.path.append(r'D:\download\nltk_data') # 這裏的路徑需要換成自己數據文件下載的路徑

　　對於文本數據，第一步肯定要進行預處理操作，基本的套路就是去掉各種特殊字符，還有一些用處不大的停用詞。

　　所謂停用詞就是該詞對最終結果影響不大，例如，“我們”“今天”“但是”等詞語就屬於停用詞。

 1 import nltk
 2 from nltk import data
 3 data.path.append(r'D:\download\nltk_data') # 這裏的路徑需要換成自己數據文件下載的路徑
 4 #加載停用詞
 5 wpt = nltk.WordPunctTokenizer()
 6 stop_words = nltk.corpus.stopwords.words('english')
 7 
 8 def normalize_document(doc):
 9     # 去掉特殊字符
10     doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I)
11     # 轉換成小寫
12     doc = doc.lower()
13     doc = doc.strip()
14     # 分詞
15     tokens = wpt.tokenize(doc)
16     # 去停用詞
17     filtered_tokens = [token for token in tokens if token not in stop_words]
18     # 重新組合成文章
19     doc = ' '.join(filtered_tokens)
20     return doc
21 
22 normalize_corpus = np.vectorize(normalize_document)

1 norm_corpus = normalize_corpus(corpus)
2 norm_corpus
3 #The sky is blue and beautiful.

array(['sky blue beautiful', 'love blue beautiful sky',
       'quick brown fox jumps lazy dog', 'brown fox quick blue dog lazy',
       'sky blue sky beautiful today', 'dog lazy brown fox quick'],
      dtype='<U30')

　　像the、this等對整句話的主題不起作用的詞也全部去掉，下面就要對文本進行特徵提取，也就是把每句話都轉換成數值向量。

1 from sklearn.feature_extraction.text import CountVectorizer
2 print (norm_corpus)
3 cv = CountVectorizer(min_df=0., max_df=1.)
4 cv.fit(norm_corpus)
5 print (cv.get_feature_names())
6 cv_matrix = cv.fit_transform(norm_corpus)
7 cv_matrix = cv_matrix.toarray()
8 cv_matrix

['sky blue beautiful' 'love blue beautiful sky'
 'quick brown fox jumps lazy dog' 'brown fox quick blue dog lazy'
 'sky blue sky beautiful today' 'dog lazy brown fox quick']
['beautiful', 'blue', 'brown', 'dog', 'fox', 'jumps', 'lazy', 'love', 'quick', 'sky', 'today']

array([[1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0],
       [0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0],
       [0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0],
       [1, 1, 0, 0, 0, 0, 0, 0, 0, 2, 1],
       [0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0]], dtype=int64)

1 vocab = cv.get_feature_names()
2 pd.DataFrame(cv_matrix, columns=vocab)

　　文章中出現多少個不同的詞，其向量的維度就是多大，再依照其出現的次數和位置，就可以把向量構造出來。上述代碼只考慮單個詞，其實還可以把詞和詞之間的組合考慮進來，原理還是一樣的，接下來就要多考慮組合，從結果來看更直接：

1 bv = CountVectorizer(ngram_range=(2,2))
2 bv_matrix = bv.fit_transform(norm_corpus)
3 bv_matrix = bv_matrix.toarray()
4 vocab = bv.get_feature_names()
5 pd.DataFrame(bv_matrix, columns=vocab)

　　上述代碼設置了ngram_range參數，相當於要考慮詞的上下文，此處只考慮兩兩組合的情況，大家也可以將ngram_range參數設置成(1,2)，這樣既包括一個詞也包括兩個詞組合的情況。

　　詞袋模型的原理和操作都十分簡單，但是這樣做出來的向量是沒有靈魂的。無論是一句話還是一篇文章，都是有先後順序的，但在詞袋模型中，卻只考慮詞頻，並且每個詞的重要程度完全和其出現的次數相關，通常情況下，文章向量會是一個非常大的稀疏矩陣，並不利於計算。

　　詞袋模型的問題看起來還是很多，其優點也是有的，簡單方便。在實際建模任務中，還不能確定哪種特徵提取方法效果更好，所以，各種方法都需要嘗試。

10.2.2常用文本特徵構造方法

　　文本特徵提取方法還很多，下面介紹一些常用的構造方法，在實際任務中，不僅可以選擇常規套路，也可以組合使用一些野路子。

　　（1）TF-IDF特徵。雖然詞袋模型只考慮了詞頻，沒考慮詞本身的含義，但在TF-IDF中，會考慮每個詞的重要程度，後續再詳細講解TF-IDF關鍵詞的提取方法，先來看看其能得到的結果：

1 from sklearn.feature_extraction.text import TfidfVectorizer 
2 tv = TfidfVectorizer(min_df=0., max_df=1., use_idf=True)
3 tv_matrix = tv.fit_transform(norm_corpus)
4 tv_matrix = tv_matrix.toarray()
5 
6 vocab = tv.get_feature_names()
7 pd.DataFrame(np.round(tv_matrix, 2), columns=vocab)

　　上述輸出結果顯示，每個詞都得到一個小數結果，並且有大小之分，表明其在該篇文章中的重要程度，下一章的新聞分類任務還會詳細討論。

　　（2）相似度特徵。只要確定了特徵，並且全部轉換成數值數據，纔可以計算它們之間的相似性，計算方法也比較多，這裏用餘弦相似性來舉例，sklearn工具包中已經有實現好的功能，直接將上例中TF-IDF特徵提取結果當作輸入即可：

1 from sklearn.metrics.pairwise import cosine_similarity
2 
3 similarity_matrix = cosine_similarity(tv_matrix)
4 similarity_df = pd.DataFrame(similarity_matrix)
5 similarity_df

　　（3）聚類特徵。聚類就是把數據按堆劃分，最後每堆給出一個實際的標籤，需要先把數據轉換成數值特徵，然後計算其聚類結果，其結果也可以當作離散型特徵（聚類算法會在第16章講解）。

1 from sklearn.cluster import KMeans
2 
3 km = KMeans(n_clusters=2)
4 km.fit_transform(similarity_df)
5 cluster_labels = km.labels_
6 cluster_labels = pd.DataFrame(cluster_labels, columns=['ClusterLabel'])
7 pd.concat([corpus_df, cluster_labels], axis=1)

　　（4）主題模型。主題模型是無監督方法，輸入就是處理好的語料庫，可以得到主題類型以及其中每一個詞的權重結果：

 1 from sklearn.decomposition import LatentDirichletAllocation
 2 
 3 # help(LatentDirichletAllocation)
 4 # lda = LatentDirichletAllocation(n_topics=2, max_iter=100, random_state=42)
 5 #  n_components : int, optional (default=10)
 6 #  |      Number of topics.
 7 
 8 lda = LatentDirichletAllocation(n_components=2, max_iter=100, random_state=42)
 9 dt_matrix = lda.fit_transform(tv_matrix)
10 features = pd.DataFrame(dt_matrix, columns=['T1', 'T2'])
11 features
12 
13 tt_matrix = lda.components_
14 for topic_weights in tt_matrix:
15     topic = [(token, weight) for token, weight in zip(vocab, topic_weights)]
16     topic = sorted(topic, key=lambda x: -x[1])
17     topic = [item for item in topic if item[1] > 0.6]
18     print(topic)
19     print()

     T1              T2
0     0.190548     0.809452
1     0.176804     0.823196
2     0.846184     0.153816
3     0.814863     0.185137
4     0.180516     0.819484
5     0.839172     0.160828

[('brown', 1.7273638692668465), ('dog', 1.7273638692668465), ('fox', 1.7273638692668465), ('lazy', 1.7273638692668465), ('quick', 1.7273638692668465), 
('jumps', 1.0328325272484777), ('blue', 0.7731573162915626)]

[('sky', 2.264386643135622), ('beautiful', 1.9068269319456903), ('blue', 1.7996282104933266), ('love', 1.148127242397004), 
('today', 1.0068251160429935)]

　　上述代碼設置n_topicsn_components =2，相當於要得到兩種主題，最後的結果就是各個主題不同關鍵詞的權重，看起來這件事處理得還不錯，使用無監督的方法，也能得到這麼多關鍵的指標。筆者認爲，LDA主題模型並不是很實用，得到的效果通常也是一般，所以，並不建議大家用其進行特徵處理或者建模任務，熟悉一下就好。

　　（5）詞向量模型。前面介紹的幾種特徵提取方法還是比較容易理解的，再來看看詞向量模型，也就是常說的word2vec，其基本原理是基於神經網絡的。先來通俗地解釋一下，首先對每個詞進行初始化操作，例如，每個詞都是長度爲10的一個隨機向量。接下來，模型會對每個詞及其上下文進行預測，例如輸入是向量“回家”，輸出就是“吃飯”，所有的輸入數據和輸出標籤都是語料庫中的上下文，所以標籤並不需要特意指定。此時不只要通過優化算法選擇合適的權重參數，例如梯度下降，輸入的向量也會隨之改變，也就是向量“回家”一開始是隨機的，在每次迭代過程中都會不斷改變，直到得到一個合適的結果。

　　詞向量模型是現階段自然語言處理中最常使用的方法，並賦予每個詞實際的空間含義，回顧一下，使用前面講述過的特徵提取方法得到的向量都沒有實際意義，只是數值，但在詞向量模型中，每個詞在空間中都是有實際意義的，例如，“喜歡”和“愛”這兩個詞在空間中比較接近，因爲其表達的含義類似，但是它們和“手機”就離得比較遠，因爲關係不大。講解完神經網絡之後，在第20章的影評分類任務中有它的實際應用案例。當大家使用時，需首先將文本中每一個詞的向量構造出來，最常用的工具包就是Gensim，其中有語料庫：

 1 from gensim.models import word2vec
 2 from nltk import data
 3 data.path.append(r'D:\download\nltk_data') # 這裏的路徑需要換成自己數據文件下載的路徑
 4 wpt = nltk.WordPunctTokenizer()
 5 tokenized_corpus = [wpt.tokenize(document) for document in norm_corpus]
 6 
 7 # 需要設置一些參數
 8 feature_size = 10    # 詞向量維度
 9 window_context = 10  # 滑動窗口                                                                        
10 min_word_count = 1   # 最小詞頻             
11 
12 w2v_model = word2vec.Word2Vec(tokenized_corpus, size=feature_size, 
13                           window=window_context, min_count = min_word_count)
14 
15 w2v_model.wv['sky']

array([-0.02571594, -0.02806569, -0.01904523, -0.03620922,  0.01884929,
       -0.04410132,  0.02005241, -0.00504071,  0.01696092,  0.01301065],
      dtype=float32)

 1 def average_word_vectors(words, model, vocabulary, num_features):
 2     
 3     feature_vector = np.zeros((num_features,),dtype="float64")
 4     nwords = 0.
 5     
 6     for word in words:
 7         if word in vocabulary: 
 8             nwords = nwords + 1.
 9             feature_vector = np.add(feature_vector, model[word])
10     
11     if nwords:
12         feature_vector = np.divide(feature_vector, nwords)
13         
14     return feature_vector
15     
16    
17 def averaged_word_vectorizer(corpus, model, num_features):
18     vocabulary = set(model.wv.index2word)
19     features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
20                     for tokenized_sentence in corpus]
21     return np.array(features)

1 w2v_feature_array = averaged_word_vectorizer(corpus=tokenized_corpus, model=w2v_model,
2                                              num_features=feature_size)
3 pd.DataFrame(w2v_feature_array) #lstm

　　輸出結果就是輸入預料中的每一個詞都轉換成向量，詞向量的應用十分廣泛，現階段通常都是將其和神經網絡結合在一起來搭配使用（後續案例就會看到其強大的戰鬥力）。

10.3論文與benchmark

　　在數據挖掘任務中，特徵工程尤爲重要，數據的字段中可能包含各種各樣的信息，如何提取出最有價值的特徵呢？大家第一個想到的可能是經驗方法，回顧一下之前處理其他數據的方法或者一些通用的套路，但肯定都不確定方法是否得當，而且要把每個想法都實踐一遍也不太現實。這裏給大家推薦一個套路，結合論文與benchmark來找解決方案，相信會事半功倍。

　　最好的方法就是從論文入手，大家也可以把論文當作是一個實際任務的解決方案，對於較複雜的任務，你可能沒有深入研究過，但是前人已經探索過其中的方法，論文就是他們對好的思路、實驗結果以及其中遇到各種問題的總結。如果把他們的方法加以研究和改進，再應用到實際任務中，是不是看起來很棒？

　　但是，如何找到合適的論文作爲參考呢？如果不是專門做某一領域，可能對這些資源並不是很熟悉，這裏給大家推薦benchmark，翻譯過來叫作“基準”。其實它就是一個數據庫，裏面有某一領域的數據集，並且收錄很多該領域的論文，還有測試結果。

　　圖10-2所示爲迪哥曾經做過實驗的benchmark，首頁就是它的整體介紹。例如，對於一個人體關鍵點的圖像識別任務，其中不僅提供了一份人體姿態的數據集，還收錄很多篇相關論文，通常能被benchmark收錄進來的論文都是被證明過效果非常不錯的。

　　圖10-2 MPII人體姿態識別benchmark

　　圖10-3中截取了其收錄的一部分論文，從2013—2018年的姿態識別經典論文都可以在此找到。如果大家熟悉計算機視覺領域，就能看出這些論文的發表級別非常高，右側有其實驗結果，包括頭部、肩膀、各個關節的識別效果。可以發現，隨着年份的增加，效果逐步提升，現在做得已經很成熟了。

　　圖10-3 收錄論文結果

　　對於不會選擇合適論文的同學，還是看經典論文吧，直接搜索出來的論文可能價值一般，benchmark推薦的論文都是經典且有學習價值的。

　　Benchmark還有一個特點，就是其收錄的論文很多都是有公開代碼的。圖10-4、圖10-5就是打開的論文主頁，不僅有實驗的源碼，還提供了訓練好的模型，無論是實際完成任務還是學習階段，都對大家有很大的幫助。假設你需要做一個人體姿態識別的任務，這時候你不只手裏有一份當下效果最好的識別代碼，還有原作者訓練好的模型，直接部署到服務器，不出一天你就可以說：任務基本完成了，目前來看沒有比這個效果更好的了（這爲我們的工作提供了一條捷徑）。

　　▲圖10-4 論文公開源碼（1）

　　▲圖10-5 論文公開源碼（2）

　　在初學階段最好將理論與實踐結合在一起，論文當然就是指導思想，告訴大家一步步該怎麼做，其提供的代碼就是實踐方法。筆者認爲沒有源碼的學習是非常痛苦的，因爲論文當中很多細節都簡化了，估計很多同學也是這樣的想法，看代碼反而能更直接地理解論文的思想。

　　如何應用源碼呢？通常拿到的工作都是比較複雜的，直接看一行行代碼,估計都挺費勁，最好的辦法就是一步步debug，看看其中每一步完成了什麼，再結合論文就好理解了。

10.4圖像特徵

1 pip install skimage

 1 import skimage
 2 import numpy as np
 3 import pandas as pd
 4 import matplotlib.pyplot as plt
 5 from skimage import io
 6 #opencv tensorflow
 7 %matplotlib inline
 8 
 9 cat = io.imread('./datasets/cat.png')
10 dog = io.imread('./datasets/dog.png')
11 df = pd.DataFrame(['Cat', 'Dog'], columns=['Image'])
12 
13 
14 print(cat.shape, dog.shape)

(168, 300, 3) (168, 300, 3)

1 cat #0-255,越小的值代表越暗，越大的值越亮

array([[[114, 105,  90],
        [113, 104,  89],
        [112, 103,  88],
        ...,
        [127, 130, 121],
        [130, 133, 124],
        [133, 136, 127]],

       [[113, 104,  89],
        [112, 103,  88],
        [111, 102,  87],
        ...,
        [129, 132, 125],
        [132, 135, 128],
        [135, 138, 131]],

       [[111, 102,  87],
        [111, 102,  87],
        [110, 101,  86],
        ...,
        [132, 134, 133],
        [136, 138, 137],
        [139, 141, 140]],

       ...,

       [[ 32,  26,  28],
        [ 32,  26,  28],
        [ 30,  24,  26],
        ...,
        [131, 131, 131],
        [131, 131, 131],
        [130, 130, 130]],

       [[ 33,  27,  29],
        [ 32,  26,  28],
        [ 31,  25,  27],
        ...,
        [131, 131, 131],
        [131, 131, 131],
        [130, 130, 130]],

       [[ 33,  27,  29],
        [ 32,  26,  28],
        [ 31,  25,  27],
        ...,
        [131, 131, 131],
        [131, 131, 131],
        [130, 130, 130]]], dtype=uint8)

1 #coffee = skimage.transform.resize(coffee, (300, 451), mode='reflect')
2 fig = plt.figure(figsize = (8,4))
3 ax1 = fig.add_subplot(1,2, 1)
4 ax1.imshow(cat)
5 ax2 = fig.add_subplot(1,2, 2)
6 ax2.imshow(dog)

<matplotlib.image.AxesImage at 0x233c9b53988>

 1 dog_r = dog.copy() # Red Channel
 2 dog_r[:,:,1] = dog_r[:,:,2] = 0 # set G,B pixels = 0
 3 dog_g = dog.copy() # Green Channel
 4 dog_g[:,:,0] = dog_r[:,:,2] = 0 # set R,B pixels = 0
 5 dog_b = dog.copy() # Blue Channel
 6 dog_b[:,:,0] = dog_b[:,:,1] = 0 # set R,G pixels = 0
 7 
 8 plot_image = np.concatenate((dog_r, dog_g, dog_b), axis=1)
 9 plt.figure(figsize = (10,4))
10 plt.imshow(plot_image)

1 dog_r

array([[[160,   0,   0],
        [160,   0,   0],
        [160,   0,   0],
        ..., 
        [113,   0,   0],
        [113,   0,   0],
        [112,   0,   0]],

       [[160,   0,   0],
        [160,   0,   0],
        [160,   0,   0],
        ..., 
        [113,   0,   0],
        [113,   0,   0],
        [112,   0,   0]],

       [[160,   0,   0],
        [160,   0,   0],
        [160,   0,   0],
        ..., 
        [113,   0,   0],
        [113,   0,   0],
        [112,   0,   0]],

       ..., 
       [[165,   0,   0],
        [165,   0,   0],
        [165,   0,   0],
        ..., 
        [212,   0,   0],
        [211,   0,   0],
        [210,   0,   0]],

       [[165,   0,   0],
        [165,   0,   0],
        [165,   0,   0],
        ..., 
        [210,   0,   0],
        [210,   0,   0],
        [209,   0,   0]],

       [[164,   0,   0],
        [164,   0,   0],
        [164,   0,   0],
        ..., 
        [209,   0,   0],
        [209,   0,   0],
        [209,   0,   0]]], dtype=uint8)

灰度圖：

1 fig = plt.figure(figsize = (8,4))
2 ax1 = fig.add_subplot(2,2, 1)
3 ax1.imshow(cgs, cmap="gray")
4 ax2 = fig.add_subplot(2,2, 2)
5 ax2.imshow(dgs, cmap='gray')

<matplotlib.image.AxesImage at 0x1fca2353358>

本章小結：

　　本章介紹了特徵提取的常用方法，主要包括數值特徵和文本特徵，可以說不同的方法各有其優缺點。在任務起始階段，應當儘可能多地嘗試各種可能的提取方法，特徵多不要緊，實際建模的時候，可以通過實驗來篩選，但是少了就沒有辦法了，所以，在特徵工程階段，還是要多動腦筋，要提前考慮建模方案。因爲一旦涉及海量數據，提取特徵可是一個漫長的活，如果只是走一步看一步，效率就會大大降低。

　　做任務的時候，一定要結合論文，各種解決方案都要進行嘗試，最好的方法就是先學學別人是怎麼做的，再應用到自己的實際任務中。

第10章完。

python數據分析個人學習讀書筆記-目錄索引

該書資源下載，請至異步社區：https://www.epubit.com

《Python數據分析與機器學習實戰-唐宇迪》讀書筆記第10章-特徵工程《數據分析實戰-托馬茲.卓巴斯》讀書筆記第9章--自然語言處理NLTK（分析文本、詞性標註、主題抽取、文本數據分類）

dbeaver連接Oracle中文亂碼的解決方案--druid

《Python數據分析與機器學習實戰-唐宇迪》讀書筆記第5章-迴歸算法

《Python數據分析與機器學習實戰-唐宇迪》讀書筆記第12章--支持向量機

《Python數據分析與機器學習實戰-唐宇迪》讀書筆記第16章--聚類算法

《Python數據分析與機器學習實戰-唐宇迪》讀書筆記第15章-降維算法

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

《Python數據分析與機器學習實戰-唐宇迪》讀書筆記第10章-特徵工程 《數據分析實戰-托馬茲.卓巴斯》讀書筆記第9章--自然語言處理NLTK（分析文本、詞性標註、主題抽取、文本數據分類）

《Python數據分析與機器學習實戰-唐宇迪》讀書筆記第10章-特徵工程《數據分析實戰-托馬茲.卓巴斯》讀書筆記第9章--自然語言處理NLTK（分析文本、詞性標註、主題抽取、文本數據分類）