特徵工程入門與實踐----特徵增強

特徵增強是對數據的進一步修改，我們開始清洗和增強數據。主要涉及的操作有

識別數據中的缺失值
刪除有害數據
輸入缺失值
對數據進行歸一化/標準化

1. 識別數據中的缺失值

特徵增強的第一種方法是識別數據的缺失值，可以讓我們更好的明白如何使用真是世界中的數據。通常，數據因爲一些原因，導致數據缺失，不完整。我們需要做的就是識別出數據中的缺失值。並對缺失值進行處理。本文使用皮馬印第安人糖尿病預測數據集。這個數據集包含768行數據點，9列特徵。預測21歲以上的女性皮馬印第安人5年內是否會患糖尿病。數據每列的含義如下：
（1）懷孕次數
（2）口服葡萄糖耐量實驗中的2小時血漿葡萄糖濃度
（3）舒張壓
（4）三頭肌皮褶厚度
（5）2小時血清胰島素濃度
（6）體重指數
（7）糖尿病家族函數
（8）年齡
（9）類變量(0或1,代表有無糖尿病)
首先我們先來了解一下數據

# 導入探索性數據分析所需的包
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('fivethirtyeight')

# 添加標題
pima_column_names = ['times_pregnant', 'plasma_glucose_concentration', 'diastolic_blood_pressure', 'triceps_thickness',
                    'serum_insulin', 'bmi', 'pedigree_function', 'age', 'onset_diabetes']
pima = pd.read_csv('./data/pima.data', names=pima_column_names)
pima.head()

# 計算一下空準確率
pima['onset_diabetes'].value_counts(normalize=True)

# 對plasma_glucose_concentration列繪製兩類的直方圖
col = 'plasma_glucose_concentration'
plt.hist(pima[pima['onset_diabetes'] == 0][col], 10, alpha=0.5, label='non-diabetes')   # 不患糖尿病
plt.hist(pima[pima['onset_diabetes'] == 1][col], 10, alpha=0.5, label='diabetes')  # 患糖尿病
plt.legend(loc='upper right')
plt.xlabel(col)
plt.ylabel('Frequency')
plt.title('Histogram of {}'.format(col))
plt.show()

# 線性相關矩陣量化變量間的關係
# 數據相關矩陣的熱力圖
sns.heatmap(pima.corr())

從上面的分析中，我們首先可以得到患者和常人的血糖濃度是有很大的差異的。並且血糖濃度與患者是否患病的相關性很大。下面我們來分析一下數據是否存在缺失值。

# 查看數據中是否存在缺失值
pima.isnull().sum()

從上面的結果我們可以看到並沒有缺失值，我們在看一下關於數據的基本描述性統計。

# 查看數據的基本描述性統計
pima.describe()

我們可以看到BMI指標的最小值是0.這是有悖於醫學常識的。這有可能是缺失或不存在的點都用0填充了。從數據中可以看到，有好幾列都是0.但是onset——diabetes中的0代表沒有糖尿病，人也可以懷孕0次。所以這兩列沒有問題，其他的列的缺失值用0填充了。

plasma_glucose_concentration
diastolic_blood_pressure
triceps_thickness
serum_insulin
bmi

2. 處理數據中的缺失值

首先，對存在缺失值的列，使用None代替0。然後在查看是否存在缺失值。

# 直接對所有列操作
columns = ['serum_insulin', 'bmi', 'plasma_glucose_concentration', 'diastolic_blood_pressure', 'triceps_thickness']
for col in columns:
    pima[col].replace([0], [None], inplace=True)

# 查看缺失值情況
pima.isnull().sum()

（1）刪除有害的行

我們首先刪除有害的行，然後對刪除前後的數據做一個分析，最後應用機器學習算法評估一下當前數據的性能。
刪除存在缺失值的數據：

# 刪除存在缺失的行
pima_dropped = pima.dropna()
# 檢查刪除了多少行
num_rows_lost = round(100*(pima.shape[0] - pima_dropped.shape[0]) / float(pima.shape[0]))
print("retained {}% of rows".format(num_rows_lost))

retained 49% of rows
數據分析：

# 繼續對數據做一下探索性分析
# 未刪除數據的空準確率
pima['onset_diabetes'].value_counts(normalize=True)

0 0.651042
1 0.348958
Name: onset_diabetes, dtype: float64

# 刪除數據後的空準確率
pima_dropped['onset_diabetes'].value_counts(normalize=True)

0 0.668367
1 0.331633
Name: onset_diabetes, dtype: float64

從空準確率來看，前後的True和False並無太大的變化。接下來比較一下刪除前後的個屬性均值。

# 未刪除數據的均值
pima.mean()

# 刪除數據後的均值
pima_dropped.mean()

# 使用條形圖進行可視化
# 均值變化百分比條形圖
ax = ((pima_dropped.mean() - pima.mean()) / pima.mean()).plot(kind='bar', title='% change in average column values')
ax.set_ylabel('% change')

我們可以看到，懷孕次數的均值在刪除缺失值後下降了14%，糖尿病血系功能也上升了11%。都變化的比較大。刪除行會嚴重影響數據的形狀，所以我們應該保留儘可能多的數據。在我們進行其他操作前，我們使用一個機器學習算法驗證一下當前數據情況的模型性能。
評估性能：

# 導入機器學習
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

# 刪除標籤數據
X_dropped = pima_dropped.drop('onset_diabetes', axis=1)  # 特徵
print("leanrning from {} rows".format(X_dropped.shape[0]))
y_dropped = pima_dropped['onset_diabetes']   # 標籤

# KNN的模型參數
knn_params = {'n_neighbors': [1, 2, 3, 4, 5, 6, 7]}

# KNN模型
knn = KNeighborsClassifier()

# 使用網格搜索優化
grid = GridSearchCV(knn, knn_params)
grid.fit(X_dropped, y_dropped)

# 輸出結果
print(grid.best_score_, grid.best_params_)

結果：0.7448979591836735 {‘n_neighbors’: 7}

（2）填充缺失值

首先，我們檢查一下缺失值的情況。然後使用sklearn模塊的方法填充缺失值，最後在檢查缺失值情況，用機器學期方法驗證一下模型的性能。
檢查缺失值：

# 在此查看缺失值情況
pima.isnull().sum()

填充缺失值：

# 使用scikit-learn預處理類的Imputer模塊
from sklearn.preprocessing import Imputer

# 實例化對象
imputer = Imputer(strategy='mean')
# 創建新對象
pima_imputed = imputer.fit_transform(pima)
# 將得到的ndarray類型轉化爲DataFrame
pima_imputed = pd.DataFrame(pima_imputed, columns=pima_column_names)
pima_imputed.head()

檢查缺失值情況並評估性能：

# 判斷是否有缺失值
pima_imputed.isnull().sum()

# 嘗試一下填充一些別的值，查看對KNN模型的影響
# 用0填充
pima_zero = pima.fillna(0)  
X_zero = pima_zero.drop('onset_diabetes', axis=1)
y_zero = pima_zero['onset_diabetes']

# knn模型參數
knn_params = {'n_neighbors': [1, 2, 3, 4, 5, 6, 7]}
# 網格搜索
grid = GridSearchCV(knn, knn_params)
grid.fit(X_zero, y_zero)

# 輸出
print(grid.best_score_, grid.best_params_)

結果：0.7330729166666666 {‘n_neighbors’: 6}

3. 標準化和歸一化

我們現在要做的是進一步增強機器學習流水線，進行一下探索性數據分析。

impute = Imputer(strategy='mean')
# 填充所有的缺失值
pima_imputed_mean = pd.DataFrame(impute.fit_transform(pima), columns=pima_column_names)
# 畫直方圖
pima_imputed_mean.hist(figsize=(15, 15))

從這分析中可以發現，某些特徵數據的尺度不同。有一些機器學習模型受數據尺度的影響很大。因此，我們可以使用某種歸一化/標準化操作。
**歸一化：**將行和列對齊並轉化爲一致的規則。將所有定量列轉化爲同一個靜態範圍中的值。
**標準化：**通過確保所有行和列在機器學習中得到平等對待，讓數據的處理保持一致。

（1）z分數標準化

z分數標準化利用了統計學最簡單的z分數思想。將特徵重新縮放，均值爲0、標準差爲1。通過縮放特徵、統一化均值和方差，可以讓機器學習模型達到最優化。公式爲
$z=\frac{x-\mu }{\sigma }$
其中 $\mu$ 爲均值， $\sigma$ 爲標準差。

# 取此列均值
mu = pima['plasma_glucose_concentration'].mean()
# 取此列標準差
sigma = pima['plasma_glucose_concentration'].std()
# 對每個值計算z分數
print(((pima['plasma_glucose_concentration'] - mu) / sigma).head())

# 使用內置的z分數歸一化
from sklearn.preprocessing import StandardScaler

# 用z分數標準化
scaler = StandardScaler()

glucose_z_score_standardized = scaler.fit_transform(pima[['plasma_glucose_concentration']])
# 直方圖
ax = pd.Series(glucose_z_score_standardized.reshape(-1,)).hist()
ax.set_title('Distribution of plasma_glucose_concentration after Z Score Scaling')

# 將z分數標準化插入到機器學習流水線上
knn_params = {'imputer__strategy': ['mean', 'median'], 'classify__n_neighbors': [1, 2, 3, 4, 5, 6, 7]}

mean_impute_standardize = Pipeline([('imputer', Imputer()), ('standardize', StandardScaler()), ('classify', knn)])

X = pima.drop('onset_diabetes', axis=1)
y = pima['onset_diabetes']

grid = GridSearchCV(mean_impute_standardize, knn_params)
grid.fit(X, y)

print(grid.best_score_, grid.best_params_)

結果：0.7421875 {‘classify__n_neighbors’: 7, ‘imputer__strategy’: ‘median’}

（2）min-max標準化

$m=\frac{x-x_{min} }{x_{max}-x_{min}}$
其中， $x_{min}$ 爲該列最小值， $x_{max}$ 爲該列最大值。
標準化

# 導入sklearn模塊
from sklearn.preprocessing import MinMaxScaler

# 實例化
min_max = MinMaxScaler()

# min-max標準化
pima_min_maxed = pd.DataFrame(min_max.fit_transform(pima_imputed), columns=pima_column_names)

# 得到描述性統計
pima_min_maxed.describe()

評估性能

knn_params = {'imputer__strategy': ['mean', 'median'], 'classify__n_neighbors': [1, 2, 3, 4, 5, 6, 6]}

mean_impute_standardize = Pipeline([('imputer', Imputer()), ('standardize', MinMaxScaler()), ('classify', knn)])

X = pima.drop('onset_diabetes', axis=1)
y = pima['onset_diabetes']

grid = GridSearchCV(mean_impute_standardize, knn_params)
grid.fit(X, y)

print(grid.best_score_, grid.best_params_)

結果：0.74609375 {‘classify__n_neighbors’: 4, ‘imputer__strategy’: ‘mean’}

（3）行歸一化

行歸一化是針對行進行操作的，保證每行有單位範數，也就是每行的向量長度相同。
$\left \| x \right \|=\sqrt{(x_{1}^{2}+x_{2}^{2}+...+x_{n}^{2})}$
歸一化

# 引入行歸一化
from sklearn.preprocessing import Normalizer

# 實例化
normalize = Normalizer()

pima_normalized = pd.DataFrame(normalize.fit_transform(pima_imputed), columns=pima_column_names)

# 行歸一化後矩陣的平均範數
np.sqrt((pima_normalized**2).sum(axis=1)).mean()

結果：1.0
評估性能

knn_params = {'imputer__strategy': ['mean', 'median'], 'classify__n_neighbors': [1, 2, 3, 4, 5, 6, 6]}

mean_impute_standardize = Pipeline([('imputer', Imputer()), ('normalize', Normalizer()), ('classify', knn)])

X = pima.drop('onset_diabetes', axis=1)
y = pima['onset_diabetes']

grid = GridSearchCV(mean_impute_standardize, knn_params)
grid.fit(X, y)

print(grid.best_score_, grid.best_params_)

結果：0.6822916666666666 {‘classify__n_neighbors’: 6, ‘imputer__strategy’: ‘mean’}
從本章的學習中，我們處理了數據中的缺失值，並使用標準化/歸一化的方法繼續處理數據。然後我們評估了性能。得到的結果是使用均值填充數據，然後用min-max標準化處理出具。得到0.7461的準確率。注意，雖然這個數據比刪除存在缺失值的數據準確率沒有高很多。但是這是使用全部數據訓練的結果。更具有一般化。泛化性能將更好。

注：本文的內容與圖片來源於《特徵工程入門與實踐》。如有您也想學習相關知識，建議買一本來看。
個人博客.
聯繫方式：2391855138(加好友請備註)

特徵工程入門與實踐----特徵增強

1. 識別數據中的缺失值

2. 處理數據中的缺失值

（1）刪除有害的行

（2）填充缺失值

3. 標準化和歸一化

（1）z分數標準化

（2）min-max標準化

（3）行歸一化

容器中nginx無法使用同一個網絡下的容器域名

Python: SunMoonTimeCalculator

NETCore中實現一個輕量無負擔的極簡任務調度ScheduleTask

docker使用特定的網絡

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

nodejs學習07——API

避免DbContext同時在多個線程調用

GPT-4o 引領人機交互新風向，向量數據庫賽道沸騰了

Pytorch學習之torch----數學操作(二)

keras學習筆記-----快速開始keras函數式API

生成對抗網絡(九)----------ACGAN

CNN模型之GoogLeNet(Inception) v2

CNN模型之NIN

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結