專欄 | 基於 Jupyter 的特徵工程手冊：數據預處理（一）

點擊上方“AI有道”，選擇“置頂”公衆號

重磅乾貨，第一時間送達

作者：Yingxiang Chen & Zihan Yang

編輯：紅色石頭

特徵工程在機器學習中的重要性不言而喻，恰當的特徵工程能顯著提升機器學習模型性能。我們在 Github 上整理編寫了一份系統的特徵工程教程，供大家參考學習。

項目地址：

https://github.com/YC-Coder-Chen/feature-engineering-handbook

本文將探討數據預處理部分：介紹瞭如何利用 scikit-learn 處理靜態的連續變量，利用 Category Encoders 處理靜態的類別變量以及利用 Featuretools 處理常見的時間序列變量。

特徵工程的數據預處理我們將分爲三大部分來介紹：

靜態連續變量
靜態類別變量
時間序列變量

本文將介紹 1.1 靜態連續變量的數據預處理。下面將結合 Jupyter，使用 sklearn，進行詳解。

1.1 靜態連續變量

1.1.1 離散化

離散化連續變量可以使模型更加穩健。例如，當預測客戶的購買行爲時，一個已有 30 次購買行爲的客戶可能與一個已有 32 次購買行爲的客戶具有非常相似的行爲。有時特徵中的過精度可能是噪聲，這就是爲什麼在 LightGBM 中，模型採用直方圖算法來防止過擬合。離散連續變量有兩種方法。

1.1.1.1 二值化

將數值特徵二值化。

# load the sample data
from sklearn.datasets import fetch_california_housing
dataset = fetch_california_housing()
X, y = dataset.data, dataset.target # we will take the first column as the example later

%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt


fig, ax = plt.subplots()
sns.distplot(X[:,0], hist = True, kde=True)
ax.set_title('Histogram', fontsize=12)
ax.set_xlabel('Value', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12); # this feature has long-tail distribution

from sklearn.preprocessing import Binarizer


sample_columns = X[0:10,0] # select the top 10 samples
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])


model = Binarizer(threshold=6) # set 6 to be the threshold
# if value <= 6, then return 0 else return 1
result = model.fit_transform(sample_columns.reshape(-1,1)).reshape(-1)
# return array([1., 1., 1., 0., 0., 0., 0., 0., 0., 0.])

1.1.1.2 分箱

將數值特徵分箱。

均勻分箱：

from sklearn.preprocessing import KBinsDiscretizer


# in order to mimic the operation in real-world, we shall fit the KBinsDiscretizer
# on the trainset and transform the testset
# we take the top ten samples in the first column as test set
# take the rest samples in the first column as train set


test_set = X[0:10,0]
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
train_set = X[10:,0]


model = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform') # set 5 bins
# return oridinal bin number, set all bins to have identical widths


model.fit(train_set.reshape(-1,1))
result = model.transform(test_set.reshape(-1,1)).reshape(-1)
# return array([2., 2., 2., 1., 1., 1., 1., 0., 0., 1.])
bin_edge = model.bin_edges_[0]
# return array([ 0.4999 ,  3.39994,  6.29998,  9.20002, 12.10006, 15.0001 ]), the bin edges

# visualiza the bin edges
fig, ax = plt.subplots()
sns.distplot(train_set, hist = True, kde=True)


for edge in bin_edge: # uniform bins
    line = plt.axvline(edge, color='b')
ax.legend([line], ['Uniform Bin Edges'], fontsize=10)
ax.set_title('Histogram', fontsize=12)
ax.set_xlabel('Value', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12);

分位數分箱：

from sklearn.preprocessing import KBinsDiscretizer


# in order to mimic the operation in real-world, we shall fit the KBinsDiscretizer
# on the trainset and transform the testset
# we take the top ten samples in the first column as test set
# take the rest samples in the first column as train set


test_set = X[0:10,0]
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
train_set = X[10:,0]


model = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile') # set 3 bins
# return oridinal bin number, set all bins based on quantile


model.fit(train_set.reshape(-1,1))
result = model.transform(test_set.reshape(-1,1)).reshape(-1)
# return array([4., 4., 4., 4., 2., 3., 2., 1., 0., 2.])
bin_edge = model.bin_edges_[0]
# return array([ 0.4999 ,  2.3523 ,  3.1406 ,  3.9667 ,  5.10824, 15.0001 ]), the bin edges
# 2.3523 is the 20% quantile
# 3.1406 is the 40% quantile, etc..

# visualiza the bin edges
fig, ax = plt.subplots()
sns.distplot(train_set, hist = True, kde=True)


for edge in bin_edge: # quantile based bins
    line = plt.axvline(edge, color='b')
ax.legend([line], ['Quantiles Bin Edges'], fontsize=10)
ax.set_title('Histogram', fontsize=12)
ax.set_xlabel('Value', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12);

1.1.2 縮放

不同尺度的特徵之間難以比較，特別是在線性迴歸和邏輯迴歸等線性模型中。在基於歐氏距離的 k-means 聚類或 KNN 模型中，就需要進行特徵縮放，否則距離的測量是無用的。而對於任何使用梯度下降的算法，縮放也會加快收斂速度。

一些常用的模型：

注：偏度影響 PCA 模型，因此最好使用冪變換來消除偏度。

1.1.2.1 標準縮放（Z 分數標準化）

公式：

其中，X 是變量（特徵），???? 是 X 的均值，???? 是 X 的標準差。此方法對異常值非常敏感，因爲異常值同時影響到 ???? 和 ????。

from sklearn.preprocessing import StandardScaler


# in order to mimic the operation in real-world, we shall fit the StandardScaler
# on the trainset and transform the testset
# we take the top ten samples in the first column as test set
# take the rest samples in the first column as train set


test_set = X[0:10,0]
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
train_set = X[10:,0]


model = StandardScaler()


model.fit(train_set.reshape(-1,1)) # fit on the train set and transform the test set
# top ten numbers for simplification
result = model.transform(test_set.reshape(-1,1)).reshape(-1)
# return array([ 2.34539745,  2.33286782,  1.78324852,  0.93339178, -0.0125957 ,
# 0.08774668, -0.11109548, -0.39490751, -0.94221041, -0.09419626])
# result is the same as ((X[0:10,0] - X[10:,0].mean())/X[10:,0].std())

# visualize the distribution after the scaling
# fit and transform the entire first feature


import seaborn as sns
import matplotlib.pyplot as plt


fig, ax = plt.subplots(2,1, figsize = (13,9))
sns.distplot(X[:,0], hist = True, kde=True, ax=ax[0])
ax[0].set_title('Histogram of the Original Distribution', fontsize=12)
ax[0].set_xlabel('Value', fontsize=12)
ax[0].set_ylabel('Frequency', fontsize=12); # this feature has long-tail distribution


model = StandardScaler()
model.fit(X[:,0].reshape(-1,1)) 
result = model.transform(X[:,0].reshape(-1,1)).reshape(-1)


# show the distribution of the entire feature
sns.distplot(result, hist = True, kde=True, ax=ax[1])
ax[1].set_title('Histogram of the Transformed Distribution', fontsize=12)
ax[1].set_xlabel('Value', fontsize=12)
ax[1].set_ylabel('Frequency', fontsize=12); # the distribution is the same, but scales change
fig.tight_layout()

1.1.2.2 MinMaxScaler（按數值範圍縮放）

假設我們要縮放的特徵數值範圍爲 (a, b)。

公式：

其中，Min 是 X 的最小值，Max 是 X 的最大值。此方法也對異常值非常敏感，因爲異常值同時影響到 Min 和 Max。

from sklearn.preprocessing import MinMaxScaler


# in order to mimic the operation in real-world, we shall fit the MinMaxScaler
# on the trainset and transform the testset
# we take the top ten samples in the first column as test set
# take the rest samples in the first column as train set


test_set = X[0:10,0]
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
train_set = X[10:,0]


model = MinMaxScaler(feature_range=(0,1)) # set the range to be (0,1)


model.fit(train_set.reshape(-1,1)) # fit on the train set and transform the test set
# top ten numbers for simplification
result = model.transform(test_set.reshape(-1,1)).reshape(-1)
# return array([0.53966842, 0.53802706, 0.46602805, 0.35469856, 0.23077613,
# 0.24392077, 0.21787286, 0.18069406, 0.1089985 , 0.22008662])
# result is the same as (X[0:10,0] - X[10:,0].min())/(X[10:,0].max()-X[10:,0].min())

# visualize the distribution after the scaling
# fit and transform the entire first feature


import seaborn as sns
import matplotlib.pyplot as plt


fig, ax = plt.subplots(2,1, figsize = (13,9))
sns.distplot(X[:,0], hist = True, kde=True, ax=ax[0])
ax[0].set_title('Histogram of the Original Distribution', fontsize=12)
ax[0].set_xlabel('Value', fontsize=12)
ax[0].set_ylabel('Frequency', fontsize=12); # this feature has long-tail distribution


model = MinMaxScaler(feature_range=(0,1))
model.fit(X[:,0].reshape(-1,1)) 
result = model.transform(X[:,0].reshape(-1,1)).reshape(-1)


# show the distribution of the entire feature
sns.distplot(result, hist = True, kde=True, ax=ax[1])
ax[1].set_title('Histogram of the Transformed Distribution', fontsize=12)
ax[1].set_xlabel('Value', fontsize=12)
ax[1].set_ylabel('Frequency', fontsize=12); # the distribution is the same, but scales change
fig.tight_layout() # now the scale change to [0,1]

1.1.2.3 RobustScaler（抗異常值縮放）

使用對異常值穩健的統計（分位數）來縮放特徵。假設我們要將縮放的特徵分位數範圍爲 (a, b)。

公式：

這種方法對異常點魯棒性更強。

import numpy as np
from sklearn.preprocessing import RobustScaler


# in order to mimic the operation in real-world, we shall fit the RobustScaler
# on the trainset and transform the testset
# we take the top ten samples in the first column as test set
# take the rest samples in the first column as train set


test_set = X[0:10,0]
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
train_set = X[10:,0]


model = RobustScaler(with_centering = True, with_scaling = True, 
                    quantile_range = (25.0, 75.0))
# with_centering = True => recenter the feature by set X' = X - X.median()
# with_scaling = True => rescale the feature by the quantile set by user
# set the quantile to the (25%, 75%)


model.fit(train_set.reshape(-1,1)) # fit on the train set and transform the test set
# top ten numbers for simplification
result = model.transform(test_set.reshape(-1,1)).reshape(-1)
# return array([ 2.19755974,  2.18664281,  1.7077657 ,  0.96729508,  0.14306683,
# 0.23049401,  0.05724508, -0.19003715, -0.66689601,  0.07196918])
# result is the same as (X[0:10,0] - np.quantile(X[10:,0], 0.5))/(np.quantile(X[10:,0],0.75)-np.quantile(X[10:,0], 0.25))

# visualize the distribution after the scaling
# fit and transform the entire first feature


import seaborn as sns
import matplotlib.pyplot as plt


fig, ax = plt.subplots(2,1, figsize = (13,9))
sns.distplot(X[:,0], hist = True, kde=True, ax=ax[0])
ax[0].set_title('Histogram of the Original Distribution', fontsize=12)
ax[0].set_xlabel('Value', fontsize=12)
ax[0].set_ylabel('Frequency', fontsize=12); # this feature has long-tail distribution


model = RobustScaler(with_centering = True, with_scaling = True, 
                    quantile_range = (25.0, 75.0))
model.fit(X[:,0].reshape(-1,1)) 
result = model.transform(X[:,0].reshape(-1,1)).reshape(-1)


# show the distribution of the entire feature
sns.distplot(result, hist = True, kde=True, ax=ax[1])
ax[1].set_title('Histogram of the Transformed Distribution', fontsize=12)
ax[1].set_xlabel('Value', fontsize=12)
ax[1].set_ylabel('Frequency', fontsize=12); # the distribution is the same, but scales change
fig.tight_layout()

1.1.2.4 冪次變換（非線性變換）

以上介紹的所有縮放方法都保持原來的分佈。但正態性是許多統計模型的一個重要假設。我們可以使用冪次變換將原始分佈轉換爲正態分佈。

Box-Cox 變換：

Box-Cox 變換隻適用於正數，並假設如下分佈：

考慮了所有的 λ 值，通過最大似然估計選擇穩定方差和最小化偏度的最優值。

from sklearn.preprocessing import PowerTransformer


# in order to mimic the operation in real-world, we shall fit the PowerTransformer
# on the trainset and transform the testset
# we take the top ten samples in the first column as test set
# take the rest samples in the first column as train set


test_set = X[0:10,0]
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
train_set = X[10:,0]


model = PowerTransformer(method='box-cox', standardize=True)
# apply box-cox transformation


model.fit(train_set.reshape(-1,1)) # fit on the train set and transform the test set
# top ten numbers for simplification
result = model.transform(test_set.reshape(-1,1)).reshape(-1)
# return array([ 1.91669292,  1.91009687,  1.60235867,  1.0363095 ,  0.19831579,
# 0.30244247,  0.09143411, -0.24694006, -1.08558469,  0.11011933])

# visualize the distribution after the scaling
# fit and transform the entire first feature


import seaborn as sns
import matplotlib.pyplot as plt


fig, ax = plt.subplots(2,1, figsize = (13,9))
sns.distplot(X[:,0], hist = True, kde=True, ax=ax[0])
ax[0].set_title('Histogram of the Original Distribution', fontsize=12)
ax[0].set_xlabel('Value', fontsize=12)
ax[0].set_ylabel('Frequency', fontsize=12); # this feature has long-tail distribution


model = PowerTransformer(method='box-cox', standardize=True)
model.fit(X[:,0].reshape(-1,1)) 
result = model.transform(X[:,0].reshape(-1,1)).reshape(-1)


# show the distribution of the entire feature
sns.distplot(result, hist = True, kde=True, ax=ax[1])
ax[1].set_title('Histogram of the Transformed Distribution', fontsize=12)
ax[1].set_xlabel('Value', fontsize=12)
ax[1].set_ylabel('Frequency', fontsize=12); # the distribution now becomes normal
fig.tight_layout()

Yeo-Johnson 變換：

Yeo Johnson 變換適用於正數和負數，並假設以下分佈：

考慮了所有的 λ 值，通過最大似然估計選擇穩定方差和最小化偏度的最優值。

from sklearn.preprocessing import PowerTransformer


# in order to mimic the operation in real-world, we shall fit the PowerTransformer
# on the trainset and transform the testset
# we take the top ten samples in the first column as test set
# take the rest samples in the first column as train set


test_set = X[0:10,0]
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
train_set = X[10:,0]


model = PowerTransformer(method='yeo-johnson', standardize=True)
# apply box-cox transformation


model.fit(train_set.reshape(-1,1)) # fit on the train set and transform the test set
# top ten numbers for simplification
result = model.transform(test_set.reshape(-1,1)).reshape(-1)
# return array([ 1.90367888,  1.89747091,  1.604735  ,  1.05166306,  0.20617221,
# 0.31245176,  0.09685566, -0.25011726, -1.10512438,  0.11598074])

# visualize the distribution after the scaling
# fit and transform the entire first feature


import seaborn as sns
import matplotlib.pyplot as plt


fig, ax = plt.subplots(2,1, figsize = (13,9))
sns.distplot(X[:,0], hist = True, kde=True, ax=ax[0])
ax[0].set_title('Histogram of the Original Distribution', fontsize=12)
ax[0].set_xlabel('Value', fontsize=12)
ax[0].set_ylabel('Frequency', fontsize=12); # this feature has long-tail distribution


model = PowerTransformer(method='yeo-johnson', standardize=True)
model.fit(X[:,0].reshape(-1,1)) 
result = model.transform(X[:,0].reshape(-1,1)).reshape(-1)


# show the distribution of the entire feature
sns.distplot(result, hist = True, kde=True, ax=ax[1])
ax[1].set_title('Histogram of the Transformed Distribution', fontsize=12)
ax[1].set_xlabel('Value', fontsize=12)
ax[1].set_ylabel('Frequency', fontsize=12); # the distribution now becomes normal
fig.tight_layout()

1.1.3 正則化

以上所有縮放方法都是按列操作的。但正則化在每一行都有效，它試圖“縮放”每個樣本，使其具有單位範數。由於正則化在每一行都起作用，它會扭曲特徵之間的關係，因此不常見。但是正則化方法在文本分類和聚類上下文中是非常有用的。

假設 X[i][j] 表示樣本 i 中特徵 j 的值。

L1 正則化公式：

L2 正則化公式：

L1 正則化：

from sklearn.preprocessing import Normalizer


# Normalizer performs operation on each row independently
# So train set and test set are processed independently


###### for L1 Norm
sample_columns = X[0:2,0:3] # select the first two samples, and the first three features
# return array([[ 8.3252, 41., 6.98412698],
# [ 8.3014 , 21.,  6.23813708]])


model = Normalizer(norm='l1')
# use L2 Norm to normalize each samples


model.fit(sample_columns) 


result = model.transform(sample_columns) # test set are processed similarly
# return array([[0.14784762, 0.72812094, 0.12403144],
# [0.23358211, 0.59089121, 0.17552668]])
# result = sample_columns/np.sum(np.abs(sample_columns), axis=1).reshape(-1,1)

L2 正則化：

###### for L2 Norm
sample_columns = X[0:2,0:3] # select the first three features
# return array([[ 8.3252, 41., 6.98412698],
# [ 8.3014 , 21.,  6.23813708]])


model = Normalizer(norm='l2')
# use L2 Norm to normalize each samples


model.fit(sample_columns) 


result = model.transform(sample_columns)
# return array([[0.19627663, 0.96662445, 0.16465922],
# [0.35435076, 0.89639892, 0.26627902]])
# result = sample_columns/np.sqrt(np.sum(sample_columns**2, axis=1)).reshape(-1,1)

# visualize the difference in the distribuiton after Normalization
# compare it with the distribuiton after RobustScaling
# fit and transform the entire first & second feature


import seaborn as sns
import matplotlib.pyplot as plt


# RobustScaler
fig, ax = plt.subplots(2,1, figsize = (13,9))


model = RobustScaler(with_centering = True, with_scaling = True, 
                    quantile_range = (25.0, 75.0))
model.fit(X[:,0:2]) 
result = model.transform(X[:,0:2])


sns.scatterplot(result[:,0], result[:,1], ax=ax[0])
ax[0].set_title('Scatter Plot of RobustScaling result', fontsize=12)
ax[0].set_xlabel('Feature 1', fontsize=12)
ax[0].set_ylabel('Feature 2', fontsize=12);


model = Normalizer(norm='l2')


model.fit(X[:,0:2]) 
result = model.transform(X[:,0:2])


sns.scatterplot(result[:,0], result[:,1], ax=ax[1])
ax[1].set_title('Scatter Plot of Normalization result', fontsize=12)
ax[1].set_xlabel('Feature 1', fontsize=12)
ax[1].set_ylabel('Feature 2', fontsize=12);
fig.tight_layout()  # Normalization distort the original distribution

1.1.4 缺失值的估算

在實際操作中，數據集中可能缺少值。然而，這種稀疏的數據集與大多數 scikit 學習模型不兼容，這些模型假設所有特徵都是數值的，而沒有丟失值。所以在應用 scikit 學習模型之前，我們需要估算缺失的值。

但是一些新的模型，比如在其他包中實現的 XGboost、LightGBM 和 Catboost，爲數據集中丟失的值提供了支持。所以在應用這些模型時，我們不再需要填充數據集中丟失的值。

1.1.4.1 單變量特徵插補

假設第 i 列中有缺失值，那麼我們將用常數或第 i 列的統計數據（平均值、中值或模式）對其進行估算。

from sklearn.impute import SimpleImputer


test_set = X[0:10,0].copy() # no missing values
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])


# manully create some missing values
test_set[3] = np.nan
test_set[6] = np.nan
# now sample_columns becomes 
# array([8.3252, 8.3014, 7.2574,    nan, 3.8462, 4.0368,    nan, 3.12 ,2.0804, 3.6912])


# create the test samples
# in real-world, we should fit the imputer on train set and tranform the test set.
train_set = X[10:,0].copy()
train_set[3] = np.nan
train_set[6] = np.nan


imputer = SimpleImputer(missing_values=np.nan, strategy='mean') # use mean
# we can set the strategy to 'mean', 'median', 'most_frequent', 'constant'
imputer.fit(train_set.reshape(-1,1))
result = imputer.transform(test_set.reshape(-1,1)).reshape(-1)
# return array([8.3252    , 8.3014    , 7.2574    , 3.87023658, 3.8462    ,
# 4.0368    , 3.87023658, 3.12      , 2.0804    , 3.6912    ])
# all missing values are imputed with 3.87023658
# 3.87023658 = np.nanmean(train_set) 
# which is the mean of the trainset ignoring missing values

1.1.4.2 多元特徵插補

多元特徵插補利用整個數據集的信息來估計和插補缺失值。在 scikit-learn 中，它以循環迭代的方式實現。

在每一步中，一個特徵列被指定爲輸出 y，其他特徵列被視爲輸入 X。一個迴歸器適用於已知 y 的（X，y）。然後，迴歸器被用來預測 y 的缺失值。這是以迭代的方式對每個特徵進行的，然後對最大值插補回合重複進行。

使用線性模型（以 BayesianRidge 爲例）：

from sklearn.experimental import enable_iterative_imputer # have to import this to enable
# IterativeImputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge


test_set = X[0:10,:].copy() # no missing values, select all features
# the first columns is
# array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])


# manully create some missing values
test_set[3,0] = np.nan
test_set[6,0] = np.nan
test_set[3,1] = np.nan
# now the first feature becomes 
# array([8.3252, 8.3014, 7.2574,    nan, 3.8462, 4.0368,    nan, 3.12 ,2.0804, 3.6912])


# create the test samples
# in real-world, we should fit the imputer on train set and tranform the test set.
train_set = X[10:,:].copy()
train_set[3,0] = np.nan
train_set[6,0] = np.nan
train_set[3,1] = np.nan


impute_estimator = BayesianRidge()
imputer = IterativeImputer(max_iter = 10, 
                           random_state = 0, 
                           estimator = impute_estimator)


imputer.fit(train_set)
result = imputer.transform(test_set)[:,0] # only select the first column to revel how it works
# return array([8.3252    , 8.3014    , 7.2574    , 4.6237195 , 3.8462    ,
# 4.0368    , 4.00258149, 3.12      , 2.0804    , 3.6912    ])

使用基於樹的模型（以 ExtraTrees 爲例）：

from sklearn.experimental import enable_iterative_imputer # have to import this to enable
# IterativeImputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import ExtraTreesRegressor


test_set = X[0:10,:].copy() # no missing values, select all features
# the first columns is
# array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])


# manully create some missing values
test_set[3,0] = np.nan
test_set[6,0] = np.nan
test_set[3,1] = np.nan
# now the first feature becomes 
# array([8.3252, 8.3014, 7.2574,    nan, 3.8462, 4.0368,    nan, 3.12 ,2.0804, 3.6912])


# create the test samples
# in real-world, we should fit the imputer on train set and tranform the test set.
train_set = X[10:,:].copy()
train_set[3,0] = np.nan
train_set[6,0] = np.nan
train_set[3,1] = np.nan


impute_estimator = ExtraTreesRegressor(n_estimators=10, random_state=0)
# parameters can be turned in CV though sklearn pipeline
imputer = IterativeImputer(max_iter = 10, 
                           random_state = 0, 
                           estimator = impute_estimator)


imputer.fit(train_set)
result = imputer.transform(test_set)[:,0] # only select the first column to revel how it works
# return array([8.3252 , 8.3014 , 7.2574 , 4.63813, 3.8462 , 4.0368 , 3.24721,
# 3.12   , 2.0804 , 3.6912 ])

使用 K 近鄰（KNN）：

from sklearn.experimental import enable_iterative_imputer # have to import this to enable
# IterativeImputer
from sklearn.impute import IterativeImputer
from sklearn.neighbors import KNeighborsRegressor


test_set = X[0:10,:].copy() # no missing values, select all features
# the first columns is
# array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])


# manully create some missing values
test_set[3,0] = np.nan
test_set[6,0] = np.nan
test_set[3,1] = np.nan
# now the first feature becomes 
# array([8.3252, 8.3014, 7.2574,    nan, 3.8462, 4.0368,    nan, 3.12 ,2.0804, 3.6912])


# create the test samples
# in real-world, we should fit the imputer on train set and tranform the test set.
train_set = X[10:,:].copy()
train_set[3,0] = np.nan
train_set[6,0] = np.nan
train_set[3,1] = np.nan


impute_estimator = KNeighborsRegressor(n_neighbors=10, 
                                       p = 1)  # set p=1 to use manhanttan distance
# use manhanttan distance to reduce effect from outliers


# parameters can be turned in CV though sklearn pipeline
imputer = IterativeImputer(max_iter = 10, 
                           random_state = 0, 
                           estimator = impute_estimator)


imputer.fit(train_set)
result = imputer.transform(test_set)[:,0] # only select the first column to revel how it works
# return array([8.3252, 8.3014, 7.2574, 3.6978, 3.8462, 4.0368, 4.052 , 3.12  ,
# 2.0804, 3.6912])

1.1.4.3 標記估算值

有時，某些缺失值可能是有用的。因此，scikit learn 還提供了將缺少值的數據集轉換爲相應的二進制矩陣的功能，該矩陣指示數據集中缺少值的存在。

from sklearn.impute import MissingIndicator


# illustrate this function on trainset only
# since the precess is independent in train set and test set
train_set = X[10:,:].copy() # select all features
train_set[3,0] = np.nan # manully create some missing values
train_set[6,0] = np.nan
train_set[3,1] = np.nan


indicator = MissingIndicator(missing_values=np.nan, features='all') 
# show the results on all the features
result = indicator.fit_transform(train_set) # result have the same shape with train_set
# contains only True & False, True corresponds with missing value


result[:,0].sum() # should return 2, the first column has two missing values
result[:,1].sum(); # should return 1, the second column has one missing value

1.1.5 特徵變換

1.1.5.1 多項式變換

有時我們希望在模型中引入非線性特徵，從而增加模型的複雜度。對於簡單的線性模型，這將大大增加模型的複雜度。但是對於更復雜的模型，如基於樹的 ML 模型，它們已經在非參數樹結構中包含了非線性關係。因此，這種特性轉換可能對基於樹的 ML 模型沒有太大幫助。

例如，如果我們將階數設置爲 3，形式如下：

from sklearn.preprocessing import PolynomialFeatures


# illustrate this function on one synthesized sample
train_set = np.array([2,3]).reshape(1,-1) # shape (1,2)
# return array([[2, 3]])


poly = PolynomialFeatures(degree = 3, interaction_only = False)
# the highest degree is set to 3, and we want more than just intereaction terms


result = poly.fit_transform(train_set) # have shape (1, 10)
# array([[ 1.,  2.,  3.,  4.,  6.,  9.,  8., 12., 18., 27.]])

1.1.5.2 自定義變換

from sklearn.preprocessing import FunctionTransformer


# illustrate this function on one synthesized sample
train_set = np.array([2,3]).reshape(1,-1) # shape (1,2)
# return array([[2, 3]])


transformer = FunctionTransformer(func = np.log1p, validate=True)
# perform log transformation, X' = log(1 + x)
# func can be any numpy function such as np.exp
result = transformer.transform(train_set)
# return array([[1.09861229, 1.38629436]]), the same as np.log1p(train_set)

好了，以上就是關於靜態連續變量的數據預處理介紹。建議讀者結合代碼，在 Jupyter 中實操一遍。

推薦閱讀

（點擊標題可跳轉閱讀）

乾貨 | 公衆號歷史文章精選

我的深度學習入門路線

我的機器學習入門路線圖

重磅！

林軒田機器學習完整視頻和博主筆記來啦！

掃描下方二維碼，添加 AI有道小助手微信，可申請入羣，並獲得林軒田機器學習完整視頻 + 博主紅色石頭的精煉筆記（一定要備註：入羣 + 地點 + 學校/公司。例如：入羣+上海+復旦。