該案例的主要知識點:
1.對於數據集標籤分佈不均衡條件下的分類方法(下采樣,上採樣及兩者的差異)
2.邏輯迴歸模型的實施(交叉驗證,正則化懲罰係數c,判定閾值的設定)
3.簡單的數據預處理(標準化)
4.精度,召回率以及混淆矩陣的概念
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
data = pd.read_csv(r"C:\Users\Administrator\01_machinelearning\1-2\creditcard.csv")
data.head(3)
Time |
V1 |
V2 |
V3 |
V4 | V5 | V6 | V7 | V8 | V9 | ... | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | Amount | Class |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.0 | -1.359807 | -0.072781 | 2.536347 | 1.378155 | -0.338321 | 0.462388 | 0.239599 | 0.098698 | 0.363787 | ... | -0.018307 | 0.277838 | -0.110474 | 0.066928 | 0.128539 | -0.189115 | 0.133558 | -0.021053 | 149.62 | 0 |
0.0 | 1.191857 | 0.266151 | 0.166480 | 0.448154 | 0.060018 | -0.082361 | -0.078803 | 0.085102 | -0.255425 | ... | -0.225775 | -0.638672 | 0.101288 | -0.339846 | 0.167170 | 0.125895 | -0.008983 | 0.014724 | 2.69 | 0 |
1.0 | -1.358354 | -1.340163 | 1.773209 | 0.379780 | -0.503198 | 1.800499 | 0.791461 | 0.247676 | -1.514654 | ... | 0.247998 | 0.771679 | 0.909412 | -0.689281 | -0.327642 | -0.139097 | -0.055353 | -0.059752 | 378.66 | 0 |
1.觀察樣本是否平衡
count_classes = pd.value_counts(data["Class"])
# 【小技巧】使用pandas自帶的畫圖工具畫圖
count_classes.plot(kind="bar")
plt.title("Fraud class histogram")
plt.xlabel("Class")
plt.ylabel("Frequency")
plt.show()
解釋:數據集標籤分佈極不均衡,無法直接建模,考慮下采樣/上採樣操作,本例先以下采樣爲例
2.規約數據,生成新特徵
Q:此處爲什麼要規約數據?如何規約?
A:機器學習中要讓特徵之間的分佈類似,常見方法是歸一化/標準化,此例中要調整amount的結構,調用sklearn預處理模塊
from sklearn.preprocessing import StandardScaler
# 【小技巧】reshape(-1,n)的作用:智能化矩陣轉化方法
# eg.[2,3] -> reshape(-1,2) -> [2*3/2,2] - > [3,2] 此處的(-1,1)讓其在一列上
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))
# 剔除多餘的Time和Amount列
data = data.drop(["Time","Amount"],axis=1)
3-1.下采樣數據集選擇
Q:什麼是下采樣?
A:對於數據集標籤分佈不均衡條件下的分類方法之一,從數據集中標籤較多的項中抽取部分,使其數據量等於標籤較少的項
Q:下采樣會有什麼問題?如何解決?
A:樣本量較少(通過交叉驗證的方式解決),recall值理想,但精度會因爲錯殺而降低(後面會談到)
# 特徵和標籤的劃分
X = data.loc[:,data.columns != "Class"]
y = data.loc[:,data.columns == "Class"]
#【小技巧】使用index方法獲取欺詐用戶和正常用戶的條目的索引,配合np.random.choice抽取
fraud_indices = data[data["Class"] == 1].index
normal_indices = data[data["Class"] == 0].index
# 隨機抽取number_records_fraud這麼多條
#【小技巧】replace=False爲無放回抽樣,另外;可以使用np.random.seed()指定每次抽出來的一樣,不過這裏沒啥用
number_records_fraud = len(data[data["Class"] == 1])
random_normal_indices = np.random.choice(normal_indices,number_records_fraud,replace = False)
random_normal_indices = np.array(random_normal_indices)
#【小技巧】使用np.concatenate拼接取出全量欺詐用戶和抽取正常用戶的索引,ndarray格式
under_smaple_indices = np.concatenate([fraud_indices,random_normal_indices])
# 定位下采樣樣本集,並作特徵和標籤的分類
under_sample_data = data.iloc[under_smaple_indices,:]
X_undersample = under_sample_data.loc[:,under_sample_data.columns != "Class" ]
y_undersample = under_sample_data.loc[:,under_sample_data.columns == "Class" ]
# 展示內容
total = len(under_sample_data)
p_o_n_t = len(under_sample_data[under_sample_data["Class"]==0])/total
p_o_f_t = len(under_sample_data[under_sample_data["Class"]==1])/total
print("Total number of transactions in resampled data: {}".format(total))
print("Percentage of normal transactions: {}".format(p_o_n_t))
print("Percentage of fraud transactions: {}".format(p_o_f_t))
Total number of transactions in resampled data: 984
Percentage of normal transactions: 0.5
Percentage of fraud transactions: 0.5
3-2.下采樣數據集の交叉驗證
Q:什麼是交叉驗證?
A:數據集通常被分爲兩類(訓練集和測試集),而對於訓練集內的數據,通常會分成若干等分(訓練集和驗證集),使得訓練集數據最大化的利用,這種方法稱之爲交叉驗證,通常用KFold確定原始訓練集要分成幾份
Q:爲什麼既即要切分下采樣數據集,又要切分原始數據集?
A:下采樣中的數據集不具有整體的一些特徵,且樣本量較小,所以使用下采樣數據集進行模型的建立,直接用整體數據中的測試集進行驗證
from sklearn.model_selection import train_test_split
#【小技巧】使用sklearn的train_test_split劃分訓練集和測試集,測試集和驗證集
# test_size:測試集的比例,random_state=0:4個樣本集不變動,避免樣本對結果的影響
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=0)
total = len(X_train) + len(X_test)
n_t_tr_d = len(X_train)
n_t_te_d = len(X_test)
print("Total number of transactions: {}".format(total))
print("Number transactions train dataset: {}".format(n_t_tr_d))
print("Number transactions test dataset: {}".format(n_t_te_d))
print("")
# 同理,劃分向下採樣數據集
X_train_undersample,X_test_undersample,y_train_undersample,y_test_undersample = train_test_split(X_undersample,y_undersample,test_size=0.3,random_state=0)
total = len(X_train_undersample)+len(X_test_undersample)
n_t_tr_d = len(X_train_undersample)
n_t_te_d = len(X_test_undersample)
print("Total number of transactions: {}".format(total))
print("Number transactions train dataset: {}".format(n_t_tr_d))
print("Number transactions test dataset: {}".format(n_t_te_d))
Total number of transactions: 284807
Number transactions train dataset: 199364
Number transactions test dataset: 85443
Total number of transactions: 984
Number transactions train dataset: 688
Number transactions test dataset: 296
3-3.下采樣數據集模型的調參及評估
Q:調那些參數?
A:此處調整正則化懲罰係數和分類器閾值,且先調整正則化懲罰係數
Q:什麼是正則化懲罰項?
A:爲什麼引入?可以通過正則化懲罰項放大訓練集的波動,尋找recall值接近模型中的穩定最優解,避免過擬合的現象(訓練表現好,測試表現差),常用正則化懲罰項?L2正則化:loss+(w^2)/2,L1正則化:loss+abs(w),在正則化懲罰項之前的增加係數c,來調整懲罰力度,即調參內容
Q:爲什麼使用recall而不用精度作爲評判的標準?
A:樣本數據不均衡的時候,精度容易有誤導性,檢測任務通常用召回率(recall)作爲評判標準,recall = TP/(TP+FN)
TP(true postives): 正類判斷爲正類,FP(false postives): 負類判斷爲正類,FN(false negatives): 負類判斷爲負類,TN(true negatives): 正例判斷爲負類
from sklearn.linear_model import LogisticRegression # 邏輯迴歸包
from sklearn.model_selection import KFold, cross_val_score # 交叉驗證份數
from sklearn.metrics import confusion_matrix,recall_score,classification_report # 混淆矩陣
# 尋找在l1/l2正則化懲罰方法下,n次交叉驗證下的最優正則化懲罰係數
# 默認是l1正則化5次交叉驗證
def print_Kfold_scores(x_train_data,y_train_data,fold_times=5,penalty="l1"):
fold = KFold(fold_times,shuffle=True)
# 正則化懲罰係數
c_param_range = [0.01,0.1,1,10,100]
# 結果展示
results_table = pd.DataFrame(index=range(len(c_param_range),2),columns=["C_parameter","Mean recall score"])
results_table["C_parameter"] = c_param_range
# 尋找最優的懲罰係數
j = 0
for c_param in c_param_range:
print("-------------------------------")
print("c parameter: {}".format(c_param))
print("-------------------------------")
recall_accs = []
# 交叉驗證
for iteration, indices in enumerate(fold.split(x_train_data)):
# 模型選擇
lr = LogisticRegression(C=c_param,penalty=penalty)
# 模型訓練,每次迭代默認爲indices[0]訓練,indices[1]測試
lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel())
# 模型預測
y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values)
# 模型檢驗,計算召回率
recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample)
recall_accs.append(recall_acc)
print("Iteration {} : recall score = {}".format(iteration,recall_acc))
# 計算平均召回率
results_table.loc[j,"Mean recall score"] = np.mean(recall_accs)
j += 1
# 結果展示
print("")
print("Mean recall score {}".format(np.mean(recall_accs)))
print("")
best_c = results_table.iloc[results_table['Mean recall score'].astype(float).idxmax()]['C_parameter']
print("==================================================================================")
print("Best model to choose from cross validation is with C parameter = {}".format(best_c))
print("==================================================================================")
return best_c
best_c = print_Kfold_scores(X_train_undersample,y_train_undersample)
-------------------------------
c parameter: 0.01
-------------------------------
Iteration 0 : recall score = 0.9726027397260274
Iteration 1 : recall score = 0.9552238805970149
Iteration 2 : recall score = 0.9508196721311475
Iteration 3 : recall score = 1.0
Iteration 4 : recall score = 0.9420289855072463
Mean recall score 0.9641350555922872
-------------------------------
c parameter: 0.1
-------------------------------
Iteration 0 : recall score = 0.810126582278481
Iteration 1 : recall score = 0.927536231884058
Iteration 2 : recall score = 0.9230769230769231
Iteration 3 : recall score = 0.8888888888888888
Iteration 4 : recall score = 0.9420289855072463
Mean recall score 0.8983315223271194
-------------------------------
c parameter: 1
-------------------------------
Iteration 0 : recall score = 0.918918918918919
Iteration 1 : recall score = 0.9104477611940298
Iteration 2 : recall score = 0.8823529411764706
Iteration 3 : recall score = 0.9041095890410958
Iteration 4 : recall score = 0.9365079365079365
Mean recall score 0.9104674293676904
-------------------------------
c parameter: 10
-------------------------------
Iteration 0 : recall score = 0.90625
Iteration 1 : recall score = 0.9264705882352942
Iteration 2 : recall score = 0.9054054054054054
Iteration 3 : recall score = 0.92
Iteration 4 : recall score = 0.921875
Mean recall score 0.9160001987281399
-------------------------------
c parameter: 100
-------------------------------
Iteration 0 : recall score = 0.9315068493150684
Iteration 1 : recall score = 0.8666666666666667
Iteration 2 : recall score = 0.88
Iteration 3 : recall score = 0.9104477611940298
Iteration 4 : recall score = 0.9142857142857143
Mean recall score 0.9005813982922959
==================================================================================
Best model to choose from cross validation is with C parameter = 0.01
==================================================================================
解釋:在5次交叉驗證的情況下,以5次平均的recall值作爲判定依據,正則化懲罰係數=0.01是最好的
其他:將原始數據直接跑模型,recall大約在0.6上下,再次驗證對於標籤分佈不均衡的數據集無法直接使用模型
3-4.混淆矩陣及繪製
Q:什麼是混淆矩陣?爲什麼要引入混淆矩陣?
A:混淆矩陣也稱誤差矩陣,是表示精度評價的一種標準格式,引入heatmap熱力圖後能更加直觀的看出精度和recall值
# 混淆矩陣的畫圖實現
def plot_confusion_matrix(cm, classes,title="Confusion matrix",cmap=plt.cm.Blues):
plt.imshow(cm, interpolation="nearest", cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=0)
plt.yticks(tick_marks, classes)
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, cm[i, j],horizontalalignment="center",color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel("True label")
plt.xlabel("Predicted label")
plt.show()
import itertools
# 下采樣測試集混淆矩陣的計算及畫圖實現
lr = LogisticRegression(C = best_c, penalty = "l1")
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample = lr.predict(X_test_undersample.values)
# 計算混淆矩陣
cnf_matrix = confusion_matrix(y_test_undersample,y_pred_undersample)
np.set_printoptions(precision=2)
print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))
# 繪製混淆矩陣
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, title="Confusion matrix")
# 在下采樣中的測試集中效果還不錯
Recall metric in the testing dataset: 0.9387755102040817
# 下采樣測試集擬合後全量樣本混淆矩陣的計算及畫圖實現
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred = lr.predict(X_test.values)
# 計算混淆矩陣
cnf_matrix = confusion_matrix(y_test,y_pred)
np.set_printoptions(precision=2)
print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))
# 繪製混淆矩陣
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, title="Confusion matrix")
plt.show()
# recall值看起來效果還不錯,但是錯分的比較多(8581),精度有些低,這也是下采樣的一個問題
Recall metric in the testing dataset: 0.918367346939
3-5.改變閾值的設定影響分類劃分
Q:爲什麼要這樣做?
A:滿足實際業務場景(精度和recall值的要求)的選擇一個閾值
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
# 此處用predict_proba來預測概率而非0/1
y_pred_undersample_proba = lr.predict_proba(X_test_undersample.values)
thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
plt.figure(figsize=(10,10))
j = 1
for i in thresholds:
y_test_predictions_high_recall = y_pred_undersample_proba[:,1] > i
plt.subplot(3,3,j)
j += 1
# 計算混淆矩陣
cnf_matrix = confusion_matrix(y_test_undersample,y_test_predictions_high_recall)
np.set_printoptions(precision=2)
print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))
# 繪製混淆矩陣
class_names = [0,1]
plot_confusion_matrix(cnf_matrix, classes=class_names, title="Threshold >= %s"%i)
Recall metric in the testing dataset: 1.0
Recall metric in the testing dataset: 1.0
Recall metric in the testing dataset: 1.0
Recall metric in the testing dataset: 0.9931972789115646
Recall metric in the testing dataset: 0.9251700680272109
Recall metric in the testing dataset: 0.8707482993197279
Recall metric in the testing dataset: 0.8367346938775511
Recall metric in the testing dataset: 0.7414965986394558
Recall metric in the testing dataset: 0.5918367346938775
解釋:在c爲best_c的情況下,隨着閾值的增高recall值降低,同時錯分的概率也減少了,閾值設定在0.5-0.6之間能效果較爲理想
4.過採樣數據集模型的調參及評估
Q:什麼是過採樣?
A:與下采樣相反,對於標籤分佈不均衡的數據集,過採樣從數據集中標籤較少的項中按照SMOTE樣本生成策略,使其數據量等於標籤較多的項
Q:過採樣的方法?
A:SMOTE樣本生成策略:
1.找到少數類樣本x,以歐氏距離計算他到少數類樣本中所有樣本的距離,得到其k近鄰居
2.根據樣本不平和比例設置一個採樣比例以確定採樣倍率N,對於每一個少數類樣本x,從其k近鄰中隨機選擇若干個樣本,假設選擇的近鄰位xn
3.對於每一個隨機選出的近鄰xn,計算其到x的距離d,分別於原樣本按照如下公式構建新樣本 xnew = x + rand(0,1) * d
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
data = pd.read_csv(r"C:\Users\Administrator\01_machinelearning\1-2\creditcard.csv")
# 標準化amount列,剔除多餘的列
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))
data = data.drop(["Time","Amount"],axis=1)
# 劃分訓練集和測試集
columns = data.columns
X = data.loc[:,data.columns != "Class"]
y = data.loc[:,data.columns == "Class"]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)
# 【注意點】不能對測試集過採樣,只對訓練集進行過採樣,random_state指定每次生成數據一樣
oversampler = SMOTE(random_state=0)
X_train_oversample,y_train_oversample = oversampler.fit_sample(X_train,y_train)
print(len(y_train_oversample[y_train_oversample==1]))
# 使用print_Kfold_scores函數計算上採樣的最優懲罰項
X_train_oversample = pd.DataFrame(X_train_oversample)
y_train_oversample = pd.DataFrame(y_train_oversample)
best_c = print_Kfold_scores(X_train_oversample,y_train_oversample)
-------------------------------
c parameter: 0.01
-------------------------------
Iteration 0 : recall score = 0.9100065977567627
Iteration 1 : recall score = 0.9092681430855841
Iteration 2 : recall score = 0.9098477379778727
Iteration 3 : recall score = 0.9085449326562155
Iteration 4 : recall score = 0.9075730208379224
Mean recall score 0.9090480864628715
-------------------------------
c parameter: 0.1
-------------------------------
Iteration 0 : recall score = 0.9090748937482108
Iteration 1 : recall score = 0.9086166458429232
Iteration 2 : recall score = 0.9101831920221412
Iteration 3 : recall score = 0.9110724586626742
Iteration 4 : recall score = 0.9123952767332938
Mean recall score 0.9102684934018488
-------------------------------
c parameter: 1
-------------------------------
Iteration 0 : recall score = 0.9086415956418592
Iteration 1 : recall score = 0.9094835145562961
Iteration 2 : recall score = 0.910353618601789
Iteration 3 : recall score = 0.9129352137505509
Iteration 4 : recall score = 0.9116281117249576
Mean recall score 0.9106084108550906
-------------------------------
c parameter: 10
-------------------------------
Iteration 0 : recall score = 0.9125590221084683
Iteration 1 : recall score = 0.911245454943707
Iteration 2 : recall score = 0.9095706087988918
Iteration 3 : recall score = 0.909858471848684
Iteration 4 : recall score = 0.9091008699844411
Mean recall score 0.9104668855368384
-------------------------------
c parameter: 100
-------------------------------
Iteration 0 : recall score = 0.9084761341999869
Iteration 1 : recall score = 0.9109394599590678
Iteration 2 : recall score = 0.9128188214599824
Iteration 3 : recall score = 0.90803865583132
Iteration 4 : recall score = 0.912469653498124
Mean recall score 0.9105485449896962
==================================================================================
Best model to choose from cross validation is with C parameter = 1.0
==================================================================================
# 過採樣測試集擬合後全量樣本混淆矩陣的計算及畫圖實現
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train_oversample,y_train_oversample.values.ravel())
y_pred_oversample = lr.predict(X_test.values)
# 計算混淆矩陣
cnf_matrix = confusion_matrix(y_test,y_pred_oversample)
np.set_printoptions(precision=2)
print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))
# 繪製混淆矩陣
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix')
plt.show()
# recall值看起來效果還不錯,錯分的也比下采樣少了,精度有所提高
Recall metric in the testing dataset: 0.9405940594059405
5.小結及參考筆記
信用卡欺詐分類是一個比較經典的邏輯迴歸案例,國慶抽空重寫了一下,對於sklearn部分包及早些其他前輩筆記的一些python用法做了更新,總結一下內容:對於標籤值差異較大(數目)一般使用過採樣或者下采樣的方法,且就這個案例中過採樣是個更好的選擇,在recall值接近的情況下,精度明顯提高(這幾乎是必然的,因爲過採樣使用的數據源更多)