2.信用卡欺詐案例——19.10.7

該案例的主要知識點:

1.對於數據集標籤分佈不均衡條件下的分類方法(下采樣,上採樣及兩者的差異)

2.邏輯迴歸模型的實施(交叉驗證,正則化懲罰係數c,判定閾值的設定)

3.簡單的數據預處理(標準化)

4.精度,召回率以及混淆矩陣的概念


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

data = pd.read_csv(r"C:\Users\Administrator\01_machinelearning\1-2\creditcard.csv")
data.head(3)

Time

V1

V2

V3

V4 V5 V6 V7 V8 V9 ... V21 V22 V23 V24 V25 V26 V27 V28 Amount Class
0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 149.62 0
0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 0.085102 -0.255425 ... -0.225775 -0.638672 0.101288 -0.339846 0.167170 0.125895 -0.008983 0.014724 2.69 0
1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 0.247676 -1.514654 ... 0.247998 0.771679 0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752 378.66 0

1.觀察樣本是否平衡

count_classes = pd.value_counts(data["Class"])

# 【小技巧】使用pandas自帶的畫圖工具畫圖
count_classes.plot(kind="bar")
plt.title("Fraud class histogram")
plt.xlabel("Class")
plt.ylabel("Frequency")
plt.show()

 

解釋:數據集標籤分佈極不均衡,無法直接建模,考慮下采樣/上採樣操作,本例先以下采樣爲例


2.規約數據,生成新特徵

Q:此處爲什麼要規約數據?如何規約?

A:機器學習中要讓特徵之間的分佈類似,常見方法是歸一化/標準化,此例中要調整amount的結構,調用sklearn預處理模塊

from sklearn.preprocessing import StandardScaler

# 【小技巧】reshape(-1,n)的作用:智能化矩陣轉化方法
# eg.[2,3] -> reshape(-1,2) -> [2*3/2,2] - > [3,2] 此處的(-1,1)讓其在一列上
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))

# 剔除多餘的Time和Amount列
data = data.drop(["Time","Amount"],axis=1)

3-1.下采樣數據集選擇

Q:什麼是下采樣?

A:對於數據集標籤分佈不均衡條件下的分類方法之一,從數據集中標籤較多的項中抽取部分,使其數據量等於標籤較少的項

Q:下采樣會有什麼問題?如何解決?

A:樣本量較少(通過交叉驗證的方式解決),recall值理想,但精度會因爲錯殺而降低(後面會談到)

# 特徵和標籤的劃分
X = data.loc[:,data.columns != "Class"]
y = data.loc[:,data.columns == "Class"]

#【小技巧】使用index方法獲取欺詐用戶和正常用戶的條目的索引,配合np.random.choice抽取
fraud_indices = data[data["Class"] == 1].index
normal_indices = data[data["Class"] == 0].index

# 隨機抽取number_records_fraud這麼多條
#【小技巧】replace=False爲無放回抽樣,另外;可以使用np.random.seed()指定每次抽出來的一樣,不過這裏沒啥用
number_records_fraud = len(data[data["Class"] == 1])
random_normal_indices = np.random.choice(normal_indices,number_records_fraud,replace = False)
random_normal_indices = np.array(random_normal_indices)

#【小技巧】使用np.concatenate拼接取出全量欺詐用戶和抽取正常用戶的索引,ndarray格式
under_smaple_indices = np.concatenate([fraud_indices,random_normal_indices])

# 定位下采樣樣本集,並作特徵和標籤的分類
under_sample_data = data.iloc[under_smaple_indices,:]
X_undersample = under_sample_data.loc[:,under_sample_data.columns != "Class" ]
y_undersample = under_sample_data.loc[:,under_sample_data.columns == "Class" ]

# 展示內容
total = len(under_sample_data)
p_o_n_t = len(under_sample_data[under_sample_data["Class"]==0])/total
p_o_f_t = len(under_sample_data[under_sample_data["Class"]==1])/total
print("Total number of transactions in resampled data: {}".format(total))
print("Percentage of normal transactions: {}".format(p_o_n_t))
print("Percentage of fraud transactions: {}".format(p_o_f_t))
Total number of transactions in resampled data: 984
Percentage of normal transactions: 0.5
Percentage of fraud transactions: 0.5

3-2.下采樣數據集の交叉驗證

Q:什麼是交叉驗證?

A:數據集通常被分爲兩類(訓練集和測試集),而對於訓練集內的數據,通常會分成若干等分(訓練集和驗證集),使得訓練集數據最大化的利用,這種方法稱之爲交叉驗證,通常用KFold確定原始訓練集要分成幾份

Q:爲什麼既即要切分下采樣數據集,又要切分原始數據集?

A:下采樣中的數據集不具有整體的一些特徵,且樣本量較小,所以使用下采樣數據集進行模型的建立,直接用整體數據中的測試集進行驗證

from sklearn.model_selection import train_test_split

#【小技巧】使用sklearn的train_test_split劃分訓練集和測試集,測試集和驗證集
# test_size:測試集的比例,random_state=0:4個樣本集不變動,避免樣本對結果的影響
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=0)
 
total = len(X_train) + len(X_test)
n_t_tr_d = len(X_train)
n_t_te_d = len(X_test)
print("Total number of transactions: {}".format(total))
print("Number transactions train dataset: {}".format(n_t_tr_d))
print("Number transactions test dataset: {}".format(n_t_te_d))

print("")

# 同理,劃分向下採樣數據集
X_train_undersample,X_test_undersample,y_train_undersample,y_test_undersample = train_test_split(X_undersample,y_undersample,test_size=0.3,random_state=0)

total = len(X_train_undersample)+len(X_test_undersample)
n_t_tr_d = len(X_train_undersample)
n_t_te_d = len(X_test_undersample)
print("Total number of transactions: {}".format(total))
print("Number transactions train dataset: {}".format(n_t_tr_d))
print("Number transactions test dataset: {}".format(n_t_te_d))
Total number of transactions: 284807
Number transactions train dataset: 199364
Number transactions test dataset: 85443

Total number of transactions: 984
Number transactions train dataset: 688
Number transactions test dataset: 296

3-3.下采樣數據集模型的調參及評估

Q:調那些參數?

A:此處調整正則化懲罰係數和分類器閾值,且先調整正則化懲罰係數

Q:什麼是正則化懲罰項?

A:爲什麼引入?可以通過正則化懲罰項放大訓練集的波動,尋找recall值接近模型中的穩定最優解,避免過擬合的現象(訓練表現好,測試表現差),常用正則化懲罰項?L2正則化:loss+(w^2)/2,L1正則化:loss+abs(w),在正則化懲罰項之前的增加係數c,來調整懲罰力度,即調參內容

Q:爲什麼使用recall而不用精度作爲評判的標準?

A:樣本數據不均衡的時候,精度容易有誤導性,檢測任務通常用召回率(recall)作爲評判標準,recall = TP/(TP+FN)

TP(true postives): 正類判斷爲正類,FP(false postives): 負類判斷爲正類,FN(false negatives): 負類判斷爲負類,TN(true negatives): 正例判斷爲負類

from sklearn.linear_model import LogisticRegression # 邏輯迴歸包
from sklearn.model_selection import KFold, cross_val_score # 交叉驗證份數
from sklearn.metrics import confusion_matrix,recall_score,classification_report # 混淆矩陣

# 尋找在l1/l2正則化懲罰方法下,n次交叉驗證下的最優正則化懲罰係數
# 默認是l1正則化5次交叉驗證
def print_Kfold_scores(x_train_data,y_train_data,fold_times=5,penalty="l1"):
    fold = KFold(fold_times,shuffle=True)
    
    # 正則化懲罰係數
    c_param_range = [0.01,0.1,1,10,100]
    
    # 結果展示
    results_table = pd.DataFrame(index=range(len(c_param_range),2),columns=["C_parameter","Mean recall score"])
    results_table["C_parameter"] = c_param_range
    
    # 尋找最優的懲罰係數
    j = 0
    for c_param in c_param_range:
        print("-------------------------------")
        print("c parameter: {}".format(c_param))
        print("-------------------------------")
        
        recall_accs = []
        
        # 交叉驗證 
        for iteration, indices in enumerate(fold.split(x_train_data)):          
            # 模型選擇
            lr = LogisticRegression(C=c_param,penalty=penalty)
            # 模型訓練,每次迭代默認爲indices[0]訓練,indices[1]測試
            lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel())
            # 模型預測
            y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values)
            # 模型檢驗,計算召回率
            recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample)
            recall_accs.append(recall_acc)
            print("Iteration {} : recall score = {}".format(iteration,recall_acc))
        
        # 計算平均召回率
        results_table.loc[j,"Mean recall score"] = np.mean(recall_accs)
        j += 1
        
        # 結果展示
        print("")
        print("Mean recall score {}".format(np.mean(recall_accs)))
        print("")
        
    best_c = results_table.iloc[results_table['Mean recall score'].astype(float).idxmax()]['C_parameter']
    print("==================================================================================")
    print("Best model to choose from cross validation is with C parameter = {}".format(best_c))
    print("==================================================================================")
    return best_c  
best_c = print_Kfold_scores(X_train_undersample,y_train_undersample)
-------------------------------
c parameter: 0.01
-------------------------------
Iteration 0 : recall score = 0.9726027397260274
Iteration 1 : recall score = 0.9552238805970149
Iteration 2 : recall score = 0.9508196721311475
Iteration 3 : recall score = 1.0
Iteration 4 : recall score = 0.9420289855072463

Mean recall score 0.9641350555922872

-------------------------------
c parameter: 0.1
-------------------------------
Iteration 0 : recall score = 0.810126582278481
Iteration 1 : recall score = 0.927536231884058
Iteration 2 : recall score = 0.9230769230769231
Iteration 3 : recall score = 0.8888888888888888
Iteration 4 : recall score = 0.9420289855072463

Mean recall score 0.8983315223271194

-------------------------------
c parameter: 1
-------------------------------
Iteration 0 : recall score = 0.918918918918919
Iteration 1 : recall score = 0.9104477611940298
Iteration 2 : recall score = 0.8823529411764706
Iteration 3 : recall score = 0.9041095890410958
Iteration 4 : recall score = 0.9365079365079365

Mean recall score 0.9104674293676904

-------------------------------
c parameter: 10
-------------------------------
Iteration 0 : recall score = 0.90625
Iteration 1 : recall score = 0.9264705882352942
Iteration 2 : recall score = 0.9054054054054054
Iteration 3 : recall score = 0.92
Iteration 4 : recall score = 0.921875

Mean recall score 0.9160001987281399

-------------------------------
c parameter: 100
-------------------------------
Iteration 0 : recall score = 0.9315068493150684
Iteration 1 : recall score = 0.8666666666666667
Iteration 2 : recall score = 0.88
Iteration 3 : recall score = 0.9104477611940298
Iteration 4 : recall score = 0.9142857142857143

Mean recall score 0.9005813982922959

==================================================================================
Best model to choose from cross validation is with C parameter = 0.01
==================================================================================

解釋:在5次交叉驗證的情況下,以5次平均的recall值作爲判定依據,正則化懲罰係數=0.01是最好的

其他:將原始數據直接跑模型,recall大約在0.6上下,再次驗證對於標籤分佈不均衡的數據集無法直接使用模型

3-4.混淆矩陣及繪製

Q:什麼是混淆矩陣?爲什麼要引入混淆矩陣?

A:混淆矩陣也稱誤差矩陣,是表示精度評價的一種標準格式,引入heatmap熱力圖後能更加直觀的看出精度和recall值

# 混淆矩陣的畫圖實現
def plot_confusion_matrix(cm, classes,title="Confusion matrix",cmap=plt.cm.Blues):
    plt.imshow(cm, interpolation="nearest", cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],horizontalalignment="center",color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel("True label")
    plt.xlabel("Predicted label")
    plt.show()
import itertools

# 下采樣測試集混淆矩陣的計算及畫圖實現
lr = LogisticRegression(C = best_c, penalty = "l1")
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample = lr.predict(X_test_undersample.values)

# 計算混淆矩陣
cnf_matrix = confusion_matrix(y_test_undersample,y_pred_undersample)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# 繪製混淆矩陣
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, title="Confusion matrix")
# 在下采樣中的測試集中效果還不錯
Recall metric in the testing dataset:  0.9387755102040817

# 下采樣測試集擬合後全量樣本混淆矩陣的計算及畫圖實現
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred = lr.predict(X_test.values)

# 計算混淆矩陣
cnf_matrix = confusion_matrix(y_test,y_pred)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# 繪製混淆矩陣
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, title="Confusion matrix")
plt.show()
# recall值看起來效果還不錯,但是錯分的比較多(8581),精度有些低,這也是下采樣的一個問題
Recall metric in the testing dataset:  0.918367346939

3-5.改變閾值的設定影響分類劃分

Q:爲什麼要這樣做?

A:滿足實際業務場景(精度和recall值的要求)的選擇一個閾值

lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
# 此處用predict_proba來預測概率而非0/1
y_pred_undersample_proba = lr.predict_proba(X_test_undersample.values)

thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]

plt.figure(figsize=(10,10))

j = 1
for i in thresholds:
    y_test_predictions_high_recall = y_pred_undersample_proba[:,1] > i
    
    plt.subplot(3,3,j)
    j += 1
    
    # 計算混淆矩陣
    cnf_matrix = confusion_matrix(y_test_undersample,y_test_predictions_high_recall)
    np.set_printoptions(precision=2)

    print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

    # 繪製混淆矩陣
    class_names = [0,1]
    plot_confusion_matrix(cnf_matrix, classes=class_names, title="Threshold >= %s"%i) 
Recall metric in the testing dataset:  1.0
Recall metric in the testing dataset:  1.0
Recall metric in the testing dataset:  1.0
Recall metric in the testing dataset:  0.9931972789115646
Recall metric in the testing dataset:  0.9251700680272109
Recall metric in the testing dataset:  0.8707482993197279
Recall metric in the testing dataset:  0.8367346938775511
Recall metric in the testing dataset:  0.7414965986394558
Recall metric in the testing dataset:  0.5918367346938775

解釋:在c爲best_c的情況下,隨着閾值的增高recall值降低,同時錯分的概率也減少了,閾值設定在0.5-0.6之間能效果較爲理想


4.過採樣數據集模型的調參及評估

Q:什麼是過採樣?

A:與下采樣相反,對於標籤分佈不均衡的數據集,過採樣從數據集中標籤較少的項中按照SMOTE樣本生成策略,使其數據量等於標籤較多的項

Q:過採樣的方法?

A:SMOTE樣本生成策略:

1.找到少數類樣本x,以歐氏距離計算他到少數類樣本中所有樣本的距離,得到其k近鄰居

2.根據樣本不平和比例設置一個採樣比例以確定採樣倍率N,對於每一個少數類樣本x,從其k近鄰中隨機選擇若干個樣本,假設選擇的近鄰位xn

3.對於每一個隨機選出的近鄰xn,計算其到x的距離d,分別於原樣本按照如下公式構建新樣本 xnew = x + rand(0,1) * d

from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


data = pd.read_csv(r"C:\Users\Administrator\01_machinelearning\1-2\creditcard.csv")

# 標準化amount列,剔除多餘的列
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))
data = data.drop(["Time","Amount"],axis=1)

# 劃分訓練集和測試集
columns = data.columns
X = data.loc[:,data.columns != "Class"]
y = data.loc[:,data.columns == "Class"]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)

# 【注意點】不能對測試集過採樣,只對訓練集進行過採樣,random_state指定每次生成數據一樣
oversampler = SMOTE(random_state=0)
X_train_oversample,y_train_oversample = oversampler.fit_sample(X_train,y_train)
print(len(y_train_oversample[y_train_oversample==1]))
# 使用print_Kfold_scores函數計算上採樣的最優懲罰項
X_train_oversample = pd.DataFrame(X_train_oversample)
y_train_oversample = pd.DataFrame(y_train_oversample)
best_c = print_Kfold_scores(X_train_oversample,y_train_oversample)
-------------------------------
c parameter: 0.01
-------------------------------
Iteration 0 : recall score = 0.9100065977567627
Iteration 1 : recall score = 0.9092681430855841
Iteration 2 : recall score = 0.9098477379778727
Iteration 3 : recall score = 0.9085449326562155
Iteration 4 : recall score = 0.9075730208379224

Mean recall score 0.9090480864628715

-------------------------------
c parameter: 0.1
-------------------------------
Iteration 0 : recall score = 0.9090748937482108
Iteration 1 : recall score = 0.9086166458429232
Iteration 2 : recall score = 0.9101831920221412
Iteration 3 : recall score = 0.9110724586626742
Iteration 4 : recall score = 0.9123952767332938

Mean recall score 0.9102684934018488

-------------------------------
c parameter: 1
-------------------------------
Iteration 0 : recall score = 0.9086415956418592
Iteration 1 : recall score = 0.9094835145562961
Iteration 2 : recall score = 0.910353618601789
Iteration 3 : recall score = 0.9129352137505509
Iteration 4 : recall score = 0.9116281117249576

Mean recall score 0.9106084108550906

-------------------------------
c parameter: 10
-------------------------------
Iteration 0 : recall score = 0.9125590221084683
Iteration 1 : recall score = 0.911245454943707
Iteration 2 : recall score = 0.9095706087988918
Iteration 3 : recall score = 0.909858471848684
Iteration 4 : recall score = 0.9091008699844411

Mean recall score 0.9104668855368384

-------------------------------
c parameter: 100
-------------------------------
Iteration 0 : recall score = 0.9084761341999869
Iteration 1 : recall score = 0.9109394599590678
Iteration 2 : recall score = 0.9128188214599824
Iteration 3 : recall score = 0.90803865583132
Iteration 4 : recall score = 0.912469653498124

Mean recall score 0.9105485449896962

==================================================================================
Best model to choose from cross validation is with C parameter = 1.0
==================================================================================
# 過採樣測試集擬合後全量樣本混淆矩陣的計算及畫圖實現
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train_oversample,y_train_oversample.values.ravel())
y_pred_oversample = lr.predict(X_test.values)

# 計算混淆矩陣
cnf_matrix = confusion_matrix(y_test,y_pred_oversample)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# 繪製混淆矩陣
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix')
plt.show()
# recall值看起來效果還不錯,錯分的也比下采樣少了,精度有所提高
Recall metric in the testing dataset:  0.9405940594059405


5.小結及參考筆記

信用卡欺詐分類是一個比較經典的邏輯迴歸案例,國慶抽空重寫了一下,對於sklearn部分包及早些其他前輩筆記的一些python用法做了更新,總結一下內容:對於標籤值差異較大(數目)一般使用過採樣或者下采樣的方法,且就這個案例中過採樣是個更好的選擇,在recall值接近的情況下,精度明顯提高(這幾乎是必然的,因爲過採樣使用的數據源更多)

https://www.itread01.com/content/1542590587.html

https://blog.csdn.net/stranger_man/article/details/79055095

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章