2.信用卡欺詐案例——19.10.7

該案例的主要知識點：

1.對於數據集標籤分佈不均衡條件下的分類方法（下采樣，上採樣及兩者的差異）

2.邏輯迴歸模型的實施（交叉驗證，正則化懲罰係數c，判定閾值的設定）

3.簡單的數據預處理（標準化）

4.精度，召回率以及混淆矩陣的概念

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

data = pd.read_csv(r"C:\Users\Administrator\01_machinelearning\1-2\creditcard.csv")
data.head(3)

Time	V1	V2	V3	V4	V5	V6	V7	V8	V9	...	V21	V22	V23	V24	V25	V26	V27	V28	Amount
0.0	-1.359807	-0.072781	2.536347	1.378155	-0.338321	0.462388	0.239599	0.098698	0.363787	...	-0.018307	0.277838	-0.110474	0.066928	0.128539	-0.189115	0.133558	-0.021053	149.62
0.0	1.191857	0.266151	0.166480	0.448154	0.060018	-0.082361	-0.078803	0.085102	-0.255425	...	-0.225775	-0.638672	0.101288	-0.339846	0.167170	0.125895	-0.008983	0.014724	2.69
1.0	-1.358354	-1.340163	1.773209	0.379780	-0.503198	1.800499	0.791461	0.247676	-1.514654	...	0.247998	0.771679	0.909412	-0.689281	-0.327642	-0.139097	-0.055353	-0.059752	378.66

1.觀察樣本是否平衡

count_classes = pd.value_counts(data["Class"])

# 【小技巧】使用pandas自帶的畫圖工具畫圖
count_classes.plot(kind="bar")
plt.title("Fraud class histogram")
plt.xlabel("Class")
plt.ylabel("Frequency")
plt.show()

解釋：數據集標籤分佈極不均衡，無法直接建模，考慮下采樣/上採樣操作，本例先以下采樣爲例

2.規約數據，生成新特徵

Q：此處爲什麼要規約數據？如何規約？

A：機器學習中要讓特徵之間的分佈類似，常見方法是歸一化/標準化，此例中要調整amount的結構，調用sklearn預處理模塊

from sklearn.preprocessing import StandardScaler

# 【小技巧】reshape(-1,n)的作用:智能化矩陣轉化方法
# eg.[2,3] -> reshape(-1,2) -> [2*3/2,2] - > [3,2] 此處的(-1,1)讓其在一列上
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))

# 剔除多餘的Time和Amount列
data = data.drop(["Time","Amount"],axis=1)

3-1.下采樣數據集選擇

Q：什麼是下采樣？

A：對於數據集標籤分佈不均衡條件下的分類方法之一，從數據集中標籤較多的項中抽取部分，使其數據量等於標籤較少的項

Q：下采樣會有什麼問題？如何解決？

A：樣本量較少（通過交叉驗證的方式解決），recall值理想，但精度會因爲錯殺而降低（後面會談到）

# 特徵和標籤的劃分
X = data.loc[:,data.columns != "Class"]
y = data.loc[:,data.columns == "Class"]

#【小技巧】使用index方法獲取欺詐用戶和正常用戶的條目的索引，配合np.random.choice抽取
fraud_indices = data[data["Class"] == 1].index
normal_indices = data[data["Class"] == 0].index

# 隨機抽取number_records_fraud這麼多條
#【小技巧】replace=False爲無放回抽樣，另外；可以使用np.random.seed()指定每次抽出來的一樣，不過這裏沒啥用
number_records_fraud = len(data[data["Class"] == 1])
random_normal_indices = np.random.choice(normal_indices,number_records_fraud,replace = False)
random_normal_indices = np.array(random_normal_indices)

#【小技巧】使用np.concatenate拼接取出全量欺詐用戶和抽取正常用戶的索引，ndarray格式
under_smaple_indices = np.concatenate([fraud_indices,random_normal_indices])

# 定位下采樣樣本集，並作特徵和標籤的分類
under_sample_data = data.iloc[under_smaple_indices,:]
X_undersample = under_sample_data.loc[:,under_sample_data.columns != "Class" ]
y_undersample = under_sample_data.loc[:,under_sample_data.columns == "Class" ]

# 展示內容
total = len(under_sample_data)
p_o_n_t = len(under_sample_data[under_sample_data["Class"]==0])/total
p_o_f_t = len(under_sample_data[under_sample_data["Class"]==1])/total
print("Total number of transactions in resampled data: {}".format(total))
print("Percentage of normal transactions: {}".format(p_o_n_t))
print("Percentage of fraud transactions: {}".format(p_o_f_t))

Total number of transactions in resampled data: 984
Percentage of normal transactions: 0.5
Percentage of fraud transactions: 0.5

3-2.下采樣數據集の交叉驗證

Q：什麼是交叉驗證？

A：數據集通常被分爲兩類（訓練集和測試集），而對於訓練集內的數據，通常會分成若干等分（訓練集和驗證集），使得訓練集數據最大化的利用，這種方法稱之爲交叉驗證，通常用KFold確定原始訓練集要分成幾份

Q：爲什麼既即要切分下采樣數據集，又要切分原始數據集？

A：下采樣中的數據集不具有整體的一些特徵，且樣本量較小，所以使用下采樣數據集進行模型的建立，直接用整體數據中的測試集進行驗證

from sklearn.model_selection import train_test_split

#【小技巧】使用sklearn的train_test_split劃分訓練集和測試集，測試集和驗證集
# test_size：測試集的比例，random_state=0：4個樣本集不變動，避免樣本對結果的影響
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=0)
 
total = len(X_train) + len(X_test)
n_t_tr_d = len(X_train)
n_t_te_d = len(X_test)
print("Total number of transactions: {}".format(total))
print("Number transactions train dataset: {}".format(n_t_tr_d))
print("Number transactions test dataset: {}".format(n_t_te_d))

print("")

# 同理，劃分向下採樣數據集
X_train_undersample,X_test_undersample,y_train_undersample,y_test_undersample = train_test_split(X_undersample,y_undersample,test_size=0.3,random_state=0)

total = len(X_train_undersample)+len(X_test_undersample)
n_t_tr_d = len(X_train_undersample)
n_t_te_d = len(X_test_undersample)
print("Total number of transactions: {}".format(total))
print("Number transactions train dataset: {}".format(n_t_tr_d))
print("Number transactions test dataset: {}".format(n_t_te_d))

Total number of transactions: 284807
Number transactions train dataset: 199364
Number transactions test dataset: 85443

Total number of transactions: 984
Number transactions train dataset: 688
Number transactions test dataset: 296

3-3.下采樣數據集模型的調參及評估

Q：調那些參數？

A：此處調整正則化懲罰係數和分類器閾值，且先調整正則化懲罰係數

Q：什麼是正則化懲罰項？

A：爲什麼引入？可以通過正則化懲罰項放大訓練集的波動，尋找recall值接近模型中的穩定最優解，避免過擬合的現象（訓練表現好，測試表現差），常用正則化懲罰項？L2正則化：loss+(w^2)/2，L1正則化：loss+abs(w)，在正則化懲罰項之前的增加係數c，來調整懲罰力度，即調參內容

Q：爲什麼使用recall而不用精度作爲評判的標準？

A：樣本數據不均衡的時候，精度容易有誤導性，檢測任務通常用召回率（recall）作爲評判標準，recall = TP/(TP+FN)

TP(true postives): 正類判斷爲正類，FP(false postives): 負類判斷爲正類，FN(false negatives): 負類判斷爲負類，TN(true negatives): 正例判斷爲負類

from sklearn.linear_model import LogisticRegression # 邏輯迴歸包
from sklearn.model_selection import KFold, cross_val_score # 交叉驗證份數
from sklearn.metrics import confusion_matrix,recall_score,classification_report # 混淆矩陣

# 尋找在l1/l2正則化懲罰方法下，n次交叉驗證下的最優正則化懲罰係數
# 默認是l1正則化5次交叉驗證
def print_Kfold_scores(x_train_data,y_train_data,fold_times=5,penalty="l1"):
    fold = KFold(fold_times,shuffle=True)
    
    # 正則化懲罰係數
    c_param_range = [0.01,0.1,1,10,100]
    
    # 結果展示
    results_table = pd.DataFrame(index=range(len(c_param_range),2),columns=["C_parameter","Mean recall score"])
    results_table["C_parameter"] = c_param_range
    
    # 尋找最優的懲罰係數
    j = 0
    for c_param in c_param_range:
        print("-------------------------------")
        print("c parameter: {}".format(c_param))
        print("-------------------------------")
        
        recall_accs = []
        
        # 交叉驗證 
        for iteration, indices in enumerate(fold.split(x_train_data)):          
            # 模型選擇
            lr = LogisticRegression(C=c_param,penalty=penalty)
            # 模型訓練，每次迭代默認爲indices[0]訓練，indices[1]測試
            lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel())
            # 模型預測
            y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values)
            # 模型檢驗，計算召回率
            recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample)
            recall_accs.append(recall_acc)
            print("Iteration {} : recall score = {}".format(iteration,recall_acc))
        
        # 計算平均召回率
        results_table.loc[j,"Mean recall score"] = np.mean(recall_accs)
        j += 1
        
        # 結果展示
        print("")
        print("Mean recall score {}".format(np.mean(recall_accs)))
        print("")
        
    best_c = results_table.iloc[results_table['Mean recall score'].astype(float).idxmax()]['C_parameter']
    print("==================================================================================")
    print("Best model to choose from cross validation is with C parameter = {}".format(best_c))
    print("==================================================================================")
    return best_c

best_c = print_Kfold_scores(X_train_undersample,y_train_undersample)

-------------------------------
c parameter: 0.01
-------------------------------
Iteration 0 : recall score = 0.9726027397260274
Iteration 1 : recall score = 0.9552238805970149
Iteration 2 : recall score = 0.9508196721311475
Iteration 3 : recall score = 1.0
Iteration 4 : recall score = 0.9420289855072463

Mean recall score 0.9641350555922872

-------------------------------
c parameter: 0.1
-------------------------------
Iteration 0 : recall score = 0.810126582278481
Iteration 1 : recall score = 0.927536231884058
Iteration 2 : recall score = 0.9230769230769231
Iteration 3 : recall score = 0.8888888888888888
Iteration 4 : recall score = 0.9420289855072463

Mean recall score 0.8983315223271194

-------------------------------
c parameter: 1
-------------------------------
Iteration 0 : recall score = 0.918918918918919
Iteration 1 : recall score = 0.9104477611940298
Iteration 2 : recall score = 0.8823529411764706
Iteration 3 : recall score = 0.9041095890410958
Iteration 4 : recall score = 0.9365079365079365

Mean recall score 0.9104674293676904

-------------------------------
c parameter: 10
-------------------------------
Iteration 0 : recall score = 0.90625
Iteration 1 : recall score = 0.9264705882352942
Iteration 2 : recall score = 0.9054054054054054
Iteration 3 : recall score = 0.92
Iteration 4 : recall score = 0.921875

Mean recall score 0.9160001987281399

-------------------------------
c parameter: 100
-------------------------------
Iteration 0 : recall score = 0.9315068493150684
Iteration 1 : recall score = 0.8666666666666667
Iteration 2 : recall score = 0.88
Iteration 3 : recall score = 0.9104477611940298
Iteration 4 : recall score = 0.9142857142857143

Mean recall score 0.9005813982922959

==================================================================================
Best model to choose from cross validation is with C parameter = 0.01
==================================================================================

解釋：在5次交叉驗證的情況下，以5次平均的recall值作爲判定依據，正則化懲罰係數=0.01是最好的

其他：將原始數據直接跑模型，recall大約在0.6上下，再次驗證對於標籤分佈不均衡的數據集無法直接使用模型

3-4.混淆矩陣及繪製

Q：什麼是混淆矩陣？爲什麼要引入混淆矩陣？

A：混淆矩陣也稱誤差矩陣，是表示精度評價的一種標準格式，引入heatmap熱力圖後能更加直觀的看出精度和recall值

# 混淆矩陣的畫圖實現
def plot_confusion_matrix(cm, classes,title="Confusion matrix",cmap=plt.cm.Blues):
    plt.imshow(cm, interpolation="nearest", cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],horizontalalignment="center",color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel("True label")
    plt.xlabel("Predicted label")
    plt.show()

import itertools

# 下采樣測試集混淆矩陣的計算及畫圖實現
lr = LogisticRegression(C = best_c, penalty = "l1")
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample = lr.predict(X_test_undersample.values)

# 計算混淆矩陣
cnf_matrix = confusion_matrix(y_test_undersample,y_pred_undersample)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# 繪製混淆矩陣
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, title="Confusion matrix")

# 在下采樣中的測試集中效果還不錯
Recall metric in the testing dataset:  0.9387755102040817

# 下采樣測試集擬合後全量樣本混淆矩陣的計算及畫圖實現
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred = lr.predict(X_test.values)

# 計算混淆矩陣
cnf_matrix = confusion_matrix(y_test,y_pred)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# 繪製混淆矩陣
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, title="Confusion matrix")
plt.show()

# recall值看起來效果還不錯，但是錯分的比較多（8581），精度有些低，這也是下采樣的一個問題
Recall metric in the testing dataset:  0.918367346939

3-5.改變閾值的設定影響分類劃分

Q：爲什麼要這樣做？

A：滿足實際業務場景（精度和recall值的要求）的選擇一個閾值

lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
# 此處用predict_proba來預測概率而非0/1
y_pred_undersample_proba = lr.predict_proba(X_test_undersample.values)

thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]

plt.figure(figsize=(10,10))

j = 1
for i in thresholds:
    y_test_predictions_high_recall = y_pred_undersample_proba[:,1] > i
    
    plt.subplot(3,3,j)
    j += 1
    
    # 計算混淆矩陣
    cnf_matrix = confusion_matrix(y_test_undersample,y_test_predictions_high_recall)
    np.set_printoptions(precision=2)

    print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

    # 繪製混淆矩陣
    class_names = [0,1]
    plot_confusion_matrix(cnf_matrix, classes=class_names, title="Threshold >= %s"%i)

Recall metric in the testing dataset:  1.0
Recall metric in the testing dataset:  1.0
Recall metric in the testing dataset:  1.0
Recall metric in the testing dataset:  0.9931972789115646
Recall metric in the testing dataset:  0.9251700680272109
Recall metric in the testing dataset:  0.8707482993197279
Recall metric in the testing dataset:  0.8367346938775511
Recall metric in the testing dataset:  0.7414965986394558
Recall metric in the testing dataset:  0.5918367346938775

解釋：在c爲best_c的情況下，隨着閾值的增高recall值降低，同時錯分的概率也減少了，閾值設定在0.5-0.6之間能效果較爲理想

4.過採樣數據集模型的調參及評估

Q：什麼是過採樣？

A：與下采樣相反，對於標籤分佈不均衡的數據集，過採樣從數據集中標籤較少的項中按照SMOTE樣本生成策略，使其數據量等於標籤較多的項

Q：過採樣的方法？

A：SMOTE樣本生成策略：

1.找到少數類樣本x，以歐氏距離計算他到少數類樣本中所有樣本的距離，得到其k近鄰居

2.根據樣本不平和比例設置一個採樣比例以確定採樣倍率N，對於每一個少數類樣本x，從其k近鄰中隨機選擇若干個樣本，假設選擇的近鄰位xn

3.對於每一個隨機選出的近鄰xn，計算其到x的距離d，分別於原樣本按照如下公式構建新樣本 xnew = x + rand(0,1) * d

from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


data = pd.read_csv(r"C:\Users\Administrator\01_machinelearning\1-2\creditcard.csv")

# 標準化amount列，剔除多餘的列
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))
data = data.drop(["Time","Amount"],axis=1)

# 劃分訓練集和測試集
columns = data.columns
X = data.loc[:,data.columns != "Class"]
y = data.loc[:,data.columns == "Class"]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)

# 【注意點】不能對測試集過採樣，只對訓練集進行過採樣，random_state指定每次生成數據一樣
oversampler = SMOTE(random_state=0)
X_train_oversample,y_train_oversample = oversampler.fit_sample(X_train,y_train)
print(len(y_train_oversample[y_train_oversample==1]))

# 使用print_Kfold_scores函數計算上採樣的最優懲罰項
X_train_oversample = pd.DataFrame(X_train_oversample)
y_train_oversample = pd.DataFrame(y_train_oversample)
best_c = print_Kfold_scores(X_train_oversample,y_train_oversample)

-------------------------------
c parameter: 0.01
-------------------------------
Iteration 0 : recall score = 0.9100065977567627
Iteration 1 : recall score = 0.9092681430855841
Iteration 2 : recall score = 0.9098477379778727
Iteration 3 : recall score = 0.9085449326562155
Iteration 4 : recall score = 0.9075730208379224

Mean recall score 0.9090480864628715

-------------------------------
c parameter: 0.1
-------------------------------
Iteration 0 : recall score = 0.9090748937482108
Iteration 1 : recall score = 0.9086166458429232
Iteration 2 : recall score = 0.9101831920221412
Iteration 3 : recall score = 0.9110724586626742
Iteration 4 : recall score = 0.9123952767332938

Mean recall score 0.9102684934018488

-------------------------------
c parameter: 1
-------------------------------
Iteration 0 : recall score = 0.9086415956418592
Iteration 1 : recall score = 0.9094835145562961
Iteration 2 : recall score = 0.910353618601789
Iteration 3 : recall score = 0.9129352137505509
Iteration 4 : recall score = 0.9116281117249576

Mean recall score 0.9106084108550906

-------------------------------
c parameter: 10
-------------------------------
Iteration 0 : recall score = 0.9125590221084683
Iteration 1 : recall score = 0.911245454943707
Iteration 2 : recall score = 0.9095706087988918
Iteration 3 : recall score = 0.909858471848684
Iteration 4 : recall score = 0.9091008699844411

Mean recall score 0.9104668855368384

-------------------------------
c parameter: 100
-------------------------------
Iteration 0 : recall score = 0.9084761341999869
Iteration 1 : recall score = 0.9109394599590678
Iteration 2 : recall score = 0.9128188214599824
Iteration 3 : recall score = 0.90803865583132
Iteration 4 : recall score = 0.912469653498124

Mean recall score 0.9105485449896962

==================================================================================
Best model to choose from cross validation is with C parameter = 1.0
==================================================================================

# 過採樣測試集擬合後全量樣本混淆矩陣的計算及畫圖實現
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train_oversample,y_train_oversample.values.ravel())
y_pred_oversample = lr.predict(X_test.values)

# 計算混淆矩陣
cnf_matrix = confusion_matrix(y_test,y_pred_oversample)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# 繪製混淆矩陣
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix')
plt.show()

# recall值看起來效果還不錯，錯分的也比下采樣少了，精度有所提高
Recall metric in the testing dataset:  0.9405940594059405

5.小結及參考筆記

信用卡欺詐分類是一個比較經典的邏輯迴歸案例，國慶抽空重寫了一下，對於sklearn部分包及早些其他前輩筆記的一些python用法做了更新，總結一下內容：對於標籤值差異較大（數目）一般使用過採樣或者下采樣的方法，且就這個案例中過採樣是個更好的選擇，在recall值接近的情況下，精度明顯提高（這幾乎是必然的，因爲過採樣使用的數據源更多）

https://www.itread01.com/content/1542590587.html

https://blog.csdn.net/stranger_man/article/details/79055095

2.信用卡欺詐案例——19.10.7

該案例的主要知識點：

1.觀察樣本是否平衡

2.規約數據，生成新特徵

3-1.下采樣數據集選擇

3-2.下采樣數據集の交叉驗證

3-3.下采樣數據集模型的調參及評估

3-4.混淆矩陣及繪製

3-5.改變閾值的設定影響分類劃分

4.過採樣數據集模型的調參及評估

5.小結及參考筆記

Wireshark 安裝+使用（一）

博客園商業化之路-衆包平臺：繼續召集早期合作開發者

2.信用卡欺詐案例——19.10.7

[Python數據分析] 7-模型評估

[Python基礎] 1-笨辦法學Python3小結I

[Python數據分析] 5-挖掘建模(監督學習)

[Python基礎] 7-Pandas:數據分析庫

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結