自定義損失函數(二)

自定義損失函數
上述文章,介紹了用於迴歸的自定義損失函數;

由於項目需要,嘗試了用於分類的自定義損失函數;
在網上查找了很多分類相關的自定義損失函數介紹,但是很多細節缺失,現總結如下,希望能夠給大家提供幫助。如有不足,請大家幫忙指正!

import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import missingno
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
import random
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE,RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

from lightgbm import LGBMClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split,cross_val_score,GridSearchCV
from sklearn.metrics import roc_auc_score,roc_curve,precision_score,auc,precision_recall_curve, \
                            accuracy_score,recall_score,f1_score,confusion_matrix,classification_report

%matplotlib inline

測試數據還是用到以前博文中用過的kaggle上的反欺詐數據:

data = pd.read_csv("creditcard.csv")
data.shape

在這裏插入圖片描述

droplist = ['V8', 'V13', 'V15', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28','Time']
data_new = data.drop(droplist, axis = 1)
data_new.shape

簡單對部分特徵進行刪除,前面的文章中有介紹爲何刪除;

data_new.Class.value_counts().loc[0]/data_new.Class.value_counts().loc[1]
577.8760162601626

樣本好壞標籤分佈差異較大;

C_TP = 0
C_FN = 3
C_FP = 1
C_TN = 0
cost_matrix = [
               [C_TP,C_FN],
               [C_FP,C_TN]
              ]
cost_matrix = pd.DataFrame(cost_matrix).reset_index()
cost_matrix['index'].loc[0]='正例'
cost_matrix['index'].loc[1]='負例'
cost_matrix.columns = ['''真實\預測''','正例','負例']
cost_matrix

在這裏插入圖片描述
定義代價矩陣

from sklearn.preprocessing import LabelBinarizer
def custom_loss(y_true,y_pred):
    eps=1e-15
    y_pred = 1.0 / (1.0 + np.exp(-y_pred))
    y_pred = np.clip(y_pred, eps, 1-eps)
#     y_pred = y_pred[:,1]
#     loss = y_true*(C_FN*np.log(y_pred)+C_TP*np.log(1-y_pred))+(1-y_true)*(C_FP*np.log(1-y_pred)+C_TN*np.log(y_pred))
#     loss = -1*loss
#     return loss
    grad = C_FN*(-1*y_true)/y_pred + C_TP*y_true/(1-y_pred) + C_FP*(1-y_true)/(1-y_pred) + C_TN*(y_true-1)/y_pred
    hess = C_FN*y_true/y_pred**2 + C_TP*y_true/(1-y_pred)**2 + C_FP*(1-y_true)/(1-y_pred)**2 + C_TN*(1-y_true)/y_pred**2
    return grad,hess

自定義損失函數objective

    eps=1e-15
    y_pred = 1.0 / (1.0 + np.exp(-y_pred))
    y_pred = np.clip(y_pred, eps, 1-eps)

需要特別注意此段
y_pred = 1.0 / (1.0 + np.exp(-y_pred)):model輸出時是連續值,即可以認爲本身輸出值是迴歸,通過sigmoid轉爲類別;
y_pred = np.clip(y_pred, eps, 1-eps):當y_pred值=0,或者極大值時,求梯度時會得到nan,這是自定義函數時特別需要注意的地方;

def get_cost(y_true,y_pred):
#     y_pred = y_pred[:,1]
    y_pred = 1.0 / (1.0 + np.exp(-y_pred))
    y_pred = np.where(y_pred>0.5,1,0)
    tn, fp, fn, tp = confusion_matrix(y_true,y_pred).ravel()
    cost = np.sum(np.array([C_TP,C_FN,C_FP,C_TN])*np.array([tp,fn,fp,tn]))
    return "custom_asymmetric_eval", cost, False

定義評估函數feval:
得到混淆矩陣,並利用向量乘法,求和得到最終cost。

過程中閱讀了xgboost中的源碼:

# user define objective function, given prediction, return gradient and second
# order gradient this is log likelihood loss
def logregobj(preds, dtrain):
    labels = dtrain.get_label()
    preds = 1.0 / (1.0 + np.exp(-preds))
    grad = preds - labels
    hess = preds * (1.0 - preds)
    return grad, hess


# user defined evaluation function, return a pair metric_name, result

# NOTE: when you do customized loss function, the default prediction value is
# margin. this may make builtin evaluation metric not function properly for
# example, we are doing logistic loss, the prediction is score before logistic
# transformation the builtin evaluation error assumes input is after logistic
# transformation Take this in mind when you use the customization, and maybe
# you need write customized evaluation function
def evalerror(preds, dtrain):
    labels = dtrain.get_label()
    # return a pair metric_name, result. The metric name must not contain a
    # colon (:) or a space since preds are margin(before logistic
    # transformation, cutoff at 0)
    return 'my-error', float(sum(labels != (preds > 0.0))) / len(labels)

自定義了對數似然損失函數,在文檔中說,自定義的函數通常要定義目標函數的一階和二階導數,但是二階導數可以不定義。XGBOOST在實現的時候用到了目標函數的二階導數信息,不同於其他的GBDT實現,只用一階導數信息。在函數定義中,代碼給定了grad和hess。上述結果是化簡後的定義;
在這裏插入圖片描述
主要是利用了ln函數的性質;

return 'my-error', float(sum(labels != (preds > 0.0))) / len(labels)

其中在評估函數中,預測值輸出以0.0爲閾值,等價於sigmoid轉爲0.5.

X, Y = data_new.drop(['Class'],axis=1),data_new['Class']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2,random_state=2019)

lgb = LGBMClassifier(objective=custom_loss,n_estimators=10000)
lgb.fit(X_train, Y_train,eval_set=[(X_train,Y_train),(X_test,Y_test)],eval_metric=get_cost,early_stopping_rounds=100,verbose=True)

在這裏插入圖片描述

print(classification_report(Y_test,lgb.predict(X_test)))

在這裏插入圖片描述
最終再來看下模型的輸出

lgb.predict_proba(X_test)

在這裏插入圖片描述

1.0 / (1.0 + np.exp(-lgb.predict_proba(X_test)))

在這裏插入圖片描述

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章