自定义损失函数(二)

原創

2020-06-27 14:12

自定义损失函数
上述文章，介绍了用于回归的自定义损失函数；

由于项目需要，尝试了用于分类的自定义损失函数；
在网上查找了很多分类相关的自定义损失函数介绍，但是很多细节缺失，现总结如下，希望能够给大家提供帮助。如有不足，请大家帮忙指正！

import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import missingno
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
import random
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE,RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

from lightgbm import LGBMClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split,cross_val_score,GridSearchCV
from sklearn.metrics import roc_auc_score,roc_curve,precision_score,auc,precision_recall_curve, \
                            accuracy_score,recall_score,f1_score,confusion_matrix,classification_report

%matplotlib inline

测试数据还是用到以前博文中用过的kaggle上的反欺诈数据：

data = pd.read_csv("creditcard.csv")
data.shape

droplist = ['V8', 'V13', 'V15', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28','Time']
data_new = data.drop(droplist, axis = 1)
data_new.shape

简单对部分特征进行删除，前面的文章中有介绍为何删除；

data_new.Class.value_counts().loc[0]/data_new.Class.value_counts().loc[1]
577.8760162601626

样本好坏标签分布差异较大；

C_TP = 0
C_FN = 3
C_FP = 1
C_TN = 0
cost_matrix = [
               [C_TP,C_FN],
               [C_FP,C_TN]
              ]
cost_matrix = pd.DataFrame(cost_matrix).reset_index()
cost_matrix['index'].loc[0]='正例'
cost_matrix['index'].loc[1]='负例'
cost_matrix.columns = ['''真实\预测''','正例','负例']
cost_matrix

定义代价矩阵

from sklearn.preprocessing import LabelBinarizer
def custom_loss(y_true,y_pred):
    eps=1e-15
    y_pred = 1.0 / (1.0 + np.exp(-y_pred))
    y_pred = np.clip(y_pred, eps, 1-eps)
#     y_pred = y_pred[:,1]
#     loss = y_true*(C_FN*np.log(y_pred)+C_TP*np.log(1-y_pred))+(1-y_true)*(C_FP*np.log(1-y_pred)+C_TN*np.log(y_pred))
#     loss = -1*loss
#     return loss
    grad = C_FN*(-1*y_true)/y_pred + C_TP*y_true/(1-y_pred) + C_FP*(1-y_true)/(1-y_pred) + C_TN*(y_true-1)/y_pred
    hess = C_FN*y_true/y_pred**2 + C_TP*y_true/(1-y_pred)**2 + C_FP*(1-y_true)/(1-y_pred)**2 + C_TN*(1-y_true)/y_pred**2
    return grad,hess

自定义损失函数objective

    eps=1e-15
    y_pred = 1.0 / (1.0 + np.exp(-y_pred))
    y_pred = np.clip(y_pred, eps, 1-eps)

需要特别注意此段
y_pred = 1.0 / (1.0 + np.exp(-y_pred))：model输出时是连续值，即可以认为本身输出值是回归，通过sigmoid转为类别；
y_pred = np.clip(y_pred, eps, 1-eps)：当y_pred值=0，或者极大值时，求梯度时会得到nan，这是自定义函数时特别需要注意的地方；

def get_cost(y_true,y_pred):
#     y_pred = y_pred[:,1]
    y_pred = 1.0 / (1.0 + np.exp(-y_pred))
    y_pred = np.where(y_pred>0.5,1,0)
    tn, fp, fn, tp = confusion_matrix(y_true,y_pred).ravel()
    cost = np.sum(np.array([C_TP,C_FN,C_FP,C_TN])*np.array([tp,fn,fp,tn]))
    return "custom_asymmetric_eval", cost, False

定义评估函数feval：
得到混淆矩阵，并利用向量乘法，求和得到最终cost。

过程中阅读了xgboost中的源码：

# user define objective function, given prediction, return gradient and second
# order gradient this is log likelihood loss
def logregobj(preds, dtrain):
    labels = dtrain.get_label()
    preds = 1.0 / (1.0 + np.exp(-preds))
    grad = preds - labels
    hess = preds * (1.0 - preds)
    return grad, hess


# user defined evaluation function, return a pair metric_name, result

# NOTE: when you do customized loss function, the default prediction value is
# margin. this may make builtin evaluation metric not function properly for
# example, we are doing logistic loss, the prediction is score before logistic
# transformation the builtin evaluation error assumes input is after logistic
# transformation Take this in mind when you use the customization, and maybe
# you need write customized evaluation function
def evalerror(preds, dtrain):
    labels = dtrain.get_label()
    # return a pair metric_name, result. The metric name must not contain a
    # colon (:) or a space since preds are margin(before logistic
    # transformation, cutoff at 0)
    return 'my-error', float(sum(labels != (preds > 0.0))) / len(labels)

自定义了对数似然损失函数，在文档中说，自定义的函数通常要定义目标函数的一阶和二阶导数，但是二阶导数可以不定义。XGBOOST在实现的时候用到了目标函数的二阶导数信息，不同于其他的GBDT实现，只用一阶导数信息。在函数定义中，代码给定了grad和hess。上述结果是化简后的定义；

主要是利用了ln函数的性质；

return 'my-error', float(sum(labels != (preds > 0.0))) / len(labels)

其中在评估函数中，预测值输出以0.0为阈值，等价于sigmoid转为0.5.

X, Y = data_new.drop(['Class'],axis=1),data_new['Class']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2,random_state=2019)

lgb = LGBMClassifier(objective=custom_loss,n_estimators=10000)

lgb.fit(X_train, Y_train,eval_set=[(X_train,Y_train),(X_test,Y_test)],eval_metric=get_cost,early_stopping_rounds=100,verbose=True)

print(classification_report(Y_test,lgb.predict(X_test)))

最终再来看下模型的输出

lgb.predict_proba(X_test)

1.0 / (1.0 + np.exp(-lgb.predict_proba(X_test)))

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

自定义损失函数(二)

【SQL进阶】CASE语句的使用

npm error Cannot read properties of null (reading 'isDescendantOf')

機器學習-重要知識點梳理

LightGBM源碼閱讀+理論分析（處理特徵類別，缺省值的實現細節）(轉)

機器學習可解釋性實踐

自定義損失函數(二)

2019年11月-12月todo

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結