使用class weight和sample weight處理不平衡問題

原創

pyxiea

2020-06-21 08:02

class weight：對訓練集裏的每個類別加一個權重。如果該類別的樣本數多，那麼它的權重就低，反之則權重就高.

sample weight：對每個樣本加權重，思路和類別權重類似，即樣本數多的類別樣本權重低，反之樣本權重高 $^{[1]}$ 。

PS：sklearn中絕大多數分類算法都有class weight和 sample weight可以使用。

	Pytorch	Tensorflow2 & Keras
class weight	多分類：torch.nn.CrossEntropyLoss(weight=…) 二分類/多標籤：torch.nn.BCEWithLogitsLoss(pos_weight=…)	二分類：tf.nn.weighted_cross_entropy_with_logits(pos_weight=…)二分類或多分類 $^{[2]}$ ：model.fit(class_weight=…)
sample weight	多標籤：torch.nn.BCEWithLogitsLoss(weight=…)	model.fit(sample_weight=…)

使用class weight時注意 $^{[2]}$ ：

1、使用class_weight會改變loss的範圍，從而有可能影響到訓練的穩定性. 當Optimizer的step size與梯度的大小有關時，將會出問題. 而類似Adam等優化器則不受影響. 另外，使用了class_weight後的模型的loss的大小不能和不使用class_weight的模型直接對比.

Note: Using class_weights changes the range of the loss. This may affect the stability of the training depending on the optimizer. Optimizers whose step size is dependent on the magnitude of the gradient, like optimizers.SGD, may fail. The optimizer used here, optimizers.Adam, is unaffected by the scaling change. Also note that because of the weighting, the total losses are not comparable between the two models.

2、設置class weight有一定講究，參考資料[2]在不平衡的二分類問題中，爲了讓loss保持與之前的大小相接近，使用了下述代碼來計算class weight：