python imblearn toolbox 解決數據不平衡問題(四)——聯合採樣、集成採樣、其它細節

原文鏈接:https://blog.csdn.net/mathlxj/article/details/89677701

一、Combination of over- and under-sampling

主要是解決SMOTE算法中生成噪聲樣本,解決方法爲cleaning the space resulting from over-sampling。
主要思路是先使用SMOTE進行上採樣,再通過Tomek’s link或者edited nearest-neighbours方法去獲得一個
cleaner space.對應的函數爲:SMOTETomekSMOTEENN.

from imblearn.combine import SMOTEENN
smote_enn = SMOTEENN(random_state=0)
X_resampled, y_resampled = smote_enn.fit_resample(X, y)

from imblearn.combine import SMOTETomek
smote_tomek = SMOTETomek(random_state=0)
X_resampled, y_resampled = smote_tomek.fit_resample(X, y)
  •  

二、Ensemble of samplers

2.1 Bagging classifier

**Bagging:**有放回的取出樣本產生樣本的不同子集,再在每個子集上建立分類器(要給定分類器類型)。
在scikit-learn中,有類BaggingClassifier,但對於不平衡數據,不能保證每個子集的數據是平衡的,因此分類結果會偏向多數類。
在imblearn中,類BalaceBaggingClassifier使得在訓練每個分類器之前,在每個子集上進行重採樣,其參數與sklearn中的BaggingClassifier相同,除了增加了兩個參數:sampling_strategyreplacement來控制隨機下采樣的方式。

from imblearn.ensemble import BalancedBaggingClassifier
from sklearn.metrics import balanced_accuracy_score
bbc = BalancedBaggingClassifier(base_estimator=DecisionTreeClassifier(),
                                sampling_strategy='auto',
                                replacement=False,
                                random_state=0)
bbc.fit(X_train, y_train)
y_pred =bbc.predict(X_test)
balanced_accuracy_score(y_test, y_pred)#計算平衡精度
  •  

2.2 Forest of randomized trees (隨機森林)

在構建每棵樹時使用平衡的bootstrap數據子集。

from imblearn.ensemble import BalancedRandomForestClassifier
brf = BalancedRandomForestClassifier(n_estimators=100,random_state=0)
brf.fit(X_train, y_train)
  •  

2.3 Boosting

在數據集子集上訓練n個弱分類器,對這n個弱分類器進行加權融合,產生最後結果的分類器.

2.3.1 RUSBoostClassifier

在執行boosting迭代之前執行一個隨機下采樣。

from imblearn.ensemble import RUSBoostClassifier
rusboost  = RUSBoostClassifier(random_state=0)
rusboost.fit(X_train, y_train)
  •  

2.3.2 EasyEnsembleClassifier,即採用Adaboost

計算弱分類器的錯誤率,對錯誤分類的樣本分配更大的權值,正確分類的樣本賦予更小權值。只要分類精度大於0.5即可做最終分類器中一員,弱分類器精度越高,權重越大。

from imblearn.ensemble import EasyEnsembleClassifier
eec = EasyEnsembleClassifier(random_state=0)
eec.fit(X_train, y_train)
  •  

三、Miscellaneous samplers

3.1 Custom sampler (自定義採樣器):FunctionSampler

from imblearn import FunctionSampler
def fuc(X, y):
    return X[:10], y[:10]
sampler = FunctionSampler(func=func)
X_res, y_res = sampler.fit_resample(X, y)
  •  

3.2 Custom generators (爲TensorFlow和Keras生成平衡的mini-batches)

3.2.1 Tensorflow generator: imblearn.tensorflow.balanced_batch_generator

import numpy as np
X = X.astype(np.float32)
from imblearn.under_sampling import RandomUnderSampler
from imblearn.tensorflow import balanced_batch_generator
training_generator, steps_per_epoch = balanced_batch_generator(
    X, y, sample_weight=None, sampler=RandomUnderSampler(),
    batch_size=10, random_state=42)

#training_generator和 steps_per_epoch的使用方法:
learning_rate, epochs = 0.01, 10
input_size, output_size = X.shape[1], 3
import tensorflow as tf
def init_weights(shape):
     return tf.Variable(tf.random_normal(shape, stddev=0.01))
def accuracy(y_true, y_pred):
     return np.mean(np.argmax(y_pred, axis=1) == y_true)
 # input and output
data = tf.placeholder("float32", shape=[None, input_size])
targets = tf.placeholder("int32", shape=[None])
# build the model and weights
W = init_weights([input_size, output_size])
b = init_weights([output_size])
out_act = tf.nn.sigmoid(tf.matmul(data, W) + b)
# build the loss, predict, and train operator
cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(
     logits=out_act, labels=targets)
loss = tf.reduce_sum(cross_entropy)
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
train_op = optimizer.minimize(loss)
predict = tf.nn.softmax(out_act)
# Initialization of all variables in the graph
init = tf.global_variables_initializer()
with tf.Session() as sess:
     print('Starting training')
     sess.run(init)
     for e in range(epochs):
         for i in range(steps_per_epoch):  ##主要是這裏
             X_batch, y_batch = next(training_generator) ##主要是這裏
             sess.run([train_op, loss], feed_dict={data: X_batch, targets: y_batch})
         # For each epoch, run accuracy on train and test
         feed_dict = dict()
         predicts_train = sess.run(predict, feed_dict={data: X})
         print("epoch: {} train accuracy: {:.3f}"
               .format(e, accuracy(y, predicts_train)))
  •  

3.2 Keras generator

##定義一個邏輯迴歸模型
import keras
y = keras.utils.to_categorical(y, 3)
model = keras.Sequential()
model.add(keras.layers.Dense(y.shape[1], input_dim=X.shape[1],
                              activation='softmax'))
model.compile(optimizer='sgd', loss='categorical_crossentropy',
               metrics=['accuracy'])
##keras.balanced_batch_generator生成平衡的min-batch
from imblearn.keras import balanced_batch_generator
training_generator, steps_per_epoch = balanced_batch_generator(
     X, y, sampler=RandomUnderSampler(), batch_size=10, random_state=42)

##或者使用keras.BalancedBatchGenerator
from imblearn.keras import BalancedBatchGenerator
training_generator = BalancedBatchGenerator(
     X, y, sampler=RandomUnderSampler(), batch_size=10, random_state=42)
callback_history = model.fit_generator(generator=training_generator,
                                        epochs=10, verbose=0)

  •  

四.Metrics(度量)

目前,sklearn對於不平衡數據的度量只有sklearn.metrics.balanced_accuracy_score.
imblearn.metrics提供了兩個其它評價分類器質量的度量

4.1 Sensitivity and specificity metrics

  • Sensitivity:true positive rate即recall。
  • Specificity:true negative rate。
    因此增加了三個度量
  • sensitivity_specificity_support:輸出sensitivity和pecificity和support
  • sensitivity_score
  • specificity_score

4.2 Additional metrics specific to imbalanced datasets

專門爲不平衡數據增加的度量

  • geometric_mean_score:計算幾何平均數(G-mean,各類sensitivity乘積的開方),具體描述如下:

The The geometric mean (G-mean) is the root of the product of class-wise sensitivity. This measure tries to maximize the accuracy on each of the classes while keeping these accuracies balanced. For binary classification G-mean is the squared root of the product of the sensitivity and specificity. For multi-class problems it is a higher root of the product of sensitivity for each class.

  • make_index_balanced_accuracy: 根據balanced accuracy平衡任何scoring function

 

轉自: https://blog.csdn.net/mathlxj/article/details/89677701

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章