實際訓練神經網絡常會遇到數據不均衡問題,數據不均衡會影響模型訓練效果,可以使用權重來糾正。
數據不均衡對應的原理也很簡單,當一個數據不足或者過多時,模型瞎猜也能獲得很高的準確率。
1%的正例,99%的負例,全選負,準確率99%。
自己改造mnist,進行樣本均衡試驗
使用mnist進行訓練,將訓練集進行處理,指定一個分類,刪除大部分樣本。
測試集不變,但是爲了對比,測試時可以分類進行。
各類樣本數如下
# 5444,6179,5470,5638,5307,4987,5417,5715,5389,5454
specified_class_idx = 3#指定一個類,數據縮水
delete_nums = 5000#刪5000個,還剩下638
觀察各集合預測結果
最低的線就是指定的,數據不均衡的“受害者”的測試集預測準確率。
最高的線就是受害者之外的其他類的測試集準確率。
中間兩條分別是訓練集通用準確率和測試集通用準確率,當然,訓練集略高於測試集。
測試集的平均準確率是綜合了指定類和其他類的。
自定義權重,指定類和其他類的權重是十倍關係,結果spec曲線平滑直接到了七十多,只用了638個樣本。
specified_class_weight=10 # 指定類
other_class_weight=1 # 其他類
cross_entropy = tf.reduce_mean(
-specified_class_weight*tf.reduce_sum((y_[:,specified_class_idx] * tf.log(y[:,specified_class_idx])))
-other_class_weight*tf.reduce_sum((y_[:,:specified_class_idx] * tf.log(y[:,:specified_class_idx])))
-other_class_weight*tf.reduce_sum((y_[:,specified_class_idx+1:] * tf.log(y[:,specified_class_idx+1:])))
)
更細的權重比例就先不調了,大概是這個意思
踩坑:
不能亂用relu,直接就導致輸出Nan,神經元也死了。只用單層W*x+b和softmax,也不用clip去強行規範log輸出
如果沒有激活,千萬別隨便用zeros初始化W和b,可能導致無法訓練,也就是訓練前9%,訓練後11%的效果,都處於瞎蒙的水平。
tf.clip_by_value也不能亂用,1e-10這個量級是限制tf.log的輸入的,不能放在輸出,比如卡了1e-10,實際訓練很多數都在此之下。
如果是更大一些的網絡,可能直接出現Nan,就需要clip了。
https://blog.csdn.net/huqinweI987/article/details/87884341
代碼可直接運行,不依賴其他模塊,改變交叉熵公式或者權重即可觀察效果:
測試集應該用切片就行,寫的稍微繁瑣了。
#用mnist,處理一下原來的數據,把某一類減少到十分之一,然後看這一類的準確率和其他類的準確率有何不同,然後給這一類的loss來個十倍權重
#思路:數據label帶着分類信息,預處理,或者內部argmax處理,讓每個batch內的不同數據能攜帶不同分類信息,就能決定是否乘以係數
#問題是tf.cond是整個流的吧,不能給單個數據。tf.where可以針對單個數據。當然,還有切片方法,現在直接用切片就可實現。
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
import numpy as np
mnist = input_data.read_data_sets(train_dir = 'mnist_data',one_hot = True)
train_images = mnist.train.images#(55000, 784)
train_labels = mnist.train.labels#(55000, 10)
if 1:
# 5444,6179,5470,5638,5307,4987,5417,5715,5389,5454
n_classes = 10
specified_class_idx = 3#指定一個類,數據縮水
delete_nums = 5000#刪5000個,還剩下638
del_idx_list = []#記錄下來滿足條件的下標,然後一次性刪除
for i in range(train_images.shape[0]):
if np.argmax(train_labels[i]) == specified_class_idx:
if delete_nums == 0:
break
del_idx_list.append(i)
delete_nums -= 1
print('del_idx_list:',del_idx_list)
new_train_images = np.delete(train_images,del_idx_list,axis=0)
new_train_labels = np.delete(train_labels,del_idx_list,axis=0)
print('new_train_images.shape:',new_train_images.shape)
print('new_train_labels.shape:',new_train_labels.shape)
# np.random.shuffle(new_train_images)#這個shuffle不能用,不對應了,先不洗牌了,影響不大(可用隨機index同步處理)
# np.random.shuffle(new_train_labels)
else:#保持原樣數據,對比
new_train_images = train_images
new_train_labels = train_labels
test_size = mnist.test.images.shape[0]
print('test_size:',test_size)
specified_test_images = []
specified_test_labels = []
other_test_images = []
other_test_labels = []
for i in range(test_size):
if np.argmax(mnist.test.labels[i]) == specified_class_idx:
specified_test_images.append(mnist.test.images[i])
specified_test_labels.append(mnist.test.labels[i])
else:
other_test_images.append(mnist.test.images[i])
other_test_labels.append(mnist.test.labels[i])
ndarray_specified_test_images = np.array(specified_test_images)
ndarray_specified_test_labels = np.array(specified_test_labels)
ndarray_other_test_images = np.array(other_test_images)
ndarray_other_test_labels = np.array(other_test_labels)
print('ndarray_specified_test_labels:',ndarray_specified_test_labels)
learning_rate = 0.0005
training_epochs = 200
batch_size = 256
display_step = 20
x = tf.placeholder(tf.float32, [None, 784]) # mnist data image of shape 28*28=784
y_ = tf.placeholder(tf.float32, [None, 10]) # 0-9 digits recognition => 10 classes
# Set model weights
#如果w和b初始化完全用zeros:訓練前9%,訓練完11%
W = tf.Variable(tf.random_normal([784, 10]))
b = tf.Variable(tf.ones([10])/10.)
tf.summary.histogram(name='W',values=W)
tf.summary.histogram(name='b',values=b)
#暫時只用一層,兩層的話,可能是relu影響,手動的cross_entropy都是Nan了。tf接口爲什麼能跑,奇怪。
y_without_softmax = tf.matmul(x, W) + b
tf.summary.histogram(name='y_without_softmax',values=y_without_softmax)
y = tf.nn.softmax(y_without_softmax) # Softmax
tf.summary.histogram(name='y',values=y)
# Minimize error using cross entropy
specified_class_weight=10 # 指定類
other_class_weight=1 # 其他類
# #切片,會出現Nan值和infinity,tf.clip_by_value沒解決,是RELU激活的問題,網絡只留一層,不加激活,也不能用clip,也會影響訓練效果的
#注意順序,外部mean,內部sum,log內是prediction
cross_entropy = tf.reduce_mean(
-specified_class_weight*tf.reduce_sum((y_[:,specified_class_idx] * tf.log(y[:,specified_class_idx])))
-other_class_weight*tf.reduce_sum((y_[:,:specified_class_idx] * tf.log(y[:,:specified_class_idx])))
-other_class_weight*tf.reduce_sum((y_[:,specified_class_idx+1:] * tf.log(y[:,specified_class_idx+1:])))
)
#tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred), reduction_indices=1))
#如果reduce_sum沒有reduction_indices=1,外部的mean還需要?
# 這個也是外部用reduce_mean處理樣本間的關係,所以用tf.where可能能實現權重嗎?可能吧,先不試了,比較麻煩。
# cost = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=y_without_softmax,labels=tf.argmax(y_,axis=1)))
cost = cross_entropy
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
init = tf.global_variables_initializer()
# Test model
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
# Calculate accuracy for 3000 examples
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
tf.summary.scalar('accuracy:',accuracy)
merged = tf.summary.merge_all()
writer = tf.summary.FileWriter('mnist_unbalanced_sample_log')
writer_test = tf.summary.FileWriter('mnist_unbalanced_sample_log_test')
writer_spec = tf.summary.FileWriter('mnist_unbalanced_sample_log_specified')
writer_other = tf.summary.FileWriter('mnist_unbalanced_sample_log_other')
# Start training
with tf.Session() as sess:
sess.run(init)
print("Accuracy:", accuracy.eval({x: mnist.test.images[:3000], y_: mnist.test.labels[:3000]}))
# Training cycle
n_samples = new_train_images.shape[0]
batch_nums = n_samples//batch_size
print('batch_nums:',batch_nums)
for i in range(training_epochs * batch_nums):#新的步數,爲了下邊的循環好寫。
# avg_cost = 0.
start = i*batch_size % n_samples
end = (i+1)*batch_size % n_samples
if end < start:#不要這種。
print('continue!')
continue
batch_xs = new_train_images[start: end]
batch_ys = new_train_labels[start: end]
_, c = sess.run([optimizer, cost], feed_dict={x: batch_xs,
y_: batch_ys})
if (i+1) % display_step == 0:#只是batch的cost,也不算太好。
# accu,y_val,y_label = sess.run([accuracy,y,y_],feed_dict={x: mnist.train.images[:3000], y_: mnist.train.labels[:3000]})
# print("train Accuracy:", accu)
# print('y_val:',y_val)
# print('y_label:',y_label)
accu,summary = sess.run([accuracy,merged],feed_dict={x: mnist.train.images[:3000], y_: mnist.train.labels[:3000]})
print("train Accuracy:", accu)
writer.add_summary(summary,i)
accu,summary = sess.run([accuracy,merged],feed_dict={x: mnist.test.images[:3000], y_: mnist.test.labels[:3000]})
print("test Accuracy:", accu)
writer_test.add_summary(summary, i)
accu,summary = sess.run([accuracy,merged],feed_dict={x: ndarray_specified_test_images, y_: ndarray_specified_test_labels})
print("spec Accuracy:", accu)
writer_spec.add_summary(summary, i)
accu,summary = sess.run([accuracy,merged],feed_dict={x: ndarray_other_test_images, y_: ndarray_other_test_labels})
print("other Accuracy:", accu)
writer_other.add_summary(summary, i)
print("Optimization Finished!")