TF-day6 CNN简单分类

主要内容:

  • 何为CNN
  • code

一、何为CNN
图解何为CNN
http://www.jianshu.com/p/6daa1af1cf37

深入理解:
http://study.163.com/course/courseMain.htm?courseId=1003223001

二、代码及解析

import tensorflow as tf
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.metrics import classification_report

1. 获取数据

##获取数据
def get_data(argv=None):
    df = pd.read_excel('/home/xp/下载/jxdc/项目评分表last.xls')
    df_feature = df.iloc[:, 2:].fillna(df.mean())
    # print(df_feature.shape)  ##(449,70)
    df_label = df.iloc[:, 1]

    ##样本均衡
    smote = SMOTE('auto')
    x_sample, y_sample = smote.fit_sample(df_feature, df_label)
    # print(x_sample.shape)    ##(690,70)
    # print(y_sample.shape)    ##(690,)

    ##转换为one-hot向量
    X = x_sample  #均衡后的输入
    # X = df_feature  #不均衡的输入
    Y = []
    for i in y_sample:
    # for i in df_label:
        if i == 'A':
            Y.append([1, 0, 0, 0])
        elif i == 'B':
            Y.append([0, 1, 0, 0])
        elif i == 'C':
            Y.append([0, 0, 1, 0])
        else:
            Y.append([0, 0, 0, 1])
    # train(X,Y)
    return X,Y
  1. fillna(df.mean())数据中有空值,采用每一行的平均值进行填充。
  2. 样本均衡:因为数据中A类数据偏少,D类数据很多,smote方法是让数据最少的一类生成一定的数据,使得其数据量与数据最多的一类一样多。
    https://pypi.python.org/pypi/imbalanced-learn
    smote具体原理参考:
    http://blog.csdn.net/Yaphat/article/details/52463304?locationNum=7
    http://blog.csdn.net/yaphat/article/details/60347968
    主要思想:对少数类样本进行分析并根据少数类样本人工合成新样本添加到数据集中。
  3. one-hot向量:
    为什么要将样本标签转换为one-hot向量呢,因为在训练神经网络就是让损失函数变小,其中交叉熵tf.nn.softmax_cross_entropy_with_logits的输入是两个概率分布,one-hot向量可以看作是一个概率分布。

2.前向传播

##前向传播
INPUT_NODE = 70
OUTPUT_NODE = 4

IMAGE_LONGTH = 70
IMAGE_WIDTH = 1
NUM_CHANNELS = 1
NUM_LABELS = 4

##第一层卷积层的尺寸和深度
CONV1_DEEP = 16
CONV1_SIZE_L = 4
CONV1_SIZE_W = 1

##第二层卷积层的尺寸和深度
CONV2_DEEP = 32
CONV2_SIZE_L = 2
CONV2_SIZE_W = 1

##全连接层的节点个数
FC_SIZE = 128

def inference(input_tensor,train,regularizer):
    ###通过使用不同的命名空间来隔离不同层的变量。不需要担心重名的问题
    ##第一层:卷积层1
    with tf.variable_scope("layer1-conv1"):
        conv1_weights = tf.get_variable("weight",[CONV1_SIZE_L,CONV1_SIZE_W,NUM_CHANNELS,CONV1_DEEP],initializer=tf.truncated_normal_initializer(stddev=0.1))
        conv1_biases = tf.get_variable("bias",[CONV1_DEEP],initializer=tf.constant_initializer(0.1))
    conv1 = tf.nn.conv2d(input_tensor,conv1_weights,strides=[1,2,1,1],padding='SAME')
    relu1 = tf.nn.relu(tf.nn.bias_add(conv1,conv1_biases))

卷积层:对应节点的加权和
1. 采用tf.variable_scope()进行变量管理,因为神经网络变量太多了,这样就不用担心命名很容易重复了。
2. 过滤器:conv1_weight,Tensor(“layer1-conv1/weight:0”, shape=(4, 1, 1, 16), dtype=float32_ref)
name:’layer1-conv1/weight:0’
shape是四维,第一、二维是过滤器尺寸,第三维表示当前层的深度,第四维表示过滤器的深度。
3. strides: 不同维度上的步长,第一维和最后一维只能是1,因为卷积层的步长只对长和宽有效。
4. padding:’SAME’表示全添加0填充,’VALID’表示不


    ##第二层:池化层
    with tf.name_scope("layer2-pool1"):
        pool1 = tf.nn.max_pool(relu1,ksize=[1,3,1,1],strides=[1,2,1,1],padding="SAME")
        # pool1 = tf.nn.avg_pool(relu1,ksize=[1,3,1,1],strides=[1,2,1,1],padding="SAME")

池化层:缩小矩阵的尺寸,从而减小最后全链接层的参数/既可以加快计算速度,也有防止过拟合的作用。
1. 最大池化层:max pooling 、平均池化层:average pooling
2. ksize:第一维和最后一维只能是1,这意味着池化层的过滤器是不可以跨样本和跨节点深度的。


    ##第三层:卷积层
    with tf.variable_scope("layer3-conv2"):
        conv2_weights = tf.get_variable("weight",[CONV2_SIZE_L,CONV2_SIZE_W,CONV1_DEEP,CONV2_DEEP],initializer=tf.truncated_normal_initializer(stddev=0.1))
        conv2_biases = tf.get_variable("bias",[CONV2_DEEP],initializer=tf.constant_initializer(0.1))
    conv2 = tf.nn.conv2d(pool1,conv2_weights,strides=[1,2,1,1],padding='SAME')
    relu2 = tf.nn.relu(tf.nn.bias_add(conv2,conv2_biases))

    ##第四层:池化层
    with tf.name_scope('layer4-pool2'):
        pool2 = tf.nn.max_pool(relu2,ksize=[1,2,1,1],strides=[1,3,1,1],padding='SAME')

    ##第五层:全连接层
    pool_shape = pool2.get_shape().as_list()
    ##计算讲矩阵拉直之后的向量长度,pool_shape[0]为batch中的数据个数
    nodes = pool_shape[1]*pool_shape[2]*pool_shape[3]    # nodes:96
    reshaped = tf.reshape(pool2,[-1,nodes])   ###BATCH_SIZE必须是已知的值???

    with tf.variable_scope('layers-fc1'):
        fc1_weights = tf.get_variable("weight",[nodes,FC_SIZE],initializer=tf.truncated_normal_initializer(stddev=0.1))
        if regularizer != None:                
            tf.add_to_collection("losses",regularizer(fc1_weights))
        fc1_biases = tf.get_variable("bias",[FC_SIZE],initializer=tf.constant_initializer(0.1))

        fc1 = tf.nn.relu(tf.matmul(reshaped,fc1_weights) + fc1_biases)
        if train:
            fc1 = tf.nn.dropout(fc1,0.5)  

全连接层:
1. 先将第四层池化层的输出pool2转换为一维向量,reshaped = tf.reshape(pool2,[-1,nodes]) 因为样本数量会变化,用-1来代替
2. 正则化项
3. dropout:
http://blog.csdn.net/stdcoutzyx/article/details/49022443
https://yq.aliyun.com/articles/68901
http://www.cnblogs.com/tornadomeet/p/3258122.html


    ##第六层:softmax层
    with tf.variable_scope("layer6-fc2"):
        fc2_weights = tf.get_variable('weight',[FC_SIZE,NUM_LABELS],initializer=tf.truncated_normal_initializer(stddev=0.1))
        if regularizer != None:
            tf.add_to_collection('losses',regularizer(fc2_weights))
        fc2_biases = tf.get_variable('bias',[NUM_LABELS],initializer=tf.constant_initializer(0.1))
        logit = tf.matmul(fc1,fc2_weights) +fc2_biases
    return logit

softmax层:
1. 这一层与全连接层全不过少了dropout

3. 训练数据

BATCH_SIZE = 100
LEARNING_RATE_BASE = 0.8
LEARNING_RATE_DECAY = 0.96
REGULARAZTION_RATE = 0.001
TRAINING_STEPS = 4000
MOVING_AVERAGE_DECAY = 0.99

##模型保存路径和文件名
MODEL_SAVE_PATH = "/home/panxie/PycharmProjects/ML/jxdc/code/cnnclassify/model.ckpt"
# MODEL_NAME = "model.ckpt"

##训练神经网络
def train(X,Y):
    x = tf.placeholder(tf.float32, [None,IMAGE_LONGTH, IMAGE_WIDTH,NUM_CHANNELS],name='x_input')
    y_ = tf.placeholder(tf.float32, [None, OUTPUT_NODE], name='y-input')

    regularizer = tf.contrib.layers.l2_regularizer(REGULARAZTION_RATE)

    y = inference(x,0.5,regularizer)
    ##这里y*1,方便给y命名
    b = tf.constant(value=1, dtype=tf.float32)
    y = tf.multiply(y, b, name='y')

    ##定义训练轮数,并指定为不可训练的参数
    global_step = tf.Variable(0, trainable=False)
    variable_average = tf.train.ExponentialMovingAverage(MOVING_AVERAGE_DECAY, global_step)
    variable_average_op = variable_average.apply(tf.trainable_variables())

    ##交叉熵和正则化项
    cross_entroy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=y, labels=y_))
    loss = cross_entroy + tf.add_n(tf.get_collection('losses'))

    ##学习率的设置,指数衰减法
    learning_rate = 0.01
    # learning_rate = tf.train.exponential_decay(learning_rate,global_step=global_step,decay_steps=100,decay_rate=0.9,staircase=True)
    # learning_rate = tf.train.exponential_decay(learning_rate,global_step=global_step,decay_steps=1,decay_rate=0.96,staircase=False)

    ###优化算法
    train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)

    ###每过一边数据要更新神经网络的参数,又要更新每一个参数的滑动平均值。
    with tf.control_dependencies([train_step,variable_average_op]):
        train_op = tf.no_op(name='train')

    ##准确率计算
    correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))  ###argmax返回的是索引
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

    ##初始化Tensorflow持久化类
    saver = tf.train.Saver()
    # saver.export_meta_graph("/home/pan-xie/PycharmProjects/ML/jxdc/code/cnnclassify/model.deda.json",as_text=True)
    with tf.Session() as sess:
        tf.global_variables_initializer().run()

        for i in range(TRAINING_STEPS):

            trainX, validationX, trainY, validationY = train_test_split(X, Y, test_size=0.25, random_state=0)

            data_size = len(trainX)
            start = (i * BATCH_SIZE) % data_size
            end = min(start + BATCH_SIZE,data_size)
            # if (end < data_size):
            #     xs = trainX[start:end]
            #     ys = trainY[start:end]
            # else:
            #     xs = [trainX.tolist()[start:end].append(j) for j in trainX[0:(end-data_size)]]
            #     ys = [trainY[start:end].append(j) for j in trainY[0:(end-data_size)]]

            ###训练数据
            xs_train_reshape = np.reshape(trainX[start:end], (-1, IMAGE_LONGTH, IMAGE_WIDTH, NUM_CHANNELS))
            train_feed = {x:xs_train_reshape,y_:trainY[start:end]}

            ##验证数据集
            xs_valid_reshape = np.reshape(validationX, (-1, IMAGE_LONGTH, IMAGE_WIDTH, NUM_CHANNELS))
            validiation_feed = {x: xs_valid_reshape, y_: validationY}

            _,loss_value,step,accuracy_train = sess.run([train_op,loss,global_step,accuracy],feed_dict=train_feed)

            loss_valid,acc_valid = sess.run([loss,accuracy],feed_dict=validiation_feed)

            # if i %500 == 0:
            #     print("After %d training steps,""loss and accuracy on training is %g and %g,""loss and accuracy on validiation is %g and %g"%(step,loss_value,accuracy_train,loss_valid,acc_valid))
                # saver.save(sess,os.path.join(MODEL_SAVE_PATH,MODEL_NAME),global_step=global_step)

        trainX, testX, trainY, testY = train_test_split(X, Y, test_size=0.25, random_state=0)
        xs_test_reshape = np.reshape(testX, (-1, IMAGE_LONGTH, IMAGE_WIDTH, NUM_CHANNELS))
        test_feed = {x: xs_test_reshape, y_: testY}

        # 使用测试数据集,查看分类指标
        target_names = ['A', 'B', 'C', 'D']
        rating_test_ = sess.run(tf.argmax(y_, 1), feed_dict=test_feed)
        test_preds1 = sess.run(tf.argmax(y, 1), feed_dict=test_feed)
        print(classification_report(rating_test_, test_preds1, target_names=target_names))

        saver.save(sess, MODEL_SAVE_PATH)
        # saver.save(sess,os.path.join(MODEL_SAVE_PATH,MODEL_NAME),global_step=global_step)
        sess.close()

神经网络的训练:
1. from sklearn.cross_validation import train_test_split 将数据随机分为两类
关于更多交叉验证的参考:http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation
2. from sklearn.metrics import classification_report
关于模型评估,量化预测质量参考: http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report
https://blog.argcv.com/articles/1036.c

         precision    recall  f1-score   support

      A       0.97      0.97      0.97        70
      B       0.73      0.62      0.67        13
      C       0.62      0.45      0.53        22
      D       0.86      0.96      0.90        68

avg / total       0.86      0.87      0.86       173

以A为例:
其中precision表示精确率,测试集中被检索到的数据中分类正确的样本数/测试集中被检索总样本数
recall表示召回率,测试集中被检索到的数据中分类正确的样本数/测试集所有的A类
F1-score值就是精确值和召回率的调和均值,也就是2/F1=1/P+1/R

if __name__=='__main__':
    X,Y = get_data()
    train(X,Y)
    # tf.app.run()

github: http://www.cnblogs.com/schaepher/p/5561193.html
完整代码:https://github.com/PanXiebit/CNN_Classify

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章