基于LSTM的文本情感分析

4 数据预处理

1. 数据下载

数据集地址：http://ai.stanford.edu/~amaas/data/sentiment/

下载后解压，会看到有两个文件夹，test和train：

我们点进train中，会发现正样本和负样本已经分好类了：
neg和pos分别是负样本和正样本，unsup是未标注的样本，可用后续需要采用。其他的都自己去看看吧。

打开pos文件，看看里面啥样：

都是一个个文本。

注意到，这些文本一般都不短…

数据集中，共有5w条文本，test集和train集各半，每个集合中，pos和neg也是各半。
本为用到只是tranin集25000条文本。

2. 训练数据介绍

情感分析是上手NLP的最简单的任务之一，它就是一个简单的文本分类问题，判断一段文本的情感极性。最简单的就是二分类，判断是积极的还是消极的；更难一点的就是三分类，除了积极消极还有无情感倾向的；更加复杂的就比如情感打分，例如电影打1~5分，这就是五分类。但本质上都一样，无非类别太多更难以学习罢了。
IMDB是一个专业的电影评论网站，类似国内的豆瓣，IMDB的电影评论数据是大家经常使用来练手的情感分析数据集，也是各种比赛，如Kaggle，和各种学者做研究常用的数据集。

其实，Tensorflow.keras 自带了IMDB的已经进行很好的预处理的数据集，可以一行代码下载，不需要进行任何的处理就可以训练，而且效果比较好。但是，这样就太没意思了。在真实场景中，我们拿到的都是脏脏的数据，我们必须自己学会读取、清洗、筛选、分成训练集测试集。而且，从我自己的实践经验来看，数据预处理的本事才是真本事，模型都好搭，现在的各种框架已经让搭建模型越来越容易，但是数据预处理只能自己动手。所有往往实际任务中，数据预处理花费的时间、精力是最多的，而且直接影响后面的效果。

另外，我们要知道，对文本进行分析，首先要将文本数值化。因为计算机不认字的，只认数字。所以最后处理好的文本应该是数值化的形式。而Tensorflow.keras自带的数据集全都数值化了，而它并不提供对应的查询字典让我们知道每个数字对应什么文字，这让我们只能训练模型，看效果，无法拓展到其他语料上，也无法深入分析。综上，我上面推荐的数据集，是原始数据集，都是真实文本，当然，为了方便处理，也已经被斯坦福的大佬分好类了。但是怎么数值化，需要我们自己动手。

3. 用到Word2Vector介绍

Google 已经帮助我们在大规模数据集上训练出来了 Word2Vec 模型，它包括 1000 亿个不同的词，在这个模型中，谷歌能创建300万个词向量，每个向量维度为 300。在理想情况下，我们将使用这些向量来构建模型，但是因为这个单词向量矩阵太大（3.6G），因此在此次研究中我们将使用一个更加易于管理的矩阵，该矩阵由 GloVe 进行训练得到。矩阵将包含 400000 个词向量，每个向量的维数为 50。

我们将导入两个不同的数据结构，一个是包含 400000 个单词的 Python 列表（wordsList.npy），一个是包含所有单词向量值的 400000*50 维的嵌入矩阵（wordVectors.npy）。

GloVe 词向量网盘下载
链接: https://pan.baidu.com/s/1PJx_ahSaPfVgMjmLpMz8cw 提取码: di2e
下载减压后

wordsList.npy介绍

一个是包含 400000 个单词的 Python 列表，它里面每个单词对应的位置就是 wordVectors里相应词向量的位置
例子：如果我们要查找baseball这个词的相应词向量

import numpy as np
import tensorflow as tf
import os as os
import matplotlib.pyplot as plt
from os import listdir
wordsList = np.load('./training_data/wordsList.npy')
print('Loaded the word list!')

wordsList = wordsList.tolist()  # Originally loaded as numpy array
wordsList = [word.decode('UTF-8') for word in wordsList]  # Encode words as UTF-8
wordVectors = np.load('./training_data/wordVectors.npy')
print('Loaded the word vectors!')

print(len(wordsList))
# print(wordsList)
print(wordVectors.shape)

baseballIndex = wordsList.index('baseball')
print(baseballIndex)
print(wordVectors[baseballIndex])

执行结果

Loaded the word list!
Loaded the word vectors!
400000
(400000, 50)
1444
[-1.9327    1.0421   -0.78515   0.91033   0.22711  -0.62158  -1.6493
  0.07686  -0.5868    0.058831  0.35628   0.68916  -0.50598   0.70473
  1.2664   -0.40031  -0.020687  0.80863  -0.90566  -0.074054 -0.87675
 -0.6291   -0.12685   0.11524  -0.55685  -1.6826   -0.26291   0.22632
  0.713    -1.0828    2.1231    0.49869   0.066711 -0.48226  -0.17897
  0.47699   0.16384   0.16537  -0.11506  -0.15962  -0.94926  -0.42833
 -0.59457   1.3566   -0.27506   0.19918  -0.36008   0.55667  -0.70315
  0.17157 ]

wordVectors.npy介绍

一个是包含所有单词向量值的 400000*50 维的嵌入矩阵
假如我们有这么一句话“I thought the movie was incredible and inspiring” ，一个10个词，那么它们在wordVectors中的词向量是什么？
代码实现去查找它们相应的词向量

import numpy as np
import tensorflow as tf
import os as os
import matplotlib.pyplot as plt
from os import listdir
wordsList = np.load('./training_data/wordsList.npy')
print('Loaded the word list!')

wordsList = wordsList.tolist()  # Originally loaded as numpy array
wordsList = [word.decode('UTF-8') for word in wordsList]  # Encode words as UTF-8
wordVectors = np.load('./training_data/wordVectors.npy')
maxSeqLength = 10  # Maximum length of sentence
numDimensions = 300  # Dimensions for each word vector
firstSentence = np.zeros((maxSeqLength), dtype='int32')
firstSentence[0] = wordsList.index("i")
firstSentence[1] = wordsList.index("thought")
firstSentence[2] = wordsList.index("the")
firstSentence[3] = wordsList.index("movie")
firstSentence[4] = wordsList.index("was")
firstSentence[5] = wordsList.index("incredible")
firstSentence[6] = wordsList.index("and")
firstSentence[7] = wordsList.index("inspiring")
# firstSentence[8] and firstSentence[9] are going to be 0
print(firstSentence.shape)
print(firstSentence)  # Shows the row index for each word
with tf.Session() as sess:
    print(tf.nn.embedding_lookup(wordVectors, firstSentence).eval().shape)

执行结果

(10,)
[    41    804 201534   1005     15   7446      5  13767      0      0]
(10, 50)

4 数据预处理

4.1 . generate_train_data函数

4.1.1. 包含加载数据
4.1.2. 省略掉低频词
4.1.3. 把词转成唯一索引，因为计算机只能认识数字
4.1.4.产生一个 trainData.npz 数据集，训练的时候就可以直接用这个数据集训练了，因为加载文件太慢了
4.1.5. 产生一个small_word_index 字典，dict (word, index), 这个字典要在预测的时候用，因为你把每一个单词都转成对应的唯一的索引，在训练完你要预测一条评论的时候要把评论里没一个单词再转成你训练这个模型的唯一的索引，这个时候你就用到这个字典了

4.2. generate_embedding_matrix 函数

4.2.1 利用wordVectors和wordList和small_word_index字典产生一个Embeding matrix
4.2.2 embedding_matrix.npy 是一个矩阵，可以把词索引转成词向量。

4.3. test_load 函数，

验证产生结果 （trainData.npz, small_word_index.npy, embedding_matrix.npy）

import numpy as np
import tensorflow as tf
import os as os
import tensorflow.keras as keras

def generate_train_data():
    datapath = r'D:\train_data\\aclImdb\train'
    pos_files = os.listdir(datapath + '/pos')
    neg_files = os.listdir(datapath + '/neg')
    print(len(pos_files))
    print(len(neg_files))

    pos_all = []
    neg_all = []
    for pf, nf in zip(pos_files, neg_files):
        with open(datapath + '/pos' + '/' + pf, encoding='utf-8') as f:
            s = f.read()
            pos_all.append(s)
        with open(datapath + '/neg' + '/' + nf, encoding='utf-8') as f:
            s = f.read()
            neg_all.append(s)
    print(len(pos_all))
    print(pos_all[0])
    print(len(neg_all))
    X_orig = np.array(pos_all + neg_all)
    Y_orig = np.array([1 for _ in range(len(pos_all))] + [0 for _ in range(len(neg_all))])
    print("X_orig:", X_orig.shape)
    print("Y_orig:", Y_orig.shape)

    import time
    vocab_size = 20000
    maxlen = 200
    print("Start fitting the corpus......")
    t = keras.preprocessing.text.Tokenizer(vocab_size)  # 要使得文本向量化时省略掉低频词，就要设置这个参数
    tik = time.time()
    t.fit_on_texts(X_orig)  # 在所有的评论数据集上训练，得到统计信息
    tok = time.time()
    word_index = t.word_index  # 不受vocab_size的影响
    print(X_orig)
    print('all_vocab_size', len(word_index), type(word_index))
    print(word_index)
    print("Fitting time: ", (tok - tik), 's')
    print("Start vectorizing the sentences.......")
    v_X = t.texts_to_sequences(X_orig)  # 受vocab_size的影响
    print("Start padding......")
    print(v_X)
    pad_X = keras.preprocessing.sequence.pad_sequences(v_X, maxlen=maxlen, padding='post')
    print(pad_X.shape)
    print("Finished!")

    np.savez('./train_data_new/trainData', x=pad_X, y=Y_orig)
    import copy
    x = list(t.word_counts.items())
    s = sorted(x, key=lambda p: p[1], reverse=True)
    small_word_index = copy.deepcopy(word_index)  # 防止原来的字典也被改变了
    print("Removing less freq words from word-index dict...")
    for item in s[20000:]:
        small_word_index.pop(item[0])
    print("Finished!")
    print(len(small_word_index))
    print(len(word_index))
    np.save('./train_data_new/small_word_index', small_word_index)

def generate_embedding_matrix():
    small_word_index = np.load('./train_data_new/small_word_index.npy', allow_pickle=True)
    vocab_size = 20000

    wordVectors = np.load('./training_data/wordVectors.npy')
    wordsList = np.load('./training_data/wordsList.npy')
    wordsList = [word.decode('UTF-8') for word in wordsList]

    embedding_matrix = np.random.uniform(size=(vocab_size + 1, 50))  # +1是要留一个给index=0
    print("Transfering to the embedding matrix......")
    for word, index in small_word_index.item().items():
        try:
            word_index = wordsList.index(word)
            word_vector = wordVectors[word_index]
            embedding_matrix[index] = word_vector
        except Exception:
            print("Word: [", word, "] not in wvmodel! Use random embedding instead.")
    print("Finished!")
    print("Embedding matrix shape:\n", embedding_matrix.shape)
    np.save('./train_data_new/embedding_matrix', embedding_matrix)

def test_load():
    small_word_index = np.load('./train_data_new/small_word_index.npy', allow_pickle=True)
    trainDataNew = np.load('./train_data_new/trainData.npz')
    X = trainDataNew['x']
    Y = trainDataNew['y']
    print(X.shape)
    print(Y.shape)
    print(X[0])
    print(Y[0])
    print(small_word_index.shape)
    print(small_word_index.item()['is'])

if __name__ == '__main__':
	generate_train_data()
	generate_embedding_matrix()
    test_load()

5 训练模型与测试模型

加载训练数据集trainData.npz 和Embeding matrix.npy

import os
import numpy as np
import tensorflow.keras as keras
import tensorflow.keras.layers as layers
import tensorflow as tf

trainDataNew = np.load('./train_data_new/trainData.npz')
X = trainDataNew['x']
Y = trainDataNew['y']

vocab_size=20000
maxlen=200

embedding_matrix = np.load('./train_data_new/embedding_matrix.npy')

from sklearn.model_selection import train_test_split
np.random.seed = 1
random_indexs = np.random.permutation(len(X))
X = X[random_indexs]
Y = Y[random_indexs]
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)

def lstm_model(use_pretrained_wv =True):
    if use_pretrained_wv:
        model = keras.Sequential([
            layers.Embedding(input_dim=20000+1, output_dim=50, input_length=maxlen , weights=[embedding_matrix]),
			#layers.BatchNormalization(),
            layers.LSTM(32, return_sequences=True),
            #layers.BatchNormalization(),
            layers.LSTM(1, activation='sigmoid', return_sequences=False)
        ])
    else:
        model = keras.Sequential([
            layers.Embedding(input_dim=20000 + 1, output_dim=50, input_length=maxlen),
            #layers.BatchNormalization(),
            layers.LSTM(32, return_sequences=True),
            #layers.BatchNormalization(),
            layers.LSTM(1, activation='sigmoid', return_sequences=False)
        ])

    model.compile(optimizer=keras.optimizers.Adam(),
                 loss=keras.losses.BinaryCrossentropy(),
                 metrics=['accuracy'])
    return model
model = lstm_model(True)

if os.path.isfile('./weights/model.h5'):
    print('load weight')
    model.load_weights('./weights/model.h5')

def save_weight(epoch, logs):
    print('save_weight', epoch, logs)
    model.save_weights('./weights/model.h5')
    
batch_print_callback = keras.callbacks.LambdaCallback(
    on_epoch_end=save_weight
)
callbacks = [
    tf.keras.callbacks.EarlyStopping(patience=4, monitor='loss'),
    batch_print_callback,
    # keras.callbacks.ModelCheckpoint('./weights/model.h5', save_best_only=True),
    tf.keras.callbacks.TensorBoard(log_dir='logs')
]

history = model.fit(X_train, y_train, batch_size=32, epochs=10,validation_split=0.1, callbacks= callbacks)

test_result = model.evaluate(X_test, y_test)
print('test Result', test_result)

import matplotlib.pyplot as plt
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.legend(['training', 'valivation'], loc='upper left')
plt.show()

训练结果

17856/18000 [============================>.] - ETA: 0s - loss: 0.0721 - accuracy: 0.9830
17888/18000 [============================>.] - ETA: 0s - loss: 0.0721 - accuracy: 0.9831
17920/18000 [============================>.] - ETA: 0s - loss: 0.0721 - accuracy: 0.9830
17952/18000 [============================>.] - ETA: 0s - loss: 0.0722 - accuracy: 0.9830
17984/18000 [============================>.] - ETA: 0s - loss: 0.0721 - accuracy: 0.9830save_weight 9 {'loss': 0.0720114214974973, 'accuracy': 0.98305553, 'val_loss': 0.429014113843441, 'val_accuracy': 0.8625}
test Result [0.40394473695755007, 0.8724]

6. 预测

你可以自己写一些电影评论测测。

import os
import numpy as np
import tensorflow.keras as keras
import tensorflow.keras.layers as layers
import tensorflow as tf

small_word_index = np.load('./train_data_new/small_word_index.npy', allow_pickle=True)
embedding_matrix = np.load('./train_data_new/embedding_matrix.npy')

vocab_size=20000
maxlen=200

def lstm_model():
    model = keras.Sequential([
        layers.Embedding(input_dim=20000+1, output_dim=50, input_length=maxlen , weights=[embedding_matrix]),
        layers.LSTM(32, return_sequences=True),
        layers.LSTM(1, activation='sigmoid', return_sequences=False)
    ])
    model.compile(optimizer=keras.optimizers.Adam(),
                 loss=keras.losses.BinaryCrossentropy(),
                 metrics=['accuracy'])
    return model
model = lstm_model()
model.summary()


if os.path.isfile('./weights/model.h5'):
    print('load weight')
    model.load_weights('./weights/model.h5')

review_index = np.zeros((1, 200), dtype=int)
#review = 'I like it so much'
#review = "This is bad movie"
#review = "This is good movie"
#review = "this is not good movie"
review = "It is perfect movie"
counter = 0
for word in review.split():
    try:
        print(word, small_word_index.item()[word])
        review_index[0][counter] = small_word_index.item()[word]
        counter = counter+1
    except Exception:
        print('Word error', word)

print(review_index.shape)
s = model.predict(x = review_index)
print(s)

预测结果

[[0.9824787]]

7. 训练模型用Tensorflow自带的imdb数据集

如果你只想训练你的模型的话，你可以直接用tensorflow 自带的数据集训练。

import tensorflow.keras as keras
import tensorflow.keras.layers as layers

num_words = 30000
maxlen = 200
(x_train, y_train), (x_test, y_test) = keras.datasets.imdb.load_data(num_words=num_words)
#print(len(x_train[0]))
#print(x_train[0])
print(x_train.shape, ' ', y_train.shape)
print(x_test.shape, ' ', y_test.shape)
x_train = keras.preprocessing.sequence.pad_sequences(x_train, maxlen, padding='post')
x_test = keras.preprocessing.sequence.pad_sequences(x_test, maxlen, padding='post')
#print(x_train[0])
print(x_train.shape, ' ', y_train.shape)
print(x_test.shape, ' ', y_test.shape)

def lstm_model():
    model = keras.Sequential([
        layers.Embedding(input_dim=30000, output_dim=32, input_length=maxlen),
        layers.LSTM(32, return_sequences=True),
        layers.LSTM(1, activation='sigmoid', return_sequences=False)
    ])
    model.compile(optimizer=keras.optimizers.Adam(),
                 loss=keras.losses.BinaryCrossentropy(),
                 metrics=['accuracy'])
    return model
model = lstm_model()
model.summary()

history = model.fit(x_train, y_train, batch_size=64, epochs=10,validation_split=0.1)

import matplotlib.pyplot as plt
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.legend(['training', 'valivation'], loc='upper left')
plt.show()

8. 训练结果比较与总结

训练类别	训练集结果	验证集结果	测试集结果	时间	预测统计
使用word2vec作为embedding的参数并固定参数	98.30%	88.78%	87.24%	时间较长	好
不使用word2vec作为embedding的参数	80.58%	78.54%	77.54%	时间最长	差
不使用word2vec作为embedding的参数(加BatchNormalization)	99.58%	84.80%	89.14%	1347s	差
用TF keras自带的imdb数据集	98.46%	87.68%	85.14%	464s	？
使用word2vec作为embedding的参数并继续fine-tune	?		?	时间最快	？

9. 参考资料

https://www.oreilly.com/content/perform-sentiment-analysis-with-lstms-using-tensorflow/
https://zhuanlan.zhihu.com/p/63852350

[深度学习TF2][RNN-LSTM]文本情感分析包含（数据预处理-训练-预测）

基于LSTM的文本情感分析

1. 数据下载

2. 训练数据介绍

3. 用到Word2Vector介绍

wordsList.npy介绍

wordVectors.npy介绍

4 数据预处理

4.1 . generate_train_data函数

4.2. generate_embedding_matrix 函数

4.3. test_load 函数，

5 训练模型与测试模型

6. 预测

7. 训练模型用Tensorflow自带的imdb数据集

8. 训练结果比较与总结

9. 参考资料

《日本蜡烛图》读书笔记 & 技术分析回测

Python多线程编程深度探索：从入门到实战

《期货-市场技术分析》读书笔记

mongodb处理json数据很好

顶级 Javaer 都在用的 20 个类库，真香！

[转帖]cpupower

google浏览器插件开发

35K*14 薪，入职了！这公司只要不裁员，我能一直呆下去！

[機器學習-邏輯迴歸]邏輯迴歸(LogisticRegression)多分類(OvR, OvO, MvM）

[深度學習NPL]word2vector總結與理解

[深度學習TF2] 梯度帶(GradientTape)

[機器學習-概念篇]徹底搞懂信息量，熵、相對熵、交叉熵

[深度學習TF2][RNN-LSTM]文本情感分析包含（數據預處理-訓練-預測）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結