[深度學習TF2][RNN-LSTM]文本情感分析包含(數據預處理-訓練-預測)

1. 數據下載

數據集地址:http://ai.stanford.edu/~amaas/data/sentiment/
在這裏插入圖片描述
下載後解壓,會看到有兩個文件夾,test和train:
在這裏插入圖片描述
我們點進train中,會發現正樣本和負樣本已經分好類了:
neg和pos分別是負樣本和正樣本,unsup是未標註的樣本,可用後續需要採用。其他的都自己去看看吧。

打開pos文件,看看裏面啥樣:

在這裏插入圖片描述
都是一個個文本。

注意到,這些文本一般都不短…
在這裏插入圖片描述
數據集中,共有5w條文本,test集和train集各半,每個集合中,pos和neg也是各半。
本爲用到只是tranin集25000條文本。

2. 訓練數據介紹

情感分析是上手NLP的最簡單的任務之一,它就是一個簡單的文本分類問題,判斷一段文本的情感極性。最簡單的就是二分類,判斷是積極的還是消極的;更難一點的就是三分類,除了積極消極還有無情感傾向的;更加複雜的就比如情感打分,例如電影打1~5分,這就是五分類。但本質上都一樣,無非類別太多更難以學習罷了。
IMDB是一個專業的電影評論網站,類似國內的豆瓣,IMDB的電影評論數據是大家經常使用來練手的情感分析數據集,也是各種比賽,如Kaggle,和各種學者做研究常用的數據集。

其實,Tensorflow.keras 自帶了IMDB的已經進行很好的預處理的數據集,可以一行代碼下載,不需要進行任何的處理就可以訓練,而且效果比較好。但是,這樣就太沒意思了。在真實場景中,我們拿到的都是髒髒的數據,我們必須自己學會讀取、清洗、篩選、分成訓練集測試集。而且,從我自己的實踐經驗來看,數據預處理的本事纔是真本事,模型都好搭,現在的各種框架已經讓搭建模型越來越容易,但是數據預處理只能自己動手。所有往往實際任務中,數據預處理花費的時間、精力是最多的,而且直接影響後面的效果。

另外,我們要知道,對文本進行分析,首先要將文本數值化。因爲計算機不認字的,只認數字。所以最後處理好的文本應該是數值化的形式。而Tensorflow.keras自帶的數據集全都數值化了,而它並不提供對應的查詢字典讓我們知道每個數字對應什麼文字,這讓我們只能訓練模型,看效果,無法拓展到其他語料上,也無法深入分析。綜上,我上面推薦的數據集,是原始數據集,都是真實文本,當然,爲了方便處理,也已經被斯坦福的大佬分好類了。但是怎麼數值化,需要我們自己動手。

3. 用到Word2Vector介紹

Google 已經幫助我們在大規模數據集上訓練出來了 Word2Vec 模型,它包括 1000 億個不同的詞,在這個模型中,谷歌能創建300萬個詞向量,每個向量維度爲 300。在理想情況下,我們將使用這些向量來構建模型,但是因爲這個單詞向量矩陣太大(3.6G),因此在此次研究中我們將使用一個更加易於管理的矩陣,該矩陣由 GloVe 進行訓練得到。矩陣將包含 400000 個詞向量,每個向量的維數爲 50

我們將導入兩個不同的數據結構,一個是包含 400000 個單詞的 Python 列表(wordsList.npy),一個是包含所有單詞向量值的 400000*50 維的嵌入矩陣 (wordVectors.npy)

GloVe 詞向量網盤下載
鏈接: https://pan.baidu.com/s/1PJx_ahSaPfVgMjmLpMz8cw 提取碼: di2e
下載減壓後
在這裏插入圖片描述

wordsList.npy介紹

一個是包含 400000 個單詞的 Python 列表,它裏面每個單詞對應的位置就是 wordVectors裏相應詞向量的位置
例子:如果我們要查找baseball這個詞的相應詞向量

import numpy as np
import tensorflow as tf
import os as os
import matplotlib.pyplot as plt
from os import listdir
wordsList = np.load('./training_data/wordsList.npy')
print('Loaded the word list!')

wordsList = wordsList.tolist()  # Originally loaded as numpy array
wordsList = [word.decode('UTF-8') for word in wordsList]  # Encode words as UTF-8
wordVectors = np.load('./training_data/wordVectors.npy')
print('Loaded the word vectors!')

print(len(wordsList))
# print(wordsList)
print(wordVectors.shape)

baseballIndex = wordsList.index('baseball')
print(baseballIndex)
print(wordVectors[baseballIndex])

執行結果

Loaded the word list!
Loaded the word vectors!
400000
(400000, 50)
1444
[-1.9327    1.0421   -0.78515   0.91033   0.22711  -0.62158  -1.6493
  0.07686  -0.5868    0.058831  0.35628   0.68916  -0.50598   0.70473
  1.2664   -0.40031  -0.020687  0.80863  -0.90566  -0.074054 -0.87675
 -0.6291   -0.12685   0.11524  -0.55685  -1.6826   -0.26291   0.22632
  0.713    -1.0828    2.1231    0.49869   0.066711 -0.48226  -0.17897
  0.47699   0.16384   0.16537  -0.11506  -0.15962  -0.94926  -0.42833
 -0.59457   1.3566   -0.27506   0.19918  -0.36008   0.55667  -0.70315
  0.17157 ]

wordVectors.npy介紹

一個是包含所有單詞向量值的 400000*50 維的嵌入矩陣
假如我們有這麼一句話“I thought the movie was incredible and inspiring” ,一個10個詞,那麼它們在wordVectors中的詞向量是什麼?
代碼實現去查找它們相應的詞向量

import numpy as np
import tensorflow as tf
import os as os
import matplotlib.pyplot as plt
from os import listdir
wordsList = np.load('./training_data/wordsList.npy')
print('Loaded the word list!')

wordsList = wordsList.tolist()  # Originally loaded as numpy array
wordsList = [word.decode('UTF-8') for word in wordsList]  # Encode words as UTF-8
wordVectors = np.load('./training_data/wordVectors.npy')
maxSeqLength = 10  # Maximum length of sentence
numDimensions = 300  # Dimensions for each word vector
firstSentence = np.zeros((maxSeqLength), dtype='int32')
firstSentence[0] = wordsList.index("i")
firstSentence[1] = wordsList.index("thought")
firstSentence[2] = wordsList.index("the")
firstSentence[3] = wordsList.index("movie")
firstSentence[4] = wordsList.index("was")
firstSentence[5] = wordsList.index("incredible")
firstSentence[6] = wordsList.index("and")
firstSentence[7] = wordsList.index("inspiring")
# firstSentence[8] and firstSentence[9] are going to be 0
print(firstSentence.shape)
print(firstSentence)  # Shows the row index for each word
with tf.Session() as sess:
    print(tf.nn.embedding_lookup(wordVectors, firstSentence).eval().shape)

執行結果

(10,)
[    41    804 201534   1005     15   7446      5  13767      0      0]
(10, 50)

4 數據預處理

4.1 . generate_train_data函數

4.1.1. 包含加載數據
4.1.2. 省略掉低頻詞
4.1.3. 把詞轉成唯一索引, 因爲計算機只能認識數字
4.1.4.產生一個 trainData.npz 數據集,訓練的時候就可以直接用這個數據集訓練了,因爲加載文件太慢 了
4.1.5. 產生一個small_word_index 字典,dict (word, index), 這個字典要在預測的時候用,因爲你把每一個單詞都轉成對應的唯一的索引,在訓練完你要預測一條評論的時候要把評論裏沒一個單詞再轉成你訓練這個模型的唯一的索引,這個時候你就用到這個字典了

4.2. generate_embedding_matrix 函數

4.2.1 利用wordVectors和wordList和small_word_index字典 產生一個Embeding matrix
4.2.2 embedding_matrix.npy 是一個矩陣,可以把詞索引轉成詞向量。

4.3. test_load 函數,

驗證產生結果 (trainData.npz, small_word_index.npy, embedding_matrix.npy)

import numpy as np
import tensorflow as tf
import os as os
import tensorflow.keras as keras

def generate_train_data():
    datapath = r'D:\train_data\\aclImdb\train'
    pos_files = os.listdir(datapath + '/pos')
    neg_files = os.listdir(datapath + '/neg')
    print(len(pos_files))
    print(len(neg_files))

    pos_all = []
    neg_all = []
    for pf, nf in zip(pos_files, neg_files):
        with open(datapath + '/pos' + '/' + pf, encoding='utf-8') as f:
            s = f.read()
            pos_all.append(s)
        with open(datapath + '/neg' + '/' + nf, encoding='utf-8') as f:
            s = f.read()
            neg_all.append(s)
    print(len(pos_all))
    print(pos_all[0])
    print(len(neg_all))
    X_orig = np.array(pos_all + neg_all)
    Y_orig = np.array([1 for _ in range(len(pos_all))] + [0 for _ in range(len(neg_all))])
    print("X_orig:", X_orig.shape)
    print("Y_orig:", Y_orig.shape)

    import time
    vocab_size = 20000
    maxlen = 200
    print("Start fitting the corpus......")
    t = keras.preprocessing.text.Tokenizer(vocab_size)  # 要使得文本向量化時省略掉低頻詞,就要設置這個參數
    tik = time.time()
    t.fit_on_texts(X_orig)  # 在所有的評論數據集上訓練,得到統計信息
    tok = time.time()
    word_index = t.word_index  # 不受vocab_size的影響
    print(X_orig)
    print('all_vocab_size', len(word_index), type(word_index))
    print(word_index)
    print("Fitting time: ", (tok - tik), 's')
    print("Start vectorizing the sentences.......")
    v_X = t.texts_to_sequences(X_orig)  # 受vocab_size的影響
    print("Start padding......")
    print(v_X)
    pad_X = keras.preprocessing.sequence.pad_sequences(v_X, maxlen=maxlen, padding='post')
    print(pad_X.shape)
    print("Finished!")

    np.savez('./train_data_new/trainData', x=pad_X, y=Y_orig)
    import copy
    x = list(t.word_counts.items())
    s = sorted(x, key=lambda p: p[1], reverse=True)
    small_word_index = copy.deepcopy(word_index)  # 防止原來的字典也被改變了
    print("Removing less freq words from word-index dict...")
    for item in s[20000:]:
        small_word_index.pop(item[0])
    print("Finished!")
    print(len(small_word_index))
    print(len(word_index))
    np.save('./train_data_new/small_word_index', small_word_index)

def generate_embedding_matrix():
    small_word_index = np.load('./train_data_new/small_word_index.npy', allow_pickle=True)
    vocab_size = 20000

    wordVectors = np.load('./training_data/wordVectors.npy')
    wordsList = np.load('./training_data/wordsList.npy')
    wordsList = [word.decode('UTF-8') for word in wordsList]

    embedding_matrix = np.random.uniform(size=(vocab_size + 1, 50))  # +1是要留一個給index=0
    print("Transfering to the embedding matrix......")
    for word, index in small_word_index.item().items():
        try:
            word_index = wordsList.index(word)
            word_vector = wordVectors[word_index]
            embedding_matrix[index] = word_vector
        except Exception:
            print("Word: [", word, "] not in wvmodel! Use random embedding instead.")
    print("Finished!")
    print("Embedding matrix shape:\n", embedding_matrix.shape)
    np.save('./train_data_new/embedding_matrix', embedding_matrix)

def test_load():
    small_word_index = np.load('./train_data_new/small_word_index.npy', allow_pickle=True)
    trainDataNew = np.load('./train_data_new/trainData.npz')
    X = trainDataNew['x']
    Y = trainDataNew['y']
    print(X.shape)
    print(Y.shape)
    print(X[0])
    print(Y[0])
    print(small_word_index.shape)
    print(small_word_index.item()['is'])

if __name__ == '__main__':
	generate_train_data()
	generate_embedding_matrix()
    test_load()

5 訓練模型與測試模型

加載訓練數據集trainData.npz 和Embeding matrix.npy

import os
import numpy as np
import tensorflow.keras as keras
import tensorflow.keras.layers as layers
import tensorflow as tf

trainDataNew = np.load('./train_data_new/trainData.npz')
X = trainDataNew['x']
Y = trainDataNew['y']

vocab_size=20000
maxlen=200

embedding_matrix = np.load('./train_data_new/embedding_matrix.npy')

from sklearn.model_selection import train_test_split
np.random.seed = 1
random_indexs = np.random.permutation(len(X))
X = X[random_indexs]
Y = Y[random_indexs]
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)

def lstm_model(use_pretrained_wv =True):
    if use_pretrained_wv:
        model = keras.Sequential([
            layers.Embedding(input_dim=20000+1, output_dim=50, input_length=maxlen , weights=[embedding_matrix]),
			#layers.BatchNormalization(),
            layers.LSTM(32, return_sequences=True),
            #layers.BatchNormalization(),
            layers.LSTM(1, activation='sigmoid', return_sequences=False)
        ])
    else:
        model = keras.Sequential([
            layers.Embedding(input_dim=20000 + 1, output_dim=50, input_length=maxlen),
            #layers.BatchNormalization(),
            layers.LSTM(32, return_sequences=True),
            #layers.BatchNormalization(),
            layers.LSTM(1, activation='sigmoid', return_sequences=False)
        ])

    model.compile(optimizer=keras.optimizers.Adam(),
                 loss=keras.losses.BinaryCrossentropy(),
                 metrics=['accuracy'])
    return model
model = lstm_model(True)

if os.path.isfile('./weights/model.h5'):
    print('load weight')
    model.load_weights('./weights/model.h5')

def save_weight(epoch, logs):
    print('save_weight', epoch, logs)
    model.save_weights('./weights/model.h5')
    
batch_print_callback = keras.callbacks.LambdaCallback(
    on_epoch_end=save_weight
)
callbacks = [
    tf.keras.callbacks.EarlyStopping(patience=4, monitor='loss'),
    batch_print_callback,
    # keras.callbacks.ModelCheckpoint('./weights/model.h5', save_best_only=True),
    tf.keras.callbacks.TensorBoard(log_dir='logs')
]

history = model.fit(X_train, y_train, batch_size=32, epochs=10,validation_split=0.1, callbacks= callbacks)

test_result = model.evaluate(X_test, y_test)
print('test Result', test_result)

import matplotlib.pyplot as plt
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.legend(['training', 'valivation'], loc='upper left')
plt.show()

訓練結果

17856/18000 [============================>.] - ETA: 0s - loss: 0.0721 - accuracy: 0.9830
17888/18000 [============================>.] - ETA: 0s - loss: 0.0721 - accuracy: 0.9831
17920/18000 [============================>.] - ETA: 0s - loss: 0.0721 - accuracy: 0.9830
17952/18000 [============================>.] - ETA: 0s - loss: 0.0722 - accuracy: 0.9830
17984/18000 [============================>.] - ETA: 0s - loss: 0.0721 - accuracy: 0.9830save_weight 9 {'loss': 0.0720114214974973, 'accuracy': 0.98305553, 'val_loss': 0.429014113843441, 'val_accuracy': 0.8625}
test Result [0.40394473695755007, 0.8724]

6. 預測

你可以自己寫一些電影評論測測。

import os
import numpy as np
import tensorflow.keras as keras
import tensorflow.keras.layers as layers
import tensorflow as tf

small_word_index = np.load('./train_data_new/small_word_index.npy', allow_pickle=True)
embedding_matrix = np.load('./train_data_new/embedding_matrix.npy')

vocab_size=20000
maxlen=200

def lstm_model():
    model = keras.Sequential([
        layers.Embedding(input_dim=20000+1, output_dim=50, input_length=maxlen , weights=[embedding_matrix]),
        layers.LSTM(32, return_sequences=True),
        layers.LSTM(1, activation='sigmoid', return_sequences=False)
    ])
    model.compile(optimizer=keras.optimizers.Adam(),
                 loss=keras.losses.BinaryCrossentropy(),
                 metrics=['accuracy'])
    return model
model = lstm_model()
model.summary()


if os.path.isfile('./weights/model.h5'):
    print('load weight')
    model.load_weights('./weights/model.h5')

review_index = np.zeros((1, 200), dtype=int)
#review = 'I like it so much'
#review = "This is bad movie"
#review = "This is good movie"
#review = "this is not good movie"
review = "It is perfect movie"
counter = 0
for word in review.split():
    try:
        print(word, small_word_index.item()[word])
        review_index[0][counter] = small_word_index.item()[word]
        counter = counter+1
    except Exception:
        print('Word error', word)

print(review_index.shape)
s = model.predict(x = review_index)
print(s)

預測結果

[[0.9824787]]

7. 訓練模型用Tensorflow自帶的imdb數據集

如果你只想訓練你的模型的話, 你可以直接用tensorflow 自帶的數據集訓練。

import tensorflow.keras as keras
import tensorflow.keras.layers as layers

num_words = 30000
maxlen = 200
(x_train, y_train), (x_test, y_test) = keras.datasets.imdb.load_data(num_words=num_words)
#print(len(x_train[0]))
#print(x_train[0])
print(x_train.shape, ' ', y_train.shape)
print(x_test.shape, ' ', y_test.shape)
x_train = keras.preprocessing.sequence.pad_sequences(x_train, maxlen, padding='post')
x_test = keras.preprocessing.sequence.pad_sequences(x_test, maxlen, padding='post')
#print(x_train[0])
print(x_train.shape, ' ', y_train.shape)
print(x_test.shape, ' ', y_test.shape)

def lstm_model():
    model = keras.Sequential([
        layers.Embedding(input_dim=30000, output_dim=32, input_length=maxlen),
        layers.LSTM(32, return_sequences=True),
        layers.LSTM(1, activation='sigmoid', return_sequences=False)
    ])
    model.compile(optimizer=keras.optimizers.Adam(),
                 loss=keras.losses.BinaryCrossentropy(),
                 metrics=['accuracy'])
    return model
model = lstm_model()
model.summary()

history = model.fit(x_train, y_train, batch_size=64, epochs=10,validation_split=0.1)

import matplotlib.pyplot as plt
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.legend(['training', 'valivation'], loc='upper left')
plt.show()

8. 訓練結果比較與總結

訓練類別 訓練集結果 驗證集結果 測試集結果 時間 預測統計
使用word2vec作爲embedding的參數並固定參數 98.30% 88.78% 87.24% 時間較長
不使用word2vec作爲embedding的參數 80.58% 78.54% 77.54% 時間最長
不使用word2vec作爲embedding的參數(加BatchNormalization) 99.58% 84.80% 89.14% 1347s
用TF keras自帶的imdb數據集 98.46% 87.68% 85.14% 464s
使用word2vec作爲embedding的參數並繼續fine-tune ? ? 時間最快

9. 參考資料

https://www.oreilly.com/content/perform-sentiment-analysis-with-lstms-using-tensorflow/
https://zhuanlan.zhihu.com/p/63852350

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章