前言：
畢業前的項目，最近終於有時間整理個博客出來。使用的keras+gensim完成，也參考了互聯網很多相關資料。最終效果只有88%左右，不過優化空間很大，只用作學習demo

數據集使用的是譚鬆波酒店評論數據集停用詞我自己整理了一個停用詞詞典分享給大家
鏈接：https://pan.baidu.com/s/1ZkMGAUH7VSxJALWBs41iKQ
提取碼：2c1e

1.數據處理

這一步主要是對評論文本做清洗，在這裏只做簡單的去停用詞。
首先寫一個去停用詞的方法

import jieba

f = open('./stop_words.txt', encoding='utf-8')         # 加載停用詞
stopwords = [i.replace("\n", "") for i in f.readlines()]    # 停用詞表

def del_stop_words(text):
	"""
	刪除每個文本中的停用詞
	:param text:
	:return:
	"""
	word_ls = jieba.lcut(text)
	word_ls = [i for i in word_ls if i not in stopwords]
	return word_ls

然後讀取正面評論與負面評論的語料並進行清洗

with open("./test_data/neg.txt", "r", encoding='UTF-8') as e:     # 加載負面語料
    neg_data1 = e.readlines()

with open("./test_data/pos.txt", "r", encoding='UTF-8') as s:     # 加載正面語料
    pos_data1 = s.readlines()

neg_data = sorted(set(neg_data1), key=neg_data1.index)  #列表去重 保持原來的順序
pos_data = sorted(set(pos_data1), key=pos_data1.index)

neg_data = [del_stop_words(data.replace("\n", "")) for data in neg_data]   # 處理負面語料
pos_data = [del_stop_words(data.replace("\n", "")) for data in pos_data]
all_sentences= neg_data + pos_data  # 全部語料 用於訓練word2vec

2. 文本向量化

對於文本的向量化其實有很多方式，包括獨熱(one-hot)，詞袋模型(bag of words)，逆文本特徵頻率(tf-idf)和word2vec等。

本項目我們使用word2vec（據說效果很好）進行詞向量的提取，word2vec是使用深度學習的方式將詞映射爲一個多維向量。

首先頂級模型結構並進行訓練及保存

from gensim.models.word2vec import Word2Vec
from gensim.corpora.dictionary import Dictionary
import pockle
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)  # 將日誌輸出到控制檯

model = Word2Vec(all_sentences,     # 上文處理過的全部語料
                 size=150,  # 詞向量維度 默認100維
                 min_count=1,  # 詞頻閾值 詞出現的頻率 小於這個頻率的詞 將不予保存
                 window=5  # 窗口大小 表示當前詞與預測詞在一個句子中的最大距離是多少
                 )
model.save('./models/Word2vec_v1')  # 保存模型

然後加載模型提取出詞的索引與向量

def create_dictionaries(model):
	"""
	創建詞語字典，並返回word2vec模型中詞語的索引，詞向量
	"""
    gensim_dict = Dictionary()    # 創建詞語詞典
    gensim_dict.doc2bow(model.wv.vocab.keys(), allow_update=True)

    w2indx = {v: k + 1 for k, v in gensim_dict.items()}  # 詞語的索引，從1開始編號
    w2vec = {word: model[word] for word in w2indx.keys()}  # 詞語的詞向量
    return w2indx, w2vec


model = Word2Vec.load('./models/Word2vec_v1')         # 加載模型
index_dict, word_vectors= create_dictionaries(model)  # 索引字典、詞向量字典

使用 pickle 存儲序列化數據 pickle是一個非常方便的庫可以將py的字典列表等等程序運行過程中的對象存儲爲實體數據

output = open(pkl_name + ".pkl", 'wb')      
pickle.dump(index_dict, output)  # 索引字典
pickle.dump(word_vectors, output)  # 詞向量字典
output.close()

3. LSTM訓練

接下來使用keas庫搭建LSTM模型來進行訓練
首先我們定義幾個必要參數

# 參數設置
vocab_dim = 150 # 向量維度
maxlen = 150 # 文本保留的最大長度
batch_size = 100 # 訓練過程中 每次傳入模型的特徵數量
n_epoch = 4   # 迭代次數

加載詞向量數據並填充詞向量矩陣

f = open("./model/評價語料索引及詞向量2.pkl", 'rb')  # 預先訓練好的
index_dict = pickle.load(f)    # 索引字典，{單詞: 索引數字}
word_vectors = pickle.load(f)  # 詞向量, {單詞: 詞向量(100維長的數組)}

n_symbols = len(index_dict) + 1  # 索引數字的個數，因爲有的詞語索引爲0，所以+1
embedding_weights = np.zeros((n_symbols, 150))  # 創建一個n_symbols * 100的0矩陣

for w, index in index_dict.items():  # 從索引爲1的詞語開始，用詞向量填充矩陣
    embedding_weights[index, :] = word_vectors[w]  # 詞向量矩陣，第一行是0向量（沒有索引爲0的詞語，未被填充）

接下來將所有的評論數據映射成爲數字
因爲之前通過加載詞向量已經擁有了一個索引字典
只要將出現在的索引字典中的單詞轉換爲其索引數字未出現的轉換爲0即可

def text_to_index_array(p_new_dic, p_sen): 
    """
    文本或列表轉換爲索引數字
    :param p_new_dic:
    :param p_sen:
    :return:
    """
    if type(p_sen) == list:
        new_sentences = []
        for sen in p_sen:
            new_sen = []
            for word in sen:
                try:
                    new_sen.append(p_new_dic[word])  # 單詞轉索引數字
                except:
                    new_sen.append(0)  # 索引字典裏沒有的詞轉爲數字0
            new_sentences.append(new_sen)
        return np.array(new_sentences)   # 轉numpy數組
    else:
        new_sentences = []
        sentences = []
        p_sen = p_sen.split(" ")
        for word in p_sen:
            try:
                sentences.append(p_new_dic[word])  # 單詞轉索引數字
            except:
                sentences.append(0)  # 索引字典裏沒有的詞轉爲數字0
        new_sentences.append(sentences)
        return new_sentences

加載特徵與標籤將特徵全部映射成數字並且分割驗證集和測試集

with open("./原始語料/neg.txt", "r", encoding='UTF-8') as f:
            neg_data1 = f.readlines()

with open("./原始語料/pos.txt", "r", encoding='UTF-8') as g:
    pos_data1 = g.readlines()

neg_data = sorted(set(neg_data1), key=neg_data1.index)  #列表去重 保持原來的順序
pos_data = sorted(set(pos_data1), key=pos_data1.index)

neg_data = [process_txt(data) for data in neg_data]
pos_data = [process_txt(data) for data in pos_data]
data = neg_data + pos_data


# 讀取語料類別標籤
label_list = ([0] * len(neg_data) + [1] * len(pos_data))


# 劃分訓練集和測試集，此時都是list列表
X_train_l, X_test_l, y_train_l, y_test_l = train_test_split(data, label_list, test_size=0.2)

# 轉爲數字索引形式

# token = Tokenizer(num_words=3000)   #字典數量
# token.fit_on_texts(train_text)

X_train = text_to_index_array(index_dict, X_train_l)
X_test = text_to_index_array(index_dict, X_test_l)

y_train = np.array(y_train_l)  # 轉numpy數組
y_test = np.array(y_test_l)

print("訓練集shape： ", X_train.shape)
print("測試集shape： ", X_test.shape)

因爲模型輸入的每一個特徵長度需要相同，所以我們需要定義一個最大的長度max_len。
當特徵小於max_len時，根據max_len填充其餘位數爲0。
當特徵大於max_len，則進行截斷。
在本項目中，我定義的max_len爲150，是一個平均長度。有時候爲了保證不丟失信息，可以打印出所有特徵中最大的長度，並將其設置爲max_len

X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)

定義模型訓練模型驗證模型

ef train_lstm(p_n_symbols, p_embedding_weights, p_X_train, p_y_train, p_X_test, p_y_test, X_test_l):
    print('創建模型...')
    model = Sequential()
    model.add(Embedding(output_dim=vocab_dim,  # 輸出向量維度
                        input_dim=p_n_symbols,  # 輸入向量維度
                        mask_zero=True,         # 使我們填補的0值在後續訓練中不產生影響（屏蔽0值）
                        weights=[p_embedding_weights],   # 對數據加權
                        input_length=maxlen ))      # 每個特徵的長度

    model.add(LSTM(output_dim=100,
                   activation='sigmoid',
                   inner_activation='hard_sigmoid'))
    model.add(Dropout(0.5))   # 每次迭代丟棄50神經元 防止過擬合
    model.add(Dense(units=512,
                    activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(units=1,  # 輸出層1個神經元 1代表正面 0代表負面
                    activation='sigmoid'))
    model.summary()

    print('編譯模型...')
    model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])

    print("訓練...")
    train_history = model.fit(p_X_train, p_y_train, batch_size=batch_size, nb_epoch=n_epoch,
              validation_data=(p_X_test, p_y_test))

    print("評估...")
    score, acc = model.evaluate(p_X_test, p_y_test, batch_size=batch_size)
    label = model.predict(p_X_test)
    print('Test score:', score)
    print('Test accuracy:', acc)
    for (a, b, c) in zip(p_y_test, X_test_l, label):
        print("原文爲："+ "".join(b))
        print("預測傾向爲", a)
        print("真實傾向爲", c)

    show_train_history(train_history, 'acc', 'val_acc')    # 訓練集準確率與驗證集準確率 折線圖
    show_train_history(train_history, 'loss', 'val_loss')  # 訓練集誤差率與驗證集誤差率 折線圖

    """保存模型"""
    model.save('./model/emotion_model_LSTM.h5')
    print("模型保存成功")

可以通過show_train_history函數打印的訓練集曲線來判斷模型是否過擬合。
方便確定迭代次數進行調參函數如下

def show_train_history(train_history,train, velidation):
    """
    可視化訓練過程 對比
    :param train_history:
    :param train:
    :param velidation:
    :return:
    """
    plt.plot(train_history.history[train])
    plt.plot(train_history.history[velidation])
    plt.title("Train History")   #標題
    plt.xlabel('Epoch')    #x軸標題
    plt.ylabel(train)  #y軸標題
    plt.legend(['train', 'test'], loc='upper left')  #圖例 左上角
    plt.show()

項目github:https://github.com/sph116/lstm_emotion
沒仔細檢查，可能會有些小問題，望海涵，歡迎交流。

後續優化：
1.增大訓練word2vec 語料數量
2.數據清洗不止進行簡單的去停用詞
3.增加模型結構及複雜度

基於長短時神經網絡(LSTM)+word2vec的情感分析

1.數據處理

2. 文本向量化

3. LSTM訓練

基於長短時神經網絡(LSTM)+word2vec的情感分析

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結