【Python機器學習及實踐】實戰篇:IMDB影評得分估計

Python機器學習及實踐——實戰篇:IMDB影評得分估計


IMDB影評得分估計

要求分析電影評論網站的留言,判斷每條留言的情感傾向。 

IMDB影評數據集簡介
       標籤數據集包含5萬條IMDB影評,專門用於情緒分析。評論的情緒是二元的,這意味着IMDB評級< 5導致情緒得分爲0,而評級>=7的情緒得分爲1。沒有哪部電影的評論超過30條。標有training set的2.5萬篇影評不包括與2.5萬篇影評測試集相同的電影。此外,還有另外5萬篇IMDB影評沒有任何評級標籤。
     The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews. The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set. In addition, there are another 50,000 IMDB reviews provided without any rating labels.

File descriptions

  • labeledTrainData - The labeled training set. The file is tab-delimited and has a header row followed by 25,000 rows containing an id, sentiment, and text for each review.  文件以製表符分隔,頭行後面跟着25000行,每行包含id、情緒和文本。
  • testData - The test set. The tab-delimited file has a header row followed by 25,000 rows containing an id and text for each review. Your task is to predict the sentiment for each one. 測試集。以製表符分隔的文件有一個頭行,後面是25,000行,其中包含每個檢查的id和文本。你的任務是預測每個人的情緒。
  • unlabeledTrainData - An extra training set with no labels. The tab-delimited file has a header row followed by 50,000 rows containing an id and text for each review. 沒有標籤的額外訓練集。以製表符分隔的文件有一個頭行,後跟50,000行,其中包含每個審閱的id和文本。
  • sampleSubmission - A comma-delimited sample submission file in the correct format.以逗號分隔的示例提交文件,要求提交的格式必須正確。

Data fields

  • id - Unique ID of each review 每個評論的唯一id。
  • sentiment - Sentiment of the review; 1 for positive reviews and 0 for negative reviews 評論的情緒,正面評價爲1、負面評價爲0
  • review - Text of the review 評論的文本內容。

模型搭建

採用Scikit-learn中的樸素貝葉斯模型以及隸屬於集成模型的梯度提升分類模型,對電影評論進行文本情感分析。

在樸素貝葉斯模型中使用“詞袋法”對每條電影評論進行特徵向量化,並且藉助CountVectorizer和TfidfVectorizer;另一方面,先利用無標註影評文件中訓練詞向量,然後將每條電影評論中所有詞彙的平均向量作爲特徵訓練梯度提升分類模型。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
@File  : IMDB.py
@Author: Xinzhe.Pang
@Date  : 2019/7/27 21:53
@Desc  : 
"""
import pandas as pd
import re
import numpy as np
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
# 導入文本特徵抽取器CountVectorizer和TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
# 導入貝葉斯模型
from sklearn.naive_bayes import MultinomialNB
# 導入pipline用於便捷搭建系統流程
from sklearn.pipeline import Pipeline
# 導入GridSearchCV用於超參數組合的網格搜索
from sklearn.model_selection import GridSearchCV
# 導入nltk.data
import nltk.data
# 從gensim.models導入word2vec
from gensim.models import word2vec
# 從sklearn.ensemble中導入GradientBoostingClassifier模型進行影評情感分析
from sklearn.ensemble import GradientBoostingClassifier
# 直接讀入已經訓練好的詞向量模型
from gensim.models import Word2Vec

# 讀取訓練數據和測試數據
train = pd.read_csv('./labeledTrainData.tsv', delimiter='\t')
test = pd.read_csv('./testData.tsv', delimiter='\t')

# 查驗一下前幾條訓練數據
print(train.head())
# 查驗一個前幾條測試數據
print(test.head())

# 定義review_to_text函數,完成對原始評論的三項數據預處理任務
def review_to_text(review, remove_stopwords):
    # 任務1:去掉html標記
    raw_text = BeautifulSoup(review, 'lxml').get_text()
    # 任務2:去掉非字母字符
    letters = re.sub('[^a-zA-Z]', ' ', raw_text)
    words = letters.lower().split()
    # 任務3:如果remove_stopwords被激活,則進一步去評論中的停用詞
    if remove_stopwords:
        nltk.download()
        stop_words = set(stopwords.words('english'))
        words = [w for w in words if w not in stop_words]
    # 返回每條評論經此三項預測任務的詞彙列表
    return words

# 分別原始訓練和測試數據集進行上述三項預處理
X_train = []
for review in train['review']:
    X_train.append(' '.join(review_to_text(review, True)))
X_test = []
for review in test['review']:
    X_test.append(' '.join(review_to_text(review, True)))
y_train = train['sentiment']

# 使用Pipeline搭建兩組使用樸素貝葉斯模型的分類器,區別在於分別使用CountVectorizer和TfidfVectorizer對文本特徵進行抽取
pip_count = Pipeline([('count_vec', CountVectorizer(analyzer='word')), ('mnb', MultinomialNB())])
pip_tfidf = Pipeline([('tfidf_vec', TfidfVectorizer(analyzer='word')), ('mnb', MultinomialNB())])

# 分別配置用於模型超參數搜索的組合
params_count = {'count_vec_binary': [True, False], 'count_vec_ngram_range': [(1, 1), (1, 2)],
                'mnb_alpha': [0.1, 1.0, 10.0]}
params_tfidf = {'tfidf_vec_binary': [True, False], 'tfidf_vec_ngram_range': [(1, 1), (1, 2)],
                'mnb_alpha': [0.1, 1.0, 10.0]}

# 採用4折交叉驗證的方法對使用CountVectorizer的樸素貝葉斯模型進行並行化超參數搜索
gs_count = GridSearchCV(pip_count, params_count, cv=4, n_jobs=-1, verbose=1)
gs_count.fit(X_train, y_train)

# 輸出交叉驗證中最佳的準確性得分以及超參數組合
print(gs_count.best_score_)
print(gs_count.best_params_)

# 以最佳的超參數組合配置模型並對測試數據進行預測
count_y_pred = gs_count.predict(X_test)

# 使用4折交叉驗證方法對使用TfidfVectorizer的樸素貝葉斯模型進行並行化超參數搜索
gs_tfidf = GridSearchCV(pip_tfidf, params_tfidf, cv=4, n_jobs=-1, verbose=1)
gs_tfidf.fit(X_train, y_train)
# 輸出交叉驗證中最佳的準確性得分以及超參數組合
print(gs_tfidf.best_score_)
print(gs_tfidf.best_params_)

# 以最佳的超參數組合配置模型並對預測數據進行預測
tfidf_y_pred = gs_tfidf.predict(X_test)

# 使用pandas對需要提交的數據進行格式化
submission_count = pd.DataFrame({'id': test['id'], 'sentiment': count_y_pred})
submission_tfidf = pd.DataFrame({'id': test['id'], 'sentiment': tfidf_y_pred})

# 結果輸出到本地硬盤
submission_count.to_csv('./submission_count.csv', index=False)
submission_tfidf.to_csv('./submission_tfidf.csv', index=False)

# 從本地讀入未標記數據
unlabeled_train = pd.read_csv('./unlabeledTrainData.tsv', delimiter='\t', quoting=3)

# 準備使用nltk的tokenizer對影評中的英文句子進行分割
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

# 定義函數review_to_sentences逐條對影評進行分句
def review_to_sentences(review, tokenizer):
    raw_sentences = tokenizer.tokenize(review.strip())
    sentences = []
    for raw_sentence in raw_sentences:
        if len(raw_sentence) > 0:
            sentences.append(review_to_text(raw_sentence, False))
    return sentences

corpora = []
# 準備用於訓練詞向量的數據
for review in unlabeled_train['review']:
    corpora += review_to_sentences(review.decode('utf8'), tokenizer)
# 配置訓練詞向量模型的超參數
num_features = 300
min_word_count = 20
num_workers = 4
context = 10
downsampling = 1e-3

# 開始詞向量模型的訓練
model = word2vec.Word2Vec(corpora, workers=num_workers, size=num_features, min_count=min_word_count, window=context,
                          sample=downsampling)
model.init_sims(replace=True)
model_name = "./300features_20minwords_10context"
# 可以將詞向量模型的訓練結果長期保存於本地硬盤
model.save(model_name)
model = Word2Vec.load("../300features_20minwords_10context")

# 探查一下該詞向量模型的訓練結果
print(model.most_similar("man"))

# 定義一個函數使用詞向量產生文本特徵向量
def makeFeatureVec(words, model, num_features):
    featureVec = np.zeros((num_features,), dtype="float32")
    nwords = 0.
    index2word_set = set(model.index2word)
    for word in words:
        if word in index2word_set:
            nwords = nwords + 1.
            featureVec = np.add(featureVec, model[word])
    featureVec = np.divide(featureVec, nwords)
    return featureVec

# 定義另一個每條影評轉化爲基於詞向量的特徵向量(平均詞向量)
def getAvgFeatureVecs(reviews, model, num_features):
    counter = 0
    reviewFeatureVecs = np.zeros((len(reviews), num_features), dtype="float32")

    for review in reviews:
        reviewFeatureVecs[counter] = makeFeatureVec(review, model, num_features)
        counter += 1
    return reviewFeatureVecs

# 準備新的基於詞向量表示的訓練和測試特徵向量
clean_train_reviews = []
for review in train['review']:
    clean_train_reviews.append(review_to_text(review, remove_stopwords=True))
trainDataVecs = getAvgFeatureVecs(clean_train_reviews, model, num_features)
clean_test_reviews = []
for review in test['review']:
    clean_test_reviews.append(review_to_text(review, remove_stopwords=True))
testDataVecs = getAvgFeatureVecs(clean_test_reviews, model, num_features)

gbc = GradientBoostingClassifier()

# 配置超參數的搜索組合
params_gdc = {'n_estimators': [10, 100, 500], 'learning_rate': [0.01, 0.1, 1.0], 'max_depth': [2, 3, 4]}
gs = GridSearchCV(gbc, params_gdc, cv=4, n_jobs=-1, verbose=1)
gs.fit(trainDataVecs, y_train)

# 輸出網格搜索得到的最佳性能以及最優超參數組合
print(gs.best_score_)
print(gs.best_params_)

# 使用超參數調優之後的梯度上升樹模型進行預測
result = gs.predict(testDataVecs)
output = pd.DataFrame(data={"id": test["id"], "sentiment": result})
output.to_csv("./submission_w2v.csv", index=False, quoting=3)

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章