【Python机器学习及实践】实战篇:IMDB影评得分估计

Python机器学习及实践——实战篇:IMDB影评得分估计


IMDB影评得分估计

要求分析电影评论网站的留言,判断每条留言的情感倾向。 

IMDB影评数据集简介
       标签数据集包含5万条IMDB影评,专门用于情绪分析。评论的情绪是二元的,这意味着IMDB评级< 5导致情绪得分为0,而评级>=7的情绪得分为1。没有哪部电影的评论超过30条。标有training set的2.5万篇影评不包括与2.5万篇影评测试集相同的电影。此外,还有另外5万篇IMDB影评没有任何评级标签。
     The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews. The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set. In addition, there are another 50,000 IMDB reviews provided without any rating labels.

File descriptions

  • labeledTrainData - The labeled training set. The file is tab-delimited and has a header row followed by 25,000 rows containing an id, sentiment, and text for each review.  文件以制表符分隔,头行后面跟着25000行,每行包含id、情绪和文本。
  • testData - The test set. The tab-delimited file has a header row followed by 25,000 rows containing an id and text for each review. Your task is to predict the sentiment for each one. 测试集。以制表符分隔的文件有一个头行,后面是25,000行,其中包含每个检查的id和文本。你的任务是预测每个人的情绪。
  • unlabeledTrainData - An extra training set with no labels. The tab-delimited file has a header row followed by 50,000 rows containing an id and text for each review. 没有标签的额外训练集。以制表符分隔的文件有一个头行,后跟50,000行,其中包含每个审阅的id和文本。
  • sampleSubmission - A comma-delimited sample submission file in the correct format.以逗号分隔的示例提交文件,要求提交的格式必须正确。

Data fields

  • id - Unique ID of each review 每个评论的唯一id。
  • sentiment - Sentiment of the review; 1 for positive reviews and 0 for negative reviews 评论的情绪,正面评价为1、负面评价为0
  • review - Text of the review 评论的文本内容。

模型搭建

采用Scikit-learn中的朴素贝叶斯模型以及隶属于集成模型的梯度提升分类模型,对电影评论进行文本情感分析。

在朴素贝叶斯模型中使用“词袋法”对每条电影评论进行特征向量化,并且借助CountVectorizer和TfidfVectorizer;另一方面,先利用无标注影评文件中训练词向量,然后将每条电影评论中所有词汇的平均向量作为特征训练梯度提升分类模型。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
@File  : IMDB.py
@Author: Xinzhe.Pang
@Date  : 2019/7/27 21:53
@Desc  : 
"""
import pandas as pd
import re
import numpy as np
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
# 导入文本特征抽取器CountVectorizer和TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
# 导入贝叶斯模型
from sklearn.naive_bayes import MultinomialNB
# 导入pipline用于便捷搭建系统流程
from sklearn.pipeline import Pipeline
# 导入GridSearchCV用于超参数组合的网格搜索
from sklearn.model_selection import GridSearchCV
# 导入nltk.data
import nltk.data
# 从gensim.models导入word2vec
from gensim.models import word2vec
# 从sklearn.ensemble中导入GradientBoostingClassifier模型进行影评情感分析
from sklearn.ensemble import GradientBoostingClassifier
# 直接读入已经训练好的词向量模型
from gensim.models import Word2Vec

# 读取训练数据和测试数据
train = pd.read_csv('./labeledTrainData.tsv', delimiter='\t')
test = pd.read_csv('./testData.tsv', delimiter='\t')

# 查验一下前几条训练数据
print(train.head())
# 查验一个前几条测试数据
print(test.head())

# 定义review_to_text函数,完成对原始评论的三项数据预处理任务
def review_to_text(review, remove_stopwords):
    # 任务1:去掉html标记
    raw_text = BeautifulSoup(review, 'lxml').get_text()
    # 任务2:去掉非字母字符
    letters = re.sub('[^a-zA-Z]', ' ', raw_text)
    words = letters.lower().split()
    # 任务3:如果remove_stopwords被激活,则进一步去评论中的停用词
    if remove_stopwords:
        nltk.download()
        stop_words = set(stopwords.words('english'))
        words = [w for w in words if w not in stop_words]
    # 返回每条评论经此三项预测任务的词汇列表
    return words

# 分别原始训练和测试数据集进行上述三项预处理
X_train = []
for review in train['review']:
    X_train.append(' '.join(review_to_text(review, True)))
X_test = []
for review in test['review']:
    X_test.append(' '.join(review_to_text(review, True)))
y_train = train['sentiment']

# 使用Pipeline搭建两组使用朴素贝叶斯模型的分类器,区别在于分别使用CountVectorizer和TfidfVectorizer对文本特征进行抽取
pip_count = Pipeline([('count_vec', CountVectorizer(analyzer='word')), ('mnb', MultinomialNB())])
pip_tfidf = Pipeline([('tfidf_vec', TfidfVectorizer(analyzer='word')), ('mnb', MultinomialNB())])

# 分别配置用于模型超参数搜索的组合
params_count = {'count_vec_binary': [True, False], 'count_vec_ngram_range': [(1, 1), (1, 2)],
                'mnb_alpha': [0.1, 1.0, 10.0]}
params_tfidf = {'tfidf_vec_binary': [True, False], 'tfidf_vec_ngram_range': [(1, 1), (1, 2)],
                'mnb_alpha': [0.1, 1.0, 10.0]}

# 采用4折交叉验证的方法对使用CountVectorizer的朴素贝叶斯模型进行并行化超参数搜索
gs_count = GridSearchCV(pip_count, params_count, cv=4, n_jobs=-1, verbose=1)
gs_count.fit(X_train, y_train)

# 输出交叉验证中最佳的准确性得分以及超参数组合
print(gs_count.best_score_)
print(gs_count.best_params_)

# 以最佳的超参数组合配置模型并对测试数据进行预测
count_y_pred = gs_count.predict(X_test)

# 使用4折交叉验证方法对使用TfidfVectorizer的朴素贝叶斯模型进行并行化超参数搜索
gs_tfidf = GridSearchCV(pip_tfidf, params_tfidf, cv=4, n_jobs=-1, verbose=1)
gs_tfidf.fit(X_train, y_train)
# 输出交叉验证中最佳的准确性得分以及超参数组合
print(gs_tfidf.best_score_)
print(gs_tfidf.best_params_)

# 以最佳的超参数组合配置模型并对预测数据进行预测
tfidf_y_pred = gs_tfidf.predict(X_test)

# 使用pandas对需要提交的数据进行格式化
submission_count = pd.DataFrame({'id': test['id'], 'sentiment': count_y_pred})
submission_tfidf = pd.DataFrame({'id': test['id'], 'sentiment': tfidf_y_pred})

# 结果输出到本地硬盘
submission_count.to_csv('./submission_count.csv', index=False)
submission_tfidf.to_csv('./submission_tfidf.csv', index=False)

# 从本地读入未标记数据
unlabeled_train = pd.read_csv('./unlabeledTrainData.tsv', delimiter='\t', quoting=3)

# 准备使用nltk的tokenizer对影评中的英文句子进行分割
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

# 定义函数review_to_sentences逐条对影评进行分句
def review_to_sentences(review, tokenizer):
    raw_sentences = tokenizer.tokenize(review.strip())
    sentences = []
    for raw_sentence in raw_sentences:
        if len(raw_sentence) > 0:
            sentences.append(review_to_text(raw_sentence, False))
    return sentences

corpora = []
# 准备用于训练词向量的数据
for review in unlabeled_train['review']:
    corpora += review_to_sentences(review.decode('utf8'), tokenizer)
# 配置训练词向量模型的超参数
num_features = 300
min_word_count = 20
num_workers = 4
context = 10
downsampling = 1e-3

# 开始词向量模型的训练
model = word2vec.Word2Vec(corpora, workers=num_workers, size=num_features, min_count=min_word_count, window=context,
                          sample=downsampling)
model.init_sims(replace=True)
model_name = "./300features_20minwords_10context"
# 可以将词向量模型的训练结果长期保存于本地硬盘
model.save(model_name)
model = Word2Vec.load("../300features_20minwords_10context")

# 探查一下该词向量模型的训练结果
print(model.most_similar("man"))

# 定义一个函数使用词向量产生文本特征向量
def makeFeatureVec(words, model, num_features):
    featureVec = np.zeros((num_features,), dtype="float32")
    nwords = 0.
    index2word_set = set(model.index2word)
    for word in words:
        if word in index2word_set:
            nwords = nwords + 1.
            featureVec = np.add(featureVec, model[word])
    featureVec = np.divide(featureVec, nwords)
    return featureVec

# 定义另一个每条影评转化为基于词向量的特征向量(平均词向量)
def getAvgFeatureVecs(reviews, model, num_features):
    counter = 0
    reviewFeatureVecs = np.zeros((len(reviews), num_features), dtype="float32")

    for review in reviews:
        reviewFeatureVecs[counter] = makeFeatureVec(review, model, num_features)
        counter += 1
    return reviewFeatureVecs

# 准备新的基于词向量表示的训练和测试特征向量
clean_train_reviews = []
for review in train['review']:
    clean_train_reviews.append(review_to_text(review, remove_stopwords=True))
trainDataVecs = getAvgFeatureVecs(clean_train_reviews, model, num_features)
clean_test_reviews = []
for review in test['review']:
    clean_test_reviews.append(review_to_text(review, remove_stopwords=True))
testDataVecs = getAvgFeatureVecs(clean_test_reviews, model, num_features)

gbc = GradientBoostingClassifier()

# 配置超参数的搜索组合
params_gdc = {'n_estimators': [10, 100, 500], 'learning_rate': [0.01, 0.1, 1.0], 'max_depth': [2, 3, 4]}
gs = GridSearchCV(gbc, params_gdc, cv=4, n_jobs=-1, verbose=1)
gs.fit(trainDataVecs, y_train)

# 输出网格搜索得到的最佳性能以及最优超参数组合
print(gs.best_score_)
print(gs.best_params_)

# 使用超参数调优之后的梯度上升树模型进行预测
result = gs.predict(testDataVecs)
output = pd.DataFrame(data={"id": test["id"], "sentiment": result})
output.to_csv("./submission_w2v.csv", index=False, quoting=3)

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章