英文文本分類——電影評論情感判別

原創

Asia-Lee

2020-05-31 14:37

5、將清洗的數據添加到DataFrame裏

1、導入所需的庫

import os
import re
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
import nltk
from nltk.corpus import stopwords

2、用Pandas讀入訓練數據

#用pandas讀入訓練數據
datafile=os.path.join('E:\\english_data','labeledTrainData.tsv')
df=pd.read_csv(datafile,sep='\t',escapechar='\\')
print('Number of reviews:{}'.format(len(df)))
df.head()

3、構建停用詞列表數據

#words_nostop=[w for w in words if w not in stopwords.words('english')]
stopwords={}.fromkeys([line.rstrip() for line in open('E:\\english_data\\stopwords.txt')])
eng_stopwords=set(stopwords)

4、對數據做預處理

（1）去掉html標籤

（2）移除標點符號

（3）將句子切分成詞

（4）去掉停用詞

（5）重組爲新的句子

def clean_text(text):
    text=BeautifulSoup(text,'html.parser').get_text()
    text=re.sub('[^a-zA-Z]',' ',text)
    words=text.lower().split()
    words=[w for w in words if w not in eng_stopwords]
    return ' '.join(words)

5、將清洗的數據添加到DataFrame裏

df['clean_review']=df.review.apply(clean_text)
df.head()

6、計算訓練集中每條評論數據的向量

（1）使用sklearn的CountVectorizer抽取bag of words特徵

vectorizer=CountVectorizer(max_features=5000)
train_data_features=vectorizer.fit_transform(df.clean_review).toarray()
train_data_features.shape

（2）使用Gensim的Word2Vec訓練詞嵌入模型

from gensim.models.word2vec import Word2Vec

# 設定詞向量訓練的參數
num_features = 300    # Word vector dimensionality
min_word_count = 40   # Minimum word count
num_workers = 4       # Number of threads to run in parallel
context = 10          # Context window size
downsampling = 1e-3   # Downsample setting for frequent words

model = Word2Vec(sentences, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling)

# If you don't plan to train the model any further, calling 
# init_sims will make the model much more memory-efficient.
model.init_sims(replace=True)

# It can be helpful to create a meaningful model name and 
# save the model for later use. You can load it later using Word2Vec.load()
model.save(os.path.join('..', 'models', model_name))

7、構建隨機森林分類器並訓練

forest=RandomForestClassifier(n_estimators=100)
forest=forest.fit(train_data_features,df.sentiment)

#刪除不用的佔內容變量
del df 
del train_data_features

8、讀取測試數據並進行預測

datafile=os.path.join('E:\\english_data','testData.tsv')
df=pd.read_csv(datafile,sep='\t',escapechar='\\')
print('Number of reviews:{}'.format(len(df)))
df['clean_review']=df.review.apply(clean_text)
df.head()

test_data_features=vectorizer.transform(df.clean_review).toarray()
test_data_features.shape

result=forest.predict(test_data_features)
output=pd.DataFrame({'id':df.id,'sentiment':result})
output.head()

9、將預測結果寫入csv文件

output.to_csv(os.path.join('E:\\english_data','Bag_of_Words_model.csv'),index=False)


del df
del test_data_features

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

英文文本分類——電影評論情感判別

1、導入所需的庫

2、用Pandas讀入訓練數據

3、構建停用詞列表數據

4、對數據做預處理

5、將清洗的數據添加到DataFrame裏

6、計算訓練集中每條評論數據的向量

7、構建隨機森林分類器並訓練

8、讀取測試數據並進行預測

9、將預測結果寫入csv文件

NLP數據增強方法總結及實現

基於樹模型的lightGBM文本分類

TextRank算法介紹及實現

Linux環境下編譯TensorFlow C++ API和測試方法總結（完美版）

Python3讀取和寫入excel表格數據

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結