以某寶購物評論爲例(表分爲好評和差評)
好評
df = pd.read_excel("F:/文本大數據/購物評論.xlsx",sheet_name="正向",header=None)
df
標記爲1
df['class']=1
df.head()
差評
df1 = pd.read_excel("F:/文本大數據/購物評論.xlsx",sheet_name="負向",header=None)
df1
標記爲0
df1['class']=0
df1.head()
合併
df2 = df.append(df1,ignore_index=True)
df2
分詞和預處理
import jieba
cuttxt = lambda x:" ".join(jieba.lcut(x))
df2["segment"]=df2[0].apply(cuttxt)
df2.head()
詞袋模型
from sklearn.feature_extraction.text import CountVectorizer
# 詞袋模型
countvec = CountVectorizer()
countvec = countvec.fit_transform(df2["segment"])
countvec
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(countvec,df2['class'],test_size=0)
x_train
from sklearn.svm import SVC
clf = SVC()# 支持向量機
clf.fit(x_train,y_train)
clf.score(x_train,y_train)
這裏會花較長的時間,需要耐心等待!
如果想提高準確率可以通過改變參數:
邏輯迴歸模型
from sklearn.linear_model import LogisticRegression
logistic = LogisticRegression()
logistic.fit(x_train,y_train)
logistic.score(x_train,y_train)
模型評估
SVM
from sklearn.metrics import classification_report
print(classification_report(y_test,clf.predict(x_test)))
s = """輸入你想預測的評論"""
s_seg = " ".join(jieba.lcut(s))
s_seg_vec = countvec.transform([s_seg])
result = clf.predict(s_seg_vec)
result
邏輯迴歸
from sklearn.metrics import classification_report
print(classification_report(y_test,logistic.predict(x_text)))
s = """輸入你想預測的評論"""
s_seg = " ".join(jieba.lcut(s))
s_seg_vec = countvec.transform([s_seg])
result = logistic.predict(s_seg_vec)
result