貼吧評論敏感詞識別及情感分析初級實現之情感分析

分三個模塊實現貼吧評論敏感詞識別及情感分析研究：“評論爬蟲抓取”、“評論敏感詞識別”、“評論情感分析（積極或消極）”。數據存儲於MongoDB中，現設數據庫“spiders”，數據集合users。其餘兩個模塊見本人博文。
在貼吧評論敏感詞識別及情感分析初級實現裏，只涉及最基礎的知識，未進行代碼的升級以及相應模塊的技術完善。

評論情感分析

現有兩種對於短文本情感傾向研究的方法，一種是基於詞典匹配，另一種是基於機器學習。
詞典匹配法，即直接將待測文本分句，找到其中的情感詞、程度詞、否定詞等，計算出每句情感傾向分值。詞典匹配情感分析語料適用範圍更廣，但受限於語義表達的豐富性。
Python+機器學習情感分析，即選出一部分積極情感的文本與消極情感的文本，之後用機器學習方法進行訓練，得出情感分類器，再通過這個情感分類器對所有文本進行積極與消極的二分分類。
該模塊旨在對全部的網民言論進行情感分析，判定是積極言論還是消極言論。利用機器學習判斷待測文本的情感傾向，即積極或消極。因爲需要根據給定的輸入預測某個結果，並且應該有輸入/輸出對的示例，所以屬於有監督的二分類機器學習，類標籤即爲neg(消極）與pos（積極）。已知有監督機器學習的流程，如圖所示。

模塊實現

（1）處理語料庫
語料庫中有10000條購酒體驗評論，現將所有評論分別歸屬於“good.txt”與“bad.txt”，即人爲的爲文本賦予類標籤。
（2）特徵提取
四種特徵提取方式：

單字作爲特徵
雙字（詞語）作爲特徵
單字加雙字作爲特徵
結巴分詞形成的詞語作爲特徵

（3）特徵降維

所有字作爲特徵
雙詞作爲特徵，並利用卡方統計選取信息量排名前n的雙字
單字和雙字共同作爲特徵，並利用卡方統計選取信息量排名前n的
單字和雙字；
結巴分詞外加卡方統計選取信息量排名前n的詞彙作爲特徵

（4）特徵表示

def text():
    f1=open('good.txt','r',encoding='utf-8')
    f2=open('bad.txt','r',encoding='utf-8')
    line1=f1.readline()
    line2=f2.readline()
    str=''
    while line1:
        str+=line1
        line1=f1.readline()
    while line2:
        str+=line2
        line2=f2.readline()
    f1.close()
    f2.close()
    return str
#單個字作爲特徵
def bag_of_words(words):
    return dict([(word,True) for word in words])
#print(bag_of_words(text())

#把詞語（雙字）作爲搭配，並通過卡方統計，選取排名前1000的詞語
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
def bigram(words, score_fn=BigramAssocMeasures.chi_sq, n=1000):
    bigram_finder = BigramCollocationFinder.from_words(words)
    bigrams = bigram_finder.nbest(score_fn, n)  # 使用卡方統計的方法，選擇排名前1000的詞語
    newBigrams = [u + v for (u, v) in bigrams]
    #return bag_of_words(newBigrams)
#print(bigram(text(),score_fn=BigramAssocMeasures.chi_sq,n=1000))

# 把單個字和詞語一起作爲特徵
def bigram_words(words, score_fn=BigramAssocMeasures.chi_sq, n=1000):
    bigram_finder = BigramCollocationFinder.from_words(words)
    bigrams = bigram_finder.nbest(score_fn, n)
    newBigrams = [u + v for (u, v) in bigrams]
    a = bag_of_words(words)
    b = bag_of_words(newBigrams)
    a.update(b)  # 把字典b合併到字典a中
    return a  # 所有單個字和雙個字一起作爲特徵
#print(bigram_words(text(),score_fn=BigramAssocMeasures.chi_sq,n=1000))

import jieba
# 返回分詞列表如：[['我','愛','北京','天安門'],['你','好'],['hello']]，一條評論一個
def readfile(filename):
    stop=[line.strip() for line in open('stop.txt','r',
                                        encoding='utf-8').readlines()]
    f=open(filename,'r',encoding='utf-8')
    line=f.readline()
    str=[]
    while line:
        s=line.split('\t')
        fenci=jieba.cut(s[0],cut_all=False)
        str.append(list(set(fenci)-set(stop)-set(['\ufeff','\n'])))
        line=f.readline()
    return str

from nltk.probability import FreqDist,ConditionalFreqDist
from nltk.metrics import BigramAssocMeasures
# 獲取信息量較高(前number個)的特徵(卡方統計)
def jieba_feature(number):
    posWords=[]
    negWords=[]
    for items in readfile('good.txt'):
        for item in items:
            posWords.append(item)
    for items in readfile('bad.txt'):
        for item in items:
            negWords.append(item)
    word_fd=FreqDist()   # 可統計所有詞的詞頻
    con_word_fd=ConditionalFreqDist()    # 可統計積極文本中的詞頻和消極文本中的詞頻
    for word in posWords:
        word_fd[word]+=1
        con_word_fd['pos'][word]+=1
    for word in negWords:
        word_fd[word]+=1
        con_word_fd['neg'][word]+=1
    pos_word_count=con_word_fd['pos'].N()    # 積極詞的數量
    neg_word_count=con_word_fd['neg'].N()    # 消極詞的數量
    # 一個詞的信息量等於積極卡方統計量加上消極卡方統計量
    total_word_count=pos_word_count+neg_word_count
    word_scores={}
    for word,freq in word_fd.items():
        pos_score=BigramAssocMeasures.chi_sq(con_word_fd['pos'][word],(freq,
                                                                       pos_word_count),total_word_count)
        neg_score=BigramAssocMeasures.chi_sq(con_word_fd['neg'][word],(freq,
                                                                       neg_word_count),total_word_count)
        word_scores[word]=pos_score+neg_score
        best_vals=sorted(word_scores.items(),key=lambda item: item[1],
                         reverse=True)[:number]
        best_words=set([w for w,s in best_vals])
    return dict([(word,True) for word in best_words])
# 構建訓練需要的數據格式：
# [[{'買': 'True', '京東': 'True', '物流': 'True', '包裝': 'True', '\n': 'True', '很快': 'True', '不錯': 'True', '酒': 'True', '正品': 'True', '感覺': 'True'},  'pos'],
# [{'買': 'True', '\n':  'True', '葡萄酒': 'True', '活動': 'True', '澳洲': 'True'}, 'pos'],
# [{'\n': 'True', '價格': 'True'}, 'pos']]
def build_features():
    #feature = bag_of_words(text())
    #feature = bigram(text(),score_fn=BigramAssocMeasures.chi_sq,n=900)
    # feature =  bigram_words(text(),score_fn=BigramAssocMeasures.chi_sq,n=900)
    feature = jieba_feature(1000)  # 結巴分詞
    posFeatures = []
    for items in readfile('good.txt'):
        a = {}
        for item in items:
            if item in feature.keys():
                a[item] = 'True'
        posWords = [a, 'pos']  # 爲積極文本賦予"pos"
        posFeatures.append(posWords)
    negFeatures = []
    for items in readfile('bad.txt'):
        a = {}
        for item in items:
            if item in feature.keys():
                a[item] = 'True'
        negWords = [a, 'neg']  # 爲消極文本賦予"neg"
        negFeatures.append(negWords)
    return posFeatures, negFeatures

（5）劃分訓練集和測試集

posFeatures,negFeatures=build_features()
from random import shuffle
shuffle(posFeatures)    #把文本的排列隨機化
shuffle(negFeatures)
train=posFeatures[200:]+negFeatures[200:]   #只在1000條數據時合適，二八原則
test=posFeatures[:200]+negFeatures[:200]
data,tag=zip(*test)    #分離測試集合的數據和標籤，便於測試

（6）選出最佳分類器
通過比較準確度score，選出最佳特徵提取方式、最佳特徵維度、最佳分類算法。
首先固定特徵維度爲1000，進行如下計算（表格數據不準確），可取分別5次實驗的平均值

	bag_of_words	bigram	bigram_words	jieba_feature
伯努利樸素貝葉斯
多項式分佈樸素貝葉斯
邏輯迴歸
SVC
LinearSVC
NuSVC

通過表格得出最佳分類算法與特徵提取方式，例如結巴分詞加卡方統計的特徵提取方式與邏輯迴歸分類算法，接着通過固定特徵提取方式與分類算法，不斷調整特徵維度，進而比較score得出最佳特徵維度。例如（結巴分詞+卡方統計+邏輯迴歸）：

特徵維度	100	600	1100	1600	2100	2600	3100
1	0.9075	0.9325	0.9475	0.9400	0.9350	0.9375	0.9550
2	0.9150	0.9600	0.9375	0.9550	0.9425	0.9600	0.9350
3	0.9250	0.9475	0.9550	0.9625	0.9550	0.9525	0.9525
4	0.9075	0.9300	0.9625	0.9475	0.9450	0.9500	0.9500
5	0.9175	0.9550	0.9375	0.9550	0.9525	0.9375	0.9525
平均值	0.9145	0.9450	0.9480	0.9520	0.9460	0.9475	0.9490

from nltk.classify.scikitlearn import SklearnClassifier
def score(classifier):
    classifier=SklearnClassifier(classifier)
    classifier.train(train)
    pred=classifier.classify_many(data)
    n=0
    s=len(pred)
    for i in range(0,s):
        if(pred[i]==tag[i]):
            n=n+1
    return n/s
#通過實驗，比較預測準確度score進而得出最佳特徵提取方式、最佳特徵維度、最佳分類算法
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC, NuSVC
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
#print('LogisticRegression`s accuracy is  %f' % score(LogisticRegression()))

（7）處理輸入文本
將評論文本處理成可預測的格式，例如“京東的物流是沒得說了，很快。這次買酒水類，包裝很仔細，沒有出現意外。酒到手了，絕對是正品。感覺很不錯喲！”
利用結巴分詞作爲特徵提取方式時，表示爲
[‘很快’, ‘正品’, ‘不錯’, ‘類’, ‘沒得說’, ‘感覺’, ‘仔細’, ‘酒’, ‘酒水’, ‘物流’, ‘出現意外’, ‘包裝’, ‘京東’, ‘到手’, ‘買’]，
進行特徵降維（卡方統計）與特徵表示後，其可預測格式爲
{‘物流’: ‘True’, ‘到手’: ‘True’, ‘仔細’: ‘True’, ‘感覺’: ‘True’, ‘很快’: ‘True’, ‘京東’: ‘True’, ‘正品’: ‘True’, ‘酒水’: ‘True’, ‘不錯’: ‘True’, ‘出現意外’: ‘True’, ‘沒得說’: ‘True’, ‘類’: ‘True’, ‘酒’: ‘True’, ‘包裝’: ‘True’, ‘買’: ‘True’}

#處理輸入的評論文本，使其成爲可預測格式
def build_page(page):
    #四中特徵提取方式
    # feature1 = bag_of_words(text())
    # n爲特徵維度，可調整
    # feature2 = bigram(text(),score_fn=BigramAssocMeasures.chi_sq,n=1000)
    # feature 3=  bigram_words(text(),score_fn=BigramAssocMeasures.chi_sq,n=1000)
    feature4 = jieba_feature(1000)  # 結巴分詞，選取1000爲特徵維度，可調整
    temp={}
    '''
     #單個字爲特徵
     for word in page:
         if word in feature1:
             temp[word]='True'
     #雙字爲特徵
     bigrams= BigramCollocationFinder.from_words(words)
     text=[u + v for (u, v) in bigrams.ngram_fd]
     for words in text:
         if words in feature2:
             temp[words]='True'
     #單字和雙字爲特徵
     bigrams= BigramCollocationFinder.from_words(words)
     text=[u + v for (u, v) in bigrams.ngram_fd]
     for word in page:
         text.append(word)
     for words in text:
         if words in feature3:
             temp[words]='True'
      '''
    #現採用結巴分詞形式處理待測文本
    fenci0=jieba.cut(page,cut_all=False)
    stop=[line.strip() for line in open('stop.txt','r',
                                        encoding='utf-8').readlines()]   #停用詞
    for words in list(set(fenci0)-set(stop)):
        if words in feature4:
            temp[words]='True'
    return temp

（8）保存最佳分類器，進行預測

#將實驗比較得出的最佳分類算法（classifier_ag）構造的分類器保存
def classfier_model(classifier_ag):
    classifier = SklearnClassifier(classifier_ag)
    classifier.train(train)
    return classifier
#假設邏輯迴歸爲最佳分類算法
#classifier=classfier_model(classifier_ag=LogisticRegression)
#用最佳分類器預測待測文本
def predict_page(page):
    pred = classifier.classify_many(data)
    return pred

（9）對數據集合中的評論page進行情感定位
注意：在此之前，一定要保存好分類器

#對users中的每條評論進行情感判斷
def emotion_decide():
    results=collection1.find()
    for result in results:
        if result['page']:   #若評論爲空，則不進行情感分析
           condition={'page':result['page']}
           if collection1.update_many(condition,{'$set':{
               'emotion':emotional.predict_page(page=result['page'])}}):
               print("successful")

###參考來源
（1）使用python+機器學習方法進行情感分析(詳細步驟)
（2）python情感分析代碼

資源

評論語料庫以及中文停用詞

基於機器學習的評論情感分析

貼吧評論敏感詞識別及情感分析初級實現之情感分析

評論情感分析

模塊實現

資源

詐騙（殺豬盤）網站進行滲透測試

Python 潮流週刊#50：我最喜歡的 Python 3.13 新特性！

外行也能讀懂的網絡硬件設備功能原理速成

基於機器學習的評論情感分析

matplotlib繪製圖例

DataFrame全部數據的顯示輸出

python3文件函數相關介紹

pandas庫中loc()與iloc()提取數據介紹

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結