使用nltk分析文本情感

原創

2019-02-01 23:07

情感分析是NLP最受歡迎的應用之一。情感分析是指確定一段給定的文本是積極還是消極的過程。下面的代碼是借用其他博主的，但是我對代碼的輸入數據格式以及類型做了一個簡單解析供大家參考。另外我發在nltk在處理中文時的切分統計不是很好，中文和英文文本的情感分析思路上是一致的，不同之處在於中文在分析前需要進行分詞，然後才能用nltk處理(nltk 的處理粒度一般是詞)，因此在切分中文的時候可以採用jieba分詞切分。中文分詞之後，文本就是一個由每個詞組成的長數組：[word1, word2, word3…… wordn]。之後就可以使用nltk 裏面的各種方法來處理這個文本了。比如用FreqDist 統計文本詞頻，用bigrams 把文本變成雙詞組的形式：[(word1, word2), (word2, word3), (word3, word4)……(wordn-1, wordn)]。再之後就可以用這些來計算文本詞語的信息熵、互信息等。再之後可以用這些來選擇機器學習的特徵，構建分類器，對文本進行分類。分詞這塊可以採用jieba分詞來解決，相關對中文情感分析的稍後實驗之後附上。

直接貼代碼：

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews


# 分析句子的情感：情感分析是NLP最受歡迎的應用之一。情感分析是指確定一段給定的文本是積極還是消極的過程。
# 有一些場景中，我們還會將“中性“作爲第三個選項。情感分析常用於發現人們對於一個特定主題的看法。


# 定義一個用於提取特徵的函數
# 輸入一段文本返回形如：{'It': True, 'movie': True, 'amazing': True, 'is': True, 'an': True}
# 返回類型是一個dict
def extract_features(word_list):
    return dict([(word, True) for word in word_list])


# 我們需要訓練數據，這裏將用NLTK提供的電影評論數據
if __name__ == '__main__':
    # 加載積極與消極評論
    positive_fileids = movie_reviews.fileids('pos')     # list類型 1000條數據 每一條是一個txt文件
    negative_fileids = movie_reviews.fileids('neg')
    # print(type(positive_fileids), len(negative_fileids))

    # 將這些評論數據分成積極評論和消極評論
    # movie_reviews.words(fileids=[f])表示每一個txt文本里面的內容，結果是單詞的列表：['films', 'adapted', 'from', 'comic', 'books', 'have', ...]
    # features_positive 結果爲一個list
    # 結果形如：[({'shakesp: True, 'limit': True, 'mouth': True, ..., 'such': True, 'prophetic': True}, 'Positive'), ..., ({...}, 'Positive'), ...]
    features_positive = [(extract_features(movie_reviews.words(fileids=[f])), 'Positive') for f in positive_fileids]
    features_negative = [(extract_features(movie_reviews.words(fileids=[f])), 'Negative') for f in negative_fileids]

    # 分成訓練數據集（80%）和測試數據集（20%）
    threshold_factor = 0.8
    threshold_positive = int(threshold_factor * len(features_positive))  # 800
    threshold_negative = int(threshold_factor * len(features_negative))  # 800
    # 提取特徵 800個積極文本800個消極文本構成訓練集  200+200構成測試文本
    features_train = features_positive[:threshold_positive] + features_negative[:threshold_negative]
    features_test = features_positive[threshold_positive:] + features_negative[threshold_negative:]
    print("\n訓練數據點的數量:", len(features_train))
    print("測試數據點的數量:", len(features_test))

    # 訓練樸素貝葉斯分類器
    classifier = NaiveBayesClassifier.train(features_train)
    print("\n分類器的準確性:", nltk.classify.util.accuracy(classifier, features_test))

    print("\n十大信息最豐富的單詞:")
    for item in classifier.most_informative_features()[:10]:
        print(item[0])

    # 輸入一些簡單的評論
    input_reviews = [
        "It is an amazing movie",
        "This is a dull movie. I would never recommend it to anyone.",
        "The cinematography is pretty great in this movie",
        "The direction was terrible and the story was all over the place"
    ]
    # 運行分類器，獲得預測結果
    print("\n預測:")
    for review in input_reviews:
        print("\n評論:", review)
        probdist = classifier.prob_classify(extract_features(review.split()))
        pred_sentiment = probdist.max()
        # 打印輸出
        print("預測情緒:", pred_sentiment)
        print("可能性:", round(probdist.prob(pred_sentiment), 2))


'''
結果：
訓練數據點的數量: 1600
測試數據點的數量: 400

分類器的準確性: 0.735

十大信息最豐富的單詞:
outstanding
insulting
vulnerable
ludicrous
uninvolving
astounding
avoids
fascination
symbol
animators

預測:

評論: It is an amazing movie
預測情緒: Positive
可能性: 0.61

評論: This is a dull movie. I would never recommend it to anyone.
預測情緒: Negative
可能性: 0.77

評論: The cinematography is pretty great in this movie
預測情緒: Positive
可能性: 0.67

評論: The direction was terrible and the story was all over the place
預測情緒: Negative
可能性: 0.63

'''

參考轉載：https://blog.csdn.net/qq_41251963/article/details/81702821

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

使用nltk分析文本情感

七天.NET 8操作SQLite入門到實戰 - （2）第七天Blazor班級管理頁面編寫和接口對接

自學編程兩個月，現在我月入 4 萬元

百度安全多篇議題入選Blackhat Asia以硬技術發現“芯”問題

「實戰應用」如何用圖表控件LightningChart創建2D氣泡圖

GtkSharp 設置窗口背景透明

Google Chrome驅動程序 124.0.6367.62（正式版本）去哪下載？

【Django】Python的Django框架-數據庫查詢(增刪改查)

滑動驗證碼識別----解決天眼查自動登錄問題

【Kafka】python創建kafka的生產者和消費者

機器學習數學精華：高等數學/線性代數/概率論與數理統計

如何實踐一個完整的數據挖掘項目

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結