利用jieba库对《秦吏》做的简单处理

刚看完秦吏,想知道除了黑夫谁是出场率最高的角色,所以用Python中的jieba库做了简单分析

import jieba
import wordcloud
txt = open("秦吏.txt",'r',encoding='utf-8').read()
excludes = {"他们","自己","一个","没有","就是","虽然","还是","不是","知道","已经","继续","什么","有些","只是","因为","众人","还有","如此","眼下","如今","所以", "那些",
          "将军","起来","这些","不过","开始","可以","只能","郡守","甚至","便是","这个","不能","这是","最后","出来","作为","说道","看着","于是","一样","过去","地方",
          "以为","时候","觉得","兵卒","为了","可能","立刻","而是","现在","之后","今日","发现","不知","二人","不会","这样","除了","这种","如何","这么","只有","真是",
          "不少","官府","大军","恐怕","依然","看到","一直","都尉","的话","离开","不敢","不同","几个","一起","却是","十分","郡尉","需要","时间","下来","这时候","为何",
          "一边","有人","抵达","记住","一些","两个","当年","明白","得到","此事","一般", "听说", "南方", "后世", "一点", "看来", "无法", "心里", "com", "阅读网", "www",
          "mayitxt","只要","才能","东西","希望","想要","一次","的确","必须","士卒","粮食","一切","过来","战争","商贾","这场","摇头","朝廷","就算","楚人","这里","回来",
          "律令","官吏","不必","当然","认为","秦朝","秦人","匈奴","秦军","咸阳","天下","秦国","胶东","南郡","楚国","安陆","关中","秦吏","百姓","中原","蚂蚁","楚军",
          "直接",}
words = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    elif word == "亭长" or word == "武忠侯":
        rword = "黑夫"
    elif word == "嬴政" or word == "始皇" or word == "皇帝" or word == "陛下" or word == "始皇帝" or word == "秦王":
        rword = "始皇"
    elif word == "公子":
        rword = "扶苏"
    elif word == "东门":
        rword = "东门豹"
    elif word == "大胡子" or word == "美髯公":
        rword = "刘季"
    else:
        counts[word] = counts.get(word,0) + 1
for word in excludes:
    del counts[word]
items = list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)
for i in range(15):
    word, count = items[i]
    print("{0:<15}{1:>5}".format(word,count))

filtered = " ".join(words)
w = wordcloud.WordCloud(font_path="msyh.ttc", width=1000, height=700)
w.generate(filtered)
w.to_file("秦吏" + ".png")

思路就是用jieba库进行精准分词

words = jieba.lcut(txt)

再剔除一个字的词

if len(word) == 1:
        continue

将多种词语指一个词的进行归并处理

elif word == "嬴政" or word == "始皇" or word == "皇帝" or word == "陛下" or word == "始皇帝" or word == "秦王":
        rword = "始皇"

通过出场次数进行排序

items.sort(key=lambda x:x[1],reverse=True)

第一次执行代码发现问题,有一些不属于我要的词汇例如 他们、自己 是高频词,所以进行了反复筛选

excludes = {"他们","自己","一个","没有","就是","虽然","还是","不是","知道","已经","继续","什么","有些","只是","因为","众人","还有","如此","眼下","如今","所以", "那些",
          "将军","起来","这些","不过","开始","可以","只能","郡守","甚至","便是","这个","不能","这是","最后","出来","作为","说道","看着","于是","一样","过去","地方",
          "以为","时候","觉得","兵卒","为了","可能","立刻","而是","现在","之后","今日","发现","不知","二人","不会","这样","除了","这种","如何","这么","只有","真是",
          "不少","官府","大军","恐怕","依然","看到","一直","都尉","的话","离开","不敢","不同","几个","一起","却是","十分","郡尉","需要","时间","下来","这时候","为何",
          "一边","有人","抵达","记住","一些","两个","当年","明白","得到","此事","一般", "听说", "南方", "后世", "一点", "看来", "无法", "心里", "com", "阅读网", "www",
          "mayitxt","只要","才能","东西","希望","想要","一次","的确","必须","士卒","粮食","一切","过来","战争","商贾","这场","摇头","朝廷","就算","楚人","这里","回来",
          "律令","官吏","不必","当然","认为","秦朝","秦人","匈奴","秦军","咸阳","天下","秦国","胶东","南郡","楚国","安陆","关中","秦吏","百姓","中原","蚂蚁","楚军",
          "直接",}
for word in excludes:
    del counts[word]

最后输出结果为

黑夫             16914
秦始皇            2705
扶苏              1940
陈平              1427
韩信              1398
李斯              1105
赵高               923
季婴               899
李由               810
刘季               772
李信               726
王贲               702
张苍               641
萧何               612
项籍               583

没想到的是始皇和扶苏居然排第二第三
当然代码还有很多问题,很多因素没考虑到

最后用词云进行一个可视化操作

w = wordcloud.WordCloud(font_path="msyh.ttc", width=1000, height=700)
w.generate(filtered)
w.to_file("秦吏" + ".png")

在这里插入图片描述

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章